Tag Archives: RDBMS

Set up Python for SQL Server


Summary

In SQL Server 2016, Microsoft extended SQL Server engine capabilities to be able to execute an external code written in R language. Sql Server 2017 added Python language to the mix and 2019 integrated Java. The new functionality is a part of Sql Server’s Machine Learning Services.
This blog explains how to set up SQL Server 2019 so we can use ML services and
the Python programming language.

Prepare the environment for Python

The first step is to add Machine Learning Service Extensions / Python to our SQL Server environment.

Figure 1, ML services

The next step is to allow Sql Server to execute external scripts by turning on the external scripts enabled user option on the server level

sp_configure 
     @configname = 'show advanced options'
    ,@configvalue = 1
GO
RECONFIGURE WITH OVERRIDE;
GO
 
sp_configure 
     @configname = 'external scripts enabled'
    ,@configvalue = 1            
GO
RECONFIGURE WITH OVERRIDE;
GO

At this stage, we should be ready to go.

Let’s explore how the SQL Server engine relates to the above-mentioned ML(Machine Learning)  services. The two components are “bound” by the Extensibility framework.

Launchpad

Launchpad service, one of several services in Sql Server, has a job to manage and execute external scripts – the scripts written in Python language. It instantiates an external script session as a separate, satellite process.


Figure 2, Extensibility framework

The purpose of the extensibility framework is to provide an interface between the SQL Server engine and Python (and other languages). Figure 2 shows how different components interact and demonstrates security concepts.
From the super high-level point of view, the process goes like this:

  1. Our application calls system stored proc. sys.sp_execute_external_script. The proc is designed to encapsulate code written in one of the external languages, Python in this case. The stored proc executes within the SQL Server engine – Sqlservr.exe. The engine knows that the code it needs to run contains a program written in Python and therefore starts the Launchpad.exe process. Sqlservr.exe uses Named Pipes protocol(often used for the communication between processes on the same machine) to pass the code to be executed.
  2. Launchpad service now call initiates the DLL specific to the language that needs to be run. In this case, it’s PythonLauncher.dll.
  3. PythonLauncher.dll now runs Python.exe/dll which compiles the program to bytecode and executes it.
  4. Python.exe then talks to yet another program, BxlServer.exe. Bxl stands for Binary Exchange Language. This program coordinates with the Python runtime to manage exchanges of data and storage of the working results.
  5. sqlsatellite.dll communicates with SQL Server engine over TCP/IP protocol and ODBC interface. It retrieves input data sets*, sends back result sets and Python’s standard I/O streams – stdout and stderr.
  6. BxlServer.exe communicates with Python.exe passing messages back and forth from Sql Server through sqlsatellite.dll.
  7. Sql Server gets the result set, closes the related tasks and processes, and sends the results back to the client e.g SSMS

Security

Security architecture for the extensibility framework can be observed in Figure 2.

  • The client uses Windows/SQL login to establish a connection with SQL Server and execute Python code encapsulated in sys.sp_execute_external_script stored proc.
  • Launchpad service runs as a separate process under its own account – NT Service\MSSQLLaunchpad$<instanceName>. The Windows account has all the necessary permissions to run external scripts and is a member of the SQLRUserGroup2019 Windows group.
    How to interact with the file system(Import/Export csv, xlsx, ..) or to access resources outside the Server env will be explained later in this blog
  • Launchpad, when invokes the appropriate runtime environment, in our case, the PythonLauncher.dll, initiates an AppContainer and assigns the process to it. By default, the Launcher service creates 21 containers AppContainer00AppContainer20. The number of containers can be changed through Sql Server Configuration Manager/Sql Server Launchpad/Advanced/Security Contexts Count property.
    AppContainer is an internal mechanism that provides an isolated execution environment for the applications assigned to it. An application running in an AppContainer can only access resources specifically granted to it. This approach prevents the application from influencing, or being influenced by, other application processes
  • Launchpad maps the User(used to make a connection) to an AppContainer that now defines credentials unique to user/application pair. This also means that the application, in this case, launchpad.exe cannot impersonate db Users and gain more access than it should have.

So, in a nutshell, a Client connects to Sql Server using Sql/Win login mapped to db user. The user runs sys.sp_execute_external_script with the Python code.  Sql Server runs launchpad.exe which runs under it’s own win account. The launcher then initiate one of twenty AppContainers

Firewall rules – how to access data from the Internet

As mentioned before, there are 20 separate application containers – AppContainer00 – AppContainer20 These objects restrict access to the applications assigned to them, in this case launchpad.exe.

Figure 3, Windows Firewall outbound rules related to the AppContainers

If we disable these rules, our Python code, encapsulated in sys.sp_execute_external_script, will be able to communicate with the “outside world” and e.g get up-to-date FX (Foreign Exchange) rates, BitCoin price for different currencies etc. – see. forex-python module. We can then e.g describe the result and send back to the Client/top level procedure. It is also possible to “join” the data to a tSQL query result, within the same query batch, and then to send it back to the caller – amazing stuff 🙂

EXEC sys.sp_execute_external_script
        @language = N'Python' --specify language to use
        ,@script   = N'
from pandas_datareader import data

start_date = "2021-11-20"
end_date = "2021-12-01"

UsToAus = data.DataReader("DEXUSAL", "fred",start=start_date, end=end_date).reset_index()

print(UsToAus)
'       
       ,@output_data_1_name =N'UsToAus'
WITH RESULT SETS  (([Date] DATE,[US to AUS]  FLOAT) )


In the example above, the code connects to FRED (Federal Reserve Bank of St. Louis, Missouri, USA) and pulls the exchange rates between the USD and  AUD for a given timeframe. 

File system accessibility – how to import/export result

Another cool thing that you can do with Python is to easily export results to disk  e.g save a query result (report) in a form of an Excel file, to a location where it can be picked up(pulled) by the authorised users. It is also possible to read e.g a lookup table – extended customer information, states, employee information etc,  from disk and join it with the tSQL query result-set, again within the same sp call.
To enable interaction with the file system we need to allow “Sql Server’s” Python to access specific folders. By default, Python can only access its working directory and its subdirectories, in this case c:\Program Files\Microsoft SQL Server\MSSQL15.SQL2019\PYTHON_SERVICES\. As mentioned before, the Launchpad.exe service runs under its own account. It starts python.exe runtime environment in a process, again, under the  same account. Finally it instantiates a  new AppContainer object to contain its activities. This container now acts as a windows app. with very limited access. Windows apps can access only those resources (files, folders, registry keys, and DCOM interfaces) to which they have been explicitly granted access (“ALL APPLICATION PACKAGES”).

Figure 4, ALL APPLICATION PACKAGES access rights

Now, we can use Pivoting with Python script from another blog to export results to disk.

Figure 5, export results to disk

Memory allocations

It is important to understand how ML Services fits into Sql Server memory allocation scheme to be able to access the possible impact it can have on the database server performance.

As we can see from Figure 2,  Python code runs outside of the processes managed by Sql Server. This means that it doesn’t use memory allocated to Sql Server.

The query below shows that, by default, the maximum memory that ML services can use, outside the memory allocated to Sql Server, is 20 % e.g Total RAM 256GB, Sql Server allocated mem = 200GB, ML can use up to 56 * 0.2 = ~11GB

SELECT * 
FROM sys.resource_governor_external_resource_pools
WHERE [name] = N'default'


Figure 6, External resource pools

We can increase the max_memory_percent to e.g 50% by altering  the “default” external resource pool

ALTER EXTERNAL RESOURCE POOL "default" 
   WITH (max_memory_percent = 50);
GO
ALTER RESOURCE GOVERNOR 
RECONFIGURE;
GO

File locations and configurations

  • launchpad.exe: ..\MSSQL15.SQL2019\MSSQL\Binn\Launchpad.exe
  • pythonlauncher.log ..\MSSQL15.SQL2019\MSSQL\Log\ExtensibilityLog\pythonlauncher.log 
  • Config file (Python) :  ..\MSSQL15.SQL2019\MSSQL\Binn\pythonlauncher.confg
    Python home directory, encoding standard, env patrh, working directory – this is where python stores intermediate results during the calculations, etc.

Python version

SQL Server 2019 supports Python 3.7.1. The package comes with the Anaconda 4.6.8 distribution. The language runtimes and various packages e.g  pandas, are installed under the root directory:

c:\Program Files\Microsoft SQL Server\MSSQL15.SQL2019\PYTHON_SERVICES

To check Python version use command prompt:

c:\Program Files\Microsoft SQL Server\MSSQL15.SQL2019\PYTHON_SERVICES>python --version

To check Anaconda distribution version:

c:\Program Files\Microsoft SQL Server\MSSQL15.SQL2019\PYTHON_SERVICES\Scripts>conda.exe --version

But wait, there is more 🙂

With SQL Server 2019 CU3+ (latest CU) we can use the Language Extension feature to “add” different versions of Python to the mix. In fact, we can add any language extension.
For example, I use Python – version 3.9.7 that is installed as a part of the Anaconda 4.10.3 distribution, for work, not related to Sql Server platform. Wouldn’t it  be cool if it was possible to use that installation in SQL Server?

Note:  There are two packages developed by Microsoft, specifically for out-of-the-box Python installation: revoscalepy and microsoftml. These packages are not available in the distribution mentioned above . 

Language extensions is a set of dll’s written in C++ that act as a bridge between Sql Server and the external runtime. This technology allows us to use relational data(query result sets) in the external code. In a nutshell, the current Python language extension program is an open source code available here. However, the code currently supports only versions 3.7.x. So, to be able to use Python 3.9 as the external code, we need to build our own “bridge”.

Here is an excellent article on how to do it. It looks complex , but once you start you’ll not be able to stop – it’s just so much fun.

Figure 7, Python 3.9 Language extension for SQL Server

In the next blog, I’ll present various dev tools we can use to construct our Python code in Sql Server. 🙂

Thanks for reading.

Dean Mincic

PIVOT, Multi Pivot & Dynamic Pivot in SQL Server


Summary

Pivoting is a technique used to rotate(transpose) rows to columns. It turns the unique values from one column in one table or table expression into multiple columns in another table. SQL Server 2005 introduced the PIVOT operator as a syntax extension for table expression in the FROM clause. PIVOT, the relational operator is a T-Sql proprietary operator and is not part of ANSI SQL Standard.

PIVOT operator structure

Rotating(Pivoting) one table or table expression into another  table requires three different elements

  1. Grouping element
  2. Aggregating element
  3. Spreading element

The PIVOT operator accepts only Aggregating and Spreading elements. To avoid possible logical errors we must have a clear understanding of all three parameters, especially the Grouping element.

The following example demonstrates the three elements in action.

Let’s say we want to present the sum of freight(Shipping cost) values per order year for each country that ordered our products.
Set up dbo.Orders_TestPivot table. The script can be found here.

The PIVOT queries below transpose columns from a table expression (ShipCountry, Freight, and OrderYear) into a new table.
The queries are logically identical although they use different types of table expressions. The version on the left uses a Derived query and the one on the right uses a Common table expression(CTE).
More on table expressions can be found here:
My personal preference is the CTE version, so I’ll use that in the following examples. 🙂

A derived query table expression Common Table Expression
SELECT PVT.Shipcountry
      ,PVT.[2018]
      ,PVT.[2019]
      ,PVT.[2020]  
FROM (
      SELECT ord.Freight
            ,OrderYear
            ,ord.Shipcountry
      FROM dbo.Orders_testPivot ord    
    ) AS tabExpr 
PIVOT ( 
        SUM(Freight)  
        FOR OrderYear 
        IN ([2018],[2019],[2020]) 
        ) AS PVT
;WITH tabExpr AS 
(
    SELECT ord.Shipcountry
          ,ord.Freight
          ,OrderYear
    FROM dbo.Orders_testPivot ord    
)
    SELECT  PVT.Shipcountry
           ,PVT.[2018]
           ,PVT.[2019]
           ,PVT.[2020]   
    FROM tabExpr
    PIVOT ( 
           SUM(Freight)  
           FOR OrderYear 
           IN ([2018],[2019],[2020]) 
          ) AS PVT;

The figure below visually maps the elements of the PIVOT operator and the final result set.

Figure 1, PIVOT Operation

My personal way of thinking when creating a PIVOT query is;

  1. Sketch the final result set and visualise all three elements required for the PIVOT operation
  2. Define a table expression(CTE) that returns:
    1. Spreading element – what we want to see on columns – OrderYear
    2. Aggregate element – what we want to see in the intersection of each row and column – Freight
    3. Grouping element* – what we want to see on rows – ShipCountry
  3. Add  PIVOT operator. The pivot operator returns a table result – in our example, the table result has alias PVT.
    1. Include aggregate function applied to the aggregate element – SUM(Freight).
    2. Include the FOR clause and the spreading column – FOR OrderYear.
    3. Specify the IN clause and the list of distinct, comma-separated values that appear in the spreading element. [2018],[2019],[2020] . In our example, we have a list of irregular identifiers* that needs to be delimited.
      If we added a non-existing value to the IN list e.g [2099], the query would execute with no error but with the NULL aggregated values 🙂
    4. Specify an alias for the PIVOT result table – PVT
  4. Specify the final SELECT. The columns are selected from the PIVOT result table. The sequence of the selected columns is not relevant.

Note: Irregular identifiers:
We use identifiers to name(identify) SQL Server’s objects i.e stored procedures, tables, views, constraints, column names, attributes ..etc. There is a set of rules for creating identifiers i.e The first character cannot be numeric, so e.g 2018 is an Irregular identifier. To be able to use irregular identifiers we need to “fix” their boundaries/limits or deLimit them. To do that we can use double quotation marks – 2018 or tSQL specific – square brackets;  [2018]. More about SQL Server Identifiers can be found here.

An interesting thing about the PIVOT operator is that it does not include a grouping element. The grouping element is “everything else” that is not a spreading or an aggregating element. In our example, the grouping element is the ShipCountry column selected in the table expression.
If we selected e.g ShipCity along with ShipCountry as the two columns that are not a spreading or an aggregate element, the result would be different.

;WITH tabExpr AS 
(
    SELECT ord.Shipcountry --grouping element 
          ,ord.Freight   
          ,OrderYear
          ,ord.Shipcity  --grouping element
    FROM dbo.Orders_testPivot ord    
)
    SELECT  PVT.Shipcountry
           ,PVT.[2018]
           ,PVT.[2019]
           ,PVT.[2020]   
    FROM tabExpr
    PIVOT ( 
            SUM(Freight)  
            FOR OrderYear 
            IN ([2018],[2019],[2020]) 
        ) AS PVT;


Figure 2, Group By ShipCountry and ShipCity

This behavior can cause logical errors, especially if we apply the PIVOT operator directly on a table.

In the next experiment, we are not using a table expression to prepare the data set for the PIVOT operator. Instead, PIVOT now operates over the entire table. It implicitly(automatically) groups data by all columns except the orderDate and Freight columns. As we can see in Figure 3, the query produces an unexpected result

--the skewed result
SELECT PVT.Shipcountry
      ,PVT.[2018]
      ,PVT.[2019]
      ,PVT.[2020]  
FROM dbo.Orders_testPivot
PIVOT ( 
       SUM(Freight)  
       FOR OrderYear 
        IN ([2018],[2019],[2020]) 
      ) AS PVT;


Figure 3, PIVOT operation directly on a table

To avoid possible logical errors, it is always a good practice to first construct a table expression with the implicitly defined PIVOT elements(grouping, spreading, and aggregating), and then apply the PIVOT operator to the prepared data set.

Horizontal Aggregates

Horizontal aggregates are aggregates across different columns per group.

It would be nice to add a total freight per country across the spreading element – the order years. Here is the base table definition.

There are several ways to achieve this e.g  We can simply check for NULL values and add the column values – total_freight = ISNULL(PVT.[2018]) + ISNULL(…

However, it would be cool to implement it by using Table constructor to create a virtual correlated query that summarizes freight data for all order years, per row – uh, that was a mouthful 🙂 

Here is what we want to achieve:

We can extend the original pivot CTE from above with the total_freight column.

;WITH tabExpr AS 
(
  SELECT ord.Shipcountry,
         ord.Freight,
         ord.OrderYear
  FROM dbo.Orders_testPivot ord
)
SELECT PVT.Shipcountry,
       PVT.[2018],
       PVT.[2019],
       PVT.[2020],
       total_freight = ( SELECT SUM(tab.freight_per_year)
                         FROM (VALUES (PVT.[2018]),
                                      (PVT.[2019]),
                                      (PVT.[2020])
                               ) AS tab(freight_per_year) )
FROM tabExpr
PIVOT(
      SUM(Freight)
      FOR OrderYear IN ([2018], [2019], [2020])
     ) AS PVT;

Multi aggregate pivot

A PIVOT operator can handle only one aggregate element at a time.  This means that if we want to use more aggregate elements we need to add more PIVOT operators to our query – a PIVOT operator per aggregate element 😐
In the previous example, our aggregate element was Freight when we calculated the total shipping costs in different countries per year.
This time, we want to calculate the average value of the orders placed in different countries per year and add the results to our query.
Figure 4 shows the desired result

Figure 4, Multi aggregate PIVOT- two aggregate elements

From the result, we can see that the second result set is just “appended” to the first. Basically, we just combined the two PIVOT results using an INNER JOIN table operator and an equality predicate on the ShipCountry column.
The final query uses column aliases to indicate the different data sets.
Figure 6, Multi aggregate PIVOT operation

The query in Figure 6 can be found here.

Dynamic PIVOT

A disadvantage of the PIVOT operator is that its IN clause only accepts a static list of spreading values. It does not support e.g a sub-query as input. This means that we need to know in advance all the distinct values in the spreading element. The “hard coding” may not necessarily be a problem in cases when we deal with a spreading element with known spreading values e.g OrderYear.
Going back to the first example, we can easily expand the IN list with the spreading values that are not available yet.

/*A future order  
 Note: If truncated, open the code in a new Window */
INSERT INTO dbo.Orders_testPivot (Custid,Orderdate,Shipperid,OrderValue,Freight,Shipname,Shipaddress,Shipcity,Shipregion,Shipcountry)
    VALUES (51,'20210201',2,$220,$350,N'Future shipment :)',N'10178 106 St',N'Edmonton',N'Alberta',N'Canada');
GO

;WITH tabExpr AS 
(
    SELECT ord.Shipcountry
          ,ord.Freight   
          ,OrderYear
    FROM dbo.Orders_testPivot ord    
)
    SELECT  PVT.Shipcountry
           ,PVT.[2018]
           ,PVT.[2019]
           ,PVT.[2020]
           ,PVT.[2021]
           ,PVT.[2022]
    FROM tabExpr
    PIVOT ( 
            SUM(Freight)  
            FOR OrderYear 
            IN ([2018],[2019],[2020],[2021],[2022]) 
        ) AS PVT;

Things get more complex when we cannot predict all possible spreading values. In these situations, we can first design a query that will give us a distinct list of spreading values and then use that list to dynamically construct the final PIVOT query, the Dynamic Pivot.
A typical scenario in which we use Dynamic pivoting is when transposing attributes of an EAV*(Entity-Attribute-Value) data model.

EAV* is one of the open-schema data models (XML, JSON …) that, in some cases, can provide more flexibility than the relational model. Here is an interesting post about EAV.

Let’s say we have a list of Products. Each product is different and can have a specific set of attributes. e.g a bicycle can have a specific type of tires and a hard drive can have a specific capacity..etc. Business frequently adds new products and product attributes. In the next example, I used a simplified EAV model to store the products. The table script can be found here.

Our next task is to return a row for each distinct product, a column for each distinct product attribute, and in the intersection of each product and attribute, we want to see the value of the attribute.

Figure 7 shows the desired output for all products and for a specific product
Figure 7, Dynamic pivot result

In this scenario, we cannot know all the possible Attributes(the spreading element values). Moreover, the list of attributes is constantly changing, so hard-coding the IN list is no longer an option.
The following is a  dynamic pivot query that can give us the result in Figure 7.

DECLARE @sprdElements AS NVARCHAR(MAX) --comma separated, delimited, distinct list of product attributes
        ,@tSql AS NVARCHAR(MAX)        --query text
        ,@ObjectName VARCHAR(255);     --specific product name

SET @ObjectName = NULL -- 'BMC Road Bike' --specific product

--comma separated list of attributes for a product
;WITH dsitSpreadElList AS
(
    SELECT DISTINCT Attribute
    FROM Products
    WHERE ObjectName = @ObjectName
        OR @ObjectName IS NULL
)
SELECT @sprdElements = COALESCE(@sprdElements+', ','')+'['+ CAST( Attribute AS NVARCHAR(255))+']'
--SELECT @sprdElements = STRING_AGG('['+Attribute+']',',') --Available in SQL2017+
FROM dsitSpreadElList;

--print @sprdElements

SET @tSql =N';WITH TabExp AS
             (
                SELECT ObjectName -- grouping element
                      ,Attribute  -- spreading element
                      ,[Value]    -- aggregating element
                FROM dbo.Products
                WHERE ObjectName = @ObjName
                  OR @ObjName IS NULL
	     )
             SELECT ObjectName,'+@sprdElements +N'
             FROM TabExp
             PIVOT (
                    MAX([Value])
                    FOR Attribute IN (' + @sprdElements +N') 
                    ) AS pvt';

 EXEC sys.sp_executesql
     @stmt = @tSql
    ,@params = N'@ObjName VARCHAR(255)'
    ,@ObjName = @ObjectName;

NOTE: To extract a known Attribute value, in this case, we can use MAX() or MIN() aggregate functions. Both functions will operate on a single value and will return a single value. Keep in mind that MIN and MAX as well as all other aggregate functions except COUNT(*), ignore NULL values.

The new attributes will be automatically handled by the dynamic query.

INSERT INTO dbo.Products
   VALUES ('BMC Road Bike',
           'Gearing',
           CAST(CAST('Triple chain-ring 50/39/30' AS VARCHAR(255)
                    ) AS SQL_VARIANT
               )
   );

A couple of versions of the dynamic query can be downloaded here.

Conclusion

Pivoting is a technique used to transpose rows to columns. PIVOT is a tSql proprietary operator and is not part of the ANSI Standard. PIVOT operator accepts two parameters; Spreading element or what we want to see on columns and aggregating element or what we want to see in the intersection of each distinct row and column. The grouping element is the third parameter involved in pivot operation. It is what we want to see on rows. The grouping element is not a formal part of the PIVOT operator and represents all columns that are not defined as spreading or aggregating elements. The implicit nature of the grouping element can lead to logical errors. This is why is recommended to construct a table expression for the PIVOT operator that provides exactly three elements needed for the operation.
A PIVOT operator is limited to only one aggregate function. To perform multi aggregate pivot we need to introduce a PIVOT operator per aggregation.
The IN clause of the PIVOT operator accepts only a hard-coded, comma-separated list of spreading element values. In the situations when the values are not known, we use dynamic sql to construct the query.

 

Thanks for reading.

Dean Mincic

ORDER BY – Changes everything

SQL programming has a few unique aspects such as Three value logic(3VL), set-based programming logic, and logical processing order. Divination from the programming patterns based on the unique nature of the language usually leads to poor-performing SQL code.

RDBMS fundamentals

SQL Server as an RDBMS(Relational Database Management System), and its SQL language dialect/variant TSQL(Transact-Structured Query Language) has a strong foundation in mathematics. TSQL as SQL Servers language for data management and manipulation is based on two cornerstones of mathematics – Set theory and Predicate logic.

Edgar F. Codd was a computer scientist who invented the relational model for database management systems. He used mathematical theories (Set theory and predicate logic) to create a general data management theory.
IBM’s System R is the very first database system built on Codd’s relational model. In order to “force” DB vendors to follow the relational principles when designing DB systems, Codd published his 12 rules which describe a relational database.
Various database vendors(Oracle, Sybase, SQL server..) implemented the Relational principles similarly but differently. This means that the DB systems “slightly” deviate from Codd’s relational theory, and the theory itself “slightly” deviates* from the original set theory and predicate logic – but that’s ok 🙂

*e.g One of the differences between predicate logic and relational algebra is that the first support the two-valued logic (True/False). The relational model supports three-way logic(True/false and Unknown)

Although tsql querying design patterns are derived from Relational algebra, Tsql does not strictly follow relational rules. One of the differences, that is also related to this article is that a Relation is not sorted → There is no relevance to the order of the elements in a set. However, Tsql allows us to sort(ORDER BY) the final result set (the one that is returned to the Client). We are allowed to replace the Relational result(as a “final product” of the relational operations in a Relational database based on relational theory…. ) with a non-relational result – CURSOR. That is why Order by “Changes everything” 🙂

The logical query processing sequence

To understand the ORDER BY operation, it is important to understand the logical query processing sequence. The concept is unique to the SQL world and defines the order of execution of different query segments.  It’s also known as Theoretical execution order because the sequence may be changed by SQL Srv optimizer due to many different reasons i.e data access patterns, query simplification, etc. The query sequence:

  1. FROM – The FROM phase identifies the query’s source tables. This phase processes CROSS JOIN, ON Filter, and in the case of OUTER JOINs it adds the non-matched rows back to the result set.
  2. WHERE Filter. Removes all those rows from the previous stage for which the predicate evaluates false.
  3. GROUP BY – Performs grouping and aggregation calculations. This phase generates one row per group.
  4. HAVING – This is the last of the three query filters. It filters the results of aggregate functions.
  5. SELECT – Processes the elements in the Select clause i.e Evaluates expressions within the Select statement and/or removes duplicates(DISTINCT) and/or filters the specified percentage/number of rows(TOP) defined by the ORDER BY clause. The phase returns TABLE.
  6. ORDER BY – Sorts the rows according to the list of columns specified and returns CURSOR.
  7. FOR XML/FOR JSON The phase converts tabular results from the SELECT statement into XML/JSON output.

For the queries that include SET operations, the logical sequence will be.

  1. Query 1
  2. SET operation (union, except, intersect) or multi-set Union All
  3. Query 2
  4. Order By

Order by and …

ORDER BY operation guarantees ordered result set – Cursor. TSQL deviates from ANSI standard and allows us to sort results using the columns and/or expressions based on columns that are not part of the SELECT statement(This is not true if the query uses the DISTINCT clause). ANSI standard allows sorting only by using the columns mentioned in the select statement. ORDER BY can also be used as a logical sorting operator when operating as a part of TOP/Offset functions.

Because ORDER BY converts query result-set into a CURSOR, it cannot be used to define table expressions;

  • Views
  • ITVF(Inline table-valued functions aka parameterized views)
  • Derived tables
  • Subqueries
  • CTE(common table expressions)

… unless used as a logical sorting with(top,offset,for XML/JSON)

The CURSOR must be returned to the client application that is able to “consume” cursor records, one at a time and in order.

ORDER BY treats NULL values as equal. NULLs are sorted FIRST in ASC order.

SET operations (Union, Except, Intersect) and multi-set Union All

When using SET-based operations (UNION, EXCEPT, INTERSECT) and/or multi-set operation UNION ALL  we cannot use the ORDER BY operator in the individual statements. The following query will fail the parsing phase.

DECLARE @t1 TABLE (id INT, val VARCHAR(10));
DECLARE @t2 TABLE (id1 INT, val1 VARCHAR(10));
INSERT INTO @t1
    VALUES (1,'A'),(1,'B'),(2,'S')
INSERT INTO @t2
   VALUES (1,'A'),(1,'B1'),(2,'S'),(3,'M')

SELECT * FROM @t1 ORDER BY val DESC
INTERSECT 
SELECT * FROM @t2

Msg 156, Level 15, State 1, Line 24
Incorrect syntax near the keyword ‘INTERSECT’

SQL Server’s SET operations correspond to operators defined in mathematical set theory. Therefore, the operands must be sets – represented as Tables (Set → Relation → Table)
ORDER BY changes query output to something that ANSI SQL defined as CURSOR. Set operations do not support ordered structures.

We can use the ORDER BY clause to format the FINAL result-set for presentation purposes. The ORDER BY in the example below logically does not belong to the last query. It operates on (Set1 U Set2)

SELECT * FROM @t1 --Set1
INTERSECT --U
SELECT * FROM @t2 --Set2
ORDER BY val DESC

TOP/OFFSET-FETCH

The TOP option is unique to tsql and does not follow ANSI guidelines. The directive simply allows us to select a number of rows or a percentage of rows to return. The option is supposed to be paired up with ORDER BY – always.
This time ORDER BY does not serve a presentation purpose. This time it serves as a logical “TOP sorting” operator. It is supposed to answer the question “TOP according to what order?”. The result of a TOP/ORDER BY operation is a table – not a cursor.
Unfortunately, tsql allows us to use TOP without ORDER BY, and in that case, the general query result will be random and non-deterministic TOP(n) rows / TOP(n) PERCENT result set.

It is a good practice to specify ORDER BY(SELECT NULL) if our intention is to select random and/or non-deterministic TOP(n) rows. Query Optimiser will remove the logical ordering during the simplification phase. The purpose of this is to let other developers know that this was done intentionally.

Test data:

IF OBJECT_ID('tempdb.dbo.#testTopOrderBy','U') IS NOT NULL
DROP TABLE #testTopOrderBy
GO
CREATE TABLE #testTopOrderBy(CustomerId INT UNIQUE NOT NULL
,OrderId INT
,OrderDate DATE)
GO
INSERT INTO #testTopOrderBy
   SELECT t.*
   FROM (VALUES (100,1001,'20170101')
                ,(106,1005,'20170210')
                ,(105,1007,'20170101')
                 
                ,(103,1008,'20170410')
                ,(104,1011,'20170511')
                ,(102,1022,'20170101')
                ,(101,1223,'20170210') ) t(custId,OrdId,odate)
CustomerId  OrderId     OrderDate
----------- ----------- ----------
100         1001        2017-01-01
106         1005        2017-02-10
105         1007        2017-01-01
103         1008        2017-04-10
104         1011        2017-05-11
102         1022        2017-01-01
101         1223        2017-02-10

(7 rows affected)

Test 1: The query returns two randomly chosen rows(TOP without ORDER BY)

SELECT TOP(2) *
FROM #testTopOrderBy
ORDER BY (SELECT NULL) – include this line to let other developers know about your intentions
CustomerId  OrderId     OrderDate
----------- ----------- ----------
100         1001        2017-01-01
106         1005        2017-02-10

Test 2: The query returns a non-deterministic result – “non-unique” ORDER BY is used aka. non-deterministic Order By.

SELECT TOP(2) *
FROM #testTopOrderBy
ORDER BY OrderDate
CustomerId  OrderId     OrderDate
----------- ----------- ----------
100         1001        2017-01-01
105         1007        2017-01-01
(2 rows affected)
--row not included
102         1022        2017-01-01

The query “decided” to select Customers 100 and 105 but not 102. In this case, if we had a clustered index on customerId, the query would probably select customers 100 and 102.

Test 3: Two ways to make the “TOP(n)” query return a deterministic result.

--Deterministic ORDER BY
SELECT TOP(2) *
FROM #testTopOrderBy
ORDER BY CustomerId,OrderDate
CustomerId OrderId OrderDate
----------- ----------- ----------
100 1001 2017-01-01
101 1223 2017-02-10

(1) The deterministic ORDER BY provides more specific information about which TOP(2) order dates to select. Now the query knows exactly which TOP(2) rows to select. a combination of Order By columns that provides a deterministic result-set is called Tiebreaker.

--TOP with TIES
SELECT TOP(2) WITH TIES *
FROM #testTopOrderBy
ORDER BY OrderDate
CustomerId OrderId OrderDate
----------- ----------- ----------
100 1001 2017-01-01
105 1007 2017-01-01
102 1022 2017-01-01

(2) The “WITH TIES” guarantees deterministic results based on Order by column(s). WITH TIES must have ORDER BY defined. In the case of non-deterministic Order By, WITH TIES will add all relevant rows to the result set in order to provide determinism. In the example above it added an extra row.

OFFSET-FETCH (tsql 2012+)

OFSET-FETCH is an ANSI SQL-supported feature. Basically, the function is an advanced version of the TOP function. The option allows us to select(fetch) only a window or a page of results from the result set.
Opposed to the similar, TOP function, OFFSET–FETCH can not operate without Order By. On the other hand, OFFSET-FETCH does not support WITH TIES therefore it cannot guarantee a deterministic result set.
The following query fails because offset-fetch always operates with ORDER BY:

SELECT *
FROM #testTopOrderBy
OFFSET 0 ROWS FETCH FIRST 2 ROWS ONLY

Msg 102, Level 15, State 1, Line 25 Incorrect syntax near ‘0’. Msg 153, Level 15, State 2, Line 25 Invalid usage of the option FIRST in the FETCH statement.

If we want to select random 2 rows from a table, we can use

SELECT *
FROM #testTopOrderBy
ORDER BY (SELECT NULL)
OFFSET 0 ROWS FETCH FIRST 2 ROWS ONLY

The query below will return the same result as the query from Test 3.

SELECT *
FROM #testTopOrderBy
ORDER BY CustomerId,OrderDate
OFFSET 0 ROWS FETCH FIRST 2 ROWS ONLY
CustomerId OrderId OrderDate
----------- ----------- ----------
100 1001 2017-01-01
101 1223 2017-02-10
(2 rows affected)

Same as TOP, OFFSET-FETCH returns a table therefore it can be used in other table expressions, e.g., (the query does not make much sense – it’s only for testing purposes …:)

SELECT a.*
FROM #testTopOrderBy a
INNER JOIN (
            SELECT CustomerId
            FROM #testTopOrderBy
            ORDER BY CustomerId,OrderDate
            OFFSET 0 ROWS FETCH FIRST 2 ROWS ONLY
           ) x --the result is a table expression
ON a.CustomerId = x.CustomerId

..and this will fail. Table operators (JOIN and non-ANSI APPLY, PIVOT, and UNPIVOT) operate only between tables(table expressions).

SELECT a.*
FROM #testTopOrderBy a
INNER JOIN (
            SELECT CustomerId
            FROM #testTopOrderBy
            ORDER BY CustomerId,OrderDate
           ) x --the result is a cursor
ON a.CustomerId = x.CustomerId

Msg 1033, Level 15, State 1, Line 38
The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions unless TOP, OFFSET, or FOR XML is also specified.

VIEWS

A view is a named virtual table that is defined by a query and used as a table
A view represents a table, and a table is a logical entity that has no order to its rows. This is the reason why it’s not allowed to use presentation ORDER BY in a view. VIEWS ARE NOT ORDERED!
The view definition below will fail

Test 4. Views are not ordered

CREATE VIEW dbo.TestViews
AS
SELECT v.CustId
      ,v.OrderId
FROM (VALUES(1,1000),(2,10001),(2,2003)) v(CustId,OrderId)
ORDER BY CustId

Msg 1033, Level 15, State 1, Procedure TestViews, Line 7 [Batch Start Line 1]
The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions unless TOP, OFFSET, or FOR XML is also specified.

Now, some programmers might think that they somehow managed to override RDBMS fundamentals by defining a view with a TOP clause specifying all 100% rows and ORDER BY.

Test 5, Views with TOP 100% ordered rows

CREATE VIEW testViews
AS
    SELECT TOP(100) PERCENT
            [name]
           ,[object_id]
           ,create_date
    FROM sys.objects
    ORDER BY create_date DESC

This will pass the parsing stage and will be stored as a valid view definition. In reality, the view definition does not make much sense(see the TOP/ORDER BY example above). During the simplification phase, the query optimizer will remove the top 100 %/ORDER BY as an unnecessary logic.

SELECT * FROM  testViews
name                     object_id   create_date
------------------------ ----------- -----------------------
sysrscols                3           2014-02-20 20:48:35.270
sysrowsets               5           2009-04-13 12:59:11.093
sysclones                6           2014-02-20 20:48:35.477
sysallocunits            7           2009-04-13 12:59:11.077
sysfiles1                8           2003-04-08 09:13:38.093
sysseobjvalues           9           2014-02-20 20:48:35.610
syspriorities            17          2014-02-20 20:48:35.367
sysdbfrag                18          2014-02-20 20:48:35.563
sysfgfrag                19          2014-02-20 20:48:35.243
... etc...

The result is not ordered by create_date DESC
In some cases i.e when we use TOP and ORDER by the clustered index key columns, the result may be appearing to be sorted – QO may decide to use a sorted table scan. Again, there is no guarantee that the returned rows will be ordered.
e.g. During the final SELECT phase, QO may decide to use SQL Server’s storage engine advanced scan functionality also known as merry-go-round scan or scan piggybacking to return the same rows to two or more concurrent sessions. Without diving into the details of how the advanced scan works(will cover that interesting optimization mechanism in one of the next posts), the different sessions will have the same result sets but in a different order.

The presentation ORDER BY will convert the un-ordered view result into an ordered CURSOR.

SELECT *
FROM testViews
ORDER BY create_date DESC

Window Functions

Windows functions are introduced in the ANSI SQL-2003 standard. Microsoft partially implemented  the functionality in tSql 2005 and expanded to a full range of window functions in tSql 2012,
A Window function operates on a number of rows(window) in a set(table).  Window functions use ORDER BY as a logical operator which defines the behavior of the aggregate operations performed by the functions. In This scenario, ORDER BY, as a part of the functions does not perform the final sort.

It is important to understand where the Window function sits in the logical query execution sequence.

(5) SELECT
(5.1) Evaluate expressions in the SELECT statement
(5.2) WINDOWS Fn can be executed from here…
(5.3) DISTINCT
(5.4) TOP
(5.5) ….to here

(6) ORDER BY (presentation)

The example below demonstrates how window functions are positioned in the logical query exec. sequence.

Test Data.

IF OBJECT_ID('tempdb.dbo.#testWindowFnLogicalSeq','U') IS NOT NULL
    DROP TABLE #testWindowFnLogicalSeq
GO
CREATE TABLE #testWindowFnLogicalSeq( 
        OrderPackDesc VARCHAR(50) 
        ,OrderValue INT)
GO
INSERT INTO #testWindowFnLogicalSeq
    SELECT *
    FROM (VALUES ('Media Pack',100)
                ,('Media Pack',90)
                ,(NULL,10)
                ,('MS Surfice accessories',200)
                ,('MS Surfice accessories',50)
        ) tab(ordrPck,val)
OrderPackDesc                                      OrderValue
-------------------------------------------------- -----------
Media Pack                                         100
Media Pack                                         90
NULL                                               10
MS Surfice accessories                             200
MS Surfice accessories                             50

(5 rows affected)

The query below sums the OrderValues per OrderPackDescription. The OrderPackDesc = NULL will be summarised as a distinct value resulting OrderValue=10. ISNULL(OrderPackDesc,’Media Pack’) evaluates the NULL value as “Media Pack”

SELECT DISTINCT OrderPackDescEXPR = ISNULL(OrderPackDesc,'Media Pack')
               ,OrdVal = SUM(OrderValue) OVER(PARTITION BY OrderPackDesc
                                              ORDER BY( SELECT NULL))
FROM #testWindowFnLogicalSeq
OrderPackDescEXPR                                  OrdVal
-------------------------------------------------- -----------
Media Pack                                         10
Media Pack                                         190
MS Surfice accessories                             250

(3 rows affected)

This could lead to a logical error. We want to find out the sums of order values per order pack. We also want to treat the missing OrderPacks as ‘Media Pack’. The previous query sums the “missing order packs” separately. This is a typical logical error. The error would be more difficult to find if we had OrderValue = 190 for the unknown orderPackDesc. In that case, we will have a total of 190 instead of 380 for the ‘Media Pack’.
Taking into consideration the query execution sequence we can modify our query as:

SELECT DISTINCT  OrderPackDescEXPR = ISNULL(OrderPackDesc,'Media Pack')
                ,OrdVal = SUM(OrderValue) OVER(PARTITION BY ISNULL(OrderPackDesc,'Media Pack') ORDER BY( SELECT NULL))
FROM #testWindowFnLogicalSeq
OrderPackDescEXPR                                  OrdVal
-------------------------------------------------- -----------
Media Pack                                         200
MS Surfice accessories                             250

(2 rows affected)

…OVER(PARTITION BY ( ISNULL(OrderPackDesc,’Media Pack’) … is evaluated BEFORE the WINDOW fn.
Now our window fn, partition knows how to treat the missing values before the aggregation part and the execution sequence works in your favor 🙂
The example also shows that DISTINCT executes after WINDOW functions.
NOTE: It is not possible to use alias OrderPackDescEXPR in other expressions within the same query, like  … OVER(Partition by OrderPackDescEXPR ..) The limitation is another unique aspect of SQL language – All-at-once operation.
E.g in most programming languages, to swap values between variables we use a third, temp var. A→ temp, B→A  , temp→ B . In SQL we can swap table column values without using the temp var storage..

DECLARE @test TABLE(Age1 tinyint, Age2 tinyint)
    INSERT INTO @test
        VALUES (65,25)
SELECT * FROM @test
    UPDATE @test
        SET Age1 = Age2
           ,Age2 = Age1
SELECT * FROM @test

Window aggregate functions

Starting from SQL 2012 the ORDER BY clause can be used along with the window aggregate functions. ORDER BY dictates how the aggregate functions operate within the window partitions. It is important to understand the ORDER BY in this content to be able to avoid logical errors.

Create some test data

IF OBJECT_ID('tempdb.dbo.#testWindowFnLogicalSeq','U') IS NOT NULL
    DROP TABLE #testWindowFnLogicalSeq
GO
CREATE TABLE #testWindowFnLogicalSeq(
     CustomerId INT
    ,OrderId INT
    ,OrderValue INT)
GO
;WITH createNumbers as
(
    SELECT TOP 15 ROW_NUMBER()OVER(ORDER BY(SELECT NULL)) + 1000 rn
    FROM sys.columns c1
    CROSS JOIN sys.columns c2
    ORDER BY (SELECT NULL)
)
INSERT INTO #testWindowFnLogicalSeq
    SELECT CustomerId = (rn % 3)+ 1000 -- 3 customers
          ,OrderId = rn + 1000 --some unique orders
          ,OrderValue = ABS(CAST(CHECKSUM(NewId()) AS BIGINT)) % 1500 -- "random" numbers between 0 - 1499
FROM createNumbers
GO
SELECT * FROM #testWindowFnLogicalSeq ORDER BY CustomerId,OrderId
CustomerId  OrderId     OrderValue
----------- ----------- -----------
1000        2002        870
1000        2005        169
1000        2008        859
1000        2011        516
1000        2014        11
1001        2003        692
1001        2006        683
1001        2009        986
1001        2012        403
1001        2015        71
1002        2001        419
1002        2004        213
1002        2007        609
1002        2010        289
1002        2013        38

(15 row(s) affected)
** The OrderValues column contain random values.

The following query explains how ORDER BY affects SUM – window aggregate function

SELECT CustomerId
       ,OrderId
       ,OrderValue
       ,SUM(OrderValue) OVER() AS [GrandTotal OrderValue sum]
       ,SUM(OrderValue) OVER(PARTITION BY CustomerId) [Total OrderValue per cust.]
       --cumulative sums
       ,SUM(OrderValue) OVER(ORDER BY CustomerId) [Cumulative OrderValue per cust.]
       ,SUM(OrderValue) OVER(PARTITION BY CustomerId
                             ORDER BY OrderId) [Cumulative OrderValue per cust. & ord.]
FROM #testWindowFnLogicalSeq
ORDER BY CustomerID,OrderId --presentaiton purpose
CustomerId  OrderId     OrderValue  GrandTotal OrderValue sum Total OrderValue per cust. Cumulative OrderValue per cust. Cumulative OrderValue per cust. & ord.
----------- ----------- ----------- ------------------------- -------------------------- ------------------------------- --------------------------------------
1000        2002        870         6828                      2425                       2425                            870
1000        2005        169         6828                      2425                       2425                            1039
1000        2008        859         6828                      2425                       2425                            1898
1000        2011        516         6828                      2425                       2425                            2414
1000        2014        11          6828                      2425                       2425                            2425
1001        2003        692         6828                      2835                       5260                            692
1001        2006        683         6828                      2835                       5260                            1375
1001        2009        986         6828                      2835                       5260                            2361
1001        2012        403         6828                      2835                       5260                            2764
1001        2015        71          6828                      2835                       5260                            2835
1002        2001        419         6828                      1568                       6828                            419
1002        2004        213         6828                      1568                       6828                            632
1002        2007        609         6828                      1568                       6828                            1241
1002        2010        289         6828                      1568                       6828                            1530
1002        2013        38          6828                      1568                       6828                            1568

ORDER BY activates cumulative aggregations on the ordered column over a defined partition. The same applies to other aggregate functions MAX, MIN, AVG, and COUNT.
Failing to understand this behavior can lead to logical error 🙂

CONCLUSION

The main purpose of ORDER BY is to force a query to return an ordered result – CURSOR. This changes the relational nature of SQL language – query result sets are NEVER ordered. The presentation ORDER BY is an expensive operation (it needs to collect the entire result-set and put it in the requested order). The same syntax (ORDER BY) is used as a logical operator for TOP/OFFSET FETCH/WINDOW Functions/FOR XML formatter. In those cases, the purpose of the operator is completely different and DOES NOT guarantee the ordered output.

Thank you for reading.