SQL programming has a few unique aspects such as Three value logic(3VL), set-based programming logic, and logical processing order. Divination from the programming patterns based on the unique nature of the language usually leads to poor-performing SQL code.
RDBMS fundamentals
SQL Server as an RDBMS(Relational Database Management System), and its SQL language dialect/variant TSQL(Transact-Structured Query Language) has a strong foundation in mathematics. TSQL as SQL Servers language for data management and manipulation is based on two cornerstones of mathematics – Set theory and Predicate logic.
Edgar F. Codd was a computer scientist who invented the relational model for database management systems. He used mathematical theories (Set theory and predicate logic) to create a general data management theory.
IBM’s System R is the very first database system built on Codd’s relational model. In order to “force” DB vendors to follow the relational principles when designing DB systems, Codd published his 12 rules which describe a relational database.
Various database vendors(Oracle, Sybase, SQL server..) implemented the Relational principles similarly but differently. This means that the DB systems “slightly” deviate from Codd’s relational theory, and the theory itself “slightly” deviates* from the original set theory and predicate logic – but that’s ok 🙂
*e.g One of the differences between predicate logic and relational algebra is that the first support the two-valued logic (True/False). The relational model supports three-way logic(True/false and Unknown)
Although tsql querying design patterns are derived from Relational algebra, Tsql does not strictly follow relational rules. One of the differences, that is also related to this article is that a Relation is not sorted → There is no relevance to the order of the elements in a set. However, Tsql allows us to sort(ORDER BY) the final result set (the one that is returned to the Client). We are allowed to replace the Relational result(as a “final product” of the relational operations in a Relational database based on relational theory…. ) with a non-relational result – CURSOR. That is why Order by “Changes everything” 🙂
The logical query processing sequence
To understand the ORDER BY operation, it is important to understand the logical query processing sequence. The concept is unique to the SQL world and defines the order of execution of different query segments. It’s also known as Theoretical execution order because the sequence may be changed by SQL Srv optimizer due to many different reasons i.e data access patterns, query simplification, etc. The query sequence:
- FROM – The FROM phase identifies the query’s source tables. This phase processes CROSS JOIN, ON Filter, and in the case of OUTER JOINs it adds the non-matched rows back to the result set.
- WHERE Filter. Removes all those rows from the previous stage for which the predicate evaluates false.
- GROUP BY – Performs grouping and aggregation calculations. This phase generates one row per group.
- HAVING – This is the last of the three query filters. It filters the results of aggregate functions.
- SELECT – Processes the elements in the Select clause i.e Evaluates expressions within the Select statement and/or removes duplicates(DISTINCT) and/or filters the specified percentage/number of rows(TOP) defined by the ORDER BY clause. The phase returns TABLE.
- ORDER BY – Sorts the rows according to the list of columns specified and returns CURSOR.
- FOR XML/FOR JSON The phase converts tabular results from the SELECT statement into XML/JSON output.
For the queries that include SET operations, the logical sequence will be.
- Query 1
- SET operation (union, except, intersect) or multi-set Union All
- Query 2
- Order By
Order by and …
ORDER BY operation guarantees ordered result set – Cursor. TSQL deviates from ANSI standard and allows us to sort results using the columns and/or expressions based on columns that are not part of the SELECT statement(This is not true if the query uses the DISTINCT clause). ANSI standard allows sorting only by using the columns mentioned in the select statement. ORDER BY can also be used as a logical sorting operator when operating as a part of TOP/Offset functions.
Because ORDER BY converts query result-set into a CURSOR, it cannot be used to define table expressions;
- Views
- ITVF(Inline table-valued functions aka parameterized views)
- Derived tables
- Subqueries
- CTE(common table expressions)
… unless used as a logical sorting with(top,offset,for XML/JSON)
The CURSOR must be returned to the client application that is able to “consume” cursor records, one at a time and in order.
ORDER BY treats NULL values as equal. NULLs are sorted FIRST in ASC order.
SET operations (Union, Except, Intersect) and multi-set Union All
When using SET-based operations (UNION, EXCEPT, INTERSECT) and/or multi-set operation UNION ALL we cannot use the ORDER BY operator in the individual statements. The following query will fail the parsing phase.
DECLARE @t1 TABLE (id INT, val VARCHAR(10));
DECLARE @t2 TABLE (id1 INT, val1 VARCHAR(10));
INSERT INTO @t1
VALUES (1,'A'),(1,'B'),(2,'S')
INSERT INTO @t2
VALUES (1,'A'),(1,'B1'),(2,'S'),(3,'M')
SELECT * FROM @t1 ORDER BY val DESC
INTERSECT
SELECT * FROM @t2
Msg 156, Level 15, State 1, Line 24
Incorrect syntax near the keyword ‘INTERSECT’
SQL Server’s SET operations correspond to operators defined in mathematical set theory. Therefore, the operands must be sets – represented as Tables (Set → Relation → Table)
ORDER BY changes query output to something that ANSI SQL defined as CURSOR. Set operations do not support ordered structures.
We can use the ORDER BY clause to format the FINAL result-set for presentation purposes. The ORDER BY in the example below logically does not belong to the last query. It operates on (Set1 U Set2)
SELECT * FROM @t1 --Set1
INTERSECT --U
SELECT * FROM @t2 --Set2
ORDER BY val DESC
TOP/OFFSET-FETCH
The TOP option is unique to tsql and does not follow ANSI guidelines. The directive simply allows us to select a number of rows or a percentage of rows to return. The option is supposed to be paired up with ORDER BY – always.
This time ORDER BY does not serve a presentation purpose. This time it serves as a logical “TOP sorting” operator. It is supposed to answer the question “TOP according to what order?”. The result of a TOP/ORDER BY operation is a table – not a cursor.
Unfortunately, tsql allows us to use TOP without ORDER BY, and in that case, the general query result will be random and non-deterministic TOP(n) rows / TOP(n) PERCENT result set.
It is a good practice to specify ORDER BY(SELECT NULL) if our intention is to select random and/or non-deterministic TOP(n) rows. Query Optimiser will remove the logical ordering during the simplification phase. The purpose of this is to let other developers know that this was done intentionally.
Test data:
IF OBJECT_ID('tempdb.dbo.#testTopOrderBy','U') IS NOT NULL
DROP TABLE #testTopOrderBy
GO
CREATE TABLE #testTopOrderBy(CustomerId INT UNIQUE NOT NULL
,OrderId INT
,OrderDate DATE)
GO
INSERT INTO #testTopOrderBy
SELECT t.*
FROM (VALUES (100,1001,'20170101')
,(106,1005,'20170210')
,(105,1007,'20170101')
,(103,1008,'20170410')
,(104,1011,'20170511')
,(102,1022,'20170101')
,(101,1223,'20170210') ) t(custId,OrdId,odate)
CustomerId OrderId OrderDate
----------- ----------- ----------
100 1001 2017-01-01
106 1005 2017-02-10
105 1007 2017-01-01
103 1008 2017-04-10
104 1011 2017-05-11
102 1022 2017-01-01
101 1223 2017-02-10
(7 rows affected)
Test 1: The query returns two randomly chosen rows(TOP without ORDER BY)
SELECT TOP(2) *
FROM #testTopOrderBy
ORDER BY (SELECT NULL) – include this line to let other developers know about your intentions
CustomerId OrderId OrderDate
----------- ----------- ----------
100 1001 2017-01-01
106 1005 2017-02-10
Test 2: The query returns a non-deterministic result – “non-unique” ORDER BY is used aka. non-deterministic Order By.
SELECT TOP(2) *
FROM #testTopOrderBy
ORDER BY OrderDate
CustomerId OrderId OrderDate
----------- ----------- ----------
100 1001 2017-01-01
105 1007 2017-01-01
(2 rows affected)
--row not included
102 1022 2017-01-01
The query “decided” to select Customers 100 and 105 but not 102. In this case, if we had a clustered index on customerId, the query would probably select customers 100 and 102.
Test 3: Two ways to make the “TOP(n)” query return a deterministic result.
--Deterministic ORDER BY
SELECT TOP(2) *
FROM #testTopOrderBy
ORDER BY CustomerId,OrderDate
CustomerId OrderId OrderDate
----------- ----------- ----------
100 1001 2017-01-01
101 1223 2017-02-10
(1) The deterministic ORDER BY provides more specific information about which TOP(2) order dates to select. Now the query knows exactly which TOP(2) rows to select. a combination of Order By columns that provides a deterministic result-set is called Tiebreaker.
--TOP with TIES
SELECT TOP(2) WITH TIES *
FROM #testTopOrderBy
ORDER BY OrderDate
CustomerId OrderId OrderDate
----------- ----------- ----------
100 1001 2017-01-01
105 1007 2017-01-01
102 1022 2017-01-01
(2) The “WITH TIES” guarantees deterministic results based on Order by column(s). WITH TIES must have ORDER BY defined. In the case of non-deterministic Order By, WITH TIES will add all relevant rows to the result set in order to provide determinism. In the example above it added an extra row.
OFFSET-FETCH (tsql 2012+)
OFSET-FETCH is an ANSI SQL-supported feature. Basically, the function is an advanced version of the TOP function. The option allows us to select(fetch) only a window or a page of results from the result set.
Opposed to the similar, TOP function, OFFSET–FETCH can not operate without Order By. On the other hand, OFFSET-FETCH does not support WITH TIES therefore it cannot guarantee a deterministic result set.
The following query fails because offset-fetch always operates with ORDER BY:
SELECT *
FROM #testTopOrderBy
OFFSET 0 ROWS FETCH FIRST 2 ROWS ONLY
Msg 102, Level 15, State 1, Line 25 Incorrect syntax near ‘0’. Msg 153, Level 15, State 2, Line 25 Invalid usage of the option FIRST in the FETCH statement.
If we want to select random 2 rows from a table, we can use
SELECT *
FROM #testTopOrderBy
ORDER BY (SELECT NULL)
OFFSET 0 ROWS FETCH FIRST 2 ROWS ONLY
The query below will return the same result as the query from Test 3.
SELECT *
FROM #testTopOrderBy
ORDER BY CustomerId,OrderDate
OFFSET 0 ROWS FETCH FIRST 2 ROWS ONLY
CustomerId OrderId OrderDate
----------- ----------- ----------
100 1001 2017-01-01
101 1223 2017-02-10
(2 rows affected)
Same as TOP, OFFSET-FETCH returns a table therefore it can be used in other table expressions, e.g., (the query does not make much sense – it’s only for testing purposes …:)
SELECT a.*
FROM #testTopOrderBy a
INNER JOIN (
SELECT CustomerId
FROM #testTopOrderBy
ORDER BY CustomerId,OrderDate
OFFSET 0 ROWS FETCH FIRST 2 ROWS ONLY
) x --the result is a table expression
ON a.CustomerId = x.CustomerId
..and this will fail. Table operators (JOIN and non-ANSI APPLY, PIVOT, and UNPIVOT) operate only between tables(table expressions).
SELECT a.*
FROM #testTopOrderBy a
INNER JOIN (
SELECT CustomerId
FROM #testTopOrderBy
ORDER BY CustomerId,OrderDate
) x --the result is a cursor
ON a.CustomerId = x.CustomerId
Msg 1033, Level 15, State 1, Line 38
The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions unless TOP, OFFSET, or FOR XML is also specified.
VIEWS
A view is a named virtual table that is defined by a query and used as a table
A view represents a table, and a table is a logical entity that has no order to its rows. This is the reason why it’s not allowed to use presentation ORDER BY in a view. VIEWS ARE NOT ORDERED!
The view definition below will fail
Test 4. Views are not ordered
CREATE VIEW dbo.TestViews
AS
SELECT v.CustId
,v.OrderId
FROM (VALUES(1,1000),(2,10001),(2,2003)) v(CustId,OrderId)
ORDER BY CustId
Msg 1033, Level 15, State 1, Procedure TestViews, Line 7 [Batch Start Line 1]
The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions unless TOP, OFFSET, or FOR XML is also specified.
Now, some programmers might think that they somehow managed to override RDBMS fundamentals by defining a view with a TOP clause specifying all 100% rows and ORDER BY.
Test 5, Views with TOP 100% ordered rows
CREATE VIEW testViews
AS
SELECT TOP(100) PERCENT
[name]
,[object_id]
,create_date
FROM sys.objects
ORDER BY create_date DESC
This will pass the parsing stage and will be stored as a valid view definition. In reality, the view definition does not make much sense(see the TOP/ORDER BY example above). During the simplification phase, the query optimizer will remove the top 100 %/ORDER BY as an unnecessary logic.
SELECT * FROM testViews
name object_id create_date
------------------------ ----------- -----------------------
sysrscols 3 2014-02-20 20:48:35.270
sysrowsets 5 2009-04-13 12:59:11.093
sysclones 6 2014-02-20 20:48:35.477
sysallocunits 7 2009-04-13 12:59:11.077
sysfiles1 8 2003-04-08 09:13:38.093
sysseobjvalues 9 2014-02-20 20:48:35.610
syspriorities 17 2014-02-20 20:48:35.367
sysdbfrag 18 2014-02-20 20:48:35.563
sysfgfrag 19 2014-02-20 20:48:35.243
... etc...
The result is not ordered by create_date DESC
In some cases i.e when we use TOP and ORDER by the clustered index key columns, the result may be appearing to be sorted – QO may decide to use a sorted table scan. Again, there is no guarantee that the returned rows will be ordered.
e.g. During the final SELECT phase, QO may decide to use SQL Server’s storage engine advanced scan functionality also known as merry-go-round scan or scan piggybacking to return the same rows to two or more concurrent sessions. Without diving into the details of how the advanced scan works(will cover that interesting optimization mechanism in one of the next posts), the different sessions will have the same result sets but in a different order.
The presentation ORDER BY will convert the un-ordered view result into an ordered CURSOR.
SELECT *
FROM testViews
ORDER BY create_date DESC
Window Functions
Windows functions are introduced in the ANSI SQL-2003 standard. Microsoft partially implemented the functionality in tSql 2005 and expanded to a full range of window functions in tSql 2012,
A Window function operates on a number of rows(window) in a set(table). Window functions use ORDER BY as a logical operator which defines the behavior of the aggregate operations performed by the functions. In This scenario, ORDER BY, as a part of the functions does not perform the final sort.
It is important to understand where the Window function sits in the logical query execution sequence.
(5) SELECT
(5.1) Evaluate expressions in the SELECT statement
(5.2) WINDOWS Fn can be executed from here…
(5.3) DISTINCT
(5.4) TOP
(5.5) ….to here
(6) ORDER BY (presentation)
The example below demonstrates how window functions are positioned in the logical query exec. sequence.
Test Data.
IF OBJECT_ID('tempdb.dbo.#testWindowFnLogicalSeq','U') IS NOT NULL
DROP TABLE #testWindowFnLogicalSeq
GO
CREATE TABLE #testWindowFnLogicalSeq(
OrderPackDesc VARCHAR(50)
,OrderValue INT)
GO
INSERT INTO #testWindowFnLogicalSeq
SELECT *
FROM (VALUES ('Media Pack',100)
,('Media Pack',90)
,(NULL,10)
,('MS Surfice accessories',200)
,('MS Surfice accessories',50)
) tab(ordrPck,val)
OrderPackDesc OrderValue
-------------------------------------------------- -----------
Media Pack 100
Media Pack 90
NULL 10
MS Surfice accessories 200
MS Surfice accessories 50
(5 rows affected)
The query below sums the OrderValues per OrderPackDescription. The OrderPackDesc = NULL will be summarised as a distinct value resulting OrderValue=10. ISNULL(OrderPackDesc,’Media Pack’) evaluates the NULL value as “Media Pack”
SELECT DISTINCT OrderPackDescEXPR = ISNULL(OrderPackDesc,'Media Pack')
,OrdVal = SUM(OrderValue) OVER(PARTITION BY OrderPackDesc
ORDER BY( SELECT NULL))
FROM #testWindowFnLogicalSeq
OrderPackDescEXPR OrdVal
-------------------------------------------------- -----------
Media Pack 10
Media Pack 190
MS Surfice accessories 250
(3 rows affected)
This could lead to a logical error. We want to find out the sums of order values per order pack. We also want to treat the missing OrderPacks as ‘Media Pack’. The previous query sums the “missing order packs” separately. This is a typical logical error. The error would be more difficult to find if we had OrderValue = 190 for the unknown orderPackDesc. In that case, we will have a total of 190 instead of 380 for the ‘Media Pack’.
Taking into consideration the query execution sequence we can modify our query as:
SELECT DISTINCT OrderPackDescEXPR = ISNULL(OrderPackDesc,'Media Pack')
,OrdVal = SUM(OrderValue) OVER(PARTITION BY ISNULL(OrderPackDesc,'Media Pack') ORDER BY( SELECT NULL))
FROM #testWindowFnLogicalSeq
OrderPackDescEXPR OrdVal
-------------------------------------------------- -----------
Media Pack 200
MS Surfice accessories 250
(2 rows affected)
…OVER(PARTITION BY ( ISNULL(OrderPackDesc,’Media Pack’) … is evaluated BEFORE the WINDOW fn.
Now our window fn, partition knows how to treat the missing values before the aggregation part and the execution sequence works in your favor 🙂
The example also shows that DISTINCT executes after WINDOW functions.
NOTE: It is not possible to use alias OrderPackDescEXPR in other expressions within the same query, like … OVER(Partition by OrderPackDescEXPR ..) The limitation is another unique aspect of SQL language – All-at-once operation.
E.g in most programming languages, to swap values between variables we use a third, temp var. A→ temp, B→A , temp→ B . In SQL we can swap table column values without using the temp var storage..
DECLARE @test TABLE(Age1 tinyint, Age2 tinyint)
INSERT INTO @test
VALUES (65,25)
SELECT * FROM @test
UPDATE @test
SET Age1 = Age2
,Age2 = Age1
SELECT * FROM @test
Window aggregate functions
Starting from SQL 2012 the ORDER BY clause can be used along with the window aggregate functions. ORDER BY dictates how the aggregate functions operate within the window partitions. It is important to understand the ORDER BY in this content to be able to avoid logical errors.
Create some test data
IF OBJECT_ID('tempdb.dbo.#testWindowFnLogicalSeq','U') IS NOT NULL
DROP TABLE #testWindowFnLogicalSeq
GO
CREATE TABLE #testWindowFnLogicalSeq(
CustomerId INT
,OrderId INT
,OrderValue INT)
GO
;WITH createNumbers as
(
SELECT TOP 15 ROW_NUMBER()OVER(ORDER BY(SELECT NULL)) + 1000 rn
FROM sys.columns c1
CROSS JOIN sys.columns c2
ORDER BY (SELECT NULL)
)
INSERT INTO #testWindowFnLogicalSeq
SELECT CustomerId = (rn % 3)+ 1000 -- 3 customers
,OrderId = rn + 1000 --some unique orders
,OrderValue = ABS(CAST(CHECKSUM(NewId()) AS BIGINT)) % 1500 -- "random" numbers between 0 - 1499
FROM createNumbers
GO
SELECT * FROM #testWindowFnLogicalSeq ORDER BY CustomerId,OrderId
CustomerId OrderId OrderValue
----------- ----------- -----------
1000 2002 870
1000 2005 169
1000 2008 859
1000 2011 516
1000 2014 11
1001 2003 692
1001 2006 683
1001 2009 986
1001 2012 403
1001 2015 71
1002 2001 419
1002 2004 213
1002 2007 609
1002 2010 289
1002 2013 38
(15 row(s) affected)
** The OrderValues column contain random values.
The following query explains how ORDER BY affects SUM – window aggregate function
SELECT CustomerId
,OrderId
,OrderValue
,SUM(OrderValue) OVER() AS [GrandTotal OrderValue sum]
,SUM(OrderValue) OVER(PARTITION BY CustomerId) [Total OrderValue per cust.]
--cumulative sums
,SUM(OrderValue) OVER(ORDER BY CustomerId) [Cumulative OrderValue per cust.]
,SUM(OrderValue) OVER(PARTITION BY CustomerId
ORDER BY OrderId) [Cumulative OrderValue per cust. & ord.]
FROM #testWindowFnLogicalSeq
ORDER BY CustomerID,OrderId --presentaiton purpose
CustomerId OrderId OrderValue GrandTotal OrderValue sum Total OrderValue per cust. Cumulative OrderValue per cust. Cumulative OrderValue per cust. & ord.
----------- ----------- ----------- ------------------------- -------------------------- ------------------------------- --------------------------------------
1000 2002 870 6828 2425 2425 870
1000 2005 169 6828 2425 2425 1039
1000 2008 859 6828 2425 2425 1898
1000 2011 516 6828 2425 2425 2414
1000 2014 11 6828 2425 2425 2425
1001 2003 692 6828 2835 5260 692
1001 2006 683 6828 2835 5260 1375
1001 2009 986 6828 2835 5260 2361
1001 2012 403 6828 2835 5260 2764
1001 2015 71 6828 2835 5260 2835
1002 2001 419 6828 1568 6828 419
1002 2004 213 6828 1568 6828 632
1002 2007 609 6828 1568 6828 1241
1002 2010 289 6828 1568 6828 1530
1002 2013 38 6828 1568 6828 1568
ORDER BY activates cumulative aggregations on the ordered column over a defined partition. The same applies to other aggregate functions MAX, MIN, AVG, and COUNT.
Failing to understand this behavior can lead to logical error 🙂
CONCLUSION
The main purpose of ORDER BY is to force a query to return an ordered result – CURSOR. This changes the relational nature of SQL language – query result sets are NEVER ordered. The presentation ORDER BY is an expensive operation (it needs to collect the entire result-set and put it in the requested order). The same syntax (ORDER BY) is used as a logical operator for TOP/OFFSET FETCH/WINDOW Functions/FOR XML formatter. In those cases, the purpose of the operator is completely different and DOES NOT guarantee the ordered output.
Thank you for reading.