Category Archives: SQL Server Development

Adding non-NULL value columns ONLINE

Adding non-NULL value columns with values to a busy table – ONLINE

Adding a non-null column to a big table on a busy database can be tricky business. It might seem to be an easy task of altering table and adding a column ( an integer data type column – 4 bytes more or less, who cares) but it’s not. DBAs are always extra careful when perform those sorts of actions since it can cause serious implications to the db applications. For example, adding a non-null column to a busy table which contains 10M rows means that every row must be updated with a value. The table may be accessed by many different applications many times in a short period of time. These requests can be seriously blocked by the alter table/add a not null column/update column with a value, causing app timeouts, deadlocks and all sorts of headaches.

One of the common approaches is to add a NULL column and then to update the column values across all rows using the batch update scenario (update N number of rows per transaction) and then to alter column as non-null. This operation is usually performed “offline” or when the table is “quiet”.

Starting with SQL 2012+(Enterprise Edition)  the operation is not “painful” any more. Adding column(s) to a busy table is now instantaneous and has less impact to the OLTP operations. In fact, the operation can be performed online.
This still requires taking into the consideration other factors that can impact the operation such as; data type of the new column, row length etc.

The task for today will be to add a new , non-null column of integer data type with the default value 46 to a busy table that has  1M rows 🙂

Test table

IF OBJECT_ID('dbo.addANonNullColOnline','U') IS NOT NULL
    DROP TABLE dbo.addANonNullColOnline
GO
--create a test table
SELECT TOP 1000000 id = IDENTITY(INTEGER,1,1)
                   ,val = CAST(c2.[name] +'AAAA' AS VARCHAR(4)) --fill all 4 bytes
                            COLLATE DATABASE_DEFAULT
    INTO dbo.addANonNullColOnline
FROM sys.columns c1
    CROSS JOIN sys.columns c2
        CROSS JOIN sys.columns c3
GO
--set one value to NULL
UPDATE dbo.addANonNullColOnline
    SET val = NULL
WHERE id = 2
id          val
----------- -----
1           bitp
2           NULL
3           colg
4           hbco
5           maxi
6           null
7           offs
...
...
10000000    runt

The table contains 1M rows. One row, id=2 has a NULL value for column “val”. The rest of the column values are taking the maximum available space of 4bytes.
NOTE: Varchar stores non-unicode characters e.g VARCHAR(4) =’AAAA’ takes 4bytes. (1 byte per non-unicode character).

The query below provides basic information about the table rows:

SELECT index_level
       ,page_count
       ,record_count
       ,min_record_size_in_bytes
       ,max_record_size_in_bytes
       ,avg_record_size_in_bytes
       ,SizeInMB = ROUND( ((record_count * avg_record_size_in_bytes) /1024.00 )/1024.00,0)
FROM sys.dm_db_index_physical_stats(db_id('test'),object_id('dbo.addANonNullColOnline','U'),null,null,'detailed')

The metadata information shows that the “shortest” row (id = 2) takes 11bytes of disk space, and the “longest” row(all other rows) takes 19bytes. The whole table(all rows) takes around 18MB of the disk space.
Before we add a new, not null column to the table, lets show how the rows are physically stored.

Rows physical storage

In this case, rows are stored using the FixedVar storage format. The name comes from the order in which Sql Server storage engine stores different data types – fixed data types(integer, decimal, char..) and then variable data types(nvarchar, varchar, varbinary..).The format has been used since Sql Server 7.0(Sphinx) released in 1998. (smile).
FixedVar format internally adds a number of additional bytes to every row on page. The extra space per row is used to provide information about rows itself like (if the row has been deleted/ghost ,the number of columns, columns with NULL values, position of the variable columns if any,  timestamp and pointer to the verison store if optimistic locking is enabled.. etc..).
For the fixed columns, storage engine will add:

6bytes + CEILING(TotalNoOfColumns / 8.00bytes)

..and for the variable part of the row

2bytes(if at least one of the varColumns has non-null value) + 2bytes * NoOfVariableNonNULLColumns.

In our example, the minimum row size(11bytes , row id=2) and the maximum row size( all other rows 19 bytes) will be..

--min_record_size_in_bytes
SELECT [FixedPart: 6 + CEILING(2/8.00)] = 6 + CEILING(2/8.00)
      ,[FixedColumnValue(INT) bytes] = 4
      ,'---'='---'
      ,[VariablePart: 0 + 2*0] = 0 +2* 0 --no non-null variableColumns
      ,[VariableColValue(VARCHAR(4)] = NULL
      ,TotalRowSize = 6 + CEILING(2/8.00) + 4 + 0
GO
--max_record_size_in_bytes
SELECT  [FixedPart: 6 + CEILING(2/8.00)] = 6 + CEILING(2/8.00)
       ,[FixedColumnValue(INT) bytes] = 4
       ,'---'='---'
       ,[VariablePart: 2 + 2*1] = 2 +2* 1
       ,[VariableColValue(VARCHAR(4) bytes] = 4
       ,TotalRowSize = 6 + CEILING(2/8.00) + 4 + 4 + 4
GO

The image below shows the structure of a FixedVar row (Pro Sql Server Internals by Dimitri Korotkevitch)

Now, down to the bits and bytes, lets find the first data page(out of 2598) that is allocated to the table and try to reconcile the min and a max row. The undocumented internal column/fn %%Lockres%% gives us FileId:PageId:SlotId for the selected rows.

SELECT TOP(5) f.*
              ,%%lockres%% as [FileId:PageId:SlotId]
FROM dbo.addANonNullColOnline f 
ORDER BY id

The query selects first few rows from the test table.

id          val  FileId:PageId:SlotId
----------- ---- --------------------------------
1           bitp 1:16576:0
2           NULL 1:16576:1
3           colg 1:16576:2
4           hbco 1:16576:3
5           maxi 1:16576:4

(5 row(s) affected)

With the PageID of the page we want to examine, the next step is to allow sql server to redirect the page content output to the screen.  To achieve this we can use DBCC(Database Console Command). The system change will affect only the current session.

--check the current state of the flag
DBCC TRACESTATUS(3604)
GO
DBCC TRACEON(3604,-1) --sets up traceflag 3604 value to true on a session level
GO
--displays the first page
DBCC PAGE (0,1,16576,3) --WITH TABLERESULTS
GO

The page snapshot below shows the first page header(first 98bytes of the total page size – 8192bytes) and the first couple of rows. The relevant information for this exercise is highlighted.

The page is read from the disk and put into the buffer pool. The relevant information on the page:

  1.  File No : Page number (1:16576)
  2.  m_lsn = (525:62471:2) – Log Sequence Number.
  3. The position of the first row.
    Slot(0) – the first row on the page
    Offset 0x60 – the physical position from the beginning of the page (Hex 0x60 = 96bytes)
    Length = 19* Happens to be the maximum record size. All rows exept row no 2 (id =2) are 19bytes.
  4. The first row -hexadecimal(byte swapped values)

      30000800 01000000 02000001 00130062 697470

       03 – TagA 1 byte
       00– TagB 1 byte
       0008 – Fsize bytes  – location of the end of the last fixed datatype column.
                       8bytes = 1b(TagA) +1b(TagB) + 2bytes(Fsize itself) + 4b(INT column)
      00000001 – The fixed data type column values. One int column with value 1.
      0002 – Total number of columns (2bytes)  =decimal 2
      00 – NullBitmap 1byte = CEILING(TotalNoOfColumns /  8.00bytes) = CEILING(1/8.00bytes)

     TOTAL Fixed part of the row: 1+1+2+4+2+1 = 11bytes

     00012bytes – No. of variable type columns = decimal 1 (If all of the variable data type columns are NULL, this will be 0bytes.)
     0013 2bytes – Every two bytes stores the location of the end of a variable,not NULL data type column. In this case there is only one variable column = decimal 19 – the var. column ends on 19th byte.
(2bytes * NoOfVariableNonNULLColumns)
     62 697470
 – 4 bytes – The actual variable column value ‘bitp’

SELECT CONVERT(VARBINARY(4), 'bitp')
Column 2 value
--------------
0x62697470

(1 row(s) affected)

    TOTAL Variable part of the row: 2+2+ = 8bytes

    Total row1 size = 11bytes(Fixed part) + 8bytes(variable part) = 19bytes

  1. Column “id” value, first row
  2. Column “val” value, first row
  3. The position of the second row.
    Slot(1) – second row on the page
    Offset 0x73 – the physical position from the beginning of the page (Hex 0x73 = 115bytes). This is header(96bytes) + first row(19bytes)
    Length = 11*  Happens to be the minimum record size. All othere rows are 19bytes.
  4. Second row -hexadecimal(byte swapped values)

      10000800 02000000 020002

      01 – TagA 1 byte
      00– TagB 1 byte
      0008 – Fsize bytes  – location of the end of the last fixed datatype column.
                      8bytes = 1b(TagA) +1b(TagB) + 2bytes(Fsize itself) + 4b(INT column)
      00000002 – The fixed data type column values. One int column with value 2.
      0002 – Total number of columns (2bytes)  =decimal 2
      02 – NullBitmap 1byte = CEILING(TotalNoOfColumns /  8.00bytes) = CEILING(1/8.00bytes)  02hex = 0010bin.  This indicates that the second column (second bit is on) in the row contains NULL value.

      TOTAL Fixed part of the row: 1+1+2+4+2+1 = 11bytes

      Since the only variable column contains NULL value, the variable part does not take any space.

      Total row2 size =11bytes(Fixed part) = 11bytes

    1. Column “id” value, second row
    2. Column “val” value, second row
    3. Slot 1 Column 2 Offset 0x0 Length 0 Length (physical) 0
      This shows that the position of the column 2 in the row 2 is 0 and the physical length of the column is also 0. This means that the value is not physically stored on the page. The NULL value during the materialisation of the column value comes from theNullBitmap byte 02

Add a new column

Finally, we will add a column to our “hot” table. The column will be a non-null integer with the default value of 46.

ALTER TABLE dbo.addANonNullColOnline
    ADD NewCol INT NOT NULL
        CONSTRAINT DF_NewCol DEFAULT(46)
GO
SELECT *
FROM addANonNullColOnline

The action will “add” a new column and 1M 46-es in a few milliseconds. We can confirm that the new column with the values is real(materialised).

id         val   NewCol
----------- ----- -----------
1           bitp 46
2           NULL 46
3           colg 46
4           hbco 46
5           maxi 46
6           null 46
7           offs 46
...
...
10000000    runt 46

How it works

After adding a new, integer data type column, we expect to see increase in row size( by 4 bytes). We also expect that Sql Server storage engine somehow managed to add 1M 4byte integer values to the table  in a fraction of second.

Let’s check rows metadata again.

SELECT index_level
       ,page_count
       ,record_count
       ,min_record_size_in_bytes
       ,max_record_size_in_bytes
       ,avg_record_size_in_bytes
       ,SizeInMB = ROUND( ((record_count * avg_record_size_in_bytes) /1024.00 )/1024.00,0)
FROM sys.dm_db_index_physical_stats(db_id('test'),object_id('dbo.addANonNullColOnline','U'),null,null,'detailed')

The query result shows that there were no changes in the row sizes.
So, what actually happened? Knowing that table rows are stored on data pages(RowStore – FixedVar storage format), we would expect increase in min and max row sizes. Using the formulas above…

min_record_size_in_bytes (row id=2)  = 6 + CEILING(2/8.00) + 4 + 4    +  0 + 2*0  =   6 + 1 + 8  =  15
max_record_size_in_bytes  = 6 + CEILING(2/8.00) + 4 + 4    +  2 + 2*1  + 4  =  6 + 1 + 8  + 4  + 4  =  23

The formulas are correct but the result is not 🙂

If we check the first page again, we’ll notice that
m_lsn (log sequential number) has not changed
– The new column (NewCol) is not stored on the page (offset from the beginning of the rows is 0bytes)

The unchanged m_lsn shows us that there were no changes on the page.

More about m_lsn

m_lsn – represents the last Log Sequential Number related to the last transaction that changed the page. The set of log records which describe the changes(old and new  version of the data, transaction, affected row etc.) starts with the same lsn. The log information allow SQL Server to recover databases to transnationally consistent state in case of system crash or any unexpected shut down.
SQL Server uses WAL(Write-Ahead-Log) model which guarantees that no data modifications will be written to disk before the associated log record is written to the disk.
Example: A transaction T1  updates a few columns in a row. The high level sequence of events will be:

  1. A Data access operator requests from the buffer pool manager(BPM) data page that contains rows to be updated .  BPM checks if the page already exists in the buffer pool. If not, BPM requests I/O operations to retrieve the page from the disk.
  2. The update operation(transaction T1) updates relevant row on the page in the buffer pool. The page header information m_lsn is updated. The operation creates a set of new log records(starting with the the same m_lsn) that describes changes made to the page. The log records are created in the log buffer.
  3. Transaction T1 commits the changes. Log buffer manager flushes the content of log buffer(~64kb) to the disk(log file).
  4. Client application receives confirmation that the transaction T1 has been successfully committed.

New column metadata

After creating a new, non-null column, instead of updating all 1M rows with some value, Sql Server 2012+(Enterprise Ed) stores the information as metadata.

SELECT TableName = object_name(p.object_id)
      ,ColumnType = type_name(pc.system_type_id)
      ,pc.max_length
      ,pc.[precision]
      ,pc.scale
      ,pc.is_nullable
      ,has_default
      ,default_value
FROM sys.system_internals_partitions p
    INNER JOIN sys.system_internals_partition_columns pc
        ON p.partition_id = pc.partition_id
WHERE p.object_id = object_id('dbo.addANonNullColOnline');

The two metadata columns hold the information about the added column. This is where 46 is materialised from, not from the page, but from the metadata.
This new behavior occurs automatically and does not impact the system.

default_value has_default

The defaut_value / has_default columns which belongs to sys.system_internals_partition_columns system view are not related to DF_NewCol Default Constraint defined during the table alteration.
The metadata that describes the default constraints can be found using the query below

SELECT parentTable = object_name(dc.parent_object_id)
      ,dc.name
      ,dc.type_desc
      ,dc.[definition]
FROM  sys.default_constraints dc
WHERE dc.name = 'DF_NewCol'

If we drop the constraint, has_default/ default_value will stay intact.

row UPDATE

An interesting thing will happen when we update a row. Let’s update row id=2 and replace the NULL value with something.

UPDATE dbo.addANonNullColOnline
    SET val = 'Hi 5'
WHERE id = 2;
id         val   NewCol
----------- ----- -----------
1           bitp 46
2           Hi 5 46
3           colg 46
4           hbco 46
5           maxi 46
6           null 46
7           offs 46
...
...
10000000    runt 46

…and now the max,min rowsizes are changed.

Whenever we update one or more values in a row, the default value (NewCol = 46) will be written on the data page. The rows that are not updated will still materialise value 46 from the metadata.

If you inspect the first page, you’ll find that the page has a new m_lsn and that all the column values of the row are written on the page
Before the change, the shortest row was row2 (id=2), size = 11b. Now, after the update, the row2 size increased to 23b. The NewColumn default value (46) now  became a physical part of the row.
Using the formulas mentioned before, the new min/max row sizes are:

max_record_size_in_bytes (row id=2)  = 6 + CEILING(2/8.00) + 4 + 4  +  2 + 2*1  + 4 =  {6 + 1 + 8 } + {4  + 4 }= 23
**4bytes for  the NewCol value of 64 and 4bytes for ‘Hi 5’  value that replaced NULL value

min_record_size_in_bytes (all rows expect row id=2)  = 6 + CEILING(2/8.00) + 4  +  2 + 2*1  + 4  =  6 + 1 + 4  + 4  + 4  =  19

Table row source(s)

The figure below shows that the test table materialises column values from two different sources – data pages and metadata.

We see that only the updated row (id=2) has all the columns physically stored on the data page.
E.g row 1 is physically stored in Slot 0, but the new column is not part of the row
Slot 0 Column 3 Offset 0x00 Length 4 Length (physical) 0
NewCol = 46

During the production life, all the table rows may be gradually updated.  The has_default default_value values(1 and 46 respectively) metadata remain but will not be in use.
Metadata will be removed if we REBUILD the table

ALTER TABLE dbo.addANonNullColOnline
    REBUILD;

Restrictions

There are a few data types that cannot be added online.

  • Varchar(max)
  • NVarchar(max)
  • Varbinary(max)
  • XML
  • hierarchy_id
  • geometry
  • geography

Default expressions that are not runtime constant (expressions that require a different value for each row) cannot be added online. e.g NewId(),NewSequentialId() ..

Row-overflow row storage format is not supported. New columns must not increase  the maximum row size over 8060 bytes limit.

Conclusion

Starting from Sql Server 2012 (Enterprise Ed) it is possible to add, ONLINE, non null columns with values to a busy table. Sql Server storage engine uses metadata to store the default values for the added columns. When requested, the table columns are materialised from data pages and metadata. There are restrictions on the data types that may be used. e.g  columns with BLOB datatypes cannot be added online.. etc. Only the runtime constant expressions can be used as a default constraints.

Thank you for reading

Dean Mincic

ORDER BY – Changes everything

SQL programming has a few unique aspects such as Three value logic(3VL), set-based programming logic, and logical processing order. Divination from the programming patterns based on the unique nature of the language usually leads to poor-performing SQL code.

RDBMS fundamentals

SQL Server as an RDBMS(Relational Database Management System), and its SQL language dialect/variant TSQL(Transact-Structured Query Language) has a strong foundation in mathematics. TSQL as SQL Servers language for data management and manipulation is based on two cornerstones of mathematics – Set theory and Predicate logic.

Edgar F. Codd was a computer scientist who invented the relational model for database management systems. He used mathematical theories (Set theory and predicate logic) to create a general data management theory.
IBM’s System R is the very first database system built on Codd’s relational model. In order to “force” DB vendors to follow the relational principles when designing DB systems, Codd published his 12 rules which describe a relational database.
Various database vendors(Oracle, Sybase, SQL server..) implemented the Relational principles similarly but differently. This means that the DB systems “slightly” deviate from Codd’s relational theory, and the theory itself “slightly” deviates* from the original set theory and predicate logic – but that’s ok 🙂

*e.g One of the differences between predicate logic and relational algebra is that the first support the two-valued logic (True/False). The relational model supports three-way logic(True/false and Unknown)

Although tsql querying design patterns are derived from Relational algebra, Tsql does not strictly follow relational rules. One of the differences, that is also related to this article is that a Relation is not sorted → There is no relevance to the order of the elements in a set. However, Tsql allows us to sort(ORDER BY) the final result set (the one that is returned to the Client). We are allowed to replace the Relational result(as a “final product” of the relational operations in a Relational database based on relational theory…. ) with a non-relational result – CURSOR. That is why Order by “Changes everything” 🙂

The logical query processing sequence

To understand the ORDER BY operation, it is important to understand the logical query processing sequence. The concept is unique to the SQL world and defines the order of execution of different query segments.  It’s also known as Theoretical execution order because the sequence may be changed by SQL Srv optimizer due to many different reasons i.e data access patterns, query simplification, etc. The query sequence:

  1. FROM – The FROM phase identifies the query’s source tables. This phase processes CROSS JOIN, ON Filter, and in the case of OUTER JOINs it adds the non-matched rows back to the result set.
  2. WHERE Filter. Removes all those rows from the previous stage for which the predicate evaluates false.
  3. GROUP BY – Performs grouping and aggregation calculations. This phase generates one row per group.
  4. HAVING – This is the last of the three query filters. It filters the results of aggregate functions.
  5. SELECT – Processes the elements in the Select clause i.e Evaluates expressions within the Select statement and/or removes duplicates(DISTINCT) and/or filters the specified percentage/number of rows(TOP) defined by the ORDER BY clause. The phase returns TABLE.
  6. ORDER BY – Sorts the rows according to the list of columns specified and returns CURSOR.
  7. FOR XML/FOR JSON The phase converts tabular results from the SELECT statement into XML/JSON output.

For the queries that include SET operations, the logical sequence will be.

  1. Query 1
  2. SET operation (union, except, intersect) or multi-set Union All
  3. Query 2
  4. Order By

Order by and …

ORDER BY operation guarantees ordered result set – Cursor. TSQL deviates from ANSI standard and allows us to sort results using the columns and/or expressions based on columns that are not part of the SELECT statement(This is not true if the query uses the DISTINCT clause). ANSI standard allows sorting only by using the columns mentioned in the select statement. ORDER BY can also be used as a logical sorting operator when operating as a part of TOP/Offset functions.

Because ORDER BY converts query result-set into a CURSOR, it cannot be used to define table expressions;

  • Views
  • ITVF(Inline table-valued functions aka parameterized views)
  • Derived tables
  • Subqueries
  • CTE(common table expressions)

… unless used as a logical sorting with(top,offset,for XML/JSON)

The CURSOR must be returned to the client application that is able to “consume” cursor records, one at a time and in order.

ORDER BY treats NULL values as equal. NULLs are sorted FIRST in ASC order.

SET operations (Union, Except, Intersect) and multi-set Union All

When using SET-based operations (UNION, EXCEPT, INTERSECT) and/or multi-set operation UNION ALL  we cannot use the ORDER BY operator in the individual statements. The following query will fail the parsing phase.

DECLARE @t1 TABLE (id INT, val VARCHAR(10));
DECLARE @t2 TABLE (id1 INT, val1 VARCHAR(10));
INSERT INTO @t1
    VALUES (1,'A'),(1,'B'),(2,'S')
INSERT INTO @t2
   VALUES (1,'A'),(1,'B1'),(2,'S'),(3,'M')

SELECT * FROM @t1 ORDER BY val DESC
INTERSECT 
SELECT * FROM @t2

Msg 156, Level 15, State 1, Line 24
Incorrect syntax near the keyword ‘INTERSECT’

SQL Server’s SET operations correspond to operators defined in mathematical set theory. Therefore, the operands must be sets – represented as Tables (Set → Relation → Table)
ORDER BY changes query output to something that ANSI SQL defined as CURSOR. Set operations do not support ordered structures.

We can use the ORDER BY clause to format the FINAL result-set for presentation purposes. The ORDER BY in the example below logically does not belong to the last query. It operates on (Set1 U Set2)

SELECT * FROM @t1 --Set1
INTERSECT --U
SELECT * FROM @t2 --Set2
ORDER BY val DESC

TOP/OFFSET-FETCH

The TOP option is unique to tsql and does not follow ANSI guidelines. The directive simply allows us to select a number of rows or a percentage of rows to return. The option is supposed to be paired up with ORDER BY – always.
This time ORDER BY does not serve a presentation purpose. This time it serves as a logical “TOP sorting” operator. It is supposed to answer the question “TOP according to what order?”. The result of a TOP/ORDER BY operation is a table – not a cursor.
Unfortunately, tsql allows us to use TOP without ORDER BY, and in that case, the general query result will be random and non-deterministic TOP(n) rows / TOP(n) PERCENT result set.

It is a good practice to specify ORDER BY(SELECT NULL) if our intention is to select random and/or non-deterministic TOP(n) rows. Query Optimiser will remove the logical ordering during the simplification phase. The purpose of this is to let other developers know that this was done intentionally.

Test data:

IF OBJECT_ID('tempdb.dbo.#testTopOrderBy','U') IS NOT NULL
DROP TABLE #testTopOrderBy
GO
CREATE TABLE #testTopOrderBy(CustomerId INT UNIQUE NOT NULL
,OrderId INT
,OrderDate DATE)
GO
INSERT INTO #testTopOrderBy
   SELECT t.*
   FROM (VALUES (100,1001,'20170101')
                ,(106,1005,'20170210')
                ,(105,1007,'20170101')
                 
                ,(103,1008,'20170410')
                ,(104,1011,'20170511')
                ,(102,1022,'20170101')
                ,(101,1223,'20170210') ) t(custId,OrdId,odate)
CustomerId  OrderId     OrderDate
----------- ----------- ----------
100         1001        2017-01-01
106         1005        2017-02-10
105         1007        2017-01-01
103         1008        2017-04-10
104         1011        2017-05-11
102         1022        2017-01-01
101         1223        2017-02-10

(7 rows affected)

Test 1: The query returns two randomly chosen rows(TOP without ORDER BY)

SELECT TOP(2) *
FROM #testTopOrderBy
ORDER BY (SELECT NULL) – include this line to let other developers know about your intentions
CustomerId  OrderId     OrderDate
----------- ----------- ----------
100         1001        2017-01-01
106         1005        2017-02-10

Test 2: The query returns a non-deterministic result – “non-unique” ORDER BY is used aka. non-deterministic Order By.

SELECT TOP(2) *
FROM #testTopOrderBy
ORDER BY OrderDate
CustomerId  OrderId     OrderDate
----------- ----------- ----------
100         1001        2017-01-01
105         1007        2017-01-01
(2 rows affected)
--row not included
102         1022        2017-01-01

The query “decided” to select Customers 100 and 105 but not 102. In this case, if we had a clustered index on customerId, the query would probably select customers 100 and 102.

Test 3: Two ways to make the “TOP(n)” query return a deterministic result.

--Deterministic ORDER BY
SELECT TOP(2) *
FROM #testTopOrderBy
ORDER BY CustomerId,OrderDate
CustomerId OrderId OrderDate
----------- ----------- ----------
100 1001 2017-01-01
101 1223 2017-02-10

(1) The deterministic ORDER BY provides more specific information about which TOP(2) order dates to select. Now the query knows exactly which TOP(2) rows to select. a combination of Order By columns that provides a deterministic result-set is called Tiebreaker.

--TOP with TIES
SELECT TOP(2) WITH TIES *
FROM #testTopOrderBy
ORDER BY OrderDate
CustomerId OrderId OrderDate
----------- ----------- ----------
100 1001 2017-01-01
105 1007 2017-01-01
102 1022 2017-01-01

(2) The “WITH TIES” guarantees deterministic results based on Order by column(s). WITH TIES must have ORDER BY defined. In the case of non-deterministic Order By, WITH TIES will add all relevant rows to the result set in order to provide determinism. In the example above it added an extra row.

OFFSET-FETCH (tsql 2012+)

OFSET-FETCH is an ANSI SQL-supported feature. Basically, the function is an advanced version of the TOP function. The option allows us to select(fetch) only a window or a page of results from the result set.
Opposed to the similar, TOP function, OFFSET–FETCH can not operate without Order By. On the other hand, OFFSET-FETCH does not support WITH TIES therefore it cannot guarantee a deterministic result set.
The following query fails because offset-fetch always operates with ORDER BY:

SELECT *
FROM #testTopOrderBy
OFFSET 0 ROWS FETCH FIRST 2 ROWS ONLY

Msg 102, Level 15, State 1, Line 25 Incorrect syntax near ‘0’. Msg 153, Level 15, State 2, Line 25 Invalid usage of the option FIRST in the FETCH statement.

If we want to select random 2 rows from a table, we can use

SELECT *
FROM #testTopOrderBy
ORDER BY (SELECT NULL)
OFFSET 0 ROWS FETCH FIRST 2 ROWS ONLY

The query below will return the same result as the query from Test 3.

SELECT *
FROM #testTopOrderBy
ORDER BY CustomerId,OrderDate
OFFSET 0 ROWS FETCH FIRST 2 ROWS ONLY
CustomerId OrderId OrderDate
----------- ----------- ----------
100 1001 2017-01-01
101 1223 2017-02-10
(2 rows affected)

Same as TOP, OFFSET-FETCH returns a table therefore it can be used in other table expressions, e.g., (the query does not make much sense – it’s only for testing purposes …:)

SELECT a.*
FROM #testTopOrderBy a
INNER JOIN (
            SELECT CustomerId
            FROM #testTopOrderBy
            ORDER BY CustomerId,OrderDate
            OFFSET 0 ROWS FETCH FIRST 2 ROWS ONLY
           ) x --the result is a table expression
ON a.CustomerId = x.CustomerId

..and this will fail. Table operators (JOIN and non-ANSI APPLY, PIVOT, and UNPIVOT) operate only between tables(table expressions).

SELECT a.*
FROM #testTopOrderBy a
INNER JOIN (
            SELECT CustomerId
            FROM #testTopOrderBy
            ORDER BY CustomerId,OrderDate
           ) x --the result is a cursor
ON a.CustomerId = x.CustomerId

Msg 1033, Level 15, State 1, Line 38
The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions unless TOP, OFFSET, or FOR XML is also specified.

VIEWS

A view is a named virtual table that is defined by a query and used as a table
A view represents a table, and a table is a logical entity that has no order to its rows. This is the reason why it’s not allowed to use presentation ORDER BY in a view. VIEWS ARE NOT ORDERED!
The view definition below will fail

Test 4. Views are not ordered

CREATE VIEW dbo.TestViews
AS
SELECT v.CustId
      ,v.OrderId
FROM (VALUES(1,1000),(2,10001),(2,2003)) v(CustId,OrderId)
ORDER BY CustId

Msg 1033, Level 15, State 1, Procedure TestViews, Line 7 [Batch Start Line 1]
The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions unless TOP, OFFSET, or FOR XML is also specified.

Now, some programmers might think that they somehow managed to override RDBMS fundamentals by defining a view with a TOP clause specifying all 100% rows and ORDER BY.

Test 5, Views with TOP 100% ordered rows

CREATE VIEW testViews
AS
    SELECT TOP(100) PERCENT
            [name]
           ,[object_id]
           ,create_date
    FROM sys.objects
    ORDER BY create_date DESC

This will pass the parsing stage and will be stored as a valid view definition. In reality, the view definition does not make much sense(see the TOP/ORDER BY example above). During the simplification phase, the query optimizer will remove the top 100 %/ORDER BY as an unnecessary logic.

SELECT * FROM  testViews
name                     object_id   create_date
------------------------ ----------- -----------------------
sysrscols                3           2014-02-20 20:48:35.270
sysrowsets               5           2009-04-13 12:59:11.093
sysclones                6           2014-02-20 20:48:35.477
sysallocunits            7           2009-04-13 12:59:11.077
sysfiles1                8           2003-04-08 09:13:38.093
sysseobjvalues           9           2014-02-20 20:48:35.610
syspriorities            17          2014-02-20 20:48:35.367
sysdbfrag                18          2014-02-20 20:48:35.563
sysfgfrag                19          2014-02-20 20:48:35.243
... etc...

The result is not ordered by create_date DESC
In some cases i.e when we use TOP and ORDER by the clustered index key columns, the result may be appearing to be sorted – QO may decide to use a sorted table scan. Again, there is no guarantee that the returned rows will be ordered.
e.g. During the final SELECT phase, QO may decide to use SQL Server’s storage engine advanced scan functionality also known as merry-go-round scan or scan piggybacking to return the same rows to two or more concurrent sessions. Without diving into the details of how the advanced scan works(will cover that interesting optimization mechanism in one of the next posts), the different sessions will have the same result sets but in a different order.

The presentation ORDER BY will convert the un-ordered view result into an ordered CURSOR.

SELECT *
FROM testViews
ORDER BY create_date DESC

Window Functions

Windows functions are introduced in the ANSI SQL-2003 standard. Microsoft partially implemented  the functionality in tSql 2005 and expanded to a full range of window functions in tSql 2012,
A Window function operates on a number of rows(window) in a set(table).  Window functions use ORDER BY as a logical operator which defines the behavior of the aggregate operations performed by the functions. In This scenario, ORDER BY, as a part of the functions does not perform the final sort.

It is important to understand where the Window function sits in the logical query execution sequence.

(5) SELECT
(5.1) Evaluate expressions in the SELECT statement
(5.2) WINDOWS Fn can be executed from here…
(5.3) DISTINCT
(5.4) TOP
(5.5) ….to here

(6) ORDER BY (presentation)

The example below demonstrates how window functions are positioned in the logical query exec. sequence.

Test Data.

IF OBJECT_ID('tempdb.dbo.#testWindowFnLogicalSeq','U') IS NOT NULL
    DROP TABLE #testWindowFnLogicalSeq
GO
CREATE TABLE #testWindowFnLogicalSeq( 
        OrderPackDesc VARCHAR(50) 
        ,OrderValue INT)
GO
INSERT INTO #testWindowFnLogicalSeq
    SELECT *
    FROM (VALUES ('Media Pack',100)
                ,('Media Pack',90)
                ,(NULL,10)
                ,('MS Surfice accessories',200)
                ,('MS Surfice accessories',50)
        ) tab(ordrPck,val)
OrderPackDesc                                      OrderValue
-------------------------------------------------- -----------
Media Pack                                         100
Media Pack                                         90
NULL                                               10
MS Surfice accessories                             200
MS Surfice accessories                             50

(5 rows affected)

The query below sums the OrderValues per OrderPackDescription. The OrderPackDesc = NULL will be summarised as a distinct value resulting OrderValue=10. ISNULL(OrderPackDesc,’Media Pack’) evaluates the NULL value as “Media Pack”

SELECT DISTINCT OrderPackDescEXPR = ISNULL(OrderPackDesc,'Media Pack')
               ,OrdVal = SUM(OrderValue) OVER(PARTITION BY OrderPackDesc
                                              ORDER BY( SELECT NULL))
FROM #testWindowFnLogicalSeq
OrderPackDescEXPR                                  OrdVal
-------------------------------------------------- -----------
Media Pack                                         10
Media Pack                                         190
MS Surfice accessories                             250

(3 rows affected)

This could lead to a logical error. We want to find out the sums of order values per order pack. We also want to treat the missing OrderPacks as ‘Media Pack’. The previous query sums the “missing order packs” separately. This is a typical logical error. The error would be more difficult to find if we had OrderValue = 190 for the unknown orderPackDesc. In that case, we will have a total of 190 instead of 380 for the ‘Media Pack’.
Taking into consideration the query execution sequence we can modify our query as:

SELECT DISTINCT  OrderPackDescEXPR = ISNULL(OrderPackDesc,'Media Pack')
                ,OrdVal = SUM(OrderValue) OVER(PARTITION BY ISNULL(OrderPackDesc,'Media Pack') ORDER BY( SELECT NULL))
FROM #testWindowFnLogicalSeq
OrderPackDescEXPR                                  OrdVal
-------------------------------------------------- -----------
Media Pack                                         200
MS Surfice accessories                             250

(2 rows affected)

…OVER(PARTITION BY ( ISNULL(OrderPackDesc,’Media Pack’) … is evaluated BEFORE the WINDOW fn.
Now our window fn, partition knows how to treat the missing values before the aggregation part and the execution sequence works in your favor 🙂
The example also shows that DISTINCT executes after WINDOW functions.
NOTE: It is not possible to use alias OrderPackDescEXPR in other expressions within the same query, like  … OVER(Partition by OrderPackDescEXPR ..) The limitation is another unique aspect of SQL language – All-at-once operation.
E.g in most programming languages, to swap values between variables we use a third, temp var. A→ temp, B→A  , temp→ B . In SQL we can swap table column values without using the temp var storage..

DECLARE @test TABLE(Age1 tinyint, Age2 tinyint)
    INSERT INTO @test
        VALUES (65,25)
SELECT * FROM @test
    UPDATE @test
        SET Age1 = Age2
           ,Age2 = Age1
SELECT * FROM @test

Window aggregate functions

Starting from SQL 2012 the ORDER BY clause can be used along with the window aggregate functions. ORDER BY dictates how the aggregate functions operate within the window partitions. It is important to understand the ORDER BY in this content to be able to avoid logical errors.

Create some test data

IF OBJECT_ID('tempdb.dbo.#testWindowFnLogicalSeq','U') IS NOT NULL
    DROP TABLE #testWindowFnLogicalSeq
GO
CREATE TABLE #testWindowFnLogicalSeq(
     CustomerId INT
    ,OrderId INT
    ,OrderValue INT)
GO
;WITH createNumbers as
(
    SELECT TOP 15 ROW_NUMBER()OVER(ORDER BY(SELECT NULL)) + 1000 rn
    FROM sys.columns c1
    CROSS JOIN sys.columns c2
    ORDER BY (SELECT NULL)
)
INSERT INTO #testWindowFnLogicalSeq
    SELECT CustomerId = (rn % 3)+ 1000 -- 3 customers
          ,OrderId = rn + 1000 --some unique orders
          ,OrderValue = ABS(CAST(CHECKSUM(NewId()) AS BIGINT)) % 1500 -- "random" numbers between 0 - 1499
FROM createNumbers
GO
SELECT * FROM #testWindowFnLogicalSeq ORDER BY CustomerId,OrderId
CustomerId  OrderId     OrderValue
----------- ----------- -----------
1000        2002        870
1000        2005        169
1000        2008        859
1000        2011        516
1000        2014        11
1001        2003        692
1001        2006        683
1001        2009        986
1001        2012        403
1001        2015        71
1002        2001        419
1002        2004        213
1002        2007        609
1002        2010        289
1002        2013        38

(15 row(s) affected)
** The OrderValues column contain random values.

The following query explains how ORDER BY affects SUM – window aggregate function

SELECT CustomerId
       ,OrderId
       ,OrderValue
       ,SUM(OrderValue) OVER() AS [GrandTotal OrderValue sum]
       ,SUM(OrderValue) OVER(PARTITION BY CustomerId) [Total OrderValue per cust.]
       --cumulative sums
       ,SUM(OrderValue) OVER(ORDER BY CustomerId) [Cumulative OrderValue per cust.]
       ,SUM(OrderValue) OVER(PARTITION BY CustomerId
                             ORDER BY OrderId) [Cumulative OrderValue per cust. & ord.]
FROM #testWindowFnLogicalSeq
ORDER BY CustomerID,OrderId --presentaiton purpose
CustomerId  OrderId     OrderValue  GrandTotal OrderValue sum Total OrderValue per cust. Cumulative OrderValue per cust. Cumulative OrderValue per cust. & ord.
----------- ----------- ----------- ------------------------- -------------------------- ------------------------------- --------------------------------------
1000        2002        870         6828                      2425                       2425                            870
1000        2005        169         6828                      2425                       2425                            1039
1000        2008        859         6828                      2425                       2425                            1898
1000        2011        516         6828                      2425                       2425                            2414
1000        2014        11          6828                      2425                       2425                            2425
1001        2003        692         6828                      2835                       5260                            692
1001        2006        683         6828                      2835                       5260                            1375
1001        2009        986         6828                      2835                       5260                            2361
1001        2012        403         6828                      2835                       5260                            2764
1001        2015        71          6828                      2835                       5260                            2835
1002        2001        419         6828                      1568                       6828                            419
1002        2004        213         6828                      1568                       6828                            632
1002        2007        609         6828                      1568                       6828                            1241
1002        2010        289         6828                      1568                       6828                            1530
1002        2013        38          6828                      1568                       6828                            1568

ORDER BY activates cumulative aggregations on the ordered column over a defined partition. The same applies to other aggregate functions MAX, MIN, AVG, and COUNT.
Failing to understand this behavior can lead to logical error 🙂

CONCLUSION

The main purpose of ORDER BY is to force a query to return an ordered result – CURSOR. This changes the relational nature of SQL language – query result sets are NEVER ordered. The presentation ORDER BY is an expensive operation (it needs to collect the entire result-set and put it in the requested order). The same syntax (ORDER BY) is used as a logical operator for TOP/OFFSET FETCH/WINDOW Functions/FOR XML formatter. In those cases, the purpose of the operator is completely different and DOES NOT guarantee the ordered output.

Thank you for reading.

Semi Joins, anti-joins and Nulls in Sql Server


Summary

SQL Joins are table operators(binary operations in Relational Algebra) used to combine columns from one or more tables. The expression(predicate) that defines the columns which are used to join the tables is called Join Predicate. The result of a join is a set (relational database implementation of a set).
ANSI standard recognizes five types of joins: Inner Join, Left Outer Join, Right Outer Join, Full Outer Join, and Cross Join.
Joins are typically used to retrieve data from the normalized tables in a relation, e.g. one-to-many, zero-one-to-many, etc., usually with an equality predicate between primary and foreign key columns.
One of the most complex tasks for the Query Optimiser is “join ordering” i.e. finding the optimal join sequence when constructing the execution plan(a query requesting data from n tables requires n-1 joins)

Semi-join is one of a few operators in relational algebra that does not have representation in Tsql language. Some of the  “missing” operators are:

  • Semi join
  • Anti-join (anti-semi-join)
  • Natural join
  • Division

Semi-join is a type of join whose result set contains only the columns from one of the “semi-joined” tables. Each row from the first table(left table if Left Semi Join) will be returned a maximum of once if matched in the second table. The duplicate rows from the first table will be returned if matched once in the second table. A distinct row from the first table will be returned no matter how many times matched in a second table.
Below is the pseudo-code representation of the above statement.

SemiJoinsPCode

Anti-semi-join will do the exact opposite. The join will select any rows from the first table that do not have at least one matching row in the second table.

SQL Server engine has three physical(showplan) operators that can be used to perform semi-join/anti-semi-join logical operations when recognized by the  Query Optimiser.

The table below maps the physical operators and the semi-join algorithms that they support.
PhysicalOpSemiJoins
*The full list of the SQL Server showplan operators can be found here.

There are a number of scenarios when Query Optimiser decides to implement a semi-join algorithm to optimize query requests. Typically, the logical operations that represent semi-joins are: IN, NOT IN, EXISTS, NOT EXISTS. The EXCEPT and INTERSECT set operators may use the same physical, Semi Join operators to perform different logical operations.

The following examples illustrate a few of these scenarios.

Set up the test environment:

CREATE DATABASE testSemiJoins
GO
USE testSemiJoins
GO

IF EXISTS (SELECT 1 FROM sys.tables t WHERE name = 'tab1')
    DROP TABLE dbo.tab1;
GO
CREATE TABLE tab1  (ProductTypeId INT,ProductName VARCHAR(100))
GO

IF EXISTS (SELECT 1 FROM sys.tables t WHERE name = 'tab2')
    DROP TABLE dbo.tab2;
GO
CREATE TABLE tab2  (ProductId INT,Name VARCHAR(100),StorageSizeGB SMALLINT)
GO
--insert some rows
INSERT INTO tab1
    SELECT tab.id,tab.name
    FROM (VALUES (1,'Hard disk')
                ,(1,'Hard disk')
                ,(2,'SSD disk')
                ,(3,'Flash Memory')
                ,(4,'EPROM')
                ,(5,NULL)
                ,(6,'Floppy Disk')
                ,(7,'RAM Memory')
                ,(7,'RAM Memory')
                ) tab(id,name)
GO
INSERT INTO tab2
    SELECT tab.id,tab.name,tab.sizeGb
    FROM (VALUES (1,'Hard disk',250)
                ,(1,'SSD disk',120)
                ,(2,'SSD disk',240)
                ,(3,'Flash Memory',8)
                ,(4,NULL,NULL)
                ,(5,'EPROM',0)
                ) tab(id,name,sizeGb)
GO
Tab1AndTab2

Left Semi Join

The Left Semi Join operator returns each row from the first (top table in the execution plan) input where there is at least one matching row in the second(bottom) input.

SELECT * 
FROM tab1 t1
WHERE t1.ProductName IN (SELECT t2.Name FROM dbo.tab2 t2)
--or using a correlated query (same execution plan)
SELECT * 
FROM tab1 t1
WHERE t1.ProductName IN (SELECT t2.Name FROM dbo.tab2 t2 where t2.Name = t1.ProductName)
LeftSemiJoin

Left Anti Semi Join

The Left Anti Semi Join operator returns each row from the first (top) input when there are no matching rows in the second (bottom) input.

SELECT ProductName FROM tab1
EXCEPT
SELECT Name FROM tab2
--or
SELECT * 
FROM tab1 t1
WHERE NOT EXISTS (SELECT t2.Name FROM dbo.tab2 t2 WHERE t1.ProductName = t2.Name)
--WHERE t1.ProductName NOT IN (SELECT t2.Name FROM dbo.tab2 t2)

LeftAntiSemiJoin

Both of the queries use the same physical operator – Loop Join( Left Anti Semi Join) to perform different logical operations. The second query performs a logical Left Anti Semi Join whereas the first query performs an operation based on the Difference of Sets  (Set A – SetB) operation.
Set operators EXCEPT, INTERSECT, and UNION treat NULL values as equal(non-distinct) whereas operator EXISTS evaluates NULL=NULL as FALSE(even if the 3VL result is UNKNOWN). More about the behavior here.
Also, it is worth mentioning that all set operators (except the multi-set operator UNION ALL) remove duplicates. 

Right Semi Join

The Right Semi Join operator returns each row from the second (bottom) input when there is a matching row in the first (top) input.

SELECT * 
FROM tab1 t1
WHERE  EXISTS (SELECT t2.Name FROM dbo.tab2 t2 WHERE t1.ProductName = t2.Name)
OPTION(HASH JOIN)
--or based on the Intersection Of Sets operation
SELECT ProductName FROM tab1
INTERSECT
SELECT Name FROM tab2 
OPTION(HASH JOIN)
-- When building the hash table, the hash match join chooses the table with fewer rows. 
-- In this example, that is the tab2 table.

RightSemiJoin

NOTE: The INTERSECT set operator treats NULL values as equal and therefore a NULL value appears in the result set(as being one of the  common values in tab1 and tab2) .

Right Anti Semi Join

The Right Anti Semi Join operator outputs each row from the second (bottom) input when a matching row in the first (top) input does not exist.

SELECT * 
FROM tab1 t1
WHERE NOT EXISTS (SELECT t2.Name FROM dbo.tab2 t2 WHERE t1.ProductName = t2.Name)
OPTION(MERGE JOIN)

RightAntiSemiJoin

Anti semi Joins, NOT IN, and NULLs

Again, things get “special” when it comes to working with NULLs.
If we execute the Left Anti Semi Join query again, but this time using NOT IN instead of EXISTS, we will get the empty result set. This is one of the common logical errors that can cause unexpected results.

--NOT IN (does not work)
SELECT * 
FROM tab1 t1
WHERE t1.ProductName NOT IN (SELECT t2.Name FROM dbo.tab2 t2)

--correlated NOT IN (works ok)
SELECT * 
FROM tab1 t1
WHERE t1.ProductName NOT IN (SELECT t2.Name FROM dbo.tab2 t2 WHERE t2.Name = t1.ProductName)

--hardcoded NOT IN (does not work)
SELECT * 
FROM tab1 t1
WHERE t1.ProductName NOT IN ('Hard disk','SSD disk','SSD disk','Flash Memory',NULL,'EPROM')

--NOT EXISTS (works ok)
SELECT * 
FROM tab1 t1
WHERE NOT EXISTS (SELECT t2.Name FROM dbo.tab2 t2 WHERE t1.ProductName = t2.Name)

The image below shows the query execution plans, the predicate property of the significant physical operators, and the result sets.

AntiJoinNot_IN_NULL1

Just as a reminder, the Nested Loop algorithm compares each row from the OUTER table (tab1 in the example) to each row from the INNER table (tab2). Only the rows that satisfy the join predicate will be returned.

The query result sets should contain all rows from the first (top) input when there are no matching rows in the second (bottom) input. The NULLs are treated as distinct. The results should be the same but there are not. Why?

The answer lies in the way the operator NOT IN was implemented.

Query1 (NOT IN)
From the Nested Loop’s Predicate property we can see that the operator uses a sequence of OR logical expressions to implement the request.

t1.ProductName IS NULL
OR t2.Name IS NULL
OR t2.ProductName = t2.Name

To make the tab1 table rows qualify for the output, the results of all the expressions in the predicate must evaluate to FALSE.

The expression t2.Name IS NULL will always evaluate to TRUE for every iteration, resulting in the empty result-set. This is because of a NULL value in the tab2.Name column.

This is important to know to be able to avoid possible logical errors. One way to prevent the behavior is to set NOT NULL column property on the tab2.Name column. The other way is to use  ISNULL() function in the query to prevent NULL expressions.

SELECT * 
FROM tab1 t1
WHERE t1.ProductName NOT IN (SELECT ISNULL(t2.Name,'') FROM dbo.tab2 t2)

--first replace all NULLs with known values, e.g '' and then add NOT NULL column property or a CHECK constraint
UPDATE tab2
    SET Name = ''
WHERE Name IS NULL
GO

ALTER TABLE tab2
  ALTER COLUMN Name VARCHAR(100) NOT NULL
GO
--or
ALTER TABLE tab2
    WITH CHECK 
    ADD CONSTRAINT CK_NameNotNull CHECK(Name IS NOT NULL)
GO

Query2 (“Correlated” NOT IN)
The query uses the same execution plan as NOT IN, except now the Nested Loop’s Predicate property adds an extra AND expression to the sequence of OR expressions.

t2.Name = t1.ProductName
AND t1.ProductName IS NULL
OR t2.Name IS NULL
OR t2.Name = t1.ProductName

The same logic applies as with the first query. The predicate(as a list of the logical expressions) must evaluate to FALSE for the tab1 rows to qualify for the output. As before, at least one of the expressions on the left side of the AND operator will always evaluate to TRUE (t2.Name IS NULL is TRUE for every iteration). However, the additional expression (the one on the right side of the AND operator) may evaluate to FALSE making the whole predicate FALSE and allowing the qualified rows to be returned.

The query returns the correct result.

Query3 (“Hard-coded” NOT IN)
From the lack of the Constant Scan Predicate property in the execution plan, we can conclude that the Query Optimiser has decided to return an empty result set knowing that at least one of the “hard-coded, NOT IN values” is NULL.
To get the predicate property and analyze its logic, we can run the same query, but this time without the NULL value in the NOT IN list  i. e.

SELECT * 
FROM tab1 t1
WHERE t1.ProductName NOT IN ('Hard disk','SSD disk','SSD disk','Flash Memory','EPROM')

The Constant Scan operator has the Predicate property now.

LeftAntiJoinNOTIN_HC

The predicate is constructed of n not-equal expressions (n is a number of distinct values in the NOT IN list, in this case 4) and n-1 AND operators.

t1.ProductName <>’EPROM’
AND t1.ProductName <>’Flash Memory’
AND t1.ProductName <>’Hard disk’
AND 1.ProductName <>’SSD disk’
— AND 1.ProductName <>NULL –if we had NULL here, the predicate would evaluate to UNKNOWN  for every row resulting in an empty result set.

There are a couple of interesting points here. The first one is the result set.

The query excludes t1.ProductName  = NULL from the result set. The reason for excluding the NULL  is because of the way “hard coded NOT IN” was implemented.
The NULL values from the left table will be excluded, whereas the NULL value(s) in the NOT IN list will cause an empty result set.
In this example, the algorithm excludes tab1.ProductName NULL value by evaluating predicate as UNKNOWN i.e

NULL <>’EPROM’ (1)
AND NULL <>’Flash Memory’   (2)
AND NULL<>’Hard disk’  (3)
AND NULL<>’SSD disk’  (4)

NULL (1) AND NULL (2) AND NULL (3) AND NULL (4)  is  NULL

It is good to know the differences in the results to prevent possible logical errors.

The other interesting thing is the number of expressions in the predicate that directly depend on the number of hard-coded literals. The question is: Is there a maximum number of literals that can be processed? The number is related to a maximum string length containing SQL Statements(batch size). The Maximum Capacity Specifications for SQL Server can be found here. Another reason may be running out of stack size(stack overflow) causing the errors 8623 and/or 8632

Query4(NOT EXISTS) 
The NOT EXIST operator works straightforwardly here. All rows from tabt1 that do not match the rows from tab2 (where the correlated predicate t1.ProductName = t2.Name evaluates to FALSE)  will be returned. NULLs are treated as distinct.
The query returns the correct result.

Conclusion

Semi-join is a join that has no representation in tsql. The result of the operation contains only columns from one of the tables. All returned rows from the first table must be matched at least once in the second table. The rows from the first table will be returned only once even if the second table contains more than one match. Semi-joins are usually implemented using IN or EXISTS operators. Any of the three physical operators that SQL Server uses to perform joins can be used to implement Semi Joins. The set operators EXCEPT and INTERSECT may use the same Semi Join physical operators to perform set operations.
Logical Anti semi join operation may be implemented using NOT IN or NOT EXISTS operators. The NOT IN operator may cause unexpected results when it operates on the tables that contain NULL values in the joined columns.

Thanks for reading.

Dean Mincic

 

 

 

Three valued logic – NULL in Sql Server


In SQL the logical expressions (predicates) can evaluate TRUE, FALSE and UNKNOWN. The third result is unique to “SQL world“ and is caused by NULL value.  This is called Three Valued Logic (3VL).   NULL represents the missing or UNKNOWN value.

The logic could be confusing because Boolean, two valued logic expressions can evaluate only to TRUE or FALSE – a comparison between two KNOWN values gives us TRUE or FALSE.

With NULL as a mark or a placeholder, we have an UNKNOWN value, so the question is how to compare a known with an unknown value or an unknown with an unknown.

The logical expressions with NULLs  can evaluate to NULL, TRUE and FALSE e.g.

SELECT IIF(1=NULL,'true','unknown')      AS  [1 = NULL]
      ,IIF(NULL=NULL,'true','unknown')   AS  [NULL = NULL]
      ,IIF(NULL+22<=33,'true','unknown') AS  [NULL*22 <= 33]

nullExpressionResult

The result of NULL negations can be confusing too

  • The opposite of True is False (NOT True = False)
  • The opposite of False is True (NOT False = True)
  • The opposite of UNKNOWN is UNKNOWN (NOT NULL = NULL)

Conjunctive(AND) predicates with Nulls evaluates:

  • True  AND NULL  => NULL
  • False AND NULL => False

and Disjunctive (OR) predicates with Null evaluates:

  • True OR NULL => True
  • False OR NULL => NULL
  • NULL OR NULL => NULL

The logic is confusing and it was source of an ongoing discussion about NULLS for a long time.
The important thing to note is to put the logic in the context of DB design and to understand how Sql Server treats NULLs in different scenarios. Knowing that we can  benefit from using the UNKNOWNS and avoid possible logical errors.

Sql Server is INCONSISTENT when evaluates the logical expressions with NULL values – The same expressions may produce different results TRUE, FALSE and sometimes UNKNOWN.
The following examples shows how SqlServer deals with NULLs in different elements of Tsql language.

 QUERY FILTERS (ON,WHERE,HAVING)

Create a couple of test tables:

CREATE TABLE #Products(ProductId INT IDENTITY(1,1)
                           PRIMARY KEY CLUSTERED
                       ,Name VARCHAR(500) NOT NULL
                       ,Price DECIMAL(8,2) NULL CHECK(Price > 0.00)           
                       ,ProductCategoryId INT NULL)
GO
CREATE TABLE #ProductCategories(ProductCategoryId INT
                                   UNIQUE CLUSTERED
                               ,Name VARCHAR(500) NOT NULL)
GO

..and add some data

INSERT INTO #Products(Name,Price, ProductCategoryId)
     SELECT tab.* 
     FROM (VALUES ('Lego',65.99,200)
                 ,('iPad',750.00,300)
                 ,('Gyroscope',NULL,NULL)
                 ,('Beer',3.55,NULL)
                 ,('PC',2500.00,300)) tab(name,price,catid)
GO
INSERT INTO #ProductCategories(ProductCategoryId,Name)
    SELECT tab.*
    FROM (VALUES (200,'KidsGames')
                ,(300,'Gadgets')
                ,(NULL,'Liquor')
                ,(500,'Camping')) tab(id,name)
GO

Test1: Select all the products over $500.00.
The product with the UNKNOWN(NULL) price will be excluded from the result-set as it was treated as FALSE. Only these rows for which the predicate evaluates to TRUE will be included in the final result-set.

SELECT * 
FROM #Products 
WHERE Price  > 500

Null1

We’ll get the similar result if we try to select all products with not unknown price like:

SELECT * 
FROM #Products 
WHERE Price <> NULL;

The result will be an empty set. Not NULL is still UNKNOWN and is treated as FALSE.

The UNKNOWN result is treated as FALSE when evaluated by ON filter.

SELECT p.*,pc.* 
FROM #Products p
  INNER JOIN #ProductCategories pc
    ON p.ProductCategoryId = pc.ProductCategoryId

As we can see from the result-set, the product “Gyroscope” is excluded from the result as NULL = NULL evaluated to UNKNOWN and was treated as FALSE.

Null2

CHECK Constraint

In case of CHECK constraint, SQL Server treats UNKNOWN result as TRUE.
The product “Gyroscope” has UNKNOWN price. The CHECK constraint on the price column allows only those values which evaluates to TRUE when compared with 0.  In this case Sql Server evaluated;

NULL >0 => True

opposed to the previous example when

.. NULL > 500 => False

To confirm the CHECK constraint behavior we can insert a new product into #Products table without getting a Check constraint error:

INSERT INTO #Products(Name,Price, ProductCategoryId)
    SELECT 'Microsoft Surface Pro',NULL,300

This can make sense if we put the NULL logic into the context of the DB design, e.g. All products must have price greater than $0 or NULL if the price hasn’t been decided yet.

Null3UNIQUE Constraint and Set operators

It was mentioned earlier that  NULL = NULL evaluates to UNKNOWN. This means that NULLs are distinct( no two NULLs are equal)

The UNIQUE constraint treats NULL as a distinct value in the context of all other values in the column i.e ProductCategoryId can have only one NULL value and the distinct integers. However, allowing only single NULL value(inserting another NULL value will throw Violation of UNIQUE KEY constraint error), the constraint treats all UNKNOWN values(NULLS) as equal or not distinct.

INSERT INTO #ProductCategories
        SELECT NULL, 'Musical Instruments'

UniqueConstraintError

The table #ProductCategories has an UNIQUE constraint (enforced by a clustered index) on ProductCategoryId column.

The SET operators UNION, UNION ALL* , EXCEPT,  INTERSECT

Sql Server’s  Set operators treat NULLs as non distinct, in other words  as equal.
e.g. Select the rows from tables @Tab1 and @Tab2 that are common for both tables.

DECLARE @Tab1 TABLE (ID INT, Val VARCHAR(5))
DECLARE @Tab2 TABLE (ID INT, Val VARCHAR(5))

INSERT INTO @Tab1(ID,Val)
    SELECT t1.id, t1.Val
    FROM (VALUES (1,NULL),(2,NULL),(3,NULL),(4,'Dean'),(5,'Peter'),(6,'Bill')) t1(id,Val)

INSERT INTO @Tab2(ID,Val)
    SELECT t2.id, t2.Val
    FROM (VALUES (1,NULL),(2,NULL),(3,NULL),(4,'Dean'),(5,'Bahar'),(6,'Nick')) t2(id,Val)

SELECT ID AS keyValue, Val AS Value FROM @Tab1
INTERSECT 
SELECT ID,Val FROM @Tab2

NullsIntersection

As we can see from the result-set, the SET operator treats NULL values as equal e.g  Element (1,NULL ) = Element (1,NULL) and therefore belongs to “the red zone”, whereas e.g Element(5,Bahar) <>(5,Peter) and therefore was left out from the result-set.

*UNION ALL is a multi set operator that combines two sets into one non distinct set (takes all elements from both sets)

 Aggregate Functions

All Aggregate functions except COUNT(*) ignore NULLs.

DECLARE @Tab3 TABLE (ID INT NULL)
INSERT INTO @Tab3
  SELECT t.* 
  FROM (VALUES(1),(2),(NULL),(NULL),(3) ) t(i)

  SELECT  AVG(ID)   AS [AVG(ID)]
         ,COUNT(ID) AS [COUNT(ID)]
         ,SUM(ID)   AS [SUM(ID)] 
         ,MAX(ID)   AS [MAX(ID)]
         ,MIN(ID)   AS [MIN(ID)]
         ,'***'AS [***],COUNT(*) AS [COUNT(*)]
          from @Tab3

NullAgg

Note: COUNT(col1) eliminate NULLs as the other aggregate functions. However, if we try something like: SELECT COUNT(NULL+1) the function returns 0 eliminating the result of the expression, which is NULL 🙂
Another interesting thing with this, is that SELECT COUNT(NULL) throws an error – Null does not have any data type (COUNT() always does implicit conversion to INT) and e.g. SELECT COUNT(CONVERT(varchar(max),NULL)) returns 0 – now COUNT() has a datatype to work with.

Cursors (Order by)

When a query requests sorting the result-set becomes cursor – sets are not ordered. Sql Server treats NULL values as distinct and sorts the unknowns first if the sort is Ascending (Descending sort order puts NULL values  last)

SELECT * 
FROM #Products
ORDER by ProductCategoryId

NullOrder

*Just a small digression. One way to show NULL values last when  sorting in Ascending direction is to use tsql’s ability to sort record-set by non-selected columns and/or expressions.

SELECT *  ---,[NullsLast]= IIF(ProductCategoryId IS NULL,1,0)
FROM #Products
ORDER BY  CASE 
            WHEN ProductCategoryId IS NULL THEN 1
            ELSE 0
          END 
          ,ProductCategoryId

The NULL expressions results can explain the “unexpected” results  when used in anti-semi joins, but that will be covered in a separate post.

Conclusion

Three valued logic is unique to DB wold. ANSI SQL Standard provides rules/guidelines about NULL values. However, different RDBMS may handle NULLs  differently in different situations. It is important to understand how Sql Server treats NULL values in different elements of tsql language. Using NULLs  in the context of DB design can make 3VL work for you.

Thanks for reading.

Dean Mincic

Short–Circuit evaluation in Sql Server


The Short-Circuit(SC) or The minimal evaluation is the way programing languages evaluates arguments involved in Boolean expressions.
SC means that the arguments are evaluated from left to right until the value of the expression is determined. The rest of the arguments (if any left) will not be evaluated.
For example, c# evaluates the arguments (if logical operators && and || are used) from left to right.The evaluation of the expression stops as soon as the outcome is determined.

..
if(object_A !=null && object_A.methodA()) == "some value"{
 //Perform some action ;
}

With SqlServer(and RDBMS systems in general) things are different when it comes to SC. In short, in Sql Server we can not relay on the minimal evaluation. Essentially, this is because of declarative nature of TSQL language. In TSQL we almost always describe only WHAT to do (e.g Give me the list of all orders processed on a particular day from the specific geographic area made by Customers who..etc) but not HOW to do it (Query optimiser-QO and other Sql Server engine components) will find the optimal way to fulfill your request.
Therefore, the order of evaluation of the query predicates may be sequential, from left to right, but it’s not guaranteed. QO may change the ordering and/or push the predicate evaluation down the query tree e.g to the data access operator (e.g index seek).
The following example shows a few case scenarios of how Sql Server evaluates predicates(logical expressions).

Create a test table:

USE tempdb
GO
IF EXISTS(SELECT 1 FROM tempdb.sys.objects 
          WHERE object_id=OBJECT_ID('TestShortCircuit'))
    DROP TABLE dbo.TestShortCircuit
GO
CREATE TABLE dbo.TestShortCircuit(
     CurrencyId      NCHAR(3) NOT NULL 
    ,Name            NVARCHAR(30) NOT NULL
    ,CONSTRAINT PK_CurrencyId PRIMARY KEY CLUSTERED(CurrencyId))
GO
INSERT INTO dbo.TestShortCircuit(CurrencyId,Name)
    SELECT tab.id, tab.name
    FROM (VALUES  (N'AUD',N'Australian Dollar')
                 ,(N'EUR',N'EURO')
                 ,(N'HA!',N'98765')
                 ,(N'USD',N'US Dollar')
                 ) tab(id,name)
GO

Test 1: Conjunctive predicates (multiple predicates separated by AND Logical operators) and predicate evaluation sequence.

SELECT * 
FROM dbo.TestShortCircuit
WHERE  Name = 98765 AND CurrencyId = N'HA!'
--WHERE  Name = 98765
/*Msg 245, Level 16, State 1, Line 28
Conversion failed when converting the nvarchar value 'Australian Dollar' to data type int.*/

The query executes with no error.

Test1Plan
Figure 1: Conjunctive predicates – second expression evaluated first

Switching the predicate places will have no effect to the QO’s reasoning.
If we ran the same query without the second expression the query would fail due to conversion error.

Successful query: Query optimiser knows that the second expression, CurrencyId = N’HA!’ includes Primary Key constraint on the column(enforced by an unique clustered index) and therefore can evaluate to true maximum once. QO decides to “push” second predicate (CurrencyId = N’HA!’e) down to the query tree to the Index Seek data access operator making it “Seek Predicate”.

If the Seek operator does not find clustered N’HA!’ the WHERE filter evaluates to false making the first expression redundant. The first expression is shown in the Predicate section of the execution plan only for consistency purpose.

On the other hand , if the operator finds the row, the first predicate will be evaluated (Name=98765). QO will then try to convert the column value(N’98765′) to INT and than to compare with the given argument(98765) . If the conversion succeed(like in the example) the query will execute with no errors. However, if the conversion fails, the query will return error

--this will return conversion error
SELECT * 
FROM dbo.TestShortCircuit 
WHERE  Name = 98765 AND CurrencyId = N'USD'

Test 1 shows that we cannot know which expression will be evaluated first(if evaluated at all) and we must not relay on the minimal evaluation.

Test 2: Disjunctive predicates (multiple predicates separated by OR Logical operators) and predicate evaluation sequence).

The query below will execute with no conversion errors.

SELECT * 
FROM dbo.TestShortCircuit 
WHERE N'A'= N'A' OR Name = 98765

The successful execution of the query may lead us to believe that Sql Server uses minimal evaluation, but it’s not the case.

If the predicates switch the places, the query will still run with no errors. This wouldn’t be possible if Name=98765 predicate evaluated first, due to conversion error. The answer lies, again, in the way QO handles the query. In this case, during the optimisation, QO simplifies the query by excluding all predicates knowing that A=A will always evaluate to true for all rows. It simply uses table scan operator to return all rows.

Test1PlanORFigure 2: Disjunctive predicates, query simplification and SC

Test2 again shows that we the sequence of predicate evaluation is controlled by QO and is not short-circuited.

Sidenote:  the query below will fail due to conversion error (the order of the predicates is irrelevant). The QO cannot use clustered index seek because of the OR operator.  Even if we had an index on Name column, the index wouldn’t be used because of the implicit conversion on the column.

SELECT * 
FROM dbo.TestShortCircuit 
WHERE CurrencyId = N'HA!' OR Name = 98765

Conclusion:

It is not recommended to relay on the minimal evaluation in Sql Server (I would say in any DB engine) because of declarative nature of SQL language. Query Optimiser,  one of Sql Server’s relational engine components, will transform/simplify the query and during the process may reorganise the “WHERE” predicates making the order of the logical expressions irrelevant.

Thanks for reading.

Dean Mincic