Category Archives: TSQL Programming

The Category contains topics related to the tsql language

GUID in Sql Server


In SQL Server, a GUID/UUID – Globally Unique Identifier/Universally Unique Identifier is a 16byte binary value represented as UNIQUIEIDENTIFIER data type. The idea is to generate and store values that are unique across different database servers and networks. There are several ways to create these values e.g using client code i.e System.Guid.NewGuid() method in .NET, NEWID() / NEWSEQUENTIALID() functions in Sql Server etc.
There are many SQL Server databases designed to use GUID as surrogate, primary key and/or clustered index key. This is to enforce entity(table) integrity and/or logical order of the rows. These, somehow common decisions in the industry may cause design and performance issues. This post explores several such scenarios and the possible reasoning behind them.

GUID structure

A GUID is a 16byte(128bit) unsigned integer. It’s a lot of combinations of digits,  2128 or 1038 to be precise. The available space is logically divided into five segments separated by hyphens. This is how the 16byte space is represented to us.
In Sql Sever, in addition to the uniqueidentifier, we can also use binary, unicode and non-unicode datatypes (variable or fixed length) to store GUID values – more on that in the following section.

Lets store a big, positive numeric value into a UNIQUEIDENTIFIER and in a BINARY data type of the same size(16bytes),  and see what happens.

Figure 1, A big numeric value stored as  binary and uniqueidentifier 

We can see that the number stored as UNIQUEIDENTIFIER (16byte binary) data type  is represented in hexadecimal form and formatted in 8 – 4 – 4 – 4 – 12 pattern. That is exactly 16bytes (4b-2b-2b-2b-6b) with each byte represented as a pair* of characters i.e the far right byte holds hexadecimal number 03. So, 16bytes are stored and 32+4 characters are displayed(including the hyphens).

*NOTE:  In SSMS, binary values are represented as hexadecimal numbers with the 0x prefix. In fact, many computer languages allow programmers to indicate that a value within a program is a hexadecimal number. In the representation above, each byte is represented as a pair of characters(hex numbers) i.e from Figure1, FD03 – two far right bytes, FD00 -> 64,768(decimal) , 0003 -> 3(decimal) and together they give 64,771(decimal)   – More on hex to dec conversion can be found here.

Interestingly, UNIQUEIDENTIFIER does not store information exactly the same as a raw, BINARY data type.
Figure1 shows that UNIQUEIDENTIFIER stores first 16 hexadecimal digits(the first 8 rightmost bytes) exactly the same as BINARY. The remaining three segments store bytes in reversed order.

The first 8bytes(the last two segments) are an 8 element byte array. In general,  array elements are stored in index order and that is why they match the row binary storage pattern. The remaining three segments contain re-shuffled byte ordering and that is a feature of the UNIQUEIDENTIFIER data type. On top of that, Microsoft OS  byte encoding (Endianness) pattern depends on the underlying CPU architecture i.e Intel processors implement the little-endian encoding. This deviates from the RFC standard, which imposes the use of the Big Endian byte ordering.

To make things more confusing, lets compare BINARY and UNIQUEIDENTIFIER byte footprints on the data page. For this experiment I’ll store the values from the Figure 1 in a table. Execute this code and observe the results.
Figure 2, UNIQUEIDENTIFIER values on a data page

It seems that the UNIQUEIDENTIFIER value is stored exactly the same as the BINARY value without byte rearrangements. However, the same value is presented with the reshuffled bytes.

RFC4122 is a standard that defines how to structure UUID. This is a set of rules on how to generate values, represented in the four segments (presented as a five segment string), which together form a globally unique number.
The functions like System.Guid.NewGuid() or NEWID() are based on these rules.

Ways to store GUIDs in Sql Server

UNIQUEIDENTIFIER is a data type designed to store GUID values. In practice however, programmers sometimes choose binary, unicode or non-unicode data types to store those values. The string data types require significantly more space. There are also concerns about sorting and casting.

  • BINARY(16) – 16bytes
  • CHAR(36)  – 36bytes
  • NCHAR(72) – 72bytes

The script below shows different storage space requirement for the same GUID value:

Figure 3, space required to store a GUID value

Figure 3 shows that if we decide to use e.g unicode data type to store GUID we’ll need 4.5x more space than if we use UNIQUEIDENTIFIER or BINARY data type.

BINARY data type conversions

According to Sql Server Data Type Conversion Chart (download here), UNIQUEIDENTIFIER value can be implicitly converted to string and binary datatypes.
In the next experiment, client code generates a GUID value and passes it to a stored procedure as a parameter of type string . The stored proc then inserts the guid value into a table as: string, binary and the uniqueidentifier.
The Python code can be found here and the TSQL script can be found here.

After running the code above and  SELECT * FROM dbo.TestGUIDstore;
Figure 4, string GUID to BINARY conversion 

The test shows a peculiar value in the uid_bin16 column. This is not the expected GUID value  stored as BINARY(16). So what is it?

Storing GUID values using explicit conversion from string to BINARY(16) results with the loss of the GUID values.

NOTE: Just a couple observations, not directly related to this post. The python script above connects to the DB using pyodbc library (odbc provider). All versions of the provider, by default, have autocommit connection string property set to False. This sets the IMPLICIT_TRANSACTIONS to ON. Try to exclude the property param from the conn. str. and see what happens. The second interesting thing is the way “execute” method executes the query(RPC call, parameterised batch request). More on the two topics can be found here(Data Providers and User Options) and here(Client Requests and Sql Events in SQL Server)

From the previous experiment:

…and if we try to convert the binary value back to GUID i.e this time to UNIQUEIDENTIFIER

The retrieved GUID is totally different than the one originally stored in the table.

The reason for this behavior lies in the explicit conversion from string  to BINARY(16).
What Python program sent, ‘D9DD9BA5-535C-46C3-888E-5961388C089E’, is a string representation of a GUID.  The format is  recognisable by UNIQUEIDENTIFIER data type. However, for BINARY data type, this is just a set of ASCII characters not a hexstring – a hexadecimal value represented as a string. Moreover, the ASCII set of characters requires 36bytes to be stored as a hexstring.

Just to illustrate the point(and because I found it interesting 🙂 ), the following script takes out each char from the GUID script, get its ascii code, convert the code into a hexadecimal value and put it back into its position. i.e char ‘D’  is ASCII 68 and hex 44.

Now we can compare the binary value stored in uid_bin16 column with the output from the script above:

Not only that the GUID is not stored correctly but, now we can see that the half of the input string got truncated (it simply needs more space then 16bytes, as mentioned above).

If you still want to store GUIDs as BINARY data type, one of the techniques is to remove hyphens and then convert the string to BINARY(16).
Note: un-comment --,uid_bin16 = CONVERT(BINARY(16),REPLACE(@guid ,'-',''),2)  from the table definition code and run the Python script again. Inspect the stored values.
The following script demonstrates the same conversion approach.

The third parameter – Style, and in this context, defines how CONVERT function treats the input string. More information here. In this case, Style = 2, instructs the function to treat the input string as a hexstring with no 0x suffix. This is why we get the correct conversion.
Keep in mind that if you need to pull the binary information from db and pass it to the client as GUIDs, the following conversion to UNIQUEIDENTIFIER will result in a similar but different GUID, as explained before.

Of course, you can always “STUFF” a few hyphens into string representation of the BINARY to get  GUID string shape.

The safest way to store a GUID as BINARY and to be able to retrieve the binary value, unchanged, as a UNIQUEIDENTIFIER is to first convert the input string to UNIQUEIDENTIFIER and then to BINARY(16) before storing it in db. To retrieve GUID we just need to convert the BINARY(16) back to UNIQUEIDENTIFIER and the GUID will be unchanged.
Uncomment --,uid_bin16 = CONVERT(UNIQUEIDENTIFIER,@guid) in the code and repeat the test above. This time, the conversion back to the GUID is correct.

Finally, it’s worth to familiarise with the nuances when comparing and sorting guids/uniqueidentifiers using i.e System.Guid.CompareTo() vs  SqlTypes.SqlGuid.CompareTo()  methods. This is well explained here.

How to generate GUID in Sql Server

In Sql Server 7 Microsoft expanded replication services capabilities with the Merge replication. Replication in general, provides loosely consistent data that adds more flexibility around network availability. This is opposed to distributed transactions which use the two phase commit protocol that guarantees data consistency but potentially keeps system locked for a long time i.e case of failed and/or in-doubt transactions. Merge replication allows both, publisher and subscriber(s) to independently modify published articles i.e tables. System synchronises the changes between the participants. To uniquely and globally identify rows across the published articles, Microsoft implemented* a new datatype – UNIQUEIDENTIFIER along with a new column property – ROWGUID and a new function for generating random guids – NEWID().

Currently, SQL Server offers two system functions for generating GUIDs

  • NEWID()
  • NEWSEQUENTIALID() -available since SQL Server 2005

Both functions are based on Windows functions, UuidCreate() and UuidCreateSequential()  respectively.

*NOTE: Merge replication is not the only reason why Microsoft implemented UNIQUEIDENTIFIER and the support for GUIDs. The ability to manage globally unique values has become an important way of identifying data, objects, software applications, and applets in distributed systems (based on [Inside Sql Server 7.0, Microsoft Press,1999])


NEWID() is a system function that generates a random, globally unique GUID value of type UNIQUEIDENTIFIER. It’s based on UuidCreate() Windows OS  function. NEWID() is compliant with the RFC4122 standard.

NEWID() is not a foldable function therefore it’s executed separately for each inserted row. This is opposed to the Runtime constant scalar functions  i.e GETDATE(), GETSYSDATE(), RAND(), the functions that are executed only once per query.

NOTE: The runtime constant scalar functions are evaluated only once, early in the query execution. The results are cached and used for all resulting rows.. This process is known as Constant Folding.

The following script demonstrates how the function works and how it’s different to a foldable function.

Figure 5, NEWID() , not foldable function

ROWGUIDCOL – is a column property( or a designator for a GUID column) similar to $IDENTITY. It is possible to have multiple UNIQUEIDENTIFIER columns per table, but only one can have the ROWGUIDCOL property.  The designator provides a generic way for the client code to retrieve the GUID column, usually with the unique values, from a table.


NEWSEQUENTIALID() is a system function that creates a globally unique GUID that is greater than any GUID previously generated by this function on a particular computer and on a particular Sql Server instance on that computer. The output of the function is of type UNIQUEIDENTIFIER. NEWSEQUENTIALID() is based on Windows UuidCreateSequential() system function.

There are a few interesting quirks and features about this function.

  • NEWSEQUENTIALID() system function cannot be invoked independently i.e SELECT NEWSEQUENTIALID();  It can only be used as a default constraint of a column in a table and the column must be of a UNIQUEINDENTIFIER data type. Also, it is not possible to combine this function with other operators to form a complex scalar expression.

  • All GUID values generated by NEWSEQUENTIALID() on the same computer are ever increasing. From SQL Server perspective ,this means that all sequential guids generated across all instances, databases and tables on the same server, are ever increasing. The “shared counter” is due to the fact that the function is based on a OS function.
    Also, the ever increasing sequence continues after the OS restart. To demonstrate the point run this code.
    Figure 6, NEWSEQUENTIALID() – shared counter
  • NEWSEQUENTIALID() is not guaranteed to be globally unique if initiated on a system with no network card. There is a possibility that another computer without an ethernet address generates the identical GUID. This is based on the underlying UuidCreateSequential()  windows function behavior.
  • NEWSEQUENTIALID() is not compliant with the  RFC4122 standard
  • The sequence of ever increasing GUIDs will be interrupted after the OS system restart. The new sequence may start from a higher range – FIgure 6, or it can start from a lower range. This is something that we cannot control.
  • UuidCreateSequential() outputs sequential guids with different byte order than NEWSEQUENTIALID(). The reason is that the  function outputs UNIQUEIDENTIFIER data type that, as it was mentioned before, re-shuffles certain bytes – see Figure 1. This may create problems with sorting in the situations when the client code generates sequential guids and stores it in database as UNIQUEIDENTIFIER(s). This article explains how to avoid this problem.
  • Sequential guids generated, and re-shuffled as explained above, by the application that runs on the same server as the DB server will be in the same sort order as the sequential guids generated by newsequentialid() across all Sql Server instances on that server.

GUID & DB Design

Relational database systems such as SQL Server have a strong foundation in mathematics and in relational theory – hence the R in RDBMS, but they also have their own principles. For example, a Set is an unordered  collection of unique, no-duplicated items. In relational theory, a relation is defined as a set of n-tuples. A tuple in mathematics is a finite sequence of elements. It has an order and allows duplicates. It was later decided that it would be more convenient, from the computer programming perspective, to use attribute names instead of the ordering. The concept has changed but the name “tuple” remained. Back to RDBMS, a table is a visual representation of a relation and a row is a similar to the concept of a tuple. These concepts are similar but not the same. E.g A table may contain duplicate values whereas relation cannot have two identical tuples etc.

The consistency of an RDBMS is enforced by constraints which are declared as part of db schema e.g Primary Key constraint enforces consistency of an entity. A tuple must have a minimal set of attributes that makes it unique within a relation. A row in a table does not need to be unique, and this is where, in my opinion, the big debate about natural keys vs surrogate keys begins.
We are designing databases with performances in mind. This includes deviations from the rigid rules of database design. The more we know about product’s internals , the more we try to  get  the most of it by adjusting our design and queries to it. Paradoxically, the declarative nature of SQL language teaches us to give instructions on what to do not how to do it. I guess , the truth is always somewhere in between :).
The above may explain use of IDENTITY columns and primary keys that follow clustered index key guidelines: static, unique, narrow and ever-increasing.

It is a common practice among developers to use GUIDs values as primary keys and/or clustered index keys. This choice ticks only one box from the PK/index key properties mentioned above – it’s unique and possibly static. Pretty much everything else is not ideal – GUID is not narrow(16bytes), not ever-increasing( unless generated by UuidCreateSequential()/NEWSEQUENTIALID() on the same PC). GUIDs are generally not good from the core database design practices(meaningless, not intuitive surrogate keys) nor from the db performance perspective(fragmentation, unnecessary disk space consumption, possible query regression etc).So why they are so “popular”?

GUID as Primary Key

The idea is to create a unique value e.g a new productId, on one of the application layers without performing a round-trip to database in order to ask for a new Id. So, the generated GUID value becomes a PK value for the new product in e.g Products table. The PK may be implemented as a unique non-clustered index. The index is likely to be highly fragmented since the generated GUIDs are completely random. Also, keep in mind that the PK is highly likely to be a part of one or more referential integrity constraints(Foreign Keys) in tables like e.g ProductInventory, ProductListPriceHistory, etc. Moreover, the Foreign keys may be, at the same time, part of the composite PKs on the foreign tables – Figure 7. This design may have negative effect on many tables and the database  performance in general.

An alternative approach may be to define GUID column as an Alternate Keyenforced by a unique NCI and to use INT or BIGINT along with the IDENTITY property or a Sequencer as a surrogate PK . The key can be enforced by the unique clustered index. This way we can avoid excessive fragmentation and enforce referential integrity in more optimal way – Figure 7,rowguid column.

Figure 7, GUID column as an Alternate Key – Adventure Works

*Alternate Key represent column(s) that uniquely identify rows in a table. A table can have more than one column or combinations of columns that can uniquely identify every row in that table. Only one choice can be set as the PK. All other options are called Alternate Keys.

GUID values can be created by SQL Server during the INSERT operations.  E.g Client code constructs a new product (product name, description, weight, color, etc..) and INSERTs the information(a new row) into Products table. The NEWID() fn automatically creates and assigns a GUID value to the new row through a DEFAULT constraint on e.g ProductId column. Client code can also generate and supply GUID for the new product. The two methods can be mixed since the generated GUIDs are globally unique.

What I often see in different production environments is that the GUID values are used as PK values even if there is no need for the globally unique values.
Very few of them had better security in mind i.e  It is safer to expose a GUID then a numeric value when querying db through a public API. The exposed numeric value in the URL may potentially be used to harm the system. E.g http://myApp/productid/88765 can suggest that there is productId =88764 etc. , but with a GUID value, these guesses will not be possible – Figure 7, data access point.

In most db designs, at least in the ones I’ve had opportunity to work on,  GUIDs are used only because it was convenient from the application code design perspective.

When application and the database becomes larger and more complex, these early decisions can cause performance problems. Usually these problems are solved by, so called quick-fixes/wins. As the rule of thumb, the first “victim” of those “wins” is always data integrity e.g adding NOLOCK table hints everywhere, removing referential integrity(FK), replacing INNER JOINS with LEFT JOINS, etc. This inevitably leads to a new set of bugs that are not easy to detect and fix. This last paragraph may be too much, but this is what I am seeing in the industry.
Use GUIDs with caution and with the cost-benefit in mind 🙂

GUID as PK and the Clustered index key

Sometimes developers decide to use GUID values as a PK enforced by the clustered index. This means that the primary key column is at the same time the clustered index key. Data pages(leaf level) of a clustered index are logically ordered by the clustered index key values.
One of the reasons for this design may be ability to easily merge data from different databases in the distributed database environment. The same idea can be implemented more efficiently using GUID as an alternative key as explained earlier.
More often, the design is inherited from Sql Server’s default behavior when the PK is created and automatically implemented as the clustered index key unless otherwise specified.

Using GUID as clustered index key leads to extensive page and index fragmentation. This is due to its randomness. E.g every time client app inserts a new Product, a new row must be placed in a specific position i.e specific memory location on a data page. This is to maintain the logical order of the key values. The pages(nodes) are part of a doubly linked list data structure.  If there is not enough space on the designated page for the new row, the page must be split into two pages to make necessary space for the new row. Physical position of the newly allocated page(8KB memory space) in the data file does not follow the order of the index key (it is not physically next to the original page). This is known as the logical fragmentation. Splitting data pages introduces yet another type of fragmentation, the physical fragmentation which defines the negative effect of the wasted space per page after the split. The increased number of “half full” pages along with the process of splitting the pages has negative impact on query performance.

The “potential collateral damage” of the decision to use GUID as clustered index key are non-unique non-clustered indexes.
A non-clustered index that is built on a clustered index,  at the leaf level, contains row locators- the clustered index key values. These unique values are used as pointers to the clustered index structure and the actual rows – more information can be found here – The data Access Pattern section.
A non-unique NCI can have many duplicate index key values. Each of the key values is “coupled” with a unique pointer – in this case a GUID value. Since GUID values are random the new rows can be inserted in any position within the range of the duplicated values. This introduces the fragmentation of the NCI. The more duplicated values, the more fragmentation.

The fragmentation can be “postponed” by using the FILLFACTOR setting. The setting instructs Sql Server what percentage of each data page should be used to store data. The “extra” free space per page can “delay” page splits. The FILLFACTOR value isn’t maintained when inserting new data. It is only effective when we create or rebuild an index. So once it’s full, and between the index rebuilds, the data page will be split again during the next insert.

Things are different with the sequential GUID. Sequential GUIDs are generated in ascending order. The “block” of compact, ever-increasing GUIDs is formed on a server and between the OS restarts. Sequential GUIDs created by the Client code on a different server will fall into a separate “block” of guids – see Figure 6. As mentioned before ,sequential GUIDs can be created by Sql Server – NEWSEQUENTIALID() fn. initiated by a DEFAULT constraint and/or by the client code. The compact “blocks” of guids will reduce fragmentation.


In SQL Server, GUID is a 16byte binary value stored as UNIQUIEIDENTIFIER data type. NEWID() and NEWSEQUENTIALID() are the two system functions that can be used to create GUIDs in Sql server. The latter is not compliant with the RFC4122 standard. Both GUID types can be created by the client code using functions: UUidCreate(), UuidCreateSequential(). .NET sorts Guid values differently than Sql Server. UNIQUEIDENTIFIER data type re-shuffles first 8 bytes(the first three segments). .NET’s SqlGuid Struct represents a GUID to be stored or retrieved from a db.
GUID values are often used as primary key/the clustered index key values. The randomness of the GUID values introduces logical and physical data fragmentation, which then leads to query performance regression. Sequential GUIDs can reduce fragmentation but still need to be used carefully and with the cost-benefit approach in mind.

Thanks for reading.

Dean Mincic

PIVOT, Multi Pivot & Dynamic Pivot in SQL Server

PIVOT, Multi aggregate Pivot & Dynamic Pivot in SQL Server


Pivoting is a technique used to rotate(transpose) rows to columns. It turns the unique values from one column in one table or table expression into multiple columns in another table. SQL Server 2005 introduced PIVOT operator as syntax extension for table expression in the FROM clause. PIVOT, a relational operator is T-Sql proprietary operator and is not part of ANSI SQL Standard.

PIVOT operator structure

Rotating(Pivoting) one table or table expression into another  table requires three different elements

  1. Groping element
  2. Aggregating element
  3. Spreading element

The PIVOT operator accepts only Aggregating and Spreading elements. To avoid possible logical errors we must have a clear understanding of all three parameters, especially the Grouping element.

The flowing example demonstrates the three elements in action.

Let’s say we want to present the sum of freight(Shipping cost) values per order year for each country that ordered our products.
Set up  dbo.Orders_TestPivot table. The script can be found here.

The PIVOT queries below transpose columns from a table expression (ShipCountry, Freight and OrderYear) into a new table.
The queries are logically identical although they use different types of table expressions. The version on the left uses Derived query and the one on the right uses Common table expression(CTE).
More on table expressions can be found here:
My personal preference is the CTE version, so i’ll use that in the following examples. 🙂

Derived query table expression Common Table Expression

The figure below visually maps the elements of the PIVOT operator and the final result set.

Figure 1, PIVOT Operation

My personal way of thinking when creating a PIVOT query is;

  1. Sketch the final result-set and visualise all three elements required for PIVOT operation
  2. Define a table expression(CTE) that returns:
    1. Spreading element – what we want to see on columns – OrderYear
    2. Aggregate element – what we want to see in the intersection of each row and column – Freight
    3. Grouping element* – what we want to see on rows – ShipCountry
  3. Add  PIVOT operator. The pivot operator returns a table result – in our example the table result has alias PVT.
    1. Include aggregate function applied to the aggregate element – SUM(Freight).
    2. Include the FOR clause and the spreading column – FOR OrderYear.
    3. Specify the IN clause and the list of distinct, comma separated values that appear in the spreading element. [2018],[2019],[2020] . In our example we have a list of irregular identifiers* that needs to be delimited.
      If we added a non existing value to the IN list e.g [2099], the query would execute with no error but with the NULL aggregated values 🙂
    4. Specify an alias for the PIVOT result table – PVT
  4. Specify the final SELECT. The columns are selected from PIVOT result table. The sequence of the selected columns is not relevant.

Note: Irregular identifiers:
We use identifiers to name(identify) Sql Server’s objects i.e stored procedures, tables, views, constraints, column names, attributes ..etc. There is a set of rules for creating identifiers i.e The first character cannot be numeric, so e.g 2018 is an Irregular identifier. To be able to use irregular identifiers we need to “fix” their boundaries/limits or to deLimit them. To do that we can use double quotation marks – 2018 or tSQL specific – square brackets;  [2018]. More about Sql Server Identifiers can be found here.

An interesting thing about PIVOT operator is that it does not include grouping element. The grouping element is “everything else” that is not a spreading or an aggregating element. In our example the grouping element is ShipCountry column selected in the table expression.
If we selected e.g ShipCity along with ShipCountry as the two columns that are not a spreading or an aggregate element, the result would be different.

Figure 2, Group By ShipCountry and ShipCity

This behavior can cause logical errors, especially if we apply PIVOT operator directly on a table.

In the next experiment, we are not using a table expression to prepare data-set for the PIVOT operator. Instead, PIVOT now operates over the entire table. It implicitly(automatically) groups data by all columns except the orderDate and Freght columns. As we can see on Figure 3, the query produces an unexpected result

Figure 3, PIVOT operation directly on a table

To avoid possible logical errors, it is always a good practice to first construct a table expression with the implicitly defined PIVOT elements(grouping, spreading and aggregating), and then to apply the PIVOT operator on the prepared data-set.

Multi aggregate pivot

A PIVOT operator can handle only one aggregate element at a time.  This means that if we want to use more aggregate elements we need to add more PIVOT operators to our query – a PIVOT operator per aggregate element 😐
In the previous example our aggregate element was Freight when we calculated the total shipping costs in different countries per year.
This time, we want to calculate the average value of the orders placed in different countries per year and to add the results to our query.
Figure 4 shows the desired result
Figure 4, Multi aggregate PIVOT- two aggregate elements

From the result we can see that the second result-set is just “appended” to the first. Basically, we just combined the two PIVOT results using an INNER JOIN table operator and an equality predicate on ShipCountry column.
The final query uses column aliases to indicate the different data-sets.
Figure 6, Multi aggregate PIVOT operation

The query in Figure 6 can be found here.

Dynamic PIVOT

A disadvantage of the PIVOT operator is that its IN clause only accepts a static list of spreading values. It does not support e.g a sub-query as input. This means that we need to know in advance all the distinct values in the spreading element. The “hard-coding” may not necessarily be a problem in cases when we deal with a spreading element with the known spreading values e.g OrderYear.
Going back to the first example, we can easily expand the IN list with the spreading values that are not available yet.

The things get more complex when we cannot predict all possible spreading values. In these situations we can fist design a query that will give us a distinct list of spreading values, and then use that list to dynamically construct the final PIVOT query, the Dynamic Pivot.
A typical scenario in which we use Dynamic pivoting is when transposing attributes of an EAV*(Entity-Attribute-Value) data model .

EAV* is one of the open-schema data models (xml, json, clr) that, in some cases, can provide more flexibility than the relational model. Here is an interesting post about EAV.

Lets say we have a list of Products. Each product is different and can have a specific set of attributes. e.g a bicycle can have specific type of tires and a hard-drive can have a specific capacity..etc. Business frequently adds new products and product attributes. In the next example I used a simplified EAV model to store the products.The table script can be found here.

Our next task is to return a row for each distinct product, a column for each distinct product attribute and in the intersection of each product and attribute we want to see the  value of the attribute.

Figure 7 shows the desired output for all products and for a specific product
Figure 7, Dynamic pivot result

In this scenario we cannot know all the possible Attributes(the spreading element values). Moreover, the list of attributes is constantly changing, so hard-coding the IN list is no longer an option.
The following is a  dynamic pivot query that can give us the result in Figure 7.

NOTE: To extract a known Attribute value, in this case we can use MAX() or MIN() aggregate functions. Both functions will operate on a single value and will return a single value. Keep in mind that MIN and MAX as well as all other aggregate functions except COUNT(*), ignores NULL values.

The new attributes will be automatically handled by the dynamic query.

A couple of versions of the dynamic query can be downloaded here.


Pivoting is a technique used to transpose rows to columns. PIVOT is tSql proprietary operator and is not part of ANSI Standard. PIVOT operator accepts two parameters; Spreading element or what we want to see on columns and aggregating element or what we want to see in the intersection of each distinct row and column. Grouping element is the third parameter involved in pivot operation. It is what we want to see on rows. The grouping element is not formal part of the PIVOT operator and represents all columns that are not defined as spreading or aggregating elements. The implicit nature of the grouping element can lead to logical errors. This is why is recommended  to construct a table expression for the PIVOT operator that provides exactly three elements needed for the operation.
A PIVOT operator is limited to only one aggregate function. To perform multi aggregate pivot we need to introduce a PIVOT operator per aggregation.
The IN clause of the PIVOT operator accepts only a hard-coded, comma separated list of spreading element values. In the situations when the values are not known, we use dynamic sql to construct the query.


Thanks for reading.

Dean Mincic

Mutex in Sql Server

Mutex in Sql Server


RDBMS systems are multi-user systems which serve many clients at the same time. Being able to process large amount of requests is very important up to the point that we often trade data consistency to improve concurrency. However, there are situations when access to a particular segment of code needs to be serialized. This is similar to the Critical section in concurrent programming when a concurrent accesses to shared resources can lead to unexpected behavior. To protect the code e.g in c#  we use lock/monitor, mutex or semaphore and in Sql Server we use dummy lock tables, isolation levels/lock hints or application locks. This post presents four different ways of protecting the critical code in Sql Server.

What is Mutex?

Mutex stands for mutually exclusive. It is a construct used to serialize access to the shared resources. Mutex is a locking mechanism that prevents race conditions allowing access to the protected code (critical section) to only one process/thread at a time.

Thread safe code – c# example

The next example shows a simple use of mutex class to serialize access to a “critical section”.
The c# console application code can be found here.

The program performs division of two random numbers. The operation that follows sets operands, num1 and num2 values to 0. This is done 5 times in a For loop.

The critical section is executed concurrently by multiple threads* causing DivideByZeroException exception. Thread1(t1) and Thread2(t2)  started executing the code at almost the same time. (t1) has performed the division and assigned value 0 to num2 variable. At the same time, (t2) was in the middle of the division when (t1) set divisor(num2) value to 0 . The new condition caused Division by zero exception.

*Note: The runtime environment starts execution of the program with the Main () method in one thread and then creates three more threads using System.Threading.Thread class.

To avoid this situation we need to serialize access to the code above. One way to do that is to use the Mutex class to provide exclusive access to the critical section. (un-comment mutex objects in the code)

Now, treads (t1) and (t2) executes the code one at a time without causing the exception. The code is now “thread safe” 🙂

Mutex in SQL Server

There are several ways to serialize access to a critical section in Sql server. Although some approaches are more proper than others, it’s good to understand them all because, sometimes a specific situation can limit our options.

Set up test environment

The code used in the experiments can be downloaded by following the links below.

  • Test table – The main test table used to simulate effects of the concurrent inserts. (download here)
  • Dummy table – table used to present one of the mutex implementation techniques. (download here)
  • ITVF – an inline function used to track table locks requested by the concurrent connections.
    (download here)

There are many scenarios that can be used to demonstrate the effects of concurrent query execution on the critical section i.e Lost updates, double inserts etc. In this post I’ll focus on the concurrent inserts only.

The base query  for the following experiments. (also available here)

Essentially, the query logic encapsulates a read and write query with the latter being executed if the first returns an empty set.

In this scenario, more than one connection is trying to insert an unique  combination of ShipperId and IdentifierValue into table. Only one “unique” insert is allowed and that is enforced by the Unique constraint on the two columns.

I’ll be executing the query from the context of the two different SSMS sessions. To simulate concurrent code execution, before each experiment we define the exact time when the code will run i.e WAITFOR TIME '14:13:40' . We also need to capture the two SIDs (session IDs) for which we want to collect metadata i.e DECLARE @spId VARCHAR(1000) = '54,64';

Insert race condition

An insert race condition is situation where multiple instances of the same code execute a conditional insert at the same time. The condition for the insert can evaluate to true for all concurrent calls causing multiple inserts of the same values. This can lead to logical errors – duplicate rows or violation of Unique constraint/Primary key etc.

So, lets execute the base query code as is and demonstrate the Insert race condition.
Figure 1, Constraint violation caused by the insert race condition

Table hints and isolation levels

The first* method to serialize access to a critical section is to use combinations of table hints and/or isolation levels. This will permit only one code execution at a time.

*NOTE: The only reason why I put this method as the first solution is because, for me personally, it was the most interesting approach to research. However, in production environment, depending on the situation, I would probably first try to implement Application locks explained in the following section.

Previous, unsuccessful attempt to insert a new row follows the sequence presented in Figure 2 below. The list is compiled using dbo.itvfCheckLocks outputs.

Figure 2, Locking pattern – key violation error

One of the first thing that comes to mind is to elevate transaction isolation level.
If we used REPEATABLE READ tran. isolation level the outcome will be exactly the same. The isolation level would keep S locks, if acquired during the first – read query, until the end of transaction. That way repeatable read prevents the inconsistent analysis aka repeatable read anomaly. However, in this this case, there won’t be any S locks acquired and hold because the requested row  (ShipperId=50009, IdentifierValue=4) does not exists.

The next isolation level is SERIALIZABLE. For the test I’ll use (HOLDLOCK) table hint. The hint acts as SERIALIZABLE transaction isolation level, only the scope of the isolation is reduced to a table level. However, opposed to i.e NOLOCK,  HOLDLOCK “holds” its locks (sticks to its guns 🙂 ) until the end of the transaction.

The complete code can be found here.

 “Reset” the dbo.ShipperIdentifier table, run the test query again and observe the results.
Figure 3, Deadlock situation and SERIALIZABLE isolation level

This time Session 54 successfully completed the insert and Session 64 was chosen to be a deadlock victim. Figure 4 shows the deadlock diagram.

Figure 4 – the deadlock diagram – serializable isolation level

NOTE: I’ve used MS Profiler to get the graphical plan. Use Deadlock graph, Lock:Deadlock and Lock:Deadlock events.Once you get the event, right click on the Deadlock graph row/ Extract event data to save the diagram for further analysis

If we correlate information from Figure 4 and dbo.itvfCheckLocks we can conclude the following;

Similar to the situation presented in Figure 2, both sessions used NCI to access the requested information. During the read phase both sessions acquired IS locks on an index page where IdentifierValue = 4 (and the subsequent, IdentifierValue=5) suppose to be. The sessions have also acquired RangeS-S locks on the NCI key range, IdentifierValue=5. The locks are compatible and both sessions evaluated conditional expression to TRUE.
During the write phase – INSERT query, both sessions have acquired X locks on the two new rows to be inserted in the Clustered index -Session(64) on a new Id=10001 and Session(54) on Id=10002.
Now, in order to acquire X lock on the new row(s) to be inserted in the NCI, the existing RangeS-S locks must be first converted to RangeI-N locks. And this is the point where the dead lock happens – see Figure 4. Because RangeI-N and RangeS-S are not compatible, Sessions 64 and 54 waits on each other to release its RangeS-S locks. After certain period of time, in this case Sql Server engine decided to “kill” session(64) and let (54) to successfully finish.

The idea is to acquire non-compatible locks during the read phase and to keep the locks until the end of the transaction – see lock compatibility matrix. We can use UPDLOCK table hint to force lock manager to use U locks instead S locks, in our case RangeS-U instead RangeS-S.
Change the test query code and reset the test environment. Find the new code here.


If we run the concurrent code again, we’ll see that one of sessions acquired RangeS-U lock on the non-clustered index Key (ShipperId=50009, ShipperIdentifier = 5  and Id =3). Both sessions have acquired IU locks on the NCI page which “hosts” the above key(UI locks are compatible). Other Session now must WAIT until first Session releases the RangeS-U locks before it enters the conditional branching and perform the read query.
The first session releases the RangeS-U lock at the end of the transaction. At this point the new row (ShipperIdentifier = 4) has already been inserted in the table (NCI and CI ). The blocked session now can continue and acquire its own RangeS-U and IU locks. This time the read query can find the requested row. The conditional expression evaluates to FALSE and skips the INSERT query.

We managed to serialize access to the critical code by acquiring non-compatible locks at the beginning of the process and holding the locks until the end of the code segment.

Application locks

Another way to prevent concurrent access to a critical section in sequel is to use Application locks. This “special” type of locks is designed to serialize access to a critical section purely from the code perspective – very much like mutex in c# demonstrated earlier.

Application locks is a mechanism which allows application to acquire an app-lock on a critical section within a transaction or a connection(session). The locks do not affect tables/pages/rows but purely the code they encompass.

The available application lock types are:  S(Shared) IS(Intent Share), U(Update), X(Exclusive) and IX(Intent Exclusive). The rules follow the standard compatibility matrix. More on the application locks can be found here.

Application locks are implemented through system stored procedures:

  • sys.sp_getapplock – used to acquire locks
    • @Resource: specifies the case sensitive name of the application lock.
    • @LockMode: specifies the lock type S,IS,U,X,IX
    • @LockOwner: specifies the scope of the lock -Transaction or
    • @LockTimeout: specifies the timeout in milliseconds. Stored proc. will return an error if it cannot acquire the lock in this interval
    • @DbPrincipal: specifies security context. The caller must be member of one of the following security objects
        • database_principal
        • dbo  – special database user principal
        • db_owner – fixed db role
        • (DEFAULT – Public db role)
  • sys.spreleaseapplock. – used to release locks
    • @Resource:
    • @LockOwner: specifies the scope of the lock -Transaction or
    • @DbPrincipal

In the next example we use a new test query which implements application locks. Reset the test environment and concurrently execute two instances of the new test query.

My concurrent sessions were 54 and 56. Even if executed at the same time, Session 54 has acquired an app lock first making the second session(Sid=56) wait until 54 releases the app lock resource. The allowed wait time(@LockTImeout) is set to 1.5s.
Below is the output of the query execution.

Figure 6, Session 54 – Application locksFigure 7, Session 56 – Application locks

As we can see, the application lock has serialized access to the critical section within explicit transaction. Application lock did not affect “standard” data locking routine defined by transaction’s isolation level and the query itself. The lock used a non-compatible mode (@LockMode = ‘Exclusive’) which prevented concurrent access.
If we used one of the compatible lock modes – ‘Shared‘, ‘IntentShared‘ or ‘IntentExclusive‘, the test would fail causing Violation of UNIQUE KEY constraint UC_ShippierId_IdentifierValue … similar to one presented in Figure1.

My personal opinion is that this is the cleanest way to serialize access to a critical section.

The next two methods are more workarounds than proper solutions.

Dummy lock tables

This method includes a dummy table, table that is not part of the database schema(at leas not logically). Its sole purpose is to be exclusively locked by one of the concurrent sessions allowing only one session to access the subsequent code/queries at a time.

To execute the test, reset the test table and use dummy table sql script from here.

In my experiment I had two sessions, Sid=66 and , Sid=65. The former had exclusively locked  the dummy table before Sid=65 requested the lock. This pattern ensured that only one session can execute the protected code at a time.
Similar to Application locks, the dummy table routine does not restrict access to the objects (tables, views..etc) within a critical section, through different access paths. i.e Session(88) attempts to update a row in dbo.ShipperIdentifier table during the above action. The concurrent update will follow standard Transaction isolation level rules regardless of the status of dummy table.
Figure 8, Dummy table pattern

Figure 9, Dummy table – blocked session

Tables and Loops

The last method encapsulates critical section in an infinite loop.  A conditional branching within the loop checks for existence of a dummy table (or a global aka double hash, temporary table). If the table does not exists, the current session will be able to access the “protected code” and subsequently to drop the table and exit the loop. However, if the table already exists, the concurrent session(s) will keep looping, constantly checking if the table still exists. Once the table gets dropped by the current session(the only session that can access the DROP TABLE code), a concurrent session will be able to create table and to access the critical section.

As mentioned before, this method is more of a workaround than a proper solution and can introduce a number of performance issues i.e excessive drop/create table actions, increased CPU workload etc.

The complete script can be found hereReset the test environment and run the script in two separate SSMS sessions and at the same time.
Figure 10, Tables and Loops 🙂

From the output we can see that Session Sid=66 was the first one to create the dummy table and to access the critical section. At the same time, Session Sid=65 was constantly trying to enter the code segment by checking the existence of the dummy table. It made 8486 attempts in order to access the critical section. Finally, it accessed the code in a serial manner without causing any constraint violation..


Sometimes access to particular segments of code needs to be serialized between concurrent client connections. A protected segment of code is also known as critical section. In concurrent programming we use objects/constructs like mutex, semaphore or locks in order to serialize threads’ access to the shared resources making them thread safe.  In SQL programming critical sections/queries  are of declarative type usually describing what we want to achieve but not how. Therefore,  serializing sql code i.e one or more queries encapsulated in an explicit transaction, comes down to ensuring that only one session/connection can access the same code through the same object i.e the same stored procedure at the same time. However, the same protected code can be concurrently accessed by other sessions through different objects i.e views, other stored procs, dynamic queries etc.
Sql Server’s application locks closely resemble mutexes in application programming. Implemented through a couple of system stored procedures, application locks are easy to understand and implement. There are many different ways to achieve the same goal e.g by controlling the types of locks (UPDLOCK) and/or mimicking behavior of the ANSI transnational isolation levels applied only to specific table(s) (SERIALIZABLE aka HOLDLOCK) within a critical section. Other solutions may seem like workarounds implementing more imperative approach such  as  Tables and Loops.

Thanks for reading.

Dean Mincic

Recursive CTE

Recursive Common Table Expressions , table expressions and more…


Common Table Expressions were introduced in SQL Server 2005. They represent one of several types of table expressions available in Sql Server. A recursive CTE is a type of CTE that references itself. It is usually used to resolve hierarchies.
In this post I will try to explain how CTE recursion works, where it sits within the group of table expressions available in Sql Server and a few case scenarios where the recursion shines.

Table Expressions

A table expression is a named query expression that represents a relational table.  Sql Server supports four types of table expressions;

  • Derived tables
  • Views
  • ITVF (Inline Table Valued Functions aka parameterised views)
  • CTE (Common Table Expressions)
    • Recursive CTE

In general, table expressions are not materialised on the disk. They are virtual tables present only in RAM memory (they may be spilled to disk as a result of i.e memory pressure, size of a virtual table etc..). The visibility of the table expressions may vary i.e views and ITVF are db objects visible on a database level, whereas they scope is always on an SQL statement level – table expressions cannot operate across different sql statements within a batch.

Benefits of table expressions are not related to query execution performances but to the logical aspect of the code !

Derived Tables

Derived tables are table expressions also known as sub-queries. The expressions are defined in the FROM clause of an outer query. The scope of derived tables  is always the outer query.

The following code represents a derived table called AUSCust.

The derived table AUSCust is visible only to the outer query and the scope is limited to the sql statement.


Views (sometimes referred to as virtual relations)  are reusable table expressions. A view definition is stored as an Sql Server object along with objects such as; user defined tables, triggers, functions, stored procedures etc.
The main advantage of Views over other types of table expressions is their re-usability i.e derived queries and CTE have scope limited  to a single statement.
Views are not materialised, meaning that the rows produced by views are not stored permanently on disk. Indexed views is Sql Server(similar but not the same as the materialised views in other db platforms)  are special type of views that can have their result-set permanently stored on disk – more on indexed views can be found here.

Just a few basic guidelines on how to define SQL Views.

  • SELECT * in the context of a View definition behaves differently then when used as a query element in a batch.

    The view definition will include all columns from the underlying table, dbo.T1 at the time of the view creation. This means that if we change the table schema (i.e add and/or remove columns) the changes will not be visible to the view – the view definition will not automatically change to support the table changes. This can cause errors in the situations when i.e a view try to select non-existing columns from an underlying table.
    To fix the problem, we can one of the two system procedures: sys.sp_refreshview or sys.sp_refreshsqlmodule.
    To prevent this behavior follow the best practice and explicitly name the columns in  the definition of the view.

  • Views are table expressions and therefore cannot be ordered. Views are not cursors! It is possible, though, to “abuse” the TOP/ORDER BY construct in the view  definition in attempt to force sorted output.  e.g .

    Query optimiser will discard the TOP/ORDER BY since the result of a table expression is always a table – selecting TOP(100 PERCENT) doesn’t make any sense anyway. The idea behind Table structures is derived from a concept in Relational database theory known as Relation.

  • During processing a query that references a view, the query from the view definition gets unfolded or Expanded  and implemented in the context of the main query. The consolidated code(query) will then be optimised and executed.

ITVF (Inline Table Valued Functions)

ITVFs are are reusable table expressions that support input parameters. The functions can be treated as parameterised views.

CTE (Common Table Expressions)

Common table expressions are similar to derived tables but with several important advantages;

A CTE is defined using a WITH statement, followed by a table expression definition. To avoid the ambiguity (TSQL uses WITH keyword for other purposes i.e WITH ENCRYPTION etc) the statement preceding CTE’s WITH clause MUST be terminated with a semi-column. This is not necessary if the WITH clause is the very first statement in a batch i.e in a VIEW/ITVF definition)

NOTE: Semi-column, the statement terminator is supported by ANSI standard and it is highly recommended to be used as a part of TSQL programming practice.

Recursive CTE

SQL Server supports recursive querying capabilities implemented trough Recursive CTEs since version 2005(Yukon).

Elements of a recursive CTE

  1. Anchor member(s) – Query definitions that;
    1. returns a valid relational result table
    2. is executed ONLY ONCE at the beginning of query execution
    3. is positioned always before the first recursive member definition
    4. the last anchor member must be followed by UNION ALL operator. The operator combines the last anchor member with the first recursive member
  2. UNION ALL multi-set operator. The operator operates on
  3. Recursive member(s) – Query definitions that;
    1. returns a valid relational result table
    2. have reference to the CTE name. The reference to the CTE name logically represents the previous result set in a sequence of executions. i.e The first “previous” result set in a sequence is the result the anchor member returned.
  4. CTE Invocation – Final statement that invokes recursion
  5. Fail-safe mechanism – MAXRECURSION option prevents database system from the infinite loops. This is an optional element.

Termination check

CTE’s recursive member has no explicit recursion termination check.
In many programming languages, we can design method that calls itself – a recursive method. Every recursive method needs to be terminated when a certain conditions are satisfied. This is Explicit recursion termination. After this point the method begins to return values. Without termination point recursion can end up calling itself “endlessly”.
CTE’s recursive member termination check is implicit , meaning that the recursion stops when no rows are returned from the previous CTE execution.

Here is a classic example of a recursion in imperative programming. The code below calculates the factorial of an integer using a recursive function(method) call.

Complete console program code can be found here.


As mentioned above, recursive CTEs as well as any recursive operation may cause infinite loops if not designed correctly. This situation can have negative impact on database performance. Sql Server engine has a fail-safe mechanism that does not allow infinite executions.

By default, the number of times recursive member can be invoked is limited to 100 (this does not count the once-off anchor execution). The code will fail upon 101st execution of the recursive member.

Msg 530, Level 16, State 1, Line xxx
The statement terminated. The maximum recursion 100 has been exhausted before statement completion.

The number of recursions is manged by  MAXRECURSION n query option. The option can override the default number of maximum allowed recursions. Parameter (n) represents the recursion level.    0<=n <=32767

Important note:MAXRECURSION 0 – disables the recursion limit!

Figure 1 shows an example of a recursive CTE with its elements

Figure 1, Recursive CTE elements

Declarative recursion is quite different than traditional, imperative recursion. Apart of the different code structure, we can observe the difference between the explicit and the implicit termination check. In the CalculateFactorial example, the explicit termination point is clearly defined by the condition: if (number == 0) then return 1.
In the case of recursive CTE above, the termination point is implicitly defined by the INNER JOIN operation, more specifically by the result of the logical expression in its ON clause: ON e.MgrId = c.EmpId. The result of the table operation drives the number of recursions. This will become more clear in the following sections.

Use recursive CTE to resolve Employee hierarchy

There are many scenarios when we can use recursive CTEs i.e to separate elements etc. The most common scenario I have come across during many years of sequeling has been to use recursive CTE to resolve various hierarchical problems.

The Employee tree hierarchy is a classic example of a hierarchical problem that can be solved using Recursive CTEs.


Let’s say we have an organisation with 12 employees. The following business rules applies;

  • An employee must have unique id, EmpId
    • enforced by: Primary Key constraint on EmpId column
  • An employee can be be managed by 0 or 1 manager.
    • enforced by: PK on EmpId, FK on MgrId and NULLable MgrId column
  • A manager can manage one or more employees.
    • enforced by: Foreign Key constraint(self referenced) on MgrId column
  • A manager cannot manage himself.
    • enforced by: CHECK constraint on MgrId column

The tree hierarchy is implemented in a table called dbo.Employees. The scripts can be found here.

Figure 2, Employees table

Lets present the way recursive CTE operate by answering the question: Who are the direct and indirect subordinates of the manager with EmpId = 3?

From the hierarchy tree in Figure 2  we can clearly see that Manager (EmpId = 3) directly manages employees; EmpId=7, EmpId=8 and EmpId=9 and indirectly manages; EmpId=10, EmpId=11 and EmpId=12.

Figure 3 shows the EmpId=3 hierarchy and the expected result. The code can be found here.

Figure 3, EmpId=3 direct and indirect subordinates

So, how did we get the final result.

The recursive part in the current iteration always references its previous result from the previous iteration. The result is a table expression(or virtual table) called cte1(the table on the right side of the INNER JOIN). As we can see, cte1 contains the anchor part as well. In the very first run(the first iteration), recursive part cannot reference its previous result because there was no previous iteration. This is why in the first iteration only the anchor part executes and only once during the whole process. The anchor query result-set gives recursive part its previous result in the second iteration. The anchor acts as a flywheel if you will 🙂

The final result builds up through iterations i.e Anchor result + iteration 1 result + iteration 2 result …

The logical execution sequence

The test query is executed by following the logical sequence below:

  1. The SELECT statement outside the cte1 expression invokes the recursion. The anchor query executes and returns a virtual table called cte1.  The recursive part returns an empty table since it has no its previous result. Remember, the expressions in set based approach are evaluated all at once.
    Figure 4, cte1 value after 1st iteration
  2. The second iteration begins.This is the first recursion. The anchor part played its part in the first iteration and from now on returns only empty sets. However, the recursive part can now reference it’s previous result(cte1 value after the first iteration) in the INNER JOIN operator. The table operation produces the result of the second iteration as shown in the figure below.
    FIgure 5, cte1 value after 2nd iteration
  3. Second iteration produces a non-empty set, so the process continues with the third iteration – the second recursion. Recursive element now references the cte1 result from the second iteration.

    FIgure 6, cte1 value after 3rd iteration
  4. An interesting thing happens in the 4th iteration – the third recursion attempt. Following the previous pattern, the recursive element uses the cte1 result from the previous iteration. However, this time there are no rows returned as a result of the INNER JOIN operation, and the recursive element returns an empty set. This is the implicit termination point mentioned before. In this case, INNER JOIN’s logical expression evaluation dictates the number of recursions.
    Because the last cte1 result is an empty result-set, the 4th iteration(or 3rd recursion) is “canceled” and the process is successfully finished.

    Figure 7, The final iteration

    The logical cancellation of the 3rd recursion (the last recursion that produced an empty result-set does not count) will become more clear in the following, recursive CTE execution plan analysis section.We can add  OPTION(MAXRECURSION 2)  query option at the end of the query which will limit the number of allowed recursions to 2. The query will produce the correct result proving that only two recursions are required for this task.Note: From the physical execution perspective, the result-set is progressively(as rows bubble up) sent to the network buffers and back to the client application.

Finally, the answer on the question above is :
There are six employees who directly or indirectly report to the Emp = 3. Three employees, EmpId= 7, EmpId=8 and EmpId=9 are direct subordinates and EmpId=10, EmpId=11 and EmpId=12 are indirect subordinates.

Knowing the mechanics of recursive CTE, we can easily solve the following problems.

  • find all the employees who are hierarchically above the EmpId = 10 (code here)
  • find EmpId=8 ‘s direct and the second level subordinates(code here)

In the second example we control depth of the hierarchy by restricting the number of recursions.
Anchor element gives us the first level of hierarchy, in this case, the direct subordinates. Each recursion then moves one hierarchy level down from the first level. In the example, the starting point is EmpId=8 and his/hers direct subordinates. The first recursion moves one more level down the hierarchy where EmpId=8 ‘s second level subordinates “live”.

Circular reference problem

One of the interesting things with hierarchies is that the members of a hierarchy can form a closed loop where the last element in the hierarchy references the first element. The closed loop is also known as circular reference.
In the cases like this, the implicit termination point, like the INNER JOIN operation explained earlier, will simply not work because it will always return a non-empty result-set for the next recursion to go on. The recursion part will keep rolling until it hits Sql Server’s fail-safe, the MAXRECURSION query option.

To demonstrate circular reference situation using previously set up test environment, we’ll need to

  • Remove Primary and Foreign key constraints from dbo.Employees table to allow the closed loops scenarios.
  • Create a circular reference (EmpId=10 will manage his indirect manager , EmpId = 3)
  • Extend the test query used in the previous examples, to be able to analyse hierarchy of the elements in the closed loop.

The extended test query can be found here.

Before continuing with the circular ref. example, lets see how the extended test query works. Comment out the WHERE clause predicates(the last two lines) and run the query against the original dbo.Employee table

Figure 8, Detecting existence of circular loops in hierarchies

The result of the extended query is exactly the same as the result presented in the previous experiment in Figure 3. The output is extended to include the following columns

  • pth – Graphically represents the current hierarchy. Initially, within the anchor part, it simply adds the first subordinate to MgrId=3, the manager we’re starting from. Now, each recursive element takes the previous pth value and adds the next subordinate to it.
  • recLvl – represents current level of recursion. Anchor execution is counted as recLvl=0
  • isCircRef – detects existence of a circular reference in the current hierarchy(row). As a part of recursive element, it searches for the existence of an EmpId that was previously included in the pth string.
    i.e if the previous pth looks like 3->8->10 and the current recursion adds ” ->3 “, (3->8 >10 -> 3) meaning that EmpId=3 is not only an indirect superior to EmpId=10, but is also EmpId=10’s subordinate – I am boss or your boss, and you are my boss kind of situation 😐

Lets now make necessary changes on dbo.Employees to see the extended test query in action.

Remove PK and FK constraints to allow circular references and add a “bad boy circular ref” to the table.

Run the extended test query, and analyse the results (don’t forget to un-commet previously commented WHERE clause at the end of the script)
The script will execute 100 recursions before gets interrupted by the default MAXRECURSION. The final result will be restricted to two recursions .. AND cte1.recLvl <= 2;  which is required  to resolve EmpId=3’s hierarchy.

Figure 9 shows a closed loop hierarchy, maximum allowed number of recursions exhausted error and the output that shows the closed loop.

Figure 10, Circular reference detected

A few notes about the circular reference script.
The script is just an idea of how to find closed loops in hierarchies. It reports only the fist occurrence of a circular reference – try to remove WHERE clause and observe the result.
In my opinion, the script (or a similar versions of the script) can be used in production environment for i.e troubleshooting purposes or as a prevention from creating circular references in an existing hierarchy. However, it needs to be secured by appropriate MAXRECURSION n, where n is expected depth of the hierarchy.

This script is non-relational and relies on a traversal technique. It is always the best approach to use declarative constraints (PK, FK, CHECK..) to prevent any closed loops in data.

Execution plan analysis

This segment explains how Sql Server’s query optimiser(QO) implements a recursive CTE. There is a common pattern that QO uses when constructing the execution plan. Run the original test query and include the actual execution plan

Like the test query, the execution plan has two branches: the anchor branch and the recursive branch.  Concatenation operator, which implements the UNION ALL operator, connects results from the two parts forming the query result.

Let’s try to reconcile the logical execution sequence mentioned before and the actual implementation of the process.

Figure 11, Recursive CTE execution plan

Following the data flow (right to left direction) the process looks like:

Anchor element (executed only once)

  1. Clustered Index Scan operator – system performs index scan. In this example, it applies expression MgrId = @EmpId as a residual predicate. Selected rows(columns EmpId and MgrId) are passed (row by row) back to the previous operator.
  2. Compute Scalar The operator adds a column to the output. In this example, the added column’s name is [Expr1007]. This represents the Number of Recursions. The column has initial value of 0; [Expr1007]=0
  3. Concatenation – combines inputs from the two branches. In the first iteration, the operator receives rows only from the anchor branch. It also changes the names of the output columns. In this example the new column names are:
    1. [Expr1010] = [Expr1007] or [Expr1009]*    *[Expr1009] holds number of recursions assigned in the recursive branch. It does not have value in the first iteration.
    2. [Recr1005] = EmpId(from the anchor part) or EmpId(from the recursive part)
    3. [Recr1006] = MgrId(from the anchor part) or MgrId (from the recursive part)
  4. Index Spool (Lazy Spool) This operator stores the result received from the Concatenation operator in a worktable. It has property “Logical Operation” set to “Lazy Spool”. This means that the operator returns its input rows immediately and does not accumulate all rows until it gets the final result set (Eager Spool) . The worktable is structured as a clustered index with the key column [Expr1010] – the recursion number. Because the index key is not unique, the system adds an internal, 4 byte uniquifier to the index key to ensure that all rows in the index are, from the physical implementation perspective, uniquely identifiable. The operator also has property “With Stack” set to “True” which makes this version of the spool operator a Stack Spool  A Stack Spool operator always has two components –  an Index Spool that builds the index structure and a Table Spool that acts as a consumer of the rows stored in the worktable that was built by the Index Spool.
    At this stage, the Index Spool operator returns rows to the SELECT operator and stores the same rows in the worktable.
  5. SELECT operator returns EmpId and MgrId ([Recr1005] , [Recr1006]). It excludes [Expr1010] from the result. The rows are sent to the network buffer as they arrive from the operators downstream

After exhausting all rows from the Index Scan operator, the Concatenation operator switches context to the recursive branch. The anchor branch will not be executed again during the process.

Recursive element

  1. Table Spool (Lazy Spool). The operator has no inputs and, as mentioned in (4) acts as a consumer of the rows produced by the Index Spool and stored in a clustered worktable. It has property “Primary Node” set to 0 which points to the Index Spool Node Id. It highlights the dependency of the two operators. The operator
    1. removes rows it read in the previous recursion. This is the first recursion and there are no previously read rows to be deleted. The worktable contains three rows (Figure 4).
    2. Read rows sorted by the index key + uniquifier in descending order. In this example, the first row read is EmpId=9, MgrId=3.

    Finally, the operator renames the output column names. [Recr1003] =[Recr1005],  [Recr1004] =[Recr1006] and [Expr1010] becomes [Expr1008].
    NOTE: The table spool operator may be observed as the cte1 expression on the right side of the INNER JOIN (figure 4)

  2. Compute Scalar The operator adds 1 to the current number of recursions previously stored in column [Expr1007].The result is stored in a new column, [Expr1009]. [Expr1009] = [Expr1007] + 1 =  0 + 1 = 1. The operator outputs three columns, the two from the table spool ([Recr1003] and [Recr1004]) and [Expr1009]
  3. Nested Loop(I) operator receives rows from its outer input, which is the Compute Scalar from the previous step, and then use [Recr1003] – represents EmpId from the Table Spool operator, as a residual predicate in the Index Scan operator positioned in the Loop’s inner input. The inner input executes once for each row from the outer input.
  4. Index Scan operator returns all qualified rows from dbo.Employees table (two columns; EmpId and MgrId) to the nested loop operator.
  5. Nested Loop(II): The operator combines [Exp1009] from the outer input and EmpId and MgrId from the inner input and passes the three column rows to the next operator.
  6. Assert operator is used to check for conditions that require query to be aborted with an error message. In the case of recursive CTEs , assert operator implements “MAXRECURSION n” query option. It checks whether the recursive part reached the allowed (n) number of recursions or not. If the current number of recursions, [Exp1009](see step 7) is greater than (n), the operator returns 0 causing a run time error. In this example, Sql Server uses its default MAXRECURSION value of 100. The expression looks like: CASE WHEN [Expr1009]> 100 THEN 0 ELSE NULL If we decide to exclude the failsafe by adding MAXRECURSION 0, the assert operator will not be included in the plan.
  7. Concatenation combines inputs from the two branches. This time it receives input from the recursive part only and outputs columns/rows as shown in the step 3.
  8. Index Spool (Lazy Spool) adds the output from concatenation operator to the worktable and then passes it to the SELECT operator. At this point the worktable contains the total of 4 rows: three rows from the anchor execution and one from the first recursion. Following the clustered index structure of the worktable, the new row is stored at the end of the worktable
  9. The process now resumes from step 6. The table spool operator removes previously read rows (the first three rows) from the worktable and reads the last inserted row, the fourth row.


CTE(Common table expressions) is a type of table expressions available in Sql Server. A CTE is an independent table expression that can be named and referenced once or more in the main query.
One of the most important uses of CTEs is to write recursive queries. Recursive CTEs always follow the same structure – anchor query, UNION ALL multi-set operator, recursive member and the statement that invokes recursion. Recursive CTE is declarative recursion and as such has different properties than its imperative counterpart e.g declarative recursion termination check is of implicit nature – the recursion process stops when there are no rows returned in the previous cte.


Thanks for reading.

Dean Mincic

Bookmark lookup tipping point

Bookmark lookup critical point


It is common for production environments to have queries – query plans, that use non-covered, non-clustered indexes in combination with a bookmark(Key or RID) lookup operator. The combination of  the physical operators is one way how query optimiser can use a non-covered index to provide information required by a query. However, sometimes, for the same query,  query optimiser decides to scan the whole (cluster or heap) table instead. This drastic change in the plan shape may have negative impact on our query performance.
This post attempts to explain the mechanism behind QO decision on when to switch between the two plan shapes. The concept is known as The Tipping Point and represents the point at which the number of page reads required by the bookmark lookup operator exceeds a certain point at which a clustered index/heap table scan becomes less expensive than the non-clustered index seek.

How bookmark lookup works

Before diving into the tipping point specifics, it would be good to understand how bookmark lookup operator works in combination with a non-clustered , non covered index. Bookmark lookup (Key or RID)  is a physical operator used to find data rows in the base table(cluster or heap) using a clustered index key(Key lookup) or row locator(RID lookup).
Lets create a test environment we can use throughout the blog.

Create test environment

Create a test table

Insert 100,000 rows

The test objects properties that are interesting for our experiment:

  • Unique clustered index on EmployeeId.
  • Unique, non-clustered index on the SearchValue column.
  • SearchValue column contains unique, ever increasing integer values. The  values match EmployeeId values.
  • The row size is exactly 500bytes. 493bytes is used by the five fixed length columns + 7bytes row overhead. More on this can be found here.

Key lookup scenario

The query below returns 500 rows (all columns) for a range predicate.

Note: Traceflag 652 . The traceflag disables page pre-fetching scans (read-ahead). Disabling the storage engine feature will help us to reconcile the number of I/O operations reported by STATISTICS IO with the number of rows selected by the query. More on the trace flag later in the blog.

Analyse key lookup query plan

The figure below consolidates three sets of information related to our test query – a graphical execution plan shape, basic properties of the two physical operators and the number of IO reads performed on the test table.

Figure 1, Key lookup, index seek plan

We read the query plan as the following sequence of events.

  • Query optimiser chose a key lookup/non-clustered index seek routine to implement query request.
  • Nested Loop operator requested, for its outer input, all valid rows(rows that are passed the seek predicate ..SearchValue BETWEEN 1000 AND 1499.. ) on NCI_SearchValue index. The index seek(index bTree traverse) was executed once resulting in 500  rows and two columns – SearchValue and EmployeeId. The latter  also acts as a pointer to the full size rows stored in the clustered index leaf level.
  • Nested Loop operator requested, through its inner input, the rest of the columns selected by the query – Name, Surname and Note. The search(key lookup operator), was executed 500 times, once per row in the outer input returning a  new set of 500 rows – a row per key search. Each execution traversed the clustered index bTree structure using EmployeeId as a seek predicate, in order to pin-point the qualifying rows.
  • For each key lookup search, Nested Loop operator combined the two outputs, the SearchValue and EmployeeId from the outer input with the Name, Surname and Note from the inner input forming the shape of the final result set.

The next thing we need to understand is the relationship between the number of I/O reads required to implement the above routine and the number of rows processed in the process.

Figure 1 shows that the number of I/O reads required for the operation was 1503 logical reads.  A logical read is a situation when Sql Server processes an 8Kb page from a RAM memory segment called buffer cache. If the requested page is not already in the buffer cache,  storage engine needs to perform a physical read operation in order to get the same 8Kb page from the disk.
The properties of the two physical operators(NCI seek and Key lookup) shows that the system read 500 rows from the non-clustered structure,in one go and performed 500 operations on the clustered index, returning a row per operation.

Now we need to dive a little bit deeper into Sql Server’s storage protocol in order to find all physical pages that were processed during the operations. Query below gives us high level overview of the index structures used in the test. The non-clustered index bTree has two levels and total of 175 pages whereas clustered index bTree has three levels and the total of 6273 pages.

Figure 3, the total number of pages per index 

The next query gives us a detailed view of the index pages across bTree levels – The Doubly Linked List data structure.

Finally, the following query gives us a sneak-peek of the actual rows  on the index/data pages.

Data access pattern

The following diagram represents data access pattern used by the key lookup routine.

Figure 2, Nonclustered index & key lookup internals

Data access pattern analysis

Our next goal is to reconcile the total number of logical I/O reads previously reported by the STATISTICS IO – (1503) with the diagram above.
Lets find out the total number of I/O read operations required to generate one row(out of a total of 500 rows). The following sequence of events is derived from Figure 2.

  1. Nested loop operator requests all eligible rows from the index seek operator. The non-clustered index is a non-covered index and can provide only SearchValue and EmployeeId columns. The index seek operator use the two seek predicates to find the range of values, SearchValue >=1000 AND SearchValue <=1499.
  2. Non-clustered index traverse. Index seek operator reads the Root page(PageId=1:16448) of the non-clustered index making the first I/O read (NoOfReads = 1).
  3. Index seek operation, following the lower boundary(SearchValue = 1000) of the search range, finds a pointer(PageId) which points to the index leaf level page which contains the full(two columns) index row. (ROOT PAGE: SearchValue = 597, PageId = 1:16456). It also knows that the PageId=1:16456 alone cannot provide complete range of the requested values but only SearchValues >=1000 AND SearchValues <1157. It “remembers” the following pointer, PageId = 1:16457 which can provide the rest of the values, SearchValue >=1157 and SearchValue <= 1499.
  4. Index seek operator performs the second I/O read(PageId =1:16457) following Path (A).  The total number of I/O reads is now (NoOfReads = 2).
  5. After storing first 165 rows found on PageId =1:16456 in memory, the operator follows Path (B). The operation is known as “partial index scan“. The operator knows that the subsequent page(PageId=1:16457) contains the rest of the requested rows(335 rows). The current page also has pointers to previous and next page(doubly linked list). The Path (B) makes the third read (NoOfReads = 3).
  6. Nested loop operator received all 500 requested rows from its outer input , the NCI index seek operator.
  7. Nested Loop operator now performs the Key Lookup operation over clustered index, 500 times, once per row collected from the outer input.
  8. Clustered index traverse (singleton search). On its very first execution, the key lookup operator uses the first row from the Nested Loop outer input, (EmployeeId = 1000) and performs its first page read(PageId = 1:2680). The page is the root level of clustered index bTree. The operation follows Path(C) increasing the total number of reads (NoOfReads = 4).
  9. Clustered root page provides a pointer(PageId = 1:832) which points to the first index intermediate level. The page maps all EnployeeIds between NULL and less then 4305. The row with EmployeeId=1000 falls into the range. Following Path(D) the operator makes its second read and increases the total number of reads ( NoOfReads = 5)
  10. Intermediate page 1:832 provides information about a leaf level page(PageId=1:1029) that contains the first full row.
  11. The process now follows Path(E) and make its final, third clustered index read ( NoOfReads = 6)
  12. The full row is then passed from the Nested Loop operator to the SELECT operator.
  13. Nested loop operator continue to progresses through the list of 500 rows from its outer input repeating steps 8 – 11 until all 500 rows have been processed.

The total number of reads is
Total No Of Reads = Index Seek operation (3 reads) + 500 * Key Lookup operation (3 reads) = 3 + 500 * 3 = 1503.
This is an exact match with the number of logical reads reported by STATISTICS IO.

Important observations

From the storage level perspective, one of the main differences between the two access patterns is that the Clustered index seek(partial scan) is a sequential read I/O operation, whereas Key lookup(singleton clustered index seek) is a random read I/O operation. Sequential reads are generally less expensive (less mechanical movements on the disk level) than the random reads. This is also true for RAM/SSD although they don’t have moving parts. This a very high level observation on data storage systems. 🙂

The number of random reads depends on size of row. The wider the row the more random reads key lookup needs to perform to get the same result-set. More about this later in the post.

Read ahead optimisation

Earlier in the blog I used TRACEFLAG 652 to disable the page pre-fetching aka read ahead functionality. Read ahead is an asynchronous I/O mechanism build to overcome the gap between CPU and I/O performances. Because CPU is many times faster than any storage system, Sql Server’s storage engine tries to read up to 64 sequential pages(8 extents) before they are requested by a query. This provides more logical reads, but on the other hand is faster than performing physical reads only when required. Although the number of read-ahead pages may vary, the mechanism reads pages from the index intermediate level(s) in order to identify the leaf level pages to be read in advance. In our case, the functionality would, if not turned off, “added a few” extra pages to the STATISTICS IO report and we wouldn’t be able to reconcile the reads with the diagram in Figure 2.
Try to run the same test query without turning on the traceflag. The number of logical reads should be greater than 1503.

The tipping point

The tipping point is the point which represents the critical number of rows after which query optimiser decides to perform cluster index scan instead non-clustered/key lookup data access pattern.
As previously shown, in non-clustered index/key lookup scenario, the number of rows requested by a query relates to the number of random reads. The main concern when determining the tipping point is actually the number of data pages that needs to be “randomly” read from clustered index leaf level – a read per each row requested . In our example this number is 500. The approach excludes the clustered index leaf level page reads and the non-cluster reads all together.
The number of pages that represents the tipping point is between 1/4 and 1/3 of clustered index data pages(leaf level). If a query requests less than 1/4 (25%) of the number of clustered index leaf level pages, query optimiser is most likely to allow random reads and non-clustered index/key lookup data access pattern. However, if a query requests more than 1/3(33%) pages, query optimiser will implement a clustered index scan data access pattern.
Figure 3, The tipping point

Knowing that a random read corresponds to a selected row, we can express the tipping point as a number of rows;

                      1/4(no of data pages) <= Tipping point (rows)  => 1/3(no of data pages)

In our example, the tipping point is expected to be somewhere between 1562 and 2083 rows.
So where exactly is our tipping point?
One approach is to apply binary search algorithm to the tipping point interval and perform trial and error approach until the query plan changes.
The other approach is to construct some kind of a program to do that for you 🙂

The script runs the same query for different SearchValue values. The values are within the expected tipping point range, starting from the lower boundary. Query result is suppressed  by FMTONLY ON session setting. OPTION(RECOMPILE)* in this context ensures that the value of local variable @start is known to Query Optimiser when creating execution plan for the query*. For each query run, program checks the current query plan RelOp element. If the element’s attribute @PhysicalOp has value set to ‘Clustered Index Scan‘ the program terminates and selects the current SearchValue value. The value represents the Tipping point we are looking for.
Figure 4, The exact Tipping point number of rows

Note:  Instead using OPTION(RECOMPILE) we could use Dynamic string execution.

The approach constructs and optimise the query during run-time. In this case, local variable @start gets evaluated and is treated as a literal within the dynamic string. It is most likely that the query plan will not be parameterised and the individual plans(one per execution) will be cached as Adhoc plans. This may lead to the Plan pollution situation, but this is a topic for a separate blog 🙂

Lets check the tipping point number of rows.

Figure 5, Query plan change

In terms of the number of pages, tipping point is expected in the range from 25%33% of the total number of clustered index data pages(leaf level). In our example, the range was between 1562 and 2083 pages.
From the number of rows point of view, tipping point was 1677 rows which is  (1677 /100000)*100 = ~1.7% of the total number of rows in the table(clustered index leaf level). This means that Sql Server is very conservative when to use bookmark lookup data access pattern, although the percentage of rows depends on the row size and probably other conditions i.e memory pressure, parallel query execution ..etc.

Tipping point & row size

As mentioned above, the number of random reads in non-clustered/key lookup access pattern depends on the size of a row. The wider the row the more random reads key lookup needs to perform to get the same result-set.

Lets increase our test table row size from 500(493bytes + 7bytes overhead) to 1000bytes. The new rows size will expand the table footprint.

The number of clustered index data pages required to store 100000 rows is now doubled, 12502 to be precise (Figure 3 query). The Tipping point is now expected to be in the range from (1/4) * 12502 and (1/3) * 12502 or 3125 and 4167 rows.

If we run query (Figure 4), we’ll get the exact tipping point , 3228 rows.

Figure 6, Query plan change (row size 1000bytes)

The interesting thing here is that now, with the wider rows, the tipping point represents (3228/100000)*100 = ~3.2% of the total number of rows in the table which is almost double than 1.7% calculated for the 500byte rows.

The tipping point experiments can be put in the context of a cached plan. If we wrap our test query into a stored proceudre and make local variable @start  to be stored proc’s input parameter, on the very first sp call query optimiser will create and cache query plan using the value passed

the query execution plan will be crated using the


The concept known as The Tipping Point represents the point at which the number of page reads required by the bookmark lookup operator exceeds a certain point at which a clustered index/heap table scan becomes less expensive than the non-clustered index seek. In this context, a bookmark operator(Key or RID) is coupled with a non-clustered , non-covered index – Index Seek operator. The latter performs sequential I/O operations whereas the first performs a number of Random Access I/O read operations. Random I/O reads are generally more expensive than sequential I/O read operations regardless of the storage system (mechanical HDD, SSD, RAM ..etc). Query optimiser allows bookmark lookup/index seek data access pattern only if the number of clustered index pages needed to be randomly accessed does not exceed 1/4 of the total number of clustered index data pages(leaf level). If the number of pages exceeds 1/3 of the total number of the clustered index data pages, Query optimiser will choose Clustered index scan data access instead. This is also true for Rid Lookup/Table scan access pattern when table is a heap.
The range of data pages between 1/4(25%) and 1/3(33.3%) of the total data pages defines The Tipping Point space. In this scenario, the number of randomly accessed pages relates to the total number of the selected rows. However,  25% – 33% of pages represents only a fraction of the total number of rows – for 500byte row size, between 1.6% and 2%. The range also depends on the row size. For the same number of rows and with the row size set to 1000bytes, the range increases to 3% – 4% of  the total number of rows.

I wish to thank to my dear colleague and a great SQL Server enthusiast Jesin Jayachandran for inspiring me to write this blog.

Thanks for reading.

Dean Mincic

Conditional branching and OPTION(Recompile)

Conditional branching, OPTION(Recompile) and procedure plan


There are many cases when programmers use conditional branching in tsql code to execute different queries or similar queries with different predicates based on a certain condition. In this post I’ll try to explain how Query optimiser handle queries in different conditional branches and how it relates to the option(recompile) hint and the procedure plan.

Conditional branching in stored procedures

Our TSQL code may implement logic which use conditional branching to decide what business rule to apply. Take for example a simple, non-production process that selects all orders and their details associated with a productId. If the product is not included in any of the  sales Orders, the code returns nothing or a warning message.

Create test data

The script below creates a sample table with the following ProductId data distribution.

Figure 1, ProductId data distribution 

The figure above reads as follows i.e
ProductId = 0 participates in 100 Orders. The number of orders makes 0.1% of all orders. The same applies for ProductId 100,200,300 …4900, or 50 different ProductIds.
ProductId=40000 participates in 20,000 orders. The number of orders makes 20% of all orders. The same applies for ProductId 60000 and 80000, or 3 different ProductIds.

The script used to check data distribution …

Test stored procedure

Experiment 1
Proc. plan is generated for all branch paths

The first experiment shows that QO (query optimser) builds query plans for all code branches regardless of which one is executed on the very first sproc call. This is expected since procedure plan( a set of query execution plans) is generated and cached without knowing which code path will be chosen.

Execute stored proc without passing @ProductID parameter value. The parameter is an optional param and will have default value NULL.

GO 100 is to ensure that the plan stays in memory for some time. This is not needed for server level environments.
The 2nd query selects the procedure’s cashed plan(this is a set of estimated plans – no run-time values 🙂 ). In this example, the complete proc plan has two main query plans branched from T-SQL(COND WITH QUERY) operator.

Figure 2, Proc plan contains query plans for all code branches

The cached procedure plan shows that the second branch query plan is optimised by using the same parameter value (ProductId = NULL).

Note: The estimated number of rows is 1. Because the initial, NULL value is not represented by a RANGE_HI_KEY, Sql Server will use AVG_RANGE_ROWS value (stats histogram) instead. If we used i.e ProductId =2008 instead of NULL, the estimated number of rows would be 100 – use DBCC SHOW_STATISTICS('dbo.TestBranchPlans' ,'NCI_ProductId') to observe this behavior. 

The stored procedure call did not return any rows and the execution plan was built for @ProductionId = NULL. All subsequent calls may have sub-optimal plans as presented below.

FIgure 2, Plan stability problem (Parameter sniffing problem)

*Accuracy[%] = (No of Actual Rows /  No. Of estimated rows ) * 100
This feature is available in SSMS 18.x+

Accurracy = 100%  – Ideal case, The estimated number of rows was spot on
Accurracy <100% – Overestimate. The estimated number of rows is higher than the actual no. of rows
Accurracy >100% – Underestimate . The estimated number of rows is lower than the actual number of rows.

Figure 2 shows negative effect of the cached, sub-optimal procedure plan, on the subsequent procedure calls.

Unstable procedure plan

Previous experiment showed how Sql Server builds query plans for all code paths without knowing which one will be executed. Query optimiser use the value passed into @ProductID parameter to create query plans for all queries in the batch that references it. In the test we called stored procedure without passing @ProductId, The absence of the value instructed the code to use parameter’s optional value, @ParameterId = NULL. As a consequence, QO used an atypical value to build and cache a sub-optimal query plan, which then had a negative impact on all subsequent procedure calls.
But the effect could be the same even if we passed any value to @ProductId.
Figure1 shows that the values in ProductId column are not evenly distributed (which is usually true for the production systems 🙂 ). However, most of the ProductIds, 76% (50 out of 66 different ProductIds) returns the same number of rows(100 rows). There is 15% ProductIds (10 out of 66) that returns 500rows and only 3% ProductIds (3 out of 66) that returns 10,000 and 20,000 rows.

Lets say that our procedure call pattern assumes similar probability of passing “small”, more selective*(returns only 100 rows)  and “big”, less selective(returns 20,000 rows) ProductId values.

*Selectivity represents uniqueness of values in a column. High selectivity = high uniqueness = low number of matching values. Selectivity = (rows that pass the predicate / total rows in the table). Selectivity[ProductId=200] = 100 / 100,000 =0.001(high selectivity)  and [ProductId = 40000] = 20000/100000 = 0.2 (low selectivity)

A cached procedure plan compiled for a “small” ProductId has a negative impact on the procedure executions with a “big” ProductId and vice versa.

Figure 3, Cached plan for a small ProductId

We would get the similar results if we passed a “big” ProductId first, and then made a procedure call passing a “small” value to the parameter, only this time the cached procedure plan would work in a favor of the “big” parameter.
This situation is known as “parameter sniffing” problem. The result is an unstable procedure plan.
One of a several different ways to resolve the problem is to instruct query processor to recompile the statement in question, on every procedure execution.


OPTION(RECOMPILE) is a statement level command that instructs query processor to pause batch execution, discard any stored query plans for the query, build a new plan, only now using the run-time values (parameters, local variables..), perform “constant folding” with passed in parameters.

Experiment 2
Conditional branching and OPTION(RECOMPILE)

In this experiment I’ll use OPTION(RECOMPILE) to stabilise the procedure plan. Lets repeat the last test, but this time we instruct query processor to recompile statement in question

Figure 4, Stable procedure plan with Option(Recompile)

Note: Using OPTION(recompile) to stabilise the plan comes with a certain cost. It adds a small overhead to the query compile time, can have some impact on CPU. It is a good idea, if possible, to re-write/decouple stored proc in order to prevent high variations in the procedure plans.
Personally, I’ve witnessed great results with the option(recompile) in the frequently executed stored procedures with the plan stability problem.

Where is my procedure plan?

In the next test we’ll run our stored procedure with the option(recompile), with a parameter value that does not match any existing ProductId value. This call will execute the first code branch and exit the batch.

Figure 5, Incomplete procedure plan 

So, now we need to answer question “why we are not getting our cached procedure plan (cached_plan is NULL)”.    🙂

Deferred Query compilation

When a client app executes stored procedure for the first time, query processor may decide not to create query plans for each individual query. This behavior is called Deferred Query Compilation. Some of the possible reasons are related to the conditional branching.
If the first call does not execute a code branch that contains at least one option(recompile) statement – there may be more than one statement in a code branch, the query plans for the branch will not be generated and cached. This makes procedure plan, as a set of individual query plans, incomplete.
Dynamic management function sys.dm_exec_query_plan(plan_handle) returns all individual query plans for the entire batch. If one or more query plans are not generated, the function returns NULL. This makes sense since we do not have complete procedure plan.
The following test demonstrate this behavior.

1. Create a new test stored proc dbo.TestCodeBranching1

2. Run the script below.

Figure 6, Individual query plans

The sequence of events as follows:

  • The first branch got executed during the very first stored procedure call.
  • Query processor finds an Option(recompile) statement within  the second code branch and decides not to create execution plans for any of the queries in the code path.
  • Dynamic management fn, sys.dm_exec_query_plan(plan_handle) did not return cached procedure plan because the plan was incomplete.
  • Dynamic management function sys.dm_exec_text_query_plan(plan_handle, statement_start_offset, statement_end_offset) returned all available single query plans within the batch. In our case we have only one cached query plan. The plan belongs to the code path that was executed on the first call.

How the sys.dm_exec_text_query_plan query works?

The query collects data from the following objects:
sys.dm_exec_query_stats – gets various statistical information for cached query plans. We use sql_handle column (uniquely identifies the batch or stored procedure that the query is part of), plan_handle (uniquely identifies query execution plan for the queries that belongs to the batch), statement_start_offset, statement_end_offset ( define, in bytes, the starting and and ending position of a query within batch)
sys.dm_exec_sql_text – gets the text of the batch of the queries identified by sql_handle.  It also provides info about proc name, database etc..
-sys.dm_exec_text_query_plan  – returns execution plan for a batch of statements or for specific statement(s) in the batch. statement_start_offset(0 beginning of the batch) and statement_end_offset (-1 end of the batch) define the start and end position of the query within the batch defined by plan_handle.


Conditional branching as an imperative construct in TSQL has a specific treatment by Sql Server’s query processor. A procedure batch with conditional branching may be optimised and cached for all code paths regardless of which branch is executed on the first procedure call. This may produce sub-optimal plans for a number of queries within the non-executed code branches. It is important to be aware of the behavior in order to prevent potential sub-optimal query executions.
Sql Server may choose not to compile statements(deferred query compilation) within the non-executed code paths if at least one of the queries within the code paths has the OPTION(recompile) hint – this is also true for temp tables. This will make the procedure plan (as a set of query plans)  incomplete, hence sys.dm_exec_query_plan function returns NULL for the plan handle. However, queries from the executed code branch will be cached and the query plans will be available through sys.dm_exec_text_query_plan.

Thanks for reading.

Dean Mincic

Statistics used in the cached execution plans

Statistics used in the cached execution plans – Stored Procedures


Query optimisation process sometimes requires understanding on how Sql Server’s Query engine compiles, re-compiles and executes sql batches. Some of the most important elements used by Query optimiser when constructing a good plan are the “Interesting statistics”. These are statistical information used by Query optimiser  when constructing a good enough query execution plan.
This blog attempts to explain what are the “interesting statistics”, when they are updated and how the statistical information relates to the query recompilation process. The topic is related to Temporary tables statistics when used in stored procedures.

Batch compilation and recompilation

To begin with, let’s analyse the batch compilation/recompilation diagram (By Arun Marathe, Jul 2004, Batch Compilation, Recompilation and Plan Caching Issues in Sql Server 2005). The idea is to create a set of  experiments that will capture the behavior of a stored procedure  through the different phases of the query compilation/recompilation process, particularly those related to the statistics that are used to generate the execution plan.

Figure 1, Batch Compilation/Recompilation diagram

I’ve used AdwentureWorks database to set up the test environment and MS Profiler to capture various Events relevant for the experiments.

MS Profiler events

    1. Attention (Errors and Warnings)
    2. Auto Stats (Performance)
    3. SP:CacheHit (Stored Procedures)
    4. SP:CacheInsert  (Stored Procedures)
    5. SP:CacheMiss  (Stored Procedures)
    6. SP:CacheRemove  (Stored Procedures)
    7. SP:Completed  (Stored Procedures)
    8. SP:Recompile  (Stored Procedures)
    9. SP:Starting  (Stored Procedures)
    10. RPC: Starting (Stored Procedures)*
    11. RPC:Completed (Stored Procedures)*
    12. SP:StmtStarting  (Stored Procedures)
  1. SQL:StmtRecompile (TSQL)
  2. SQL:StmtStarting  (TSQL)

Database Objects
Set AdventureWorks DB compatibility level to 140 – Sql Server 2017. The version provides easy access to the information about the interesting statistics saved with the query plan (SSMS – SELECT Plan Operator, Properties,OptimizerStatsUsage).

Below is the set of Sql Server object definitions used for the testing.

Information about the statistics/indexes on the tables can be retrieved using the queries below.

The following examples assume the default settings for the Sql  Server’s options related to the statistics:

A bit of theory first before proceeding with the tests. : )


colmodctr is an ever increasing counter that tracks the changes made on tables (a counter per column excluding the non-persistent computed columns). colmodctr is not transactionally consistent which means that is not affected by the rolled back changes i.e if a transaction inserts 10 rows in a table and then roll-back, the counter will still report 10 changes.
Sql Server Statistics (automatically/manually created/updated) on a column(s) will store the snapshot value of the colmodctr for the leftmost column in the stats-blob.
The counter is a very important since it’s one of the elements that drives the query recompilation decisions related to the statistics changed reasons. 

colmodctr counter can be accessed through the following system views.

Figure 2, colmodctr, system views – standard and hidden

One way  to access the hidden tables is to; Open a separate SSMS instance, close the object explorer, create a single connection using Server name: i.e ADMIN:(local)
NOTE: The structure of the hidden tables and the tables’ accessibility is not documented and may be changed in the future versions.

Recompile thresholds (RT)

RT concept defines the number of changes on a table column needed to be done in order to indicate the statistical information of that column as stale. 
The changes includes the column values changes through the DML operations such as INSERT, UPDATE, DELETE… i.e Inserting 10 new rows in a table is considered as 10 changes(identified by the colmodctr counters mentioned before).
If the table does not have statistical information i. e HEAP table with no indexes and no manually created statistics, and the query plans that references the table does not load/automatically create interesting statistics, the only relevant change when performing the RT crossing test will be the change in the number of rows inserted and/or deleted.

colmodctr(current) – colmodctr(snapshot) |  >= RT


 | cardinality(current) – cardinality(snapshot) |  >= RT

current     – refers to the current value of the modification counter
snapshot – refers to the value of the mod. counter captured during the last plan compilation(recopilation).
cardinality* – the number of rows in the table.

*cardinality has different meaning in the different contexts: Cardinality may represent the uniqueness of data values in a particular column – the lower the cardinality the more duplicated values in the column.
Cardinality is also a way to define the relationship between two entities in a data model. It is also known as the degree of relationship i 1-1, 1-m, m-n.

The Threshold Crossing  Test evaluates to TRUE if the number of changes is greater than the predefined RT value (see Figure 3)

Recompilation thresholds(RT) for all the tables referenced in the query are stored along with the query plan.

RT depends on the table type(permanent vs temporary) and the number of rows in the table.

Figure 3, Recompile thresholds

Special case. RT = 1 if the table has 0 rows (with or without statistics)

NOTE: Starting from SQL Server 2008 R2 SP1, Microsoft introduced TF2371. The trace flag activates the dynamic recompile threshold calculation. The higher number of rows in a table, the lower the RT. The functionality is implemented to allow automatic statistics updates to kick off more frequently for the big tables. i.e RT for a 10,000 row table is 500 + 0.20*10,000 = 2,500 – the number of changes required to trigger query recompile. For a table with 100M rows, the RT is 20,000,500. For some applications the RT may be too high, resulting in the sub-optimal plans due to the lack of query recompilation. Hence the TF2371.
Starting from SQL Server 2016, the TF2371 is turned on by default.

Here is a couple of examples to illustrate Figure3.
If there is a table A that contains 230 rows, RT for the table will be set to 500. This means that if we i.e insert 500 rows, the total number of rows (c)  will change to 730 (c>=230+500) which is enough changes to make the table’s statistics stale.
The change itself does not mean much if there are no queries that references the table : )
The query plans may or may not initiate the auto-statistics creation on the specific table columns. Also, the referenced tables may not have any statistical information i.e HEAP table with no non-clustered indexes.


Experiment 1 (stats change before query execution)

In this experiment we will make “enough” changes to the ListPrice column (dbo.Products table) BEFORE running the stored procedure for the first time, 
The column is a key column in NCI_Products_ListPrice, the non-clustered index and has statistical information attached to it (the stats object name is the same as the NCI)

Lets begin the experiment by creating the test objects and checking the statistical information on the tables.

Step 1, Check the initial stats/rowmodctr information

Figure 4, Initial rowmodctr information

Step 2, Check stats BLOB and make changes on dbo.Products table

Run the DBCC  command below before and after the UPDATE to confirm that there were no changes in the stats BLOB information.

NOTE: rowmodctr is not transactionally consistent.

Figure 5, stats BLOB information

Figure 6, rowmodctr after the initial dbo.Products update

The changes are detected and available through Sql Server’s metadata.

Step 3, Run the stored procedure and observe the captured events by the Profiler.

Figure 7, Statistics refresh

Following the batch compilation diagram we can identify the following steps.

  1. Cache Lookup step resulted in the SP:CasheMiss event. dbo.TestQueryExecution stored proc. does not exist in the cache.
  2. Query Compilation Begins. SQL Server engine is about to load all of the  “interesting statistics”. The loaded statistics can be retrieved from the Actual Execution Plan, the SELECT physical  operator – OptimiserStatsUsage property.
  3. Query engine checks if any of the loaded  interesting statistics are stale. If yes, the system stops the batch compilation process and refreshes the statistics. In our case the system has 
    • Identified the number of changes made on the ListPrice column. From the stats/index information gathered after the initial update, the number of changes (rowmodctr/Modifications) is 610
    • Performed RT crossing test.  The test passed since the number of changes(610) exceeded the RT for tables with the number of rows greater than 500. RT = 500 + 0.20 * 504 ~ 601, 601 < 610
    • Executed StatMan, an internal process which automatically maintains statistics. The process updated the stale statistics NCI_Products_ListPrice on dbo.Product table

      If we check the stats blob from the Step 2, we will see that the Updated column changed its value to the current date – the stats blob has been updated.
      The AutoStats event reported the UPDATE of the statistics with EventSubClass = 1 – Other. More on the event can be found here.

  4. Query Optimiser starts to generate the query plan – a plan for each query statement.
    • The second query in the batch has a predicate on the Name column of the dbo.Products table. In an attempt to make better cardinality estimates on the rows that needs to be processed, Query Optimiser decided to automatically create statistical information on the column.
      The system stops the batch compilation process and again executes the StatsMan process to create the new statistics.

      After creating the stats, QO decided not to use it  : (
      Below is the list of the “interesting statistics” loaded during the Query compilation process. The list does not include automatically created stats on the Name column.

      As a result of the updated statistics on the ListPrice column , the rowmodctr for the column was reset. 

    • QO sets the new recompilation thresholds(RT) for all tables used in the queries.
      1. RT(dbo. SalesOrderDetail) = 500 + 0.20(121317) =24763.4 (~24764)
      2. RT(dbo.Products) = 500 + 0.20(504)= 600.8(~601)
        This meas that QO will initiate query recompile due to “Statistics changed” reason if
        1. dbo. SalesOrderDetail
          1. 24764 or more inserted/deleted rows
          2. 24764 or more changes on: SalesOrderDetailID, ProductID columns
        2. dbo.Products
          1. 601 or more inserted rows
          2. 601 or more changes on: ProductID, ListPrice, Name columns
  5. The query execution starts. The query plans are constructed and cached. SP:CacheInsert event reported that the stored procedure has been cached.

Experiment 2 (stats change during the query execution)

In this experiment we will make “enough” changes to the Name column (dbo.Products table) HALFWAY THROUGH the stored procedure execution.

Step 1 Set up the environment

  • Run the script to reset the test environment
  • Add a WAITFOR statement between the two queries in the stored procedure
  • Use PowerShell to execute the stored procedure. Add HostName property. Use the HostName to capture only the events related to the PS call. This will prevent MS Profiler from capturing events related to the UPDATE statement that will run in parallel.
  • Add an ApplicationName filter to the Profiler trace (ApplicationName LIKE experiment)

Step 2, Run the PowerShell cmdlet, switch to SSMS and run the UPDATE queries below. The queries will generate enough changes to make the automatically created statistics on the Name column stale.

Step 3. Analyse the captured MS Profiler trace.
Figure 8, Query recompile

  • The first thing that is different from the last run is the SP:CacheHit event. The event shows that our stored procedure was found in the Plan cache. The previously set RTs and the interesting statistics are part of the cached information.
    NOTE: Auto created statistic on the Name column was not used during the initial query compilation – the stats are not part of the interesting stats.
  • This time there were no changes on the columns that would initiate statistics updates, no new auto created stats and the existing cached query plan does not need to be recompiled due to “statistic changed” reasons. The process proceeds with the query execution.
  • The first query is successfully executed following by the  WAITFOR statement. During the statement execution (6 seconds delay) a separate query has made enough changes on the Name column(dbo.Products) to pass the RT crossing test for the table and flag the auto created statistics on the column as stale. Even if not used by QO during the plan generation, the stats are marked as stale.
  • (1) The query execution stops at the  “Any stats stale?”  step . The System initiates the query recompile process – SP: Recompile due to 2 – Statistics changed reason. The event is followed by the statement level SQL:StmtRecompile event which indicates that only the second query needs to be recompiled.
  • (2) Again, the StatsMan process kicks in and updates the stale statistics. The RTs are set (in this case the number of row  has not changed ,hence the RTs stayed the same).Rowmodctr value for the Name column is reset. to 0 
  • (3) The AutoStats event reported Statistics Update  having EventSubClass = 1 – Other
  • (4) The SP:StmtStarting event reports that the second query has been recompiled and the batch execution continues.

Experiment 3 (tables with no stats on columns)

The experiment demonstrates how queries get recompiled when referencing tables with no statistics. The recompiles due to the “statistics changed” reasons are initiated by the RT-table cardinality crossing test results only.
As previously mentioned, the cardinality based RT crossing test is defined as

 | cardinality(current) – cardinality(snapshot) |  >= RT

Lets create a test table and a stored procedure to perform the above experiment.

Step 1, set up the test environment

Add some data to the table..

The initial statistical information looks like (find how to retrieve the metadata related to the statistical information at the beginning of the post)

Figure 9, rowmodctr with no statistical information

Step Run the stored proc for the first time. The RT is set to 500.

Step 3 Make enough changes to the table to pass the cardinality crossing test. Insert 500 rows. Do not use explicit transaction yet.

Step 3 Run the stored procedure again and observe the query execution behavior in Profiler.

Figure 10, Query recompile, table cardinality change – no stats

  • The new rowmodctr information looks like

    The new number of rows (rowcnt) is recorded along with the number of changes, rowmodctr=730. In this case the rowmodctr value is not relevant since the RT crossing test depends only on changes in the table cardinality. This will be more visible if we ROLLBACK the row insertion operation which will be covered later.
  • The second execution followed the “Cashe lookup = Success” path (see the batch compilation diagram) and failed to pass the very last step  “Any stats stale?“.
  • At this stage, the system has detected that the RT cardinality crossing test has passed due to the number of changes(new rows) inserted in the table.
  • The system stopped the execution process and  initiated the stored proc/statement recompile – SP:Recompile, SQL:StmtRecompile.  As in the previous examples, the reason for the recompile was 2 – Statistics changed.
    NOTE: The recompile process is not followed by the StatMan process since the query does not have any statsBlob information to be refreshed/created.

Experiment 3.1 (rowmodcnt not in use)

The next example shows that the RT cardinality crossing test is not related to rowmodctr as it may seem from the previous example where the number of changes followed table cardinality changes.

  • Follow the steps from the previous example.
  • Execute the INSERT query  from the Step 3 within an explicit transaction
  • Observe that there are no query recompiles due to “statistic change since there were no table cardinality changes – the ROLLBACK “canceled” row insertions.
  • The statistical information shows that the rowmodctr= 720.


Query compilation, execution and recompilation sequence among other steps includes; loading interesting statistics – the statistical information on different table columns that Query Optimiser may find useful when creating a good plan and auto-creating statistical information on the columns that participate in i.e WHERE filter, GROUP BY ..etc. 
Sql Server query engine also checks the validity of the loaded statistical information during the initial stored procedure compilation and again during the stored procedure execution phase. If the loaded statistics are found to be stale, the former pauses stored procedure compilation, refreshes(re-samples/refreshes) the loaded statistical information and continues compilation process. If Query engine detects stale loaded statistics during the execution phase,  the process stops, refreshes(re-samples/updates) statistics and restarts compilation process – query recompilation. The re-compiles are done per query not per batch.
The examples in this blog showed that the statistical information can be automatically maintained by the queries that use them. Statistics can be also maintained manually.
To mark statistics as “Stale”, QO uses the Recompile Threshold(RT) crossing test. The test tracks the number of changes on the significant(leftmost) columns within the statistic BLOBs. The information is stored in an ever-increasing, non transactionally consistent counter – “rowmodctr”.  The RTs are stored per table and within the compiled query.
The RT crossing test can be based only on the changes in the number of rows in a table.


Thanks for reading.

Dean Mincic

Temporary tables statistics when used in stored procedures

Temporary tables statistics when used in stored procedures


Sometimes when we design solutions which implements complex business rules we tend to use temporary objects, temporary tables in particular. Decoupling complex queries into the smaller “intermediate” results may help optimiser to come up with a better plan since it needs to resolve simpler queries. It can also make code more readable and maintainable.  This approach, of course, needs to be carefully planned since the excessive use of temporary objects may degrade query performances, deviate from set based design principles and do more damage than good. Some of the common patterns “When to break down big queries” can be found here.
Depending on a task, we might decide to use temporary tables over temp variables. This may be due to the different scope and/or due to the fact that temp tables are more “robust” and supports statistics.

This blog explores specific case scenarios that include temp tables used in stored procedures and the unique behavior of related statistical information that can lead to suboptimal query plans.

Sample data

Let’s create a couple of test tables and a stored procedure that we’ll use through the article.
Platform: Microsoft SQL Server 2017 (RTM) Developer Ed.

The code above is a “complex query” that we decided to decouple by using a temp table. First, we store all relevant products in the temp table and then join the table with a big table in order to calculate the required aggregates. This is one of the common temp tables use cases. We expect our stored procedure to be called with widely varying parameters i.e For all products names that start with ‘B'(4 products) we expect 2000 matching rows that needs to be aggregated (500 orders per product), and for those products whose name starts with ‘A’ (100 products), we expect 25x more -50000.
Since temp tables support statistics and our DB is set up to perform auto create statistics/auto update stats, we also expect to get the good plans for the varying input parameters. In other words we expect query plan to be recompiled if needed – In addition, we also expect query to recompile due to the correctness related reasons (the temp table is created and destroyed every time we run the query)

Test scenario

Run the code from Session1.

Figure 1, The plan created for the first input parameter

As we can see, the plan is good. The temp table Carnality estimates are good, Nested Loop physical operator estimates are aligned with the actual no of rows and the Sort operator was granted enough memory to perform distinct sort.
Now, say another client runs the same stored proc in the context of a different session with a different parameter.
Run the code from Session2

Figure 2, The original plan did not change for the new parameter

Session2 Client gets a horrible plan. Starting from the temp table carnality, everything seems to be off. Nested Loop physical operator estimates are now totally wrong, and the Sort operator has no enough granted memory to perform sort operation and needs to spill 231 pages(~2MB)  to the disk.
If we execute the same stored proc in the context of  a session passing @StartsWith parameter value N’F’, we’ll get the same plan but again with wrong estimates and even worse Sort operator spills.


From the previous example we can see that the first passed parameter dictates the query plan structure. The plan “never changes” for the consecutive calls. This looks like parameter sniffing problem, but our query predicates do not include any parameters.
To investigate and explain this behavior let’s start by monitoring the following parameters:

  • Query compilation/recompilation  – Using MS Profiler
  • Temp table’s automatically created statistics – Using database console command SHOW_STATISTICS.

Set up Profiler
Include the following events to Profiler’s trace

Figure 3, Profiler – events and properties

Add temp table statistic tracking to the query

Start trace and run the stored procedure again.
Note: Altering stored procedure automatically removes existing, cached batch.

The query result shows auto created statistics on the temp table’s ProductName column.
Figure 4, First parameter statistics

The plan created is the same as on Figure1

The Profiler trace shows the following sequence of events:

Figure 5, Profiler -The first plan compilation

This plan is exactly what we expected to see.
The unit of compilation and caching is a batch – in this case the content of our stored procedure. During batch compilation the individual statements are compiled one after another.

Short detour: What happens when we create a stored proc? After hitting F5(executing CREATE PROCEDURE …), Sql Server parses the procedure definition for syntactical accuracy. If the statements are syntactically correct, the text of the stored procedure is stored in an internal table accessible through  sys.sql_modules system view. At this stage, it is possible to reference non-existing objects within the batch i.e it is possible to include query that inserts data in a non-existing table, or to select data from a non existing function. Later, when we execute stored procedure for the first time, the process known as “Deferred Name Resolution” will check the names of the referenced objects and consequently initiate  Recompile of the query segment which references the objects.

Follow the steps 1- 6 on the trace above.

  1. (not captured by the trace). The execution environment has  performed the initial batch compilation. Each query within batch is compiled separately(Starting from SQL Server Yukon :). The initial compilation was incomplete and only skeleton plan was created – CREATE TABLE query is the only one fully compiled. The second, INSERT query and the following SELECT query could not be compiled since the temp table did not exist before the very first execution. The skipped compilations are deferred until the execution time.
  2. Batch execution started and the first query was executed (a brand new temp table was created)
  3. The second query (INSERT INTO..#Temp) has to be recompiled due to Correctness related reasons. The EventSubClass property describes the cause of Recompilations
        • A new object(temp table) has been created. The query recompile has been initiated.
        • Query Optimiser loads all “interesting” statistics that can help building a good plan. The statistic belongs to the tables included in the query. If needed, QO will pause the process to automatically create additional statistics on the columns that can help building the plan*. For this query QO does not need to auto-create stats.
  4. The INSERT INTO …#Temp.. query is now recompiled and executed.
  5. The third query(the one we are analysing) has to be recompiled due to the similar reasons as previous query. After loading all “interesting” stats – which belongs to dbo.Transactions, QO decided to stop the execution process and to auto-create statistics on #temp table columns: ProductId and ProductName
  6. StatMan – an internal process, creates auto-stats on ProductId column(#temp table). ProductId column is used in  JOIN predicate.
  7. StatMan creates auto-stats on ProductName column(#temp table). ProductName column is used for the grouping. The stat header and histogram are selected in the query output.
    Auto stat name convention: _WA_Sys_prefix. * In our example:  _WA_Sys_00000002_AA7DA731,  00000002 – Column ordinal position, B8CC2078 – Hexadecimal #Temp table ObjectID. SELECT CAST(0xAA7DA731 as INTEGER) /* Object_id = -1434605775 */
  8. The SELECT query is now recompiled and executed.
  9. (Batch execution successfully completed)

*NOTE: _WA – Stands for Waterloo, Canada(Sybase, the Father of Sql Server was developed in Waterloo) . This is mentioned in 70-462.

Clear Profiler’s trace and execute the same stored procedure, this time with parameter  ‘A’

Figure 6, Stale statistics

Strangely enough, the statistical information (header and histogram) are totally wrong. i.e Table carnality should be 100, not 10 and the histogram steps should show product names that starts with ‘A’ not ‘C’.

Figure 6 – Profiler, No query recompiles

The trace shows no recompiles due to schema changes(CREATE TABLE #Temp) and no stat. updates. From the optimality point of view, Query engine executed the first query(input param ‘C’). The result set is correct, but it seems that everything else is wrong.  The first question is why we are missing recompiles due to schema changes (a brand new temp table was created and populated with the new values). The second question is why the auto-created statistics on the temp table were never updated. Is this why our query never gets recompiled due to plan optimality related reasons.

Temporary tables caching

Temp table caching mechanism may explain the lack of the schema related recompiles in the previous example. This may also explain the stale statistics.
The caching feature was introduced in SQL Server 2005. This functionality provides the caching of temporary objects(temp tables, table variables and TVFs) across repeated calls of routines(Stored procedures, triggers and UDFs).
In-short, When a stored procedure execution ends, SQL Server truncates* and renames the table, keeping only one IAM and one data page. The structure will be used by the subsequent calls instead of allocating new pages from scratch when the object is created again.

NOTE: For the temp objects smaller than 8MB, the truncation happens immediately after module execution ends. For the larger temp objects, SQL Serer performs “deferred drop” and immediately returns control to the application.

The caching mechanism works for the temp tables that is created by using CREATE TABLE or SELECT INTO statements. The caching is not possible when there is an explicit DDL on a temp table after it is created i.e ALTER #table ADD Constraint, CREATE STATISTICS** on table colums or there is a named constraint on a temp table(this is generally a bad idea since it can lead to constraint name conflict). Temporary tables are also not cached if they are part of dynamic SQL or ad-hoc batch.

NOTE: Statistics created using an explicit CREATE STATISTICS statement are not linked to a cached temporary object. Auto-Created statistics(like the ones presented in the example) are linked to a cached temp object. UPDATE STATISTICS does not prevent temp table caching.

The following query shows that there is one cached temp objects currently not in use

We track temp table name by adding  SELECT OBJECT_ID('tempdb.dbo.#temp')  in the example stored proc. This will show that temp table object_id never changes – an internal process renames temp table name to hexadecimal form at the end of the stored procedure. This would happen even if we explicitly dropped the table.

If a stored procedure is executed concurrently, multiple separate cached objects may be created in tempdb.  There is a cached temp object per execution context.

This feature explains why our plan has not been recompiled due to schema changes related reasons. But what about statistics? As previously mentioned, it is expected for query processor to be able to detect the statistical changes that can affect the cached plan validity and to adapt to it. But it didn’t t happen, why?

Recompilation Reasons

There are many different reasons that can cause a query to be recompiled(EventSubClass Figure 5) . The list of all possible recompilation reasons (SP:Recompile event) can be retrieved using the query below.

For the purpose of this analysis, we are interested only in Schema changed and Statistic changed reasons. Temp table caching can explain the absence of the former reason.
Updating a statistics (both manual and auto-update) may cause Optimality(data) related recompilation of the plans that uses these statistics.

RT – Recompile thresholds

SQL Server query processor is designed to adapt the cached query plans to the data changes. The changes are tracked by statistical information(histograms) linked to the “interesting” table columns. Statistical information changes may cause query recompilation due to plan optimality-related reasons.
During the query optimisation phase, Query processor identifies and loads the “interesting” statistics. This helps QP to create good enough plans. In our case, QP loads the following, existing statistics.

  • NCI_Name (Statistics on dbo.Products Name column crated as a part of the NCI index on the column)
  • NCI_ProductId(Statistics on dbo.Transactions ProductId column also created with the NCI index on the column)
  • (Auto Created)#temp table auto-created statistics on #temp.ProductName and #temp.ProductId

Use the “undocumented”  trace flag to find out the interesting statistics used in our example:

The table contents are constantly changing (INSERT, UPDATE, DELETE, MERGE). SQL Server query processor tracks those changes in order to estimate their impact  on the existing cached plans.
For each “interesting statistics” loaded, Query processor creates a snapshot of a counter that counts the number of table modifications. Generally, every table referenced in a query will have such counters loaded.
The table contents are tracked:

  • Directly     – using table cardinality metadata information
  • Indirectly – using statistics on table columns.
 Each column has a RT (recompilation threshold) associated with it. RT is a function of table cardinality.

 RT = f(n) , n -number of rows in a table (table cardinality)

Essentially, RT for a table defines the frequency with which queries that refer to the table recompile.

During the first query compilation, for each interesting statistics, a snapshot value of a counter(colmodctr) that counts number of table modification is stored – a counter per column that participate in the interesting statistics.

Along with colmodctr snapshot values, Query processor sets the Recompile thresholds for every colmodctr created.
Before using an existing cached plan, Query optimiser performs The threshold crossing test defined by the formula below.

 | colmodctr(current) – colmodctr(snapshot) |  >= RT

current     – refers to the current value of the modification counter
snapshot – refers to the value of the mod. counter captured during the last plan compilation(recopilation).
If the threshold crossing test evaluates to TRUE (the number of changes exceeds  the pre-defined RT), for any of the interesting statistics, the query plan is recompiled.
The threshold crossing test will be performed even if query processor does not load any interesting statistics. In that case the test will be based simply on table cardinalities.

 | cardinality(current) – cardinality(snapshot) |  >= RT

The following table shows how RT is calculated for permanent and temporary tables

Temporary Tables
Table cardinality Recompile Threshold Passing RT test caridnality
n<6 6 >= (n+6)
6 <= n <= 500 500 >= (n+500)
n > 500 500 +0.20*n >= (n +500 +0.20*n)
Permanent tables
Table cardinality Recompile Threshold Passing RT test caridnality
6 <= n <= 500 500 >= (n+500)
n > 500 500 +0.20*n >= (n +500 +0.20*n)


colmodctr is the name of a counter which counts number of modifications that a table has undergone. The counter is per column and is non transnational (if we insert 10 rows and rollback the transaction, colmodctr value will increase by 20). The counter is also ever increasing.

So far, Temp table caching can explain why our query does not get recompiled due to schema changes. Now, let’s find out why our query use stale statistics and why it does not perform optimality-related query recompilation.

Automatically created statistics are cashed along with the cached temp tables. In this case,  auto-created statistics are NOT reset at the end of the stored procedure execution cycle nor at the beginning of a consecutive stored proc. execution.
This is why statistics on a cached temporary table may belong to one of the previous stored procedure calls and are absolutely not related to the context of the current temp table.
If we used a “normal”(dbo.Temp) table instead of #temp table, the table would be dropped and recreated on each stored procedure call. Consequently, auto-created statistics would be repeatedly dropped and created which would make them always up-to date.

The modification counters which count changes on the interesting columns, in our example columns: ProductName, and ProductId will not cause query recompiles even if the number of the changes passes the RT threshold test. This behavior is unique to the temp tables(Cached or non cached*) when used within stored procedures.

Note*: To prevent temp table caching, perform a DDL operation on the newly created temp table. e.g

Err 2667: The statistics will be internally present, but not available through sp_helpstats, DBCC SHOW_STATISTICS and tempdb.sys.stats. My assumption would be that the auto-crated stats. BLOB entry gets “disconnected” from temp table’s object_id -> For non-cached temp tables, the object_id changes on every sp run.
What is also interesting with the non-cached table is that it does not force our query to  be recompiled due to schema change reasons as we might expected.

Back to the original example, to be able to track the modification counters, expand dbo.testTempTableStats stored procedure with the dynamic query below.

Now execute the query with the parameter ‘C’

Figure 7, colmodctr – modification counters

On the first query execution, query optimiser sets up the RT values for the two auto-created columns. The Recompilation thresholds are based on temp table’s cardinality (NoOfRows = 10). Based on previously mentioned formula ( 6 < n < 500, RT = 500 ), in order to pass RT crossing test we’ll need at least 510(500+10) changes on ether of the two columns to initiate query recomilation. This means that, if we execute stored proc 6 times passing parameter ‘A‘ (populates temp table with 100 rows), we will reach 6*100 = 600 changes which is more than we need to pass the test.

The 6th execution shows the following results, but no expected recompile.

Figure 8, colmodctr – modification counters do not initiate query recompiles

The table below shows colmodctr counter’s progress:

Figure 9, modification counter progress

As we can see, the query was not recompiled due to the optimality reasons. Non-cached temp tables will show the same behavior. The colmodctr simply does not work for temp tables within stored procedures.
If we repeat the experiment, but this time we pass parameter ‘B'(param ‘B’ results in 4 rows in the temp table) for the very first stored procedure call, we will get the similar output as on Figure 7. RT value is now set to 6 ,  1<n<6, RT = 6, which means that we need at least 10(6+4) changes on the interesting temp table columns to initiate query recompilation.
If we execute the same stored procedure(with param ‘B’), say two more times, the modification counters will show 16 changes and, as expected, with no recompile.
But if we execute stored procedure with a parameter that generates 10 or more rows in the temp table, i.e ‘C’  or ‘E’ (10 and 27 rows) our query will recompile.