Category Archives: SQL Server Development

Bookmark lookup tipping point

Bookmark lookup critical point


Summary

It is common for production environments to have queries – query plans, that use non-covered, non-clustered indexes in combination with a bookmark(Key or RID) lookup operator. The combination of  the physical operators is one way how query optimiser can use a non-covered index to provide information required by a query. However, sometimes, for the same query,  query optimiser decides to scan the whole (cluster or heap) table instead. This drastic change in the plan shape may have negative impact on our query performance.
This post attempts to explain the mechanism behind QO decision on when to switch between the two plan shapes. The concept is known as The Tipping Point and represents the point at which the number of page reads required by the bookmark lookup operator exceeds a certain point at which a clustered index/heap table scan becomes less expensive than the non-clustered index seek.

How bookmark lookup works

Before diving into the tipping point specifics, it would be good to understand how bookmark lookup operator works in combination with a non-clustered , non covered index. Bookmark lookup (Key or RID)  is a physical operator used to find data rows in the base table(cluster or heap) using a clustered index key(Key lookup) or row locator(RID lookup).
Lets create a test environment we can use throughout the blog.

Create test environment

Create a test table

Insert 100,000 rows

The test objects properties that are interesting for our experiment:

  • Unique clustered index on EmployeeId.
  • Unique, non-clustered index on the SearchValue column.
  • SearchValue column contains unique, ever increasing integer values. The  values match EmployeeId values.
  • The row size is exactly 500bytes. 493bytes is used by the five fixed length columns + 7bytes row overhead. More on this can be found here.

Key lookup scenario

The query below returns 500 rows (all columns) for a range predicate.

Note: Traceflag 652 . The traceflag disables page pre-fetching scans (read-ahead). Disabling the storage engine feature will help us to reconcile the number of I/O operations reported by STATISTICS IO with the number of rows selected by the query. More on the trace flag later in the blog.

Analyse key lookup query plan

The figure below consolidates three sets of information related to our test query – a graphical execution plan shape, basic properties of the two physical operators and the number of IO reads performed on the test table.


Figure 1, Key lookup, index seek plan

We read the query plan as the following sequence of events.

  • Query optimiser chose a key lookup/non-clustered index seek routine to implement query request.
  • Nested Loop operator requested, for its outer input, all valid rows(rows that are passed the seek predicate ..SearchValue BETWEEN 1000 AND 1499.. ) on NCI_SearchValue index. The index seek(index bTree traverse) was executed once resulting in 500  rows and two columns – SearchValue and EmployeeId. The latter  also acts as a pointer to the full size rows stored in the clustered index leaf level.
  • Nested Loop operator requested, through its inner input, the rest of the columns selected by the query – Name, Surname and Note. The search(key lookup operator), was executed 500 times, once per row in the outer input returning a  new set of 500 rows – a row per key search. Each execution traversed the clustered index bTree structure using EmployeeId as a seek predicate, in order to pin-point the qualifying rows.
  • For each key lookup search, Nested Loop operator combined the two outputs, the SearchValue and EmployeeId from the outer input with the Name, Surname and Note from the inner input forming the shape of the final result set.

The next thing we need to understand is the relationship between the number of I/O reads required to implement the above routine and the number of rows processed in the process.

Figure 1 shows that the number of I/O reads required for the operation was 1503 logical reads.  A logical read is a situation when Sql Server processes an 8Kb page from a RAM memory segment called buffer cache. If the requested page is not already in the buffer cache,  storage engine needs to perform a physical read operation in order to get the same 8Kb page from the disk.
The properties of the two physical operators(NCI seek and Key lookup) shows that the system read 500 rows from the non-clustered structure,in one go and performed 500 operations on the clustered index, returning a row per operation.

Now we need to dive a little bit deeper into Sql Server’s storage protocol in order to find all physical pages that were processed during the operations. Query below gives us high level overview of the index structures used in the test. The non-clustered index bTree has two levels and total of 175 pages whereas clustered index bTree has three levels and the total of 6273 pages.

Figure 3, the total number of pages per index 

The next query gives us a detailed view of the index pages across bTree levels – The Doubly Linked List data structure.

Finally, the following query gives us a sneak-peek of the actual rows  on the index/data pages.

Data access pattern

The following diagram represents data access pattern used by the key lookup routine.


Figure 2, Nonclustered index & key lookup internals

Data access pattern analysis

Our next goal is to reconcile the total number of logical I/O reads previously reported by the STATISTICS IO – (1503) with the diagram above.
Lets find out the total number of I/O read operations required to generate one row(out of a total of 500 rows). The following sequence of events is derived from Figure 2.

  1. Nested loop operator requests all eligible rows from the index seek operator. The non-clustered index is a non-covered index and can provide only SearchValue and EmployeeId columns. The index seek operator use the two seek predicates to find the range of values, SearchValue >=1000 AND SearchValue <=1499.
  2. Non-clustered index traverse. Index seek operator reads the Root page(PageId=1:16448) of the non-clustered index making the first I/O read (NoOfReads = 1).
  3. Index seek operation, following the lower boundary(SearchValue = 1000) of the search range, finds a pointer(PageId) which points to the index leaf level page which contains the full(two columns) index row. (ROOT PAGE: SearchValue = 597, PageId = 1:16456). It also knows that the PageId=1:16456 alone cannot provide complete range of the requested values but only SearchValues >=1000 AND SearchValues <1157. It “remembers” the following pointer, PageId = 1:16457 which can provide the rest of the values, SearchValue >=1157 and SearchValue <= 1499.
  4. Index seek operator performs the second I/O read(PageId =1:16457) following Path (A).  The total number of I/O reads is now (NoOfReads = 2).
  5. After storing first 165 rows found on PageId =1:16456 in memory, the operator follows Path (B). The operation is known as “partial index scan“. The operator knows that the subsequent page(PageId=1:16457) contains the rest of the requested rows(335 rows). The current page also has pointers to previous and next page(doubly linked list). The Path (B) makes the third read (NoOfReads = 3).
  6. Nested loop operator received all 500 requested rows from its outer input , the NCI index seek operator.
  7. Nested Loop operator now performs the Key Lookup operation over clustered index, 500 times, once per row collected from the outer input.
  8. Clustered index traverse (singleton search). On its very first execution, the key lookup operator uses the first row from the Nested Loop outer input, (EmployeeId = 1000) and performs its first page read(PageId = 1:2680). The page is the root level of clustered index bTree. The operation follows Path(C) increasing the total number of reads (NoOfReads = 4).
  9. Clustered root page provides a pointer(PageId = 1:832) which points to the first index intermediate level. The page maps all EnployeeIds between NULL and less then 4305. The row with EmployeeId=1000 falls into the range. Following Path(D) the operator makes its second read and increases the total number of reads ( NoOfReads = 5)
  10. Intermediate page 1:832 provides information about a leaf level page(PageId=1:1029) that contains the first full row.
  11. The process now follows Path(E) and make its final, third clustered index read ( NoOfReads = 6)
  12. The full row is then passed from the Nested Loop operator to the SELECT operator.
  13. Nested loop operator continue to progresses through the list of 500 rows from its outer input repeating steps 8 – 11 until all 500 rows have been processed.

The total number of reads is
Total No Of Reads = Index Seek operation (3 reads) + 500 * Key Lookup operation (3 reads) = 3 + 500 * 3 = 1503.
This is an exact match with the number of logical reads reported by STATISTICS IO.

Important observations

From the storage level perspective, one of the main differences between the two access patterns is that the Clustered index seek(partial scan) is a sequential read I/O operation, whereas Key lookup(singleton clustered index seek) is a random read I/O operation. Sequential reads are generally less expensive (less mechanical movements on the disk level) than the random reads. This is also true for RAM/SSD although they don’t have moving parts. This a very high level observation on data storage systems. 🙂

The number of random reads depends on size of row. The wider the row the more random reads key lookup needs to perform to get the same result-set. More about this later in the post.

Read ahead optimisation

Earlier in the blog I used TRACEFLAG 652 to disable the page pre-fetching aka read ahead functionality. Read ahead is an asynchronous I/O mechanism build to overcome the gap between CPU and I/O performances. Because CPU is many times faster than any storage system, Sql Server’s storage engine tries to read up to 64 sequential pages(8 extents) before they are requested by a query. This provides more logical reads, but on the other hand is faster than performing physical reads only when required. Although the number of read-ahead pages may vary, the mechanism reads pages from the index intermediate level(s) in order to identify the leaf level pages to be read in advance. In our case, the functionality would, if not turned off, “add a few” extra pages to the STATISTICS IO report and we wouldn’t be able to reconcile the reads with the diagram on Figure 2.
Try to run the same test query without turning on the traceflag. The number of logical reads should be greater than 1503.

The tipping point

The tipping point is the point which represents the critical number of rows after which query optimiser decides to perform cluster index scan instead non-clustered/key lookup data access pattern.
As previously shown, in non-clustered index/key lookup scenario, the number of rows requested by a query relates to the number of random reads. The main concern when determining the tipping point is actually the number of data pages that needs to be “randomly” read from clustered index leaf level – a read per each row requested . In our example this number is 500. The approach excludes the clustered index leaf level page reads and the non-cluster reads all together.
The number of pages that represents the tipping point is between 1/4 and 1/3 of clustered index data pages(leaf level). If a query requests less than 1/4 (25%) of the number of clustered index leaf level pages, query optimiser is most likely to allow random reads and non-clustered index/key lookup data access pattern. However, if a query requests more than 1/3(33%) pages, query optimiser will implement a clustered index scan data access pattern.
Figure 3, The tipping point

Knowing that a random read corresponds to a selected row, we can express the tipping point as a number of rows;

                      1/4(no of data pages) <= Tipping point (rows)  => 1/3(no of data pages)

In our example, the tipping point is expected to be somewhere between 1562 and 2083 rows.
So where exactly is our tipping point?
One approach is to apply binary search algorithm to the tipping point interval and perform trial and error approach until the query plan changes.
The other approach is to construct some kind of a program to do that for you 🙂

The script runs the same query for different SearchValue values. The values are within the expected tipping point range, starting from the lower boundary. Query result is suppressed  by FMTONLY ON session setting. OPTION(RECOMPILE)* in this context ensures that the value of local variable @start is known to Query Optimiser when creating execution plan for the query*. For each query run, program checks the current query plan RelOp element. If the element’s attribute @PhysicalOp has value set to ‘Clustered Index Scan‘ the program terminates and selects the current SearchValue value. The value represents the Tipping point we are looking for.
Figure 4, The exact Tipping point number of rows

Note:  Instead using OPTION(RECOMPILE) we could use Dynamic string execution.

The approach constructs and optimise the query during run-time. In this case, local variable @start gets evaluated and is treated as a literal within the dynamic string. It is most likely that the query plan will not be parameterised and the individual plans(one per execution) will be cached as Adhoc plans. This may lead to the Plan pollution situation, but this is a topic for a separate blog 🙂

Lets check the tipping point number of rows.

Figure 5, Query plan change

In terms of the number of pages, tipping point is expected in the range from 25%33% of the total number of clustered index data pages(leaf level). In our example, the range was between 1562 and 2083 pages.
From the number of rows point of view, tipping point was 1677 rows which is  (1677 /100000)*100 = ~1.7% of the total number of rows in the table(clustered index leaf level). This means that Sql Server is very conservative when to use bookmark lookup data access pattern, although the percentage of rows depends on the row size and probably other conditions i.e memory pressure, parallel query execution ..etc.

Tipping point & row size

As mentioned above, the number of random reads in non-clustered/key lookup access pattern depends on the size of a row. The wider the row the more random reads key lookup needs to perform to get the same result-set.

Lets increase our test table row size from 500(493bytes + 7bytes overhead) to 1000bytes. The new rows size will expand the table footprint.

The number of clustered index data pages required to store 100000 rows is now doubled, 12502 to be precise (Figure 3 query). The Tipping point is now expected to be in the range from (1/4) * 12502 and (1/3) * 12502 or 3125 and 4167 rows.

If we run query (Figure 4), we’ll get the exact tipping point , 3228 rows.


Figure 6, Query plan change (row size 1000bytes)

The interesting thing here is that now, with the wider rows, the tipping point represents (3228/100000)*100 = ~3.2% of the total number of rows in the table which is almost double than 1.7% calculated for the 500byte rows.

The tipping point experiments can be put in the context of a cached plan. If we wrap our test query into a stored proceudre and make local variable @start  to be stored proc’s input parameter, on the very first sp call query optimiser will create and cache query plan using the value passed

the query execution plan will be crated using the

Conclusion

The concept known as The Tipping Point represents the point at which the number of page reads required by the bookmark lookup operator exceeds a certain point at which a clustered index/heap table scan becomes less expensive than the non-clustered index seek. In this context, a bookmark operator(Key or RID) is coupled with a non-clustered , non-covered index – Index Seek operator. The latter performs sequential I/O operations whereas the first performs a number of Random Access I/O read operations. Random I/O reads are generally more expensive than sequential I/O read operations regardless of the storage system (mechanical HDD, SSD, RAM ..etc). Query optimiser allows bookmark lookup/index seek data access pattern only if the number of clustered index pages needed to be randomly accessed does not exceed 1/4 of the total number of clustered index data pages(leaf level). If the number of pages exceeds 1/3 of the total number of the clustered index data pages, Query optimiser will choose Clustered index scan data access instead. This is also true for Rid Lookup/Table scan access pattern when table is a heap.
The range of data pages between 1/4(25%) and 1/3(33.3%) of the total data pages defines The Tipping Point space. In this scenario, the number of randomly accessed pages relates to the total number of the selected rows. However,  25% – 33% of pages represents only a fraction of the total number of rows – for 500byte row size, between 1.6% and 2%. The range also depends on the row size. For the same number of rows and with the row size set to 1000bytes, the range increases to 3% – 4% of  the total number of rows.

I wish to thank to my dear colleague and a great SQL Server enthusiast Jesin Jayachandran for inspiring me to write this blog.

Thanks for reading.

Dean Mincic

Conditional branching and OPTION(Recompile)

Conditional branching, OPTION(Recompile) and procedure plan


Summary

There are many cases when programmers use conditional branching in tsql code to execute different queries or similar queries with different predicates based on a certain condition. In this post I’ll try to explain how Query optimiser handle queries in different conditional branches and how it relates to the option(recompile) hint and the procedure plan.

Conditional branching in stored procedures

Our Tsql code may implement logic which use conditional branching to decide what business rule to apply. Take for example a simple, non-production process that selects the order details associated with a productId. If the product is not included in any of the  sales Orders, return nothing – or a warning message.

Create test data

The script below creates a sample table with the following ProductId data distribution.


Figure 1, ProductId data distribution 

The figure above reads as follows i.e
ProductId = 0 participates in 100 Orders. The number of orders makes 0.1% of all orders. The same applies for ProductId 100,200,300 …4900, or 50 different ProductIds.
ProductId=40000 participates in 10,000 orders. The number of orders makes 20% of all orders. The same applies for ProductId 60000 and 80000, or 3 different ProductIds.

The script used to check data distribution …

Test stored procedure

Experiment 1
Proc. plan is generated for all branch paths

The first experiment shows that QO (query optimser) builds query plans for all code branches regardless of which one is executed on the very first sproc call. This is expected since procedure plan( a set of query execution plans) is generated and cached without knowing which code path will be chosen.

Execute stored proc without passing @ProductID parameter value. The parameter is an optional param and will have default value NULL.

Notes:
GO 100 is to ensure that the plan stays in memory for some time. This is not needed for server level environments.
The 2nd query selects the procedure’s cashed plan(this is a set of estimated plans – no run-time values 🙂 ). In this example, the complete proc plan has two main query plans branched from T-SQL(COND WITH QUERY) operator.


Figure 2, Proc plan contains query plans for all code branches

The cached procedure plan shows that the second branch query plan is optimised by using the same parameter value (ProductId = NULL).

Note: The estimated number of rows is 1. Because the initial, NULL value is not represented by a RANGE_HI_KEY, Sql Server will use AVG_RANGE_ROWS value (stats histogram) instead. If we used i.e ProductId =2008 instead of NULL, the estimated number of rows would be 100 – use DBCC SHOW_STATISTICS('dbo.TestBranchPlans' ,'NCI_ProductId') to observe this behavior. 

The stored procedure call did not return any rows and the execution plan was built for @ProductionId = NULL. All subsequent calls may have sub-optimal plans as presented below.

FIgure 2, Plan stability problem (Parameter sniffing problem)

*Accuracy[%] = (No of Actual Rows /  No. Of estimated rows ) * 100
This feature is available in SSMS 18.x+

Accurracy = 100%  – Ideal case, The estimated number of rows was spot on
Accurracy <100% – Overestimate. The estimated number of rows is higher than the actual no. of rows
Accurracy >100% – Underestimate . The estimated number of rows is lower than the actual number of rows.

Figure 2 shows negative effect of the cached, sub-optimal procedure plan, on the subsequent procedure calls.

Unstable procedure plan

Previous experiment showed how Sql Server builds query plans for all code paths without knowing which one will be executed. Query optimiser use the value passed into @ProductID parameter to create query plans for all queries in the batch that references it. In the test we called stored procedure without passing @ProductId, The absence of the value instructed the code to use parameter’s optional value, @ParameterId = NULL. As a consequence, QO used an atypical value to build and cache a sub-optimal query plan, which then had a negative impact on all subsequent procedure calls.
But the effect could be the same even if we passed any value to @ProductId.
Figure1 shows that the values in ProductId column are not evenly distributed (which is usually true for the production systems 🙂 ). However, most of the ProductIds, 76% (50 out of 66 different ProductIds) returns the same number of rows(100 rows). There is 15% ProductIds (10 out of 66) that returns 500rows and only 3% ProductIds (3 out of 66) that returns 10,000 and 20,000 rows.

Lets say that our procedure call pattern assumes similar probability of passing “small”, more selective*(returns only 100 rows)  and “big”, less selective(returns 20,000 rows) ProductId values.

*Selectivity represents uniqueness of values in a column. High selectivity = high uniqueness = low number of matching values. Selectivity = (rows that pass the predicate / total rows in the table). Selectivity[ProductId=200] = 100 / 100,000 =0.001(high selectivity)  and [ProductId = 40000] = 20000/100000 = 0.2 (low selectivity)

A cached procedure plan compiled for a “small” ProductId has a negative impact on the procedure executions with a “big” ProductId and vice versa.


Figure 3, Cached plan for a small ProductId

We would get the similar results if we passed a “big” ProductId first, and then made a procedure call passing a “small” value to the parameter, only this time the cached procedure plan would work in a favor of the “big” parameter.
This situation is known as “parameter sniffing” problem. The result is an unstable procedure plan.
One of a several different ways to resolve the problem is to instruct query processor to recompile the statement in question, on every procedure execution.

OPTION(RECOMPILE)

OPTION(RECOMPILE) is a statement level command that instructs query processor to pause batch execution, discard any stored query plans for the query, build a new plan, only now using the run-time values (parameters, local variables..), perform “constant folding” with passed in parameters.

Experiment 2
Conditional branching and OPTION(RECOMPILE)

In this experiment I’ll use OPTION(RECOMPILE) to stabilise the procedure plan. Lets repeat the last test, but this time we instruct query processor to recompile statement in question


Figure 4, Stable procedure plan with Option(Recompile)

Note: Using OPTION(recompile) to stabilise the plan comes with a certain cost. It adds a small overhead to the query compile time, can have some impact on CPU. It is a good idea, if possible, to re-write/decouple stored proc in order to prevent high variations in the procedure plans.
Personally, I’ve witnessed great results with the option(recompile) in the frequently executed stored procedures with the plan stability problem.

Where is my procedure plan?

In the next test we’ll run our stored procedure with the option(recompile), with a parameter value that does not match any existing ProductId value. This call will execute the first code branch and exit the batch.


Figure 5, Incomplete procedure plan 

So, now we need to answer question “why we are not getting our cached procedure plan (cached_plan is NULL)”.    🙂

Deferred Query compilation

When a client app executes stored procedure for the first time, query processor may decide not to create query plans for each individual query. This behavior is called Deferred Query Compilation. Some of the possible reasons are related to the conditional branching.
If the first call does not execute a code branch that contains at least one option(recompile) statement – there may be more than one statement in a code branch, the query plans for the branch will not be generated and cached. This makes procedure plan, as a set of individual query plans, incomplete.
Dynamic management function sys.dm_exec_query_plan(plan_handle) returns all individual query plans for the entire batch. If one or more query plans are not generated, the function returns NULL. This makes sense since we do not have complete procedure plan.
The following test demonstrate this behavior.

1. Create a new test stored proc dbo.TestCodeBranching1

2. Run the script below.

Results
Figure 6, Individual query plans

The sequence of events as follows:

  • The first branch got executed during the very first stored procedure call.
  • Query processor finds an Option(recompile) statement within  the second code branch and decides not to create execution plans for any of the queries in the code path.
  • Dynamic management fn, sys.dm_exec_query_plan(plan_handle) did not return cached procedure plan because the plan was incomplete.
  • Dynamic management function sys.dm_exec_text_query_plan(plan_handle, statement_start_offset, statement_end_offset) returned all available single query plans within the batch. In our case we have only one cached query plan. The plan belongs to the code path that was executed on the first call.

How the sys.dm_exec_text_query_plan query works?

The query collects data from the following objects:
sys.dm_exec_query_stats – gets various statistical information for cached query plans. We use sql_handle column (uniquely identifies the batch or stored procedure that the query is part of), plan_handle (uniquely identifies query execution plan for the queries that belongs to the batch), statement_start_offset, statement_end_offset ( define, in bytes, the starting and and ending position of a query within batch)
sys.dm_exec_sql_text – gets the text of the batch of the queries identified by sql_handle.  It also provides info about proc name, database etc..
-sys.dm_exec_text_query_plan  – returns execution plan for a batch of statements or for specific statement(s) in the batch. statement_start_offset(0 beginning of the batch) and statement_end_offset (-1 end of the batch) define the start and end position of the query within the batch defined by plan_handle.

Conclusion

Conditional branching as an imperative construct in TSQL has a specific treatment by Sql Server’s query processor. A procedure batch with conditional branching may be optimised and cached for all code paths regardless of which branch is executed on the first procedure call. This may produce sub-optimal plans for a number of queries within the non-executed code branches. It is important to be aware of the behavior in order to prevent potential sub-optimal query executions.
Sql Server may choose not to compile statements(deferred query compilation) within the non-executed code paths if at least one of the queries within the code paths has the OPTION(recompile) hint – this is also true for temp tables. This will make the procedure plan (as a set of query plans)  incomplete, hence sys.dm_exec_query_plan function returns NULL for the plan handle. However, queries from the executed code branch will be cached and the query plans will be available through sys.dm_exec_text_query_plan.

Thanks for reading.

Dean Mincic

Statistics used in the cached execution plans

Statistics used in the cached execution plans – Stored Procedures


Summary

Query optimisation process sometimes requires understanding on how Sql Server’s Query engine compiles, re-compiles and executes sql batches. Some of the most important elements used by Query optimiser when constructing a good plan are the “Interesting statistics”. These are statistical information used by Query optimiser  when constructing a good enough query execution plan.
This blog attempts to explain what are the “interesting statistics”, when they are updated and how the statistical information relates to the query recompilation process. The topic is related to Temporary tables statistics when used in stored procedures.

Batch compilation and recompilation

To begin with, let’s analyse the batch compilation/recompilation diagram (By Arun Marathe, Jul 2004, Batch Compilation, Recompilation and Plan Caching Issues in Sql Server 2005). The idea is to create a set of  experiments that will capture the behavior of a stored procedure  through the different phases of the query compilation/recompilation process, particularly those related to the statistics that are used to generate the execution plan.


Figure 1, Batch Compilation/Recompilation diagram

I’ve used AdwentureWorks database to set up the test environment and MS Profiler to capture various Events relevant for the experiments.

MS Profiler events

    1. Attention (Errors and Warnings)
    2. Auto Stats (Performance)
    3. SP:CacheHit (Stored Procedures)
    4. SP:CacheInsert  (Stored Procedures)
    5. SP:CacheMiss  (Stored Procedures)
    6. SP:CacheRemove  (Stored Procedures)
    7. SP:Completed  (Stored Procedures)
    8. SP:Recompile  (Stored Procedures)
    9. SP:Starting  (Stored Procedures)
    10. RPC: Starting (Stored Procedures)*
    11. RPC:Completed (Stored Procedures)*
    12. SP:StmtStarting  (Stored Procedures)
  1. SQL:StmtRecompile (TSQL)
  2. SQL:StmtStarting  (TSQL)

Database Objects
Set AdventureWorks DB compatibility level to 140 – Sql Server 2017. The version provides easy access to the information about the interesting statistics saved with the query plan (SSMS – SELECT Plan Operator, Properties,OptimizerStatsUsage).

Below is the set of Sql Server object definitions used for the testing.

Information about the statistics/indexes on the tables can be retrieved using the queries below.

The following examples assume the default settings for the Sql  Server’s options related to the statistics:
 AUTO_CREATE_STATISTICS ON
– AUTO_UPDATE_STATISTICS ON
AUTO_UPDATE_STATISTICS_ASYNC OFF 

A bit of theory first before proceeding with the tests. : )

colmodctr

colmodctr is an ever increasing counter that tracks the changes made on tables (a counter per column excluding the non-persistent computed columns). colmodctr is not transactionally consistent which means that is not affected by the rolled back changes i.e if a transaction inserts 10 rows in a table and then roll-back, the counter will still report 10 changes.
Sql Server Statistics (automatically/manually created/updated) on a column(s) will store the snapshot value of the colmodctr for the leftmost column in the stats-blob.
The counter is a very important since it’s one of the elements that drives the query recompilation decisions related to the statistics changed reasons. 

colmodctr counter can be accessed through the following system views.


Figure 2, colmodctr, system views – standard and hidden

One way  to access the hidden tables is to; Open a separate SSMS instance, close the object explorer, create a single connection using Server name: i.e ADMIN:(local)
NOTE: The structure of the hidden tables and the tables’ accessibility is not documented and may be changed in the future versions.

Recompile thresholds (RT)

RT concept defines the number of changes on a table column needed to be done in order to indicate the statistical information of that column as stale. 
The changes includes the column values changes through the DML operations such as INSERT, UPDATE, DELETE… i.e Inserting 10 new rows in a table is considered as 10 changes(identified by the colmodctr counters mentioned before).
If the table does not have statistical information i. e HEAP table with no indexes and no manually created statistics, and the query plans that references the table does not load/automatically create interesting statistics, the only relevant change when performing the RT crossing test will be the change in the number of rows inserted and/or deleted.

colmodctr(current) – colmodctr(snapshot) |  >= RT

or

 | cardinality(current) – cardinality(snapshot) |  >= RT

current     – refers to the current value of the modification counter
snapshot – refers to the value of the mod. counter captured during the last plan compilation(recopilation).
cardinality* – the number of rows in the table.

*cardinality has different meaning in the different contexts: Cardinality may represent the uniqueness of data values in a particular column – the lower the cardinality the more duplicated values in the column.
Cardinality is also a way to define the relationship between two entities in a data model. It is also known as the degree of relationship i 1-1, 1-m, m-n.

The Threshold Crossing  Test evaluates to TRUE if the number of changes is greater than the predefined RT value (see Figure 3)

Recompilation thresholds(RT) for all the tables referenced in the query are stored along with the query plan.

RT depends on the table type(permanent vs temporary) and the number of rows in the table.


Figure 3, Recompile thresholds

Special case. RT = 1 if the table has 0 rows (with or without statistics)

NOTE: Starting from SQL Server 2008 R2 SP1, Microsoft introduced TF2371. The trace flag activates the dynamic recompile threshold calculation. The higher number of rows in a table, the lower the RT. The functionality is implemented to allow automatic statistics updates to kick off more frequently for the big tables. i.e RT for a 10,000 row table is 500 + 0.20*10,000 = 2,500 – the number of changes required to trigger query recompile. For a table with 100M rows, the RT is 20,000,500. For some applications the RT may be too high, resulting in the sub-optimal plans due to the lack of query recompilation. Hence the TF2371.
Starting from SQL Server 2016, the TF2371 is turned on by default.

Here is a couple of examples to illustrate Figure3.
If there is a table A that contains 230 rows, RT for the table will be set to 500. This means that if we i.e insert 500 rows, the total number of rows (c)  will change to 730 (c>=230+500) which is enough changes to make the table’s statistics stale.
The change itself does not mean much if there are no queries that references the table : )
The query plans may or may not initiate the auto-statistics creation on the specific table columns. Also, the referenced tables may not have any statistical information i.e HEAP table with no non-clustered indexes.

Experiments

Experiment 1 (stats change before query execution)

In this experiment we will make “enough” changes to the ListPrice column (dbo.Products table) BEFORE running the stored procedure for the first time, 
The column is a key column in NCI_Products_ListPrice, the non-clustered index and has statistical information attached to it (the stats object name is the same as the NCI)

Lets begin the experiment by creating the test objects and checking the statistical information on the tables.

Step 1, Check the initial stats/rowmodctr information

Figure 4, Initial rowmodctr information

Step 2, Check stats BLOB and make changes on dbo.Products table

Run the DBCC  command below before and after the UPDATE to confirm that there were no changes in the stats BLOB information.

NOTE: rowmodctr is not transactionally consistent.

Figure 5, stats BLOB information

Figure 6, rowmodctr after the initial dbo.Products update

The changes are detected and available through Sql Server’s metadata.

Step 3, Run the stored procedure and observe the captured events by the Profiler.

Figure 7, Statistics refresh

Following the batch compilation diagram we can identify the following steps.

  1. Cache Lookup step resulted in the SP:CasheMiss event. dbo.TestQueryExecution stored proc. does not exist in the cache.
  2. Query Compilation Begins. SQL Server engine is about to load all of the  “interesting statistics”. The loaded statistics can be retrieved from the Actual Execution Plan, the SELECT physical  operator – OptimiserStatsUsage property.
  3. Query engine checks if any of the loaded  interesting statistics are stale. If yes, the system stops the batch compilation process and refreshes the statistics. In our case the system has 
    • Identified the number of changes made on the ListPrice column. From the stats/index information gathered after the initial update, the number of changes (rowmodctr/Modifications) is 610
    • Performed RT crossing test.  The test passed since the number of changes(610) exceeded the RT for tables with the number of rows greater than 500. RT = 500 + 0.20 * 504 ~ 601, 601 < 610
    • Executed StatMan, an internal process which automatically maintains statistics. The process updated the stale statistics NCI_Products_ListPrice on dbo.Product table

      If we check the stats blob from the Step 2, we will see that the Updated column changed its value to the current date – the stats blob has been updated.
      The AutoStats event reported the UPDATE of the statistics with EventSubClass = 1 – Other. More on the event can be found here.

  4. Query Optimiser starts to generate the query plan – a plan for each query statement.
    • The second query in the batch has a predicate on the Name column of the dbo.Products table. In an attempt to make better cardinality estimates on the rows that needs to be processed, Query Optimiser decided to automatically create statistical information on the column.
      The system stops the batch compilation process and again executes the StatsMan process to create the new statistics.

      After creating the stats, QO decided not to use it  : (
      Below is the list of the “interesting statistics” loaded during the Query compilation process. The list does not include automatically created stats on the Name column.

      As a result of the updated statistics on the ListPrice column , the rowmodctr for the column was reset. 

    • QO sets the new recompilation thresholds(RT) for all tables used in the queries.
      1. RT(dbo. SalesOrderDetail) = 500 + 0.20(121317) =24763.4 (~24764)
      2. RT(dbo.Products) = 500 + 0.20(504)= 600.8(~601)
        This meas that QO will initiate query recompile due to “Statistics changed” reason if
        1. dbo. SalesOrderDetail
          1. 24764 or more inserted/deleted rows
          2. 24764 or more changes on: SalesOrderDetailID, ProductID columns
        2. dbo.Products
          1. 601 or more inserted rows
          2. 601 or more changes on: ProductID, ListPrice, Name columns
  5. The query execution starts. The query plans are constructed and cached. SP:CacheInsert event reported that the stored procedure has been cached.

Experiment 2 (stats change during the query execution)

In this experiment we will make “enough” changes to the Name column (dbo.Products table) HALFWAY THROUGH the stored procedure execution.

Step 1 Set up the environment

  • Run the script to reset the test environment
  • Add a WAITFOR statement between the two queries in the stored procedure
  • Use PowerShell to execute the stored procedure. Add HostName property. Use the HostName to capture only the events related to the PS call. This will prevent MS Profiler from capturing events related to the UPDATE statement that will run in parallel.
  • Add an ApplicationName filter to the Profiler trace (ApplicationName LIKE experiment)

Step 2, Run the PowerShell cmdlet, switch to SSMS and run the UPDATE queries below. The queries will generate enough changes to make the automatically created statistics on the Name column stale.

Step 3. Analyse the captured MS Profiler trace.
Figure 8, Query recompile

  • The first thing that is different from the last run is the SP:CacheHit event. The event shows that our stored procedure was found in the Plan cache. The previously set RTs and the interesting statistics are part of the cached information.
    NOTE: Auto created statistic on the Name column was not used during the initial query compilation – the stats are not part of the interesting stats.
  • This time there were no changes on the columns that would initiate statistics updates, no new auto created stats and the existing cached query plan does not need to be recompiled due to “statistic changed” reasons. The process proceeds with the query execution.
  • The first query is successfully executed following by the  WAITFOR statement. During the statement execution (6 seconds delay) a separate query has made enough changes on the Name column(dbo.Products) to pass the RT crossing test for the table and flag the auto created statistics on the column as stale. Even if not used by QO during the plan generation, the stats are marked as stale.
  • (1) The query execution stops at the  “Any stats stale?”  step . The System initiates the query recompile process – SP: Recompile due to 2 – Statistics changed reason. The event is followed by the statement level SQL:StmtRecompile event which indicates that only the second query needs to be recompiled.
  • (2) Again, the StatsMan process kicks in and updates the stale statistics. The RTs are set (in this case the number of row  has not changed ,hence the RTs stayed the same).Rowmodctr value for the Name column is reset. to 0 
  • (3) The AutoStats event reported Statistics Update  having EventSubClass = 1 – Other
  • (4) The SP:StmtStarting event reports that the second query has been recompiled and the batch execution continues.

Experiment 3 (tables with no stats on columns)

The experiment demonstrates how queries get recompiled when referencing tables with no statistics. The recompiles due to the “statistics changed” reasons are initiated by the RT-table cardinality crossing test results only.
As previously mentioned, the cardinality based RT crossing test is defined as

 | cardinality(current) – cardinality(snapshot) |  >= RT

Lets create a test table and a stored procedure to perform the above experiment.

Step 1, set up the test environment

Add some data to the table..

The initial statistical information looks like (find how to retrieve the metadata related to the statistical information at the beginning of the post)


Figure 9, rowmodctr with no statistical information

Step Run the stored proc for the first time. The RT is set to 500.

Step 3 Make enough changes to the table to pass the cardinality crossing test. Insert 500 rows. Do not use explicit transaction yet.

Step 3 Run the stored procedure again and observe the query execution behavior in Profiler.

Figure 10, Query recompile, table cardinality change – no stats

  • The new rowmodctr information looks like

    The new number of rows (rowcnt) is recorded along with the number of changes, rowmodctr=730. In this case the rowmodctr value is not relevant since the RT crossing test depends only on changes in the table cardinality. This will be more visible if we ROLLBACK the row insertion operation which will be covered later.
  • The second execution followed the “Cashe lookup = Success” path (see the batch compilation diagram) and failed to pass the very last step  “Any stats stale?“.
  • At this stage, the system has detected that the RT cardinality crossing test has passed due to the number of changes(new rows) inserted in the table.
  • The system stopped the execution process and  initiated the stored proc/statement recompile – SP:Recompile, SQL:StmtRecompile.  As in the previous examples, the reason for the recompile was 2 – Statistics changed.
    NOTE: The recompile process is not followed by the StatMan process since the query does not have any statsBlob information to be refreshed/created.

Experiment 3.1 (rowmodcnt not in use)

The next example shows that the RT cardinality crossing test is not related to rowmodctr as it may seem from the previous example where the number of changes followed table cardinality changes.

  • Follow the steps from the previous example.
  • Execute the INSERT query  from the Step 3 within an explicit transaction
  • Observe that there are no query recompiles due to “statistic change since there were no table cardinality changes – the ROLLBACK “canceled” row insertions.
  • The statistical information shows that the rowmodctr= 720.

Conclusion

Query compilation, execution and recompilation sequence among other steps includes; loading interesting statistics – the statistical information on different table columns that Query Optimiser may find useful when creating a good plan and auto-creating statistical information on the columns that participate in i.e WHERE filter, GROUP BY ..etc. 
Sql Server query engine also checks the validity of the loaded statistical information during the initial stored procedure compilation and again during the stored procedure execution phase. If the loaded statistics are found to be stale, the former pauses stored procedure compilation, refreshes(re-samples/refreshes) the loaded statistical information and continues compilation process. If Query engine detects stale loaded statistics during the execution phase,  the process stops, refreshes(re-samples/updates) statistics and restarts compilation process – query recompilation. The re-compiles are done per query not per batch.
The examples in this blog showed that the statistical information can be automatically maintained by the queries that use them. Statistics can be also maintained manually.
To mark statistics as “Stale”, QO uses the Recompile Threshold(RT) crossing test. The test tracks the number of changes on the significant(leftmost) columns within the statistic BLOBs. The information is stored in an ever-increasing, non transactionally consistent counter – “rowmodctr”.  The RTs are stored per table and within the compiled query.
The RT crossing test can be based only on the changes in the number of rows in a table.

 

Thanks for reading.

Dean Mincic

Temporary tables statistics when used in stored procedures

Temporary tables statistics when used in stored procedures


Summary

Sometimes when we design solutions which implements complex business rules we tend to use temporary objects, temporary tables in particular. Decoupling complex queries into the smaller “intermediate” results may help optimiser to come up with a better plan since it needs to resolve simpler queries. It can also make code more readable and maintainable.  This approach, of course, needs to be carefully planned since the excessive use of temporary objects may degrade query performances, deviate from set based design principles and do more damage than good. Some of the common patterns “When to break down big queries” can be found here.
Depending on a task, we might decide to use temporary tables over temp variables. This may be due to the different scope and/or due to the fact that temp tables are more “robust” and supports statistics.

This blog explores specific case scenarios that include temp tables used in stored procedures and the unique behavior of related statistical information that can lead to suboptimal query plans.

Sample data

Let’s create a couple of test tables and a stored procedure that we’ll use through the article.
Platform: Microsoft SQL Server 2017 (RTM) Developer Ed.

The code above is a “complex query” that we decided to decouple by using a temp table. First, we store all relevant products in the temp table and then join the table with a big table in order to calculate the required aggregates. This is one of the common temp tables use cases. We expect our stored procedure to be called with widely varying parameters i.e For all products names that start with ‘B'(4 products) we expect 2000 matching rows that needs to be aggregated (500 orders per product), and for those products whose name starts with ‘A’ (100 products), we expect 25x more -50000.
Since temp tables support statistics and our DB is set up to perform auto create statistics/auto update stats, we also expect to get the good plans for the varying input parameters. In other words we expect query plan to be recompiled if needed – In addition, we also expect query to recompile due to the correctness related reasons (the temp table is created and destroyed every time we run the query)

Test scenario

Run the code from Session1.


Figure 1, The plan created for the first input parameter

As we can see, the plan is good. The temp table Carnality estimates are good, Nested Loop physical operator estimates are aligned with the actual no of rows and the Sort operator was granted enough memory to perform distinct sort.
Now, say another client runs the same stored proc in the context of a different session with a different parameter.
Run the code from Session2


Figure 2, The original plan did not change for the new parameter

Session2 Client gets a horrible plan. Starting from the temp table carnality, everything seems to be off. Nested Loop physical operator estimates are now totally wrong, and the Sort operator has no enough granted memory to perform sort operation and needs to spill 231 pages(~2MB)  to the disk.
If we execute the same stored proc in the context of  a session passing @StartsWith parameter value N’F’, we’ll get the same plan but again with wrong estimates and even worse Sort operator spills.

Analysis

From the previous example we can see that the first passed parameter dictates the query plan structure. The plan “never changes” for the consecutive calls. This looks like parameter sniffing problem, but our query predicates do not include any parameters.
To investigate and explain this behavior let’s start by monitoring the following parameters:

  • Query compilation/recompilation  – Using MS Profiler
  • Temp table’s automatically created statistics – Using database console command SHOW_STATISTICS.

Set up Profiler
Include the following events to Profiler’s trace


Figure 3, Profiler – events and properties

Add temp table statistic tracking to the query

Start trace and run the stored procedure again.
Note: Altering stored procedure automatically removes existing, cached batch.

The query result shows auto created statistics on the temp table’s ProductName column.
Figure 4, First parameter statistics

The plan created is the same as on Figure1

The Profiler trace shows the following sequence of events:

Figure 5, Profiler -The first plan compilation

This plan is exactly what we expected to see.
The unit of compilation and caching is a batch – in this case the content of our stored procedure. During batch compilation the individual statements are compiled one after another.

Short detour: What happens when we create a stored proc? After hitting F5(executing CREATE PROCEDURE …), Sql Server parses the procedure definition for syntactical accuracy. If the statements are syntactically correct, the text of the stored procedure is stored in an internal table accessible through  sys.sql_modules system view. At this stage, it is possible to reference non-existing objects within the batch i.e it is possible to include query that inserts data in a non-existing table, or to select data from a non existing function. Later, when we execute stored procedure for the first time, the process known as “Deferred Name Resolution” will check the names of the referenced objects and consequently initiate  Recompile of the query segment which references the objects.

Follow the steps 1- 6 on the trace above.

  1. (not captured by the trace). The execution environment has  performed the initial batch compilation. Each query within batch is compiled separately(Starting from SQL Server Yukon :). The initial compilation was incomplete and only skeleton plan was created – CREATE TABLE query is the only one fully compiled. The second, INSERT query and the following SELECT query could not be compiled since the temp table did not exist before the very first execution. The skipped compilations are deferred until the execution time.
  2. Batch execution started and the first query was executed (a brand new temp table was created)
  3. The second query (INSERT INTO..#Temp) has to be recompiled due to Correctness related reasons. The EventSubClass property describes the cause of Recompilations
        • A new object(temp table) has been created. The query recompile has been initiated.
        • Query Optimiser loads all “interesting” statistics that can help building a good plan. The statistic belongs to the tables included in the query. If needed, QO will pause the process to automatically create additional statistics on the columns that can help building the plan*. For this query QO does not need to auto-create stats.
  4. The INSERT INTO …#Temp.. query is now recompiled and executed.
  5. The third query(the one we are analysing) has to be recompiled due to the similar reasons as previous query. After loading all “interesting” stats – which belongs to dbo.Transactions, QO decided to stop the execution process and to auto-create statistics on #temp table columns: ProductId and ProductName
  6. StatMan – an internal process, creates auto-stats on ProductId column(#temp table). ProductId column is used in  JOIN predicate.
  7. StatMan creates auto-stats on ProductName column(#temp table). ProductName column is used for the grouping. The stat header and histogram are selected in the query output.
    Auto stat name convention: _WA_Sys_prefix. * In our example:  _WA_Sys_00000002_AA7DA731,  00000002 – Column ordinal position, B8CC2078 – Hexadecimal #Temp table ObjectID. SELECT CAST(0xAA7DA731 as INTEGER) /* Object_id = -1434605775 */
  8. The SELECT query is now recompiled and executed.
  9. (Batch execution successfully completed)

*NOTE: _WA – Stands for Waterloo, Canada(Sybase, the Father of Sql Server was developed in Waterloo) . This is mentioned in 70-462.

Clear Profiler’s trace and execute the same stored procedure, this time with parameter  ‘A’

Figure 6, Stale statistics

Strangely enough, the statistical information (header and histogram) are totally wrong. i.e Table carnality should be 100, not 10 and the histogram steps should show product names that starts with ‘A’ not ‘C’.

Figure 6 – Profiler, No query recompiles

The trace shows no recompiles due to schema changes(CREATE TABLE #Temp) and no stat. updates. From the optimality point of view, Query engine executed the first query(input param ‘C’). The result set is correct, but it seems that everything else is wrong.  The first question is why we are missing recompiles due to schema changes (a brand new temp table was created and populated with the new values). The second question is why the auto-created statistics on the temp table were never updated. Is this why our query never gets recompiled due to plan optimality related reasons.

Temporary tables caching

Temp table caching mechanism may explain the lack of the schema related recompiles in the previous example. This may also explain the stale statistics.
The caching feature was introduced in SQL Server 2005. This functionality provides the caching of temporary objects(temp tables, table variables and TVFs) across repeated calls of routines(Stored procedures, triggers and UDFs).
In-short, When a stored procedure execution ends, SQL Server truncates* and renames the table, keeping only one IAM and one data page. The structure will be used by the subsequent calls instead of allocating new pages from scratch when the object is created again.

NOTE: For the temp objects smaller than 8MB, the truncation happens immediately after module execution ends. For the larger temp objects, SQL Serer performs “deferred drop” and immediately returns control to the application.

The caching mechanism works for the temp tables that is created by using CREATE TABLE or SELECT INTO statements. The caching is not possible when there is an explicit DDL on a temp table after it is created i.e ALTER #table ADD Constraint, CREATE STATISTICS** on table colums or there is a named constraint on a temp table(this is generally a bad idea since it can lead to constraint name conflict). Temporary tables are also not cached if they are part of dynamic SQL or ad-hoc batch.

NOTE: Statistics created using an explicit CREATE STATISTICS statement are not linked to a cached temporary object. Auto-Created statistics(like the ones presented in the example) are linked to a cached temp object. UPDATE STATISTICS does not prevent temp table caching.

The following query shows that there is one cached temp objects currently not in use

We track temp table name by adding  SELECT OBJECT_ID('tempdb.dbo.#temp')  in the example stored proc. This will show that temp table object_id never changes – an internal process renames temp table name to hexadecimal form at the end of the stored procedure. This would happen even if we explicitly dropped the table.

If a stored procedure is executed concurrently, multiple separate cached objects may be created in tempdb.  There is a cached temp object per execution context.

This feature explains why our plan has not been recompiled due to schema changes related reasons. But what about statistics? As previously mentioned, it is expected for query processor to be able to detect the statistical changes that can affect the cached plan validity and to adapt to it. But it didn’t t happen, why?

Recompilation Reasons

There are many different reasons that can cause a query to be recompiled(EventSubClass Figure 5) . The list of all possible recompilation reasons (SP:Recompile event) can be retrieved using the query below.

For the purpose of this analysis, we are interested only in Schema changed and Statistic changed reasons. Temp table caching can explain the absence of the former reason.
Updating a statistics (both manual and auto-update) may cause Optimality(data) related recompilation of the plans that uses these statistics.

RT – Recompile thresholds

SQL Server query processor is designed to adapt the cached query plans to the data changes. The changes are tracked by statistical information(histograms) linked to the “interesting” table columns. Statistical information changes may cause query recompilation due to plan optimality-related reasons.
During the query optimisation phase, Query processor identifies and loads the “interesting” statistics. This helps QP to create good enough plans. In our case, QP loads the following, existing statistics.

  • NCI_Name (Statistics on dbo.Products Name column crated as a part of the NCI index on the column)
  • NCI_ProductId(Statistics on dbo.Transactions ProductId column also created with the NCI index on the column)
  • (Auto Created)#temp table auto-created statistics on #temp.ProductName and #temp.ProductId

Use the “undocumented”  trace flag to find out the interesting statistics used in our example:

The table contents are constantly changing (INSERT, UPDATE, DELETE, MERGE). SQL Server query processor tracks those changes in order to estimate their impact  on the existing cached plans.
For each “interesting statistics” loaded, Query processor creates a snapshot of a counter that counts the number of table modifications. Generally, every table referenced in a query will have such counters loaded.
The table contents are tracked:

  • Directly     – using table cardinality metadata information
  • Indirectly – using statistics on table columns.
 Each column has a RT (recompilation threshold) associated with it. RT is a function of table cardinality.

 RT = f(n) , n -number of rows in a table (table cardinality)

Essentially, RT for a table defines the frequency with which queries that refer to the table recompile.

During the first query compilation, for each interesting statistics, a snapshot value of a counter(colmodctr) that counts number of table modification is stored – a counter per column that participate in the interesting statistics.

Along with colmodctr snapshot values, Query processor sets the Recompile thresholds for every colmodctr created.
Before using an existing cached plan, Query optimiser performs The threshold crossing test defined by the formula below.

 | colmodctr(current) – colmodctr(snapshot) |  >= RT

current     – refers to the current value of the modification counter
snapshot – refers to the value of the mod. counter captured during the last plan compilation(recopilation).
If the threshold crossing test evaluates to TRUE (the number of changes exceeds  the pre-defined RT), for any of the interesting statistics, the query plan is recompiled.
The threshold crossing test will be performed even if query processor does not load any interesting statistics. In that case the test will be based simply on table cardinalities.

 | cardinality(current) – cardinality(snapshot) |  >= RT

The following table shows how RT is calculated for permanent and temporary tables

Temporary Tables
Table cardinality Recompile Threshold Passing RT test caridnality
n<6 6 >= (n+6)
6 <= n <= 500 500 >= (n+500)
n > 500 500 +0.20*n >= (n +500 +0.20*n)
Permanent tables
Table cardinality Recompile Threshold Passing RT test caridnality
6 <= n <= 500 500 >= (n+500)
n > 500 500 +0.20*n >= (n +500 +0.20*n)


colmodctr

colmodctr is the name of a counter which counts number of modifications that a table has undergone. The counter is per column and is non transnational (if we insert 10 rows and rollback the transaction, colmodctr value will increase by 20). The counter is also ever increasing.

So far, Temp table caching can explain why our query does not get recompiled due to schema changes. Now, let’s find out why our query use stale statistics and why it does not perform optimality-related query recompilation.

Automatically created statistics are cashed along with the cached temp tables. In this case,  auto-created statistics are NOT reset at the end of the stored procedure execution cycle nor at the beginning of a consecutive stored proc. execution.
This is why statistics on a cached temporary table may belong to one of the previous stored procedure calls and are absolutely not related to the context of the current temp table.
If we used a “normal”(dbo.Temp) table instead of #temp table, the table would be dropped and recreated on each stored procedure call. Consequently, auto-created statistics would be repeatedly dropped and created which would make them always up-to date.

The modification counters which count changes on the interesting columns, in our example columns: ProductName, and ProductId will not cause query recompiles even if the number of the changes passes the RT threshold test. This behavior is unique to the temp tables(Cached or non cached*) when used within stored procedures.

Note*: To prevent temp table caching, perform a DDL operation on the newly created temp table. e.g

Err 2667: The statistics will be internally present, but not available through sp_helpstats, DBCC SHOW_STATISTICS and tempdb.sys.stats. My assumption would be that the auto-crated stats. BLOB entry gets “disconnected” from temp table’s object_id -> For non-cached temp tables, the object_id changes on every sp run.
What is also interesting with the non-cached table is that it does not force our query to  be recompiled due to schema change reasons as we might expected.

Back to the original example, to be able to track the modification counters, expand dbo.testTempTableStats stored procedure with the dynamic query below.

Now execute the query with the parameter ‘C’

Figure 7, colmodctr – modification counters

On the first query execution, query optimiser sets up the RT values for the two auto-created columns. The Recompilation thresholds are based on temp table’s cardinality (NoOfRows = 10). Based on previously mentioned formula ( 6 < n < 500, RT = 500 ), in order to pass RT crossing test we’ll need at least 510(500+10) changes on ether of the two columns to initiate query recomilation. This means that, if we execute stored proc 6 times passing parameter ‘A‘ (populates temp table with 100 rows), we will reach 6*100 = 600 changes which is more than we need to pass the test.

The 6th execution shows the following results, but no expected recompile.


Figure 8, colmodctr – modification counters do not initiate query recompiles

The table below shows colmodctr counter’s progress:

Figure 9, modification counter progress

As we can see, the query was not recompiled due to the optimality reasons. Non-cached temp tables will show the same behavior. The colmodctr simply does not work for temp tables within stored procedures.
If we repeat the experiment, but this time we pass parameter ‘B'(param ‘B’ results in 4 rows in the temp table) for the very first stored procedure call, we will get the similar output as on Figure 7. RT value is now set to 6 ,  1<n<6, RT = 6, which means that we need at least 10(6+4) changes on the interesting temp table columns to initiate query recompilation.
If we execute the same stored procedure(with param ‘B’), say two more times, the modification counters will show 16 changes and, as expected, with no recompile.
But if we execute stored procedure with a parameter that generates 10 or more rows in the temp table, i.e ‘C’  or ‘E’ (10 and 27 rows) our query will recompile.

The experiment shows that Query processor does track temp table cardinality changes.
The threshold crossing test will be based on temp table cardinality.

 | cardinality(current) – cardinality(snapshot) |  >= RT

In our case RT =6, the expected current table cardinality is 10 , | 10 – 4 | >=6, and the  RT crossing test evaluates to TRUE.

The interesting thing about this is if our code uses a temp table that has relatively stable number of rows i.e ~500, it is likely that we’ll end up with one plan that will never change due to optimality reasons. If our sp creates a query plan for @StartsWith = ‘C’ parameter that sets up RT=500, query optimiser will expect temp table’s cardinality to increase to at least 510 rows, which will never happen if our table has maximum of 500 rows. The difference will be more visible i.e in  case when Query optimiser sets RT to 500 +20%*n (total number of rows – see the RT tables above). The expected cardinality that will initiate recompile will be n + 500 +0.2*n.

Possible solutions

WITH RECOMPILE

One of the obvious solutions for this problem would be to create stored procedure WITH RECOMPILE. This will

  • Prevent temp table caching
  • Prevent stored procedure query plan caching
  • Recompile all queries in the stored procedure on each execution.

This method may work out of the box, but may also be expensive since the whole batch  that can contain quite a few complex queries must compile on each call, and then, during the execution, separate queries may be recompiled due to previously mentioned reasons.

EXECUTE .. WITH RECOMPILE

In situation we know that the cached query  plan works for the most common range of input parameters. If we are able to identify those parameters for which the plan would be sub-optimal, we can create a wrapper stored procedure that controls the main stored procedure execution properties.

A stable plan will be cached for the parameters Z,B,C and E and a second, non-cached plan will be constructed each time we pass any other parameter value than the four mentioned above.

OPTION RECOMPILE and UPDATE STATISTICS

To overcome the problem of the stale statistics, we can add OPTION RECOMPILE to our original query. This will force our query to recompile.

If we repeat the first test and pass parameters ‘C’ and ‘F’ respectively, we can see that “F” param‘s option recompile updates only temp table’s cardinality, from 10 rows to 358 rows. The process skips to update auto-created statistics on ProductName and ProductId.


Figure 10, Stale statistics, correct cardinality

OPTION(RECOMPILE) forces query to recompile, but it does not detect stale statistics. It also “activates” rowmodcnt (modification_counter). In this case, after the first execution(param ‘C’), RT is set to 500. QO needs to detect at least 510 changes on the significant columns  to initiate the recompile. In our case, when modification counter reaches 510+ changes(Two consecutive calls passing param. “A”), including temp table truncates,  QO will update statistics as a part of the query recompilation.

1st Execution Param “C” : RowNo=10: RT=500; modification_counter = 0
2nd Execution Param “A” : RowNo=358: RT=500; modification_counter = 10(truncate) + 358 = 368
3rd Execution Param “A” : RowNo=358: RT=500; modification_counter = 358(Truncate) +  368 = 726** RT crossing test evaluates to TRUE -> Update Statistics.

For non-cached temp tables, OPTION RECOMPILE does update auto-created statistics as well.

We can force query optimiser  to update auto created statistics using UPDATE STATISTICS  #temp. The complete solution would be …

It is a bit unusual place to put UPDATE STATISTICS statement since we expect auto-created statistics to be constructed during our query compilation/recompilation process, not before. But then, if we put the statement after the query, UPDATE STATISTICS will not be able to have effect on the preceding query.

BATCH MODE MEMORY GRANT FEEDBACK (Sql Server 2017+)

This solution is just an experimental one. In Sql Server 2017 Microsoft introduced Adoptive Query Processing, a few new query processing features set to improve query performance.  The new features (v1.0) are

  • Batch mode memory grant feedback
  • Batch mode adaptive joins and
  • Interleaved execution

More about these interesting set of features can be found here.
The first two features are available in queries that references one or more objects with columnstore indexes. Opposed to row-mode  execution style, when rows are passed/processed between iterators(operators) one at a time, the batch processing operates on 64bytes structure(8 pages). The allocated memory space can host between 64 and 900 rows depending on the number of columns selected.
To activate the features(and the batch processing) we can create a “dummy” columnstore index with an int column and LEFT JOIN it with our query. The new join will not change the query logic nor the result-set.

If  we run the original test again, first with param ‘C’ and then with param ‘F’, we’ll notice a couple of interesting changes in the query execution plan.

The first execution creates a new plan

Figure 11, Adaptive Query Processing features

We can notice a new Adaptive Join operator and a new, batch execution mode for Hash Match Aggregate and Compute scalar  operators.
The next execution (param ‘F’) will result in underestimated memory grants for Hash Match Aggregate operator. This is due to stale statistics, as it was shown earlier.
Now, if we execute the same query again(param ‘F’), the Batch mode memory grant feedback feature will FIX the problem with the operator by allocating enough memory for the operation. This will fix the problem with the Aggregate operator and prevent data spills. However, the original problem with the stale statistics etc… will stay.

Conclusion

Temporary tables when used in stored procedures may have totally wrong statistics. This can lead to performance degradation due to sub-optimal plans including not enough accurate memory grants, wrong temp table cardinality estimates etc. This is partially due to the temp table caching mechanism.  The feature is extremely useful since it provides more efficient temp table reuse across frequent stored procedure calls, particularly in OLTP workloads.  Between stored proc. calls, the column modification counters on interesting temp table columns do not initiate query recompiles ( even if the RT crossing test evaluates to true).  However, RT crossing test that depends on temp table’s cardinality changes, if evaluated to true, will initiate query recompiles.
Forcing Query optimiser to perform a statement recompile does not resolve the problem. For the cached temp tables it fixes cardinality estimates but does not update relevant statistics. For non-cached tables it resolves the problem (everything is up to date). Combining UPDATE STATISTICS #tempTable and OPTION RECOMPILE is one way to workaround the problem.
The example demonstrates a common scenario when a temp table is used to reduce query plan complexity.  The table stores relatively small number of rows and drives a more complex query. If the cached plan is sensitive to the cardinality and statistical distribution of the values in the temp table, the plan reuse may result in stored procedure performance degradation.
For the larger temp tables with widely differing numbers of rows on every execution, it is likely that the main query recompiles due to temp table cardinality changes.
For the query plans that are less dependent on the temp table statistics and cardinality, we might consider to use table variables instead(lightweight, no statistics, cardinality presented as 1row etc.)

 

Thanks for reading.

Dean Mincic

Adding non-NULL value columns ONLINE

Adding non-NULL value columns with values to a busy table – ONLINE

Adding a non-null column to a big table on a busy database can be tricky business. It might seem to be an easy task of altering table and adding a column ( an integer data type column – 4 bytes more or less, who cares) but it’s not. DBAs are always extra careful when perform those sorts of actions since it can cause serious implications to the db applications. For example, adding a non-null column to a busy table which contains 10M rows means that every row must be updated with a value. The table may be accessed by many different applications many times in a short period of time. These requests can be seriously blocked by the alter table/add a not null column/update column with a value, causing app timeouts, deadlocks and all sorts of headaches.

One of the common approaches is to add a NULL column and then to update the column values across all rows using the batch update scenario (update N number of rows per transaction) and then to alter column as non-null. This operation is usually performed “offline” or when the table is “quiet”.

Starting with SQL 2012+(Enterprise Edition)  the operation is not “painful” any more. Adding column(s) to a busy table is now instantaneous and has less impact to the OLTP operations. In fact, the operation can be performed online.
This still requires taking into the consideration other factors that can impact the operation such as; data type of the new column, row length etc.

The task for today will be to add a new , non-null column of integer data type with the default value 46 to a busy table that has  1M rows 🙂

Test table

The table contains 1M rows. One row, id=2 has a NULL value for column “val”. The rest of the column values are taking the maximum available space of 4bytes.
NOTE: Varchar stores non-unicode characters e.g VARCHAR(4) =’AAAA’ takes 4bytes. (1 byte per non-unicode character).

The query below provides basic information about the table rows:

The metadata information shows that the “shortest” row (id = 2) takes 11bytes of disk space, and the “longest” row(all other rows) takes 19bytes. The whole table(all rows) takes around 18MB of the disk space.
Before we add a new, not null column to the table, lets show how the rows are physically stored.

Rows physical storage

In this case, rows are stored using the FixedVar storage format. The name comes from the order in which Sql Server storage engine stores different data types – fixed data types(integer, decimal, char..) and then variable data types(nvarchar, varchar, varbinary..).The format has been used since Sql Server 7.0(Sphinx) released in 1998. (smile).
FixedVar format internally adds a number of additional bytes to every row on page. The extra space per row is used to provide information about rows itself like (if the row has been deleted/ghost ,the number of columns, columns with NULL values, position of the variable columns if any,  timestamp and pointer to the verison store if optimistic locking is enabled.. etc..).
For the fixed columns, storage engine will add:

6bytes + CEILING(TotalNoOfColumns / 8.00bytes)

..and for the variable part of the row

2bytes(if at least one of the varColumns has non-null value) + 2bytes * NoOfVariableNonNULLColumns.

In our example, the minimum row size(11bytes , row id=2) and the maximum row size( all other rows 19 bytes) will be..

The image below shows the structure of a FixedVar row (Pro Sql Server Internals by Dimitri Korotkevitch)

Now, down to the bits and bytes, lets find the first data page(out of 2598) that is allocated to the table and try to reconcile the min and a max row. The undocumented internal column/fn %%Lockres%% gives us FileId:PageId:SlotId for the selected rows.

The query selects first few rows from the test table.

With the PageID of the page we want to examine, the next step is to allow sql server to redirect the page content output to the screen.  To achieve this we can use DBCC(Database Console Command). The system change will affect only the current session.

The page snapshot below shows the first page header(first 98bytes of the total page size – 8192bytes) and the first couple of rows. The relevant information for this exercise is highlighted.

The page is read from the disk and put into the buffer pool. The relevant information on the page:

  1.  File No : Page number (1:16576)
  2.  m_lsn = (525:62471:2) – Log Sequence Number.
  3. The position of the first row.
    Slot(0) – the first row on the page
    Offset 0x60 – the physical position from the beginning of the page (Hex 0x60 = 96bytes)
    Length = 19* Happens to be the maximum record size. All rows exept row no 2 (id =2) are 19bytes.
  4. The first row -hexadecimal(byte swapped values)

      30000800 01000000 02000001 00130062 697470

       03 – TagA 1 byte
       00– TagB 1 byte
       0008 – Fsize bytes  – location of the end of the last fixed datatype column.
                       8bytes = 1b(TagA) +1b(TagB) + 2bytes(Fsize itself) + 4b(INT column)
      00000001 – The fixed data type column values. One int column with value 1.
      0002 – Total number of columns (2bytes)  =decimal 2
      00 – NullBitmap 1byte = CEILING(TotalNoOfColumns /  8.00bytes) = CEILING(1/8.00bytes)

     TOTAL Fixed part of the row: 1+1+2+4+2+1 = 11bytes

     00012bytes – No. of variable type columns = decimal 1 (If all of the variable data type columns are NULL, this will be 0bytes.)
     0013 2bytes – Every two bytes stores the location of the end of a variable,not NULL data type column. In this case there is only one variable column = decimal 19 – the var. column ends on 19th byte.
(2bytes * NoOfVariableNonNULLColumns)
     62 697470
 – 4 bytes – The actual variable column value ‘bitp’

    TOTAL Variable part of the row: 2+2+ = 8bytes

    Total row1 size = 11bytes(Fixed part) + 8bytes(variable part) = 19bytes

  1. Column “id” value, first row
  2. Column “val” value, first row
  3. The position of the second row.
    Slot(1) – second row on the page
    Offset 0x73 – the physical position from the beginning of the page (Hex 0x73 = 115bytes). This is header(96bytes) + first row(19bytes)
    Length = 11*  Happens to be the minimum record size. All othere rows are 19bytes.
  4. Second row -hexadecimal(byte swapped values)

      10000800 02000000 020002

      01 – TagA 1 byte
      00– TagB 1 byte
      0008 – Fsize bytes  – location of the end of the last fixed datatype column.
                      8bytes = 1b(TagA) +1b(TagB) + 2bytes(Fsize itself) + 4b(INT column)
      00000002 – The fixed data type column values. One int column with value 2.
      0002 – Total number of columns (2bytes)  =decimal 2
      02 – NullBitmap 1byte = CEILING(TotalNoOfColumns /  8.00bytes) = CEILING(1/8.00bytes)  02hex = 0010bin.  This indicates that the second column (second bit is on) in the row contains NULL value.

      TOTAL Fixed part of the row: 1+1+2+4+2+1 = 11bytes

      Since the only variable column contains NULL value, the variable part does not take any space.

      Total row2 size =11bytes(Fixed part) = 11bytes

    1. Column “id” value, second row
    2. Column “val” value, second row
    3. Slot 1 Column 2 Offset 0x0 Length 0 Length (physical) 0
      This shows that the position of the column 2 in the row 2 is 0 and the physical length of the column is also 0. This means that the value is not physically stored on the page. The NULL value during the materialisation of the column value comes from theNullBitmap byte 02

Add a new column

Finally, we will add a column to our “hot” table. The column will be a non-null integer with the default value of 46.

The action will “add” a new column and 1M 46-es in a few milliseconds. We can confirm that the new column with the values is real(materialised).

How it works

After adding a new, integer data type column, we expect to see increase in row size( by 4 bytes). We also expect that Sql Server storage engine somehow managed to add 1M 4byte integer values to the table  in a fraction of second.

Let’s check rows metadata again.

The query result shows that there were no changes in the row sizes.
So, what actually happened? Knowing that table rows are stored on data pages(RowStore – FixedVar storage format), we would expect increase in min and max row sizes. Using the formulas above…

min_record_size_in_bytes (row id=2)  = 6 + CEILING(2/8.00) + 4 + 4    +  0 + 2*0  =   6 + 1 + 8  =  15
max_record_size_in_bytes  = 6 + CEILING(2/8.00) + 4 + 4    +  2 + 2*1  + 4  =  6 + 1 + 8  + 4  + 4  =  23

The formulas are correct but the result is not 🙂

If we check the first page again, we’ll notice that
m_lsn (log sequential number) has not changed
– The new column (NewCol) is not stored on the page (offset from the beginning of the rows is 0bytes)

The unchanged m_lsn shows us that there were no changes on the page.

More about m_lsn

m_lsn – represents the last Log Sequential Number related to the last transaction that changed the page. The set of log records which describe the changes(old and new  version of the data, transaction, affected row etc.) starts with the same lsn. The log information allow SQL Server to recover databases to transnationally consistent state in case of system crash or any unexpected shut down.
SQL Server uses WAL(Write-Ahead-Log) model which guarantees that no data modifications will be written to disk before the associated log record is written to the disk.
Example: A transaction T1  updates a few columns in a row. The high level sequence of events will be:

  1. A Data access operator requests from the buffer pool manager(BPM) data page that contains rows to be updated .  BPM checks if the page already exists in the buffer pool. If not, BPM requests I/O operations to retrieve the page from the disk.
  2. The update operation(transaction T1) updates relevant row on the page in the buffer pool. The page header information m_lsn is updated. The operation creates a set of new log records(starting with the the same m_lsn) that describes changes made to the page. The log records are created in the log buffer.
  3. Transaction T1 commits the changes. Log buffer manager flushes the content of log buffer(~64kb) to the disk(log file).
  4. Client application receives confirmation that the transaction T1 has been successfully committed.

New column metadata

After creating a new, non-null column, instead of updating all 1M rows with some value, Sql Server 2012+(Enterprise Ed) stores the information as metadata.

The two metadata columns hold the information about the added column. This is where 46 is materialised from, not from the page, but from the metadata.
This new behavior occurs automatically and does not impact the system.

default_value has_default

The defaut_value / has_default columns which belongs to sys.system_internals_partition_columns system view are not related to DF_NewCol Default Constraint defined during the table alteration.
The metadata that describes the default constraints can be found using the query below

If we drop the constraint, has_default/ default_value will stay intact.

row UPDATE

An interesting thing will happen when we update a row. Let’s update row id=2 and replace the NULL value with something.

…and now the max,min rowsizes are changed.

Whenever we update one or more values in a row, the default value (NewCol = 46) will be written on the data page. The rows that are not updated will still materialise value 46 from the metadata.

If you inspect the first page, you’ll find that the page has a new m_lsn and that all the column values of the row are written on the page
Before the change, the shortest row was row2 (id=2), size = 11b. Now, after the update, the row2 size increased to 23b. The NewColumn default value (46) now  became a physical part of the row.
Using the formulas mentioned before, the new min/max row sizes are:

max_record_size_in_bytes (row id=2)  = 6 + CEILING(2/8.00) + 4 + 4  +  2 + 2*1  + 4 =  {6 + 1 + 8 } + {4  + 4 }= 23
**4bytes for  the NewCol value of 64 and 4bytes for ‘Hi 5’  value that replaced NULL value

min_record_size_in_bytes (all rows expect row id=2)  = 6 + CEILING(2/8.00) + 4  +  2 + 2*1  + 4  =  6 + 1 + 4  + 4  + 4  =  19

Table row source(s)

The figure below shows that the test table materialises column values from two different sources – data pages and metadata.

We see that only the updated row (id=2) has all the columns physically stored on the data page.
E.g row 1 is physically stored in Slot 0, but the new column is not part of the row
Slot 0 Column 3 Offset 0x00 Length 4 Length (physical) 0
NewCol = 46

During the production life, all the table rows may be gradually updated.  The has_default default_value values(1 and 46 respectively) metadata remain but will not be in use.
Metadata will be removed if we REBUILD the table

Restrictions

There are a few data types that cannot be added online.

  • Varchar(max)
  • NVarchar(max)
  • Varbinary(max)
  • XML
  • hierarchy_id
  • geometry
  • geography

Default expressions that are not runtime constant (expressions that require a different value for each row) cannot be added online. e.g NewId(),NewSequentialId() ..

Row-overflow row storage format is not supported. New columns must not increase  the maximum row size over 8060 bytes limit.

Conclusion

Starting from Sql Server 2012 (Enterprise Ed) it is possible to add, ONLINE, non null columns with values to a busy table. Sql Server storage engine uses metadata to store the default values for the added columns. When requested, the table columns are materialised from data pages and metadata. There are restrictions on the data types that may be used. e.g  columns with BLOB datatypes cannot be added online.. etc. Only the runtime constant expressions can be used as a default constraints.

Thank you for reading

Dean Mincic

ORDER BY – Changes everything

SQL programming has a few unique aspects such as Three value logic(3VL), set based programming logic and logical processing order. Divination from the programming patterns based on the unique nature of the language usually leads to poor performing SQL code.

RDBMS fundamentals

Sql Server as a RDBMS(Relational Database Management System), and its SQL language dialect/variant TSQL(Transact-Structured Query Language) has strong foundation in mathematics. TSQL as Sql Servers language for data management and manipulation is based on two cornerstones of mathematics – Set theory and Predicate logic.

Edgar F. Codd was a computer scientist who invented the relational model for database management systems. He used mathematical theories (Set theory and predicate logic) to create a general theory of data management.
IBM’s System R is the very first database system build on Codd’s relational model. In order to “force” DB vendors to follow the relational principles when designing db systems, Codd published his 12 rules which describes a relational database.
Various database vendors(Oracle, Sybase, Sql server..) implemented the Relational principles in a similar but different way. This means that the db systems “slightly” deviate from Codd’s relational theory, and the theory itself “slightly” deviates* from the original set theory and predicate logic – but that’s ok 🙂

*e.g One of the differences between predicate logic and relational algebra is that the first supports the two-valued logic (True/False). Relational model supports three way logic(True/false and Unknown)

Although tsql querying design patterns are derived from the Relational algebra, Tsql does not strictly follow relational rules. One of the differences, that is also related to this article is the that a Relation is not sorted → There is no relevance to the order of the elements in a set. However, Tsql allow us to sort(ORDER BY) the final result-set(the one that is returned to Client). We are allowed to replace the Relational result(as a “final product” of the relational operations in a Relational database based on relational theory…. ) with a non- relational result – CURSOR. That is why Order by “Changes everything” 🙂

Logical query processing sequence

To understand ORDER BY operation, it is important to understand the logical query processing sequence. The concept is unique to SQL world and defines the order of execution of different query segments.  It’s also known as Theoretical execution order because the sequence may be changed by SQL Srv optimiser due to may different reasons i.e data access patterns, query simplification etc. The query sequence:

  1. FROM – The FROM phase identifies the query’s source tables. This phase processes CROSS JOIN, ON Filter and in case of OUTER JOINs it adds the non-matched rows back to the result-set.
  2. WHERE Filter. Removes all those rows from the previous stage for which the predicate evaluates false.
  3. GROUP BY – Performs grouping and aggregation calculations. This phase generates one row per group.
  4. HAVING – This is the last of the three query filters. It filters the results of aggregate functions.
  5. SELECT – Processes the elements in the Select clause i.e Evaluates expressions within Select statement and/or removes duplicates(DISTINCT) and/or filters the specified percentage/number of rows(TOP) defined by the ORDER BY clause. The phase returns TABLE.
  6. ORDER BY – Sorts the rows according to the list of columns specified and returns CURSOR.
  7. FOR XML/FOR JSON The phase converts tabular result from the SELECT statement into XML/JSON outupt.

For the queries that include SET operations, the logical sequence will be.

  1. Query 1
  2. SET operation (union, except, intersect) or multi-set Union All
  3. Query 2
  4. Order By

Order by and …

ORDER BY operation guarantees ordered result set – Cursor. TSQL deviates from ANSI standard and allows us to sort results using the columns and/or expressions based on columns that are not part of the SELECT statement(This is not true if query uses DISTINCT clause) . ANSI standard allows sorting only by using the columns mentioned in the select statement. ORDER BY can also be used as a logical sorting operator when operating as a part of TOP/Offset functions.

Because ORDER BY converts query result-set into a CURSOR, it cannot be used to define table expressions;

  • Views
  • ITVF(Inline table valued functions aka parameterised views)
  • Derived tables
  • Subqueries
  • CTE(common table expressions)

… unless used as a logical sorting with(top,offset,for xml/json)

The CURSOR must be returned to the client application that is able to “consume” cursor records, one at a time and in order.

ORDER BY treats NULL values as equal. NULLs are sorted FIRST in ASC order.

SET operations (Union, Except, Intersect) and multi-set Union All

When using SET based operations (UNION, EXCEPT, INTERSECT) and/or multi-set operation UNION ALL  we cannot use ORDER BY operator in the individual statements. The following query will fail the parsing phase.

Msg 156, Level 15, State 1, Line 24
Incorrect syntax near the keyword ‘INTERSECT’

Sql Server’s SET operations correspond to operators defined in mathematical set theory. Therefore, the operands must be sets – represented as Tables (Set → Relation → Table)
ORDER BY changes query output to something that ANSI SQL defined as CURSOR. Set operations do not support ordered structures.

We can use ORDER BY clause to format the FINAL result-set for the presentation purposes. The ORDER BY in the example below logically does not belong to the last query. It operates on (Set1 U Set2)

TOP/OFFSET-FETCH

TOP option is unique to tsql and does not follow ANSI guidelines. The directive simply allows us to select a number of rows or percentage of rows to return. The option is supposed to be paired up with ORDER BY – always.
This time ORDER BY does not serve a presentation purpose. This time it serves as a logical “TOP sorting” operator. It supposed to answer the question “TOP according to what order?”. The result of a TOP/ORDER BY operation is a table – not a cursor.
Unfortunately, tsql allows us to use TOP without ORDER BY and in that case the general query result will be random and non-deterministic TOP(n) rows / TOP(n) PERCENT result set.

It is a good practice to specify ORDER BY(SELECT NULL) if our intention is to select random and/or non-deterministic TOP(n) rows. Query Optimiser will remove the logical ordering during the simplification phase. The purpose of this is to let other developers know that this was done intentionally.

Test data:

Test 1: The query returns two randomly chosen rows(TOP without ORDER BY)

Test 2: The query returns a non-deterministic result – “non unique” ORDER BY is used aka. non-deterministic Order By.

The query “decided” to select Customers 100 and 105 but not 102. In this case, if we had a clustered index on customerId, the query would probably select customers 100 and 102.

Test 3: Two ways to make the “TOP(n)” query to return a deterministic result.

(1) The deterministic order by provides more specific information about which TOP(2) order dates to select. Now the query knows exactly which TOP(2) rows to select. a combination of Order By columns that provides a deterministic result-set is called Tiebreaker.

(2) The “WITH TIES” guaranties deterministic result based on Order by column(s). WITH TIES must have ORDER BY defined. In case of non-deterministic Order By, WITH TIES will add all relevant rows to the result-set in order to provide determinism. In the example above it added an extra row.

OFSET-FETCH (tsql 2012+)

OFSET-FETCH is an ANSI SQL supported feature. Basically, the function is an advanced version of the TOP function. The option allows us to select(fetch) only a window or a page of results from the result set.
Opposed to the similar, TOP function, OFFSET–FETCH can not operate without Order By. On the other hand OFFSET-FETCH does not support WITH TIES therefore it cannot guarantee deterministic result set.
The following query fails because offset-fetch always operate with order by:

Msg 102, Level 15, State 1, Line 25 Incorrect syntax near ‘0’. Msg 153, Level 15, State 2, Line 25 Invalid usage of the option FIRST in the FETCH statement.

If we want to select random 2 rows from a table, we can use

The query below will return the same result as the query from the Test3.

Same as TOP, OFFSET-FETCH returns a table therefore it can be used in other table expressions, e.g.. (the query does not make much sense – its only for testing purposes …:)

..and this will fail. Table operators (JOIN and non-ANSI APPLY,PIVOT and UNPIVOT)  operates only between tables(table expressions).

Msg 1033, Level 15, State 1, Line 38
The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified.

VIEWS

A view is a named virtual table that is defined by a query and used as a table
A view represents a table, and a table is a logical entity that has no order to its rows. This is the reason why it’s not allowed to use presentation ORDER BY in a view. VIEWS ARE NOT ORDERED!
The view definition below will fail

Test 4. Views are not ordered

Msg 1033, Level 15, State 1, Procedure TestViews, Line 7 [Batch Start Line 1]
The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified.

Now, some programmers might think that they somehow managed to override RDBMS fundamentals by defining a view with TOP clause specifying all 100% rows and ORDER BY.

Test 5, Views with TOP 100% ordered rows

This will pass the parsing stage and will be stored as a valid view definition. In reality the view definition does not make much sense(see the TOP/ORDER BY example above). During the simplification phase, query optimizer will remove the top100%/ORDER BY as an unnecessary logic.

The result is not ordered by create_date DESC
In some cases i.e when we use TOP and ORDER by the clustered index key columns, the result may be appearing to be sorted – QO may decide to use sorted table scan. Again, there is no guarantee that the returned rows will be ordered.
e.g. During the final SELECT phase, QO may decide to use Sql Server’s storage engine advanced scan functionality also known as merry-go-round scan or scan piggybacking to return the same rows to two or more concurrent sessions. Without diving into the details of how the advanced scan works(will cover that interesting optimisation mechanism in one of the next posts), the different sessions will have the same result-sets but with different order.

The presentation ORDER BY will convert un-ordered view result into an ordered CURSOR.

Window Functions

Windows functions are intorduced in ANSI SQL-2003 standard. Microsoft partially implemented  the functionality in tSql 2005 and expanded to a full range of window functions in tSql 2012,
A Window function operates on a number of rows(window) in a set(table).  Window functions use ORDER BY as a logical operator which defines the behaviour of the aggregate operations performed by the functions. In This scenario, ORDER BY, as a part of the functions does not perform the final sort.

It is important to understand where Window function sits in the logical query execution sequence.

(5) SELECT
(5.1) Evaluate expressions in the SELECT statement
(5.2) WINDOWS Fn can be executed from here…
(5.3) DISTINCT
(5.4) TOP
(5.5) ….to here

(6) ORDER BY (presentation)

The example below demonstrates how window functions are positioned in the logical query exec. sequence.

Test Data.

The query below sums the OrderValues per OrderPackDescription. The OrderPackDesc = NULL will be summarised as a distinct value resulting OrderValue=10. ISNULL(OrderPackDesc,’Media Pack’) evaluates the NULL value as “Media Pack”

This could lead to a logical error. We want to find out the sums of order values per order pack. We also want to treat the missing OrderPacks as ‘Media Pack’. The previous query sums the “missing order packs” separately. This is a tipical logical error. The error would be more difficult to find if we had OrderValue = 190 for the unknown orderPackDesc. In that case we will have total of 190 instead of 380 for ‘Media Pack’.
Taking into the consideration the query execution sequence we can modify our query as:

…OVER(PARTITION BY ( ISNULL(OrderPackDesc,’Media Pack’) … is evaluated BEFORE the WINDOW fn.
Now our window fn, partition knows how to treat the missing values before the aggregation part and the execution sequence works in your favor 🙂
The example also shows that DISTINCT executes after WINDOW functions.
NOTE: It is not possible to use alias OrderPackDescEXPR in other expressions within the same query, like  … OVER(Partition by OrderPackDescEXPR ..) The limitation is another unique aspect of SQL language – All-at-once operation.
E.g in most programming languages, to swap values between variables we use a third, temp var. A→ temp, B→A  , temp→ B . In SQL we can swap table column values without using the temp var storage..

Window aggregate funtions

Starting from SQL 2012 the ORDER BY clause can be used along with the window aggregate functions. ORDER BY dictates how the aggragate functions operate within the window partitions. It it important to understand the ORDER BY in this content to be able toavoid logical errors.

Create some test data

The following query explains how ORDER BY affects SUM – window aggregate function