Bookmark lookup tipping point

Bookmark lookup critical point


Summary

It is common for production environments to have queries – query plans, that use non-covered, non-clustered indexes in combination with a bookmark(Key or RID) lookup operator. The combination of  the physical operators is one way how query optimiser can use a non-covered index to provide information required by a query. However, sometimes, for the same query,  query optimiser decides to scan the whole (cluster or heap) table instead. This drastic change in the plan shape may have negative impact on our query performance.
This post attempts to explain the mechanism behind QO decision on when to switch between the two plan shapes. The concept is known as The Tipping Point and represents the point at which the number of page reads required by the bookmark lookup operator exceeds a certain point at which a clustered index/heap table scan becomes less expensive than the non-clustered index seek.

How bookmark lookup works

Before diving into the tipping point specifics, it would be good to understand how bookmark lookup operator works in combination with a non-clustered , non covered index. Bookmark lookup (Key or RID)  is a physical operator used to find data rows in the base table(cluster or heap) using a clustered index key(Key lookup) or row locator(RID lookup).
Lets create a test environment we can use throughout the blog.

Create test environment

Create a test table

Insert 100,000 rows

The test objects properties that are interesting for our experiment:

  • Unique clustered index on EmployeeId.
  • Unique, non-clustered index on the SearchValue column.
  • SearchValue column contains unique, ever increasing integer values. The  values match EmployeeId values.
  • The row size is exactly 500bytes. 493bytes is used by the five fixed length columns + 7bytes row overhead. More on this can be found here.

Key lookup scenario

The query below returns 500 rows (all columns) for a range predicate.

Note: Traceflag 652 . The traceflag disables page pre-fetching scans (read-ahead). Disabling the storage engine feature will help us to reconcile the number of I/O operations reported by STATISTICS IO with the number of rows selected by the query. More on the trace flag later in the blog.

Analyse key lookup query plan

The figure below consolidates three sets of information related to our test query – a graphical execution plan shape, basic properties of the two physical operators and the number of IO reads performed on the test table.


Figure 1, Key lookup, index seek plan

We read the query plan as the following sequence of events.

  • Query optimiser chose a key lookup/non-clustered index seek routine to implement query request.
  • Nested Loop operator requested, for its outer input, all valid rows(rows that are passed the seek predicate ..SearchValue BETWEEN 1000 AND 1499.. ) on NCI_SearchValue index. The index seek(index bTree traverse) was executed once resulting in 500  rows and two columns – SearchValue and EmployeeId. The latter  also acts as a pointer to the full size rows stored in the clustered index leaf level.
  • Nested Loop operator requested, through its inner input, the rest of the columns selected by the query – Name, Surname and Note. The search(key lookup operator), was executed 500 times, once per row in the outer input returning a  new set of 500 rows – a row per key search. Each execution traversed the clustered index bTree structure using EmployeeId as a seek predicate, in order to pin-point the qualifying rows.
  • For each key lookup search, Nested Loop operator combined the two outputs, the SearchValue and EmployeeId from the outer input with the Name, Surname and Note from the inner input forming the shape of the final result set.

The next thing we need to understand is the relationship between the number of I/O reads required to implement the above routine and the number of rows processed in the process.

Figure 1 shows that the number of I/O reads required for the operation was 1503 logical reads.  A logical read is a situation when Sql Server processes an 8Kb page from a RAM memory segment called buffer cache. If the requested page is not already in the buffer cache,  storage engine needs to perform a physical read operation in order to get the same 8Kb page from the disk.
The properties of the two physical operators(NCI seek and Key lookup) shows that the system read 500 rows from the non-clustered structure,in one go and performed 500 operations on the clustered index, returning a row per operation.

Now we need to dive a little bit deeper into Sql Server’s storage protocol in order to find all physical pages that were processed during the operations. Query below gives us high level overview of the index structures used in the test. The non-clustered index bTree has two levels and total of 175 pages whereas clustered index bTree has three levels and the total of 6273 pages.

Figure 3, the total number of pages per index 

The next query gives us a detailed view of the index pages across bTree levels – The Doubly Linked List data structure.

Finally, the following query gives us a sneak-peek of the actual rows  on the index/data pages.

Data access pattern

The following diagram represents data access pattern used by the key lookup routine.


Figure 2, Nonclustered index & key lookup internals

Data access pattern analysis

Our next goal is to reconcile the total number of logical I/O reads previously reported by the STATISTICS IO – (1503) with the diagram above.
Lets find out the total number of I/O read operations required to generate one row(out of a total of 500 rows). The following sequence of events is derived from Figure 2.

  1. Nested loop operator requests all eligible rows from the index seek operator. The non-clustered index is a non-covered index and can provide only SearchValue and EmployeeId columns. The index seek operator use the two seek predicates to find the range of values, SearchValue >=1000 AND SearchValue <=1499.
  2. Non-clustered index traverse. Index seek operator reads the Root page(PageId=1:16448) of the non-clustered index making the first I/O read (NoOfReads = 1).
  3. Index seek operation, following the lower boundary(SearchValue = 1000) of the search range, finds a pointer(PageId) which points to the index leaf level page which contains the full(two columns) index row. (ROOT PAGE: SearchValue = 597, PageId = 1:16456). It also knows that the PageId=1:16456 alone cannot provide complete range of the requested values but only SearchValues >=1000 AND SearchValues <1157. It “remembers” the following pointer, PageId = 1:16457 which can provide the rest of the values, SearchValue >=1157 and SearchValue <= 1499.
  4. Index seek operator performs the second I/O read(PageId =1:16457) following Path (A).  The total number of I/O reads is now (NoOfReads = 2).
  5. After storing first 165 rows found on PageId =1:16456 in memory, the operator follows Path (B). The operation is known as “partial index scan“. The operator knows that the subsequent page(PageId=1:16457) contains the rest of the requested rows(335 rows). The current page also has pointers to previous and next page(doubly linked list). The Path (B) makes the third read (NoOfReads = 3).
  6. Nested loop operator received all 500 requested rows from its outer input , the NCI index seek operator.
  7. Nested Loop operator now performs the Key Lookup operation over clustered index, 500 times, once per row collected from the outer input.
  8. Clustered index traverse (singleton search). On its very first execution, the key lookup operator uses the first row from the Nested Loop outer input, (EmployeeId = 1000) and performs its first page read(PageId = 1:2680). The page is the root level of clustered index bTree. The operation follows Path(C) increasing the total number of reads (NoOfReads = 4).
  9. Clustered root page provides a pointer(PageId = 1:832) which points to the first index intermediate level. The page maps all EnployeeIds between NULL and less then 4305. The row with EmployeeId=1000 falls into the range. Following Path(D) the operator makes its second read and increases the total number of reads ( NoOfReads = 5)
  10. Intermediate page 1:832 provides information about a leaf level page(PageId=1:1029) that contains the first full row.
  11. The process now follows Path(E) and make its final, third clustered index read ( NoOfReads = 6)
  12. The full row is then passed from the Nested Loop operator to the SELECT operator.
  13. Nested loop operator continue to progresses through the list of 500 rows from its outer input repeating steps 8 – 11 until all 500 rows have been processed.

The total number of reads is
Total No Of Reads = Index Seek operation (3 reads) + 500 * Key Lookup operation (3 reads) = 3 + 500 * 3 = 1503.
This is an exact match with the number of logical reads reported by STATISTICS IO.

Important observations

From the storage level perspective, one of the main differences between the two access patterns is that the Clustered index seek(partial scan) is a sequential read I/O operation, whereas Key lookup(singleton clustered index seek) is a random read I/O operation. Sequential reads are generally less expensive (less mechanical movements on the disk level) than the random reads. This is also true for RAM/SSD although they don’t have moving parts. This a very high level observation on data storage systems. 🙂

The number of random reads depends on size of row. The wider the row the more random reads key lookup needs to perform to get the same result-set. More about this later in the post.

Read ahead optimisation

Earlier in the blog I used TRACEFLAG 652 to disable the page pre-fetching aka read ahead functionality. Read ahead is an asynchronous I/O mechanism build to overcome the gap between CPU and I/O performances. Because CPU is many times faster than any storage system, Sql Server’s storage engine tries to read up to 64 sequential pages(8 extents) before they are requested by a query. This provides more logical reads, but on the other hand is faster than performing physical reads only when required. Although the number of read-ahead pages may vary, the mechanism reads pages from the index intermediate level(s) in order to identify the leaf level pages to be read in advance. In our case, the functionality would, if not turned off, “add a few” extra pages to the STATISTICS IO report and we wouldn’t be able to reconcile the reads with the diagram on Figure 2.
Try to run the same test query without turning on the traceflag. The number of logical reads should be greater than 1503.

The tipping point

The tipping point is the point which represents the critical number of rows after which query optimiser decides to perform cluster index scan instead non-clustered/key lookup data access pattern.
As previously shown, in non-clustered index/key lookup scenario, the number of rows requested by a query relates to the number of random reads. The main concern when determining the tipping point is actually the number of data pages that needs to be “randomly” read from clustered index leaf level – a read per each row requested . In our example this number is 500. The approach excludes the clustered index leaf level page reads and the non-cluster reads all together.
The number of pages that represents the tipping point is between 1/4 and 1/3 of clustered index data pages(leaf level). If a query requests less than 1/4 (25%) of the number of clustered index leaf level pages, query optimiser is most likely to allow random reads and non-clustered index/key lookup data access pattern. However, if a query requests more than 1/3(33%) pages, query optimiser will implement a clustered index scan data access pattern.
Figure 3, The tipping point

Knowing that a random read corresponds to a selected row, we can express the tipping point as a number of rows;

                      1/4(no of data pages) <= Tipping point (rows)  => 1/3(no of data pages)

In our example, the tipping point is expected to be somewhere between 1562 and 2083 rows.
So where exactly is our tipping point?
One approach is to apply binary search algorithm to the tipping point interval and perform trial and error approach until the query plan changes.
The other approach is to construct some kind of a program to do that for you 🙂

The script runs the same query for different SearchValue values. The values are within the expected tipping point range, starting from the lower boundary. Query result is suppressed  by FMTONLY ON session setting. OPTION(RECOMPILE)* in this context ensures that the value of local variable @start is known to Query Optimiser when creating execution plan for the query*. For each query run, program checks the current query plan RelOp element. If the element’s attribute @PhysicalOp has value set to ‘Clustered Index Scan‘ the program terminates and selects the current SearchValue value. The value represents the Tipping point we are looking for.
Figure 4, The exact Tipping point number of rows

Note:  Instead using OPTION(RECOMPILE) we could use Dynamic string execution.

The approach constructs and optimise the query during run-time. In this case, local variable @start gets evaluated and is treated as a literal within the dynamic string. It is most likely that the query plan will not be parameterised and the individual plans(one per execution) will be cached as Adhoc plans. This may lead to the Plan pollution situation, but this is a topic for a separate blog 🙂

Lets check the tipping point number of rows.

Figure 5, Query plan change

In terms of the number of pages, tipping point is expected in the range from 25%33% of the total number of clustered index data pages(leaf level). In our example, the range was between 1562 and 2083 pages.
From the number of rows point of view, tipping point was 1677 rows which is  (1677 /100000)*100 = ~1.7% of the total number of rows in the table(clustered index leaf level). This means that Sql Server is very conservative when to use bookmark lookup data access pattern, although the percentage of rows depends on the row size and probably other conditions i.e memory pressure, parallel query execution ..etc.

Tipping point & row size

As mentioned above, the number of random reads in non-clustered/key lookup access pattern depends on the size of a row. The wider the row the more random reads key lookup needs to perform to get the same result-set.

Lets increase our test table row size from 500(493bytes + 7bytes overhead) to 1000bytes. The new rows size will expand the table footprint.

The number of clustered index data pages required to store 100000 rows is now doubled, 12502 to be precise (Figure 3 query). The Tipping point is now expected to be in the range from (1/4) * 12502 and (1/3) * 12502 or 3125 and 4167 rows.

If we run query (Figure 4), we’ll get the exact tipping point , 3228 rows.


Figure 6, Query plan change (row size 1000bytes)

The interesting thing here is that now, with the wider rows, the tipping point represents (3228/100000)*100 = ~3.2% of the total number of rows in the table which is almost double than 1.7% calculated for the 500byte rows.

The tipping point experiments can be put in the context of a cached plan. If we wrap our test query into a stored proceudre and make local variable @start  to be stored proc’s input parameter, on the very first sp call query optimiser will create and cache query plan using the value passed

the query execution plan will be crated using the

Conclusion

The concept known as The Tipping Point represents the point at which the number of page reads required by the bookmark lookup operator exceeds a certain point at which a clustered index/heap table scan becomes less expensive than the non-clustered index seek. In this context, a bookmark operator(Key or RID) is coupled with a non-clustered , non-covered index – Index Seek operator. The latter performs sequential I/O operations whereas the first performs a number of Random Access I/O read operations. Random I/O reads are generally more expensive than sequential I/O read operations regardless of the storage system (mechanical HDD, SSD, RAM ..etc). Query optimiser allows bookmark lookup/index seek data access pattern only if the number of clustered index pages needed to be randomly accessed does not exceed 1/4 of the total number of clustered index data pages(leaf level). If the number of pages exceeds 1/3 of the total number of the clustered index data pages, Query optimiser will choose Clustered index scan data access instead. This is also true for Rid Lookup/Table scan access pattern when table is a heap.
The range of data pages between 1/4(25%) and 1/3(33.3%) of the total data pages defines The Tipping Point space. In this scenario, the number of randomly accessed pages relates to the total number of the selected rows. However,  25% – 33% of pages represents only a fraction of the total number of rows – for 500byte row size, between 1.6% and 2%. The range also depends on the row size. For the same number of rows and with the row size set to 1000bytes, the range increases to 3% – 4% of  the total number of rows.

I wish to thank to my dear colleague and a great SQL Server enthusiast Jesin Jayachandran for inspiring me to write this blog.

Thanks for reading.

Dean Mincic

Leave a Reply

Your email address will not be published.