Bookmark lookup critical point
Summary
It is common for production environments to have queries – query plans, that use non-covered, non-clustered indexes in combination with a bookmark(Key or RID) lookup operator. The combination of the physical operators is one way how query optimiser can use a non-covered index to provide information required by a query. However, sometimes, for the same query, query optimiser decides to scan the whole (cluster or heap) table instead. This drastic change in the plan shape may have negative impact on our query performance.
This post attempts to explain the mechanism behind QO decision on when to switch between the two plan shapes. The concept is known as The Tipping Point and represents the point at which the number of page reads required by the bookmark lookup operator exceeds a certain point at which a clustered index/heap table scan becomes less expensive than the non-clustered index seek.
How bookmark lookup works
Before diving into the tipping point specifics, it would be good to understand how bookmark lookup operator works in combination with a non-clustered , non covered index. Bookmark lookup (Key or RID) is a physical operator used to find data rows in the base table(cluster or heap) using a clustered index key(Key lookup) or row locator(RID lookup).
Lets create a test environment we can use throughout the blog.
Create test environment
Create a test table
--DB compatibility level 150 DROP TABLE IF EXISTS dbo.TestBookmarkLookupthreshold; GO CREATE TABLE dbo.TestBookmarkLookupthreshold( EmployeeId INT IDENTITY(1,1) PRIMARY KEY CLUSTERED --4 ,[Name] CHAR(120) NOT NULL --120 ,Surname CHAR(120) NOT NULL --120 ,Notes CHAR(245) --245 ,SearchValue INT NOT NULL --4 INDEX NCI_SearchValue UNIQUE NONCLUSTERED(SearchValue) ) GO
Insert 100,000 rows
;WITH generateRows AS ( SELECT TOP(100000) n = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) FROM sys.columns c CROSS APPLY sys.columns c1 CROSS APPLY sys.columns c2 ORDER BY (SELECT NULL) ) INSERT INTO DBO.TestBookmarkLookupthreshold ( [Name] ,Surname ,Notes ,SearchValue ) SELECT [Name] = 'MyNameIs_'+CAST(n AS VARCHAR(10)) ,Surname = 'MyLastNameIs_' + CAST(n+100 AS VARCHAR(10)) ,Notes = 'MyNotes..' ,SearchValue = n FROM generateRows ORDER BY n GO
The test objects properties that are interesting for our experiment:
- Unique clustered index on EmployeeId.
- Unique, non-clustered index on the SearchValue column.
- SearchValue column contains unique, ever increasing integer values. The values match EmployeeId values.
- The row size is exactly 500bytes. 493bytes is used by the five fixed length columns + 7bytes row overhead. More on this can be found here.
Key lookup scenario
The query below returns 500 rows (all columns) for a range predicate.
DBCC TRACEON(652); GO SET STATISTICS XML,IO ON; SELECT * FROM dbo.TestBookmarkLookupthreshold WHERE SearchValue BETWEEN 1000 AND 1499 SET STATISTICS XML,IO OFF DBCC TRACEOFF(652); GO
Note: Traceflag 652 . The traceflag disables page pre-fetching scans (read-ahead). Disabling the storage engine feature will help us to reconcile the number of I/O operations reported by STATISTICS IO with the number of rows selected by the query. More on the trace flag later in the blog.
Analyse key lookup query plan
The figure below consolidates three sets of information related to our test query – a graphical execution plan shape, basic properties of the two physical operators and the number of IO reads performed on the test table.
Figure 1, Key lookup, index seek plan
We read the query plan as the following sequence of events.
- Query optimiser chose a key lookup/non-clustered index seek routine to implement query request.
- Nested Loop operator requested, for its outer input, all valid rows(rows that are passed the seek predicate ..SearchValue BETWEEN 1000 AND 1499.. ) on NCI_SearchValue index. The index seek(index bTree traverse) was executed once resulting in 500 rows and two columns – SearchValue and EmployeeId. The latter also acts as a pointer to the full size rows stored in the clustered index leaf level.
- Nested Loop operator requested, through its inner input, the rest of the columns selected by the query – Name, Surname and Note. The search(key lookup operator), was executed 500 times, once per row in the outer input returning a new set of 500 rows – a row per key search. Each execution traversed the clustered index bTree structure using EmployeeId as a seek predicate, in order to pin-point the qualifying rows.
- For each key lookup search, Nested Loop operator combined the two outputs, the SearchValue and EmployeeId from the outer input with the Name, Surname and Note from the inner input forming the shape of the final result set.
The next thing we need to understand is the relationship between the number of I/O reads required to implement the above routine and the number of rows processed in the process.
Figure 1 shows that the number of I/O reads required for the operation was 1503 logical reads. A logical read is a situation when Sql Server processes an 8Kb page from a RAM memory segment called buffer cache. If the requested page is not already in the buffer cache, storage engine needs to perform a physical read operation in order to get the same 8Kb page from the disk.
The properties of the two physical operators(NCI seek and Key lookup) shows that the system read 500 rows from the non-clustered structure,in one go and performed 500 operations on the clustered index, returning a row per operation.
Now we need to dive a little bit deeper into Sql Server’s storage protocol in order to find all physical pages that were processed during the operations. Query below gives us high level overview of the index structures used in the test. The non-clustered index bTree has two levels and total of 175 pages whereas clustered index bTree has three levels and the total of 6273 pages.
SELECT idx.[name] ,idx.[type_desc] ,pstats.index_level ,pstats.index_depth ,pstats.avg_record_size_in_bytes --all fixed length data type columns ,pstats.record_count ,pstats.page_count ,idx.index_id ,TotalPageNo = SUM(pstats.page_count) OVER(PARTITION BY idx.index_id) --total no of pages per index FROM sys.dm_db_index_physical_stats (DB_ID(), OBJECT_ID(N'dbo.TestBookmarkLookupthreshold','U'), NULL, NULL , 'DETAILED') pstats LEFT OUTER JOIN sys.indexes idx ON idx.object_id = pstats.object_id AND idx.index_id = pstats.index_id;
Figure 3, the total number of pages per index
The next query gives us a detailed view of the index pages across bTree levels – The Doubly Linked List data structure.
SELECT idx.[name] ,idx.[type_desc] ,aloc.page_type ,aloc.page_type_desc ,aloc.previous_page_page_id ,aloc.allocated_page_page_id ,aloc.next_page_page_id ,aloc.extent_page_id ,idx.index_id ,aloc.page_level FROM sys.dm_db_database_page_allocations(DB_ID(), OBJECT_ID(N'dbo.TestBookmarkLookupthreshold','U'), NULL, NULL , 'DETAILED') aloc LEFT OUTER JOIN sys.indexes idx ON idx.object_id = aloc.object_id AND idx.index_id = aloc.index_id WHERE aloc.page_type IS NOT NULL AND aloc.index_id = 1 --2 (0 - heap, 1- clustered index , > 1 non-clustered index ORDER BY aloc.page_level DESC
Finally, the following query gives us a sneak-peek of the actual rows on the index/data pages.
DBCC TRACEON(3604); --allows DBCC PAGE command to send its output to SSMS GO DBCC PAGE ('Test2019DB', 1,832,3) --replace db name/page_id with your values DBCC TRACEOFF(3604); /* DBCC PAGE ( { database_name | database_id | 0}, file_number, page_number [,print_option ={0|1|2|3} ]) [WITH TABLERESULTS] */ GO
Data access pattern
The following diagram represents data access pattern used by the key lookup routine.
--test query SELECT * FROM dbo.TestBookmarkLookupthreshold WHERE SearchValue BETWEEN 1000 AND 1499
Figure 2, Nonclustered index & key lookup internals
Data access pattern analysis
Our next goal is to reconcile the total number of logical I/O reads previously reported by the STATISTICS IO – (1503) with the diagram above.
Lets find out the total number of I/O read operations required to generate one row(out of a total of 500 rows). The following sequence of events is derived from Figure 2.
- Nested loop operator requests all eligible rows from the index seek operator. The non-clustered index is a non-covered index and can provide only SearchValue and EmployeeId columns. The index seek operator use the two seek predicates to find the range of values, SearchValue >=1000 AND SearchValue <=1499.
- Non-clustered index traverse. Index seek operator reads the Root page(PageId=1:16448) of the non-clustered index making the first I/O read (NoOfReads = 1).
- Index seek operation, following the lower boundary(SearchValue = 1000) of the search range, finds a pointer(PageId) which points to the index leaf level page which contains the full(two columns) index row. (ROOT PAGE: SearchValue = 597, PageId = 1:16456). It also knows that the PageId=1:16456 alone cannot provide complete range of the requested values but only SearchValues >=1000 AND SearchValues <1157. It “remembers” the following pointer, PageId = 1:16457 which can provide the rest of the values, SearchValue >=1157 and SearchValue <= 1499.
- Index seek operator performs the second I/O read(PageId =1:16457) following Path (A). The total number of I/O reads is now (NoOfReads = 2).
- After storing first 165 rows found on PageId =1:16456 in memory, the operator follows Path (B). The operation is known as “partial index scan“. The operator knows that the subsequent page(PageId=1:16457) contains the rest of the requested rows(335 rows). The current page also has pointers to previous and next page(doubly linked list). The Path (B) makes the third read (NoOfReads = 3).
- Nested loop operator received all 500 requested rows from its outer input , the NCI index seek operator.
- Nested Loop operator now performs the Key Lookup operation over clustered index, 500 times, once per row collected from the outer input.
- Clustered index traverse (singleton search). On its very first execution, the key lookup operator uses the first row from the Nested Loop outer input, (EmployeeId = 1000) and performs its first page read(PageId = 1:2680). The page is the root level of clustered index bTree. The operation follows Path(C) increasing the total number of reads (NoOfReads = 4).
- Clustered root page provides a pointer(PageId = 1:832) which points to the first index intermediate level. The page maps all EnployeeIds between NULL and less then 4305. The row with EmployeeId=1000 falls into the range. Following Path(D) the operator makes its second read and increases the total number of reads ( NoOfReads = 5)
- Intermediate page 1:832 provides information about a leaf level page(PageId=1:1029) that contains the first full row.
- The process now follows Path(E) and make its final, third clustered index read ( NoOfReads = 6)
- The full row is then passed from the Nested Loop operator to the SELECT operator.
- Nested loop operator continue to progresses through the list of 500 rows from its outer input repeating steps 8 – 11 until all 500 rows have been processed.
The total number of reads is
Total No Of Reads = Index Seek operation (3 reads) + 500 * Key Lookup operation (3 reads) = 3 + 500 * 3 = 1503.
This is an exact match with the number of logical reads reported by STATISTICS IO.
Important observations
From the storage level perspective, one of the main differences between the two access patterns is that the Clustered index seek(partial scan) is a sequential read I/O operation, whereas Key lookup(singleton clustered index seek) is a random read I/O operation. Sequential reads are generally less expensive (less mechanical movements on the disk level) than the random reads. This is also true for RAM/SSD although they don’t have moving parts. This a very high level observation on data storage systems. 🙂
The number of random reads depends on size of row. The wider the row the more random reads key lookup needs to perform to get the same result-set. More about this later in the post.
Read ahead optimisation
Earlier in the blog I used TRACEFLAG 652 to disable the page pre-fetching aka read ahead functionality. Read ahead is an asynchronous I/O mechanism build to overcome the gap between CPU and I/O performances. Because CPU is many times faster than any storage system, Sql Server’s storage engine tries to read up to 64 sequential pages(8 extents) before they are requested by a query. This provides more logical reads, but on the other hand is faster than performing physical reads only when required. Although the number of read-ahead pages may vary, the mechanism reads pages from the index intermediate level(s) in order to identify the leaf level pages to be read in advance. In our case, the functionality would, if not turned off, “added a few” extra pages to the STATISTICS IO report and we wouldn’t be able to reconcile the reads with the diagram in Figure 2.
Try to run the same test query without turning on the traceflag. The number of logical reads should be greater than 1503.
The tipping point
The tipping point is the point which represents the critical number of rows after which query optimiser decides to perform cluster index scan instead non-clustered/key lookup data access pattern.
As previously shown, in non-clustered index/key lookup scenario, the number of rows requested by a query relates to the number of random reads. The main concern when determining the tipping point is actually the number of data pages that needs to be “randomly” read from clustered index leaf level – a read per each row requested . In our example this number is 500. The approach excludes the clustered index leaf level page reads and the non-cluster reads all together.
The number of pages that represents the tipping point is between 1/4 and 1/3 of clustered index data pages(leaf level). If a query requests less than 1/4 (25%) of the number of clustered index leaf level pages, query optimiser is most likely to allow random reads and non-clustered index/key lookup data access pattern. However, if a query requests more than 1/3(33%) pages, query optimiser will implement a clustered index scan data access pattern.
Figure 3, The tipping point
Knowing that a random read corresponds to a selected row, we can express the tipping point as a number of rows;
1/4(no of data pages) <= Tipping point (rows) => 1/3(no of data pages)
In our example, the tipping point is expected to be somewhere between 1562 and 2083 rows.
So where exactly is our tipping point?
One approach is to apply binary search algorithm to the tipping point interval and perform trial and error approach until the query plan changes.
The other approach is to construct some kind of a program to do that for you 🙂
SET NOCOUNT ON; GO DECLARE @start INT ,@end INT ,@IndexScan NVARCHAR(1000); SELECT @start = 6250.00 / 4 ,@end = 6250.00 / 3; SELECT [Tipping point range lower boundary] = @start ,[Tipping point range higher boundary] = @end; ;WHILE @start <= @end BEGIN SET FMTONLY ON SELECT * --thisIsMyLittleQuery FROM DBO.TestBookmarkLookupthreshold WHERE SearchValue <= @start OPTION (RECOMPILE) SET FMTONLY OFF ;WITH XMLNAMESPACES (default 'http://schemas.microsoft.com/sqlserver/2004/07/showplan' ) ,individualPlans AS ( SELECT DatabaseName = DB_NAME(s2.[dbid]) ,ProcName = OBJECT_NAME(s2.objectid, s2.[dbid]) ,sql_statement = SUBSTRING( s2.[text] ,(s1.statement_start_offset / 2) + 1 --/2 2byutes per character S2.text is of type nvarchar(max). Start position is 0byte ,( (CASE -- check for the end of the batch WHEN s1.statement_end_offset = -1 THEN DATALENGTH(s2.[text]) --end of the last query in the batch -1 represents the end of a batch ELSE s1.statement_end_offset END) - (statement_start_offset) / 2 ) + 1 ) ,query_plan = CAST(s3.query_plan AS XML) FROM sys.dm_exec_query_stats AS s1 CROSS APPLY sys.dm_exec_sql_text(s1.[sql_handle]) AS s2 CROSS APPLY sys.dm_exec_text_query_plan(s1.plan_handle, s1.statement_start_offset, s1.statement_end_offset) s3 -- get the query plans WHERE s2.text LIKE '%SELECT * --thisIsMyLittleQuery%' ) SELECT @IndexScan = query_plan.value('(/ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple/QueryPlan/RelOp/@PhysicalOp)[1]','nvarchar(max)') FROM individualPlans WHERE sql_statement LIKE 'SELECT * --thisIsMyLittleQuery%'; IF @IndexScan = 'Clustered Index Scan' BEGIN SELECT [TippingPoint (NoOfrows)] = @start BREAK; END SET @start +=1; END
The script runs the same query for different SearchValue values. The values are within the expected tipping point range, starting from the lower boundary. Query result is suppressed by FMTONLY ON session setting. OPTION(RECOMPILE)* in this context ensures that the value of local variable @start is known to Query Optimiser when creating execution plan for the query*. For each query run, program checks the current query plan RelOp element. If the element’s attribute @PhysicalOp has value set to ‘Clustered Index Scan‘ the program terminates and selects the current SearchValue value. The value represents the Tipping point we are looking for.
Figure 4, The exact Tipping point number of rows
Note: Instead using OPTION(RECOMPILE) we could use Dynamic string execution.
/* SET FMTONLY ON SELECT * --thisIsMyLittleQuery FROM DBO.TestBookmarkLookupthreshold WHERE SearchValue <= @start OPTION (RECOMPILE) SET FMTONLY OFF */ EXEC ('SET FMTONLY ON SELECT * --thisIsMyLittleQuery FROM DBO.TestBookmarkLookupthreshold WHERE SearchValue <=' + @start + ' SET FMTONLY ON')
The approach constructs and optimise the query during run-time. In this case, local variable @start gets evaluated and is treated as a literal within the dynamic string. It is most likely that the query plan will not be parameterised and the individual plans(one per execution) will be cached as Adhoc plans. This may lead to the Plan pollution situation, but this is a topic for a separate blog 🙂
Lets check the tipping point number of rows.
-- check the tipping point TP = 1677, Row Size 500bytes. DBCC TRACEON(652); GO SET STATISTICS XML,IO ON; SELECT * FROM dbo.TestBookmarkLookupthreshold WHERE SearchValue < 1677; GO SELECT * FROM dbo.TestBookmarkLookupthreshold WHERE SearchValue < 1678; GO SET STATISTICS XML,IO OFF; DBCC TRACEOFF(652); GO
In terms of the number of pages, tipping point is expected in the range from 25% – 33% of the total number of clustered index data pages(leaf level). In our example, the range was between 1562 and 2083 pages.
From the number of rows point of view, tipping point was 1677 rows which is (1677 /100000)*100 = ~1.7% of the total number of rows in the table(clustered index leaf level). This means that Sql Server is very conservative when to use bookmark lookup data access pattern, although the percentage of rows depends on the row size and probably other conditions i.e memory pressure, parallel query execution ..etc.
Tipping point & row size
As mentioned above, the number of random reads in non-clustered/key lookup access pattern depends on the size of a row. The wider the row the more random reads key lookup needs to perform to get the same result-set.
Lets increase our test table row size from 500(493bytes + 7bytes overhead) to 1000bytes. The new rows size will expand the table footprint.
ALTER TABLE dbo.TestBookmarkLookupthreshold ALTER COLUMN Notes CHAR(745); GO ALTER INDEX NCI_SearchValue ON dbo.TestBookmarkLookupthreshold REBUILD; GO --this is why it's always better to name db constraints/indexes :) ALTER INDEX [PK__TestBook__7AD04F11C1954CB9] ON dbo.TestBookmarkLookupthreshold REBUILD; GO
The number of clustered index data pages required to store 100000 rows is now doubled, 12502 to be precise (Figure 3 query). The Tipping point is now expected to be in the range from (1/4) * 12502 and (1/3) * 12502 or 3125 and 4167 rows.
If we run query (Figure 4), we’ll get the exact tipping point , 3228 rows.
-- check the tipping point TP = 3228 , Row Size 1000bytes. DBCC TRACEON(652); GO SET STATISTICS XML,IO ON; SELECT * FROM dbo.TestBookmarkLookupthreshold WHERE SearchValue < 3228; GO SELECT * FROM dbo.TestBookmarkLookupthreshold WHERE SearchValue < 3229; GO SET STATISTICS XML,IO OFF; DBCC TRACEOFF(652); GO
Figure 6, Query plan change (row size 1000bytes)
The interesting thing here is that now, with the wider rows, the tipping point represents (3228/100000)*100 = ~3.2% of the total number of rows in the table which is almost double than 1.7% calculated for the 500byte rows.
The tipping point experiments can be put in the context of a cached plan. If we wrap our test query into a stored proceudre and make local variable @start to be stored proc’s input parameter, on the very first sp call query optimiser will create and cache query plan using the value passed
the query execution plan will be crated using the
Conclusion
The concept known as The Tipping Point represents the point at which the number of page reads required by the bookmark lookup operator exceeds a certain point at which a clustered index/heap table scan becomes less expensive than the non-clustered index seek. In this context, a bookmark operator(Key or RID) is coupled with a non-clustered , non-covered index – Index Seek operator. The latter performs sequential I/O operations whereas the first performs a number of Random Access I/O read operations. Random I/O reads are generally more expensive than sequential I/O read operations regardless of the storage system (mechanical HDD, SSD, RAM ..etc). Query optimiser allows bookmark lookup/index seek data access pattern only if the number of clustered index pages needed to be randomly accessed does not exceed 1/4 of the total number of clustered index data pages(leaf level). If the number of pages exceeds 1/3 of the total number of the clustered index data pages, Query optimiser will choose Clustered index scan data access instead. This is also true for Rid Lookup/Table scan access pattern when table is a heap.
The range of data pages between 1/4(25%) and 1/3(33.3%) of the total data pages defines The Tipping Point space. In this scenario, the number of randomly accessed pages relates to the total number of the selected rows. However, 25% – 33% of pages represents only a fraction of the total number of rows – for 500byte row size, between 1.6% and 2%. The range also depends on the row size. For the same number of rows and with the row size set to 1000bytes, the range increases to 3% – 4% of the total number of rows.
I wish to thank to my dear colleague and a great SQL Server enthusiast Jesin Jayachandran for inspiring me to write this blog.
Thanks for reading.
Dean Mincic