- Character encoding, strings, and code points
- Sorting rules
Collation is one of those settings in SQL Server that most of the developers would rather avoid understanding and just go with the defaults. At some point during the production life of an application, collations may decide to “strike back” causing unpredictable errors and frustration. This blog is a high-level overview of SQL Server’s Collation, how it works and what is the basic idea behind it.
Character encoding, strings, and code points
Words in a text are created from Characters. Characters are grouped into Character sets aka repertoires.
Computers, because they only deal with numbers, use Character encoding to represent character sets. Each character is encoded(represented as something else) as a unique number also known as Code point e.g letter “A” might be encoded as decimal 65 or 0100 0001 in binary.
Character string data type stores a sequence of characters. In terms of length, SQL Server offers two types of character string data types:
- Fixed-length : CHAR(n), NCHAR(n)
- Variable-length : VARCHAR(n), NVARCHAR(n) ,VARCHAR(max), NVARCHAR(max), text*, ntext*
*text and ntext will be deprecated in future versions of Sql Server
VARCHAR(max) and NVARCHAR(max) aka the LOB data types are variable-length data types that can store up to 2GB of data.
Broadly speaking, there are two main standards for mapping code points to characters: non-Unicode and Unicode.
In SQL Server, most of the non-Unicode characters require 1byte(1) of storage space. The characters are encoded as numbers between 0 and 255. This means that there can be a maximum of 256 distinct, non-Unicode encoded characters stored in a single byte. e.g Character D is encoded as decimal 68 or binary 01000100 and m is encoded as decimal 109 or binary 01101101.
The problem with this is that the total of characters in all the world’s languages exceeds 256 available code points. So how to cram thousands of different characters into just one byte? Well, someone came up with the idea to create Code pages.
1:Most of the available code pages in Sql Server support only Single-Byte Character Sets. However, there are several code pages that allow for Double-Byte Character Sets – the characters that require 2bytes of storage space. SQL Server 2019 supports a new code page for the UTF-8 character encoding. This code page supports characters that may require 1, 2,3 or 4bytes of storage space.
This means that e.g VARCHAR(50), where 50 represents the number of bytes, depending on the collation, may not be able to store 50 non-Unicode character length string despite the common perception.
For varchar(n)/nvarchar(n)/char(n)/nchar(n), (n) represents the storage size in bytes, not the number of characters that can be stored i.e try to cram this bad boy ” 😆” character into, say NCHAR(1).
A code page is basically a character set that represents an ordered list of 256 different code points which define characters specific to a group of similar languages. As of SqlServer 2019, there are 17 + 1 different code pages available.
SELECT [CodePage] = COLLATIONPROPERTY([name],'CodePage') ,[No. of Collations] = COUNT(*) FROM sys.fn_helpcollations() GROUP BY COLLATIONPROPERTY([name],'CodePage');
Each code page has many collations associated with it (one or more collations can define rules such as case sensitivity and sorting over the same code page). CodePage = 0 encapsulates collations that only support Unicode. These collations cannot be used for non-Unicode encoded characters – see an example here. More on the collations in the next paragraph.
The ASCII character Set
As mentioned above, in the non-Unicode world, different code pages support characters relevant to different languages. The 256 available code points are split into two groups.
- Standard ASCII Character set – code points from 0 to 127 (decimal). These code points require 7bits of storage space where every single bit represents a unique character. But we have one more bit that we can use in a byte …
- Extended ASCII Character set – code points from 128-255(decimal). This is an 8-bit character set where the code points represent different characters depending on the code page.
The good thing is that the standard ASCII char. set contains all the characters (alphabet, numbers, punctuation symbols, etc) that we need to write in the English language. On the other hand, the extended ASCII character set covers characters that are specific to languages such as Greek, Spanish, Serbian, Hebrew, Turkish, etc. This means that the non-Unicode covers most of the world’s languages, except Chinese, Japanese, and Korean which require more than one byte to store a character.
The figure below shows a comparison between Cyrillic and Greek ASCII character set vs the first 255 Unicode code points. T-SQL code used for this experiment along with the output (txt) files can be found here: T-sql script, ASCII_Cyrillic_output, ASCII_Greek_output, and Unicode_output.
Figure 2. The first 128 Unicode code points are the same as ASCII
The first 0-127 ASCII characters share the same code points across different code pages and in the Unicode world. This means that, from the encoding perspective, it is irrelevant if we go with non-Unicode or Unicode standard as long as our application uses the first 126 ASCII characters only. However, from a database perspective, we also need to take into consideration SQL Server collations as these “bad boys” define not only code pages, but the rules on how to sort and compare characters.
Figure 2 shows that e.g character “w” is always encoded as decimal 119 regardless of the code page/Unicode (hex 0x77) – see ASCII and Unicode. Characters from 128-255 may have different encodings depending on the code page and/or Unicode
If a system uses a non-Unicode system to store e.g Cyrillic characters specific to a particular language, we need to be sure that both, client and SQL Server use the same code page to encode text. Otherwise, the client app may send e.g Cyrillic character “ч” (like ch in chocolate) and SQL Server, if set up to use a Greek code page, may decode and store it as a Greek “χ” (chi) – Figure 2.
Some time ago, someone come up with an idea to create a universal character encoding standard that will contain all possible characters from all the world’s languages and more. The idea was for each character to have their own unique code point that never changes e.g Cyrillic “ч” is always encoded as decimal 1095 and the Greek letter “χ” is always encoded as decimal 967. It basically means that the code points are the same regardless of platform, device, or application. The Unicode standard is maintained by the Unicode Consortium.
There are several different implementations of the Unicode standard depending on the way code points(numbers) are stored in memory;
- UTF-8 – By far, the most popular implementation. it is a variable-length encoding that requires up to 4bytes(32bits) per character. It uses 1byte per character for the standard ASCII characters and 2bytes or 4bytes for others.
- UTF-16 – SQL Server default Unicode implementation. Although the standard allows 1,114,111 different characters, SQL Server’s NCHAR/NVARCHAR can store the Unicode characters in the code point range only* from 0 to 65,535. This is also known as BMP – Basic Multilingual Plane. Every character in this range requires 2byte of storage space, hence i.e NCHAR(1) requires 2 bytes. The characters above this range would require 4 bytes.
- UTF-32 – Opposed to the variable-length encodings, this one always uses 32bits(4bytes) per character. Hmm, that can waste a lot of space.
*Characters with code points above decimal 65,535 are called Supplementary characters(SC). SQL Server 2012 introduced a new set of collations that enable Unicode data types NVARCHAR, NCHAR, and SQL_VARIANT to store the whole Unicode range, the BMP, and the Supplementary characters.
SQL Server collations have two main purposes:
- To implement Code pages for the non-Unicode characters. This does not apply to the Unicode characters, because, as mentioned above, they have their own code points that are “set in stone”.
- To define rules on how to sort and compare characters. These rules apply to both non-Unicode and Unicode characters.
Different collations can implement the same Code page e.g from the code above (Figure 1), we can see that there are e.g 894 different collations based on the same Code Page, 1252. Each of the 894 collations implements different sorting rules for the same code e.g The example below demonstrates how the result of the LIKE operator depends on the collation of its operands.
USE master go ALTER DATABASE testCollation COLLATE Latin1_General_100_CI_AI; GO USE testCollation GO DROP TABLE IF EXISTS #compareNonUnicode; GO CREATE TABLE #compareNonUnicode(txtCI VARCHAR(20) COLLATE Latin1_General_100_CI_AI ,txtCS VARCHAR(20) COLLATE Latin1_General_100_CS_AS ); GO INSERT INTO #compareNonUnicode (txtCI,txtCS) SELECT 'Hello World','Hello World'; GO SELECT * FROM #compareNonUnicode WHERE txtCS LIKE 'h%' -- no match GO SELECT * FROM #compareNonUnicode WHERE txtCI LIKE 'h%' --match GO
The comparison rules apply for the Unicode data types as well – try to use NVARCHAR(20) instead of use VARCHAR(20) in the code above).
There is a set of collations that support only the Unicode encodings and cannot be set on a database level. These collations have Code Page = 0 (see Figure 1). Here is an interesting example of how these collations works with the non-Unicode data.
SQL Server collations can be divided into two sets depending on the sorting rules they implement.
- Windows Collation – based on Windows OS system locale. These collations use the same algorithms for sorting Unicode and non-Unicode data.
- SQL Collation – based on previous versions of SQL Server. This set of collations use different algorithms for sorting Unicode and non-Unicode data. This can produce different results when comparing the same character string encoded as non-Unicode and Unicode. SQL Collation names begin with “SQL_%“.
This script lists all SQL and Windows collation sets available in SQL Server.
--SQL and Windows collation sets SELECT CodePage = COLLATIONPROPERTY([name],'CodePage') ,CollationSet = IIF(name LIKE '%SQL%','SQL Collation','Windows Collation') ,[name] ,[description] FROM sys.fn_helpcollations() ORDER BY CodePage ,CollationSet --No. of SQL vs Windows collation sets per code page SELECT [CodePage] = COLLATIONPROPERTY([name],'CodePage') ,NoOfWinCollations = SUM(IIF([name] NOT LIKE '%SQL%',1,0)) ,SQLCollations = SUM(IIF([name] LIKE '%SQL%',1,0)) ,TotalNo = COUNT(*) FROM sys.fn_helpcollations() GROUP BY COLLATIONPROPERTY([name],'CodePage') ORDER BY [CodePage]
Both Windows and SQL collation sets also support Binary based collation (‘%_BIN’ or ‘%_BIN2′). Binary collations compare characters by comparing their code points.
Windows vs SQL Collation sorting quirks
It is interesting that, when installing a brand new SQL Server instance, the default collation is initially set to SQL_Latin1_General_CP1_CI_AS, if the OS is using the U.S. English Locale.
From the SQL Server 2000 Retired Technical documentation, we can learn that the Db installer chooses the Windows collation that supports the Windows locale of the computer on which the instance of SQL Server is being installed. This is followed by the note below:
“The Setup program does not set the instance default collation to the Windows collation Latin1_General_CI_AS if the computer is using the U.S. English locale. Instead, it sets the instance default collation to the SQL collation SQL_Latin1_General_Cp1_CI_AS. This may change in a future release“.
Well, it hasn’t been changed yet. 🙂
A later document states:
During SQL Server setup, the default installation collation setting is determined by the operating system (OS) locale. For backward compatibility reasons, the default collation is set to the oldest available version that’s associated with each specific locale. To take full advantage of SQL Server features, change the default installation settings to use Windows collations. For example, for the OS locale “English (United States)” (code page 1252), the default collation during setup is SQL_Latin1_General_CP1_CI_AS, and it can be changed to its closest Windows collation counterpart, Latin1_General_100_CI_AS_SC.
Different sorting rules
This example demonstrates how an SQL collation uses different sorting rules for the non-Unicode strings. The query output shows the different sort orders for the same values encoded as the Unicode/non-Unicode e.g SQL Collation sorts a non-Unicode “Mountain BIke A-F100” before “Mountain BIke ABC” because, it treats the hyphen as a separate character, whereas the windows collation, for the same string, use a “word sort” sorting rules that ignores the hyphen, hence Mountain BIke ABC is less than “Mountain BIke A-F100“
If for some reason, we want to return cursor* to the client, the order of the cursor elements may differ depending on the collation and the encoding of the sorted column(s).
Note: A query with a presentation ORDER BY clause i.e the one that is not coupled with the TOP clause, returns an object with rows organised in a specific order. ANSI recognises such an object as a CURSOR.
The query optimiser cannot use proper statistics
Now, when we think about sorting, we cannot avoid thinking about indexes. So the next “quirk” demonstrates how a simple query that returns a single row, with an index on the searched column, and with a predicate that appears to be SARGable, may “decide” to do a full clustered index scan instead of the expected index seek/key lookup.
Note: Client requests & sql events in Sql server post explains parameterised batch requests.
The example shows how exactly the same query can behave differently depending on the database collation. If the DB uses the default SQL collation, not the one recommended by Microsoft for OS locale “English (United States)” (code page 1252) – the most commonly used locale, the query will be implemented through a sub-optimal plan.
Fire up MS Profiler and include the following events:
- Performance: Showplan XML Statistics Profile
- Sessions: Existing Connection
- Stored Procedures: RPC Completed
- TSQL: Exec Prepared SQL, Prepare SQL, SQL:StmtCompleted, Unprepare SQL
With the testCollation database set to use SQL_Latin1_General_CP1_CI_AS, we have the following output.
Figure 4, SQL collation and sub-optimal plan
So, the client code executes a parameterised batch. It then passes a Unicode parameter value “Goldberg“. The query executes in the context of testCollation DB set to use SQL_Latin1_General_CP1_CI_AS collation. The collation implements different sorting rules for the Unicode and non-Unicode characters. On the left side of the predicate, we have non-Unicode (column:LastName“) values that we compare to a Unicode value.
declare @p1 int
exec sp_prepexec @p1 output,N‘@P1 nvarchar(16)’,N’SELECT * FROM dbo.Employees WHERE LastName = @P1;’,N‘Goldberg’
–note: @p1 != @P1, @p1- is a prepared query handle int value.
Query Optimiser, because of the different sorting rules, cannot compare the values and access the available index through a seek operation. Instead, it decides to perform a full scan on the clustered index. Using a residual predicate operation on the Clustered Index Scan Operator, it implicitly converts each LastName value to NVARCHAR(50), the Unicode value, before comparing it with the query parameter value N”Goldberg”.
Interestingly enough, if we set up our test DB to use Windows-based collation(Latin1_General_100_CI_AS_SC), for the same OS locale, the batch request will be implemented as expected (Index seek/Key lookup).
Change tsql script to set the Windows collation on the testCollation db, and repeat the test.
... ALTER DATABASE testCollation --COLLATE SQL_Latin1_General_CP1_CI_AS; COLLATE Latin1_General_100_CI_AS_SC; --windows collation; GO ...
Following a similar execution sequence, the Query optimiser was able to use the Index on the searched column and construct a more optimal plan.
Figure 5, Windows collation produces an optimal plan
From the experiments above, we can see that we need to be aware of the DB collation characteristics and how they may affect query execution.
It is worth mentioning that it’s always a good idea to define proper parameter data types when making RPC calls whether it is a sproc or a parameterised batch. In this case, it was supposed to be varchar(50)).
.. query = "SELECT * FROM dbo.Employees WHERE LastName = CAST(? AS VARCHAR(50));" ..
Comparing a non-Unicode string to a Unicode string, in this case, the LastName column VARCHAR to an NVARCHAR requires the Implicit Conversion operation. This conversion follows the Data type precedence rules that say that the varchar value must be converted to nvarchar which has higher type precedence. This means that the query must convert every LastName value to nvarchar before evaluating the predicate; LastName = N”Goldberg”. However, QA is still able to utilise the index seek. More on that in the next segment.
More on the ODBC param. batch requests and dynamic index seek
This is not directly related to the topic of this post, but I found it super interesting, so here it is 🙂
In addition to the query behavior presented above, we can observe a few more interesting things.
ODBC driver used by Python’s pyodbc library implements parameterised batch requests using the sys.sp_prepexec system stored proc- see Figure 4. The sproc implements sys.sp_prepare and sys.sp_execute. It is interesting that the plan generated by the sys.sp_prepexec does not “sniff” the first parameter passed – in our case, Mr. “Goldberg”. We would expect QO to use Histogram info to find out how many Goldbergs there are in the LastName column. In this particular case, it would use the AVG_RANGE_ROWS = 1.
Instead, QO used General (All) Density(aka Density Vector) = 0.003703704. The reciprocal of the density vector represents the number of unique LastNames. (1/0.003703704 = 270). So how many Goldbergs QO estimated? Density Vector * TableCardinality–> 0.003703704 x 296 = 1.0963. This information is available in the properties of the Index Seek operator.
If we run the same code using e.g .NET Framework Data Provider for SQL Server (pshell example), the parameterised batch will be implemented using sys.sp_executesql system sproc and QO will use histogram data to make a more accurate estimate of the number of qualified rows for the “LastName” predicate.
Dynamic index seek
The shape of the plan presented in Figure 5, includes ConstantScan and ComputeScalar operators which interact with the Index Seek operator. This plan shape implements an index seek with a dynamic seek range data access pattern.
Let’s generate the same plan in SSMS.
DBCC TRACEON(2486); SET STATISTICS XML ON; DECLARE @p1 INT SET @p1=-1 EXEC sp_prepexec @p1 OUTPUT,N'@P1 nvarchar(16)',N'SELECT * FROM dbo.Employees WHERE LastName = @P1 OPTION(RECOMPILE);',N'Goldberg' SET STATISTICS XML OFF; DBCC TRACEOFF(2486); EXEC sys.sp_unprepare @p1 GO
The undocumented traceflag 2486 exposes the Expressions values (Expr1003, Expr1004, and Expr1005) assigned during the runtime. Values are visible in the XML query plan.
<ValueVector> <ColumnReference Column="Expr1004" /> <ColumnReference Column="Expr1005" /> <ColumnReference Column="Expr1003" /> </ValueVector> <ScalarOperator ScalarString="GetRangeThroughConvert(N'Goldberg',N'Goldberg',(62))"> <Intrinsic FunctionName="GetRangeThroughConvert"> <ScalarOperator> <Const ConstValue="N'Goldberg'" /> </ScalarOperator> <ScalarOperator> <Const ConstValue="N'Goldberg'" /> </ScalarOperator> <ScalarOperator> <Const ConstValue="(62)" /> </ScalarOperator> </Intrinsic> </ScalarOperator>
An interesting thing about this output is that it exposes an Intrinsic(built-in) Function “GetRangeThroughConvert”. The purpose of this function is to narrow down the set of the LastName candidates for the final result(dynamic seek range). It is visible in the Seek Predicates plan segment. This significantly reduces the number of the LastName column values to be implicitly converted from VARCHAR to NVARCHAR. Without this optimisation, QO would decide to go with a full clustered index scan performing the conversion for all LastName values.
Once the function reduces the number of LastName candidates, the system performs implicit conversion through the Residual Predicate.
Figure 6, GetRangeThroughConvert built-in function
The only thing missing is Expr1003 = 62. The number(bitmask) defines the actual test performed – dynamic seek range is always presented in a generic way Start: Column >Expr, End Column<Expr. In this case, 62 includes the Expr values in the interval narrowing down the seek range to just one value, Mr. “Goldberg”.
Words in a text are created from Characters. Characters are encoded as numbers. Some complex characters may be presented as a combination of two or more numbers. Non-Unicode characters generally support 256 different characters, the first 128 being ASCII charset. The second “half” depends on the Code page. Different code pages define different sets of characters. Unicode characters implement a universal character encoding standard where each character has its unique,set in stone, code.
Collations in SQL Server have two main purposes: to implement code pages for the non-Unicode characters and to define sorting rules for Unicode and non-Unicode characters. Conversions between different collations may result in errors or in the sub-optimal query execution plans.
It is very important that the application code and the database are on the same page (pun intended) and the same characters are understood in the same way.
Thanks for reading.