Tag Archives: database design

Set up Python for SQL Server


Summary

In SQL Server 2016, Microsoft extended SQL Server engine capabilities to be able to execute an external code written in R language. Sql Server 2017 added Python language to the mix and 2019 integrated Java. The new functionality is a part of Sql Server’s Machine Learning Services.
This blog explains how to set up SQL Server 2019 so we can use ML services and
the Python programming language.

Prepare the environment for Python

The first step is to add Machine Learning Service Extensions / Python to our SQL Server environment.

Figure 1, ML services

The next step is to allow Sql Server to execute external scripts by turning on the external scripts enabled user option on the server level

sp_configure 
     @configname = 'show advanced options'
    ,@configvalue = 1
GO
RECONFIGURE WITH OVERRIDE;
GO
 
sp_configure 
     @configname = 'external scripts enabled'
    ,@configvalue = 1            
GO
RECONFIGURE WITH OVERRIDE;
GO

At this stage, we should be ready to go.

Let’s explore how the SQL Server engine relates to the above-mentioned ML(Machine Learning)  services. The two components are “bound” by the Extensibility framework.

Launchpad

Launchpad service, one of several services in Sql Server, has a job to manage and execute external scripts – the scripts written in Python language. It instantiates an external script session as a separate, satellite process.


Figure 2, Extensibility framework

The purpose of the extensibility framework is to provide an interface between the SQL Server engine and Python (and other languages). Figure 2 shows how different components interact and demonstrates security concepts.
From the super high-level point of view, the process goes like this:

  1. Our application calls system stored proc. sys.sp_execute_external_script. The proc is designed to encapsulate code written in one of the external languages, Python in this case. The stored proc executes within the SQL Server engine – Sqlservr.exe. The engine knows that the code it needs to run contains a program written in Python and therefore starts the Launchpad.exe process. Sqlservr.exe uses Named Pipes protocol(often used for the communication between processes on the same machine) to pass the code to be executed.
  2. Launchpad service now call initiates the DLL specific to the language that needs to be run. In this case, it’s PythonLauncher.dll.
  3. PythonLauncher.dll now runs Python.exe/dll which compiles the program to bytecode and executes it.
  4. Python.exe then talks to yet another program, BxlServer.exe. Bxl stands for Binary Exchange Language. This program coordinates with the Python runtime to manage exchanges of data and storage of the working results.
  5. sqlsatellite.dll communicates with SQL Server engine over TCP/IP protocol and ODBC interface. It retrieves input data sets*, sends back result sets and Python’s standard I/O streams – stdout and stderr.
  6. BxlServer.exe communicates with Python.exe passing messages back and forth from Sql Server through sqlsatellite.dll.
  7. Sql Server gets the result set, closes the related tasks and processes, and sends the results back to the client e.g SSMS

Security

Security architecture for the extensibility framework can be observed in Figure 2.

  • The client uses Windows/SQL login to establish a connection with SQL Server and execute Python code encapsulated in sys.sp_execute_external_script stored proc.
  • Launchpad service runs as a separate process under its own account – NT Service\MSSQLLaunchpad$<instanceName>. The Windows account has all the necessary permissions to run external scripts and is a member of the SQLRUserGroup2019 Windows group.
    How to interact with the file system(Import/Export csv, xlsx, ..) or to access resources outside the Server env will be explained later in this blog
  • Launchpad, when invokes the appropriate runtime environment, in our case, the PythonLauncher.dll, initiates an AppContainer and assigns the process to it. By default, the Launcher service creates 21 containers AppContainer00AppContainer20. The number of containers can be changed through Sql Server Configuration Manager/Sql Server Launchpad/Advanced/Security Contexts Count property.
    AppContainer is an internal mechanism that provides an isolated execution environment for the applications assigned to it. An application running in an AppContainer can only access resources specifically granted to it. This approach prevents the application from influencing, or being influenced by, other application processes
  • Launchpad maps the User(used to make a connection) to an AppContainer that now defines credentials unique to user/application pair. This also means that the application, in this case, launchpad.exe cannot impersonate db Users and gain more access than it should have.

So, in a nutshell, a Client connects to Sql Server using Sql/Win login mapped to db user. The user runs sys.sp_execute_external_script with the Python code.  Sql Server runs launchpad.exe which runs under it’s own win account. The launcher then initiate one of twenty AppContainers

Firewall rules – how to access data from the Internet

As mentioned before, there are 20 separate application containers – AppContainer00 – AppContainer20 These objects restrict access to the applications assigned to them, in this case launchpad.exe.

Figure 3, Windows Firewall outbound rules related to the AppContainers

If we disable these rules, our Python code, encapsulated in sys.sp_execute_external_script, will be able to communicate with the “outside world” and e.g get up-to-date FX (Foreign Exchange) rates, BitCoin price for different currencies etc. – see. forex-python module. We can then e.g describe the result and send back to the Client/top level procedure. It is also possible to “join” the data to a tSQL query result, within the same query batch, and then to send it back to the caller – amazing stuff 🙂

EXEC sys.sp_execute_external_script
        @language = N'Python' --specify language to use
        ,@script   = N'
from pandas_datareader import data

start_date = "2021-11-20"
end_date = "2021-12-01"

UsToAus = data.DataReader("DEXUSAL", "fred",start=start_date, end=end_date).reset_index()

print(UsToAus)
'       
       ,@output_data_1_name =N'UsToAus'
WITH RESULT SETS  (([Date] DATE,[US to AUS]  FLOAT) )


In the example above, the code connects to FRED (Federal Reserve Bank of St. Louis, Missouri, USA) and pulls the exchange rates between the USD and  AUD for a given timeframe. 

File system accessibility – how to import/export result

Another cool thing that you can do with Python is to easily export results to disk  e.g save a query result (report) in a form of an Excel file, to a location where it can be picked up(pulled) by the authorised users. It is also possible to read e.g a lookup table – extended customer information, states, employee information etc,  from disk and join it with the tSQL query result-set, again within the same sp call.
To enable interaction with the file system we need to allow “Sql Server’s” Python to access specific folders. By default, Python can only access its working directory and its subdirectories, in this case c:\Program Files\Microsoft SQL Server\MSSQL15.SQL2019\PYTHON_SERVICES\. As mentioned before, the Launchpad.exe service runs under its own account. It starts python.exe runtime environment in a process, again, under the  same account. Finally it instantiates a  new AppContainer object to contain its activities. This container now acts as a windows app. with very limited access. Windows apps can access only those resources (files, folders, registry keys, and DCOM interfaces) to which they have been explicitly granted access (“ALL APPLICATION PACKAGES”).

Figure 4, ALL APPLICATION PACKAGES access rights

Now, we can use Pivoting with Python script from another blog to export results to disk.

Figure 5, export results to disk

Memory allocations

It is important to understand how ML Services fits into Sql Server memory allocation scheme to be able to access the possible impact it can have on the database server performance.

As we can see from Figure 2,  Python code runs outside of the processes managed by Sql Server. This means that it doesn’t use memory allocated to Sql Server.

The query below shows that, by default, the maximum memory that ML services can use, outside the memory allocated to Sql Server, is 20 % e.g Total RAM 256GB, Sql Server allocated mem = 200GB, ML can use up to 56 * 0.2 = ~11GB

SELECT * 
FROM sys.resource_governor_external_resource_pools
WHERE [name] = N'default'


Figure 6, External resource pools

We can increase the max_memory_percent to e.g 50% by altering  the “default” external resource pool

ALTER EXTERNAL RESOURCE POOL "default" 
   WITH (max_memory_percent = 50);
GO
ALTER RESOURCE GOVERNOR 
RECONFIGURE;
GO

File locations and configurations

  • launchpad.exe: ..\MSSQL15.SQL2019\MSSQL\Binn\Launchpad.exe
  • pythonlauncher.log ..\MSSQL15.SQL2019\MSSQL\Log\ExtensibilityLog\pythonlauncher.log 
  • Config file (Python) :  ..\MSSQL15.SQL2019\MSSQL\Binn\pythonlauncher.confg
    Python home directory, encoding standard, env patrh, working directory – this is where python stores intermediate results during the calculations, etc.

Python version

SQL Server 2019 supports Python 3.7.1. The package comes with the Anaconda 4.6.8 distribution. The language runtimes and various packages e.g  pandas, are installed under the root directory:

c:\Program Files\Microsoft SQL Server\MSSQL15.SQL2019\PYTHON_SERVICES

To check Python version use command prompt:

c:\Program Files\Microsoft SQL Server\MSSQL15.SQL2019\PYTHON_SERVICES>python --version

To check Anaconda distribution version:

c:\Program Files\Microsoft SQL Server\MSSQL15.SQL2019\PYTHON_SERVICES\Scripts>conda.exe --version

But wait, there is more 🙂

With SQL Server 2019 CU3+ (latest CU) we can use the Language Extension feature to “add” different versions of Python to the mix. In fact, we can add any language extension.
For example, I use Python – version 3.9.7 that is installed as a part of the Anaconda 4.10.3 distribution, for work, not related to Sql Server platform. Wouldn’t it  be cool if it was possible to use that installation in SQL Server?

Note:  There are two packages developed by Microsoft, specifically for out-of-the-box Python installation: revoscalepy and microsoftml. These packages are not available in the distribution mentioned above . 

Language extensions is a set of dll’s written in C++ that act as a bridge between Sql Server and the external runtime. This technology allows us to use relational data(query result sets) in the external code. In a nutshell, the current Python language extension program is an open source code available here. However, the code currently supports only versions 3.7.x. So, to be able to use Python 3.9 as the external code, we need to build our own “bridge”.

Here is an excellent article on how to do it. It looks complex , but once you start you’ll not be able to stop – it’s just so much fun.

Figure 7, Python 3.9 Language extension for SQL Server

In the next blog, I’ll present various dev tools we can use to construct our Python code in Sql Server. 🙂

Thanks for reading.

Dean Mincic

GUID in Sql Server


Summary

In SQL Server, a GUID/UUID – Globally Unique Identifier/Universally Unique Identifier is a 16byte binary value represented as a UNIQUIEIDENTIFIER data type. The idea is to generate and store values that are unique across different database servers and networks. There are several ways to create these values e.g using client code i.e System.Guid.NewGuid() method in .NET, NEWID() / NEWSEQUENTIALID() functions in SQL Server etc.
There are many SQL Server databases designed to use GUID as a surrogate, primary key, and/or clustered index key. This is to enforce entity(table) integrity and/or logical order of the rows. These, somehow common decisions in the industry may cause design and performance issues. This post explores several such scenarios and the possible reasoning behind them.

GUID structure

A GUID is a 16byte(128bit) unsigned integer. It’s a lot of combinations of digits,  2128 or 1038 to be precise. The available space is logically divided into five segments separated by hyphens. This is how the 16byte space is represented to us.
In SQL Server, in addition to the uniqueidentifier, we can also use binary, Unicode, and non-Unicode datatypes (variable or fixed length) to store GUID values – more on that in the following section.

Let’s store a big, positive numeric value into a UNIQUEIDENTIFIER and in a BINARY data type of the same size(16bytes),  and see what happens.

DECLARE @BigNumeric DECIMAL(28,0) = 1234567898765321234567898765;   
/*
For the demonstration I used a positive integer that requires more than 8bytes of space 
*/

DECLARE @binary16 BINARY(16)
       ,@guid UNIQUEIDENTIFIER

SET @binary16 = @BigNumeric
SET @guid = @binary16 --implicit conversion from a numeric to uniqueidentifier is not permitted

SELECT [decimal] = @BigNumeric
      ,[binary(16)] = @binary16
      , [GUID] = @guid;

--to convert back to numeric
/*
SELECT CONVERT(DECIMAL(28,0), @binary16) --bin -> numeric
SELECT CONVERT(DECIMAL(28,0),CONVERT(BINARY(16), @guid)) --guid -> bin -> numeric
*/
GO


Figure 1, A big numeric value stored as  binary and uniqueidentifier 

We can see that the number stored as UNIQUEIDENTIFIER (16byte binary) data type is represented in hexadecimal form and formatted in an 8 – 4 – 4 – 4 – 12 pattern. That is exactly 16bytes (4b-2b-2b-2b-6b) with each byte represented as a pair* of characters i.e the far right byte holds hexadecimal number 03. So, 16bytes are stored and 32+4 characters are displayed(including the hyphens).

*NOTE:  In SSMS, binary values are represented as hexadecimal numbers with the 0x prefix. In fact, many computer languages allow programmers to indicate that a value within a program is a hexadecimal number. In the representation above, each byte is represented as a pair of characters(hex numbers) i.e from Figure1, FD03 – two far-right bytes, FD00 -> 64,768(decimal) , 0003 -> 3(decimal) and together they give 64,771(decimal)   – More on hex to dec conversion can be found here.

Interestingly, UNIQUEIDENTIFIER does not store information exactly the same as a raw, BINARY data type.
Figure1 shows that UNIQUEIDENTIFIER stores the first 16 hexadecimal digits(the first 8 rightmost bytes) exactly the same as BINARY. The remaining three segments store bytes in reversed order.

The first 8bytes(the last two segments) are an 8-element byte array. In general,  array elements are stored in index order and that is why they match the row binary storage pattern. The remaining three segments contain re-shuffled byte ordering and that is a feature of the UNIQUEIDENTIFIER data type. On top of that, the Microsoft OS  byte encoding (Endianness) pattern depends on the underlying CPU architecture i.e Intel processors implement the little-endian encoding. This deviates from the RFC standard, which imposes the use of the Big Endian byte ordering.

To make things more confusing, let’s compare BINARY and UNIQUEIDENTIFIER byte footprints on the data page. For this experiment, I’ll store the values from Figure 1 in a table. Execute this code and observe the results.

Figure 2, UNIQUEIDENTIFIER values on a data page

It seems that the UNIQUEIDENTIFIER value is stored exactly the same as the BINARY value without byte rearrangements. However, the same value is presented with the reshuffled bytes.

RFC4122 is a standard that defines how to structure UUID. This is a set of rules on how to generate values, represented in the four segments (presented as a five-segment string), which together form a globally unique number.
The functions like System.Guid.NewGuid() or NEWID() are based on these rules.

Ways to store GUIDs in SQL Server

UNIQUEIDENTIFIER is a data type designed to store GUID values. In practice, however, programmers sometimes choose binary, Unicode, or non-Unicode data types to store those values. The string data types require significantly more space. There are also concerns about sorting and casting.

  • BINARY(16) – 16bytes
  • CHAR(36)  – 36bytes
  • NCHAR(72) – 72bytes

The script below shows different storage space requirements for the same GUID value:

DECLARE @guid UNIQUEIDENTIFIER = '18eb083c-39dd-4859-ad7f-0392077c32cb'

SELECT [GUID] = @guid
      ,[uniqueidentifier(bytes)] = DATALENGTH(@guid)
      ,[binary(bytes)]           = DATALENGTH(CONVERT(VARBINARY(100),@guid))
      ,[non-unicode(bytes)]      = DATALENGTH(CONVERT(VARCHAR(100),@guid))
      ,[unicode(bytes)]          = DATALENGTH(CONVERT(NVARCHAR(100),@guid));
GO


Figure 3, space required to store a GUID value

Figure 3 shows that if we decide to use e.g Unicode data type to store GUID we’ll need 4.5x more space than if we use UNIQUEIDENTIFIER or BINARY data type.

BINARY data type conversions

According to SQL Server Data Type Conversion Chart (download here), UNIQUEIDENTIFIER value can be implicitly converted to a string and binary datatypes.
In the next experiment, client code generates a GUID value and passes it to a stored procedure as a parameter of type string. The stored proc then inserts the guid value into a table as string, binary, and the uniqueidentifier.
The Python code can be found here and the TSQL script can be found here.

...python code segment..

ui = uuid.uuid4() #create a guid value

sql = "EXECUTE dbo.TestGUIDinsert @guid = ?" #define tsql command and params
params = ui

#execute stored proc 
cursor.execute(sql,params) ##RPC call - parameterised batch request
...

After running the code above and SELECT * FROM dbo.TestGUIDstore;

Figure 4, string GUID to BINARY conversion 

The test shows a peculiar value in the uid_bin16 column. This is not the expected GUID value stored as BINARY(16). So what is it?

Storing GUID values using explicit conversion from string to BINARY(16) results in the loss of the GUID values.

NOTE: Just a couple observations, not directly related to this post. The Python script above connects to the DB using pyodbc library (odbc provider). All versions of the provider, by default, have the autocommit connection string property set to False. This sets the IMPLICIT_TRANSACTIONS to ON. Try to exclude the property param from the conn. str. and see what happens. The second interesting thing is the way the “execute” method executes the query(RPC call, parameterized batch request). More on the two topics can be found here(Data Providers and User Options) and here(Client Requests and SQL Events in SQL Server)

From the previous experiment:

--Default: Style =0
SELECT CONVERT(BINARY(16),'D9DD9BA5-535C-46C3-888E-5961388C089E',0) 

--Result: 0x44394444394241352D353335432D3436

…and if we try to convert the binary value back to GUID i.e this time to UNIQUEIDENTIFIER

SELECT CONVERT(UNIQUEIDENTIFIER,0x44394444394241352D353335432D3436)

--Ressult: 44443944-4239-3541-2D35-3335432D3436

The retrieved GUID is totally different than the one originally stored in the table.

The reason for this behavior lies in the explicit conversion from string to BINARY(16).
What Python program sent, ‘D9DD9BA5-535C-46C3-888E-5961388C089E’, is a string representation of a GUID.  The format is recognizable by UNIQUEIDENTIFIER data type. However, for BINARY data type, this is just a set of ASCII characters, not a hex string – a hexadecimal value represented as a string. Moreover, the ASCII set of characters requires 36bytes to be stored as a hex string.

Just to illustrate the point(and because I found it interesting 🙂 ), the following script takes out each char from the GUID script, gets its ASCII code, converts the code into a hexadecimal value, and put it back into its position. i.e char ‘D’  is ASCII 68 and hex 44.

DECLARE @guidString CHAR(36) = 'D9DD9BA5-535C-46C3-888E-5961388C089E';
DECLARE @result VARCHAR(72)=''; --stores the final result

;WITH N1(n) AS --generate 6 rows
(
    SELECT 1 UNION ALL SELECT 1 UNION ALL  SELECT 1 UNION ALL
    SELECT 1 UNION ALL SELECT 1 UNION ALL  SELECT 1 
),numbers(n1) AS --use cartasian product (6x6) to get enough rows to match the GUIDs 36 chars
(
    SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) --construct a sequence of numbers 1-36
    FROM N1,N1 a
),getHexAscii AS 
(
SELECT n1
      ,chars = SUBSTRING(@guidString,n1 ,1) --use seq. no. to navigate to the chars to be "extracted"
      ,[ascii] = ASCII(SUBSTRING(@guidString,n1 ,1)) --ascii code of the "extracted" char
      ,hexAscii = FORMAT(ASCII(SUBSTRING(@guidString,n1 ,1)),'X') --the ascii code formated as hex
FROM numbers
)
--consolidate the chars back to a hexstring
SELECT @result += getHexAscii.hexAscii 
FROM getHexAscii
ORDER BY n1 ASC;

SELECT  [GUID_hexString(72chars)] = @result

Now we can compare the binary value stored in the uid_bin16 column with the output from the script above:

0x44394444394241352D353335432D3436  (stored in the table)
  44394444394241352D353335432D343643332D383838452D353936313338384330383945 (script)

Not only that the GUID is not stored correctly but, now we can see that half of the input string got truncated (it simply needs more space than 16bytes, as mentioned above).

If you still want to store GUIDs as a BINARY data type, one of the techniques is to remove hyphens and then convert the string to BINARY(16).
Note: un-comment –,uid_bin16 = CONVERT(BINARY(16),REPLACE(@guid ,’-‘,”),2)  from the table definition code and run the Python script again. Inspect the stored values.
The following script demonstrates the same conversion approach.

DECLARE @guidString CHAR(36) = 'D9DD9BA5-535C-46C3-888E-5961388C089E';
DECLARE @guidBinary BINARY(16)
--Style 2
SELECT  @guidBinary = CONVERT(BINARY(16),REPLACE(@guidString ,'-',''),2)
SELECT  @guidBinary --Result: 0xD9DD9BA5535C46C3888E5961388C089E

The third parameter – Style, in this context, defines how the CONVERT function treats the input string. More information can be found here. In this case, Style = 2, instructs the function to treat the input string as a hex string with no 0x suffix. This is why we get the correct conversion.
Keep in mind that if you need to pull the binary information from DB and pass it to the client as GUIDs, the following conversion to UNIQUEIDENTIFIER will result in a similar but different GUID, as explained before.

DECLARE @guidString CHAR(36) = 'D9DD9BA5-535C-46C3-888E-5961388C089E';
DECLARE @guidBinary BINARY(16)
SELECT  @guidBinary = CONVERT(BINARY(16),REPLACE(@guidString ,'-',''),2) 
SELECT  @guidBinary --Result: 0xD9DD9BA5535C46C3888E5961388C089E

--and convert to GUID
SELECT CONVERT(UNIQUEIDENTIFIER,@guidBinary)
/*
  BINARY:          0xD9DD9BA5 535C 46C3 888E 5961388C089E
  UNIQUEIDENTIFIER:  A59BDDD9-5C53-C346-888E-5961388C089E
*/

Of course, you can always “STUFF” a few hyphens into the string representation of the BINARY to get a GUID string shape.

DECLARE @guidString CHAR(36) = 'D9DD9BA5-535C-46C3-888E-5961388C089E';
DECLARE @guidBinary BINARY(16)
SELECT  @guidBinary = CONVERT(BINARY(16),REPLACE(@guidString ,'-',''),2)
SELECT  @guidBinary --Result: 0xD9DD9BA5535C46C3888E5961388C089E

--and back to string (Style 2 (string representation without the leading 0x )
SET @guidString = CONVERT(VARCHAR(36),@guidBinary,2) 
SET @guidString = STUFF(STUFF(STUFF(STUFF(@guidString,9,0,'-'),14,0,'-'),19,0,'-'),24,0,'-')
SELECT @guidString

--Result: D9DD9BA5-535C-46C3-888E-5961388C089E

The safest way to store a GUID as BINARY and to be able to retrieve the binary value, unchanged, as a UNIQUEIDENTIFIER is to first convert the input string to UNIQUEIDENTIFIER and then to BINARY(16) before storing it in DB. To retrieve GUID we just need to convert the BINARY(16) back to UNIQUEIDENTIFIER and the GUID will be unchanged.
Uncomment –,uid_bin16 = CONVERT(UNIQUEIDENTIFIER,@guid) in the code and repeat the test above. This time, the conversion back to the GUID is correct.

DECLARE @guidString CHAR(36) = 'D9DD9BA5-535C-46C3-888E-5961388C089E';
DECLARE @guid_bin16 BINARY(16)

--convert to binary (and store in db)
SET @guid_bin16 = CONVERT(BINARY(16),CONVERT(UNIQUEIDENTIFIER,@guidString))
SELECT [BINARY] = @guidString;

--retrieve the bin value from db, convert back to GUID and sent to client
SELECT [GUID] = CONVERT(UNIQUEIDENTIFIER,@guid_bin16)

Finally, it’s worth familiarising with the nuances when comparing and sorting guids/uniqueidentifiers using i.e System.Guid.CompareTo() vs  SqlTypes.SqlGuid.CompareTo()  methods. This is well explained here.

How to generate GUID in SQL Server

In SQL Server 7 Microsoft expanded replication services capabilities with the Merge replication. Replication in general provides loosely consistent data that adds more flexibility around network availability. This is opposed to distributed transactions which use the two-phase commit protocol that guarantees data consistency but potentially keeps the system locked for a long time i.e case of failed and/or in-doubt transactions. Merge replication allows both, publisher and subscriber(s) to independently modify published articles i.e tables. The system synchronizes the changes between the participants. To uniquely and globally identify rows across the published articles, Microsoft implemented* a new datatype – UNIQUEIDENTIFIER along with a new column property – ROWGUID and a new function for generating random guids – NEWID().

Currently, SQL Server offers two system functions for generating GUIDs

  • NEWID()
  • NEWSEQUENTIALID() -available since SQL Server 2005

Both functions are based on Windows functions, UuidCreate() and UuidCreateSequential()  respectively.

*NOTE: Merge replication is not the only reason why Microsoft implemented UNIQUEIDENTIFIER and the support for GUIDs. The ability to manage globally unique values has become an important way of identifying data, objects, software applications, and applets in distributed systems (based on [Inside SQL Server 7.0, Microsoft Press,1999])

NEWID()

NEWID() is a system function that generates a random, globally unique GUID value of type UNIQUEIDENTIFIER. It’s based on UuidCreate() Windows OS  function. NEWID() is compliant with the RFC4122 standard.

NEWID() is not a foldable function therefore it’s executed separately for each inserted row. This is opposed to the Runtime constant scalar functions i.e GETDATE(), GETSYSDATE(), RAND(), the functions that are executed only once per query.

NOTE: The runtime constant scalar functions are evaluated only once, early in the query execution. The results are cached and used for all resulting rows.. This process is known as Constant Folding.

The following script demonstrates how the function works and how it’s different to a foldable function.

DROP TABLE IF EXISTS #testNEWID;
GO
CREATE TABLE #testNEWID(GuidId UNIQUEIDENTIFIER ROWGUIDCOL --column property
                            DEFAULT NEWID()
                        ,DateCreated DATETIME2
                            DEFAULT SYSDATETIME()
)
GO
 
--set based insert (sysdatetime folded - same value for all rows
--GuidId is different for each row)
INSERT INTO #testNEWID (GuidId,DateCreated)
    SELECT TOP (5) guidId = NEWID()
                  ,DateCreated =  SYSDATETIME()
    FROM sys.columns;
GO
SELECT * FROM #testNEWID
GO
--reset test
TRUNCATE TABLE #testNEWID
GO
--separate batch insert (dateCreated & GuidId different for each row)
INSERT INTO #testNEWID
    DEFAULT VALUES;
 GO 10
 SELECT * FROM #testNEWID
 GO


Figure 5, NEWID() , not foldable function

ROWGUIDCOL – is a column property( or a designator for a GUID column) similar to $IDENTITY. It is possible to have multiple UNIQUEIDENTIFIER columns per table, but only one can have the ROWGUIDCOL property.  The designator provides a generic way for the client code to retrieve the GUID column, usually with unique values, from a table.

--get the last generated GUID
 SELECT ROWGUIDCOL
 FROM #testNEWID
 WHERE DateCreateD = (SELECT MAX(DateCreated)
                      FROM #testNEWID)

NEWSEQUENTIALID()

NEWSEQUENTIALID() is a system function that creates a globally unique GUID that is greater than any GUID previously generated by this function on a particular computer and on a particular SQL Server instance on that computer. The output of the function is of type UNIQUEIDENTIFIER. NEWSEQUENTIALID() is based on Windows UuidCreateSequential() system function.

There are a few interesting quirks and features of this function.

  • NEWSEQUENTIALID() system function cannot be invoked independently i.e SELECT NEWSEQUENTIALID();  It can only be used as a default constraint of a column in a table and the column must be of a UNIQUEINDENTIFIER data type. Also, it is not possible to combine this function with other operators to form a complex scalar expression.
CREATE TABLE #t(seqGuid UNIQUEIDENTIFIER 
                    DEFAULT NEWSEQUENTIALID()
);
INSERT INTO #t
    DEFAULT_VALUES;
GO 2
  • All GUID values generated by NEWSEQUENTIALID() on the same computer are ever-increasing. From the SQL Server perspective, this means that all sequential guids generated across all instances, databases, and tables on the same server, are ever-increasing. The “shared counter” is due to the fact that the function is based on an OS function.
    Also, the ever-increasing sequence continues after the OS restart. To demonstrate the point run this code
    .
    Figure 6, NEWSEQUENTIALID() – shared counter
  • NEWSEQUENTIALID() is not guaranteed to be globally unique if initiated on a system with no network card. There is a possibility that another computer without an ethernet address generates the identical GUID. This is based on the underlying UuidCreateSequential()  windows function behavior.
  • NEWSEQUENTIALID() is not compliant with the  RFC4122 standard
  • The sequence of ever-increasing GUIDs will be interrupted after the OS system restart. The new sequence may start from a higher range – Figure 6, or it can start from a lower range. This is something that we cannot control.
  • UuidCreateSequential() outputs sequential guids with different byte order than NEWSEQUENTIALID(). The reason is that the function outputs UNIQUEIDENTIFIER data type that, as it was mentioned before, re-shuffles certain bytes – see Figure 1. This may create problems with sorting in situations when the client code generates sequential guids and stores it in the database as UNIQUEIDENTIFIER(s). This article explains how to avoid this problem.
  • Sequential guids generated, and re-shuffled as explained above, by the application that runs on the same server as the DB server will be in the same sort order as the sequential guids generated by newsequentialid() across all SQL Server instances on that server.

GUID & DB Design

Relational database systems such as SQL Server have a strong foundation in mathematics and in relational theory – hence the R in RDBMS, but they also have their own principles. For example, a Set is an unordered collection of unique, no-duplicated items. In relational theory, a relation is defined as a set of n-tuples. A tuple in mathematics is a finite sequence of elements. It has an order and allows duplicates. It was later decided that it would be more convenient, from the computer programming perspective, to use attribute names instead of ordering. The concept has changed but the name “tuple” remained. Back to RDBMS, a table is a visual representation of a relation and a row is similar to the concept of a tuple. These concepts are similar but not the same. E.g A table may contain duplicate values whereas a relation cannot have two identical tuples etc.

The consistency of an RDBMS is enforced by constraints that are declared as part of the DB schema e.g Primary Key constraint enforces the consistency of an entity. A tuple must have a minimal set of attributes that makes it unique within a relation. A row in a table does not need to be unique, and this is where, in my opinion, the big debate about natural keys vs surrogate keys begins.
We are designing databases with performances in mind. This includes deviations from the rigid rules of database design. The more we know about the internals of the RDBMS we use, the more we try to get the most out of it by adjusting our design and queries to it. Paradoxically, the declarative nature of SQL language teaches us to give instructions on what to do not how to do it. I guess, the truth is always somewhere in between :).
The above may explain the use of IDENTITY columns and primary keys that follow clustered index key guidelines: static, unique, narrow, and ever-increasing.

It is a common practice among developers to use GUIDs values as primary keys and/or clustered index keys. This choice ticks only one box from the PK/index key properties mentioned above – it’s unique and possibly static. Pretty much everything else is not ideal – GUID is not narrow(16bytes), not ever-increasing( unless generated by UuidCreateSequential()/NEWSEQUENTIALID() on the same PC). From a database design point of view, GUIDs are generally not good( they are meaningless, not intuitive surrogate keys) nor from the DB performances perspective(they cause fragmentation, unnecessary disk space consumption, possible query regression, etc). So why they are so “popular”?

GUID as Primary Key

The idea is to create a unique value e.g a new productId, on one of the application layers without performing a round-trip to the database in order to ask for a new Id. So, the generated GUID value becomes a PK value for the new product in e.g Products table. The PK may be implemented as a unique non-clustered index. The index is likely to be highly fragmented since the generated GUIDs are completely random. Also, keep in mind that the PK is highly likely to be a part of one or more referential integrity constraints(Foreign Keys) in tables like e.g ProductInventory, ProductListPriceHistory, etc. Moreover, the Foreign keys may be, at the same time, part of the composite PKs on the foreign tables – Figure 7. This design may have a negative effect on many tables and the database performance in general.

An alternative approach may be to define GUID column as an Alternate Keyenforced by a unique NCI and to use INT or BIGINT along with the IDENTITY property or a Sequencer as a surrogate PK. The key can be enforced by the unique clustered index. This way we can avoid excessive fragmentation and enforce referential integrity in a more optimal way – Figure 7,rowguid column.

Figure 7, GUID column as an Alternate Key – Adventure Works

*Alternate Key represents column(s) that uniquely identify rows in a table. A table can have more than one column or combinations of columns that can uniquely identify every row in that table. Only one choice can be set as the PK. All other options are called Alternate Keys.

GUID values can be created by SQL Server during the INSERT operations.  E.g Client code constructs a new product (product name, description, weight, color, etc..) and INSERTs the information(a new row) into the Products table. The NEWID() fn automatically creates and assigns a GUID value to the new row through a DEFAULT constraint on e.g ProductId column. Client code can also generate and supply GUID for the new product. The two methods can be mixed since the generated GUIDs are globally unique.

What I often see in different production environments is that the GUID values are used as PK values even if there is no need for the globally unique values.
Very few of them had better security in mind i.e  It is safer to expose a GUID than a numeric value when querying DB through a public API. The exposed numeric value in the URL may potentially be used to harm the system. E.g http://myApp/productid/88765 can suggest that there is productId =88764 etc., but with a GUID value, these guesses will not be possible – Figure 7, data access point.

In most DB designs, at least in the ones I’ve had an opportunity to work on,  GUIDs are used only because it was convenient from the application code design perspective.

When the application and the database become larger and more complex, these early decisions can cause performance problems. Usually, these problems are solved by, so-called quick fixes/wins. As the rule of thumb, the first “victim” of those “wins” is always data integrity e.g adding NOLOCK table hints everywhere, removing referential integrity(FK), replacing INNER JOINS with LEFT JOINS, etc. This inevitably leads to a new set of bugs that are not easy to detect and fix. This last paragraph may be too much, but this is what I am seeing in the industry.
Use GUIDs with caution and with the cost-benefit in mind 🙂

GUID as PK and the Clustered index key

Sometimes developers decide to use GUID values as a PK enforced by the clustered index. This means that the primary key column is at the same time the clustered index key. Data pages(leaf level) of a clustered index are logically ordered by the clustered index key values.
One of the reasons for this design may be the ability to easily merge data from different databases in the distributed database environment. The same idea can be implemented more efficiently using GUID as an alternative key as explained earlier.
More often, the design is inherited from SQL Server’s default behavior when the PK is created and automatically implemented as the clustered index key unless otherwise specified.

Using GUID as clustered index key leads to extensive page and index fragmentation. This is due to its randomness. E.g every time the client app inserts a new Product, a new row must be placed in a specific position i.e specific memory location on a data page. This is to maintain the logical order of the key values. The pages(nodes) are part of a doubly linked list data structure.  If there is not enough space on the designated page for the new row, the page must be split into two pages to make the necessary space for the new row. The physical position of the newly allocated page (8KB memory space) in the data file does not follow the order of the index key (it is not physically next to the original page). This is known as logical fragmentation. Splitting data pages introduces yet another type of fragmentation, physical fragmentation which defines the negative effect of the wasted space per page after the split. The increased number of “half full” pages along with the process of splitting the pages has a negative impact on query performance.

The “potential collateral damage” of the decision to use GUID as clustered index key is non-unique non-clustered indexes.
A non-clustered index that is built on a clustered index,  at the leaf level, contains row locators- the clustered index key values. These unique values are used as pointers to the clustered index structure and the actual rows – more information can be found here – The data Access Pattern section.
A non-unique NCI can have many duplicate index key values. Each of the key values is “coupled” with a unique pointer – in this case, a GUID value. Since GUID values are random the new rows can be inserted in any position within the range of the duplicated values. This introduces the fragmentation of the NCI. The more duplicated values, the more fragmentation.

The fragmentation can be “postponed” by using the FILLFACTOR setting. The setting instructs SQL Server what percentage of each data page should be used to store data. The “extra” free space per page can “delay” page splits. The FILLFACTOR value isn’t maintained when inserting new data. It is only effective when we create or rebuild an index. So once it’s full, and between the index rebuilds, the data page will be split again during the next insert.

Things are different with the sequential GUID. Sequential GUIDs are generated in ascending order. The “block” of compact, ever-increasing GUIDs is formed on a server and between the OS restarts. Sequential GUIDs created by the Client code on a different server will fall into a separate “block” of guids – see Figure 6. As mentioned before, sequential GUIDs can be created by SQL Server – NEWSEQUENTIALID() fn. initiated by a DEFAULT constraint and/or by the client code. The compact “blocks” of guids will reduce fragmentation.

Conclusion

In SQL Server, GUID is a 16byte binary value stored as UNIQUIEIDENTIFIER data type. NEWID() and NEWSEQUENTIALID() are the two system functions that can be used to create GUIDs in SQL server. The latter is not compliant with the RFC4122 standard. Both GUID types can be created by the client code using functions: UUidCreate(), UuidCreateSequential(). .NET sorts Guid values differently than SQL Server. UNIQUEIDENTIFIER data type re-shuffles first 8 bytes(the first three segments). .NET’s SqlGuid Struct represents a GUID to be stored or retrieved from a DB.
GUID values are often used as primary key/clustered index key values. The randomness of the GUID values introduces logical and physical data fragmentation, which then leads to query performance regression. Sequential GUIDs can reduce fragmentation but still need to be used carefully and with the cost-benefit approach in mind.

Thanks for reading.

Dean Mincic

Client Requests & Sql events in Sql Server

Client Requests & Sql events in Sql Server


Summary

Very often we, the sequel guys, use MS Profiler/Sql Server’s Extended events to intercept and analyse the traffic between client applications and our precious Sql Server. Sometimes we capture an hour/day of traffic, and sometimes we just want to capture a slow running stored procedure’s  input param values.  Also, we are involved in  the data layer code reviews in order to keep Client data requests as optimal as possible. This post is a sort of an internal notes/documentation about different types of requests  that can be sent from  Clients to Sql Server and the corresponding events visible in MS Profiler. The article also gives a high overview on the space between ADO.NET and Sql Server

MDAC – Microsoft Data Access Components

To start off, lets first take a brief look at the MDAC framework and where ADO.NET sits in the stack. MDAC technology allows applications to communicate with almost any data-store.
MDAC architecture implements three logical layers:

  • Programming interface layer (ADO, ADO.NET)
  • DB access layer (ODBC, OLEDB, .NET Managed providers)
  • Data store

All of the layers are accessible through the MDAC API. (+ MS Sql Server network library)*


Figure 1, MDAC architecture

NOTE: Microsoft Network Library(Net-lib) is also part of MDAC and sits between TDS(Tabular data stream) Formatter and  the network protocol. The DLLs are specifically used by SQL Server to communicate with the Clients on the network level.  It supports the network protocols: Named Pipes, TCP/IP, Shared Memory, VIA).

ADO.NET

ADO.NET provides a generic interface(set of classes that expose data access services)  to a number of different data-stores (SQL Server,CSV, Excel..) from .NET applications. It is a set of libraries included in MS .NET Framework.
The main objectives of ADO.NET is to;

  • connect to a data source
  • submit queries
  • process results
  • provide a powerful disconnected data structure to perform data operations offline.

There are two main  components of ADO.NET for accessing and manipulating data

    • .NET Framework data providers – a set of class libraries that understand how to interact with specific data platform e.g Sql Server, or with a common data layer i.e OLE DB. Data providers transport data between proprietary data platforms and the generic ADO.NET data layer. ADO.NET framework includes three providers
      • The MS Sql Server Provider – exposed through the System.Data.SqlClient namespace
      • The OLE DB provider – exposed through the System.Data.OleDb namespace.
      • The ODBC provider – exposed through the System.Data.Odbc namespace.
    • DataSet – a class(System.Data namespace) that provides a disconnected representation of result sets from the Data Source. The object provides a consistent relational programming model regardless of the data source. It includes objects like: DataTable, DataRow,  DataView, DataColumn, PrimaryKey ..etc

The key classes within each provider(each provider has its own specific class names for the same objects) include:

  • Command (SqlCommand),
  • Connection (SqlConnection),
  • DataAdapter (SqlDataAdapter)
  • DataReader (SqlDataReader).


Figure 2, ADO.NET components, high overview

TDS (Tabular data stream)

TDS is an application layer protocol used to transfer tsql command between client and server. The protocol includes facilities for;  authentication and identification, encrypted negotiation, issuing of Sql batch(language events) and stored procedures(RPC) calls, managing and distributing the outcome of atomic transactions. TDS describe the names, types and optional descriptions of the rows being returned.   It was developed by Sysbase in 1984 and adopted by Microsoft in 1990. MS-TDS(Microsoft’s version of the protocol) specification can be found here.

Figure 3,Communication flow in TDS

More on how the Client talks to Sql Server

Client communicates with Sql Server in a layered fashion. Each layer exchanges information with its neighboring layer. Sometimes the layers can be bypassed.  Only the Transport protocol layer enables physical communication between Client and Sql Server. For all other layers the exchange of information is done locally – in RAM memory and  between different dlls.
The layers are:

  1. Application Layer (C# code, etc)
  2. High level data access API (ADO,ADO.NET…)
  3. Client data interface(OLEDB,ODBC, db-lib ..)
  4. Client TDS Formatter. The tsql commands sent to Sql server has to be in TDS format.
  5. Client net protocol (Shared memory, TCP/IP sockets, net-lib)
  6. Client transport protocol (TCP/IP ..)
  7. Server transport protocol (TCP/IP ..)
  8. Server net protocol (Shared memory, TCP/IP sockets, net-lib)
  9. Server data interface
  10. SQL Server

NOTES:  App level(A) can bypass ADO.NET  level(B) if we directly access teh Client data interface (direct API calls to OLE-DB, ODBC..etc).
Client protocol(E) and Client transport protocol(F) can be bypassed if the Client and Sql Server communicate through the Shared memory protocol. In this scenario Client and SQL Server run on the same machine. The components exchange information through the RAM memory space. The network elements are not required.

Client data requests

The only way clients can interact with Sql Server is by sending requests that contain TSQL commands for the database engine. The requests are sent over MS-TDS protocol(Microsoft’s version of TDS).
There are three main forms of the requests

  • Batch requests
  • RPC (Remote Procedure Requests)
  • Bulk Load Request

In the following examples I’ll use ADO.NET components to present the different  types of Client requests. I’ll  use MS Profiler to capture and analyse Events which corresponds to the requests.

Events

When Sql Server receives a message from a client(through TDS), it can be thought of as an event occurring – see  Figure 3, First client sends a connection request(a login event) and gets back success or failure response. Once connected, client requests typically fall into one of two categories:

  • Language events
  • RPC events

Language Events

A language event is a tsql batch sent from the client to the server. From ADO.NET perspective a tsql batch* is a set of tsql commands defined in the CommandText property of the SqlCommand Class. The command can be executed using one of the commands; ExecuteReader, ExecuteScalar, ExecuteNonQuery and ExectueXMLReader.

The commandType enum property of the SqlCommand Class specifies how the command string is interpreted. The Field values can be:

  • CommandType.StoredProcedure (contains the name of a stored procedure or a scalar function)
  • CommandType.Text (contains an tsql batch – one or more tsql statements, no GO directives)
  • CommandType.TableDirect (contains a table name. Only supported by the .NET Framework Data Provider for OLD DB)

NOTE: A tsql batch can be defined as one or more tsql statements before the GO* directive(end of batch signal). This is specific to Microsoft SQL stack tools i.e SSMS, sqlcmd, osql. isql- Sybase It is not part of tSql language and may not be recognised by other similar tools. It is not permitted to use GO directive within CommandText property.
After a TDS request reaches SQL Server, the db  engine will create a TASK to handle the request. One task handles only one batch(a task may spawn a number of sub-tasks to support parallel executions). When ADO.NET “executes” a batch(one or more tsql commands), one TASK will handle the request. When SSMS executes i.e two batches(the same tsql command twice – GO 2, or two tsql statements divided by the GO terminator), Sql Server engine(SQLOS) allocates two separate TASKS for the job.

Non-parameterised batch request

This type of request does not have parameters. It can contain local variable declarations. The request is treated by Sql Server engine as an ad-hoc query.
Adhoc queries  can be cached and reused only if a subsequent batch matches exactly – exact textual match.  i.e 

SELECT * FROM dbo.Products WHERE [name]='bread';
--and
SELECT * FROM dbo.Products WHERE [name]='butter';

The query plans above are  compiled and cashed separately. This behavior can lead to the problem known as “the cache pollution”.
Sql server optimiser may decide to automatically parameterise an ad-hoc query, but it’s very conservative when making those decisions. The query template must be considered to be safe, meaning that the same query plan doesn’t change even if the actual parameter values change – the paramterisation must not degrade query performances. Also, query engine decides on the auto-parameter data type (smallint in the example below). The datatype may be changed due to change in the input parameter size(PurshaseOrderId = 3420 can fit in 2bytes-smallint, but PurshaseOrderId = 210 fits into 1byte-tinyint). In that case the optimiser cashes another copy of the plan that includes the new data type.
In addition, many query elements disallow automatic parameterisation, such as; JOIN, IN, INTO, DISTINCT, TOP.. just to name a few.

--this query may be autoparameterised since it uses the same plan regardless of the 
--predicate's value (PurchaseOrderID is a clustered primary key)
USE AdventureWorks2012
GO
SELECT PurchaseOrderID
      ,[Status]
      ,OrderDate
      ,SubTotal
FROM Purchasing.PurchaseOrderHeader
WHERE PurchaseOrderID = 3420; --843, --1 ..
--the plan template looks like
/*
(@1smallint)SELECT[PurchaseOrderID],[Status],[OrderDate],[SubTotal] 
            FROM [Purchasing].[PurchaseOrderHeader] 
            WHERE [PurchaseOrderID]=@1
*/

-- the query below uses a predicate with an uneven distribution of values. This means that the
-- plan shape depends on the predicate value and therefore the auto parameterisation would degrade 
-- the query's performance
USE AdventureWorks2012
GO
SELECT  ProductID
       ,PurchaseOrderID
       ,OrderQty
 FROM Purchasing.PurchaseOrderDetail 
 WHERE ProductID = 707 --319 --391

Let’s execute the second query through a simple console application. The predicate value will be passed as a part of the batch.  Pass 707 as a command line argument; (ProductId = 707)
NOTE: We can also use SSMS to run the batch.

using System;
using System.Data;
using System.Data.SqlClient;

namespace TestClientRequest_batch
{
    class Program
    {
        static void Main(string[] args)
        {
            int ProductId = 0;

            //get arguments
            if (args.Length != 0)
            {
                ProductId = Convert.ToInt16(args[0]);
            }

            using (SqlConnection conn = new SqlConnection())
            {
                conn.ConnectionString = @"Data Source=tcp:(local);
                                          Initial Catalog=AdventureWorks2012;
                                          Integrated Security=True;
                                          Application Name=TestClientRequests";

                using (SqlCommand cmd = new SqlCommand())
                {
                    cmd.Connection = conn;
                    cmd.CommandType = CommandType.Text;
                    cmd.CommandText = @"SELECT  ProductID
                                               ,PurchaseOrderID
                                               ,OrderQty
                                        FROM Purchasing.PurchaseOrderDetail 
                                        WHERE ProductID= " + ProductId.ToString();
                    conn.Open();

                    //use sql reader to execute the query 
                    using (SqlDataReader dr = cmd.ExecuteReader())
                    {

                        Console.WriteLine("ProductID PurchaseOrderID OrderQty\t");
                        while (dr.Read())
                        {
                            Console.WriteLine(dr[0].ToString() + "\t " + dr[1].ToString() + "\t\t\t" + dr[2].ToString());
                        }
                    }
                }
                Console.ReadKey(); //prevent auto close
            }
        }
    }
}

Figure 4, MS Profiler – tsql batch execution events 

The Client request is executed as a tsql batch. SQL:BatchStarting signals that the batch execution is starting.  SQL:StmtStarting event was fired for each statement in the batch(once in this example).

RPC Events

There are two basic ADO.NET client request types which invoke RPC(Remote Procedure Call) events

  • Parameterised batch request  – uses sys.sp_executesql Sql Server’s system extended stored proc
  • Stored procedure and  scalar user defined function call

NOTE: RPC(Remote Procedure Call) name came from Sybase and in context of the Client request represents a type of formatting of the query inside a TDS packet before it is sent to Sql Server. The requests are pre-formated by the driver/provider i.e OLEDB, ODBC,JDBC. ..etc
Windows RPC(Remote Procedure Call) is completely different technology which implements the interprocess communication(IPC) that allows applications to talk to each other locally or over the network. More on the RPC  here.

Parameterised batch request

Parameterised type of request does have parameter values passed to tSQL statements. Parameter input is treated as a literal value, not  as an executable code.
This type of request significantly improves the way Sql Server process the tsql commands. The sql batch  separates the “body” of the query from the data values managed by that query.
Lets change SqlCommand object from the previous example to use a parameterised tsql statement and see what happens.

...
using (SqlCommand cmd = new SqlCommand())
{
    cmd.Connection = conn;
    cmd.CommandType = CommandType.Text;
    cmd.CommandText = @"SELECT  ProductID
                               ,PurchaseOrderID
                               ,OrderQty
                        FROM Purchasing.PurchaseOrderDetail 
                        WHERE ProductID= @ProductId";

    SqlParameter ProdId = new SqlParameter("@ProductId", SqlDbType.Int);
    cmd.Parameters.Add(ProdId).Value = ProductId;

    conn.Open();
...

Figure 5, MS Profiler – parameterised batch RPC event

This time ADO.NET constructs the request as a RPC(Remote Procedure Call) type. The client sends an RPC message data stream(TDS protocol level) to Sql server. The message contains: Procedure name( or in this case the ProcID) and the Parameters.  ProcID identifies Sql Server’s internal sp to be executed. In the example above ProcID = 10. The id value maps sys.sp_exeuctesql extended stored procedure in Sql Server.

Figure 6, sys.sp_executesql extended stored procedure

NOTE: We cannot use SSMS to invoke RPC event types since it’s not possible to send Parameters as a part of the TDS binary stream.

The parameterised batch request results with a cached plan, just like a regular stored procedure call.

SELECT   cp.usecounts
        ,cp.cacheobjtype
        ,cp.objtype
        ,txt.[text]
        ,qp.query_plan
FROM    sys.dm_exec_cached_plans cp
    CROSS APPLY sys.dm_exec_sql_text(plan_handle) txt
    CROSS APPLY sys.dm_exec_query_plan(plan_handle) qp
WHERE   cacheobjtype = 'Compiled Plan'
    AND [text] NOT LIKE '%dm_exec_cached_plans%'
    AND [text] LIKE '%FROM Purchasing.PurchaseOrderDetail%'

Figure 7, parameterised batch request – cached plan

Stored procedure call

This type of RPC request is similar to the parameterised batch request. The main difference is that the commandType enum property of the SqlCommand Class is set to “StoredProcedure“, and the CommandText property value represents the name* of the stored procedure.

Quick note: In order to avoid cache misses/compile locks, use the two part naming convention when referencing  sql objects – dbo.uspGetPurchaseOrderDetails. More about sp caching can be found here.

Let’s encapsulate previous tsql batch in a stored proc…

USE AdventureWorks2012;
GO
CREATE PROCEDURE dbo.uspGetPurchaseOrderDetails
     @ProductId int 
AS 
BEGIN
     SET NOCOUNT ON;

     SELECT   ProductID
             ,PurchaseOrderID
             ,OrderQty
     FROM Purchasing.PurchaseOrderDetail 
     WHERE ProductID= @ProductId;  

     RETURN;
END

and then modify CommandType and CommandText properties as follows…

.....
using (SqlCommand cmd = new SqlCommand())
                {
                    cmd.Connection = conn;
                    cmd.CommandType = CommandType.StoredProcedure;
                    cmd.CommandText = @"dbo.uspGetPurchaseOrderDetails";

                    SqlParameter ProdId = new SqlParameter("@ProductId", SqlDbType.Int);
                    cmd.Parameters.Add(ProdId).Value = ProductId;

                    conn.Open();
...

Figure 8, MS Profiler – RPC Stored proc call

Using RPC: Starting and/or SP:Starting  events we can find out the value of the stored procedure’s input parameter(s).  It is interesting to notice the absence of the stored procedure’s ObjectId within the RPC call. The initial request doesn’t know whatID (object_id) is used by Sql Server to identify the stored procedure.

SELECT  o.[name]
        ,o.[object_id]
        ,spdef.[definition]
FROM sys.objects o 
    INNER JOIN  sys.all_sql_modules spdef
        ON spdef.[object_id] = o.[object_id]
WHERE o.[name] = N'uspGetPurchaseOrderDetails' 
/* --result
name                        ObjectId      Definition
-----------------------------------------------------
uspGetPurchaseOrderDetails  1364199910    CREATE PROC..

*/

Not directly related to the topic but nevertheless interesting observation is the SP:CacheMiss event. The event describes the failed cache lookup for the stored procedure on the very first run as the plan was not generated and stored in the cache.

Now, let’s execute a  stored procedure that has a nested sp.

--------------------- top level stored proc --------------
USE AdventureWorks2012
GO

ALTER PROCEDURE dbo.uspGetPurchaseOrderDetails
        @ProductId int 
       ,@ProdName dbo.[Name] OUTPUT --NVARCHAR(50)
AS 
BEGIN
    SET NOCOUNT ON;
      
    SELECT   ProductID
            ,PurchaseOrderID
            ,OrderQty
    FROM Purchasing.PurchaseOrderDetail 
    WHERE ProductID= @ProductId

    --get the product name
    EXEC dbo.uspGetProductName
            @ProductId
           ,@ProductName = @ProdName OUTPUT      
    RETURN;
END
go 
--------------------- nested stored proc --------------

CREATE  PROCEDURE dbo.uspGetProductName
      @ProductId INT
     ,@ProductName dbo.[Name] OUTPUT --NVARCHAR(50)
AS
BEGIN
    SET NOCOUNT ON;

    IF @ProductId IS NULL
        RETURN;

    SELECT @ProductName = [Name]
    FROM Production.Product 
    WHERE ProductID = @ProductId;
    
    RETURN;
END 
-----------------------------------------------------

… add an output parameter to the code ..

...
using (SqlCommand cmd = new SqlCommand())
{
          cmd.Connection = conn;
          cmd.CommandType = CommandType.StoredProcedure;
          cmd.CommandText = @"dbo.uspGetPurchaseOrderDetails";

          SqlParameter ProdId = new SqlParameter("@ProductId", SqlDbType.Int);
          SqlParameter ProdName = new SqlParameter("@ProdName", SqlDbType.NVarChar,-1); // -1 varchar max
               
          ProdName.Direction = ParameterDirection.Output;
                    
          cmd.Parameters.Add(ProdId).Value = ProductId;
          cmd.Parameters.Add(ProdName);

          conn.Open();

          //use sql reader to execute the query 
          using (SqlDataReader dr = cmd.ExecuteReader())
          {
               Console.WriteLine("ProductID PurchaseOrderID OrderQty\t");
               while (dr.Read())
               {
                 Console.WriteLine(dr[0].ToString() + "\t " + dr[1].ToString() + "\t\t" + dr[2].ToString());
               }
                        
           }
//data reader object must be closed before retrieving the output parameter
Console.WriteLine(ProdName.Value.ToString());
}
....

From the Profiler’s trace below we can see that the only one RPC request was initiated for the top level stored procedure.

Figure 9, RPC request and nested stored procedures

From the trace above we can observe a few interesting things

  • Only the first, top level stored procedure call is initiated through the RPC event. This is the request initiated by ADO.NET.
  • The nested stored procedure call was captured through SP:Starting and SP:StmtStarting events. The input parameter values are not visible(more about this later).
  • RPC Output parameter* event is triggered AFTER the RPC:Completed event. This explains why we need to close Data reader object before retrieving the output parameter(see the c# code above)

NOTE: When the RPC is invoked, some or all of its parameters are designated as output parameters. All output parameters will have values returned from the server. For each output parameter, there is a corresponding return value, sent via the RETURNVALUE token. More about this can be found here.

How to capture nested stored procedure parameter values

In the example below, the nested stored procedure call was captured through the SP:Starting and SP:StmtStarting events. it is not possible to capture nested procedure(s) input parameter values only by examining this two events. Sql Profiler is not a debugging tool and it shows only what has been executed. This means that it can show only the parameters values that are initially passed from the data layer, not during the procedure execution.
So how can we capture a nested stored procedure input parameter values?  Well, we can get the values if

  • The top level stored proc parameter(which we can get through the RPC event) value is passed unchanged to the nested sp.
  • The nested stored proc parameters are passed as literals i.e .. EXECUTE dbo.MyStoredProc @name=’John Doe’, @Age=42 ..

However, there is another, more interesting way to capture the parameter values.

We can include the Showplan XML Statistic Profile  event class to our trace. The event class triggers when Sql Server executes an SQL statement and shows complete, compile-time data, including the parameter values we are interested in.

Figure 10, Showplan XML Statistic Profile event class

We can export the plan(XML format) and then search for the ParameterList element.

<ColumnReference  Column="@ProductName" 
                  ParameterDataType="nvarchar(50)" 
                  ParameterCompiledValue="NULL" 
                  ParameterRuntimeValue="N'HL Crankarm'">
</ColumnReference>
<ColumnReference  Column="@ProductId" 
                  ParameterDataType="int" 
                  ParameterCompiledValue="(319)" 
                  ParameterRuntimeValue="(319)">
</ColumnReference>

Now we can analyse the parameters’ runtime/compiled values and the hunt for the parameter sniffing issues can begin 🙂

NOTE: Showplan XML Statistic Profile is an expensive event to track. It may be a good idea to use the server side tracking instead or the Profiler UI. ..

Conclusion

In my opinion,  It is beneficial for the Sql database developers to be familiar with the database access patterns from the client end. Knowing how different components i.e ADO.NET, OLEDB, ODBC, TDS etc. fit together can help when making design decisions. The way Client constructs its data requests dictates the database engine behavior i.e Sql Server treats non-parametersied queries as ad-hoc queries whereas it caches the execution plans for the parameterised queries the same way as it does for stored procedures. Understanding how different events( captured by Profiler or Extended events) relate to the client requests can help us find and understand the root causes  for the slow running queries, excessive locking etc. that may not be on the Sql Server side but on the data layer.

Thanks for reading

Dean Mincic