Tag Archives: Machine learning services

Set up Python for SQL Server


Summary

In SQL Server 2016, Microsoft extended SQL Server engine capabilities to be able to execute an external code written in R language. Sql Server 2017 added Python language to the mix and 2019 integrated Java. The new functionality is a part of Sql Server’s Machine Learning Services.
This blog explains how to set up SQL Server 2019 so we can use ML services and
the Python programming language.

Prepare the environment for Python

The first step is to add Machine Learning Service Extensions / Python to our SQL Server environment.

Figure 1, ML services

The next step is to allow Sql Server to execute external scripts by turning on the external scripts enabled user option on the server level

sp_configure 
     @configname = 'show advanced options'
    ,@configvalue = 1
GO
RECONFIGURE WITH OVERRIDE;
GO
 
sp_configure 
     @configname = 'external scripts enabled'
    ,@configvalue = 1            
GO
RECONFIGURE WITH OVERRIDE;
GO

At this stage, we should be ready to go.

Let’s explore how the SQL Server engine relates to the above-mentioned ML(Machine Learning)  services. The two components are “bound” by the Extensibility framework.

Launchpad

Launchpad service, one of several services in Sql Server, has a job to manage and execute external scripts – the scripts written in Python language. It instantiates an external script session as a separate, satellite process.


Figure 2, Extensibility framework

The purpose of the extensibility framework is to provide an interface between the SQL Server engine and Python (and other languages). Figure 2 shows how different components interact and demonstrates security concepts.
From the super high-level point of view, the process goes like this:

  1. Our application calls system stored proc. sys.sp_execute_external_script. The proc is designed to encapsulate code written in one of the external languages, Python in this case. The stored proc executes within the SQL Server engine – Sqlservr.exe. The engine knows that the code it needs to run contains a program written in Python and therefore starts the Launchpad.exe process. Sqlservr.exe uses Named Pipes protocol(often used for the communication between processes on the same machine) to pass the code to be executed.
  2. Launchpad service now call initiates the DLL specific to the language that needs to be run. In this case, it’s PythonLauncher.dll.
  3. PythonLauncher.dll now runs Python.exe/dll which compiles the program to bytecode and executes it.
  4. Python.exe then talks to yet another program, BxlServer.exe. Bxl stands for Binary Exchange Language. This program coordinates with the Python runtime to manage exchanges of data and storage of the working results.
  5. sqlsatellite.dll communicates with SQL Server engine over TCP/IP protocol and ODBC interface. It retrieves input data sets*, sends back result sets and Python’s standard I/O streams – stdout and stderr.
  6. BxlServer.exe communicates with Python.exe passing messages back and forth from Sql Server through sqlsatellite.dll.
  7. Sql Server gets the result set, closes the related tasks and processes, and sends the results back to the client e.g SSMS

Security

Security architecture for the extensibility framework can be observed in Figure 2.

  • The client uses Windows/SQL login to establish a connection with SQL Server and execute Python code encapsulated in sys.sp_execute_external_script stored proc.
  • Launchpad service runs as a separate process under its own account – NT Service\MSSQLLaunchpad$<instanceName>. The Windows account has all the necessary permissions to run external scripts and is a member of the SQLRUserGroup2019 Windows group.
    How to interact with the file system(Import/Export csv, xlsx, ..) or to access resources outside the Server env will be explained later in this blog
  • Launchpad, when invokes the appropriate runtime environment, in our case, the PythonLauncher.dll, initiates an AppContainer and assigns the process to it. By default, the Launcher service creates 21 containers AppContainer00AppContainer20. The number of containers can be changed through Sql Server Configuration Manager/Sql Server Launchpad/Advanced/Security Contexts Count property.
    AppContainer is an internal mechanism that provides an isolated execution environment for the applications assigned to it. An application running in an AppContainer can only access resources specifically granted to it. This approach prevents the application from influencing, or being influenced by, other application processes
  • Launchpad maps the User(used to make a connection) to an AppContainer that now defines credentials unique to user/application pair. This also means that the application, in this case, launchpad.exe cannot impersonate db Users and gain more access than it should have.

So, in a nutshell, a Client connects to Sql Server using Sql/Win login mapped to db user. The user runs sys.sp_execute_external_script with the Python code.  Sql Server runs launchpad.exe which runs under it’s own win account. The launcher then initiate one of twenty AppContainers

Firewall rules – how to access data from the Internet

As mentioned before, there are 20 separate application containers – AppContainer00 – AppContainer20 These objects restrict access to the applications assigned to them, in this case launchpad.exe.

Figure 3, Windows Firewall outbound rules related to the AppContainers

If we disable these rules, our Python code, encapsulated in sys.sp_execute_external_script, will be able to communicate with the “outside world” and e.g get up-to-date FX (Foreign Exchange) rates, BitCoin price for different currencies etc. – see. forex-python module. We can then e.g describe the result and send back to the Client/top level procedure. It is also possible to “join” the data to a tSQL query result, within the same query batch, and then to send it back to the caller – amazing stuff 🙂

EXEC sys.sp_execute_external_script
        @language = N'Python' --specify language to use
        ,@script   = N'
from pandas_datareader import data

start_date = "2021-11-20"
end_date = "2021-12-01"

UsToAus = data.DataReader("DEXUSAL", "fred",start=start_date, end=end_date).reset_index()

print(UsToAus)
'       
       ,@output_data_1_name =N'UsToAus'
WITH RESULT SETS  (([Date] DATE,[US to AUS]  FLOAT) )


In the example above, the code connects to FRED (Federal Reserve Bank of St. Louis, Missouri, USA) and pulls the exchange rates between the USD and  AUD for a given timeframe. 

File system accessibility – how to import/export result

Another cool thing that you can do with Python is to easily export results to disk  e.g save a query result (report) in a form of an Excel file, to a location where it can be picked up(pulled) by the authorised users. It is also possible to read e.g a lookup table – extended customer information, states, employee information etc,  from disk and join it with the tSQL query result-set, again within the same sp call.
To enable interaction with the file system we need to allow “Sql Server’s” Python to access specific folders. By default, Python can only access its working directory and its subdirectories, in this case c:\Program Files\Microsoft SQL Server\MSSQL15.SQL2019\PYTHON_SERVICES\. As mentioned before, the Launchpad.exe service runs under its own account. It starts python.exe runtime environment in a process, again, under the  same account. Finally it instantiates a  new AppContainer object to contain its activities. This container now acts as a windows app. with very limited access. Windows apps can access only those resources (files, folders, registry keys, and DCOM interfaces) to which they have been explicitly granted access (“ALL APPLICATION PACKAGES”).

Figure 4, ALL APPLICATION PACKAGES access rights

Now, we can use Pivoting with Python script from another blog to export results to disk.

Figure 5, export results to disk

Memory allocations

It is important to understand how ML Services fits into Sql Server memory allocation scheme to be able to access the possible impact it can have on the database server performance.

As we can see from Figure 2,  Python code runs outside of the processes managed by Sql Server. This means that it doesn’t use memory allocated to Sql Server.

The query below shows that, by default, the maximum memory that ML services can use, outside the memory allocated to Sql Server, is 20 % e.g Total RAM 256GB, Sql Server allocated mem = 200GB, ML can use up to 56 * 0.2 = ~11GB

SELECT * 
FROM sys.resource_governor_external_resource_pools
WHERE [name] = N'default'


Figure 6, External resource pools

We can increase the max_memory_percent to e.g 50% by altering  the “default” external resource pool

ALTER EXTERNAL RESOURCE POOL "default" 
   WITH (max_memory_percent = 50);
GO
ALTER RESOURCE GOVERNOR 
RECONFIGURE;
GO

File locations and configurations

  • launchpad.exe: ..\MSSQL15.SQL2019\MSSQL\Binn\Launchpad.exe
  • pythonlauncher.log ..\MSSQL15.SQL2019\MSSQL\Log\ExtensibilityLog\pythonlauncher.log 
  • Config file (Python) :  ..\MSSQL15.SQL2019\MSSQL\Binn\pythonlauncher.confg
    Python home directory, encoding standard, env patrh, working directory – this is where python stores intermediate results during the calculations, etc.

Python version

SQL Server 2019 supports Python 3.7.1. The package comes with the Anaconda 4.6.8 distribution. The language runtimes and various packages e.g  pandas, are installed under the root directory:

c:\Program Files\Microsoft SQL Server\MSSQL15.SQL2019\PYTHON_SERVICES

To check Python version use command prompt:

c:\Program Files\Microsoft SQL Server\MSSQL15.SQL2019\PYTHON_SERVICES>python --version

To check Anaconda distribution version:

c:\Program Files\Microsoft SQL Server\MSSQL15.SQL2019\PYTHON_SERVICES\Scripts>conda.exe --version

But wait, there is more 🙂

With SQL Server 2019 CU3+ (latest CU) we can use the Language Extension feature to “add” different versions of Python to the mix. In fact, we can add any language extension.
For example, I use Python – version 3.9.7 that is installed as a part of the Anaconda 4.10.3 distribution, for work, not related to Sql Server platform. Wouldn’t it  be cool if it was possible to use that installation in SQL Server?

Note:  There are two packages developed by Microsoft, specifically for out-of-the-box Python installation: revoscalepy and microsoftml. These packages are not available in the distribution mentioned above . 

Language extensions is a set of dll’s written in C++ that act as a bridge between Sql Server and the external runtime. This technology allows us to use relational data(query result sets) in the external code. In a nutshell, the current Python language extension program is an open source code available here. However, the code currently supports only versions 3.7.x. So, to be able to use Python 3.9 as the external code, we need to build our own “bridge”.

Here is an excellent article on how to do it. It looks complex , but once you start you’ll not be able to stop – it’s just so much fun.

Figure 7, Python 3.9 Language extension for SQL Server

In the next blog, I’ll present various dev tools we can use to construct our Python code in Sql Server. 🙂

Thanks for reading.

Dean Mincic

Pivoting with Python in Sql Server


Summary

In SQL Server 2016, Microsoft introduced a new system stored procedure sys.sp_execute_external_script. The idea was to extend the capabilities of SQL Server engine to be able to execute external code i.e code written in R, Java, or Python. SQL 2017 supports R and Python. The new functionality is a part of Sql Server’s Machine Learning Services. The purpose of this blog is to “tickle devs imagination” on how to use Python for Pivoting and more..

From a super high-level point of view, the process goes like this: we call sys.sp_execute_external_script indicating that we want to use e.g Python language, and pass in our python code. We also define a data set(an Sql query) that the code will use as an input data source. The code performs analytical tasks over the input data source and returns a result-set in the form of a pandas DataFrame. We use python’s methods to “tweak” the data frame to match the final shape of the output sql dataset. Optionally, we describe the output(define column names and their data types) by using WITH RESULT SET stored procedure option.

So, I thought it would be cool to try to do pivoting/multi-pivoting using Python code. What I discovered are the amazing things you can do with Python in SQL Server.

NOTE: More information about how Sql Server engine executes external code can be found here.

Prepare the environment for Python

First thing, we need to install Sql Server Machine Learning Services and Language Extensions.
Figure 8, Sql Server ML Services

Make sure that the SQL Server Launchpad service is up and running.
The next step is to allow Sql Server to execute the external scripts and we are good to go.

sp_configure 
     @configname = 'show advanced options'
    ,@configvalue = 1
GO
RECONFIGURE WITH OVERRIDE;
GO

sp_configure 
     @configname = 'external scripts enabled'
    ,@configvalue = 1            
GO
RECONFIGURE WITH OVERRIDE;
GO

Python’s Pivot

Let us present the sum of freight(Shipping cost) values per order year for each country that ordered our products, but this time using Python instead tSQL’s PIVOT operator – you can find the tSQL example here.
Set up  dbo.Orders_TestPivot  test table and run the python script.


Figure 1, Python’s pivot result

Note: During the testing, I found it difficult to use only SSMS to write Python code (similar to working with dynamic sequel) with no debugger, IntelliSense, etc. I used the Visual Studio Code tool with Python 3.8. Here is the code I used for testing. 

The system stored procedure sp_execute_external_script is similar to sp_executesql, but along with the code to be executed, parameter definitions, and parameter  values, we also pass the following values(from our pivot script):

@input_data_1 – There are a couple of interesting things with the query used as a base for Python Pivoting.

  1. Python does define Pivot grouping element, therefore, we don’t need a table expression that implicitly defines Pivot elements where the grouping element is everything else but spreading and aggregate element – see pivot operation directly on a table.
  2. The query result-set(in our case named df) is internally transformed to DataFrame object – a table-like structure defined within pandas. Pandas is an open-source data analysis library build on top of the Python language. DataFrame does not support all Sql Server data types e.g MONEY and DECIMAL are not supported and that’s why the two columns Freight and OrderValue need to be converted to FLOAT.
    Supported types : Supported types: bit, tinyint, smallint, int, bigint, uniqueidentifier, real, float, char, varchar, nchar, nvarchar, varbinary, date, datetime, smalldatetime.

How it works

As mentioned before, after passing the input query, the query gets executed and the resultset, natively in a form of a table expression, gets transformed into a DataFrame object named df. The code below runs the pivot_table method(far more powerful than tSQL’s PIVOT operator 🙂 ) on the DataFrame object. The final result is then stored in the dfpivot_out variable of type DataFrame, previously defined as an output dataset name.

       ....    
        ,@script   = N'

dfpivot_out = df.pivot_table(index = ["Shipcountry"], \
                              columns =["OrderYear"], \
                              values = ["Freight"], \
                              aggfunc={"Freight":sum}, \
                              fill_value=None).reset_index(level="Shipcountry")

## dfpivot_out =dfpivot_out.reset_index(level="Shipcountry") ##we can reshape the data frame in a separate statement.
print(dfpivot_out)
'

....

Note: Python code above starts with no indentation 🙂

Pivot_table

In our example, we are passing four parameters to the pivot_table method.

index – This parameter explicitly defines a list of grouping element(s). Due to the difference between DataFrame and sequel’s table expression structures, the Index column will not be visible in the final output (see reset_index method)
columns – defines a list of spreading elements.
values – defines a list of columns whose values will be aggregated.
aggfunc – defines a list of pairs (value column: aggregate function). Basically, we can apply different aggregate functions on different aggregate columns defined in the values list.

Before explaining the reset_index() method, remove the method from the code and comment out WITH RESULT SET option.

...
values = ["Freight"], \
aggfunc={"Freight":sum}) ##.reset_index(level="Shipcountry")
...
        ,@output_data_1_name =N'dfpivot_out'
    WITH --RESULT SETS  ((Shipcountry NVARCHAR(15),[2018] MONEY, [2019]  MONEY,[2020] MONEY));

After running the code, have a look at the result of the print statement under the Messages pane in SSMS. This is how DataFrame graphically looks like

Figure 2, panda’s DataFrame shape

The index values are not presented as a DataFrame column. There are many ways to manipulate the DataFrame output to match the sql result-set shape. One way is to use reset_index(level=”Shipcountry”) method on the DataFrame. This will “convert” the index into a column. The new, default index will be created with the unique, ever-increasing integer values, starting from 0.
Run the code in its original form and compare the print output.

Multi aggregate Pivot with Python

This time we want to calculate the total Freight and the average order value in different countries per year. Again, compare the tSQL approach with the Python code.

Compare tSQL example with the Python code. (Just a few “tweaks” to the code above and there you go 🙂 )

...
        ,@script   = N'
import numpy as np
dfpivot_out = df.pivot_table(index = ["Shipcountry"], \
                              columns =["OrderYear"], \
                              values = ["Freight","OrderValue"], \
                              aggfunc={"Freight":sum,"OrderValue":np.average}).reset_index(level="Shipcountry")
print(dfpivot_out)
' 
...
   WITH RESULT SETS  ((Shipcountry  varchar(2000),[2018] FLOAT, [2019] FLOAT,[2020] FLOAT
                                              ,[Avg Order Value(2018)] FLOAT
                                              ,[Avg Order Value(2019)] FLOAT
                                              ,[Avg Order Value(2020)] FLOAT))

Note: For this example, I’ve imported another python library, – numpy, to be able to use its average aggregate function.

… and here is another one.
Find total Freight and average order value in different countries and different regions per year. The code can be found here.

Conclusion

Playing with pivot operations is just a tip of the iceberg.  There are many different functions available in Python that we can use in SQL Server for all sorts of data analysis.  The good thing is that the data does not need to be moved away from SQL Server. However, It is still important to completely understand how python code executes in SQL environment i.e performance impact on the existing workload etc. Nevertheless, I found Python very intuitive and easy to work with, so, sorry c#, but I seem to have found a new second-best friend  🙂

Thanks for reading.

Dean Mincic