sqlmlutils/Python
Sean Leonard b888ff1e87
Migrate Pipeline to Self-Hosted Runners and Remediate Tests (#104)
* GH runner pool support
with fixed tests.

* skip slow python test
remove legacy TravisCI skip tags.

* Fix R coercion to logical 1 error

* add sql MI

* update spacing.

* reduce code duplication and remove erroneous space
that caused issues

* remove conditional

* consolidate code and add conditional

* remove continue on error.

* Concise step title.

* fix R version
2022-08-29 11:32:21 -07:00
..
samples Add Travis CI for Python (#74) 2020-08-07 16:17:27 -07:00
sqlmlutils Update Python Support for Python 3.10.5 with Integration Test Fixes (#101) 2022-08-15 17:05:37 -07:00
tests Migrate Pipeline to Self-Hosted Runners and Remediate Tests (#104) 2022-08-29 11:32:21 -07:00
LICENSE.txt Change pymssql -> pyODBC (#58) 2020-03-24 10:10:46 -07:00
MANIFEST Add and modify files for PyPI release 2019-10-15 12:07:48 -07:00
MANIFEST.in Add and modify files for PyPI release 2019-10-15 12:07:48 -07:00
README.md Fix tests and remove binaries (#86) 2021-03-15 20:32:30 -07:00
buildandinstall.cmd Fix tests and remove binaries (#86) 2021-03-15 20:32:30 -07:00
buildandinstall.sh Add Travis CI for Python (#74) 2020-08-07 16:17:27 -07:00
requirements.txt Set max wheel version for pep425tags (#83) 2020-10-13 10:49:13 -07:00
setup.cfg Add and modify files for PyPI release 2019-10-15 12:07:48 -07:00
setup.py Update Python Support for Python 3.10.5 with Integration Test Fixes (#101) 2022-08-15 17:05:37 -07:00

README.md

sqlmlutils

sqlmlutils is a python package to help execute Python code on a SQL Server machine. It is built to work with ML Services for SQL Server.

Installation

To install from PyPI, run:

pip install sqlmlutils

To install from file, download the latest release from https://github.com/microsoft/sqlmlutils/releases:

pip install sqlmlutils-1.1.0.zip

If you are developing on your own branch and want to rebuild and install the package, you can use the buildandinstall.cmd script that is included.

Getting started

Shown below are the important functions sqlmlutils provides:

SQLPythonExecutor functions:
  execute_function_in_sql         # Execute a python function inside the SQL database
  execute_script_in_sql           # Execute a python script inside the SQL database
  execute_sql_query               # Execute a sql query in the database and return the resultant table

  create_sproc_from_function      # Create a stored procedure based on a Python function inside the SQL database
  create_sproc_from_script        # Create a stored procedure based on a Python script inside the SQL database
  check_sproc                     # Check whether a stored procedure exists in the SQL database
  drop_sproc                      # Drop a stored procedure from the SQL database
  execute_sproc                   # Execute a stored procedure in the SQL database 

SQLPackageManager functions:
  install                         # Install a Python package on the SQL database
  uninstall                       # Remove a Python package from the SQL database
  list                            # Enumerate packages that are installed on the SQL database
  _get_packages_by_user           # Enumerate external libraries installed by specific user in specific scope

Examples

Execute in SQL

Execute a python function in database
import sqlmlutils

def foo():
    return "bar"

# For Linux SQL Server, you must specify the ODBC Driver and the username/password because there is no Trusted_Connection/Implied Authentication support yet.
# connection = sqlmlutils.ConnectionInfo(driver="ODBC Driver 13 for SQL Server", server="localhost", database="master", uid="username", pwd="password")

connection = sqlmlutils.ConnectionInfo(server="localhost", database="master")

sqlpy = sqlmlutils.SQLPythonExecutor(connection)
result = sqlpy.execute_function_in_sql(foo)
assert result == "bar"
Generate a scatter plot without the data leaving the machine
import sqlmlutils
from PIL import Image


def scatter_plot(input_df, x_col, y_col):
    import matplotlib.pyplot as plt
    import io

    title = x_col + " vs. " + y_col

    plt.scatter(input_df[x_col], input_df[y_col])
    plt.xlabel(x_col)
    plt.ylabel(y_col)
    plt.title(title)

    # Save scatter plot image as a png
    buf = io.BytesIO()
    plt.savefig(buf, format="png")
    buf.seek(0)

    # Returns the bytes of the png to the client
    return buf

# For Linux SQL Server, you must specify the ODBC Driver and the username/password because there is no Trusted_Connection/Implied Authentication support yet.
# connection = sqlmlutils.ConnectionInfo(driver="ODBC Driver 13 for SQL Server", server="localhost", database="AirlineTestDB", uid="username", pwd="password")

connection = sqlmlutils.ConnectionInfo(server="localhost", database="AirlineTestDB")

sqlpy = sqlmlutils.SQLPythonExecutor(connection)

sql_query = "select top 100 * from airline5000"
plot_data = sqlpy.execute_function_in_sql(func=scatter_plot, input_data_query=sql_query,
                                          x_col="ArrDelay", y_col="CRSDepTime")
im = Image.open(plot_data)
im.show()
Perform linear regression on data stored in SQL Server without the data leaving the machine

You can use the AirlineTestDB (supplied as a .bak file above) to run these examples.

import sqlmlutils

def linear_regression(input_df, x_col, y_col):
    from sklearn import linear_model

    X = input_df[[x_col]]
    y = input_df[y_col]

    lr = linear_model.LinearRegression()
    lr.fit(X, y)

    return lr

# For Linux SQL Server, you must specify the ODBC Driver and the username/password because there is no Trusted_Connection/Implied Authentication support yet.
# connection = sqlmlutils.ConnectionInfo(driver="ODBC Driver 13 for SQL Server", server="localhost", database="AirlineTestDB", uid="username", pwd="password")

connection = sqlmlutils.ConnectionInfo(server="localhost", database="AirlineTestDB")

sqlpy = sqlmlutils.SQLPythonExecutor(connection)
sql_query = "select top 1000 CRSDepTime, CRSArrTime from airline5000"
regression_model = sqlpy.execute_function_in_sql(linear_regression, input_data_query=sql_query,
                                                 x_col="CRSDepTime", y_col="CRSArrTime")
print(regression_model)
print(regression_model.coef_)
Execute a SQL Query from Python
import sqlmlutils
import pytest

# For Linux SQL Server, you must specify the ODBC Driver and the username/password because there is no Trusted_Connection/Implied Authentication support yet.
# connection = sqlmlutils.ConnectionInfo(driver="ODBC Driver 13 for SQL Server", server="localhost", database="AirlineTestDB", uid="username", pwd="password")

connection = sqlmlutils.ConnectionInfo(server="localhost", database="AirlineTestDB")

sqlpy = sqlmlutils.SQLPythonExecutor(connection)
sql_query = "select top 10 * from airline5000"
data_table = sqlpy.execute_sql_query(sql_query)
assert len(data_table.columns) == 30
assert len(data_table) == 10

Stored Procedure

Create and call a T-SQL stored procedure based on a Python function
import sqlmlutils
import pytest

def principal_components(input_table: str, output_table: str):
    import sqlalchemy
    from urllib import parse
    import pandas as pd
    from sklearn.decomposition import PCA

    # Internal ODBC connection string used by process executing inside SQL Server
    connection_string = "Driver=SQL Server;Server=localhost;Database=AirlineTestDB;Trusted_Connection=Yes;"
    engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect={}".format(parse.quote_plus(connection_string)))

    input_df = pd.read_sql("select top 200 ArrDelay, CRSDepTime from {}".format(input_table), engine).dropna()  
        

    pca = PCA(n_components=2)
    components = pca.fit_transform(input_df)

    output_df = pd.DataFrame(components)
    output_df.to_sql(output_table, engine, if_exists="replace")


# For Linux SQL Server, you must specify the ODBC Driver and the username/password because there is no Trusted_Connection/Implied Authentication support yet.
# connection = sqlmlutils.ConnectionInfo(driver="ODBC Driver 13 for SQL Server", server="localhost", database="AirlineTestDB", uid="username", pwd="password")

connection = sqlmlutils.ConnectionInfo(server="localhost", database="AirlineTestDB")

input_table = "airline5000"
output_table = "AirlineDemoPrincipalComponents"

sp_name = "SavePrincipalComponents"

sqlpy = sqlmlutils.SQLPythonExecutor(connection)

if sqlpy.check_sproc(sp_name):
    sqlpy.drop_sproc(sp_name)

sqlpy.create_sproc_from_function(sp_name, principal_components)

# You can check the stored procedure exists in the db with this:
assert sqlpy.check_sproc(sp_name)

sqlpy.execute_sproc(sp_name, input_table=input_table, output_table=output_table)

sqlpy.drop_sproc(sp_name)
assert not sqlpy.check_sproc(sp_name)

Package Management

Python package management with sqlmlutils is supported in SQL Server 2019 CTP 2.4 and later.
Install and remove packages from SQL Server
import sqlmlutils

# For Linux SQL Server, you must specify the ODBC Driver and the username/password because there is no Trusted_Connection/Implied Authentication support yet.
# connection = sqlmlutils.ConnectionInfo(driver="ODBC Driver 13 for SQL Server", server="localhost", database="AirlineTestDB", uid="username", pwd="password")

connection = sqlmlutils.ConnectionInfo(server="localhost", database="AirlineTestDB")
pkgmanager = sqlmlutils.SQLPackageManager(connection)
pkgmanager.install("astor")

def import_astor():
    import astor

# import the astor package to make sure it installed properly
sqlpy = sqlmlutils.SQLPythonExecutor(connection)
val = sqlpy.execute_function_in_sql(import_astor)

pkgmanager.uninstall("astor")

Notes for Developers

Running the tests

  1. Make sure a SQL Server with an updated ML Services Python is running on localhost.
  2. Restore the AirlineTestDB from the .bak file in this repo
  3. Make sure Trusted (Windows) authentication works for connecting to the database
  4. Setup a user with db_owner role (and not server admin) with uid: "AirlineUser" and password "FakeT3sterPwd!"

Notable TODOs and open issues

  1. Testing from a Linux client has not been performed.
  2. The way we get dependencies of a package to install is sort of hacky (parsing pip output)