Accelerate real time big data analytics with Spark to SQL DB connector

The Spark to SQL DB connector enables SQL databases, including Azure SQL Database and SQL Server, to act as input data source or output data sink for Spark jobs.

It allows you to utilize real time transactional data in big data analytics and persist results for adhoc queries or reporting. Comparing to the built-in JDBC connector, the Spark to SQL DB connector provides the ability to bulk insert data into SQL databases. It can outperform row by row insertion with more than 10x faster performance. The Spark to SQL DB connector also support AAD authentication. It allows you securely connect to your Azure SQL databases from Azure Databricks using your AAD account. The Spark to SQL DB connector provides similar interfaces with the built-in JDBC connector. It is very easy to migrate your existing Spark jobs to use this new connector.

Download

To get started, download the Spark to SQL DB connector from the sqldb-spark repository on GitHub.

Official Supported Versions

Apache Spark - 2.0.2 or later
Scala - 2.10 or later
Microsoft JDBC Driver for SQL Server 6.2 or later

Dataflow

The Spark to SQL DB connector utilizes the Microsoft JDBC Driver for SQL Server to move data between Spark worker nodes and SQL databases The dataflow is as following:

The Spark master node connect to SQL Server or Azure SQL Databases and load data from a specific table or using a specific SQL query
Spark master node distribute data to worker nodes for transformation.
Worker node connect to SQL Server or Azure SQL Databases and write data to the database. User can choose to use row-by-row insertion or bulk insert.

Build the Spark to SQL DB connector

Currently, the connector project uses maven. To build the connector without dependencies, you can run:

mvn clean package

You can also download the latest versions of the JAR from the release folder .

Connect Spark to SQL DB using AAD authentication

You can connect to Azure SQL Databases using Azure Active Directory (AAD) authentication. Use AAD authentication to centrally manage identities of database users and as an alternative to SQL Server authentication.

The following AAD authentication modes are supported:

ActiveDirectoryIntegrated
ActiveDirectoryPassword
SqlPassword

Connecting using ActiveDirectoryPassword Authentication Mode

If you are using the ActiveDirectoryPassword authentication mode you will need azure-activedirectory-library-for-java and its dependencies. See Connecting using ActiveDirectoryPassword Authentication Mode section for more details.

Connecting using ActiveDirectoryIntegrated Authentication Mode

If you are using the ActiveDirectoryIntegrated mode, you will need to install the Active Directory Authentication Library for SQL Server (ADALSQL.DLL) and sqljdbc_auth.dll.

Connecting using Access Token

If you are using the access token based authentication mode, you will need azure-activedirectory-library-for-java and its dependencies to run the examples from this article. See Connecting using Access Token section for more details.

Write data to SQL database using Bulk Insert

The traditional jdbc connector writes data into SQL databases using row-by-row insertion. You can use Spark to SQL DB connector to write data to SQL database using bulk insert. It will significantly improve the write performance when loading large data sets.