This project provides a client library that allows Azure SQL DB or SQL Server to act as an input source or output sink for Spark jobs.

Перейти к файлу

Gerhard Brueckl 47f5cdc19f Update README.md (#92 ) fix docs for sqlContext.azurePushdownQuery which was renamed to sqlContext.sqlDBQuery		2020-08-26 11:16:07 -07:00
docs	fix incorrect sqlDB examples	2020-02-15 12:14:28 -05:00
lib	Added Scalastyle config	2018-04-09 16:05:40 -07:00
releases/azure-sqldb-spark-1.0.0	Added release jars	2018-04-09 10:02:25 -07:00
samples	Added sample databricks notebook	2018-04-05 14:35:20 -07:00
src	Update setFireTriggers to have the correct configs	2020-05-01 11:33:25 -04:00
.gitignore	Fix maven build and test errors	2018-04-09 14:42:01 -07:00
.travis.yml	Explicitly specify Linux dist to be Trusty	2020-04-20 11:10:31 -07:00
LICENSE	Initial commit	2018-02-28 15:37:19 -08:00
README.md	Update README.md (#92 )	2020-08-26 11:16:07 -07:00
pom.xml	Set Scala versions using properties (#72 )	2020-06-03 12:56:13 -07:00

README.md

Updated Jun 2020: This project is not being actively maintained. Instead, Apache Spark Connector for SQL Server and Azure SQL is now available, with support for Python and R bindings, an easier-to use interface to bulk insert data, and many other improvements. We encourage you to actively evaluate and use the new connector.

Spark connector for Azure SQL Databases and SQL Server

The Spark connector for Azure SQL Database and SQL Server enables SQL databases, including Azure SQL Databases and SQL Server, to act as input data source or output data sink for Spark jobs. It allows you to utilize real time transactional data in big data analytics and persist results for adhoc queries or reporting.

Comparing to the built-in Spark connector, this connector provides the ability to bulk insert data into SQL databases. It can outperform row by row insertion with 10x to 20x faster performance. The Spark connector for Azure SQL Databases and SQL Server also supports AAD authentication. It allows you securely connecting to your Azure SQL databases from Azure Databricks using your AAD account. It provides similar interfaces with the built-in JDBC connector. It is easy to migrate your existing Spark jobs to use this new connector.

How to connect to Spark using this library

This connector uses Microsoft SQLServer JDBC driver to fetch data from/to the Azure SQL Database. Results are of the DataFrame type.

All connection properties in Microsoft JDBC Driver for SQL Server are supported in this connector. Add connection properties as fields in the com.microsoft.azure.sqldb.spark.config.Config object.

Reading from Azure SQL Database or SQL Server

import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._

val config = Config(Map(
  "url"            -> "mysqlserver.database.windows.net",
  "databaseName"   -> "MyDatabase",
  "dbTable"        -> "dbo.Clients"
  "user"           -> "username",
  "password"       -> "*********",
  "connectTimeout" -> "5", //seconds
  "queryTimeout"   -> "5"  //seconds
))

val collection = sqlContext.read.sqlDB(config)
collection.show()

Writing to Azure SQL Database or SQL Server

import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._
 
// Aquire a DataFrame collection (val collection)

val config = Config(Map(
  "url"          -> "mysqlserver.database.windows.net",
  "databaseName" -> "MyDatabase",
  "dbTable"      -> "dbo.Clients"
  "user"         -> "username",
  "password"     -> "*********"
))

import org.apache.spark.sql.SaveMode
collection.write.mode(SaveMode.Append).sqlDB(config)

Pushdown query to Azure SQL Database or SQL Server

For SELECT queries with expected return results, please use Reading from Azure SQL Database using Scala

import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.query._
val query = """
              |UPDATE Customers
              |SET ContactName = 'Alfred Schmidt', City= 'Frankfurt'
              |WHERE CustomerID = 1;
            """.stripMargin

val config = Config(Map(
  "url"          -> "mysqlserver.database.windows.net",
  "databaseName" -> "MyDatabase",
  "user"         -> "username",
  "password"     -> "*********",
  "queryCustom"  -> query
))

sqlContext.sqlDBQuery(config)

Bulk Copy to Azure SQL Database or SQL Server

import com.microsoft.azure.sqldb.spark.bulkcopy.BulkCopyMetadata
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._

/** 
  Add column Metadata.
  If not specified, metadata will be automatically added
  from the destination table, which may suffer performance.
*/
var bulkCopyMetadata = new BulkCopyMetadata
bulkCopyMetadata.addColumnMetadata(1, "Title", java.sql.Types.NVARCHAR, 128, 0)
bulkCopyMetadata.addColumnMetadata(2, "FirstName", java.sql.Types.NVARCHAR, 50, 0)
bulkCopyMetadata.addColumnMetadata(3, "LastName", java.sql.Types.NVARCHAR, 50, 0)

val bulkCopyConfig = Config(Map(
  "url"               -> "mysqlserver.database.windows.net",
  "databaseName"      -> "MyDatabase",
  "user"              -> "username",
  "password"          -> "*********",
  "databaseName"      -> "MyDatabase",
  "dbTable"           -> "dbo.Clients",
  "bulkCopyBatchSize" -> "2500",
  "bulkCopyTableLock" -> "true",
  "bulkCopyTimeout"   -> "600"
))

df.bulkCopyToSqlDB(bulkCopyConfig, bulkCopyMetadata)
//df.bulkCopyToSqlDB(bulkCopyConfig) if no metadata is specified.

Requirements

Official supported versions

Component	Versions Supported
Apache Spark	2.0.2 or later
Scala	2.10 or later
Microsoft JDBC Driver for SQL Server	6.2 to 7.4 ^
Microsoft SQL Server	SQL Server 2008 or later
Azure SQL Databases	Supported

^ Driver version 8.x not tested

Download

Download from Maven

You can download the latest version from here

You can also use the following coordinate to import the library into Azure SQL Databricks: com.microsoft.azure:azure-sqldb-spark:1.0.2

Build this project

Currently, the connector project uses maven. To build the connector without dependencies, you can run:

mvn clean package

Contributing & Feedback

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

To give feedback and/or report an issue, open a GitHub Issue.

Apache®, Apache Spark, and Spark® are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.