Welcome to the azure-cosmosdb-spark wiki!
NOTE: There is a new Cosmos DB Spark connector available for Spark 3
--------------------------------------------------------------------
The new Cosmos DB Spark connector has been released. The Maven coordinates (which can be used to install the connector in Databricks) are "com.azure.cosmos.spark:azure-cosmos-spark_3-1_2-12:4.6.0"
The source code for the new connector is located here: https://github.com/Azure/azure-sdk-for-java/tree/master/sdk/cosmos/azure-cosmos-spark_3_2-12
A migration guide to change applications which used the Spark 2.4 connector is located here: https://github.com/Azure/azure-sdk-for-java/blob/master/sdk/cosmos/azure-cosmos-spark_3_2-12/docs/migration.md
The quick start introduction: https://github.com/Azure/azure-sdk-for-java/blob/master/sdk/cosmos/azure-cosmos-spark_3_2-12/docs/quick-start.md Config Reference: https://github.com/Azure/azure-sdk-for-java/blob/master/sdk/cosmos/azure-cosmos-spark_3_2-12/docs/configuration-reference.md End-to-end samples: https://github.com/Azure/azure-sdk-for-java/blob/master/sdk/cosmos/azure-cosmos-spark_3_2-12/Samples/Python/NYC-Taxi-Data/01_Batch.ipynb
--------------------------------------------------------------------
This wiki contains the following resources for your reference:
Configuration and Setup
- Spark Connector Configuration
- Spark to Cosmos DB Connector Setup (In progress)
- Configuring Power BI Direct Query to Azure Cosmos DB via Apache Spark (HDI)
Troubleshooting
Performance
Change Feed
- Stream Processing Changes using Azure Cosmos DB Change Feed and Apache Spark
- Change Feed Demos
- Structured Stream Demos
Introduction
This project provides a client library that allows Azure Cosmos DB to act as an input source or output sink for Spark jobs. Fast connectivity between Apache Spark and Azure Cosmos DB accelerates your ability to solve your fast moving Data Sciences problems where your data can be quickly persisted and retrieved using Azure Cosmos DB. With the Spark to Cosmos DB connector, you can more easily solve scenarios including (but not limited to) blazing fast IoT scenarios, update-able columns when performing analytics, push-down predicate filtering, and performing advanced analytics to data sciences against your fast changing data against a geo-replicated managed document store with guaranteed SLAs for consistency, availability, low latency, and throughput.
This package is provided as a technical preview.
Common Scenarios
Common scenarios to use Apache Spark and Cosmos DB together include:
- Distributed Aggregations and Analytics
- Push-down Predicate Filtering
- Blazing Fast IoT Scenarios
- Updateable Columns
Below are more details surrounding the scenario; if you're ready to use azure-cosmosdb-spark
, please refer to the Azure Cosmos DB Spark Connector User Guide.
Distributed Aggregations and Analytics
While Azure Cosmos DB has aggregations (SUM
, MIN
, MAX
, COUNT
, SUM
and working on GROUP BY
, DISTINCT
, etc.), connecting Apache Spark to Cosmos DB allows you to easily and quickly perform distributed aggregations which is important for larger implementations. For example, below is a screenshot of calculating a distributed MEDIAN
calculation using Apache Spark's PERCENTILE_APPROX
function.
select destination, percentile_approx(delay, 0.5) as median_delay
from df
where delay < 0
group by destination
order by percentile_approx(delay, 0.5)
.
Push-down Predicate Filtering
As noted in the following animated gif, the queries from Apache Spark will push down predicts to Azure Cosmos DB and take advantage that Cosmos DB indexes every attribute by default.
For example, if you only want to ask for the flights departing from Seattle (SEA), azure-cosmosdb-spark
will:
- Send the query to Azure Cosmos DB
- As all attributes within Azure Cosmos DB are automatically indexed, only the flights pertaining to Seattle will be returned to the Spark worker nodes quickly.
- This way as you perform your Data Sciences work, you will only transfer the data you need.
.
Blazing Fast IoT Scenarios
Azure Cosmos DB is designed for high-throughput, low-latency IoT environments. The animated GIF below refers to the a flights scenario.
Together, you can:
- Handle high throughput of concurrent alerts (e.g. weather, flight information, global safety alerts, etc.)
- Send this information downstream for device notifications, RESTful services, etc. (e.g. alert on your phone of an impending flight delay) including the use of change feed
- At the same time, as you are building up ML models against your data, you can also make sense of the latest information
Updateable Columns
Related to the above noted blazing fast IoT scenarios, let's dive into updateable columns:
As the new piece of information comes in (e.g. the flight delay has changed from 5 min to 30 min), you want to be able to quickly re-run your ML models to reflect this newest information. For example, you can predict the impact of the 30min for all of the downstream flights. This event can be quickly initiated via the Azure Cosmos DB Change Feed to refresh your ML models.