Below are the results of some query test runs using the different Spark to Cosmos DB connector methods.
Cosmos DB Configurations Used for Query Tests
Below are the Cosmos DB configurations used:
- Single partition collection: 10,000 RUs
airport.codes
has 512 documentsDepartureDelays.flights
has 1.05M documents (single collection)- Partitioned Collection: 250,000 RUs
DepartureDelays.flights (pColl)
has 1.39M documents (partitioned collection)
Apache Spark Configurations Used for Query Tests
Below are the Apache Spark configurations used:
- dev box: Single VM Spark cluster (one master, one worker) on Azure DS11 v2 VM (14GB RAM, 2 cores) running Ubuntu 16.04 LTS using Spark 2.1.
- HDI Cluster: HDI 3.5 multi-node Spark cluster (2 master, multiple workers, 3 zookeepers) using Spark 2.0.2
Query and Collections Used
The queries were:
- Q1:
SELECT c.City FROM c WHERE c.State='WA'
- Q2a:
SELECT TOP 100 c.date, c.delay, c.distance, c.origin, c.destination FROM c
- Q2b:
SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA' LIMIT 100
- Q3:
SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA'
- Q4:
SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c
Note the slight differences between
Q2a
andQ2b
; this is because Q2a viapyDocumentDB
is using a Cosmos DB SQL query (i.e. usingTOP
) while Q2b viaazure-cosmosdb-spark
is using a Spark SQL query (i.e. usingLIMIT
).
against the following collections:
- C1:
airport.codes
- C2:
DepartureDelays.flights
- C3:
DepartureDelays.flights_pcoll
The query results below are from executing
df.count()
pyDocumentDB Performance
Below are the results of connecting Spark to Cosmos DB via pyDocumentDB
from the dev box:
Single Collection
Below are the results from querying a single collection
Query | # of rows | Collection | Response Time (First) | Response Time (Second) |
---|---|---|---|---|
Q1 | 7 | C1 | 0:00:00.225645 | 0:00:00.006784 |
Q2a | 100 | C2 | 0:00:00.214985 | 0:00:00.009669 |
Q3 | 14,808 | C2 | 0:00:01.498699 | 0:00:01.323917 |
Q4 | 1,048,575 | C2 | 0:01:37.518344 |
Partitioned Collection
Below are the results from querying a partitioned collection (25 partitions)
Query | # of rows | Collection | Response Time (First) | Response Time (Second) |
---|---|---|---|---|
Q2a | 100 | C3 | 0:00:00.774820 | 0:00:00.508290 |
Q3 | 23,078 | C3 | 0:00:05.146107 | 0:00:03.234670 |
Q4 | 1,391,578 | C3 | 0:02:36.335267 |
.
azure-cosmosdb-spark Performance
Below are the results of connecting Spark to Cosmos DB via azure-cosmosdb-spark
:
Single Collection
Below are the results from querying a single collection from the dev box:
Query | # of rows | Collection | Response Time (First) | Response Time (Second) |
---|---|---|---|---|
Q2 | 100 | C2 | 00:00:01.183 | 00:00:00.958 |
Q3 | 14,808 | C2 | 00:00:01.802 | 00:00:01.558 |
Q4 | 1,048,575 | C2 | 00:00:56.642 | 00:00:54.931 |
Partitioned Collection
Below are the results from querying a partitioned collection (25 partitions):
Dev Box Results
Query | # of rows | Collection | Response Time (First) | Response Time (Second) |
---|---|---|---|---|
Q2 | 100 | C3 | 0:00:00.774820 | 0:00:00.508290 |
Q3 | 23,078 | C3 | 0:00:05.146107 | 0:00:03.234670 |
Q4 | 1,391,578 | C3 | 0:02:36.335267 |
HDI Cluster: 2 workers
Query | # of rows | Collection | Response Time (First) | Response Time (Second) |
---|---|---|---|---|
Q2 | 100 | C3 | 00:00:01.286 | 00:00:00.868 |
Q3 | 23,078 | C3 | 00:00:01.582 | 00:00:01.339 |
Q4 | 1,391,578 | C3 | 00:00:16.955 | 00:00:12.982 |
HDI Cluster: for Q4, multiple worker configurations
Re-running Q4
but scaling the cluster to see the impact of parallel queries
# of workers | Response Time (First) | Response Time (Second) |
---|---|---|
4 | 00:00:11.129 | 00:00:09.958 |
6 | 00:00:10.028 | 00:00:10.495 |
8 | 00:00:10.323 | 00:00:09.723 |
12 | 00:00:08.899 | 00:00:09.153 |
20 | 00:00:10.210 | 00:00:10.398 |