azure-cosmosdb-spark

Содержание

Cosmos DB Configurations Used for Query Tests
Apache Spark Configurations Used for Query Tests
Query and Collections Used
pyDocumentDB Performance

Single Collection
Partitioned Collection

azure-cosmosdb-spark Performance

Single Collection
Partitioned Collection

Dev Box Results
HDI Cluster: 2 workers
HDI Cluster: for Q4, multiple worker configurations

Below are the results of some query test runs using the different Spark to Cosmos DB connector methods.

Cosmos DB Configurations Used for Query Tests

Below are the Cosmos DB configurations used:

Single partition collection: 10,000 RUs
airport.codes has 512 documents
DepartureDelays.flights has 1.05M documents (single collection)
Partitioned Collection: 250,000 RUs
DepartureDelays.flights (pColl) has 1.39M documents (partitioned collection)

Apache Spark Configurations Used for Query Tests

Below are the Apache Spark configurations used:

dev box: Single VM Spark cluster (one master, one worker) on Azure DS11 v2 VM (14GB RAM, 2 cores) running Ubuntu 16.04 LTS using Spark 2.1.
HDI Cluster: HDI 3.5 multi-node Spark cluster (2 master, multiple workers, 3 zookeepers) using Spark 2.0.2

Query and Collections Used

The queries were:

Q1: SELECT c.City FROM c WHERE c.State='WA'
Q2a: SELECT TOP 100 c.date, c.delay, c.distance, c.origin, c.destination FROM c
Q2b: SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA' LIMIT 100
Q3: SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c WHERE c.origin = 'SEA'
Q4: SELECT c.date, c.delay, c.distance, c.origin, c.destination FROM c

Note the slight differences between Q2a and Q2b; this is because Q2a via pyDocumentDB is using a Cosmos DB SQL query (i.e. using TOP) while Q2b via azure-cosmosdb-spark is using a Spark SQL query (i.e. using LIMIT).

against the following collections:

C1: airport.codes
C2: DepartureDelays.flights
C3: DepartureDelays.flights_pcoll

The query results below are from executing df.count()

pyDocumentDB Performance

Below are the results of connecting Spark to Cosmos DB via pyDocumentDB from the dev box:

Single Collection

Below are the results from querying a single collection

Query	# of rows	Collection	Response Time (First)	Response Time (Second)
Q1	7	C1	0:00:00.225645	0:00:00.006784
Q2a	100	C2	0:00:00.214985	0:00:00.009669
Q3	14,808	C2	0:00:01.498699	0:00:01.323917
Q4	1,048,575	C2	0:01:37.518344

Partitioned Collection

Below are the results from querying a partitioned collection (25 partitions)

Query	# of rows	Collection	Response Time (First)	Response Time (Second)
Q2a	100	C3	0:00:00.774820	0:00:00.508290
Q3	23,078	C3	0:00:05.146107	0:00:03.234670
Q4	1,391,578	C3	0:02:36.335267

azure-cosmosdb-spark Performance

Below are the results of connecting Spark to Cosmos DB via azure-cosmosdb-spark:

Single Collection

Below are the results from querying a single collection from the dev box:

Query	# of rows	Collection	Response Time (First)	Response Time (Second)
Q2	100	C2	00:00:01.183	00:00:00.958
Q3	14,808	C2	00:00:01.802	00:00:01.558
Q4	1,048,575	C2	00:00:56.642	00:00:54.931

Partitioned Collection

Below are the results from querying a partitioned collection (25 partitions):

Dev Box Results

Query	# of rows	Collection	Response Time (First)	Response Time (Second)
Q2	100	C3	0:00:00.774820	0:00:00.508290
Q3	23,078	C3	0:00:05.146107	0:00:03.234670
Q4	1,391,578	C3	0:02:36.335267

HDI Cluster: 2 workers

Query	# of rows	Collection	Response Time (First)	Response Time (Second)
Q2	100	C3	00:00:01.286	00:00:00.868
Q3	23,078	C3	00:00:01.582	00:00:01.339
Q4	1,391,578	C3	00:00:16.955	00:00:12.982

HDI Cluster: for Q4, multiple worker configurations

Re-running Q4 but scaling the cluster to see the impact of parallel queries

# of workers	Response Time (First)	Response Time (Second)
4	00:00:11.129	00:00:09.958
6	00:00:10.028	00:00:10.495
8	00:00:10.323	00:00:09.723
12	00:00:08.899	00:00:09.153
20	00:00:10.210	00:00:10.398