spark/benchmark
..
csharp
python
scala
README.md
run_csharp_benchmark.sh
run_python_benchmark.sh
run_scala_benchmark.sh

README.md

Benchmarking

Generate Data

  1. Download the TPC-H benchmark tool. Follow the instructions for registration and download the tool to local disk with at least 300GB free space.

  2. Build the dbgen tool.

    • Decompress the zip file, then navigate to dbgen folder.
    • For Linux, the TPC-H README contains instructions on how to build the tool.
    • For Windows, generate the dbgen.exe using Visual Studio:
      • (1). In the dbgen folder, you will see tpch.sln, open it using Visual Studio.
      • (2). Build dbgen project, no need to build qgen, it should generate dbgen.exe in the Debug folder.
  3. Generate the data.

    • For Linux, the TPC-H README contains instructions on how to generate the database tables.

    • For Windows,

      • (1). Copy dbgen.exe to the dbgen folder
      • (2). The following will generate a 300GB TPC-H dataaset:
      cd /d \path\to\dbgen
      dbgen.exe -vf -s 300
      

      Note: Since there is no parallelization option for TPC-H dbgen, generating a 300GB dataset could take up to 40 hours to complete.

    • After database population generation is completed, there should be 8 tables (customer, lineitem, nation, orders, part, partsupp, region, supplier) created with the .tbl extension.

  4. Convert TPC-H dataset to parquet format.

    • You can use a simple Spark application to convert the TPC-H dataset to parquet format. You can run the following spark-submit command to submit the application, you can also adjust it according to format of submitting application.
        <spark-submit> --master local[*] --class com.microsoft.tpch.ConvertTpchCsvToParquetApp microsoft-spark-benchmark-<version>.jar <path-to-source-directory-with-TPCH-tables> <path-to-destination-directory-to-save-parquet-file>

Cluster Run

TPCH timing results is written to stdout in the following form: TPCH_Result,<language>,<test type>,<query number>,<iteration>,<total time taken for iteration in milliseconds>,<time taken to run query in milliseconds>

  • Cold Run
    • Each <query + iteration> uses a new spark-submit
  • Warm Run
    • Each query uses a new spark-submit
    • Each iteration reuses the Spark Session after creating the Dataframe (therefore, skips the load phase that does file enumeration)

CSharp

  1. Ensure that the Microsoft.Spark.Worker is properly installed in your cluster.
  2. Build microsoft-spark-<spark_majorversion.spark_minorversion.x>-<spark_dotnet_version>.jar and the CSharp Tpch benchmark application by following the build instructions.
  3. Upload run_csharp_benchmark.sh, the Tpch benchmark application, and microsoft-spark-<spark_majorversion.spark_minorversion.x>-<spark_dotnet_version>.jar to the cluster.
  4. Run the benchmark by invoking:
    run_csharp_benchmark.sh \
    <number of cold iterations> \
    <num_executors> \
    <driver_memory> \
    <executor_memory> \
    <executor_cores> \
    </path/to/Tpch.dll> \
    </path/to/microsoft-spark-<spark_majorversion.spark_minorversion.x>-<spark_dotnet_version>.jar> \
    </path/to/Tpch executable> \
    </path/to/dataset> \
    <number of iterations> \
    <true for sql tests, false for functional tests>
    

Python

  1. Upload run_python_benchmark.sh and all python tpch benchmark files to the cluster.
  2. Run the benchmark by invoking:
    run_python_benchmark.sh \
    <number of cold iterations> \
    <num_executors> \
    <driver_memory> \
    <executor_memory> \
    <executor_cores> \
    </path/to/tpch.py> \
    </path/to/dataset> \
    <number of iterations> \
    <true for sql tests, false for functional tests>
    

Scala

  1. mvn package to build the scala tpch benchmark application.
  2. Upload run_scala_benchmark.sh and the microsoft-spark-benchmark-<version>.jar to the cluster.
  3. Run the benchmark by invoking:
    run_scala_benchmark.sh \
    <number of cold iterations> \
    <num_executors> \
    <driver_memory> \
    <executor_memory> \
    <executor_cores> \
    </path/to/microsoft-spark-benchmark-<version>.jar> \
    </path/to/dataset> \
    <number of iterations> \
    <true for sql tests, false for functional tests>