История

Andrew Fogarty b63c08b87a Upgrade to .NET 6 (#1112 )		2023-02-17 16:56:32 -05:00
..
csharp	Upgrade to .NET 6 (#1112 )	2023-02-17 16:56:32 -05:00
python	Using Pandas UDF for TPCH query 1 and 8 (#243 )	2019-09-25 20:37:46 -07:00
scala	Prep v2.1.1 release (#1050 )	2022-05-19 15:25:05 -07:00
README.md	Upgrade to .NET 6 (#1112 )	2023-02-17 16:56:32 -05:00
run_csharp_benchmark.sh	Update documentation to run benchmarks (#298 )	2019-10-29 11:34:43 -07:00
run_python_benchmark.sh	Update documentation to run benchmarks (#298 )	2019-10-29 11:34:43 -07:00
run_scala_benchmark.sh	[Doc] Tpch Benchmark Instructions (#7 )	2019-04-24 04:00:11 -07:00

README.md

Benchmarking

Generate Data

Download the TPC-H benchmark tool. Follow the instructions for registration and download the tool to local disk with at least 300GB free space.
Build the dbgen tool.
- Decompress the zip file, then navigate to dbgen folder.
- For Linux, the TPC-H README contains instructions on how to build the tool.
- For Windows, generate the dbgen.exe using Visual Studio:
  - (1). In the dbgen folder, you will see tpch.sln, open it using Visual Studio.
  - (2). Build dbgen project, no need to build qgen, it should generate dbgen.exe in the Debug folder.
Generate the data.
- For Linux, the TPC-H README contains instructions on how to generate the database tables.
- For Windows,
  - (1). Copy dbgen.exe to the dbgen folder
  - (2). The following will generate a 300GB TPC-H dataaset:
```
cd /d \path\to\dbgen
dbgen.exe -vf -s 300
```
  Note: Since there is no parallelization option for TPC-H dbgen, generating a 300GB dataset could take up to 40 hours to complete.
- After database population generation is completed, there should be 8 tables (customer, lineitem, nation, orders, part, partsupp, region, supplier) created with the .tbl extension.
Convert TPC-H dataset to parquet format.
- You can use a simple Spark application to convert the TPC-H dataset to parquet format. You can run the following spark-submit command to submit the application, you can also adjust it according to format of submitting application.

        <spark-submit> --master local[*] --class com.microsoft.tpch.ConvertTpchCsvToParquetApp microsoft-spark-benchmark-<version>.jar <path-to-source-directory-with-TPCH-tables> <path-to-destination-directory-to-save-parquet-file>

Cluster Run

TPCH timing results is written to stdout in the following form: TPCH_Result,<language>,<test type>,<query number>,<iteration>,<total time taken for iteration in milliseconds>,<time taken to run query in milliseconds>

Cold Run
- Each <query + iteration> uses a new spark-submit
Warm Run
- Each query uses a new spark-submit
- Each iteration reuses the Spark Session after creating the Dataframe (therefore, skips the load phase that does file enumeration)

CSharp

Ensure that the Microsoft.Spark.Worker is properly installed in your cluster.
Build microsoft-spark-<version>.jar and the CSharp Tpch benchmark application by following the build instructions.
Upload run_csharp_benchmark.sh, the Tpch benchmark application, and microsoft-spark-<version>.jar to the cluster.

Run the benchmark by invoking:

run_csharp_benchmark.sh \
<number of cold iterations> \
<num_executors> \
<driver_memory> \
<executor_memory> \
<executor_cores> \
</path/to/Tpch.dll> \
</path/to/microsoft-spark-<version>.jar> \
</path/to/Tpch executable> \
</path/to/dataset> \
<number of iterations> \
<true for sql tests, false for functional tests>

Note: Ensure that you build the worker and application with .NET 6 in order to run hardware acceleration queries.

Python

Upload run_python_benchmark.sh and all python tpch benchmark files to the cluster.
Install pyarrow and pandas on all nodes in the cluster. For example, if you are using Conda, you can use the following commands to install them.
```
sudo /path/to/conda update --all
sudo /path/to/conda install pandas
sudo /path/to/conda install pyarrow
```

Run the benchmark by invoking:

run_python_benchmark.sh \
<number of cold iterations> \
<num_executors> \
<driver_memory> \
<executor_memory> \
<executor_cores> \
</path/to/tpch.py> \
</path/to/dataset> \
<number of iterations> \
<true for sql tests, false for functional tests>

In order to run with Python 3.x (the default is 2.7) you will need to do the following on the driver node.

Activate the Python 3.x environment, by changing the value of PYSPARK_PYTHON environment variable to point to the Python3 binary. This change can be made in the spark-env.sh conf file.
```
export PYSPARK_PYTHON=${PYSPARK_PYTHON:-/path/to/python3}
```

Scala

mvn package to build the scala tpch benchmark application.
Upload run_scala_benchmark.sh and the microsoft-spark-benchmark-<version>.jar to the cluster.

Run the benchmark by invoking:

run_scala_benchmark.sh \
<number of cold iterations> \
<num_executors> \
<driver_memory> \
<executor_memory> \
<executor_cores> \
</path/to/microsoft-spark-benchmark-<version>.jar> \
</path/to/dataset> \
<number of iterations> \
<true for sql tests, false for functional tests>