* add instructions for customizing pyspark job * add customized example |
.. | ||
docker | ||
README.md | ||
gradient_boosted_tree_classifier_example.py | ||
sample_libsvm_data.txt | ||
simple_pyspark_example.py |
Spark on PAI
This example demonstrate howto run Spark job on PAI.
1. Off-the-shelf example
1. Submit your Spark application
Below is a job config running the SparkPi
Java example on PAI.
Note: Replace the YOUR_PAI_MASTER_IP
with your own, before submitting the job on PAI. If you want to quit after Spark job finished, change minSucceededTaskCount
to 1
"jobName": "spark-example",
"image": "openpai/spark-example",
"virtualCluster": "default",
"retryCount": 0,
"taskRoles": [
"name": "submitter",
"taskNumber": 1,
"cpuNumber": 1,
"memoryMB": 2048,
"shmMB": 64,
"gpuNumber": 0,
"minFailedTaskCount": 1,
"minSucceededTaskCount": null,
"command": "spark-submit --conf spark.eventLog.enabled=true --conf spark.history.fs.logDirectory=hdfs://YOUR_PAI_MASTER_IP:9000/shared/spark-logs --conf spark.eventLog.dir=hdfs://YOUR_PAI_MASTER_IP:9000/shared/spark-logs --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --driver-memory 1g --executor-memory 2g --executor-cores 1 --queue default ${SPARK_HOME}/examples/jars/spark-examples*.jar 10",
"portList": []
"name": "spark_history_server",
"taskNumber": 1,
"cpuNumber": 1,
"memoryMB": 1024,
"shmMB": 64,
"gpuNumber": 0,
"minFailedTaskCount": 1,
"minSucceededTaskCount": null,
"command": "URL=http://${PAI_CURRENT_CONTAINER_IP}:${PAI_CONTAINER_HOST_history_server_PORT_LIST}/ && echo Please visit spark histroy server: ${URL} && SPARK_DAEMON_JAVA_OPTS=\"-Dspark.history.ui.port=${PAI_CONTAINER_HOST_history_server_PORT_LIST} -Dspark.history.fs.logDirectory=hdfs://YOUR_PAI_MASTER_IP:9000/shared/spark-logs -Dspark.eventLog.enabled=true -Dspark.eventLog.dir=hdfs://YOUR_PAI_MASTER_IP:9000/shared/spark-logs\" spark-class org.apache.spark.deploy.history.HistoryServer",
"portList": [
"label": "history_server",
"beginAt": 0,
"portNumber": 1
2. Visit Spark history server
Your job look like below, key info is marked on red font.
As the previous image indicated, you can visit the Spark history server on
2. Run your python application
For python application, you will need to manage dependencies carefully. In the example below, we provide the dependency using --py-files
1. Prepare your data and code
Upload sample_libsvm_data.txt
and gradient_boosted_tree_classifier_example.py
to hdfs:
hdfs dfs -mkdir -p hdfs://YOUR_PAI_MASTER_IP:9000/user/core/data/mllib/
hdfs dfs -put sample_libsvm_data.txt hdfs://YOUR_PAI_MASTER_IP:9000/user/core/data/mllib/
hdfs dfs -mkdir -p hdfs://YOUR_PAI_MASTER_IP:9000/user/core/code
hdfs dfs -put gradient_boosted_tree_classifier_example.py hdfs://YOUR_PAI_MASTER_IP:9000/user/core/code/
2. Generate your dependencies with conda env
First, install conda.
sudo apt update --yes
sudo apt upgrade --yes
# Get Miniconda and make it the main Python interpreter
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
bash ~/miniconda.sh -b -p ~/miniconda
rm ~/miniconda.sh
Then create an spark-python
env, with python3
, numpy
conda create -n spark-python --copy -y -q python=3 numpy
At last, zip and ship your dependencies.
zip -r spark-python.zip spark-python
hdfs dfs -put spark-python.zip hdfs://YOUR_PAI_MASTER_IP:9000/user/core/
3. Submit job on PAI
Note: Replace the YOUR_PAI_MASTER_IP
with your own, before submitting the job on PAI. If you want to quit after Spark job finished, change minSucceededTaskCount
to 1
"jobName": "spark-python-example",
"image": "openpai/spark-example",
"dataDir": "hdfs://YOUR_PAI_MASTER_IP:9000/user/core/data/mllib/",
"codeDir": "hdfs://YOUR_PAI_MASTER_IP:9000/user/core/code",
"virtualCluster": "default",
"retryCount": 0,
"taskRoles": [
"name": "submitter",
"taskNumber": 1,
"cpuNumber": 1,
"memoryMB": 2048,
"shmMB": 64,
"gpuNumber": 0,
"minFailedTaskCount": 1,
"minSucceededTaskCount": null,
"command": "spark-submit --conf spark.eventLog.enabled=true --conf spark.history.fs.logDirectory=hdfs://YOUR_PAI_MASTER_IP:9000/shared/spark-logs --conf spark.eventLog.dir=hdfs://YOUR_PAI_MASTER_IP:9000/shared/spark-logs --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=MY_CONDA/spark-python/bin/python --master yarn --deploy-mode cluster --archives hdfs://YOUR_PAI_MASTER_IP:9000/user/core/spark-python.zip#MY_CONDA --queue default hdfs://YOUR_PAI_MASTER_IP:9000/user/core/code/gradient_boosted_tree_classifier_example.py hdfs://YOUR_PAI_MASTER_IP:9000/user/core/data/mllib/sample_libsvm_data.txt",
"portList": []
"name": "spark_history_server",
"taskNumber": 1,
"cpuNumber": 1,
"memoryMB": 1024,
"shmMB": 64,
"gpuNumber": 0,
"minFailedTaskCount": 1,
"minSucceededTaskCount": null,
"command": "URL=http://${PAI_CURRENT_CONTAINER_IP}:${PAI_CONTAINER_HOST_history_server_PORT_LIST}/ && echo Please visit spark histroy server: ${URL} && SPARK_DAEMON_JAVA_OPTS=\"-Dspark.history.ui.port=${PAI_CONTAINER_HOST_history_server_PORT_LIST} -Dspark.history.fs.logDirectory=hdfs://YOUR_PAI_MASTER_IP:9000/shared/spark-logs -Dspark.eventLog.enabled=true -Dspark.eventLog.dir=hdfs://YOUR_PAI_MASTER_IP:9000/shared/spark-logs\" spark-class org.apache.spark.deploy.history.HistoryServer",
"portList": [
"label": "history_server",
"beginAt": 0,
"portNumber": 1
4. Visit Spark history server
As previous section.
Note: If you want to write your own pyspark code, you must set yarn-master as the master setMaster("yarn-master")
instead of "local[*]
. Then, you must upload your code to hdfs rather than build it in your docker. Finally, you should spark-submit
the code on hdfs. You can refer to the simple pyspark example and read the comment.