83f5852baf
* add instructions for customizing pyspark job * add customized example |
||
---|---|---|
.. | ||
docker | ||
README.md | ||
gradient_boosted_tree_classifier_example.py | ||
sample_libsvm_data.txt | ||
simple_pyspark_example.py |
README.md
Spark on PAI
This example demonstrate howto run Spark job on PAI.
1. Off-the-shelf example
1. Submit your Spark application
Below is a job config running the SparkPi
Java example on PAI.
Note: Replace the YOUR_PAI_MASTER_IP
with your own, before submitting the job on PAI. If you want to quit after Spark job finished, change minSucceededTaskCount
to 1
.
{
"jobName": "spark-example",
"image": "openpai/spark-example",
"virtualCluster": "default",
"retryCount": 0,
"taskRoles": [
{
"name": "submitter",
"taskNumber": 1,
"cpuNumber": 1,
"memoryMB": 2048,
"shmMB": 64,
"gpuNumber": 0,
"minFailedTaskCount": 1,
"minSucceededTaskCount": null,
"command": "spark-submit --conf spark.eventLog.enabled=true --conf spark.history.fs.logDirectory=hdfs://YOUR_PAI_MASTER_IP:9000/shared/spark-logs --conf spark.eventLog.dir=hdfs://YOUR_PAI_MASTER_IP:9000/shared/spark-logs --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --driver-memory 1g --executor-memory 2g --executor-cores 1 --queue default ${SPARK_HOME}/examples/jars/spark-examples*.jar 10",
"portList": []
},
{
"name": "spark_history_server",
"taskNumber": 1,
"cpuNumber": 1,
"memoryMB": 1024,
"shmMB": 64,
"gpuNumber": 0,
"minFailedTaskCount": 1,
"minSucceededTaskCount": null,
"command": "URL=http://${PAI_CURRENT_CONTAINER_IP}:${PAI_CONTAINER_HOST_history_server_PORT_LIST}/ && echo Please visit spark histroy server: ${URL} && SPARK_DAEMON_JAVA_OPTS=\"-Dspark.history.ui.port=${PAI_CONTAINER_HOST_history_server_PORT_LIST} -Dspark.history.fs.logDirectory=hdfs://YOUR_PAI_MASTER_IP:9000/shared/spark-logs -Dspark.eventLog.enabled=true -Dspark.eventLog.dir=hdfs://YOUR_PAI_MASTER_IP:9000/shared/spark-logs\" spark-class org.apache.spark.deploy.history.HistoryServer",
"portList": [
{
"label": "history_server",
"beginAt": 0,
"portNumber": 1
}
]
}
]
}
2. Visit Spark history server
Your job look like below, key info is marked on red font.
As the previous image indicated, you can visit the Spark history server on http://10.151.40.228:15692/.
2. Run your python application
For python application, you will need to manage dependencies carefully. In the example below, we provide the dependency using --py-files
parameter.
1. Prepare your data and code
Upload sample_libsvm_data.txt
and gradient_boosted_tree_classifier_example.py
to hdfs:
hdfs dfs -mkdir -p hdfs://YOUR_PAI_MASTER_IP:9000/user/core/data/mllib/
hdfs dfs -put sample_libsvm_data.txt hdfs://YOUR_PAI_MASTER_IP:9000/user/core/data/mllib/
hdfs dfs -mkdir -p hdfs://YOUR_PAI_MASTER_IP:9000/user/core/code
hdfs dfs -put gradient_boosted_tree_classifier_example.py hdfs://YOUR_PAI_MASTER_IP:9000/user/core/code/
2. Generate your dependencies with conda env
First, install conda.
sudo apt update --yes
sudo apt upgrade --yes
# Get Miniconda and make it the main Python interpreter
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
bash ~/miniconda.sh -b -p ~/miniconda
rm ~/miniconda.sh
Then create an spark-python
env, with python3
, numpy
installed.
conda create -n spark-python --copy -y -q python=3 numpy
At last, zip and ship your dependencies.
cd /YOUR_CONDA_HOME/envs
zip -r spark-python.zip spark-python
hdfs dfs -put spark-python.zip hdfs://YOUR_PAI_MASTER_IP:9000/user/core/
3. Submit job on PAI
Note: Replace the YOUR_PAI_MASTER_IP
with your own, before submitting the job on PAI. If you want to quit after Spark job finished, change minSucceededTaskCount
to 1
.
{
"jobName": "spark-python-example",
"image": "openpai/spark-example",
"dataDir": "hdfs://YOUR_PAI_MASTER_IP:9000/user/core/data/mllib/",
"codeDir": "hdfs://YOUR_PAI_MASTER_IP:9000/user/core/code",
"virtualCluster": "default",
"retryCount": 0,
"taskRoles": [
{
"name": "submitter",
"taskNumber": 1,
"cpuNumber": 1,
"memoryMB": 2048,
"shmMB": 64,
"gpuNumber": 0,
"minFailedTaskCount": 1,
"minSucceededTaskCount": null,
"command": "spark-submit --conf spark.eventLog.enabled=true --conf spark.history.fs.logDirectory=hdfs://YOUR_PAI_MASTER_IP:9000/shared/spark-logs --conf spark.eventLog.dir=hdfs://YOUR_PAI_MASTER_IP:9000/shared/spark-logs --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=MY_CONDA/spark-python/bin/python --master yarn --deploy-mode cluster --archives hdfs://YOUR_PAI_MASTER_IP:9000/user/core/spark-python.zip#MY_CONDA --queue default hdfs://YOUR_PAI_MASTER_IP:9000/user/core/code/gradient_boosted_tree_classifier_example.py hdfs://YOUR_PAI_MASTER_IP:9000/user/core/data/mllib/sample_libsvm_data.txt",
"portList": []
},
{
"name": "spark_history_server",
"taskNumber": 1,
"cpuNumber": 1,
"memoryMB": 1024,
"shmMB": 64,
"gpuNumber": 0,
"minFailedTaskCount": 1,
"minSucceededTaskCount": null,
"command": "URL=http://${PAI_CURRENT_CONTAINER_IP}:${PAI_CONTAINER_HOST_history_server_PORT_LIST}/ && echo Please visit spark histroy server: ${URL} && SPARK_DAEMON_JAVA_OPTS=\"-Dspark.history.ui.port=${PAI_CONTAINER_HOST_history_server_PORT_LIST} -Dspark.history.fs.logDirectory=hdfs://YOUR_PAI_MASTER_IP:9000/shared/spark-logs -Dspark.eventLog.enabled=true -Dspark.eventLog.dir=hdfs://YOUR_PAI_MASTER_IP:9000/shared/spark-logs\" spark-class org.apache.spark.deploy.history.HistoryServer",
"portList": [
{
"label": "history_server",
"beginAt": 0,
"portNumber": 1
}
]
}
]
}
4. Visit Spark history server
As previous section.
Note: If you want to write your own pyspark code, you must set yarn-master as the master setMaster("yarn-master")
instead of "local[*]
. Then, you must upload your code to hdfs rather than build it in your docker. Finally, you should spark-submit
the code on hdfs. You can refer to the simple pyspark example and read the comment.