зеркало из https://github.com/microsoft/spark.git
2.3 KiB
2.3 KiB
layout | title |
---|---|
global | Launching Spark on YARN |
Spark 0.6 adds experimental support for running over a YARN (Hadoop
NextGen) cluster.
Because YARN depends on version 2.0 of the Hadoop libraries, this currently requires checking out a
separate branch of Spark, called yarn
, which you can do as follows:
git clone git://github.com/mesos/spark
cd spark
git checkout -b yarn --track origin/yarn
Preparations
- In order to distribute Spark within the cluster, it must be packaged into a single JAR file. This can be done by running
sbt/sbt assembly
- Your application code must be packaged into a separate JAR file.
If you want to test out the YARN deployment mode, you can use the current Spark examples. A spark-examples_2.9.2-0.6.0-SNAPSHOT.jar
file can be generated by running sbt/sbt package
.
Launching Spark on YARN
The command to launch the YARN Client is as follows:
SPARK_JAR=<SPARK_YAR_FILE> ./run spark.deploy.yarn.Client \
--jar <YOUR_APP_JAR_FILE> \
--class <APP_MAIN_CLASS> \
--args <APP_MAIN_ARGUMENTS> \
--num-workers <NUMBER_OF_WORKER_MACHINES> \
--worker-memory <MEMORY_PER_WORKER> \
--worker-cores <CORES_PER_WORKER>
For example:
SPARK_JAR=./core/target/spark-core-assembly-0.6.0-SNAPSHOT.jar ./run spark.deploy.yarn.Client \
--jar examples/target/scala-2.9.2/spark-examples_2.9.2-0.6.0-SNAPSHOT.jar \
--class spark.examples.SparkPi \
--args standalone \
--num-workers 3 \
--worker-memory 2g \
--worker-cores 2
The above starts a YARN Client programs which periodically polls the Application Master for status updates and displays them in the console. The client will exit once your application has finished running.
Important Notes
- When your application instantiates a Spark context it must use a special "standalone" master url. This starts the scheduler without forcing it to connect to a cluster. A good way to handle this is to pass "standalone" as an argument to your program, as shown in the example above.
- YARN does not support requesting container resources based on the number of cores. Thus the numbers of cores given via command line arguments cannot be guaranteed.