Merge pull request #202 from andyk/doc

Updates to docs (including nav structure)
This commit is contained in:
Matei Zaharia 2012-09-16 17:18:38 -07:00
Родитель bcd7b40578 a92748d3ba
Коммит 098ae55db1
10 изменённых файлов: 85 добавлений и 98 удалений

1
.gitignore поставляемый
Просмотреть файл

@ -14,6 +14,7 @@ conf/java-opts
conf/spark-env.sh
conf/log4j.properties
docs/_site
docs/api
target/
reports/
.project

Просмотреть файл

@ -36,21 +36,51 @@
<a class="brand" href="{{HOME_PATH}}index.html"></a>
<ul class="nav">
<!--TODO(andyk): Add class="active" attribute to li some how.-->
<li><a href="{{HOME_PATH}}index.html">Home</a></li>
<li><a href="{{HOME_PATH}}programming-guide.html">Programming Guide</a></li>
<li><a href="{{HOME_PATH}}api.html">API (Scaladoc)</a></li>
<!--
<li><a href="{{HOME_PATH}}index.html">Getting Started</a></li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Versions ({{ page.spark-version }})<b class="caret"></b></a>
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Programming Guides<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="#">Something else here</a></li>
<li><a href="{{HOME_PATH}}java-programming-guide.html">Java Programming Guide</a></li>
<li><a href="{{HOME_PATH}}scala-programming-guide.html">Scala Programming Guide</a></li>
<li><a href="{{HOME_PATH}}bagel-programming-guide.html">Bagel Programming Guide</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Deploying<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="{{HOME_PATH}}ec2-scripts.html">On Amazon EC2</a></li>
<li><a href="{{HOME_PATH}}running-on-mesos.html">On Mesos</a></li>
<li><a href="{{HOME_PATH}}running-on-yarn.html">On YARN</a></li>
<li><a href="{{HOME_PATH}}spark-standalone.html">Standalone Mode</a></li>
</ul>
</li>
<li class="dropdown">
<a href="{{HOME_PATH}}api.html" class="dropdown-toggle" data-toggle="dropdown">API (Scaladoc)<b class="caret"></b></a>
<ul class="dropdown-menu">
<!--<li><a href="#">Something else here</a></li>
<li class="divider"></li>
<li class="nav-header">Nav header</li>
<li><a href="#">Separated link</a></li>
<li><a href="#">One more separated link</a></li>
<li><a href="#">One more separated link</a></li>-->
<li><a href="api/core/index.html">Core</a></li>
<li><a href="api/examples/index.html">Examples</a></li>
<li><a href="api/repl/index.html">REPL</a></li>
<li><a href="api/bagel/index.html">Bagel</a></li>
</ul>
</li>
<li class="dropdown">
<a href="{{HOME_PATH}}api.html" class="dropdown-toggle" data-toggle="dropdown">More<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="configuration.html">Configuration</a></li>
<li><a href="spark-debugger.html">Debugger</a></li>
<li><a href="contributing-to-spark.html">Contributing</a></li>
</ul>
</li>
-->
</ul>
</div>
</div>

Просмотреть файл

@ -2,8 +2,6 @@
layout: global
title: Spark Configuration
---
# Spark Configuration
Spark is configured primarily through the `conf/spark-env.sh` script. This script doesn't exist in the Git repository, but you can create it by copying `conf/spark-env.sh.template`. Make sure the script is executable.
Inside this script, you can set several environment variables:

Просмотреть файл

@ -2,7 +2,9 @@
layout: global
title: Using the Spark EC2 Scripts
---
The `spark-ec2` script located in the Spark's `ec2` directory allows you
This guide describes how to get Spark running on an EC2 cluster, including how to launch clusters, how to run jobs on them, and how to shut them down. It assumes you have already signed up for Amazon EC2 account on the [Amazon Web Services site](http://aws.amazon.com/).
The `spark-ec2` script, located in Spark's `ec2` directory, allows you
to launch, manage and shut down Spark clusters on Amazon EC2. It builds
on the [Mesos EC2 script](https://github.com/mesos/mesos/wiki/EC2-Scripts)
in Apache Mesos.
@ -19,11 +21,8 @@ for you based on the cluster name you request. You can also use them to
identify machines belonging to each cluster in the EC2 Console or
ElasticFox.
This guide describes how to get set up to run clusters, how to launch
clusters, how to run jobs on them, and how to shut them down.
Before You Start
================
# Before You Start
- Create an Amazon EC2 key pair for yourself. This can be done by
logging into your Amazon Web Services account through the [AWS
@ -37,8 +36,7 @@ Before You Start
obtained from the [AWS homepage](http://aws.amazon.com/) by clicking
Account \> Security Credentials \> Access Credentials.
Launching a Cluster
===================
# Launching a Cluster
- Go into the `ec2` directory in the release of Spark you downloaded.
- Run
@ -75,8 +73,7 @@ available.
permissions on your private key file, you can run `launch` with the
`--resume` option to restart the setup process on an existing cluster.
Running Jobs
============
# Running Jobs
- Go into the `ec2` directory in the release of Spark you downloaded.
- Run `./spark-ec2 -k <keypair> -i <key-file> login <cluster-name>` to
@ -102,8 +99,7 @@ Running Jobs
- Finally, if you get errors while running your jobs, look at the slave's logs
for that job using the Mesos web UI (`http://<master-hostname>:8080`).
Terminating a Cluster
=====================
# Terminating a Cluster
***Note that there is no way to recover data on EC2 nodes after shutting
them down! Make sure you have copied everything important off the nodes
@ -112,8 +108,7 @@ before stopping them.***
- Go into the `ec2` directory in the release of Spark you downloaded.
- Run `./spark-ec2 destroy <cluster-name>`.
Pausing and Restarting Clusters
===============================
# Pausing and Restarting Clusters
The `spark-ec2` script also supports pausing a cluster. In this case,
the VMs are stopped but not terminated, so they
@ -130,8 +125,7 @@ storage.
`./spark-ec2 destroy <cluster-name>` as described in the previous
section.
Limitations
===========
# Limitations
- `spark-ec2` currently only launches machines in the US-East region of EC2.
It should not be hard to make it launch VMs in other zones, but you will need
@ -144,3 +138,13 @@ Limitations
If you have a patch or suggestion for one of these limitations, feel free to
[contribute]({{HOME_PATH}}contributing-to-spark.html) it!
# Using a Newer Spark Version
The Spark EC2 machine images may not come with the latest version of Spark. To use a newer version, you can run `git pull` to pull in `/root/spark` to pull in the latest version of Spark from `git`, and build it using `sbt/sbt compile`. You will also need to copy it to all the other nodes in the cluster using `~/mesos-ec2/copy-dir /root/spark`.
# Accessing Data in S3
Spark's file interface allows it to process data in Amazon S3 using the same URI formats that are supported for Hadoop. You can specify a path in S3 as input through a URI of the form `s3n://<id>:<secret>@<bucket>/path`, where `<id>` is your Amazon access key ID and `<secret>` is your Amazon secret access key. Note that you should escape any `/` characters in the secret key as `%2F`. Full instructions can be found on the [Hadoop S3 page](http://wiki.apache.org/hadoop/AmazonS3).
In addition to using a single input file, you can also use a directory of files as input by simply giving the path to the directory.

Просмотреть файл

@ -3,7 +3,11 @@ layout: global
title: Spark Overview
---
Spark is a MapReduce-like cluster computing framework designed to support low-latency iterative jobs and interactive use from an interpreter. It is written in [Scala](http://www.scala-lang.org), a high-level language for the JVM, and exposes a clean language-integrated syntax that makes it easy to write parallel jobs. Spark runs on top of the [Apache Mesos](http://incubator.apache.org/mesos/) cluster manager.
{% comment %}
TODO(andyk): Rewrite to make the Java API a first class part of the story.
{% endcomment %}
Spark is a MapReduce-like cluster computing framework designed to support low-latency iterative jobs and interactive use from an interpreter. It is written in [Scala](http://www.scala-lang.org), a high-level language for the JVM, and exposes a clean language-integrated syntax that makes it easy to write parallel jobs. Spark runs on top of the [Apache Mesos](http://incubator.apache.org/mesos/) cluster manager, Hadoop YARN, or without an independent resource manager (i.e., in "standalone mode").
# Downloading
@ -51,11 +55,11 @@ of `project/SparkBuild.scala`, then rebuilding Spark (`sbt/sbt clean compile`).
# Where to Go from Here
* [Spark Programming Guide]({{HOME_PATH}}programming-guide.html): how to get started using Spark, and details on the API
* [Running Spark on Amazon EC2]({{HOME_PATH}}running-on-amazon-ec2.html): scripts that let you launch a cluster on EC2 in about 5 minutes
* [Running Spark on Amazon EC2]({{HOME_PATH}}ec2-scripts.html): scripts that let you launch a cluster on EC2 in about 5 minutes
* [Running Spark on Mesos]({{HOME_PATH}}running-on-mesos.html): instructions on how to deploy to a private cluster
* [Running Spark on YARN]({{HOME_PATH}}running-on-yarn.html): instructions on how to run Spark on top of a YARN cluster
* [Spark Standalone Mode]({{HOME_PATH}}spark-standalone.html): instructions on running Spark without Mesos
* [Configuration]({{HOME_PATH}}configuration.html)
* [Configuration]({{HOME_PATH}}configuration.html): How to set up and customize Spark via its configuration system.
* [Bagel Programming Guide]({{HOME_PATH}}bagel-programming-guide.html): implementation of Google's Pregel on Spark
* [Spark Debugger]({{HOME_PATH}}spark-debugger.html): experimental work on a debugger for Spark jobs
* [Contributing to Spark](contributing-to-spark.html)
@ -63,7 +67,7 @@ of `project/SparkBuild.scala`, then rebuilding Spark (`sbt/sbt clean compile`).
# Other Resources
* [Spark Homepage](http://www.spark-project.org)
* [AMPCamp](http://ampcamp.berkeley.edu/): All AMPCamp presentation videos are available online. Going through the videos and exercises is a great way to sharpen your Spark skills.
* [AMP Camp](http://ampcamp.berkeley.edu/) - In 2012, the AMP Lab hosted the first AMP Camp which featured talks and hands-on exercises about Spark, Shark, Mesos, and more. [Videos, slides](http://ampcamp.berkeley.edu/agenda) and the [exercises](http://ampcamp.berkeley.edu/exercises) are all available online now. Going through the videos and exercises is a great way to sharpen your Spark skills.
* [Paper describing the programming model](http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf)
* [Code Examples](http://spark-project.org/examples.html) (more also available in the [examples subfolder](https://github.com/mesos/spark/tree/master/examples/src/main/scala/spark/examples) of the Spark codebase)
* [Mailing List](http://groups.google.com/group/spark-users)

Просмотреть файл

@ -0,0 +1,5 @@
---
layout: global
title: Java Programming Guide
---
TODO: Write Java programming guide!

Просмотреть файл

@ -1,29 +0,0 @@
---
layout: global
title: Running Spark on Amazon EC2
---
This guide describes how to get Spark running on an EC2 cluster. It assumes you have already signed up for Amazon EC2 account on the [Amazon Web Services site](http://aws.amazon.com/).
# For Spark 0.5
Spark now includes some [EC2 Scripts]({{HOME_PATH}}ec2-scripts.html) for launching and managing clusters on EC2. You can typically launch a cluster in about five minutes. Follow the instructions at this link for details.
# For older versions of Spark
Older versions of Spark use the EC2 launch scripts included in Mesos. You can use them as follows:
- Download Mesos using the instructions on the [Mesos wiki](http://github.com/mesos/mesos/wiki). There's no need to compile it.
- Launch a Mesos EC2 cluster following the [EC2 guide on the Mesos wiki](http://github.com/mesos/mesos/wiki/EC2-Scripts). (Essentially, this involves setting some environment variables and running a Python script.)
- Log into your EC2 cluster's master node using `mesos-ec2 -k <keypair> -i <key-file> login cluster-name`.
- Go into the `spark` directory in `root`'s home directory.
- Run either `spark-shell` or another Spark program, setting the Mesos master to use to `master@<ec2-master-node>:5050`. You can also find this master URL in the file `~/mesos-ec2/cluster-url` in newer versions of Mesos.
- Use the Mesos web UI at `http://<ec2-master-node>:8080` to view the status of your job.
# Using a Newer Spark Version
The Spark EC2 machines may not come with the latest version of Spark. To use a newer version, you can run `git pull` to pull in `/root/spark` to pull in the latest version of Spark from `git`, and build it using `sbt/sbt compile`. You will also need to copy it to all the other nodes in the cluster using `~/mesos-ec2/copy-dir /root/spark`.
# Accessing Data in S3
Spark's file interface allows it to process data in Amazon S3 using the same URI formats that are supported for Hadoop. You can specify a path in S3 as input through a URI of the form `s3n://<id>:<secret>@<bucket>/path`, where `<id>` is your Amazon access key ID and `<secret>` is your Amazon secret access key. Note that you should escape any `/` characters in the secret key as `%2F`. Full instructions can be found on the [Hadoop S3 page](http://wiki.apache.org/hadoop/AmazonS3).
In addition to using a single input file, you can also use a directory of files as input by simply giving the path to the directory.

Просмотреть файл

@ -5,8 +5,6 @@ title: Running Spark on Mesos
To run on a cluster, Spark uses the [Apache Mesos](http://incubator.apache.org/mesos/) resource manager. Follow the steps below to install Mesos and Spark:
### For Spark 0.5:
1. Download and build Spark using the instructions [here]({{ HOME_DIR }}Home).
2. Download Mesos 0.9.0 from a [mirror](http://www.apache.org/dyn/closer.cgi/incubator/mesos/mesos-0.9.0-incubating/).
3. Configure Mesos using the `configure` script, passing the location of your `JAVA_HOME` using `--with-java-home`. Mesos comes with "template" configure scripts for different platforms, such as `configure.macosx`, that you can run. See the README file in Mesos for other options. **Note:** If you want to run Mesos without installing it into the default paths on your system (e.g. if you don't have administrative privileges to install it), you should also pass the `--prefix` option to `configure` to tell it where to install. For example, pass `--prefix=/home/user/mesos`. By default the prefix is `/usr/local`.
@ -26,39 +24,9 @@ To run on a cluster, Spark uses the [Apache Mesos](http://incubator.apache.org/m
new SparkContext("HOST:5050", "My Job Name", "/home/user/spark", List("my-job.jar"))
{% endhighlight %}
### For Spark versions before 0.5:
1. Download and build Spark using the instructions [here]({{ HOME_DIR }}Home).
2. Download either revision 1205738 of Mesos if you're using the master branch of Spark, or the pre-protobuf branch of Mesos if you're using Spark 0.3 or earlier (note that for new users, _we recommend the master branch instead of 0.3_). For revision 1205738 of Mesos, use:
{% highlight bash %}
svn checkout -r 1205738 http://svn.apache.org/repos/asf/incubator/mesos/trunk mesos
{% endhighlight %}
For the pre-protobuf branch (for Spark 0.3 and earlier), use:
{% highlight bash %}
git clone git://github.com/mesos/mesos
cd mesos
git checkout --track origin/pre-protobuf
{% endhighlight %}
3. Configure Mesos using the `configure` script, passing the location of your `JAVA_HOME` using `--with-java-home`. Mesos comes with "template" configure scripts for different platforms, such as `configure.template.macosx`, so you can just run the one on your platform if it exists. See the [Mesos wiki](https://github.com/mesos/mesos/wiki) for other configuration options.
4. Build Mesos using `make`.
5. In Spark's `conf/spark-env.sh` file, add `export MESOS_HOME=<path to Mesos directory>`. If you don't have a `spark-env.sh`, copy `conf/spark-env.sh.template`. You should also set `SCALA_HOME` there if it's not on your system's default path.
6. Copy Spark and Mesos to the _same_ path on all the nodes in the cluster.
7. Configure Mesos for deployment:
* On your master node, edit `MESOS_HOME/conf/masters` to list your master and `MESOS_HOME/conf/slaves` to list the slaves. Also, edit `MESOS_HOME/conf/mesos.conf` and add the line `failover_timeout=1` to change a timeout parameter that is too high by default.
* Run `MESOS_HOME/deploy/start-mesos` to start it up. If all goes well, you should see Mesos's web UI on port 8080 of the master machine.
* See Mesos's [deploy instructions](https://github.com/mesos/mesos/wiki/Deploy-Scripts) for more information on deploying it.
8. To run a Spark job against the cluster, when you create your `SparkContext`, pass the string `master@HOST:5050` as the first parameter, where `HOST` is the machine running your Mesos master. In addition, pass the location of Spark on your nodes as the third parameter, and a list of JAR files containing your JAR's code as the fourth (these will automatically get copied to the workers). For example:
{% highlight scala %}
new SparkContext("master@HOST:5050", "My Job Name", "/home/user/spark", List("my-job.jar"))
{% endhighlight %}
## Running on Amazon EC2
If you want to run Spark on Amazon EC2, there's an easy way to launch a cluster with Mesos, Spark, and HDFS pre-configured: the [EC2 launch scripts]({{HOME_PATH}}running-on-amazon-ec2.html). This will get you a cluster in about five minutes without any configuration on your part.
If you want to run Spark on Amazon EC2, you can use the Spark [EC2 launch scripts]({{HOME_PATH}}ec2-scripts.html), which provide an easy way to launch a cluster with Mesos, Spark, and HDFS pre-configured. This will get you a cluster in about five minutes without any configuration on your part.
## Running Alongside Hadoop

Просмотреть файл

Просмотреть файл

@ -3,11 +3,18 @@ layout: global
title: Spark Standalone Mode
---
In addition to running on top of [Mesos](https://github.com/mesos/mesos), Spark also supports a standalone mode, consisting of one Spark master and several Spark worker processes. You can run the Spark standalone mode either locally or on a cluster. If you wish to run an Spark Amazon EC2 cluster using standalone mode we have provided a set of scripts that make it easy to do so.
{% comment %}
TODO(andyk):
- Add a table of contents
- Move configuration towards the end so that it doesn't come first
- Say the scripts will guess the resource amounts (i.e. # cores) automatically
{% endcomment %}
In addition to running on top of [Mesos](https://github.com/mesos/mesos), Spark also supports a standalone mode, consisting of one Spark master and several Spark worker processes. You can run the Spark standalone mode either locally or on a cluster. If you wish to run an Spark Amazon EC2 cluster using standalone mode we have provided [a set of scripts](ec2-scripts.html) that make it easy to do so.
## Getting Started
Download and compile Spark as described in the [README](https://github.com/mesos/spark/wiki). You do not need to install mesos on your machine if you are using the standalone mode.
Download and compile Spark as described in the [Getting Started Guide](index.html). You do not need to install mesos on your machine if you are using the standalone mode.
## Standalone Mode Configuration
@ -53,13 +60,13 @@ The following options can be passed to the worker:
Spark offers a web-based user interface in the standalone mode. The master and each worker has its own WebUI that shows cluster and job statistics. By default you can access the WebUI for the master at port 8080. The port can be changed either in the configuration file or via command-line options.
Detailed log output for the jobs is by default written to the `work/` by default.
Detailed log output for the jobs is written to the `work` drectory by default.
## Running on a Cluster
In order to run a Spark standalone cluster there are two main points of configuration, the `conf/spark-env.sh` file (described above), and the `conf/slaves` file. the `conf/spark-env.sh` file lets you specify global settings for the master and slave instances, such as memory, or port numbers to bind to. We are assuming that all your machines share the same configuration parameters.
The `conf/slaves` file contains a list of all machines where you would like to start a Spark slave (worker) instance when using then scripts below. The master machine must be able to access each of the slave machines via ssh. For testing purposes, you can have a single `localhost` entry in the slaves file.
The `conf/slaves` file contains a list of all machines where you would like to start a Spark slave (worker) instance when using the scripts below. The master machine must be able to access each of the slave machines via ssh. For testing purposes, you can have a single `localhost` entry in the slaves file.
In order to make starting master and slave instances easier, we have provided Hadoop-style shell scripts. The scripts can be found in the `bin` directory. A quick overview:
@ -72,9 +79,8 @@ In order to make starting master and slave instances easier, we have provided Ha
Note that the scripts must be executed on the machine you want to start the Spark master on, not your local machine.
{% comment %}
## EC2 Scripts
To save you from needing to set up a cluster of Spark machines yourself, we provide a set of scripts that launch Amazon EC2 instances with a preinstalled Spark distribution. These scripts are identical to the [EC2 Mesos Scripts](https://github.com/mesos/spark/wiki/EC2-Scripts), except that you need to execute `ec2/spark-ec2` with the following additional parameters: `--cluster-type standalone -a standalone`. Note that the Spark version on these machines may not reflect the latest changes, so it may be a good idea to ssh into the machines and merge the latest version from github.
{% endcomment %}