зеркало из https://github.com/microsoft/spark.git
Add `pyspark` script to replace the other scripts.
Expand the PySpark programming guide.
This commit is contained in:
Родитель
b58340dbd9
Коммит
ce9f1bbe20
|
@ -24,6 +24,35 @@ There are a few key differences between the Python and Scala APIs:
|
|||
- `sample`
|
||||
- `sort`
|
||||
|
||||
In PySpark, RDDs support the same methods as their Scala counterparts but take Python functions and return Python collection types.
|
||||
Short functions can be passed to RDD methods using Python's [`lambda`](http://www.diveintopython.net/power_of_introspection/lambda_functions.html) syntax:
|
||||
|
||||
{% highlight python %}
|
||||
logData = sc.textFile(logFile).cache()
|
||||
errors = logData.filter(lambda s: 'ERROR' in s.split())
|
||||
{% endhighlight %}
|
||||
|
||||
You can also pass functions that are defined using the `def` keyword; this is useful for more complicated functions that cannot be expressed using `lambda`:
|
||||
|
||||
{% highlight python %}
|
||||
def is_error(line):
|
||||
return 'ERROR' in line.split()
|
||||
errors = logData.filter(is_error)
|
||||
{% endhighlight %}
|
||||
|
||||
Functions can access objects in enclosing scopes, although modifications to those objects within RDD methods will not be propagated to other tasks:
|
||||
|
||||
{% highlight python %}
|
||||
error_keywords = ["Exception", "Error"]
|
||||
def is_error(line):
|
||||
words = line.split()
|
||||
return any(keyword in words for keyword in error_keywords)
|
||||
errors = logData.filter(is_error)
|
||||
{% endhighlight %}
|
||||
|
||||
PySpark will automatically ship these functions to workers, along with any objects that they reference.
|
||||
Instances of classes will be serialized and shipped to workers by PySpark, but classes themselves cannot be automatically distributed to workers.
|
||||
The [Standalone Use](#standalone-use) section describes how to ship code dependencies to workers.
|
||||
|
||||
# Installing and Configuring PySpark
|
||||
|
||||
|
@ -34,13 +63,14 @@ By default, PySpark's scripts will run programs using `python`; an alternate Pyt
|
|||
|
||||
All of PySpark's library dependencies, including [Py4J](http://py4j.sourceforge.net/), are bundled with PySpark and automatically imported.
|
||||
|
||||
Standalone PySpark jobs should be run using the `run-pyspark` script, which automatically configures the Java and Python environmnt using the settings in `conf/spark-env.sh`.
|
||||
Standalone PySpark jobs should be run using the `pyspark` script, which automatically configures the Java and Python environment using the settings in `conf/spark-env.sh`.
|
||||
The script automatically adds the `pyspark` package to the `PYTHONPATH`.
|
||||
|
||||
|
||||
# Interactive Use
|
||||
|
||||
PySpark's `pyspark-shell` script provides a simple way to learn the API:
|
||||
The `pyspark` script launches a Python interpreter that is configured to run PySpark jobs.
|
||||
When run without any input files, `pyspark` launches a shell that can be used explore data interactively, which is a simple way to learn the API:
|
||||
|
||||
{% highlight python %}
|
||||
>>> words = sc.textFile("/usr/share/dict/words")
|
||||
|
@ -48,9 +78,18 @@ PySpark's `pyspark-shell` script provides a simple way to learn the API:
|
|||
[u'spar', u'sparable', u'sparada', u'sparadrap', u'sparagrass']
|
||||
{% endhighlight %}
|
||||
|
||||
By default, the `pyspark` shell creates SparkContext that runs jobs locally.
|
||||
To connect to a non-local cluster, set the `MASTER` environment variable.
|
||||
For example, to use the `pyspark` shell with a [standalone Spark cluster](spark-standalone.html):
|
||||
|
||||
{% highlight shell %}
|
||||
$ MASTER=spark://IP:PORT ./pyspark
|
||||
{% endhighlight %}
|
||||
|
||||
|
||||
# Standalone Use
|
||||
|
||||
PySpark can also be used from standalone Python scripts by creating a SparkContext in the script and running the script using the `run-pyspark` script in the `pyspark` directory.
|
||||
PySpark can also be used from standalone Python scripts by creating a SparkContext in your script and running the script using `pyspark`.
|
||||
The Quick Start guide includes a [complete example](quick-start.html#a-standalone-job-in-python) of a standalone Python job.
|
||||
|
||||
Code dependencies can be deployed by listing them in the `pyFiles` option in the SparkContext constructor:
|
||||
|
@ -65,8 +104,8 @@ Code dependencies can be added to an existing SparkContext using its `addPyFile(
|
|||
|
||||
# Where to Go from Here
|
||||
|
||||
PySpark includes several sample programs using the Python API in `pyspark/examples`.
|
||||
You can run them by passing the files to the `pyspark-run` script included in PySpark -- for example `./pyspark-run examples/wordcount.py`.
|
||||
PySpark includes several sample programs using the Python API in `python/examples`.
|
||||
You can run them by passing the files to the `pyspark` script -- for example `./pyspark python/examples/wordcount.py`.
|
||||
Each example program prints usage help when run without any arguments.
|
||||
|
||||
We currently provide [API documentation](api/pyspark/index.html) for the Python API as Epydoc.
|
||||
|
|
|
@ -258,11 +258,11 @@ We can pass Python functions to Spark, which are automatically serialized along
|
|||
For jobs that use custom classes or third-party libraries, we can add those code dependencies to SparkContext to ensure that they will be available on remote machines; this is described in more detail in the [Python programming guide](python-programming-guide).
|
||||
`SimpleJob` is simple enough that we do not need to specify any code dependencies.
|
||||
|
||||
We can run this job using the `run-pyspark` script in `$SPARK_HOME/pyspark`:
|
||||
We can run this job using the `pyspark` script:
|
||||
|
||||
{% highlight python %}
|
||||
$ cd $SPARK_HOME
|
||||
$ ./pyspark/run-pyspark SimpleJob.py
|
||||
$ ./pyspark SimpleJob.py
|
||||
...
|
||||
Lines with a: 8422, Lines with b: 1836
|
||||
{% endhighlight python %}
|
||||
|
|
|
@ -20,6 +20,10 @@ export PYSPARK_PYTHON
|
|||
# Add the PySpark classes to the Python path:
|
||||
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
|
||||
|
||||
# Load the PySpark shell.py script when ./pyspark is used interactively:
|
||||
export OLD_PYTHONSTARTUP=$PYTHONSTARTUP
|
||||
export PYTHONSTARTUP=$FWDIR/python/pyspark/shell.py
|
||||
|
||||
# Launch with `scala` by default:
|
||||
if [[ "$SPARK_LAUNCH_WITH_SCALA" != "0" ]] ; then
|
||||
export SPARK_LAUNCH_WITH_SCALA=1
|
|
@ -1,3 +0,0 @@
|
|||
#!/usr/bin/env bash
|
||||
FWDIR="`dirname $0`"
|
||||
exec $FWDIR/run-pyspark $FWDIR/python/pyspark/shell.py "$@"
|
|
@ -1,33 +1,17 @@
|
|||
"""
|
||||
An interactive shell.
|
||||
"""
|
||||
import optparse # I prefer argparse, but it's not included with Python < 2.7
|
||||
import code
|
||||
import sys
|
||||
|
||||
This fle is designed to be launched as a PYTHONSTARTUP script.
|
||||
"""
|
||||
import os
|
||||
from pyspark.context import SparkContext
|
||||
|
||||
|
||||
def main(master='local', ipython=False):
|
||||
sc = SparkContext(master, 'PySparkShell')
|
||||
user_ns = {'sc' : sc}
|
||||
banner = "Spark context avaiable as sc."
|
||||
if ipython:
|
||||
import IPython
|
||||
IPython.embed(user_ns=user_ns, banner2=banner)
|
||||
else:
|
||||
print banner
|
||||
code.interact(local=user_ns)
|
||||
sc = SparkContext(os.environ.get("MASTER", "local"), "PySparkShell")
|
||||
print "Spark context avaiable as sc."
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
usage = "usage: %prog [options] master"
|
||||
parser = optparse.OptionParser(usage=usage)
|
||||
parser.add_option("-i", "--ipython", help="Run IPython shell",
|
||||
action="store_true")
|
||||
(options, args) = parser.parse_args()
|
||||
if len(sys.argv) > 1:
|
||||
master = args[0]
|
||||
else:
|
||||
master = 'local'
|
||||
main(master, options.ipython)
|
||||
# The ./pyspark script stores the old PYTHONSTARTUP value in OLD_PYTHONSTARTUP,
|
||||
# which allows us to execute the user's PYTHONSTARTUP file:
|
||||
_pythonstartup = os.environ.get('OLD_PYTHONSTARTUP')
|
||||
if _pythonstartup and os.path.isfile(_pythonstartup):
|
||||
execfile(_pythonstartup)
|
||||
|
|
|
@ -0,0 +1,9 @@
|
|||
#!/usr/bin/env bash
|
||||
|
||||
# Figure out where the Scala framework is installed
|
||||
FWDIR="$(cd `dirname $0`; cd ../; pwd)"
|
||||
|
||||
$FWDIR/pyspark pyspark/rdd.py
|
||||
$FWDIR/pyspark -m doctest pyspark/broadcast.py
|
||||
|
||||
# TODO: in the long-run, it would be nice to use a test runner like `nose`.
|
Загрузка…
Ссылка в новой задаче