Matei Zaharia
72ff62a37c
Two fixes to IPython support:
...
- Don't attempt to run worker processes with ipython (that can cause
some crashes as ipython prints things to standard out)
- Allow passing some IPYTHON_OPTS to launch things like the notebook
2013-07-28 22:23:13 -04:00
Matei Zaharia
af3c9d5042
Add Apache license headers and LICENSE and NOTICE files
2013-07-16 17:21:33 -07:00
Nick Pentreath
21d3946d17
Adding IPYTHON environment variable support for launching pyspark using ipython shell
2013-02-07 16:54:31 +02:00
Matei Zaharia
892c32a14b
Warn users if they run pyspark or spark-shell without compiling Spark
2013-01-17 11:14:47 -08:00
Josh Rosen
ce9f1bbe20
Add `pyspark` script to replace the other scripts.
...
Expand the PySpark programming guide.
2013-01-01 21:25:49 -08:00
Josh Rosen
b58340dbd9
Rename top-level 'pyspark' directory to 'python'
2013-01-01 15:05:00 -08:00
Josh Rosen
170e451fbd
Minor documentation and style fixes for PySpark.
2013-01-01 13:52:14 -08:00
Josh Rosen
6f6a6b79c4
Launch with `scala` by default in run-pyspark
2012-12-31 14:57:18 -08:00
Josh Rosen
099898b439
Port LR example to PySpark using numpy.
...
This version of the example crashes after the first iteration with
"OverflowError: math range error" because Python's math.exp()
behaves differently than Scala's; see SPARK-646.
2012-12-29 18:00:28 -08:00
Josh Rosen
39dd953fd8
Add test for pyspark.RDD.saveAsTextFile().
2012-12-29 17:06:50 -08:00
Josh Rosen
59195c68ec
Update PySpark for compatibility with TaskContext.
2012-12-29 16:01:03 -08:00
Josh Rosen
26186e2d25
Use batching in pyspark parallelize(); fix cartesian()
2012-12-29 15:34:57 -08:00
Josh Rosen
6ee1ff2663
Fix bug in pyspark.serializers.batch; add .gitignore.
2012-12-29 22:25:34 +00:00
Josh Rosen
c2b105af34
Add documentation for Python API.
2012-12-28 22:51:28 -08:00
Josh Rosen
7ec3595de2
Fix bug (introduced by batching) in PySpark take()
2012-12-28 22:21:16 -08:00
Josh Rosen
fbadb1cda5
Mark api.python classes as private; echo Java output to stderr.
2012-12-28 09:06:11 -08:00
Josh Rosen
665466dfff
Simplify PySpark installation.
...
- Bundle Py4J binaries, since it's hard to install
- Uses Spark's `run` script to launch the Py4J
gateway, inheriting the settings in spark-env.sh
With these changes, (hopefully) nothing more than
running `sbt/sbt package` will be necessary to run
PySpark.
2012-12-27 22:47:37 -08:00
Josh Rosen
ac32447cd3
Use addFile() to ship code to cluster in PySpark.
...
Add options to pyspark.SparkContext constructor.
2012-12-27 19:59:04 -08:00
Josh Rosen
85b8f2c64f
Add epydoc API documentation for PySpark.
2012-12-27 18:04:10 -08:00
Josh Rosen
2d98fff065
Add IPython support to pyspark-shell.
...
Suggested by / based on code from @MLnick
2012-12-27 10:17:36 -08:00
Josh Rosen
e2dad15621
Add support for batched serialization of Python objects in PySpark.
2012-12-26 18:16:09 -08:00
Josh Rosen
4608902fb8
Use filesystem to collect RDDs in PySpark.
...
Passing large volumes of data through Py4J seems
to be slow. It appears to be faster to write the
data to the local filesystem and read it back from
Python.
2012-12-24 17:20:10 -08:00
Josh Rosen
ccd075cf96
Reduce object overhead in Pyspark shuffle and collect
2012-12-24 15:01:13 -08:00
Josh Rosen
2ccf3b6652
Fix PySpark hash partitioning bug.
...
A Java array's hashCode is based on its object
identify, not its elements, so this was causing
serialized keys to be hashed incorrectly.
This commit adds a PySpark-specific workaround
and adds more tests.
2012-10-28 22:30:28 -07:00
Josh Rosen
7859879aaa
Bump required Py4J version and add test for large broadcast variables.
2012-10-28 16:48:25 -07:00
Josh Rosen
d4f2e5b0ef
Remove PYTHONPATH from SparkContext's executorEnvs.
...
It makes more sense to pass it in the dictionary
of environment variables that is used to construct
PythonRDD.
2012-10-22 10:28:59 -07:00
Josh Rosen
c23bf1aff4
Add PySpark README and run scripts.
2012-10-20 00:22:27 +00:00
Josh Rosen
52989c8a2c
Update Python API for v0.6.0 compatibility.
2012-10-19 10:24:49 -07:00
Josh Rosen
9abdfa6633
Fix Python 2.6 compatibility in Python API.
2012-09-17 00:09:16 -07:00
Josh Rosen
4143678509
Fix minor bugs in Python API examples.
2012-08-27 00:24:47 -07:00
Josh Rosen
bff6a46359
Add pipe(), saveAsTextFile(), sc.union() to Python API.
2012-08-27 00:24:47 -07:00
Josh Rosen
200d248dcc
Simplify Python worker; pipeline the map step of partitionBy().
2012-08-27 00:24:39 -07:00
Josh Rosen
6904cb77d4
Use local combiners in Python API combineByKey().
2012-08-27 00:19:26 -07:00
Josh Rosen
8b64b7ecd8
Add countByKey(), reduceByKeyLocally() to Python API
2012-08-27 00:19:22 -07:00
Josh Rosen
08b201d810
Add mapPartitions(), glom(), countByValue() to Python API.
2012-08-27 00:19:14 -07:00
Josh Rosen
f79a1e4d2a
Add broadcast variables to Python API.
2012-08-27 00:16:47 -07:00
Josh Rosen
65e8406029
Implement fold() in Python API.
2012-08-27 00:16:47 -07:00
Josh Rosen
f3b852ce66
Refactor Python MappedRDD to use iterator pipelines.
2012-08-24 19:44:14 -07:00
Josh Rosen
4b52300487
Fix options parsing in Python pi example.
2012-08-24 19:42:47 -07:00
Josh Rosen
607b53abfc
Use numpy in Python k-means example.
2012-08-22 00:43:55 -07:00
Josh Rosen
fd94e5443c
Use only cPickle for serialization in Python API.
...
Objects serialized with JSON can be compared for equality, but JSON can be slow
to serialize and only supports a limited range of data types.
2012-08-21 14:01:27 -07:00
Josh Rosen
13b9514966
Bundle cloudpickle with pyspark.
2012-08-19 17:17:42 -07:00
Josh Rosen
886b39de55
Add Python API.
2012-08-18 22:33:51 -07:00