spark

Граф коммитов

Автор	SHA1	Сообщение	Дата
Matei Zaharia	72ff62a37c	Two fixes to IPython support: - Don't attempt to run worker processes with ipython (that can cause some crashes as ipython prints things to standard out) - Allow passing some IPYTHON_OPTS to launch things like the notebook	2013-07-28 22:23:13 -04:00
Matei Zaharia	af3c9d5042	Add Apache license headers and LICENSE and NOTICE files	2013-07-16 17:21:33 -07:00
Nick Pentreath	21d3946d17	Adding IPYTHON environment variable support for launching pyspark using ipython shell	2013-02-07 16:54:31 +02:00
Matei Zaharia	892c32a14b	Warn users if they run pyspark or spark-shell without compiling Spark	2013-01-17 11:14:47 -08:00
Josh Rosen	ce9f1bbe20	Add `pyspark` script to replace the other scripts. Expand the PySpark programming guide.	2013-01-01 21:25:49 -08:00
Josh Rosen	b58340dbd9	Rename top-level 'pyspark' directory to 'python'	2013-01-01 15:05:00 -08:00
Josh Rosen	170e451fbd	Minor documentation and style fixes for PySpark.	2013-01-01 13:52:14 -08:00
Josh Rosen	6f6a6b79c4	Launch with `scala` by default in run-pyspark	2012-12-31 14:57:18 -08:00
Josh Rosen	099898b439	Port LR example to PySpark using numpy. This version of the example crashes after the first iteration with "OverflowError: math range error" because Python's math.exp() behaves differently than Scala's; see SPARK-646.	2012-12-29 18:00:28 -08:00
Josh Rosen	39dd953fd8	Add test for pyspark.RDD.saveAsTextFile().	2012-12-29 17:06:50 -08:00
Josh Rosen	59195c68ec	Update PySpark for compatibility with TaskContext.	2012-12-29 16:01:03 -08:00
Josh Rosen	26186e2d25	Use batching in pyspark parallelize(); fix cartesian()	2012-12-29 15:34:57 -08:00
Josh Rosen	6ee1ff2663	Fix bug in pyspark.serializers.batch; add .gitignore.	2012-12-29 22:25:34 +00:00
Josh Rosen	c2b105af34	Add documentation for Python API.	2012-12-28 22:51:28 -08:00
Josh Rosen	7ec3595de2	Fix bug (introduced by batching) in PySpark take()	2012-12-28 22:21:16 -08:00
Josh Rosen	fbadb1cda5	Mark api.python classes as private; echo Java output to stderr.	2012-12-28 09:06:11 -08:00
Josh Rosen	665466dfff	Simplify PySpark installation. - Bundle Py4J binaries, since it's hard to install - Uses Spark's `run` script to launch the Py4J gateway, inheriting the settings in spark-env.sh With these changes, (hopefully) nothing more than running `sbt/sbt package` will be necessary to run PySpark.	2012-12-27 22:47:37 -08:00
Josh Rosen	ac32447cd3	Use addFile() to ship code to cluster in PySpark. Add options to pyspark.SparkContext constructor.	2012-12-27 19:59:04 -08:00
Josh Rosen	85b8f2c64f	Add epydoc API documentation for PySpark.	2012-12-27 18:04:10 -08:00
Josh Rosen	2d98fff065	Add IPython support to pyspark-shell. Suggested by / based on code from @MLnick	2012-12-27 10:17:36 -08:00
Josh Rosen	e2dad15621	Add support for batched serialization of Python objects in PySpark.	2012-12-26 18:16:09 -08:00
Josh Rosen	4608902fb8	Use filesystem to collect RDDs in PySpark. Passing large volumes of data through Py4J seems to be slow. It appears to be faster to write the data to the local filesystem and read it back from Python.	2012-12-24 17:20:10 -08:00
Josh Rosen	ccd075cf96	Reduce object overhead in Pyspark shuffle and collect	2012-12-24 15:01:13 -08:00
Josh Rosen	2ccf3b6652	Fix PySpark hash partitioning bug. A Java array's hashCode is based on its object identify, not its elements, so this was causing serialized keys to be hashed incorrectly. This commit adds a PySpark-specific workaround and adds more tests.	2012-10-28 22:30:28 -07:00
Josh Rosen	7859879aaa	Bump required Py4J version and add test for large broadcast variables.	2012-10-28 16:48:25 -07:00
Josh Rosen	d4f2e5b0ef	Remove PYTHONPATH from SparkContext's executorEnvs. It makes more sense to pass it in the dictionary of environment variables that is used to construct PythonRDD.	2012-10-22 10:28:59 -07:00
Josh Rosen	c23bf1aff4	Add PySpark README and run scripts.	2012-10-20 00:22:27 +00:00
Josh Rosen	52989c8a2c	Update Python API for v0.6.0 compatibility.	2012-10-19 10:24:49 -07:00
Josh Rosen	9abdfa6633	Fix Python 2.6 compatibility in Python API.	2012-09-17 00:09:16 -07:00
Josh Rosen	4143678509	Fix minor bugs in Python API examples.	2012-08-27 00:24:47 -07:00
Josh Rosen	bff6a46359	Add pipe(), saveAsTextFile(), sc.union() to Python API.	2012-08-27 00:24:47 -07:00
Josh Rosen	200d248dcc	Simplify Python worker; pipeline the map step of partitionBy().	2012-08-27 00:24:39 -07:00
Josh Rosen	6904cb77d4	Use local combiners in Python API combineByKey().	2012-08-27 00:19:26 -07:00
Josh Rosen	8b64b7ecd8	Add countByKey(), reduceByKeyLocally() to Python API	2012-08-27 00:19:22 -07:00
Josh Rosen	08b201d810	Add mapPartitions(), glom(), countByValue() to Python API.	2012-08-27 00:19:14 -07:00
Josh Rosen	f79a1e4d2a	Add broadcast variables to Python API.	2012-08-27 00:16:47 -07:00
Josh Rosen	65e8406029	Implement fold() in Python API.	2012-08-27 00:16:47 -07:00
Josh Rosen	f3b852ce66	Refactor Python MappedRDD to use iterator pipelines.	2012-08-24 19:44:14 -07:00
Josh Rosen	4b52300487	Fix options parsing in Python pi example.	2012-08-24 19:42:47 -07:00
Josh Rosen	607b53abfc	Use numpy in Python k-means example.	2012-08-22 00:43:55 -07:00
Josh Rosen	fd94e5443c	Use only cPickle for serialization in Python API. Objects serialized with JSON can be compared for equality, but JSON can be slow to serialize and only supports a limited range of data types.	2012-08-21 14:01:27 -07:00
Josh Rosen	13b9514966	Bundle cloudpickle with pyspark.	2012-08-19 17:17:42 -07:00
Josh Rosen	886b39de55	Add Python API.	2012-08-18 22:33:51 -07:00

43 Коммитов