jydoop

Efficient Hadoop Map-Reduce in Python

Перейти к файлу

Taras Glek 0e5e31bb16 pass-through map-only jobs		2013-03-18 08:06:20 -07:00
scripts	anr script	2013-03-13 17:45:35 -07:00
CallJava.py	FileDriver.py lets one test map/reduces	2013-03-13 14:02:45 -07:00
CallPython.java	some cleanup	2013-03-13 12:12:53 -07:00
FileDriver.py	pass-through map-only jobs	2013-03-18 08:06:20 -07:00
HBaseDriver.java	Makefile fix, use compression for hbase map output	2013-03-13 16:21:29 -07:00
HDFSDriver.java	some cleanup	2013-03-13 12:12:53 -07:00
Makefile	anr script	2013-03-13 17:45:35 -07:00
PythonWrapper.java	some cleanup	2013-03-13 12:12:53 -07:00
README	FileDriver.py lets one test map/reduces	2013-03-13 14:02:45 -07:00
run.sh	basic map/reduce skeleton to run queries from	2013-03-05 17:44:10 -08:00

README

Python script gets wrapped into driver.jar which also contains HDFSDriver and HBaseDriver. Choose one depending on if you are doing a map of a hdfs or hbase.

To process files do:
make hadoop ARGS="input output" TASK=HDFSDriver SCRIPT=mypythonfile.py

To process hbase:
make hadoop ARGS="telemetry output 201302281 201302282 yyyyMMddk" SCRIPT=mypythonfile.py

python script has to define a map and (optionally) reduce function. If reduce function is not present hadoop will not do a reduce, which can save a lot of time for simple data dumps.

Idea is that one keeps boilerplate in java and does important things in python.

To test python map/reduce one can use FileDriver.py in __main__
For example:
python CallJava.py log > log.out
where log is a newline-separated json dump