Efficient Hadoop Map-Reduce in Python
Перейти к файлу
Taras Glek 0e5e31bb16 pass-through map-only jobs 2013-03-18 08:06:20 -07:00
scripts anr script 2013-03-13 17:45:35 -07:00
CallJava.py FileDriver.py lets one test map/reduces 2013-03-13 14:02:45 -07:00
CallPython.java some cleanup 2013-03-13 12:12:53 -07:00
FileDriver.py pass-through map-only jobs 2013-03-18 08:06:20 -07:00
HBaseDriver.java Makefile fix, use compression for hbase map output 2013-03-13 16:21:29 -07:00
HDFSDriver.java some cleanup 2013-03-13 12:12:53 -07:00
Makefile anr script 2013-03-13 17:45:35 -07:00
PythonWrapper.java some cleanup 2013-03-13 12:12:53 -07:00
README FileDriver.py lets one test map/reduces 2013-03-13 14:02:45 -07:00
run.sh basic map/reduce skeleton to run queries from 2013-03-05 17:44:10 -08:00

README

Python script gets wrapped into driver.jar which also contains HDFSDriver and HBaseDriver. Choose one depending on if you are doing a map of a hdfs or hbase.

To process files do:
make hadoop ARGS="input output" TASK=HDFSDriver SCRIPT=mypythonfile.py

To process hbase:
make hadoop ARGS="telemetry output 201302281 201302282 yyyyMMddk" SCRIPT=mypythonfile.py

python script has to define a map and (optionally) reduce function. If reduce function is not present hadoop will not do a reduce, which can save a lot of time for simple data dumps.

Idea is that one keeps boilerplate in java and does important things in python.

To test python map/reduce one can use FileDriver.py in __main__
For example:
python CallJava.py log > log.out
where log is a newline-separated json dump