distribution-viewer/notebooks
Rob Hudson e9c588eda0 Fix #228 - Add name field to `DataSet` 2017-03-29 14:42:18 -04:00
..
Makefile Add aggregation and import script 2016-09-08 15:21:30 -07:00
README Fix #175: Bail early to avoid duplicate imports 2017-03-08 14:45:24 -08:00
aggregate-and-import.py Fix #228 - Add name field to `DataSet` 2017-03-29 14:42:18 -04:00

README

Testing on ATMO
===============

By default any Python files that are executed are run with a Jupyter driver, so
the following environment variables need to be set (or unset) to run standalone:

  export PYSPARK_DRIVER_PYTHON=/mnt/anaconda2/bin/python
  unset PYSPARK_DRIVER_PYTHON_OPTS

Secure copy ('scp') the Python file to the host machine.

Next you can submit the Python file to Spark, with the specified arguments to
more closely match how Airflow will execute jobs:

  spark-submit --executor-cores 8 --master yarn --deploy-mode client "./aggregate-and-import.py"