ETL jobs for Firefox Telemetry
Перейти к файлу
Anthony Miyaguchi 174450ce4a Update dependencies for churn-v2 2019-01-29 16:30:38 -08:00
bin Fix issue #79 - Use pyspark via pip instead of docker 2017-09-11 19:39:07 +02:00
mozetl Add sanity checks to hardware-report output (#129) 2017-09-28 15:06:56 -05:00
scheduling Bug 1385232 - Add unified entry point and submission script 2017-08-04 10:03:43 +02:00
tests Add sanity checks to hardware-report output (#129) 2017-09-28 15:06:56 -05:00
.gitignore Create a dataset providing one flat row per (client_id, subsession_start_date) in main_summary. 2017-07-31 15:22:17 -07:00
.travis.yml Fix issue #79 - Use pyspark via pip instead of docker 2017-09-11 19:39:07 +02:00
LICENSE Build boilerplate from cookiecutter-python-etl 2017-03-15 15:14:05 -04:00
MANIFEST.in Fixes #22 - Add static files to setup.py 2017-05-04 17:22:40 -04:00
README.md Fix issue #79 - Use pyspark via pip instead of docker 2017-09-11 19:39:07 +02:00
setup.py Update dependencies for churn-v2 2019-01-29 16:30:38 -08:00
tox.ini Fix issue #79 - Use pyspark via pip instead of docker 2017-09-11 19:39:07 +02:00

README.md

Firefox Telemetry Python ETL

Build Status codecov

This repository is a collection of ETL jobs for Firefox Telemetry.

Benefits

Jobs committed to python_mozetl can be scheduled via airflow or ATMO. We provide a testing suite and code review, which makes your job more maintainable. Centralizing our jobs in one repository allows for code reuse and easier collaboration.

There are a host of benefits to moving your analysis out of a Jupyter notebook and into a python package. For more on this see the writeup at cookiecutter-python-etl.

Tests

Dependencies

First install the necessary runtime dependencies -- snappy and the java runtime environment. These are used for the pyspark package. In ubuntu:

$ sudo apt-get install libsnappy-dev default-jre

Calling the test runner

Run tests by calling tox in the root directory.

Arguments to pytest can be passed through tox using --.

tox -- -k tests/test_main.py # runs tests only in the test_main module

Tests are configured in tox.ini

Scheduling

You can schedule your job on either ATMO or airflow.

Scheduling a job on ATMO is easy and does not require review, but is less maintainable. Use ATMO to schedule jobs you are still prototyping or jobs that have a limited lifespan.

Jobs scheduled on Airflow will be more robust.

  • Airflow will automatically retry your job in the event of a failure.
  • You can also alert other members of your team when jobs fail, while ATMO will only send an email to the job owner.
  • If your job depends on other datasets, you can identify these dependencies in Airflow. This is useful if an upstream job fails.

ATMO

To schedule a job on ATMO, take a look at the load_and_run notebook. This notebook clones and installs the python_mozetl package. You can then run your job from the notebook.

Airflow

To schedule a job on Airflow, you'll need to add a new Operator to the DAGs and provide a shell script for running your job. Take a look at this example shell script. and this example Operator for templates.

Early Stage ETL Jobs

We usually require tests before accepting new ETL jobs. If you're still prototyping your job, but you'd like to move your code out of a Jupyter notebook take a look at cookiecutter-python-etl.

This tool will initialize a new repository with all of the necessary boilerplate for testing and packaging. In fact, this project was created with cookiecutter-python-etl.