data-docs/concepts/review.md

3.9 KiB

Getting Analysis Review

The current process for getting analysis review has not been optimized. If you run into problems, feel free to ask #datapipeline

Table of Contents

Finding a Reviewer

There are two major types of review for most analyses: Data review and Methodology review.

A data reviewer has expert understanding of the dataset your analysis is based upon. During data review, your reviewer will try to identify any issues with your analysis that come from misapplying the data. For example, sampling issues or data collection oddities.

Methodology review primarily covers statistical concepts. For example, if you're training regressions, forecasting data, or doing complex statistical tests, you should probably get Methodology review. Many (most?) analyses at Mozilla use simple aggregations and little to no statistical inference. These analyses only require data review.

Accordingly, we suggest you first find a data reviewer for your analysis. Your data reviewer will tell you if you should get methodology review and will help you find a reviewer.

Data Reviewers

To get review for your analysis, contact one of the data reviewers listed for the dataset your analysis is based upon:

Dataset Reviewers
Longitudinal
Cross Sectional
Main Summary
Crash Summary
Crash Aggregate
Events
Sync Summary
Addons
Client Count

Jupyter Notebooks

It's difficult to review Jupyter notebooks on Github. The notebook is stored as a JSON object, with python code stored in strings. This makes commenting on particular lines very difficult. Because the output is stored next to the code, it's also very difficult to review the diff for a given notebook.

RTMO

For simple notebooks, the best way to get review is through the knowledge repo. The knowledge repo renders your Jupyter notebooks to markdown while keeping the original source code next to the document. This gives your reviewer an easy-to-review markdown file. Your analysis will also be available at RTMO. This will aid discoverability and help others find your analysis. Note that RTMO is public so your analysis will also be public.

Start a Repository

Notebooks are great for presenting information, but are not a good medium for storing code. If your notebook contains complicated logic or a significant amount of custom code, you should consider moving most of the logic to a python package. Your reviewer may ask you for tests, which also requires moving the code out of a notebook.

Where to store the package

The data platform team maintains a python repository of analyses and ETL called python_mozetl. Feel free to file a pull request against that repository. Your reviewer will provide analysis review during the code review. You'll need review for each commit to python_mozetl's master branch.

If you are still prototyping your job but want to move out of a Jupyter notebook, take a look at cookiecutter-python-etl. This tool will help you configure a new python repository so you can hack quickly without getting review.

nbconvert

The easiest way to get your code out of a Jupyter notebook is to use nbconvert command from Jupyter. If you have a notebook called analysis.ipynb you can dump your analysis to a python script using the following command:

jupyter nbconvert --to python analysis.ipynb

You'll need to clean up some formatting in the resulting analysis.py file, but most of the work has been done for you.