Этот файл содержит неоднозначные символы Юникода, которые могут быть перепутаны с другими в текущей локали. Если это намеренно, можете спокойно проигнорировать это предупреждение. Используйте кнопку Экранировать, чтобы подсветить эти символы.
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Firefox Telemetry Python ETL — python_mozetl 0.1 documentation</title>
<link rel="stylesheet" href="_static/pygments.css" type="text/css" />
<link rel="stylesheet" href="_static/alabaster.css" type="text/css" />
<script id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script>
<script src="_static/jquery.js"></script>
<script src="_static/underscore.js"></script>
<script src="_static/doctools.js"></script>
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="prev" title="Indices and tables" href="index.html" />
<link rel="stylesheet" href="_static/custom.css" type="text/css" />
<meta name="viewport" content="width=device-width, initial-scale=0.9, maximum-scale=0.9" />
</head><body>
<div class="document">
<div class="documentwrapper">
<div class="bodywrapper">
<div class="body" role="main">
<div class="section" id="firefox-telemetry-python-etl">
<h1>Firefox Telemetry Python ETL<a class="headerlink" href="#firefox-telemetry-python-etl" title="Permalink to this headline">¶</a></h1>
<a class="reference external image-reference" href="https://circleci.com/gh/mozilla/python_mozetl"><img alt="CircleCI" src="https://circleci.com/gh/mozilla/python_mozetl.svg?style=svg" /></a>
<a class="reference external image-reference" href="https://codecov.io/gh/mozilla/python_mozetl"><img alt="codecov" src="https://codecov.io/gh/mozilla/python_mozetl/branch/main/graph/badge.svg" /></a>
<p>This repository is a collection of ETL jobs for Firefox Telemetry.</p>
</div>
<div class="section" id="benefits">
<h1>Benefits<a class="headerlink" href="#benefits" title="Permalink to this headline">¶</a></h1>
<p>Jobs committed to python_mozetl can be <strong>scheduled via
`airflow <https://github.com/mozilla/telemetry-airflow>`_
or
`ATMO <https://analysis.telemetry.mozilla.org/>`_**.
We provide a **testing suite</strong> and <strong>code review</strong>, which makes your job more maintainable.
Centralizing our jobs in one repository allows for
<strong>code reuse</strong> and <strong>easier collaboration</strong>.</p>
<p>There are a host of benefits to moving your analysis out of a Jupyter notebook
and into a python package.
For more on this see the writeup at
<a class="reference external" href="https://github.com/harterrt/cookiecutter-python-etl/blob/master/README.md#benefits">cookiecutter-python-etl</a>.</p>
</div>
<div class="section" id="tests">
<h1>Tests<a class="headerlink" href="#tests" title="Permalink to this headline">¶</a></h1>
<div class="section" id="dependencies">
<h2>Dependencies<a class="headerlink" href="#dependencies" title="Permalink to this headline">¶</a></h2>
<p>First install the necessary runtime dependencies – snappy and the java runtime
environment. These are used for the <code class="docutils literal notranslate"><span class="pre">pyspark</span></code> package. In ubuntu:</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span>sudo<span class="w"> </span>apt-get<span class="w"> </span>install<span class="w"> </span>libsnappy-dev<span class="w"> </span>openjdk-8-jre-headless
</pre></div>
</div>
</div>
<div class="section" id="calling-the-test-runner">
<h2>Calling the test runner<a class="headerlink" href="#calling-the-test-runner" title="Permalink to this headline">¶</a></h2>
<p>Run tests by calling <code class="docutils literal notranslate"><span class="pre">tox</span></code> in the root directory.</p>
<p>Arguments to <code class="docutils literal notranslate"><span class="pre">pytest</span></code> can be passed through tox using <code class="docutils literal notranslate"><span class="pre">--</span></code>.</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span><span class="n">tox</span> <span class="o">--</span> <span class="o">-</span><span class="n">k</span> <span class="n">test_main</span><span class="o">.</span><span class="n">py</span> <span class="c1"># runs tests only in the test_main module</span>
</pre></div>
</div>
<p>Tests are configured in <a class="reference external" href="tox.ini">tox.ini</a></p>
</div>
</div>
<div class="section" id="manual-execution">
<h1>Manual Execution<a class="headerlink" href="#manual-execution" title="Permalink to this headline">¶</a></h1>
<div class="section" id="atmo">
<h2>ATMO<a class="headerlink" href="#atmo" title="Permalink to this headline">¶</a></h2>
<p>The first method of manual execution is the <code class="docutils literal notranslate"><span class="pre">mozetl-submit.sh</span></code> script located in <code class="docutils literal notranslate"><span class="pre">bin</span></code>.
This script is used with the <code class="docutils literal notranslate"><span class="pre">EMRSparkOperator</span></code> in <code class="docutils literal notranslate"><span class="pre">telemetry-airflow</span></code> to schedule execution of <code class="docutils literal notranslate"><span class="pre">mozetl</span></code> jobs.
It may be used with <a class="reference external" href="https://analysis.telemetry.mozilla.org/">ATMO</a> to manually test jobs.</p>
<p>In an SSH session with an ATMO cluster, grab a copy of the script:</p>
<div class="highlight-default notranslate"><div class="highlight"><pre><span></span>$ wget https://raw.githubusercontent.com/mozilla/python_mozetl/main/bin/mozetl-submit.sh
</pre></div>
</div>
<p>Push your code to your own fork, where the job has been added to <code class="docutils literal notranslate"><span class="pre">mozetl.cli</span></code>. Then run it.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>$<span class="w"> </span>./mozetl-submit.sh<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-p<span class="w"> </span>https://github.com/<USERNAME>/python_mozetl.git<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>-b<span class="w"> </span><BRANCHNAME><span class="w"> </span><span class="se">\</span>
<span class="w"> </span><COMMAND><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--first-argument<span class="w"> </span>foo<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--second-argument<span class="w"> </span>bar
</pre></div>
</div>
<p>See comments in <code class="docutils literal notranslate"><span class="pre">bin/mozetl-submit.sh</span></code> for more details.</p>
</div>
<div class="section" id="databricks">
<h2>Databricks<a class="headerlink" href="#databricks" title="Permalink to this headline">¶</a></h2>
<p>Jobs may also be executed on <a class="reference external" href="https://dbc-caf9527b-e073.cloud.databricks.com/">Databricks</a>.
They are scheduled via the <code class="docutils literal notranslate"><span class="pre">MozDatabricksSubmitRunOperator</span></code> in <code class="docutils literal notranslate"><span class="pre">telemetry-airflow</span></code>.</p>
<p>This script runs on your local machine and submits the job to a remote spark executor.
First, generate an API token in the User Settings page in Databricks.
Then run the script.</p>
<div class="highlight-bash notranslate"><div class="highlight"><pre><span></span>python<span class="w"> </span>bin/mozetl-databricks.py<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--git-path<span class="w"> </span>https://github.com/<USERNAME>/python_mozetl.git<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--git-branch<span class="w"> </span><BRANCHNAME><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--token<span class="w"> </span><TOKEN><span class="w"> </span><span class="se">\</span>
<span class="w"> </span><COMMAND><span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--first-argument<span class="w"> </span>foo<span class="w"> </span><span class="se">\</span>
<span class="w"> </span>--second-argument<span class="w"> </span>bar
</pre></div>
</div>
<p>Run <code class="docutils literal notranslate"><span class="pre">python</span> <span class="pre">bin/mozetl-databricks.py</span> <span class="pre">--help</span></code> for more options, including increasing the number of workers and using python 3.
Refer to this <a class="reference external" href="https://github.com/mozilla/python_mozetl/pull/296">pull request</a> for more examples.</p>
<p>It is also possible to use this script for external mozetl-compatible modules by setting the <code class="docutils literal notranslate"><span class="pre">--git-path</span></code> and <code class="docutils literal notranslate"><span class="pre">--module-name</span></code> options appropriately.
See this <a class="reference external" href="https://github.com/mozilla/python_mozetl/pull/316">pull request</a> for more information about building a mozetl-compatible repository that can be scheduled on Databricks.</p>
</div>
</div>
<div class="section" id="scheduling">
<h1>Scheduling<a class="headerlink" href="#scheduling" title="Permalink to this headline">¶</a></h1>
<p>You can schedule your job on either
<a class="reference external" href="https://analysis.telemetry.mozilla.org/">ATMO</a>
or
<a class="reference external" href="https://github.com/mozilla/telemetry-airflow">airflow</a>.</p>
<p>Scheduling a job on ATMO is easy and does not require review,
but is less maintainable.
Use ATMO to schedule jobs you are still prototyping
or jobs that have a limited lifespan.</p>
<p>Jobs scheduled on Airflow will be more robust.</p>
<ul class="simple">
<li><p>Airflow will automatically retry your job in the event of a failure.</p></li>
<li><p>You can also alert other members of your team when jobs fail,
while ATMO will only send an email to the job owner.</p></li>
<li><p>If your job depends on other datasets,
you can identify these dependencies in Airflow.
This is useful if an upstream job fails.</p></li>
</ul>
<div class="section" id="id5">
<h2>ATMO<a class="headerlink" href="#id5" title="Permalink to this headline">¶</a></h2>
<p>To schedule a job on ATMO, take a look at the
<a class="reference external" href="scheduling/load_and_run.ipynb">load_and_run notebook</a>.
This notebook clones and installs the python_mozetl package.
You can then run your job from the notebook.</p>
</div>
<div class="section" id="id6">
<h2>Airflow<a class="headerlink" href="#id6" title="Permalink to this headline">¶</a></h2>
<p>To schedule a job on Airflow,
you’ ll need to add a new Operator to the DAGs and provide a shell script for running your job.
Take a look at
<a class="reference external" href="https://github.com/mozilla/telemetry-airflow/blob/master/jobs/topline_dashboard.sh">this example shell script</a>.
and
<a class="reference external" href="https://github.com/mozilla/telemetry-airflow/blob/master/dags/topline.py#L31">this example Operator</a>
for templates.</p>
</div>
</div>
<div class="section" id="early-stage-etl-jobs">
<h1>Early Stage ETL Jobs<a class="headerlink" href="#early-stage-etl-jobs" title="Permalink to this headline">¶</a></h1>
<p>We usually require tests before accepting new ETL jobs.
If you’ re still prototyping your job,
but you’ d like to move your code out of a Jupyter notebook
take a look at
<a class="reference external" href="https://github.com/harterrt/cookiecutter-python-etl">cookiecutter-python-etl</a>.</p>
<p>This tool will initialize a new repository
with all of the necessary boilerplate for testing and packaging.
In fact, this project was created with
<a class="reference external" href="https://github.com/harterrt/cookiecutter-python-etl">cookiecutter-python-etl</a>.</p>
</div>
</div>
</div>
</div>
<div class="sphinxsidebar" role="navigation" aria-label="main navigation">
<div class="sphinxsidebarwrapper">
<h1 class="logo"><a href="index.html">python_mozetl</a></h1>
<h3>Navigation</h3>
<ul class="current">
<li class="toctree-l1 current"><a class="current reference internal" href="#">Firefox Telemetry Python ETL</a></li>
<li class="toctree-l1"><a class="reference internal" href="#benefits">Benefits</a></li>
<li class="toctree-l1"><a class="reference internal" href="#tests">Tests</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#dependencies">Dependencies</a></li>
<li class="toctree-l2"><a class="reference internal" href="#calling-the-test-runner">Calling the test runner</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="#manual-execution">Manual Execution</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#atmo">ATMO</a></li>
<li class="toctree-l2"><a class="reference internal" href="#databricks">Databricks</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="#scheduling">Scheduling</a><ul>
<li class="toctree-l2"><a class="reference internal" href="#id5">ATMO</a></li>
<li class="toctree-l2"><a class="reference internal" href="#id6">Airflow</a></li>
</ul>
</li>
<li class="toctree-l1"><a class="reference internal" href="#early-stage-etl-jobs">Early Stage ETL Jobs</a></li>
</ul>
<div class="relations">
<h3>Related Topics</h3>
<ul>
<li><a href="index.html">Documentation overview</a><ul>
<li>Previous: <a href="index.html" title="previous chapter">Indices and tables</a></li>
</ul></li>
</ul>
</div>
<div id="searchbox" style="display: none" role="search">
<h3 id="searchlabel">Quick search</h3>
<div class="searchformwrapper">
<form class="search" action="search.html" method="get">
<input type="text" name="q" aria-labelledby="searchlabel" />
<input type="submit" value="Go" />
</form>
</div>
</div>
<script>$('#searchbox').show(0);</script>
</div>
</div>
<div class="clearer"></div>
</div>
<div class="footer">
©2018, Ryan Harter.
|
Powered by <a href="http://sphinx-doc.org/">Sphinx 3.5.4</a>
& <a href="https://github.com/bitprophet/alabaster">Alabaster 0.7.13</a>
|
<a href="_sources/readme.rst.txt"
rel="nofollow">Page source</a>
</div>
</body>
</html>