Add better documentation (#168)

* Update documentation to reflect prod setup * dirty commit * Add stubs for GCP resources * Add instructions for deletion of user data * Add link to production YAML configuration * merged missing docs * Fill in GCP and Airflow variable information
2020-06-24 16:51:34 -04:00 · 2020-06-24 16:51:34 -04:00 · 0f57bd5072
--- a/README.md
+++ b/README.md
@ -21,17 +21,63 @@ This is the ordered list of the currently supported models:

 | Order | Model | Description | Conditions | Generator job |
 |-------|-------|-------------|------------|---------------|
-| 1 | [Legacy](taar/recommenders/legacy_recommender.py) | recommends WebExtensions based on the reported and disabled legacy add-ons | Telemetry data is available for the user and the user has at least one disabled add-on|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_legacy.py)|
-| 2 | [Collaborative](taar/recommenders/collaborative_recommender.py) | recommends add-ons based on add-ons installed by other users (i.e. [collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering))|Telemetry data is available for the user and the user has at least one enabled add-on|[source](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/ml/AddonRecommender.scala)|
-| 3 | [Similarity](taar/recommenders/similarity_recommender.py) &#42;| recommends add-ons based on add-ons installed by similar representative users|Telemetry data is available for the user and a suitable representative donor can be found|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_similarity.py)|
-| 4 | [Locale](taar/recommenders/locale_recommender.py) |recommends add-ons based on the top addons for the user's locale|Telemetry data is available for the user and the locale has enough users|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_locale.py)|
-| 5 | [Ensemble](taar/recommenders/ensemble_recommender.py) &#42;|recommends add-ons based on the combined (by [stacked generalization](https://en.wikipedia.org/wiki/Ensemble_learning#Stacking)) recomendations of other available recommender modules.|More than one of the other Models are available to provide recommendations.|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_ensemble.py)|
+| 1 | [Collaborative](taar/recommenders/collaborative_recommender.py) | recommends add-ons based on add-ons installed by other users (i.e. [collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering))|Telemetry data is available for the user and the user has at least one enabled add-on|[source](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/ml/AddonRecommender.scala)|
+| 2 | [Similarity](taar/recommenders/similarity_recommender.py) &#42;| recommends add-ons based on add-ons installed by similar representative users|Telemetry data is available for the user and a suitable representative donor can be found|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_similarity.py)|
+| 3 | [Locale](taar/recommenders/locale_recommender.py) |recommends add-ons based on the top addons for the user's locale|Telemetry data is available for the user and the locale has enough users|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_locale.py)|
+| 4 | [Ensemble](taar/recommenders/ensemble_recommender.py) &#42;|recommends add-ons based on the combined (by [stacked generalization](https://en.wikipedia.org/wiki/Ensemble_learning#Stacking)) recomendations of other available recommender modules.|More than one of the other Models are available to provide recommendations.|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_ensemble.py)|

-&#42; In order to ensure stable/repeatable testing and prevent unecessary computation, these jobs are not scheduled on [Airflow](https://github.com/mozilla/telemetry-airflow), rather run manually when fresh models are desired.
+&#42; All jobs are scheduled in Mozilla's instance of [Airflow](https://github.com/mozilla/telemetry-airflow).  The Collaborative, Similarity and Locale jobs are executed on a [daily](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_daily.py) schedule, while the ensemble job is scheduled on a [weekly](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_weekly.py) schedule.

 ## Instructions for releasing updates
-New releases can be shipped by using the normal [github workflow](https://help.github.com/articles/creating-releases/). Once a new release is created, it will be automatically uploaded to `pypi`.
+Releases for TAAR are split across ETL jobs for Airflow and the
+webservice that handles traffic coming from addons.mozilla.org.

+ETL releases are subdivided further into 3 categories:
+
+ 1. Scala code that requires deployment by Java JAR file to a Dataproc environment
+ 2. PySpark code that requires deployment by a single monolithic script in the
+    Dataproc enviroment.  These are stored in [telemetry-airflow/jobs]
+and are autodeployed to gs://moz-fx-data-prod-airflow-dataproc-artifacts/jobs
+ 3. Python code that executes in a Google Kubernetes Engine (GKE)
+    enviroment using a docker container image.
+
+GKEPodOperator jobs:
+
+    * [taar_etl_.taar_amodump](https://github.com/mozilla/taar_gcp_etl/blob/master/taar_etl/taar_amodump.py)
+    * [taar_etl.taar_amowhitelist](https://github.com/mozilla/taar_gcp_etl/blob/master/taar_etl/taar_amowhitelist.py)
+    * [taar_etl.taar_update_whitelist](https://github.com/mozilla/taar_gcp_etl/blob/master/taar_etl/taar_update_whitelist.py)
+
+PySpark jobs for Dataproc: 
+
+    * [telemetry-airflow/jobs/taar_locale.py](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/taar_locale.py)
+    * [telemetry-airflow/jobs/taar_similarity.py](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/taar_similarity.py)
+    * [telemetry-airflow/jobs/taar_lite_guidguid.py](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/taar_lite_guidguid.py)
+
+Scala jobs for Dataproc
+    * [com.mozilla.telemetry.ml.AddonRecommender](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/ml/AddonRecommender.scala) from telemetry-batch-view.jar
+
+
+Jobs are scheduled in two separate DAGs in Airflow.
+
+* [taar_daily](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_daily.py)
+* [taar_weekly](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_weekly.py)
+
+GKEPodOperator jobs must have code packaged up as containers for
+execution in GKE.  Code can be found in the taar_gcp_etl repository.
+Details to push containers into the GCP cloud repositories can be
+found in the
+[README.md](https://github.com/mozilla/taar_gcp_etl/blob/master/README.md)
+in that repository.
+
+PySpark jobs are maintained in the telemetry-airflow repository.  You
+must take care to update the code in that repository and have it
+merged to master for code to autodeploy.  Airflow execution will
+always copy jobs out of the [jobs](https://github.com/mozilla/telemetry-airflow/tree/master/jobs)
+into `gs://moz-fx-data-prod-airflow-dataproc-artifacts/`
+
+The sole scala job remaining is part of the telemetry-batch-view
+repository. Airflow will automatically use the latest code in the
+master branch of `telemetry-batch-view`.

 ## A note on cdist optimization. 
 cdist can speed up distance computation by a factor of 10 for the computations we're doing.
@ -50,30 +96,22 @@ so we can just apply the function `distance.hamming` to our array manually and g
 performance.

 ## Build and run tests
-You should be able to build taar using Python 2.7 or Python 3.5. To
-run the testsuite, execute ::
+You should be able to build taar using Python 3.5 or 3.7. 
+To run the testsuite, execute ::

 ```python
 $ python setup.py develop
 $ python setup.py test
 ```

-Alternately, if you've got GNUMake installed, you can just run `make test` which will do all of that for you and run flake8 on the codebase.
-
-
-There are additional integration tests and a microbenchmark available
-in `tests/test_integration.py`.  See the source code for more
-information.
-
+Alternately, if you've got GNUMake installed, you can just run `make build; make tests` which will build a complete Docker container and run the test suite inside the container.

 ## Pinning dependencies

-TAAR uses hashin (https://pypi.org/project/hashin/) to pin SHA256
-hashes for each dependency.  To update the hashes, you will need to
-remove the run `make freeze` which forces all packages in the current
-virtualenv to be written out to requirement.txt with versions and SHA
-hashes.
+TAAR uses miniconda and a enviroment.yml file to manage versioning.

+To update versions, edit the enviroment.yml with the new dependency
+you need.

 ## Required S3 dependencies

@ -96,6 +134,42 @@ EnsembleRecommender:
  * s3://telemetry-parquet/taar/ensemble/ensemble_weight.json


+## Google Cloud Platform resources
+
+Google Cloud BigQuery ::
+
+    Cloud BigQuery uses the GCP project defined in Airflow in the
+    variable `taar_gcp_project_id`.
+
+    Dataset  : `taar_tmp`
+    Table ID : `taar_tmp_profile`
+
+    Note that this table only exists for the duration of the
+    taar_weekly job, so there should be no need to manually manage this
+    table.
+
+
+Google Cloud Storage ::
+
+    The taar user profile extraction puts Avro format files into 
+    a GCS bucket defined by the following two variables in Airflow:
+
+    `taar_gcp_project_id`.`taar_etl_storage_bucket`
+
+    The bucket is automatically cleared at the *start* and *end* of
+    the TAAR weekly ETL job.
+
+Google Cloud BigTable ::
+
+    The final TAAR user profile data is stored in a Cloud BigTable
+instance defined by the following two variables in Airflow:
+
+    * `taar_gcp_project_id`
+    * `taar_bigtable_instance_id`
+
+The table ID for user profile information is `taar_profile`.
+
+----

 TAAR breaks out all S3 data load configuration into enviroment
 variables.  This ensures that running under test has no chance of
@ -127,3 +201,42 @@ Similarity Recommender ::
    TAAR_SIMILARITY_BUCKET = "telemetry-parquet"
    TAAR_SIMILARITY_DONOR_KEY = "taar/similarity/donors.json"
    TAAR_SIMILARITY_LRCURVES_KEY = "taar/similarity/lr_curves.json"
+
+
+------
+
+Production Configuration Settings
+---------------------------------
+
+Production enviroment settings are stored in a [private repository](https://github.com/mozilla-services/cloudops-deployment/blob/master/projects/data/puppet/yaml/type/data.api.prod.taar.yaml).
+
+------
+
+
+Deleting individual user data from all TAAR resources
+-----------------------------------------------------
+
+Deletion of records in TAAR is fairly straight forward.  Once a user
+disables telemetry from Firefox, all that is required is to delete
+records from TAAR.
+
+Deletion of records from the TAAR BigTable instance will remove the
+client's list of addons from TAAR.  No further work is required.
+
+Removal of the records from BigTable will cause JSON model updates to
+no longer take the deleted record into account.  JSON models are
+updated on a daily basis via the
+[`taar_daily`](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_daily.py)
+DAG in Airflow.
+=======
+
+Google Cloud Platform
+Stage enviroment
+
+curl https://stage:fancyfork38@stage.taar.nonprod.dataops.mozgcp.net/v1/api/recommendations/<hashed_telemetry_id>
+
+Airflow variables for BigTable and GCS Avro storage
+
+    `taar_bigtable_instance_id`
+    `taar_etl_storage_bucket`
+    `taar_gcp_project_id`