From 0f57bd5072163fad746ca04f37684e1128ae07eb Mon Sep 17 00:00:00 2001 From: Victor Ng Date: Wed, 24 Jun 2020 16:51:34 -0400 Subject: [PATCH] Add better documentation (#168) * Update documentation to reflect prod setup * dirty commit * Add stubs for GCP resources * Add instructions for deletion of user data * Add link to production YAML configuration * merged missing docs * Fill in GCP and Airflow variable information --- README.md | 155 ++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 134 insertions(+), 21 deletions(-) diff --git a/README.md b/README.md index e430f54..2415c4c 100644 --- a/README.md +++ b/README.md @@ -21,17 +21,63 @@ This is the ordered list of the currently supported models: | Order | Model | Description | Conditions | Generator job | |-------|-------|-------------|------------|---------------| -| 1 | [Legacy](taar/recommenders/legacy_recommender.py) | recommends WebExtensions based on the reported and disabled legacy add-ons | Telemetry data is available for the user and the user has at least one disabled add-on|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_legacy.py)| -| 2 | [Collaborative](taar/recommenders/collaborative_recommender.py) | recommends add-ons based on add-ons installed by other users (i.e. [collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering))|Telemetry data is available for the user and the user has at least one enabled add-on|[source](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/ml/AddonRecommender.scala)| -| 3 | [Similarity](taar/recommenders/similarity_recommender.py) *| recommends add-ons based on add-ons installed by similar representative users|Telemetry data is available for the user and a suitable representative donor can be found|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_similarity.py)| -| 4 | [Locale](taar/recommenders/locale_recommender.py) |recommends add-ons based on the top addons for the user's locale|Telemetry data is available for the user and the locale has enough users|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_locale.py)| -| 5 | [Ensemble](taar/recommenders/ensemble_recommender.py) *|recommends add-ons based on the combined (by [stacked generalization](https://en.wikipedia.org/wiki/Ensemble_learning#Stacking)) recomendations of other available recommender modules.|More than one of the other Models are available to provide recommendations.|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_ensemble.py)| +| 1 | [Collaborative](taar/recommenders/collaborative_recommender.py) | recommends add-ons based on add-ons installed by other users (i.e. [collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering))|Telemetry data is available for the user and the user has at least one enabled add-on|[source](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/ml/AddonRecommender.scala)| +| 2 | [Similarity](taar/recommenders/similarity_recommender.py) *| recommends add-ons based on add-ons installed by similar representative users|Telemetry data is available for the user and a suitable representative donor can be found|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_similarity.py)| +| 3 | [Locale](taar/recommenders/locale_recommender.py) |recommends add-ons based on the top addons for the user's locale|Telemetry data is available for the user and the locale has enough users|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_locale.py)| +| 4 | [Ensemble](taar/recommenders/ensemble_recommender.py) *|recommends add-ons based on the combined (by [stacked generalization](https://en.wikipedia.org/wiki/Ensemble_learning#Stacking)) recomendations of other available recommender modules.|More than one of the other Models are available to provide recommendations.|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_ensemble.py)| -* In order to ensure stable/repeatable testing and prevent unecessary computation, these jobs are not scheduled on [Airflow](https://github.com/mozilla/telemetry-airflow), rather run manually when fresh models are desired. +* All jobs are scheduled in Mozilla's instance of [Airflow](https://github.com/mozilla/telemetry-airflow). The Collaborative, Similarity and Locale jobs are executed on a [daily](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_daily.py) schedule, while the ensemble job is scheduled on a [weekly](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_weekly.py) schedule. ## Instructions for releasing updates -New releases can be shipped by using the normal [github workflow](https://help.github.com/articles/creating-releases/). Once a new release is created, it will be automatically uploaded to `pypi`. +Releases for TAAR are split across ETL jobs for Airflow and the +webservice that handles traffic coming from addons.mozilla.org. +ETL releases are subdivided further into 3 categories: + + 1. Scala code that requires deployment by Java JAR file to a Dataproc environment + 2. PySpark code that requires deployment by a single monolithic script in the + Dataproc enviroment. These are stored in [telemetry-airflow/jobs] +and are autodeployed to gs://moz-fx-data-prod-airflow-dataproc-artifacts/jobs + 3. Python code that executes in a Google Kubernetes Engine (GKE) + enviroment using a docker container image. + +GKEPodOperator jobs: + + * [taar_etl_.taar_amodump](https://github.com/mozilla/taar_gcp_etl/blob/master/taar_etl/taar_amodump.py) + * [taar_etl.taar_amowhitelist](https://github.com/mozilla/taar_gcp_etl/blob/master/taar_etl/taar_amowhitelist.py) + * [taar_etl.taar_update_whitelist](https://github.com/mozilla/taar_gcp_etl/blob/master/taar_etl/taar_update_whitelist.py) + +PySpark jobs for Dataproc: + + * [telemetry-airflow/jobs/taar_locale.py](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/taar_locale.py) + * [telemetry-airflow/jobs/taar_similarity.py](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/taar_similarity.py) + * [telemetry-airflow/jobs/taar_lite_guidguid.py](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/taar_lite_guidguid.py) + +Scala jobs for Dataproc + * [com.mozilla.telemetry.ml.AddonRecommender](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/ml/AddonRecommender.scala) from telemetry-batch-view.jar + + +Jobs are scheduled in two separate DAGs in Airflow. + +* [taar_daily](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_daily.py) +* [taar_weekly](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_weekly.py) + +GKEPodOperator jobs must have code packaged up as containers for +execution in GKE. Code can be found in the taar_gcp_etl repository. +Details to push containers into the GCP cloud repositories can be +found in the +[README.md](https://github.com/mozilla/taar_gcp_etl/blob/master/README.md) +in that repository. + +PySpark jobs are maintained in the telemetry-airflow repository. You +must take care to update the code in that repository and have it +merged to master for code to autodeploy. Airflow execution will +always copy jobs out of the [jobs](https://github.com/mozilla/telemetry-airflow/tree/master/jobs) +into `gs://moz-fx-data-prod-airflow-dataproc-artifacts/` + +The sole scala job remaining is part of the telemetry-batch-view +repository. Airflow will automatically use the latest code in the +master branch of `telemetry-batch-view`. ## A note on cdist optimization. cdist can speed up distance computation by a factor of 10 for the computations we're doing. @@ -50,30 +96,22 @@ so we can just apply the function `distance.hamming` to our array manually and g performance. ## Build and run tests -You should be able to build taar using Python 2.7 or Python 3.5. To -run the testsuite, execute :: +You should be able to build taar using Python 3.5 or 3.7. +To run the testsuite, execute :: ```python $ python setup.py develop $ python setup.py test ``` -Alternately, if you've got GNUMake installed, you can just run `make test` which will do all of that for you and run flake8 on the codebase. - - -There are additional integration tests and a microbenchmark available -in `tests/test_integration.py`. See the source code for more -information. - +Alternately, if you've got GNUMake installed, you can just run `make build; make tests` which will build a complete Docker container and run the test suite inside the container. ## Pinning dependencies -TAAR uses hashin (https://pypi.org/project/hashin/) to pin SHA256 -hashes for each dependency. To update the hashes, you will need to -remove the run `make freeze` which forces all packages in the current -virtualenv to be written out to requirement.txt with versions and SHA -hashes. +TAAR uses miniconda and a enviroment.yml file to manage versioning. +To update versions, edit the enviroment.yml with the new dependency +you need. ## Required S3 dependencies @@ -96,6 +134,42 @@ EnsembleRecommender: * s3://telemetry-parquet/taar/ensemble/ensemble_weight.json +## Google Cloud Platform resources + +Google Cloud BigQuery :: + + Cloud BigQuery uses the GCP project defined in Airflow in the + variable `taar_gcp_project_id`. + + Dataset : `taar_tmp` + Table ID : `taar_tmp_profile` + + Note that this table only exists for the duration of the + taar_weekly job, so there should be no need to manually manage this + table. + + +Google Cloud Storage :: + + The taar user profile extraction puts Avro format files into + a GCS bucket defined by the following two variables in Airflow: + + `taar_gcp_project_id`.`taar_etl_storage_bucket` + + The bucket is automatically cleared at the *start* and *end* of + the TAAR weekly ETL job. + +Google Cloud BigTable :: + + The final TAAR user profile data is stored in a Cloud BigTable +instance defined by the following two variables in Airflow: + + * `taar_gcp_project_id` + * `taar_bigtable_instance_id` + +The table ID for user profile information is `taar_profile`. + +---- TAAR breaks out all S3 data load configuration into enviroment variables. This ensures that running under test has no chance of @@ -127,3 +201,42 @@ Similarity Recommender :: TAAR_SIMILARITY_BUCKET = "telemetry-parquet" TAAR_SIMILARITY_DONOR_KEY = "taar/similarity/donors.json" TAAR_SIMILARITY_LRCURVES_KEY = "taar/similarity/lr_curves.json" + + +------ + +Production Configuration Settings +--------------------------------- + +Production enviroment settings are stored in a [private repository](https://github.com/mozilla-services/cloudops-deployment/blob/master/projects/data/puppet/yaml/type/data.api.prod.taar.yaml). + +------ + + +Deleting individual user data from all TAAR resources +----------------------------------------------------- + +Deletion of records in TAAR is fairly straight forward. Once a user +disables telemetry from Firefox, all that is required is to delete +records from TAAR. + +Deletion of records from the TAAR BigTable instance will remove the +client's list of addons from TAAR. No further work is required. + +Removal of the records from BigTable will cause JSON model updates to +no longer take the deleted record into account. JSON models are +updated on a daily basis via the +[`taar_daily`](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_daily.py) +DAG in Airflow. +======= + +Google Cloud Platform +Stage enviroment + +curl https://stage:fancyfork38@stage.taar.nonprod.dataops.mozgcp.net/v1/api/recommendations/ + +Airflow variables for BigTable and GCS Avro storage + + `taar_bigtable_instance_id` + `taar_etl_storage_bucket` + `taar_gcp_project_id`