* Update documentation to reflect prod setup

* dirty commit

* Add stubs for GCP resources

* Add instructions for deletion of user data

* Add link to production YAML configuration

* merged missing docs

* Fill in GCP and Airflow variable information
This commit is contained in:
Victor Ng 2020-06-24 16:51:34 -04:00 коммит произвёл GitHub
Родитель e1da916205
Коммит 0f57bd5072
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
1 изменённых файлов: 134 добавлений и 21 удалений

155
README.md
Просмотреть файл

@ -21,17 +21,63 @@ This is the ordered list of the currently supported models:
| Order | Model | Description | Conditions | Generator job |
|-------|-------|-------------|------------|---------------|
| 1 | [Legacy](taar/recommenders/legacy_recommender.py) | recommends WebExtensions based on the reported and disabled legacy add-ons | Telemetry data is available for the user and the user has at least one disabled add-on|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_legacy.py)|
| 2 | [Collaborative](taar/recommenders/collaborative_recommender.py) | recommends add-ons based on add-ons installed by other users (i.e. [collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering))|Telemetry data is available for the user and the user has at least one enabled add-on|[source](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/ml/AddonRecommender.scala)|
| 3 | [Similarity](taar/recommenders/similarity_recommender.py) *| recommends add-ons based on add-ons installed by similar representative users|Telemetry data is available for the user and a suitable representative donor can be found|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_similarity.py)|
| 4 | [Locale](taar/recommenders/locale_recommender.py) |recommends add-ons based on the top addons for the user's locale|Telemetry data is available for the user and the locale has enough users|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_locale.py)|
| 5 | [Ensemble](taar/recommenders/ensemble_recommender.py) *|recommends add-ons based on the combined (by [stacked generalization](https://en.wikipedia.org/wiki/Ensemble_learning#Stacking)) recomendations of other available recommender modules.|More than one of the other Models are available to provide recommendations.|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_ensemble.py)|
| 1 | [Collaborative](taar/recommenders/collaborative_recommender.py) | recommends add-ons based on add-ons installed by other users (i.e. [collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering))|Telemetry data is available for the user and the user has at least one enabled add-on|[source](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/ml/AddonRecommender.scala)|
| 2 | [Similarity](taar/recommenders/similarity_recommender.py) *| recommends add-ons based on add-ons installed by similar representative users|Telemetry data is available for the user and a suitable representative donor can be found|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_similarity.py)|
| 3 | [Locale](taar/recommenders/locale_recommender.py) |recommends add-ons based on the top addons for the user's locale|Telemetry data is available for the user and the locale has enough users|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_locale.py)|
| 4 | [Ensemble](taar/recommenders/ensemble_recommender.py) *|recommends add-ons based on the combined (by [stacked generalization](https://en.wikipedia.org/wiki/Ensemble_learning#Stacking)) recomendations of other available recommender modules.|More than one of the other Models are available to provide recommendations.|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_ensemble.py)|
* In order to ensure stable/repeatable testing and prevent unecessary computation, these jobs are not scheduled on [Airflow](https://github.com/mozilla/telemetry-airflow), rather run manually when fresh models are desired.
* All jobs are scheduled in Mozilla's instance of [Airflow](https://github.com/mozilla/telemetry-airflow). The Collaborative, Similarity and Locale jobs are executed on a [daily](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_daily.py) schedule, while the ensemble job is scheduled on a [weekly](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_weekly.py) schedule.
## Instructions for releasing updates
New releases can be shipped by using the normal [github workflow](https://help.github.com/articles/creating-releases/). Once a new release is created, it will be automatically uploaded to `pypi`.
Releases for TAAR are split across ETL jobs for Airflow and the
webservice that handles traffic coming from addons.mozilla.org.
ETL releases are subdivided further into 3 categories:
1. Scala code that requires deployment by Java JAR file to a Dataproc environment
2. PySpark code that requires deployment by a single monolithic script in the
Dataproc enviroment. These are stored in [telemetry-airflow/jobs]
and are autodeployed to gs://moz-fx-data-prod-airflow-dataproc-artifacts/jobs
3. Python code that executes in a Google Kubernetes Engine (GKE)
enviroment using a docker container image.
GKEPodOperator jobs:
* [taar_etl_.taar_amodump](https://github.com/mozilla/taar_gcp_etl/blob/master/taar_etl/taar_amodump.py)
* [taar_etl.taar_amowhitelist](https://github.com/mozilla/taar_gcp_etl/blob/master/taar_etl/taar_amowhitelist.py)
* [taar_etl.taar_update_whitelist](https://github.com/mozilla/taar_gcp_etl/blob/master/taar_etl/taar_update_whitelist.py)
PySpark jobs for Dataproc:
* [telemetry-airflow/jobs/taar_locale.py](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/taar_locale.py)
* [telemetry-airflow/jobs/taar_similarity.py](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/taar_similarity.py)
* [telemetry-airflow/jobs/taar_lite_guidguid.py](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/taar_lite_guidguid.py)
Scala jobs for Dataproc
* [com.mozilla.telemetry.ml.AddonRecommender](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/ml/AddonRecommender.scala) from telemetry-batch-view.jar
Jobs are scheduled in two separate DAGs in Airflow.
* [taar_daily](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_daily.py)
* [taar_weekly](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_weekly.py)
GKEPodOperator jobs must have code packaged up as containers for
execution in GKE. Code can be found in the taar_gcp_etl repository.
Details to push containers into the GCP cloud repositories can be
found in the
[README.md](https://github.com/mozilla/taar_gcp_etl/blob/master/README.md)
in that repository.
PySpark jobs are maintained in the telemetry-airflow repository. You
must take care to update the code in that repository and have it
merged to master for code to autodeploy. Airflow execution will
always copy jobs out of the [jobs](https://github.com/mozilla/telemetry-airflow/tree/master/jobs)
into `gs://moz-fx-data-prod-airflow-dataproc-artifacts/`
The sole scala job remaining is part of the telemetry-batch-view
repository. Airflow will automatically use the latest code in the
master branch of `telemetry-batch-view`.
## A note on cdist optimization.
cdist can speed up distance computation by a factor of 10 for the computations we're doing.
@ -50,30 +96,22 @@ so we can just apply the function `distance.hamming` to our array manually and g
performance.
## Build and run tests
You should be able to build taar using Python 2.7 or Python 3.5. To
run the testsuite, execute ::
You should be able to build taar using Python 3.5 or 3.7.
To run the testsuite, execute ::
```python
$ python setup.py develop
$ python setup.py test
```
Alternately, if you've got GNUMake installed, you can just run `make test` which will do all of that for you and run flake8 on the codebase.
There are additional integration tests and a microbenchmark available
in `tests/test_integration.py`. See the source code for more
information.
Alternately, if you've got GNUMake installed, you can just run `make build; make tests` which will build a complete Docker container and run the test suite inside the container.
## Pinning dependencies
TAAR uses hashin (https://pypi.org/project/hashin/) to pin SHA256
hashes for each dependency. To update the hashes, you will need to
remove the run `make freeze` which forces all packages in the current
virtualenv to be written out to requirement.txt with versions and SHA
hashes.
TAAR uses miniconda and a enviroment.yml file to manage versioning.
To update versions, edit the enviroment.yml with the new dependency
you need.
## Required S3 dependencies
@ -96,6 +134,42 @@ EnsembleRecommender:
* s3://telemetry-parquet/taar/ensemble/ensemble_weight.json
## Google Cloud Platform resources
Google Cloud BigQuery ::
Cloud BigQuery uses the GCP project defined in Airflow in the
variable `taar_gcp_project_id`.
Dataset : `taar_tmp`
Table ID : `taar_tmp_profile`
Note that this table only exists for the duration of the
taar_weekly job, so there should be no need to manually manage this
table.
Google Cloud Storage ::
The taar user profile extraction puts Avro format files into
a GCS bucket defined by the following two variables in Airflow:
`taar_gcp_project_id`.`taar_etl_storage_bucket`
The bucket is automatically cleared at the *start* and *end* of
the TAAR weekly ETL job.
Google Cloud BigTable ::
The final TAAR user profile data is stored in a Cloud BigTable
instance defined by the following two variables in Airflow:
* `taar_gcp_project_id`
* `taar_bigtable_instance_id`
The table ID for user profile information is `taar_profile`.
----
TAAR breaks out all S3 data load configuration into enviroment
variables. This ensures that running under test has no chance of
@ -127,3 +201,42 @@ Similarity Recommender ::
TAAR_SIMILARITY_BUCKET = "telemetry-parquet"
TAAR_SIMILARITY_DONOR_KEY = "taar/similarity/donors.json"
TAAR_SIMILARITY_LRCURVES_KEY = "taar/similarity/lr_curves.json"
------
Production Configuration Settings
---------------------------------
Production enviroment settings are stored in a [private repository](https://github.com/mozilla-services/cloudops-deployment/blob/master/projects/data/puppet/yaml/type/data.api.prod.taar.yaml).
------
Deleting individual user data from all TAAR resources
-----------------------------------------------------
Deletion of records in TAAR is fairly straight forward. Once a user
disables telemetry from Firefox, all that is required is to delete
records from TAAR.
Deletion of records from the TAAR BigTable instance will remove the
client's list of addons from TAAR. No further work is required.
Removal of the records from BigTable will cause JSON model updates to
no longer take the deleted record into account. JSON models are
updated on a daily basis via the
[`taar_daily`](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_daily.py)
DAG in Airflow.
=======
Google Cloud Platform
Stage enviroment
curl https://stage:fancyfork38@stage.taar.nonprod.dataops.mozgcp.net/v1/api/recommendations/<hashed_telemetry_id>
Airflow variables for BigTable and GCS Avro storage
`taar_bigtable_instance_id`
`taar_etl_storage_bucket`
`taar_gcp_project_id`