зеркало из https://github.com/mozilla/taar.git
Add better documentation (#168)
* Update documentation to reflect prod setup * dirty commit * Add stubs for GCP resources * Add instructions for deletion of user data * Add link to production YAML configuration * merged missing docs * Fill in GCP and Airflow variable information
This commit is contained in:
Родитель
e1da916205
Коммит
0f57bd5072
155
README.md
155
README.md
|
@ -21,17 +21,63 @@ This is the ordered list of the currently supported models:
|
|||
|
||||
| Order | Model | Description | Conditions | Generator job |
|
||||
|-------|-------|-------------|------------|---------------|
|
||||
| 1 | [Legacy](taar/recommenders/legacy_recommender.py) | recommends WebExtensions based on the reported and disabled legacy add-ons | Telemetry data is available for the user and the user has at least one disabled add-on|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_legacy.py)|
|
||||
| 2 | [Collaborative](taar/recommenders/collaborative_recommender.py) | recommends add-ons based on add-ons installed by other users (i.e. [collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering))|Telemetry data is available for the user and the user has at least one enabled add-on|[source](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/ml/AddonRecommender.scala)|
|
||||
| 3 | [Similarity](taar/recommenders/similarity_recommender.py) *| recommends add-ons based on add-ons installed by similar representative users|Telemetry data is available for the user and a suitable representative donor can be found|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_similarity.py)|
|
||||
| 4 | [Locale](taar/recommenders/locale_recommender.py) |recommends add-ons based on the top addons for the user's locale|Telemetry data is available for the user and the locale has enough users|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_locale.py)|
|
||||
| 5 | [Ensemble](taar/recommenders/ensemble_recommender.py) *|recommends add-ons based on the combined (by [stacked generalization](https://en.wikipedia.org/wiki/Ensemble_learning#Stacking)) recomendations of other available recommender modules.|More than one of the other Models are available to provide recommendations.|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_ensemble.py)|
|
||||
| 1 | [Collaborative](taar/recommenders/collaborative_recommender.py) | recommends add-ons based on add-ons installed by other users (i.e. [collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering))|Telemetry data is available for the user and the user has at least one enabled add-on|[source](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/ml/AddonRecommender.scala)|
|
||||
| 2 | [Similarity](taar/recommenders/similarity_recommender.py) *| recommends add-ons based on add-ons installed by similar representative users|Telemetry data is available for the user and a suitable representative donor can be found|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_similarity.py)|
|
||||
| 3 | [Locale](taar/recommenders/locale_recommender.py) |recommends add-ons based on the top addons for the user's locale|Telemetry data is available for the user and the locale has enough users|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_locale.py)|
|
||||
| 4 | [Ensemble](taar/recommenders/ensemble_recommender.py) *|recommends add-ons based on the combined (by [stacked generalization](https://en.wikipedia.org/wiki/Ensemble_learning#Stacking)) recomendations of other available recommender modules.|More than one of the other Models are available to provide recommendations.|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_ensemble.py)|
|
||||
|
||||
* In order to ensure stable/repeatable testing and prevent unecessary computation, these jobs are not scheduled on [Airflow](https://github.com/mozilla/telemetry-airflow), rather run manually when fresh models are desired.
|
||||
* All jobs are scheduled in Mozilla's instance of [Airflow](https://github.com/mozilla/telemetry-airflow). The Collaborative, Similarity and Locale jobs are executed on a [daily](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_daily.py) schedule, while the ensemble job is scheduled on a [weekly](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_weekly.py) schedule.
|
||||
|
||||
## Instructions for releasing updates
|
||||
New releases can be shipped by using the normal [github workflow](https://help.github.com/articles/creating-releases/). Once a new release is created, it will be automatically uploaded to `pypi`.
|
||||
Releases for TAAR are split across ETL jobs for Airflow and the
|
||||
webservice that handles traffic coming from addons.mozilla.org.
|
||||
|
||||
ETL releases are subdivided further into 3 categories:
|
||||
|
||||
1. Scala code that requires deployment by Java JAR file to a Dataproc environment
|
||||
2. PySpark code that requires deployment by a single monolithic script in the
|
||||
Dataproc enviroment. These are stored in [telemetry-airflow/jobs]
|
||||
and are autodeployed to gs://moz-fx-data-prod-airflow-dataproc-artifacts/jobs
|
||||
3. Python code that executes in a Google Kubernetes Engine (GKE)
|
||||
enviroment using a docker container image.
|
||||
|
||||
GKEPodOperator jobs:
|
||||
|
||||
* [taar_etl_.taar_amodump](https://github.com/mozilla/taar_gcp_etl/blob/master/taar_etl/taar_amodump.py)
|
||||
* [taar_etl.taar_amowhitelist](https://github.com/mozilla/taar_gcp_etl/blob/master/taar_etl/taar_amowhitelist.py)
|
||||
* [taar_etl.taar_update_whitelist](https://github.com/mozilla/taar_gcp_etl/blob/master/taar_etl/taar_update_whitelist.py)
|
||||
|
||||
PySpark jobs for Dataproc:
|
||||
|
||||
* [telemetry-airflow/jobs/taar_locale.py](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/taar_locale.py)
|
||||
* [telemetry-airflow/jobs/taar_similarity.py](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/taar_similarity.py)
|
||||
* [telemetry-airflow/jobs/taar_lite_guidguid.py](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/taar_lite_guidguid.py)
|
||||
|
||||
Scala jobs for Dataproc
|
||||
* [com.mozilla.telemetry.ml.AddonRecommender](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/ml/AddonRecommender.scala) from telemetry-batch-view.jar
|
||||
|
||||
|
||||
Jobs are scheduled in two separate DAGs in Airflow.
|
||||
|
||||
* [taar_daily](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_daily.py)
|
||||
* [taar_weekly](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_weekly.py)
|
||||
|
||||
GKEPodOperator jobs must have code packaged up as containers for
|
||||
execution in GKE. Code can be found in the taar_gcp_etl repository.
|
||||
Details to push containers into the GCP cloud repositories can be
|
||||
found in the
|
||||
[README.md](https://github.com/mozilla/taar_gcp_etl/blob/master/README.md)
|
||||
in that repository.
|
||||
|
||||
PySpark jobs are maintained in the telemetry-airflow repository. You
|
||||
must take care to update the code in that repository and have it
|
||||
merged to master for code to autodeploy. Airflow execution will
|
||||
always copy jobs out of the [jobs](https://github.com/mozilla/telemetry-airflow/tree/master/jobs)
|
||||
into `gs://moz-fx-data-prod-airflow-dataproc-artifacts/`
|
||||
|
||||
The sole scala job remaining is part of the telemetry-batch-view
|
||||
repository. Airflow will automatically use the latest code in the
|
||||
master branch of `telemetry-batch-view`.
|
||||
|
||||
## A note on cdist optimization.
|
||||
cdist can speed up distance computation by a factor of 10 for the computations we're doing.
|
||||
|
@ -50,30 +96,22 @@ so we can just apply the function `distance.hamming` to our array manually and g
|
|||
performance.
|
||||
|
||||
## Build and run tests
|
||||
You should be able to build taar using Python 2.7 or Python 3.5. To
|
||||
run the testsuite, execute ::
|
||||
You should be able to build taar using Python 3.5 or 3.7.
|
||||
To run the testsuite, execute ::
|
||||
|
||||
```python
|
||||
$ python setup.py develop
|
||||
$ python setup.py test
|
||||
```
|
||||
|
||||
Alternately, if you've got GNUMake installed, you can just run `make test` which will do all of that for you and run flake8 on the codebase.
|
||||
|
||||
|
||||
There are additional integration tests and a microbenchmark available
|
||||
in `tests/test_integration.py`. See the source code for more
|
||||
information.
|
||||
|
||||
Alternately, if you've got GNUMake installed, you can just run `make build; make tests` which will build a complete Docker container and run the test suite inside the container.
|
||||
|
||||
## Pinning dependencies
|
||||
|
||||
TAAR uses hashin (https://pypi.org/project/hashin/) to pin SHA256
|
||||
hashes for each dependency. To update the hashes, you will need to
|
||||
remove the run `make freeze` which forces all packages in the current
|
||||
virtualenv to be written out to requirement.txt with versions and SHA
|
||||
hashes.
|
||||
TAAR uses miniconda and a enviroment.yml file to manage versioning.
|
||||
|
||||
To update versions, edit the enviroment.yml with the new dependency
|
||||
you need.
|
||||
|
||||
## Required S3 dependencies
|
||||
|
||||
|
@ -96,6 +134,42 @@ EnsembleRecommender:
|
|||
* s3://telemetry-parquet/taar/ensemble/ensemble_weight.json
|
||||
|
||||
|
||||
## Google Cloud Platform resources
|
||||
|
||||
Google Cloud BigQuery ::
|
||||
|
||||
Cloud BigQuery uses the GCP project defined in Airflow in the
|
||||
variable `taar_gcp_project_id`.
|
||||
|
||||
Dataset : `taar_tmp`
|
||||
Table ID : `taar_tmp_profile`
|
||||
|
||||
Note that this table only exists for the duration of the
|
||||
taar_weekly job, so there should be no need to manually manage this
|
||||
table.
|
||||
|
||||
|
||||
Google Cloud Storage ::
|
||||
|
||||
The taar user profile extraction puts Avro format files into
|
||||
a GCS bucket defined by the following two variables in Airflow:
|
||||
|
||||
`taar_gcp_project_id`.`taar_etl_storage_bucket`
|
||||
|
||||
The bucket is automatically cleared at the *start* and *end* of
|
||||
the TAAR weekly ETL job.
|
||||
|
||||
Google Cloud BigTable ::
|
||||
|
||||
The final TAAR user profile data is stored in a Cloud BigTable
|
||||
instance defined by the following two variables in Airflow:
|
||||
|
||||
* `taar_gcp_project_id`
|
||||
* `taar_bigtable_instance_id`
|
||||
|
||||
The table ID for user profile information is `taar_profile`.
|
||||
|
||||
----
|
||||
|
||||
TAAR breaks out all S3 data load configuration into enviroment
|
||||
variables. This ensures that running under test has no chance of
|
||||
|
@ -127,3 +201,42 @@ Similarity Recommender ::
|
|||
TAAR_SIMILARITY_BUCKET = "telemetry-parquet"
|
||||
TAAR_SIMILARITY_DONOR_KEY = "taar/similarity/donors.json"
|
||||
TAAR_SIMILARITY_LRCURVES_KEY = "taar/similarity/lr_curves.json"
|
||||
|
||||
|
||||
------
|
||||
|
||||
Production Configuration Settings
|
||||
---------------------------------
|
||||
|
||||
Production enviroment settings are stored in a [private repository](https://github.com/mozilla-services/cloudops-deployment/blob/master/projects/data/puppet/yaml/type/data.api.prod.taar.yaml).
|
||||
|
||||
------
|
||||
|
||||
|
||||
Deleting individual user data from all TAAR resources
|
||||
-----------------------------------------------------
|
||||
|
||||
Deletion of records in TAAR is fairly straight forward. Once a user
|
||||
disables telemetry from Firefox, all that is required is to delete
|
||||
records from TAAR.
|
||||
|
||||
Deletion of records from the TAAR BigTable instance will remove the
|
||||
client's list of addons from TAAR. No further work is required.
|
||||
|
||||
Removal of the records from BigTable will cause JSON model updates to
|
||||
no longer take the deleted record into account. JSON models are
|
||||
updated on a daily basis via the
|
||||
[`taar_daily`](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_daily.py)
|
||||
DAG in Airflow.
|
||||
=======
|
||||
|
||||
Google Cloud Platform
|
||||
Stage enviroment
|
||||
|
||||
curl https://stage:fancyfork38@stage.taar.nonprod.dataops.mozgcp.net/v1/api/recommendations/<hashed_telemetry_id>
|
||||
|
||||
Airflow variables for BigTable and GCS Avro storage
|
||||
|
||||
`taar_bigtable_instance_id`
|
||||
`taar_etl_storage_bucket`
|
||||
`taar_gcp_project_id`
|
||||
|
|
Загрузка…
Ссылка в новой задаче