зеркало из https://github.com/mozilla/taar.git
Update documentation to reflect modern deployment (#170)
This commit is contained in:
Родитель
fef44f8368
Коммит
316aee7c4f
433
README.md
433
README.md
|
@ -3,18 +3,47 @@ Telemetry-Aware Addon Recommender
|
|||
|
||||
[![CircleCI](https://circleci.com/gh/mozilla/taar.svg?style=svg)](https://circleci.com/gh/mozila/taar)
|
||||
|
||||
Table of Contents (ToC):
|
||||
===========================
|
||||
|
||||
* [How does it work?](#how-does-it-work)
|
||||
* [Supported models](#supported-models)
|
||||
* [Instructions for Releasing Updates](#instructions-for-releasing-updates)
|
||||
* [Building and Running tests](#build-and-run-tests)
|
||||
Table of Contents
|
||||
=================
|
||||
|
||||
* [Taar](#taar)
|
||||
* [How does it work?](#how-does-it-work)
|
||||
* [Supported models](#supported-models)
|
||||
* [Build and run tests](#build-and-run-tests)
|
||||
* [Pinning dependencies](#pinning-dependencies)
|
||||
* [Instructions for releasing updates to production](#instructions-for-releasing-updates-to-production)
|
||||
* [Dependencies](#dependencies)
|
||||
* [AWS resources](#aws-resources)
|
||||
* [AWS enviroment configuration](#aws-enviroment-configuration)
|
||||
* [Collaborative Recommender](#collaborative-recommender)
|
||||
* [Ensemble Recommender](#ensemble-recommender)
|
||||
* [Locale Recommender](#locale-recommender)
|
||||
* [Similarity Recommender](#similarity-recommender)
|
||||
* [Google Cloud Platform resources](#google-cloud-platform-resources)
|
||||
* [Google Cloud BigQuery](#google-cloud-bigquery)
|
||||
* [Google Cloud Storage](#google-cloud-storage)
|
||||
* [Google Cloud BigTable](#google-cloud-bigtable)
|
||||
* [Production Configuration Settings](#production-configuration-settings)
|
||||
* [Deleting individual user data from all TAAR resources](#deleting-individual-user-data-from-all-taar-resources)
|
||||
* [Airflow enviroment configuration](#airflow-enviroment-configuration)
|
||||
* [Staging Enviroment](#staging-enviroment)
|
||||
* [A note on cdist optimization\.](#a-note-on-cdist-optimization)
|
||||
|
||||
|
||||
## How does it work?
|
||||
The recommendation strategy is implemented through the [RecommendationManager](taar/recommenders/recommendation_manager.py). Once a recommendation is requested for a specific [client id](https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/data/common-ping.html), the recommender iterates through all the registered models (e.g. [CollaborativeRecommender](taar/recommenders/collaborative_recommender.py)) linearly in their registered order. Results are returned from the first module that can perform a recommendation.
|
||||
The recommendation strategy is implemented through the
|
||||
[RecommendationManager](taar/recommenders/recommendation_manager.py).
|
||||
Once a recommendation is requested for a specific [client
|
||||
id](https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/data/common-ping.html),
|
||||
the recommender iterates through all the registered models (e.g.
|
||||
[CollaborativeRecommender](taar/recommenders/collaborative_recommender.py))
|
||||
linearly in their registered order. Results are returned from the
|
||||
first module that can perform a recommendation.
|
||||
|
||||
Each module specifies its own sets of rules and requirements and thus can decide if it can perform a recommendation independently from the other modules.
|
||||
Each module specifies its own sets of rules and requirements and thus
|
||||
can decide if it can perform a recommendation independently from the
|
||||
other modules.
|
||||
|
||||
### Supported models
|
||||
This is the ordered list of the currently supported models:
|
||||
|
@ -22,62 +51,210 @@ This is the ordered list of the currently supported models:
|
|||
| Order | Model | Description | Conditions | Generator job |
|
||||
|-------|-------|-------------|------------|---------------|
|
||||
| 1 | [Collaborative](taar/recommenders/collaborative_recommender.py) | recommends add-ons based on add-ons installed by other users (i.e. [collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering))|Telemetry data is available for the user and the user has at least one enabled add-on|[source](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/ml/AddonRecommender.scala)|
|
||||
| 2 | [Similarity](taar/recommenders/similarity_recommender.py) *| recommends add-ons based on add-ons installed by similar representative users|Telemetry data is available for the user and a suitable representative donor can be found|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_similarity.py)|
|
||||
| 2 | [Similarity](taar/recommenders/similarity_recommender.py) | recommends add-ons based on add-ons installed by similar representative users|Telemetry data is available for the user and a suitable representative donor can be found|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_similarity.py)|
|
||||
| 3 | [Locale](taar/recommenders/locale_recommender.py) |recommends add-ons based on the top addons for the user's locale|Telemetry data is available for the user and the locale has enough users|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_locale.py)|
|
||||
| 4 | [Ensemble](taar/recommenders/ensemble_recommender.py) *|recommends add-ons based on the combined (by [stacked generalization](https://en.wikipedia.org/wiki/Ensemble_learning#Stacking)) recomendations of other available recommender modules.|More than one of the other Models are available to provide recommendations.|[source](https://github.com/mozilla/python_mozetl/blob/master/mozetl/taar/taar_ensemble.py)|
|
||||
|
||||
* All jobs are scheduled in Mozilla's instance of [Airflow](https://github.com/mozilla/telemetry-airflow). The Collaborative, Similarity and Locale jobs are executed on a [daily](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_daily.py) schedule, while the ensemble job is scheduled on a [weekly](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_weekly.py) schedule.
|
||||
|
||||
## Instructions for releasing updates
|
||||
Releases for TAAR are split across ETL jobs for Airflow and the
|
||||
webservice that handles traffic coming from addons.mozilla.org.
|
||||
|
||||
ETL releases are subdivided further into 3 categories:
|
||||
|
||||
1. Scala code that requires deployment by Java JAR file to a Dataproc environment
|
||||
2. PySpark code that requires deployment by a single monolithic script in the
|
||||
Dataproc enviroment. These are stored in [telemetry-airflow/jobs]
|
||||
and are autodeployed to gs://moz-fx-data-prod-airflow-dataproc-artifacts/jobs
|
||||
3. Python code that executes in a Google Kubernetes Engine (GKE)
|
||||
enviroment using a docker container image.
|
||||
|
||||
GKEPodOperator jobs:
|
||||
|
||||
* [taar_etl_.taar_amodump](https://github.com/mozilla/taar_gcp_etl/blob/master/taar_etl/taar_amodump.py)
|
||||
* [taar_etl.taar_amowhitelist](https://github.com/mozilla/taar_gcp_etl/blob/master/taar_etl/taar_amowhitelist.py)
|
||||
* [taar_etl.taar_update_whitelist](https://github.com/mozilla/taar_gcp_etl/blob/master/taar_etl/taar_update_whitelist.py)
|
||||
|
||||
PySpark jobs for Dataproc:
|
||||
|
||||
* [telemetry-airflow/jobs/taar_locale.py](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/taar_locale.py)
|
||||
* [telemetry-airflow/jobs/taar_similarity.py](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/taar_similarity.py)
|
||||
* [telemetry-airflow/jobs/taar_lite_guidguid.py](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/taar_lite_guidguid.py)
|
||||
|
||||
Scala jobs for Dataproc
|
||||
* [com.mozilla.telemetry.ml.AddonRecommender](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/ml/AddonRecommender.scala) from telemetry-batch-view.jar
|
||||
All jobs are scheduled in Mozilla's instance of
|
||||
[Airflow](https://github.com/mozilla/telemetry-airflow). The
|
||||
Collaborative, Similarity and Locale jobs are executed on a
|
||||
[daily](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_daily.py)
|
||||
schedule, while the ensemble job is scheduled on a
|
||||
[weekly](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_weekly.py)
|
||||
schedule.
|
||||
|
||||
|
||||
Jobs are scheduled in two separate DAGs in Airflow.
|
||||
## Build and run tests
|
||||
You should be able to build taar using Python 3.5 or 3.7.
|
||||
To run the testsuite, execute ::
|
||||
|
||||
* [taar_daily](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_daily.py)
|
||||
* [taar_weekly](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_weekly.py)
|
||||
```python
|
||||
$ python setup.py develop
|
||||
$ python setup.py test
|
||||
```
|
||||
|
||||
GKEPodOperator jobs must have code packaged up as containers for
|
||||
execution in GKE. Code can be found in the taar_gcp_etl repository.
|
||||
Details to push containers into the GCP cloud repositories can be
|
||||
found in the
|
||||
[README.md](https://github.com/mozilla/taar_gcp_etl/blob/master/README.md)
|
||||
in that repository.
|
||||
Alternately, if you've got GNUMake installed, a Makefile is included
|
||||
with
|
||||
[`build`](https://github.com/mozilla/taar/blob/more_docs/Makefile#L20)
|
||||
and
|
||||
[`test-container`](https://github.com/mozilla/taar/blob/more_docs/Makefile#L55)
|
||||
targets.
|
||||
|
||||
PySpark jobs are maintained in the telemetry-airflow repository. You
|
||||
must take care to update the code in that repository and have it
|
||||
merged to master for code to autodeploy. Airflow execution will
|
||||
always copy jobs out of the [jobs](https://github.com/mozilla/telemetry-airflow/tree/master/jobs)
|
||||
into `gs://moz-fx-data-prod-airflow-dataproc-artifacts/`
|
||||
You can just run `make
|
||||
build; make test-container` which will build a complete Docker
|
||||
container and run the test suite inside the container.
|
||||
|
||||
The sole scala job remaining is part of the telemetry-batch-view
|
||||
repository. Airflow will automatically use the latest code in the
|
||||
master branch of `telemetry-batch-view`.
|
||||
## Pinning dependencies
|
||||
|
||||
TAAR uses miniconda and a enviroment.yml file to manage versioning.
|
||||
|
||||
To update versions, edit the enviroment.yml with the new dependency
|
||||
you need. If you are unfamiliar with using conda, see the [official
|
||||
documentation](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html)
|
||||
for reference.
|
||||
|
||||
## Instructions for releasing updates to production
|
||||
|
||||
Building a new release of TAAR is fairly involved. Documentation to
|
||||
create a new release has been split out into separate
|
||||
[instructions](https://github.com/mozilla/taar/blob/master/docs/release_instructions.md).
|
||||
|
||||
|
||||
## Dependencies
|
||||
|
||||
### AWS resources
|
||||
|
||||
Recommendation engines load models from Amazon S3.
|
||||
|
||||
The following table is a complete list of all resources per
|
||||
recommendation engine.
|
||||
|
||||
Recommendation Engine | S3 Resource
|
||||
--- | ---
|
||||
RecommendationManager Whitelist | s3://telemetry-parquet/telemetry-ml/addon_recommender/top_200_whitelist.json
|
||||
Similarity Recommender | s3://telemetry-parquet/taar/similarity/donors.json <br> s3://telemetry-parquet/taar/similarity/lr_curves.json
|
||||
CollaborativeRecommender | s3://telemetry-parquet/telemetry-ml/addon_recommender/item_matrix.json <br> s3://telemetry-parquet/telemetry-ml/addon_recommender/addon_mapping.json
|
||||
LocaleRecommender | s3://telemetry-parquet/taar/locale/top10_dict.json
|
||||
EnsembleRecommender | s3://telemetry-parquet/taar/ensemble/ensemble_weight.json
|
||||
|
||||
|
||||
|
||||
### AWS enviroment configuration
|
||||
|
||||
TAAR breaks out all S3 data load configuration into enviroment
|
||||
variables. This ensures that running under test has no chance of
|
||||
clobbering the production data in the event that a developer has AWS
|
||||
configuration keys installed locally in `~/.aws/`
|
||||
|
||||
Production enviroment variables required for TAAR
|
||||
|
||||
## Collaborative Recommender
|
||||
|
||||
Env Variable | Value
|
||||
------- | ---
|
||||
TAAR_ITEM_MATRIX_BUCKET | "telemetry-parquet"
|
||||
TAAR_ITEM_MATRIX_KEY | "telemetry-ml/addon_recommender/item_matrix.json"
|
||||
TAAR_ADDON_MAPPING_BUCKET | "telemetry-parquet"
|
||||
TAAR_ADDON_MAPPING_KEY | "telemetry-ml/addon_recommender/addon_mapping.json"
|
||||
|
||||
## Ensemble Recommender
|
||||
|
||||
Env Variable | Value
|
||||
--- | ---
|
||||
TAAR_ENSEMBLE_BUCKET | "telemetry-parquet"
|
||||
TAAR_ENSEMBLE_KEY | "taar/ensemble/ensemble_weight.json"
|
||||
|
||||
## Locale Recommender
|
||||
|
||||
Env Variable | Value
|
||||
--- | ---
|
||||
TAAR_LOCALE_BUCKET | "telemetry-parquet"
|
||||
TAAR_LOCALE_KEY | "taar/locale/top10_dict.json"
|
||||
|
||||
## Similarity Recommender
|
||||
|
||||
Env Variable | Value
|
||||
--- | ---
|
||||
TAAR_SIMILARITY_BUCKET | "telemetry-parquet"
|
||||
TAAR_SIMILARITY_DONOR_KEY | "taar/similarity/donors.json"
|
||||
TAAR_SIMILARITY_LRCURVES_KEY | "taar/similarity/lr_curves.json"
|
||||
|
||||
|
||||
## Google Cloud Platform resources
|
||||
### Google Cloud BigQuery
|
||||
|
||||
Cloud BigQuery uses the GCP project defined in Airflow in the
|
||||
variable `taar_gcp_project_id`.
|
||||
|
||||
Dataset
|
||||
* `taar_tmp`
|
||||
|
||||
Table ID
|
||||
* `taar_tmp_profile`
|
||||
|
||||
Note that this table only exists for the duration of the taar_weekly
|
||||
job, so there should be no need to manually manage this table.
|
||||
|
||||
### Google Cloud Storage
|
||||
|
||||
The taar user profile extraction puts Avro format files into
|
||||
a GCS bucket defined by the following two variables in Airflow:
|
||||
|
||||
* `taar_gcp_project_id`
|
||||
* `taar_etl_storage_bucket`
|
||||
|
||||
The bucket is automatically cleared at the *start* and *end* of
|
||||
the TAAR weekly ETL job.
|
||||
|
||||
### Google Cloud BigTable
|
||||
|
||||
The final TAAR user profile data is stored in a Cloud BigTable
|
||||
instance defined by the following two variables in Airflow:
|
||||
|
||||
* `taar_gcp_project_id`
|
||||
* `taar_bigtable_instance_id`
|
||||
|
||||
The table ID for user profile information is `taar_profile`.
|
||||
|
||||
|
||||
------
|
||||
|
||||
## Production Configuration Settings
|
||||
|
||||
Production enviroment settings are stored in a [private repository](https://github.com/mozilla-services/cloudops-deployment/blob/master/projects/data/puppet/yaml/type/data.api.prod.taar.yaml).
|
||||
|
||||
|
||||
## Deleting individual user data from all TAAR resources
|
||||
|
||||
Deletion of records in TAAR is fairly straight forward. Once a user
|
||||
disables telemetry from Firefox, all that is required is to delete
|
||||
records from TAAR.
|
||||
|
||||
Deletion of records from the TAAR BigTable instance will remove the
|
||||
client's list of addons from TAAR. No further work is required.
|
||||
|
||||
Removal of the records from BigTable will cause JSON model updates to
|
||||
no longer take the deleted record into account. JSON models are
|
||||
updated on a daily basis via the
|
||||
[`taar_daily`](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_daily.py)
|
||||
|
||||
Updates in the weekly Airflow job in
|
||||
[`taar_weekly`](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_weekly.py) only update the ensemble weights and the user profile information.
|
||||
|
||||
If the user profile information in `clients_last_seen` continues to
|
||||
have data for the user's telemetry-id, TAAR will repopulate the user
|
||||
profile data.
|
||||
|
||||
Users who wish to remove their data from TAAR need to:
|
||||
1. Disable telemetry in Firefox
|
||||
2. Have user telemetry data removed from all telemetry storage systems
|
||||
in GCP. Primarily this means the `clients_last_seen` table in
|
||||
BigQuery.
|
||||
3. Have user data removed from BigTable.
|
||||
|
||||
|
||||
|
||||
## Airflow enviroment configuration
|
||||
|
||||
TAAR requires some configuration to be stored in Airflow variables for
|
||||
the ETL jobs to run to completion correctly.
|
||||
|
||||
Airflow Variable | Value
|
||||
--- | ---
|
||||
taar_gcp_project_id | The Google Cloud Platform project where BigQuery temporary tables, Cloud Storage buckets for Avro files and BigTable reside for TAAR.
|
||||
taar_etl_storage_bucket | The Cloud Storage bucket name where temporary Avro files will reside when transferring data from BigQuery to BigTable.
|
||||
taar_bigtable_instance_id | The BigTable instance ID for TAAR user profile information
|
||||
taar_dataflow_subnetwork | The subnetwork required to communicate between Cloud Dataflow
|
||||
|
||||
|
||||
## Staging Enviroment
|
||||
|
||||
The staging enviroment of the TAAR service in GCP can be reached using
|
||||
curl.
|
||||
|
||||
```
|
||||
curl https://user@pass:stage.taar.nonprod.dataops.mozgcp.net/v1/api/recommendations/<hashed_telemetry_id>
|
||||
```
|
||||
|
||||
## A note on cdist optimization.
|
||||
cdist can speed up distance computation by a factor of 10 for the computations we're doing.
|
||||
|
@ -94,149 +271,3 @@ However, when you manually provide a callable to cdist, cdist can not do it's ba
|
|||
optimizations (https://github.com/scipy/scipy/blob/v1.0.0/scipy/spatial/distance.py#L2408)
|
||||
so we can just apply the function `distance.hamming` to our array manually and get the same
|
||||
performance.
|
||||
|
||||
## Build and run tests
|
||||
You should be able to build taar using Python 3.5 or 3.7.
|
||||
To run the testsuite, execute ::
|
||||
|
||||
```python
|
||||
$ python setup.py develop
|
||||
$ python setup.py test
|
||||
```
|
||||
|
||||
Alternately, if you've got GNUMake installed, you can just run `make build; make tests` which will build a complete Docker container and run the test suite inside the container.
|
||||
|
||||
## Pinning dependencies
|
||||
|
||||
TAAR uses miniconda and a enviroment.yml file to manage versioning.
|
||||
|
||||
To update versions, edit the enviroment.yml with the new dependency
|
||||
you need.
|
||||
|
||||
## Required S3 dependencies
|
||||
|
||||
|
||||
RecommendationManager:
|
||||
* s3://telemetry-parquet/telemetry-ml/addon_recommender/top_200_whitelist.json
|
||||
|
||||
Similarity Recommender:
|
||||
* s3://telemetry-parquet/taar/similarity/donors.json
|
||||
* s3://telemetry-parquet/taar/similarity/lr_curves.json
|
||||
|
||||
CollaborativeRecommender:
|
||||
* s3://telemetry-parquet/telemetry-ml/addon_recommender/item_matrix.json
|
||||
* s3://telemetry-parquet/telemetry-ml/addon_recommender/addon_mapping.json
|
||||
|
||||
LocaleRecommender:
|
||||
* s3://telemetry-parquet/taar/locale/top10_dict.json
|
||||
|
||||
EnsembleRecommender:
|
||||
* s3://telemetry-parquet/taar/ensemble/ensemble_weight.json
|
||||
|
||||
|
||||
## Google Cloud Platform resources
|
||||
|
||||
Google Cloud BigQuery ::
|
||||
|
||||
Cloud BigQuery uses the GCP project defined in Airflow in the
|
||||
variable `taar_gcp_project_id`.
|
||||
|
||||
Dataset : `taar_tmp`
|
||||
Table ID : `taar_tmp_profile`
|
||||
|
||||
Note that this table only exists for the duration of the
|
||||
taar_weekly job, so there should be no need to manually manage this
|
||||
table.
|
||||
|
||||
|
||||
Google Cloud Storage ::
|
||||
|
||||
The taar user profile extraction puts Avro format files into
|
||||
a GCS bucket defined by the following two variables in Airflow:
|
||||
|
||||
`taar_gcp_project_id`.`taar_etl_storage_bucket`
|
||||
|
||||
The bucket is automatically cleared at the *start* and *end* of
|
||||
the TAAR weekly ETL job.
|
||||
|
||||
Google Cloud BigTable ::
|
||||
|
||||
The final TAAR user profile data is stored in a Cloud BigTable
|
||||
instance defined by the following two variables in Airflow:
|
||||
|
||||
* `taar_gcp_project_id`
|
||||
* `taar_bigtable_instance_id`
|
||||
|
||||
The table ID for user profile information is `taar_profile`.
|
||||
|
||||
----
|
||||
|
||||
TAAR breaks out all S3 data load configuration into enviroment
|
||||
variables. This ensures that running under test has no chance of
|
||||
clobbering the production data in the event that a developer has AWS
|
||||
configuration keys installed locally in `~/.aws/`
|
||||
|
||||
Production enviroment variables required for TAAR
|
||||
|
||||
Collaborative Recommender ::
|
||||
|
||||
TAAR_ITEM_MATRIX_BUCKET = "telemetry-parquet"
|
||||
TAAR_ITEM_MATRIX_KEY = "telemetry-ml/addon_recommender/item_matrix.json"
|
||||
|
||||
TAAR_ADDON_MAPPING_BUCKET = "telemetry-parquet"
|
||||
TAAR_ADDON_MAPPING_KEY = "telemetry-ml/addon_recommender/addon_mapping.json"
|
||||
|
||||
Ensemble Recommender ::
|
||||
|
||||
TAAR_ENSEMBLE_BUCKET = "telemetry-parquet"
|
||||
TAAR_ENSEMBLE_KEY = "taar/ensemble/ensemble_weight.json"
|
||||
|
||||
Locale Recommender ::
|
||||
|
||||
TAAR_LOCALE_BUCKET = "telemetry-parquet"
|
||||
TAAR_LOCALE_KEY = "taar/locale/top10_dict.json"
|
||||
|
||||
Similarity Recommender ::
|
||||
|
||||
TAAR_SIMILARITY_BUCKET = "telemetry-parquet"
|
||||
TAAR_SIMILARITY_DONOR_KEY = "taar/similarity/donors.json"
|
||||
TAAR_SIMILARITY_LRCURVES_KEY = "taar/similarity/lr_curves.json"
|
||||
|
||||
|
||||
------
|
||||
|
||||
Production Configuration Settings
|
||||
---------------------------------
|
||||
|
||||
Production enviroment settings are stored in a [private repository](https://github.com/mozilla-services/cloudops-deployment/blob/master/projects/data/puppet/yaml/type/data.api.prod.taar.yaml).
|
||||
|
||||
------
|
||||
|
||||
|
||||
Deleting individual user data from all TAAR resources
|
||||
-----------------------------------------------------
|
||||
|
||||
Deletion of records in TAAR is fairly straight forward. Once a user
|
||||
disables telemetry from Firefox, all that is required is to delete
|
||||
records from TAAR.
|
||||
|
||||
Deletion of records from the TAAR BigTable instance will remove the
|
||||
client's list of addons from TAAR. No further work is required.
|
||||
|
||||
Removal of the records from BigTable will cause JSON model updates to
|
||||
no longer take the deleted record into account. JSON models are
|
||||
updated on a daily basis via the
|
||||
[`taar_daily`](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_daily.py)
|
||||
DAG in Airflow.
|
||||
=======
|
||||
|
||||
Google Cloud Platform
|
||||
Stage enviroment
|
||||
|
||||
curl https://stage:fancyfork38@stage.taar.nonprod.dataops.mozgcp.net/v1/api/recommendations/<hashed_telemetry_id>
|
||||
|
||||
Airflow variables for BigTable and GCS Avro storage
|
||||
|
||||
`taar_bigtable_instance_id`
|
||||
`taar_etl_storage_bucket`
|
||||
`taar_gcp_project_id`
|
||||
|
|
|
@ -0,0 +1,115 @@
|
|||
# Instructions for releasing updates
|
||||
|
||||
## Overview
|
||||
|
||||
Releases for TAAR are split across ETL jobs for Airflow and the
|
||||
webservice that handles traffic coming from addons.mozilla.org.
|
||||
|
||||
You may or may not need to upgrade all parts at once.
|
||||
|
||||
### ETL release instructions
|
||||
|
||||
ETL releases are subdivided further into 3 categories:
|
||||
|
||||
1. Scala code that requires deployment by Java JAR file to a Dataproc environment
|
||||
2. PySpark code that requires deployment by a single monolithic script in the
|
||||
Dataproc enviroment. These are stored in [telemetry-airflow/jobs]
|
||||
and are autodeployed to gs://moz-fx-data-prod-airflow-dataproc-artifacts/jobs
|
||||
3. Python code that executes in a Google Kubernetes Engine (GKE)
|
||||
enviroment using a docker container image.
|
||||
4. TAAR User profile information
|
||||
|
||||
#### 1. Scala jobs for Dataproc
|
||||
* [com.mozilla.telemetry.ml.AddonRecommender](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/ml/AddonRecommender.scala) from telemetry-batch-view.jar
|
||||
|
||||
#### 2. PySpark jobs for Dataproc
|
||||
|
||||
* [telemetry-airflow/jobs/taar_locale.py](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/taar_locale.py)
|
||||
* [telemetry-airflow/jobs/taar_similarity.py](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/taar_similarity.py)
|
||||
* [telemetry-airflow/jobs/taar_lite_guidguid.py](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/taar_lite_guidguid.py)
|
||||
|
||||
#### 3. GKEPodOperator jobs
|
||||
|
||||
* [taar_etl_.taar_amodump](https://github.com/mozilla/taar_gcp_etl/blob/master/taar_etl/taar_amodump.py)
|
||||
* [taar_etl.taar_amowhitelist](https://github.com/mozilla/taar_gcp_etl/blob/master/taar_etl/taar_amowhitelist.py)
|
||||
* [taar_etl.taar_update_whitelist](https://github.com/mozilla/taar_gcp_etl/blob/master/taar_etl/taar_update_whitelist.py)
|
||||
|
||||
|
||||
#### 4. TAAR User profile information
|
||||
|
||||
The TAAR User profile information is stored in Cloud BigTable. The
|
||||
job is run as a list of idempotent steps. All tasks are contained in
|
||||
a single file at:
|
||||
|
||||
* [taar_etl.taar_profile_bigtable](https://github.com/mozilla/taar_gcp_etl/blob/master/taar_etl/taar_profile_bigtable.py)
|
||||
|
||||
|
||||
## Jobs are scheduled in two separate DAGs in Airflow.
|
||||
|
||||
* [taar_daily](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_daily.py)
|
||||
* [taar_weekly](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_weekly.py)
|
||||
|
||||
|
||||
### Updating code for GKEPodOperator jobs
|
||||
|
||||
G#KEPodOperator jobs must have code packaged up as containers for
|
||||
execution in GKE. Code can be found in the taar_gcp_etl repository.
|
||||
Detailed build instructions can be found in the
|
||||
[README.md](https://github.com/mozilla/taar_gcp_etl/blob/master/README.md)
|
||||
in that repository.
|
||||
|
||||
Generally, if you tag a revision in `taar_gcp_etl` - CircleCI will build the production
|
||||
container for you automatically. You will also need to update the
|
||||
container tag in the `taar_daily` or `taar_weekly` DAGs.
|
||||
|
||||
### Updating code for PySpark jobs
|
||||
|
||||
PySpark jobs are maintained in the telemetry-airflow repository. You
|
||||
must take care to update the code in that repository and have it
|
||||
merged to master for code to autodeploy into the production Airflow instance.
|
||||
|
||||
|
||||
Airflow execution will always copy jobs out of the
|
||||
[jobs](https://github.com/mozilla/telemetry-airflow/tree/master/jobs)
|
||||
into `gs://moz-fx-data-prod-airflow-dataproc-artifacts/`
|
||||
|
||||
### Updating code for the Scala ETL job
|
||||
|
||||
The sole scala job remaining is part of the telemetry-batch-view
|
||||
repository. Airflow will automatically use the latest code in the
|
||||
master branch of `telemetry-batch-view`.
|
||||
|
||||
|
||||
## Deploying TAAR the webservice
|
||||
|
||||
The TAAR webservice is setup as a single container with no dependant
|
||||
containers. If you are familiar with earlier versions of TAAR, you
|
||||
may be expecting redis servers to also be required - this is no longer
|
||||
the case. Models are sufficiently small that they can held in memory.
|
||||
|
||||
Tagging a version in git will trigger CircleCI to build a container
|
||||
image for production.
|
||||
|
||||
You must inform operations to push the tag to staging and to
|
||||
production enviroments.
|
||||
|
||||
No autopush on tag is currently enabled.
|
||||
|
||||
## A note about logging
|
||||
|
||||
tl;dr - Do **NOT** use python's logging module for any logging in the TAAR
|
||||
repository. TAAR's recommendation code is used by the ETL jobs - some
|
||||
of which execute inside a PySpark enviroment and logging is
|
||||
incompatible with PySpark.
|
||||
|
||||
PySpark distributes executable objects across the spark worker nodes
|
||||
by pickling live objects. Unfortunately, Python uses non-serizable
|
||||
mutexes in the logging module which was not fixed until python 3.8.
|
||||
|
||||
See the https://bugs.python.org/issue30520 for details.
|
||||
|
||||
You cannot upgrade TAAR to use Python 3.8 either, as the full
|
||||
numerical computation stack of PySpark, numpy, scipy, sklearn do not
|
||||
properly support Python 3.8.
|
||||
|
||||
So again -just **don't use python logging**.
|
Загрузка…
Ссылка в новой задаче