зеркало из https://github.com/mozilla/taar.git
348 строки
14 KiB
Markdown
348 строки
14 KiB
Markdown
# Taar
|
|
Telemetry-Aware Addon Recommender
|
|
|
|
[![CircleCI](https://circleci.com/gh/mozilla/taar.svg?style=svg)](https://circleci.com/gh/mozilla/taar)
|
|
|
|
|
|
Table of Contents
|
|
=================
|
|
|
|
* [Taar](#taar)
|
|
* [How does it work?](#how-does-it-work)
|
|
* [Supported models](#supported-models)
|
|
* [Build and run tests](#build-and-run-tests)
|
|
* [Pinning dependencies](#pinning-dependencies)
|
|
* [Instructions for releasing updates to production](#instructions-for-releasing-updates-to-production)
|
|
* [Collaborative Recommender](#collaborative-recommender)
|
|
* [Ensemble Recommender](#ensemble-recommender)
|
|
* [Locale Recommender](#locale-recommender)
|
|
* [Similarity Recommender](#similarity-recommender)
|
|
* [Google Cloud Platform resources](#google-cloud-platform-resources)
|
|
* [Google Cloud BigQuery](#google-cloud-bigquery)
|
|
* [Google Cloud Storage](#google-cloud-storage)
|
|
* [Google Cloud BigTable](#google-cloud-bigtable)
|
|
* [Production Configuration Settings](#production-configuration-settings)
|
|
* [Deleting individual user data from all TAAR resources](#deleting-individual-user-data-from-all-taar-resources)
|
|
* [Airflow environment configuration](#airflow-environment-configuration)
|
|
* [Staging Environment](#staging-environment)
|
|
* [A note on cdist optimization\.](#a-note-on-cdist-optimization)
|
|
|
|
|
|
## How does it work?
|
|
The recommendation strategy is implemented through the
|
|
[RecommendationManager](taar/recommenders/recommendation_manager.py).
|
|
Once a recommendation is requested for a specific [client
|
|
id](https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/data/common-ping.html),
|
|
the recommender iterates through all the registered models (e.g.
|
|
[CollaborativeRecommender](taar/recommenders/collaborative_recommender.py))
|
|
linearly in their registered order. Results are returned from the
|
|
first module that can perform a recommendation.
|
|
|
|
Each module specifies its own sets of rules and requirements and thus
|
|
can decide if it can perform a recommendation independently from the
|
|
other modules.
|
|
|
|
### Supported models
|
|
This is the ordered list of the currently supported models:
|
|
|
|
| Order | Model | Description | Conditions | Generator job |
|
|
|-------|-------|-------------|------------|---------------|
|
|
| 1 | [Collaborative](taar/recommenders/collaborative_recommender.py) | recommends add-ons based on add-ons installed by other users (i.e. [collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering))|Telemetry data is available for the user and the user has at least one enabled add-on|[source](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/ml/AddonRecommender.scala)|
|
|
| 2 | [Similarity](taar/recommenders/similarity_recommender.py) | recommends add-ons based on add-ons installed by similar representative users|Telemetry data is available for the user and a suitable representative donor can be found|[source](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/taar_similarity.py)|
|
|
| 3 | [Locale](taar/recommenders/locale_recommender.py) |recommends add-ons based on the top addons for the user's locale|Telemetry data is available for the user and the locale has enough users|[source](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/taar_locale.py)|
|
|
| 4 | [Ensemble](taar/recommenders/ensemble_recommender.py) *|recommends add-ons based on the combined (by [stacked generalization](https://en.wikipedia.org/wiki/Ensemble_learning#Stacking)) recomendations of other available recommender modules.|More than one of the other Models are available to provide recommendations.|[source](https://github.com/mozilla/telemetry-airflow/blob/master/jobs/taar_ensemble.py)|
|
|
|
|
All jobs are scheduled in Mozilla's instance of
|
|
[Airflow](https://github.com/mozilla/telemetry-airflow). The
|
|
Collaborative, Similarity and Locale jobs are executed on a
|
|
[daily](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_daily.py)
|
|
schedule, while the ensemble job is scheduled on a
|
|
[weekly](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_weekly.py)
|
|
schedule.
|
|
|
|
|
|
## Build and run tests
|
|
You should be able to build taar using Python 3.5 or 3.7.
|
|
To run the testsuite, execute ::
|
|
|
|
```python
|
|
$ python setup.py develop
|
|
$ python setup.py test
|
|
```
|
|
|
|
Alternately, if you've got GNUMake installed, a Makefile is included
|
|
with
|
|
[`build`](https://github.com/mozilla/taar/blob/more_docs/Makefile#L20)
|
|
and
|
|
[`test-container`](https://github.com/mozilla/taar/blob/more_docs/Makefile#L55)
|
|
targets.
|
|
|
|
You can just run `make
|
|
build; make test-container` which will build a complete Docker
|
|
container and run the test suite inside the container.
|
|
|
|
## Pinning dependencies
|
|
|
|
TAAR uses miniconda and a environment.yml file to manage versioning.
|
|
|
|
To update versions, edit the `environment.yml` with the new dependency
|
|
you need then run `make conda_update`.
|
|
|
|
If you are unfamiliar with using conda, see the [official
|
|
documentation](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html)
|
|
for reference.
|
|
|
|
## Instructions for releasing updates to production
|
|
|
|
Building a new release of TAAR is fairly involved. Documentation to
|
|
create a new release has been split out into separate
|
|
[instructions](https://github.com/mozilla/taar/blob/master/docs/release_instructions.md).
|
|
|
|
|
|
## Dependencies
|
|
|
|
### Google Cloud Storage resources
|
|
|
|
The final TAAR models are stored in:
|
|
|
|
```gs://moz-fx-data-taar-pr-prod-e0f7-prod-models```
|
|
|
|
The TAAR production model bucket is defined in Airflow under the
|
|
variable `taar_etl_model_storage_bucket`
|
|
|
|
Temporary models that the Airflow ETL jobs require are stored in a
|
|
temporary bucket defined in the Airflow variable `taar_etl_storage_bucket`
|
|
|
|
Recommendation engines load models from GCS.
|
|
|
|
The following table is a complete list of all resources per
|
|
recommendation engine.
|
|
|
|
Recommendation Engine | GCS Resource
|
|
--- | ---
|
|
RecommendationManager Whitelist | gs://moz-fx-data-taar-pr-prod-e0f7-prod-models/addon_recommender/only_guids_top_200.json.bz2
|
|
Similarity Recommender | gs://moz-fx-data-taar-pr-prod-e0f7-prod-models/taar/similarity/donors.json.bz2 <br> gs://moz-fx-data-taar-pr-prod-e0f7-prod-models/taar/similarity/lr_curves.json.bz2
|
|
CollaborativeRecommender | gs://moz-fx-data-taar-pr-prod-e0f7-prod-models/addon_recommender/item_matrix.json.bz2 <br> gs://moz-fx-data-taar-pr-prod-e0f7-prod-models/addon_recommender/addon_mapping.json.bz2
|
|
LocaleRecommender | gs://moz-fx-data-taar-pr-prod-e0f7-prod-models/taar/locale/top10_dict.json.bz2
|
|
EnsembleRecommender | gs://moz-fx-data-taar-pr-prod-e0f7-prod-models/taar/ensemble/ensemble_weight.json.bz2
|
|
TAAR lite | gs://moz-fx-data-taar-pr-prod-e0f7-prod-models/taar/lite/guid_install_ranking.json.bz2 <br/> gs://moz-fx-data-taar-pr-prod-e0f7-prod-models/taar/lite/guid_coinstallation.json.bz2
|
|
|
|
|
|
# Production environment variables required for TAAR
|
|
|
|
## Collaborative Recommender
|
|
|
|
Env Variable | Value
|
|
------- | ---
|
|
TAAR_ITEM_MATRIX_BUCKET | "moz-fx-data-taar-pr-prod-e0f7-prod-models"
|
|
TAAR_ITEM_MATRIX_KEY | "addon_recommender/item_matrix.json.bz2"
|
|
TAAR_ADDON_MAPPING_BUCKET | "moz-fx-data-taar-pr-prod-e0f7-prod-models"
|
|
TAAR_ADDON_MAPPING_KEY | "addon_recommender/addon_mapping.json.bz2"
|
|
|
|
## Ensemble Recommender
|
|
|
|
Env Variable | Value
|
|
--- | ---
|
|
TAAR_ENSEMBLE_BUCKET | "moz-fx-data-taar-pr-prod-e0f7-prod-models"
|
|
TAAR_ENSEMBLE_KEY | "taar/ensemble/ensemble_weight.json.bz2"
|
|
|
|
## Locale Recommender
|
|
|
|
Env Variable | Value
|
|
--- | ---
|
|
TAAR_LOCALE_BUCKET | "moz-fx-data-taar-pr-prod-e0f7-prod-models"
|
|
TAAR_LOCALE_KEY | "taar/locale/top10_dict.json.bz2"
|
|
|
|
## Similarity Recommender
|
|
|
|
Env Variable | Value
|
|
--- | ---
|
|
TAAR_SIMILARITY_BUCKET | "moz-fx-data-taar-pr-prod-e0f7-prod-models"
|
|
TAAR_SIMILARITY_DONOR_KEY | "taar/similarity/donors.json.bz2"
|
|
TAAR_SIMILARITY_LRCURVES_KEY | "taar/similarity/lr_curves.json.bz2"
|
|
|
|
|
|
## TAAR Lite
|
|
|
|
Env Variable | Value
|
|
--- | ---
|
|
TAARLITE_GUID_COINSTALL_BUCKET | "moz-fx-data-taar-pr-prod-e0f7-prod-models"
|
|
TAARLITE_GUID_COINSTALL_KEY | "taar/lite/guid_coinstallation.json.bz2"
|
|
TAARLITE_GUID_RANKING_KEY | "taar/lite/guid_install_ranking.json.bz2"
|
|
|
|
|
|
## Google Cloud Platform resources
|
|
### Google Cloud BigQuery
|
|
|
|
Cloud BigQuery uses the GCP project defined in Airflow in the
|
|
variable `taar_gcp_project_id`.
|
|
|
|
Dataset
|
|
* `taar_tmp`
|
|
|
|
Table ID
|
|
* `taar_tmp_profile`
|
|
|
|
Note that this table only exists for the duration of the taar_weekly
|
|
job, so there should be no need to manually manage this table.
|
|
|
|
### Google Cloud Storage
|
|
|
|
The taar user profile extraction puts Avro format files into
|
|
a GCS bucket defined by the following two variables in Airflow:
|
|
|
|
* `taar_gcp_project_id`
|
|
* `taar_etl_storage_bucket`
|
|
|
|
The bucket is automatically cleared at the *start* and *end* of
|
|
the TAAR weekly ETL job.
|
|
|
|
### Google Cloud BigTable
|
|
|
|
The final TAAR user profile data is stored in a Cloud BigTable
|
|
instance defined by the following two variables in Airflow:
|
|
|
|
* `taar_gcp_project_id`
|
|
* `taar_bigtable_instance_id`
|
|
|
|
The table ID for user profile information is `taar_profile`.
|
|
|
|
|
|
------
|
|
|
|
## Production Configuration Settings
|
|
|
|
Production environment settings are stored in a [private repository](https://github.com/mozilla-services/cloudops-deployment/blob/master/projects/data/puppet/yaml/type/data.api.prod.taar.yaml).
|
|
|
|
|
|
## Deleting individual user data from all TAAR resources
|
|
|
|
Deletion of records in TAAR is fairly straight forward. Once a user
|
|
disables telemetry from Firefox, all that is required is to delete
|
|
records from TAAR.
|
|
|
|
Deletion of records from the TAAR BigTable instance will remove the
|
|
client's list of addons from TAAR. No further work is required.
|
|
|
|
Removal of the records from BigTable will cause JSON model updates to
|
|
no longer take the deleted record into account. JSON models are
|
|
updated on a daily basis via the
|
|
[`taar_daily`](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_daily.py)
|
|
|
|
Updates in the weekly Airflow job in
|
|
[`taar_weekly`](https://github.com/mozilla/telemetry-airflow/blob/master/dags/taar_weekly.py) only update the ensemble weights and the user profile information.
|
|
|
|
If the user profile information in `clients_last_seen` continues to
|
|
have data for the user's telemetry-id, TAAR will repopulate the user
|
|
profile data.
|
|
|
|
Users who wish to remove their data from TAAR need to:
|
|
1. Disable telemetry in Firefox
|
|
2. Have user telemetry data removed from all telemetry storage systems
|
|
in GCP. Primarily this means the `clients_last_seen` table in
|
|
BigQuery.
|
|
3. Have user data removed from BigTable.
|
|
|
|
|
|
|
|
## Airflow environment configuration
|
|
|
|
TAAR requires some configuration to be stored in Airflow variables for
|
|
the ETL jobs to run to completion correctly.
|
|
|
|
Airflow Variable | Value
|
|
--- | ---
|
|
taar_gcp_project_id | The Google Cloud Platform project where BigQuery temporary tables, Cloud Storage buckets for Avro files and BigTable reside for TAAR.
|
|
taar_etl_storage_bucket | The Cloud Storage bucket name where temporary Avro files will reside when transferring data from BigQuery to BigTable.
|
|
taar_etl_model_storage_bucket | The main GCS bucket where the models are stored
|
|
taar_bigtable_instance_id | The BigTable instance ID for TAAR user profile information
|
|
taar_dataflow_subnetwork | The subnetwork required to communicate between Cloud Dataflow
|
|
|
|
|
|
## Staging Environment
|
|
|
|
The staging environment of the TAAR service in GCP can be reached using
|
|
curl.
|
|
|
|
```
|
|
curl https://user@pass:stage.taar.nonprod.dataops.mozgcp.net/v1/api/recommendations/<hashed_telemetry_id>
|
|
```
|
|
|
|
Requests for a TAAR-lite recommendation can be made using curl as
|
|
well:
|
|
|
|
```
|
|
curl https://stage.taar.nonprod.dataops.mozgcp.net/taarlite/api/v1/addon_recommendations/<addon_guid>/
|
|
```
|
|
|
|
|
|
## TAARlite cache tools
|
|
|
|
There is a taarlite-redis tool to manage the taarlit redis cache.
|
|
|
|
The cache needs to be populated using the `--load` command or TAARlite
|
|
will return no results.
|
|
|
|
It is safe to reload new data while TAARlite is running - no
|
|
performance degradation is expected.
|
|
|
|
The cache contains a 'hot' buffer for reads and a 'cold' buffer to
|
|
write updated data to.
|
|
|
|
Subsequent invocations to `--load` will update the cache in the cold
|
|
buffer. After data is successfully loaded, the hot and cold buffers
|
|
are swapped.
|
|
|
|
Running the the taarlite-redis tool inside the container:
|
|
|
|
```
|
|
$ docker run -it taar:latest bin/run python /opt/conda/bin/taarlite-redis.py --help
|
|
|
|
Usage: taarlite-redis.py [OPTIONS]
|
|
|
|
Manage the TAARLite redis cache.
|
|
|
|
This expecte that the following environment variables are set:
|
|
|
|
REDIS_HOST REDIS_PORT
|
|
|
|
Options:
|
|
--reset Reset the redis cache to an empty state
|
|
--load Load data into redis
|
|
--info Display information about the cache state
|
|
--help Show this message and exit.
|
|
```
|
|
|
|
|
|
## Testing
|
|
|
|
|
|
TAARLite will respond with suggestions given an addon GUID.
|
|
|
|
A sample URL path may look like this:
|
|
|
|
`/taarlite/api/v1/addon_recommendations/uBlock0%40raymondhill.net/`
|
|
|
|
TAAR will treat any client ID with only repeating digits (ie: 0000) as
|
|
a test client ID and will return a dummy response.
|
|
|
|
A URL with the path : `/v1/api/recommendations/0000000000/` will
|
|
return a valid JSON result
|
|
|
|
|
|
## A note on cdist optimization.
|
|
cdist can speed up distance computation by a factor of 10 for the computations we're doing.
|
|
We can use it without problems on the canberra distance calculation.
|
|
|
|
Unfortunately there are multiple problems with it accepting a string array. There are different
|
|
problems in 0.18.1 (which is what is available on EMR), and on later versions. In both cases
|
|
cdist attempts to convert a string to a double, which fails. For versions of scipy later than
|
|
0.18.1 this could be worked around with:
|
|
|
|
distance.cdist(v1, v2, lambda x, y: distance.hamming(x, y))
|
|
|
|
However, when you manually provide a callable to cdist, cdist can not do it's baked in
|
|
optimizations (https://github.com/scipy/scipy/blob/v1.0.0/scipy/spatial/distance.py#L2408)
|
|
so we can just apply the function `distance.hamming` to our array manually and get the same
|
|
performance.
|