Telemetry-Aware Addon Recommender
Перейти к файлу
Victor Ng c7c17880e4 added redis read timer to logging 2020-09-02 12:36:42 -04:00
.circleci Update CircleCI to use docker container and miniconda 2020-06-16 16:12:36 -04:00
analysis Update retention notebook to current verison 2018-08-17 14:43:32 -04:00
bin Features/taar in redis (#179) 2020-09-01 21:39:09 -04:00
docs Merge taar-lite into main TAAR repository (#173) 2020-07-13 12:50:19 -04:00
taar added redis read timer to logging 2020-09-02 12:36:42 -04:00
tests Load redis data once and only once per request 2020-09-02 12:05:56 -04:00
.flake8 Merge taar-lite into main TAAR repository (#173) 2020-07-13 12:50:19 -04:00
.gitignore patch to enable querying for existence of a (hashed_client_id, addon_id) pair 2019-04-25 14:48:04 -04:00
.pre-commit-config.yaml Merge taar-lite into main TAAR repository (#173) 2020-07-13 12:50:19 -04:00
API.md Reworked the URL handler to accept a hashed ID instead of a raw ID. 2018-11-30 18:02:22 +00:00
CODE_OF_CONDUCT.md Add Mozilla Code of Conduct file 2019-03-31 18:08:08 -05:00
DATA_COLLECTION_POLICY.md Fix nit from bmiroglio review. 2018-11-26 13:32:43 +00:00
Dockerfile Modernize build configuration with miniconda 2020-06-16 16:12:36 -04:00
LICENSE Initial commit 2017-03-15 21:59:28 +00:00
MANIFEST.in Add MANIFEST.in 2017-03-17 11:07:31 +00:00
Makefile Remove warm_caches function as it does nothing 2020-09-02 12:13:42 -04:00
README.md Features/new test client ids (#178) 2020-08-31 16:52:13 -04:00
docker-compose.yml Features/175 taarlite limits (#176) 2020-08-26 14:01:19 -04:00
enviroment.yml Features/taar in redis (#179) 2020-09-01 21:39:09 -04:00
prod-requirements.txt reworked docker repackaging branch 2018-12-04 19:31:57 +00:00
setup.cfg added a `python setup.py test` directive that invokes pytest 2018-01-11 09:52:42 -05:00
setup.py Features/taar in redis (#179) 2020-09-01 21:39:09 -04:00

README.md

Taar

Telemetry-Aware Addon Recommender

CircleCI

Table of Contents

How does it work?

The recommendation strategy is implemented through the RecommendationManager. Once a recommendation is requested for a specific client id, the recommender iterates through all the registered models (e.g. CollaborativeRecommender) linearly in their registered order. Results are returned from the first module that can perform a recommendation.

Each module specifies its own sets of rules and requirements and thus can decide if it can perform a recommendation independently from the other modules.

Supported models

This is the ordered list of the currently supported models:

Order Model Description Conditions Generator job
1 Collaborative recommends add-ons based on add-ons installed by other users (i.e. collaborative filtering) Telemetry data is available for the user and the user has at least one enabled add-on source
2 Similarity recommends add-ons based on add-ons installed by similar representative users Telemetry data is available for the user and a suitable representative donor can be found source
3 Locale recommends add-ons based on the top addons for the user's locale Telemetry data is available for the user and the locale has enough users source
4 Ensemble * recommends add-ons based on the combined (by stacked generalization) recomendations of other available recommender modules. More than one of the other Models are available to provide recommendations. source

All jobs are scheduled in Mozilla's instance of Airflow. The Collaborative, Similarity and Locale jobs are executed on a daily schedule, while the ensemble job is scheduled on a weekly schedule.

Build and run tests

You should be able to build taar using Python 3.5 or 3.7. To run the testsuite, execute ::

$ python setup.py develop
$ python setup.py test

Alternately, if you've got GNUMake installed, a Makefile is included with build and test-container targets.

You can just run make build; make test-container which will build a complete Docker container and run the test suite inside the container.

Pinning dependencies

TAAR uses miniconda and a enviroment.yml file to manage versioning.

To update versions, edit the enviroment.yml with the new dependency you need. If you are unfamiliar with using conda, see the official documentation for reference.

Instructions for releasing updates to production

Building a new release of TAAR is fairly involved. Documentation to create a new release has been split out into separate instructions.

Dependencies

Google Cloud Storage resources

TODO: put this into a table to be easier to read

The final TAAR models are stored in:

gs://moz-fx-data-taar-pr-prod-e0f7-prod-models

The TAAR production model bucket is defined in Airflow under the variable taar_etl_model_storage_bucket

Temporary models that the Airflow ETL jobs require are stored in a temporary bucket defined in the Airflow variable taar_etl_storage_bucket

AWS resources

Recommendation engines load models from Amazon S3.

The following table is a complete list of all resources per recommendation engine.

Recommendation Engine S3 Resource
RecommendationManager Whitelist s3://telemetry-parquet/telemetry-ml/addon_recommender/top_200_whitelist.json
Similarity Recommender s3://telemetry-parquet/taar/similarity/donors.json
s3://telemetry-parquet/taar/similarity/lr_curves.json
CollaborativeRecommender s3://telemetry-parquet/telemetry-ml/addon_recommender/item_matrix.json
s3://telemetry-parquet/telemetry-ml/addon_recommender/addon_mapping.json
LocaleRecommender s3://telemetry-parquet/taar/locale/top10_dict.json
EnsembleRecommender s3://telemetry-parquet/taar/ensemble/ensemble_weight.json

AWS enviroment configuration

TAAR breaks out all S3 data load configuration into enviroment variables. This ensures that running under test has no chance of clobbering the production data in the event that a developer has AWS configuration keys installed locally in ~/.aws/

Production enviroment variables required for TAAR

Collaborative Recommender

Env Variable Value
TAAR_ITEM_MATRIX_BUCKET "telemetry-parquet"
TAAR_ITEM_MATRIX_KEY "telemetry-ml/addon_recommender/item_matrix.json"
TAAR_ADDON_MAPPING_BUCKET "telemetry-parquet"
TAAR_ADDON_MAPPING_KEY "telemetry-ml/addon_recommender/addon_mapping.json"

Ensemble Recommender

Env Variable Value
TAAR_ENSEMBLE_BUCKET "telemetry-parquet"
TAAR_ENSEMBLE_KEY "taar/ensemble/ensemble_weight.json"

Locale Recommender

Env Variable Value
TAAR_LOCALE_BUCKET "telemetry-parquet"
TAAR_LOCALE_KEY "taar/locale/top10_dict.json"

Similarity Recommender

Env Variable Value
TAAR_SIMILARITY_BUCKET "telemetry-parquet"
TAAR_SIMILARITY_DONOR_KEY "taar/similarity/donors.json"
TAAR_SIMILARITY_LRCURVES_KEY "taar/similarity/lr_curves.json"

Google Cloud Platform resources

Google Cloud BigQuery

Cloud BigQuery uses the GCP project defined in Airflow in the variable taar_gcp_project_id.

Dataset

  • taar_tmp

Table ID

  • taar_tmp_profile

Note that this table only exists for the duration of the taar_weekly job, so there should be no need to manually manage this table.

Google Cloud Storage

The taar user profile extraction puts Avro format files into a GCS bucket defined by the following two variables in Airflow:

  • taar_gcp_project_id
  • taar_etl_storage_bucket

The bucket is automatically cleared at the start and end of the TAAR weekly ETL job.

Google Cloud BigTable

The final TAAR user profile data is stored in a Cloud BigTable instance defined by the following two variables in Airflow:

  • taar_gcp_project_id
  • taar_bigtable_instance_id

The table ID for user profile information is taar_profile.


Production Configuration Settings

Production enviroment settings are stored in a private repository.

Deleting individual user data from all TAAR resources

Deletion of records in TAAR is fairly straight forward. Once a user disables telemetry from Firefox, all that is required is to delete records from TAAR.

Deletion of records from the TAAR BigTable instance will remove the client's list of addons from TAAR. No further work is required.

Removal of the records from BigTable will cause JSON model updates to no longer take the deleted record into account. JSON models are updated on a daily basis via the taar_daily

Updates in the weekly Airflow job in taar_weekly only update the ensemble weights and the user profile information.

If the user profile information in clients_last_seen continues to have data for the user's telemetry-id, TAAR will repopulate the user profile data.

Users who wish to remove their data from TAAR need to:

  1. Disable telemetry in Firefox
  2. Have user telemetry data removed from all telemetry storage systems in GCP. Primarily this means the clients_last_seen table in BigQuery.
  3. Have user data removed from BigTable.

Airflow enviroment configuration

TAAR requires some configuration to be stored in Airflow variables for the ETL jobs to run to completion correctly.

Airflow Variable Value
taar_gcp_project_id The Google Cloud Platform project where BigQuery temporary tables, Cloud Storage buckets for Avro files and BigTable reside for TAAR.
taar_etl_storage_bucket The Cloud Storage bucket name where temporary Avro files will reside when transferring data from BigQuery to BigTable.
taar_bigtable_instance_id The BigTable instance ID for TAAR user profile information
taar_dataflow_subnetwork The subnetwork required to communicate between Cloud Dataflow

Staging Enviroment

The staging enviroment of the TAAR service in GCP can be reached using curl.

curl https://user@pass:stage.taar.nonprod.dataops.mozgcp.net/v1/api/recommendations/<hashed_telemetry_id>

Requests for a TAAR-lite recommendation can be made using curl as well:

curl https://stage.taar.nonprod.dataops.mozgcp.net/taarlite/api/v1/addon_recommendations/<addon_guid>/

TAARlite cache tools

There is a taarlite-redis tool to manage the taarlit redis cache.

The cache needs to be populated using the --load command or TAARlite will return no results.

It is safe to reload new data while TAARlite is running - no performance degradation is expected.

The cache contains a 'hot' buffer for reads and a 'cold' buffer to write updated data to.

Subsequent invocations to --load will update the cache in the cold buffer. After data is successfully loaded, the hot and cold buffers are swapped.

Running the the taarlite-redis tool inside the container:

$ docker run -it taar:latest bin/run python /opt/conda/bin/taarlite-redis.py --help

Usage: taarlite-redis.py [OPTIONS]

  Manage the TAARLite redis cache.

  This expecte that the following enviroment variables are set:

  REDIS_HOST REDIS_PORT

Options:
  --reset  Reset the redis cache to an empty state
  --load   Load data into redis
  --info   Display information about the cache state
  --help   Show this message and exit.

Testing

TAARLite will respond with suggestions given an addon GUID.

A sample URL path may look like this:

/taarlite/api/v1/addon_recommendations/uBlock0%40raymondhill.net/

TAAR will treat any client ID with only repeating digits (ie: 0000) as a test client ID and will return a dummy response.

A URL with the path : /v1/api/recommendations/0000000000/ will return a valid JSON result

A note on cdist optimization.

cdist can speed up distance computation by a factor of 10 for the computations we're doing. We can use it without problems on the canberra distance calculation.

Unfortunately there are multiple problems with it accepting a string array. There are different problems in 0.18.1 (which is what is available on EMR), and on later versions. In both cases cdist attempts to convert a string to a double, which fails. For versions of scipy later than 0.18.1 this could be worked around with:

distance.cdist(v1, v2, lambda x, y: distance.hamming(x, y))

However, when you manually provide a callable to cdist, cdist can not do it's baked in optimizations (https://github.com/scipy/scipy/blob/v1.0.0/scipy/spatial/distance.py#L2408) so we can just apply the function distance.hamming to our array manually and get the same performance.