Telemetry-Aware Addon Recommender

Перейти к файлу

Victor Ng 7110da8f85 lowered logging level for failed client lookups in dynamo		2019-02-21 13:23:34 -05:00
.circleci	Add CircleCI config	2018-12-27 16:39:54 +00:00
analysis	Update retention notebook to current verison	2018-08-17 14:43:32 -04:00
bin	set limit for taar results in test with length 4	2019-02-10 21:44:05 -05:00
taar	lowered logging level for failed client lookups in dynamo	2019-02-21 13:23:34 -05:00
tests	Reworked the matrix construction in similarity recommender so that a fresh JSON file will force the cached matrices to be recomputed.	2019-02-20 18:28:07 -05:00
.flake8	* fixed some spurious flake8 warnings	2019-01-28 15:29:29 -05:00
.gitignore	ignore version.json file during build	2019-01-02 10:02:34 -05:00
.travis.yml	added a skip_existing directive for Travis+PyPI	2019-01-29 10:29:37 -05:00
API.md	Reworked the URL handler to accept a hashed ID instead of a raw ID.	2018-11-30 18:02:22 +00:00
DATA_COLLECTION_POLICY.md	Fix nit from bmiroglio review.	2018-11-26 13:32:43 +00:00
Dockerfile	force execution of setup.py in docker image	2018-12-27 16:39:54 +00:00
LICENSE	Initial commit	2017-03-15 21:59:28 +00:00
MANIFEST.in	Add MANIFEST.in	2017-03-17 11:07:31 +00:00
Makefile	disable coverage tests in tox because circle is being a pain	2018-12-27 16:39:54 +00:00
README.md	reworked docker repackaging branch	2018-12-04 19:31:57 +00:00
docker-compose.yml	reworked docker repackaging branch	2018-12-04 19:31:57 +00:00
environment_emr.yaml	Squashed commits for adding schema validation	2018-07-24 21:18:05 -04:00
prod-requirements.txt	reworked docker repackaging branch	2018-12-04 19:31:57 +00:00
requirements.txt	Added sentry integration	2019-02-20 14:26:44 -05:00
setup.cfg	added a `python setup.py test` directive that invokes pytest	2018-01-11 09:52:42 -05:00
setup.py	bumped to 0.4.3 for a new tag and testing travis deployment	2019-01-29 10:19:32 -05:00
tox.ini	fix for "F821 undefined name 'unicode'" error when running tox	2019-01-20 20:42:39 -05:00

README.md

Taar

Telemetry-Aware Addon Recommender

Table of Contents (ToC):

How does it work?
Supported models
Instructions for Releasing Updates
Building and Running tests

How does it work?

The recommendation strategy is implemented through the RecommendationManager. Once a recommendation is requested for a specific client id, the recommender iterates through all the registered models (e.g. CollaborativeRecommender) linearly in their registered order. Results are returned from the first module that can perform a recommendation.

Each module specifies its own sets of rules and requirements and thus can decide if it can perform a recommendation independently from the other modules.

Supported models

This is the ordered list of the currently supported models:

Order	Model	Description	Conditions	Generator job
1	Legacy	recommends WebExtensions based on the reported and disabled legacy add-ons	Telemetry data is available for the user and the user has at least one disabled add-on	source
2	Collaborative	recommends add-ons based on add-ons installed by other users (i.e. collaborative filtering)	Telemetry data is available for the user and the user has at least one enabled add-on	source
3	Similarity *	recommends add-ons based on add-ons installed by similar representative users	Telemetry data is available for the user and a suitable representative donor can be found	source
4	Locale	recommends add-ons based on the top addons for the user's locale	Telemetry data is available for the user and the locale has enough users	source
5	Ensemble *	recommends add-ons based on the combined (by stacked generalization) recomendations of other available recommender modules.	More than one of the other Models are available to provide recommendations.	source

* In order to ensure stable/repeatable testing and prevent unecessary computation, these jobs are not scheduled on Airflow, rather run manually when fresh models are desired.

Instructions for releasing updates

New releases can be shipped by using the normal github workflow. Once a new release is created, it will be automatically uploaded to pypi.

A note on cdist optimization.

cdist can speed up distance computation by a factor of 10 for the computations we're doing. We can use it without problems on the canberra distance calculation.

Unfortunately there are multiple problems with it accepting a string array. There are different problems in 0.18.1 (which is what is available on EMR), and on later versions. In both cases cdist attempts to convert a string to a double, which fails. For versions of scipy later than 0.18.1 this could be worked around with:

distance.cdist(v1, v2, lambda x, y: distance.hamming(x, y))

However, when you manually provide a callable to cdist, cdist can not do it's baked in optimizations (https://github.com/scipy/scipy/blob/v1.0.0/scipy/spatial/distance.py#L2408) so we can just apply the function distance.hamming to our array manually and get the same performance.

Build and run tests

You should be able to build taar using Python 2.7 or Python 3.5. To run the testsuite, execute ::

$ python setup.py develop
$ python setup.py test

Alternately, if you've got GNUMake installed, you can just run make test which will do all of that for you and run flake8 on the codebase.

There are additional integration tests and a microbenchmark available in tests/test_integration.py. See the source code for more information.

Pinning dependencies

TAAR uses hashin (https://pypi.org/project/hashin/) to pin SHA256 hashes for each dependency. To update the hashes, you will need to remove the run make freeze which forces all packages in the current virtualenv to be written out to requirement.txt with versions and SHA hashes.

Required S3 dependencies

RecommendationManager:

s3://telemetry-parquet/telemetry-ml/addon_recommender/top_200_whitelist.json

Hybrid Recommender:

s3://telemetry-parquet/taar/ensemble/ensemble_weight.json
s3://telemetry-parquet/telemetry-ml/addon_recommender/top_200_whitelist.json

Similarity Recommender:

s3://telemetry-parquet/taar/similarity/donors.json
s3://telemetry-parquet/taar/similarity/lr_curves.json

CollaborativeRecommender:

s3://telemetry-public-analysis-2/telemetry-ml/addon_recommender/item_matrix.json
s3://telemetry-public-analysis-2/telemetry-ml/addon_recommender/addon_mapping.json

LocaleRecommender:

s3://telemetry-parquet/taar/locale/top10_dict.json

EnsembleRecommender:

s3://telemetry-parquet/taar/ensemble/ensemble_weight.json

TAAR breaks out all S3 data load configuration into enviroment variables. This ensures that running under test has no chance of clobbering the production data in the event that a developer has AWS configuration keys installed locally in ~/.aws/

Production enviroment variables required for TAAR

Collaborative Recommender ::

TAAR_ITEM_MATRIX_BUCKET = "telemetry-public-analysis-2"
TAAR_ITEM_MATRIX_KEY = "telemetry-ml/addon_recommender/item_matrix.json"
TAAR_ADDON_MAPPING_BUCKET = "telemetry-public-analysis-2"
TAAR_ADDON_MAPPING_KEY = "telemetry-ml/addon_recommender/addon_mapping.json"

Ensemble Recommender ::

TAAR_ENSEMBLE_BUCKET = "telemetry-parquet"
TAAR_ENSEMBLE_KEY = "taar/ensemble/ensemble_weight.json"

Hybrid Recommender ::

TAAR_WHITELIST_BUCKET = "telemetry-parquet"
TAAR_WHITELIST_KEY = "telemetry-ml/addon_recommender/only_guids_top_200.json"

Locale Recommender ::

TAAR_LOCALE_BUCKET = "telemetry-parquet"
TAAR_LOCALE_KEY = "taar/locale/top10_dict.json"

Similarity Recommender ::

TAAR_SIMILARITY_BUCKET = "telemetry-parquet"
TAAR_SIMILARITY_DONOR_KEY = "taar/similarity/donors.json"
TAAR_SIMILARITY_LRCURVES_KEY = "taar/similarity/lr_curves.json"