This commit is contained in:
Victor Ng 2020-02-04 14:18:28 -05:00
Родитель cc0bea2bc6
Коммит 0db175bdb1
1 изменённых файлов: 122 добавлений и 18 удалений

140
README.md
Просмотреть файл

@ -5,26 +5,130 @@ TAARlite and TAAR ETL jobs for GCP
This repo contains scripts which are used in ETL jobs for the TAAR and
TAARlite services.
TODO document each major module here:
taar_amodump.py
taar_amowhitelist.py
taar_lite_guidguid.py
taar_update_whitelist.py
taar_utils.py
taar_utils.pyc
-----
Put all your code into your own repository and package it up as a
container. This makes it much easier to deploy your code into both
GKEPodOperators which run containerized code within a Kubernetes Pod,
as well as giving you the ability to deploy code into a dataproc
cluster using a git checkout.
S3 locations
============
Top level bucket:
S3_PARQUET_BUCKET=telemetry-parquet
New GCS storage locations:
==========================
Top level bucket:
GCS_BUCKET=moz-fx-data-derived-datasets-parquet
S3 Buckets
==========
Jobs
====
This is a list of buckets that we are writing to in S3
taar_etl.taar_amodump
Bucket Path
====================== =================================
telemetry-parquet telemetry-ml/addon_recommender/
telemetry-parquet taar/ensemble/
telemetry-parquet taar/lite/
telemetry-private-analysis-2 taar/locale/
telemetry-parquet taar/similarity/
srg-team-bucket *
This job extracts the complete AMO listing and emits a JSON blob.
Depends On:
https://addons.mozilla.org/api/v3/addons/search/
Output file:
S3_PARQUET_BUCKET
Path: telemetry-ml/addon_recommender/extended_addons_database.json
taar_etl.taar_amowhitelist
This job filters the AMO whitelist from taar_amodump into 3 filtered lists.
Depends On:
taar_etl.taar_amodump
Output file:
S3_PARQUET_BUCKET
Path: telemetry-ml/addon_recommender/whitelist_addons_database.json
Path: telemetry-ml/addon_recommender/featured_addons_database.json
Path: telemetry-ml/addon_recommender/featured_whitelist_addons.json
taar_etl.taar_update_whitelist
This job extracts the editorial approved addons from AMO
Depends On:
https://addons.mozilla.org/api/v3/addons/search/
Output file:
S3_PARQUET_BUCKET
Path: telemetry-ml/addon_recommender/only_guids_top_200.json
Task Sensors
------------
wait_for_main_summary_export
wait_for_clients_daily_export
PySpark Jobs
------------
taar_dynamo_job
Depends On:
gcs: gs://moz-fx-data-derived-datasets-parquet/main_summary/v4
TODO: we need to upgrade to v6 as the similarity job has been
updated
Output file:
AWS DynamoDB: us-west-2/taar_addon_data_20180206
taar_similarity
Depends On:
gcs: gs://moz-fx-data-derived-datasets-parquet/main_summary/v4
Output file:
S3_PARQUET_BUCKET
Path: taar/similarity/donors.json
Path: taar/similarity/lr_curves.json
taar_locale
Depends On:
gcs: gs://moz-fx-data-derived-datasets-parquet/clients_daily/v6
S3_PARQUET_BUCKET
Path: telemetry-ml/addon_recommender/only_guids_top_200.json
Output file:
S3_BUCKET: telemetry-private-analysis-2
Path: taar/locale/top10_dict.json
taar_collaborative_recommender
Computes addons using a collaborative filter.
Depends On:
clients_daily over apache hive tables
S3_BUCKET: telemetry-ml/addon_recommender/only_guids_top_200.json
Output files:
s3://telemetry-ml/addon_recommender/addon_mapping.json
s3://telemetry-ml/addon_recommender/item_matrix.json
s3://telemetry-ml/addon_recommender/{rundate}/addon_mapping.json
s3://telemetry-ml/addon_recommender/{rundate}/item_matrix.json
taar_lite
Compute addon coinstallation rates for TAARlite
Depends On:
s3a://telemetry-parquet/clients_daily/v6/submission_date_s3={dateformat}
Output file:
S3_BUCKET: telemetry-parquet
Path: taar/lite/guid_coinstallation.json