2020-07-20 19:16:53 +03:00
|
|
|
[![CircleCI](https://circleci.com/gh/mozilla/taar_gcp_etl.svg?style=svg)](https://circleci.com/gh/mozilla/taar_gcp_etl)
|
2020-07-20 19:15:28 +03:00
|
|
|
|
2019-10-03 06:15:27 +03:00
|
|
|
TAARlite and TAAR ETL jobs for GCP
|
|
|
|
==================================
|
|
|
|
|
|
|
|
|
|
|
|
This repo contains scripts which are used in ETL jobs for the TAAR and
|
|
|
|
TAARlite services.
|
|
|
|
|
|
|
|
|
2020-02-04 22:18:28 +03:00
|
|
|
-----
|
2019-10-03 06:15:27 +03:00
|
|
|
|
2020-02-04 22:18:28 +03:00
|
|
|
Put all your code into your own repository and package it up as a
|
|
|
|
container. This makes it much easier to deploy your code into both
|
|
|
|
GKEPodOperators which run containerized code within a Kubernetes Pod,
|
|
|
|
as well as giving you the ability to deploy code into a dataproc
|
|
|
|
cluster using a git checkout.
|
2019-10-03 06:15:27 +03:00
|
|
|
|
|
|
|
|
2020-07-20 19:12:41 +03:00
|
|
|
## New GCS storage locations
|
2020-02-04 22:18:28 +03:00
|
|
|
|
2021-01-23 00:47:35 +03:00
|
|
|
Prod buckets:
|
2020-02-04 22:18:28 +03:00
|
|
|
|
2021-01-23 00:47:35 +03:00
|
|
|
moz-fx-data-taar-pr-prod-e0f7-prod-etl
|
|
|
|
moz-fx-data-taar-pr-prod-e0f7-prod-models
|
2020-02-04 22:18:28 +03:00
|
|
|
|
2021-01-23 00:47:35 +03:00
|
|
|
Test bucket:
|
|
|
|
|
|
|
|
taar_models
|
2020-02-04 22:18:28 +03:00
|
|
|
|
2020-07-20 19:12:41 +03:00
|
|
|
## Jobs
|
2020-02-04 22:18:28 +03:00
|
|
|
|
|
|
|
taar_etl.taar_amodump
|
|
|
|
|
|
|
|
This job extracts the complete AMO listing and emits a JSON blob.
|
|
|
|
Depends On:
|
2022-06-27 22:20:02 +03:00
|
|
|
https://addons.mozilla.org/api/v4/addons/search/
|
2020-02-04 22:18:28 +03:00
|
|
|
|
|
|
|
Output file:
|
2020-07-20 19:12:41 +03:00
|
|
|
Path: gs://taar_models/addon_recommender/extended_addons_database.json
|
2020-02-04 22:18:28 +03:00
|
|
|
|
|
|
|
taar_etl.taar_amowhitelist
|
|
|
|
|
|
|
|
This job filters the AMO whitelist from taar_amodump into 3 filtered lists.
|
|
|
|
|
|
|
|
Depends On:
|
|
|
|
taar_etl.taar_amodump
|
|
|
|
|
|
|
|
Output file:
|
2020-07-20 19:12:41 +03:00
|
|
|
Path: gs://taar_models/addon_recommender/whitelist_addons_database.json
|
|
|
|
Path: gs://taar_models/addon_recommender/featured_addons_database.json
|
|
|
|
Path: gs://taar_models/addon_recommender/featured_whitelist_addons.json
|
2020-02-04 22:18:28 +03:00
|
|
|
|
2021-01-23 00:47:35 +03:00
|
|
|
taar_lite_guid_ranking
|
|
|
|
|
|
|
|
This job loads installation counts by addon from BigQuery telemetry telemetry.addons table
|
|
|
|
and saves it to GCS.
|
|
|
|
|
|
|
|
Output file:
|
|
|
|
Path: gs://taar_models/taar/lite/guid_install_ranking.json
|
|
|
|
|
|
|
|
|
2020-02-04 22:18:28 +03:00
|
|
|
taar_etl.taar_update_whitelist
|
|
|
|
|
|
|
|
This job extracts the editorial approved addons from AMO
|
2020-06-04 05:39:51 +03:00
|
|
|
|
2020-02-04 22:18:28 +03:00
|
|
|
Depends On:
|
2022-06-27 22:20:02 +03:00
|
|
|
https://addons.mozilla.org/api/v4/addons/search/
|
2020-02-04 22:18:28 +03:00
|
|
|
|
|
|
|
Output file:
|
2020-07-20 19:12:41 +03:00
|
|
|
Path: gs://taar_models/addon_recommender/only_guids_top_200.json
|
2020-06-04 05:39:51 +03:00
|
|
|
|
|
|
|
|
|
|
|
taar_etl.taar_profile_bigtable
|
|
|
|
|
|
|
|
This task is responsible for extracting data from BigQuery from
|
|
|
|
the telemetry table: `clients_last_seen`
|
|
|
|
and exports temporary files in Avro format to a bucket in Google
|
|
|
|
to Cloud Storage.
|
|
|
|
|
|
|
|
Avro files are then loaded into Cloud BigTable.
|
|
|
|
|
|
|
|
Each record is keyed on a SHA256 hash of the telemetry client-id.
|
2020-02-04 22:18:28 +03:00
|
|
|
|
2020-06-04 05:39:51 +03:00
|
|
|
While this job runs - several intermediate data files are created.
|
|
|
|
Any intermediate files are destroyed at the end of the job
|
|
|
|
execution.
|
|
|
|
|
|
|
|
The only artifact of this job is records residing in BigTable
|
|
|
|
as defined by the `--bigtable-instance-id` and `--bigtable-table-id`
|
|
|
|
options to the job.
|
2020-02-04 22:18:28 +03:00
|
|
|
|
|
|
|
|
2020-07-20 19:12:41 +03:00
|
|
|
## PySpark Jobs
|
2020-02-04 22:18:28 +03:00
|
|
|
|
|
|
|
taar_similarity
|
2021-01-23 00:47:35 +03:00
|
|
|
|
2020-02-04 22:18:28 +03:00
|
|
|
Output file:
|
2020-07-20 19:12:41 +03:00
|
|
|
Path: gs://taar_models/similarity/donors.json
|
|
|
|
Path: gs://taar_models/similarity/lr_curves.json
|
2020-02-04 22:18:28 +03:00
|
|
|
|
|
|
|
taar_locale
|
2021-01-23 00:47:35 +03:00
|
|
|
|
2020-02-04 22:18:28 +03:00
|
|
|
Output file:
|
2020-07-20 19:12:41 +03:00
|
|
|
Path: gs://taar_models/locale/top10_dict.json
|
2020-02-04 22:18:28 +03:00
|
|
|
|
|
|
|
|
|
|
|
taar_lite
|
2021-01-23 00:47:35 +03:00
|
|
|
|
2020-02-04 22:18:28 +03:00
|
|
|
Compute addon coinstallation rates for TAARlite
|
|
|
|
|
|
|
|
Output file:
|
2020-07-20 19:12:41 +03:00
|
|
|
Path: gs://taar_models/taar/lite/guid_coinstallation.json
|
2020-06-04 05:39:51 +03:00
|
|
|
|
|
|
|
|
2020-07-20 19:12:41 +03:00
|
|
|
## Google Cloud Platform jobs
|
2020-06-04 05:39:51 +03:00
|
|
|
|
|
|
|
taar_etl.taar_profile_bigtable
|
2021-01-23 00:47:35 +03:00
|
|
|
|
2020-06-04 05:39:51 +03:00
|
|
|
This job extracts user profile data from `clients_last_seen` to
|
2021-04-09 21:06:10 +03:00
|
|
|
build a user profile table in Bigtable. This job is split into 4
|
2020-06-04 05:39:51 +03:00
|
|
|
parts:
|
|
|
|
|
|
|
|
1. Filling a BigQuery table with all pertinent data so that we
|
|
|
|
can export to Avro on Google Cloud Storage. The fill is
|
|
|
|
completed using a `CREATE OR REPLACE TABLE` operation in
|
|
|
|
BigQuery.
|
|
|
|
|
|
|
|
2. Exporting the newly populated BigQuery table into Google Cloud
|
|
|
|
Storage in Apache Avro format.
|
|
|
|
|
|
|
|
3. Import of Avro files from Google Cloud Storage into
|
|
|
|
Cloud BigTable.
|
|
|
|
|
2021-04-09 21:06:10 +03:00
|
|
|
4. Delete users that opt-out from telemetry colleciton.
|
|
|
|
|
2020-06-04 05:39:51 +03:00
|
|
|
When this set of tasks is scheduled in Airflow, it is expected
|
|
|
|
that the Google Cloud Storage bucket will be cleared at the start of
|
|
|
|
the DAG, and cleared again at the end of DAG to prevent unnecessary
|
|
|
|
storage.
|
2020-06-24 23:30:47 +03:00
|
|
|
|
|
|
|
|
2020-07-20 19:12:41 +03:00
|
|
|
## Uploading images to gcr.io
|
|
|
|
|
2021-01-23 00:47:35 +03:00
|
|
|
CircleCI will automatically build a docker image and push the image into
|
2020-07-20 19:12:41 +03:00
|
|
|
gcr.io for production using the latest tag.
|
|
|
|
|
|
|
|
You can use images from the gcr.io image repository using a path like:
|
|
|
|
|
|
|
|
```
|
|
|
|
gcr.io/moz-fx-data-airflow-prod-88e0/taar_gcp_etl:<latest_tag>
|
|
|
|
```
|
|
|
|
|
|
|
|
|
2020-06-24 23:30:47 +03:00
|
|
|
|
2020-07-20 19:12:41 +03:00
|
|
|
## Running a job from within a container
|
2020-06-24 23:30:47 +03:00
|
|
|
|
|
|
|
Sample command for the impatient:
|
|
|
|
|
|
|
|
```
|
|
|
|
docker run \
|
|
|
|
-v ~/.gcp_creds:/app/creds \ # directory where you service_account json file resides
|
|
|
|
-v ~/.config:/app/.config \
|
|
|
|
-e GOOGLE_APPLICATION_CREDENTIALS=/app/creds/<YOUR_SERVICE_ACCOUNT_JSON_FILE_HERE.json> \
|
|
|
|
-e GCLOUD_PROJECT=<YOUR_TEST_GCP_PROJECT_HERE> \
|
|
|
|
-it app:build \
|
|
|
|
-m taar_etl.taar_profile_bigtable \
|
|
|
|
--iso-date=<YESTERDAY_ISODATE_NO_DASHES> \
|
|
|
|
--gcp-project=<YOUR_TEST_GCP_PROJECT_HERE> \
|
|
|
|
--avro-gcs-bucket=<YOUR_GCS_BUCKET_FOR_AVRO_HERE> \
|
|
|
|
--bigquery-dataset-id=<BIG_QUERY_DATASET_ID_HERE> \
|
|
|
|
--bigquery-table-id=<BIG_QUERY_TABLE_ID_HERE> \
|
|
|
|
--bigtable-instance-id=<BIG_TABLE_INSTANCE_ID> \
|
|
|
|
--wipe-bigquery-tmp-table
|
|
|
|
```
|
|
|
|
|
|
|
|
The container defines an entry point which pre-configures the conda
|
|
|
|
enviromet and starts up the python interpreter. You need to pass in
|
|
|
|
arguments to run your module as a task.
|
|
|
|
|
|
|
|
Note that to test on your local machine - you need to volume mount two
|
|
|
|
locations to get your credentials to load, and you will need to mount
|
|
|
|
your google authentication tokens by mounting `.config` and you will
|
|
|
|
also need to volume mount your GCP service account JSON file. You
|
|
|
|
will also need to specify your GCP_PROJECT.
|
2022-06-27 22:20:02 +03:00
|
|
|
|
|
|
|
### More examples
|
|
|
|
|
|
|
|
**amodump**
|
|
|
|
```
|
|
|
|
docker run \
|
|
|
|
-v ~/.config:/app/.config \
|
|
|
|
-v ~/.gcp_creds:/app/creds \
|
|
|
|
-e GOOGLE_APPLICATION_CREDENTIALS=/app/creds/<YOUR_SERVICE_ACCOUNT_JSON_FILE_HERE.json> \
|
|
|
|
-e GCLOUD_PROJECT=<YOUR_TEST_GCP_PROJECT_HERE> \
|
|
|
|
-it app:build \
|
|
|
|
-m taar_etl.taar_amodump \
|
|
|
|
--date=20220620
|
|
|
|
```
|
|
|
|
**amowhitelist**
|
|
|
|
```
|
|
|
|
docker run \
|
|
|
|
-v ~/.config:/app/.config \
|
|
|
|
-v ~/.gcp_creds:/app/creds \
|
|
|
|
-e GOOGLE_APPLICATION_CREDENTIALS=/app/creds/<YOUR_SERVICE_ACCOUNT_JSON_FILE_HERE.json> \
|
|
|
|
-e GCLOUD_PROJECT=<YOUR_TEST_GCP_PROJECT_HERE> \
|
|
|
|
-it app:build \
|
|
|
|
-m taar_etl.taar_amowhitelist
|
|
|
|
```
|
|
|
|
**update_whitelist**
|
|
|
|
```
|
|
|
|
docker run \
|
|
|
|
-v ~/.config:/app/.config \
|
|
|
|
-v ~/.gcp_creds:/app/creds \
|
|
|
|
-e GOOGLE_APPLICATION_CREDENTIALS=/app/creds/<YOUR_SERVICE_ACCOUNT_JSON_FILE_HERE.json> \
|
|
|
|
-e GCLOUD_PROJECT=<YOUR_TEST_GCP_PROJECT_HERE> \
|
|
|
|
-it app:build \
|
|
|
|
-m taar_etl.taar_update_whitelist \
|
|
|
|
--date=20220620
|
|
|
|
```
|
|
|
|
**profile_bigtable.delete**
|
|
|
|
|
|
|
|
You might need to replace GCP project specific arguments
|
|
|
|
```
|
|
|
|
docker run \
|
|
|
|
-v ~/.config:/app/.config \
|
|
|
|
-e GCLOUD_PROJECT=moz-fx-data-taar-nonprod-48b6 \
|
|
|
|
-it app:build \
|
|
|
|
-m taar_etl.taar_profile_bigtable \
|
|
|
|
--iso-date=20210426 \
|
|
|
|
--gcp-project=moz-fx-data-taar-nonprod-48b6 \
|
|
|
|
--bigtable-table-id=taar_profile \
|
|
|
|
--bigtable-instance-id=taar-stage-202006 \
|
|
|
|
--delete-opt-out-days 28 \
|
|
|
|
--avro-gcs-bucket moz-fx-data-taar-nonprod-48b6-stage-etl \
|
|
|
|
--subnetwork regions/us-west1/subnetworks/gke-taar-nonprod-v1 \
|
|
|
|
--dataflow-workers=2 \
|
|
|
|
--dataflow-service-account taar-stage-dataflow@moz-fx-data-taar-nonprod-48b6.iam.gserviceaccount.com \
|
|
|
|
--sample-rate=1.0 \
|
|
|
|
--bigtable-delete-opt-out
|
|
|
|
```
|