8937dd9741
chore: use circleci context |
||
---|---|---|
.circleci | ||
bin | ||
taar_etl | ||
.flake8 | ||
.gitignore | ||
CODE_OF_CONDUCT.md | ||
Dockerfile | ||
LICENSE | ||
Makefile | ||
README.md | ||
environment.yml | ||
setup.py |
README.md
TAARlite and TAAR ETL jobs for GCP
This repo contains scripts which are used in ETL jobs for the TAAR and TAARlite services.
Put all your code into your own repository and package it up as a container. This makes it much easier to deploy your code into both GKEPodOperators which run containerized code within a Kubernetes Pod, as well as giving you the ability to deploy code into a dataproc cluster using a git checkout.
New GCS storage locations
Prod buckets:
moz-fx-data-taar-pr-prod-e0f7-prod-etl
moz-fx-data-taar-pr-prod-e0f7-prod-models
Test bucket:
taar_models
Jobs
taar_etl.taar_amodump
This job extracts the complete AMO listing and emits a JSON blob.
Depends On:
https://addons.mozilla.org/api/v4/addons/search/
Output file:
Path: gs://taar_models/addon_recommender/extended_addons_database.json
taar_etl.taar_amowhitelist
This job filters the AMO whitelist from taar_amodump into 3 filtered lists.
Depends On:
taar_etl.taar_amodump
Output file:
Path: gs://taar_models/addon_recommender/whitelist_addons_database.json
Path: gs://taar_models/addon_recommender/featured_addons_database.json
Path: gs://taar_models/addon_recommender/featured_whitelist_addons.json
taar_lite_guid_ranking
This job loads installation counts by addon from BigQuery telemetry telemetry.addons table
and saves it to GCS.
Output file:
Path: gs://taar_models/taar/lite/guid_install_ranking.json
taar_etl.taar_update_whitelist
This job extracts the editorial approved addons from AMO
Depends On:
https://addons.mozilla.org/api/v4/addons/search/
Output file:
Path: gs://taar_models/addon_recommender/only_guids_top_200.json
taar_etl.taar_profile_bigtable
This task is responsible for extracting data from BigQuery from
the telemetry table: `clients_last_seen`
and exports temporary files in Avro format to a bucket in Google
to Cloud Storage.
Avro files are then loaded into Cloud BigTable.
Each record is keyed on a SHA256 hash of the telemetry client-id.
While this job runs - several intermediate data files are created.
Any intermediate files are destroyed at the end of the job
execution.
The only artifact of this job is records residing in BigTable
as defined by the `--bigtable-instance-id` and `--bigtable-table-id`
options to the job.
PySpark Jobs
taar_similarity
Output file:
Path: gs://taar_models/similarity/donors.json
Path: gs://taar_models/similarity/lr_curves.json
taar_locale
Output file:
Path: gs://taar_models/locale/top10_dict.json
taar_lite
Compute addon coinstallation rates for TAARlite
Output file:
Path: gs://taar_models/taar/lite/guid_coinstallation.json
Google Cloud Platform jobs
taar_etl.taar_profile_bigtable
This job extracts user profile data from `clients_last_seen` to
build a user profile table in Bigtable. This job is split into 4
parts:
1. Filling a BigQuery table with all pertinent data so that we
can export to Avro on Google Cloud Storage. The fill is
completed using a `CREATE OR REPLACE TABLE` operation in
BigQuery.
2. Exporting the newly populated BigQuery table into Google Cloud
Storage in Apache Avro format.
3. Import of Avro files from Google Cloud Storage into
Cloud BigTable.
4. Delete users that opt-out from telemetry colleciton.
When this set of tasks is scheduled in Airflow, it is expected
that the Google Cloud Storage bucket will be cleared at the start of
the DAG, and cleared again at the end of DAG to prevent unnecessary
storage.
Uploading images to gcr.io
CircleCI will automatically build a docker image and push the image into gcr.io for production using the latest tag.
You can use images from the gcr.io image repository using a path like:
gcr.io/moz-fx-data-airflow-prod-88e0/taar_gcp_etl:<latest_tag>
Running a job from within a container
Sample command for the impatient:
docker run \
-v ~/.gcp_creds:/app/creds \ # directory where you service_account json file resides
-v ~/.config:/app/.config \
-e GOOGLE_APPLICATION_CREDENTIALS=/app/creds/<YOUR_SERVICE_ACCOUNT_JSON_FILE_HERE.json> \
-e GCLOUD_PROJECT=<YOUR_TEST_GCP_PROJECT_HERE> \
-it app:build \
-m taar_etl.taar_profile_bigtable \
--iso-date=<YESTERDAY_ISODATE_NO_DASHES> \
--gcp-project=<YOUR_TEST_GCP_PROJECT_HERE> \
--avro-gcs-bucket=<YOUR_GCS_BUCKET_FOR_AVRO_HERE> \
--bigquery-dataset-id=<BIG_QUERY_DATASET_ID_HERE> \
--bigquery-table-id=<BIG_QUERY_TABLE_ID_HERE> \
--bigtable-instance-id=<BIG_TABLE_INSTANCE_ID> \
--wipe-bigquery-tmp-table
The container defines an entry point which pre-configures the conda enviromet and starts up the python interpreter. You need to pass in arguments to run your module as a task.
Note that to test on your local machine - you need to volume mount two
locations to get your credentials to load, and you will need to mount
your google authentication tokens by mounting .config
and you will
also need to volume mount your GCP service account JSON file. You
will also need to specify your GCP_PROJECT.
More examples
amodump
docker run \
-v ~/.config:/app/.config \
-v ~/.gcp_creds:/app/creds \
-e GOOGLE_APPLICATION_CREDENTIALS=/app/creds/<YOUR_SERVICE_ACCOUNT_JSON_FILE_HERE.json> \
-e GCLOUD_PROJECT=<YOUR_TEST_GCP_PROJECT_HERE> \
-it app:build \
-m taar_etl.taar_amodump \
--date=20220620
amowhitelist
docker run \
-v ~/.config:/app/.config \
-v ~/.gcp_creds:/app/creds \
-e GOOGLE_APPLICATION_CREDENTIALS=/app/creds/<YOUR_SERVICE_ACCOUNT_JSON_FILE_HERE.json> \
-e GCLOUD_PROJECT=<YOUR_TEST_GCP_PROJECT_HERE> \
-it app:build \
-m taar_etl.taar_amowhitelist
update_whitelist
docker run \
-v ~/.config:/app/.config \
-v ~/.gcp_creds:/app/creds \
-e GOOGLE_APPLICATION_CREDENTIALS=/app/creds/<YOUR_SERVICE_ACCOUNT_JSON_FILE_HERE.json> \
-e GCLOUD_PROJECT=<YOUR_TEST_GCP_PROJECT_HERE> \
-it app:build \
-m taar_etl.taar_update_whitelist \
--date=20220620
profile_bigtable.delete
You might need to replace GCP project specific arguments
docker run \
-v ~/.config:/app/.config \
-e GCLOUD_PROJECT=moz-fx-data-taar-nonprod-48b6 \
-it app:build \
-m taar_etl.taar_profile_bigtable \
--iso-date=20210426 \
--gcp-project=moz-fx-data-taar-nonprod-48b6 \
--bigtable-table-id=taar_profile \
--bigtable-instance-id=taar-stage-202006 \
--delete-opt-out-days 28 \
--avro-gcs-bucket moz-fx-data-taar-nonprod-48b6-stage-etl \
--subnetwork regions/us-west1/subnetworks/gke-taar-nonprod-v1 \
--dataflow-workers=2 \
--dataflow-service-account taar-stage-dataflow@moz-fx-data-taar-nonprod-48b6.iam.gserviceaccount.com \
--sample-rate=1.0 \
--bigtable-delete-opt-out