taar_gcp_etl/README.md

[![CircleCI](https://circleci.com/gh/mozilla/taar_gcp_etl.svg?style=svg)](https://circleci.com/gh/mozilla/taar_gcp_etl)

TAARlite and TAAR ETL jobs for GCP
==================================


This repo contains scripts which are used in ETL jobs for the TAAR and
TAARlite services.


-----

Put all your code into your own repository and package it up as a
container.  This makes it much easier to deploy your code into both
GKEPodOperators which run containerized code within a Kubernetes Pod,
as well as giving you the ability to deploy code into a dataproc
cluster using a git checkout.


## New GCS storage locations

Prod buckets: 

    moz-fx-data-taar-pr-prod-e0f7-prod-etl
    moz-fx-data-taar-pr-prod-e0f7-prod-models

Test bucket:

    taar_models

## Jobs

taar_etl.taar_amodump 

    This job extracts the complete AMO listing and emits a JSON blob.
    Depends On:
        https://addons.mozilla.org/api/v4/addons/search/

    Output file: 
        Path: gs://taar_models/addon_recommender/extended_addons_database.json

taar_etl.taar_amowhitelist 

    This job filters the AMO whitelist from taar_amodump into 3 filtered lists.

    Depends On:
        taar_etl.taar_amodump 

    Output file:
        Path: gs://taar_models/addon_recommender/whitelist_addons_database.json
        Path: gs://taar_models/addon_recommender/featured_addons_database.json
        Path: gs://taar_models/addon_recommender/featured_whitelist_addons.json

taar_lite_guid_ranking

    This job loads installation counts by addon from BigQuery telemetry telemetry.addons table
    and saves it to GCS.

    Output file:
        Path: gs://taar_models/taar/lite/guid_install_ranking.json


taar_etl.taar_update_whitelist

    This job extracts the editorial approved addons from AMO

    Depends On:
        https://addons.mozilla.org/api/v4/addons/search/

    Output file:
        Path: gs://taar_models/addon_recommender/only_guids_top_200.json


taar_etl.taar_profile_bigtable

    This task is responsible for extracting data from BigQuery from
    the telemetry table: `clients_last_seen`
    and exports temporary files in Avro format to a bucket in Google
    to Cloud Storage.

    Avro files are then loaded into Cloud BigTable.

    Each record is keyed on a SHA256 hash of the telemetry client-id.

    While this job runs - several intermediate data files are created.
    Any intermediate files are destroyed at the end of the job
    execution.

    The only artifact of this job is records residing in BigTable
    as defined by the `--bigtable-instance-id` and `--bigtable-table-id`
    options to the job.


## PySpark Jobs

taar_similarity

    Output file: 
        Path: gs://taar_models/similarity/donors.json
        Path: gs://taar_models/similarity/lr_curves.json

taar_locale

    Output file: 
        Path: gs://taar_models/locale/top10_dict.json


taar_lite

    Compute addon coinstallation rates for TAARlite
    
    Output file: 
        Path: gs://taar_models/taar/lite/guid_coinstallation.json


## Google Cloud Platform jobs

taar_etl.taar_profile_bigtable

    This job extracts user profile data from `clients_last_seen` to
    build a user profile table in Bigtable. This job is split into 4
    parts:

    1. Filling a BigQuery table with all pertinent data so that we
       can export to Avro on Google Cloud Storage.  The fill is
       completed using a `CREATE OR REPLACE TABLE` operation in
       BigQuery.

    2. Exporting the newly populated BigQuery table into Google Cloud
       Storage in Apache Avro format.

    3. Import of Avro files from Google Cloud Storage into 
       Cloud BigTable.

    4. Delete users that opt-out from telemetry colleciton. 
 
    When this set of tasks is scheduled in Airflow, it is expected
    that the Google Cloud Storage bucket will be cleared at the start of
    the DAG, and cleared again at the end of DAG to prevent unnecessary
    storage.


## Uploading images to gcr.io

CircleCI will automatically build a docker image and push the image into
gcr.io for production using the latest tag.

You can use images from the gcr.io image repository using a path like:

```
gcr.io/moz-fx-data-airflow-prod-88e0/taar_gcp_etl:<latest_tag>
```


## Running a job from within a container

Sample command for the impatient:

```
	docker run \
		-v ~/.gcp_creds:/app/creds  \     # directory where you service_account json file resides 
		-v ~/.config:/app/.config \
		-e GOOGLE_APPLICATION_CREDENTIALS=/app/creds/<YOUR_SERVICE_ACCOUNT_JSON_FILE_HERE.json> \
		-e GCLOUD_PROJECT=<YOUR_TEST_GCP_PROJECT_HERE> \
		-it app:build \
		-m taar_etl.taar_profile_bigtable \
		--iso-date=<YESTERDAY_ISODATE_NO_DASHES> \
		--gcp-project=<YOUR_TEST_GCP_PROJECT_HERE> \
		--avro-gcs-bucket=<YOUR_GCS_BUCKET_FOR_AVRO_HERE> \
		--bigquery-dataset-id=<BIG_QUERY_DATASET_ID_HERE> \
		--bigquery-table-id=<BIG_QUERY_TABLE_ID_HERE> \
		--bigtable-instance-id=<BIG_TABLE_INSTANCE_ID> \
		--wipe-bigquery-tmp-table
```

The container defines an entry point which pre-configures the conda
enviromet and starts up the python interpreter.  You need to pass in
arguments to run your module as a task.

Note that to test on your local machine - you need to volume mount two
locations to get your credentials to load, and you will need to mount
your google authentication tokens by mounting `.config` and you will
also need to volume mount your GCP service account JSON file.  You
will also need to specify your GCP_PROJECT.

### More examples

**amodump**
```
docker run \
    -v ~/.config:/app/.config \
    -v ~/.gcp_creds:/app/creds \
    -e GOOGLE_APPLICATION_CREDENTIALS=/app/creds/<YOUR_SERVICE_ACCOUNT_JSON_FILE_HERE.json> \
    -e GCLOUD_PROJECT=<YOUR_TEST_GCP_PROJECT_HERE>   \
    -it app:build \
    -m taar_etl.taar_amodump \
    --date=20220620
```
**amowhitelist**
```
docker run \
    -v ~/.config:/app/.config \
    -v ~/.gcp_creds:/app/creds \
    -e GOOGLE_APPLICATION_CREDENTIALS=/app/creds/<YOUR_SERVICE_ACCOUNT_JSON_FILE_HERE.json> \
    -e GCLOUD_PROJECT=<YOUR_TEST_GCP_PROJECT_HERE>   \
    -it app:build \
    -m taar_etl.taar_amowhitelist
```
**update_whitelist**
```
docker run \
    -v ~/.config:/app/.config \
    -v ~/.gcp_creds:/app/creds \
    -e GOOGLE_APPLICATION_CREDENTIALS=/app/creds/<YOUR_SERVICE_ACCOUNT_JSON_FILE_HERE.json> \
    -e GCLOUD_PROJECT=<YOUR_TEST_GCP_PROJECT_HERE>   \
    -it app:build \
    -m taar_etl.taar_update_whitelist \
    --date=20220620
```
**profile_bigtable.delete** 

You might need to replace GCP project specific arguments
```
docker run \
    -v ~/.config:/app/.config \
    -e GCLOUD_PROJECT=moz-fx-data-taar-nonprod-48b6  \
    -it app:build \
    -m taar_etl.taar_profile_bigtable \
    --iso-date=20210426 \
    --gcp-project=moz-fx-data-taar-nonprod-48b6 \
    --bigtable-table-id=taar_profile \
    --bigtable-instance-id=taar-stage-202006 \
    --delete-opt-out-days 28 \
    --avro-gcs-bucket moz-fx-data-taar-nonprod-48b6-stage-etl \
    --subnetwork regions/us-west1/subnetworks/gke-taar-nonprod-v1 \
    --dataflow-workers=2 \
    --dataflow-service-account taar-stage-dataflow@moz-fx-data-taar-nonprod-48b6.iam.gserviceaccount.com \
    --sample-rate=1.0 \
    --bigtable-delete-opt-out
```
Fix typo in circleci badge 2020-07-20 19:16:53 +03:00			`[![CircleCI](https://circleci.com/gh/mozilla/taar_gcp_etl.svg?style=svg)](https://circleci.com/gh/mozilla/taar_gcp_etl)`
Add CircleCI badge 2020-07-20 19:15:28 +03:00
merged GCP port of taar_lite_guidguid 2019-10-03 06:15:27 +03:00			`TAARlite and TAAR ETL jobs for GCP`
			`==================================`


			`This repo contains scripts which are used in ETL jobs for the TAAR and`
			`TAARlite services.`


added more documentation 2020-02-04 22:18:28 +03:00			`-----`
merged GCP port of taar_lite_guidguid 2019-10-03 06:15:27 +03:00
added more documentation 2020-02-04 22:18:28 +03:00			`Put all your code into your own repository and package it up as a`
			`container. This makes it much easier to deploy your code into both`
			`GKEPodOperators which run containerized code within a Kubernetes Pod,`
			`as well as giving you the ability to deploy code into a dataproc`
			`cluster using a git checkout.`
merged GCP port of taar_lite_guidguid 2019-10-03 06:15:27 +03:00

Convert all ETL jobs to use only GCS instead of S3 (#8) * Migrate taar_update_whitelist from S3 to GCS * Migrate taar_amodump AMO database extraction from S3 to GCS * Add requests_toolbelt for addon extraction * Delete AWS enviroment configuration * Convert extended_addons_database.json to bz2 * Convert taarlite addon lists to bz2 * Convert only_guids_top_200 to bz2 * Convert taar_lite_guidguid to only use GCS * Repoint README to GCS from S3 * TAAR-lite job is handled by telemetry-airflow/jobs * Add instructions to get the container image 2020-07-20 19:12:41 +03:00			`## New GCS storage locations`
added more documentation 2020-02-04 22:18:28 +03:00
Add taarlite guid ranking job (#9) 2021-01-23 00:47:35 +03:00			`Prod buckets:`
added more documentation 2020-02-04 22:18:28 +03:00
Add taarlite guid ranking job (#9) 2021-01-23 00:47:35 +03:00			`moz-fx-data-taar-pr-prod-e0f7-prod-etl`
			`moz-fx-data-taar-pr-prod-e0f7-prod-models`
added more documentation 2020-02-04 22:18:28 +03:00
Add taarlite guid ranking job (#9) 2021-01-23 00:47:35 +03:00			`Test bucket:`

			`taar_models`
added more documentation 2020-02-04 22:18:28 +03:00
Convert all ETL jobs to use only GCS instead of S3 (#8) * Migrate taar_update_whitelist from S3 to GCS * Migrate taar_amodump AMO database extraction from S3 to GCS * Add requests_toolbelt for addon extraction * Delete AWS enviroment configuration * Convert extended_addons_database.json to bz2 * Convert taarlite addon lists to bz2 * Convert only_guids_top_200 to bz2 * Convert taar_lite_guidguid to only use GCS * Repoint README to GCS from S3 * TAAR-lite job is handled by telemetry-airflow/jobs * Add instructions to get the container image 2020-07-20 19:12:41 +03:00			`## Jobs`
added more documentation 2020-02-04 22:18:28 +03:00
			`taar_etl.taar_amodump`

			`This job extracts the complete AMO listing and emits a JSON blob.`
			`Depends On:`
Update addon api to v4 (#15) * Update addon api to v4 * Move testing commands to readme 2022-06-27 22:20:02 +03:00			`https://addons.mozilla.org/api/v4/addons/search/`
added more documentation 2020-02-04 22:18:28 +03:00
			`Output file:`
Convert all ETL jobs to use only GCS instead of S3 (#8) * Migrate taar_update_whitelist from S3 to GCS * Migrate taar_amodump AMO database extraction from S3 to GCS * Add requests_toolbelt for addon extraction * Delete AWS enviroment configuration * Convert extended_addons_database.json to bz2 * Convert taarlite addon lists to bz2 * Convert only_guids_top_200 to bz2 * Convert taar_lite_guidguid to only use GCS * Repoint README to GCS from S3 * TAAR-lite job is handled by telemetry-airflow/jobs * Add instructions to get the container image 2020-07-20 19:12:41 +03:00			`Path: gs://taar_models/addon_recommender/extended_addons_database.json`
added more documentation 2020-02-04 22:18:28 +03:00
			`taar_etl.taar_amowhitelist`

			`This job filters the AMO whitelist from taar_amodump into 3 filtered lists.`

			`Depends On:`
			`taar_etl.taar_amodump`

			`Output file:`
Convert all ETL jobs to use only GCS instead of S3 (#8) * Migrate taar_update_whitelist from S3 to GCS * Migrate taar_amodump AMO database extraction from S3 to GCS * Add requests_toolbelt for addon extraction * Delete AWS enviroment configuration * Convert extended_addons_database.json to bz2 * Convert taarlite addon lists to bz2 * Convert only_guids_top_200 to bz2 * Convert taar_lite_guidguid to only use GCS * Repoint README to GCS from S3 * TAAR-lite job is handled by telemetry-airflow/jobs * Add instructions to get the container image 2020-07-20 19:12:41 +03:00			`Path: gs://taar_models/addon_recommender/whitelist_addons_database.json`
			`Path: gs://taar_models/addon_recommender/featured_addons_database.json`
			`Path: gs://taar_models/addon_recommender/featured_whitelist_addons.json`
added more documentation 2020-02-04 22:18:28 +03:00
Add taarlite guid ranking job (#9) 2021-01-23 00:47:35 +03:00			`taar_lite_guid_ranking`

			`This job loads installation counts by addon from BigQuery telemetry telemetry.addons table`
			`and saves it to GCS.`

			`Output file:`
			`Path: gs://taar_models/taar/lite/guid_install_ranking.json`


added more documentation 2020-02-04 22:18:28 +03:00			`taar_etl.taar_update_whitelist`

			`This job extracts the editorial approved addons from AMO`
Added support for TAAR profiles from Cloud BigTable 2020-06-04 05:39:51 +03:00
added more documentation 2020-02-04 22:18:28 +03:00			`Depends On:`
Update addon api to v4 (#15) * Update addon api to v4 * Move testing commands to readme 2022-06-27 22:20:02 +03:00			`https://addons.mozilla.org/api/v4/addons/search/`
added more documentation 2020-02-04 22:18:28 +03:00
			`Output file:`
Convert all ETL jobs to use only GCS instead of S3 (#8) * Migrate taar_update_whitelist from S3 to GCS * Migrate taar_amodump AMO database extraction from S3 to GCS * Add requests_toolbelt for addon extraction * Delete AWS enviroment configuration * Convert extended_addons_database.json to bz2 * Convert taarlite addon lists to bz2 * Convert only_guids_top_200 to bz2 * Convert taar_lite_guidguid to only use GCS * Repoint README to GCS from S3 * TAAR-lite job is handled by telemetry-airflow/jobs * Add instructions to get the container image 2020-07-20 19:12:41 +03:00			`Path: gs://taar_models/addon_recommender/only_guids_top_200.json`
Added support for TAAR profiles from Cloud BigTable 2020-06-04 05:39:51 +03:00

			`taar_etl.taar_profile_bigtable`

			`This task is responsible for extracting data from BigQuery from`
			the telemetry table: `clients_last_seen`
			`and exports temporary files in Avro format to a bucket in Google`
			`to Cloud Storage.`

			`Avro files are then loaded into Cloud BigTable.`

			`Each record is keyed on a SHA256 hash of the telemetry client-id.`
added more documentation 2020-02-04 22:18:28 +03:00
Added support for TAAR profiles from Cloud BigTable 2020-06-04 05:39:51 +03:00			`While this job runs - several intermediate data files are created.`
			`Any intermediate files are destroyed at the end of the job`
			`execution.`

			`The only artifact of this job is records residing in BigTable`
			as defined by the `--bigtable-instance-id` and `--bigtable-table-id`
			`options to the job.`
added more documentation 2020-02-04 22:18:28 +03:00

Convert all ETL jobs to use only GCS instead of S3 (#8) * Migrate taar_update_whitelist from S3 to GCS * Migrate taar_amodump AMO database extraction from S3 to GCS * Add requests_toolbelt for addon extraction * Delete AWS enviroment configuration * Convert extended_addons_database.json to bz2 * Convert taarlite addon lists to bz2 * Convert only_guids_top_200 to bz2 * Convert taar_lite_guidguid to only use GCS * Repoint README to GCS from S3 * TAAR-lite job is handled by telemetry-airflow/jobs * Add instructions to get the container image 2020-07-20 19:12:41 +03:00			`## PySpark Jobs`
added more documentation 2020-02-04 22:18:28 +03:00
			`taar_similarity`
Add taarlite guid ranking job (#9) 2021-01-23 00:47:35 +03:00
added more documentation 2020-02-04 22:18:28 +03:00			`Output file:`
Convert all ETL jobs to use only GCS instead of S3 (#8) * Migrate taar_update_whitelist from S3 to GCS * Migrate taar_amodump AMO database extraction from S3 to GCS * Add requests_toolbelt for addon extraction * Delete AWS enviroment configuration * Convert extended_addons_database.json to bz2 * Convert taarlite addon lists to bz2 * Convert only_guids_top_200 to bz2 * Convert taar_lite_guidguid to only use GCS * Repoint README to GCS from S3 * TAAR-lite job is handled by telemetry-airflow/jobs * Add instructions to get the container image 2020-07-20 19:12:41 +03:00			`Path: gs://taar_models/similarity/donors.json`
			`Path: gs://taar_models/similarity/lr_curves.json`
added more documentation 2020-02-04 22:18:28 +03:00
			`taar_locale`
Add taarlite guid ranking job (#9) 2021-01-23 00:47:35 +03:00
added more documentation 2020-02-04 22:18:28 +03:00			`Output file:`
Convert all ETL jobs to use only GCS instead of S3 (#8) * Migrate taar_update_whitelist from S3 to GCS * Migrate taar_amodump AMO database extraction from S3 to GCS * Add requests_toolbelt for addon extraction * Delete AWS enviroment configuration * Convert extended_addons_database.json to bz2 * Convert taarlite addon lists to bz2 * Convert only_guids_top_200 to bz2 * Convert taar_lite_guidguid to only use GCS * Repoint README to GCS from S3 * TAAR-lite job is handled by telemetry-airflow/jobs * Add instructions to get the container image 2020-07-20 19:12:41 +03:00			`Path: gs://taar_models/locale/top10_dict.json`
added more documentation 2020-02-04 22:18:28 +03:00

			`taar_lite`
Add taarlite guid ranking job (#9) 2021-01-23 00:47:35 +03:00
added more documentation 2020-02-04 22:18:28 +03:00			`Compute addon coinstallation rates for TAARlite`

			`Output file:`
Convert all ETL jobs to use only GCS instead of S3 (#8) * Migrate taar_update_whitelist from S3 to GCS * Migrate taar_amodump AMO database extraction from S3 to GCS * Add requests_toolbelt for addon extraction * Delete AWS enviroment configuration * Convert extended_addons_database.json to bz2 * Convert taarlite addon lists to bz2 * Convert only_guids_top_200 to bz2 * Convert taar_lite_guidguid to only use GCS * Repoint README to GCS from S3 * TAAR-lite job is handled by telemetry-airflow/jobs * Add instructions to get the container image 2020-07-20 19:12:41 +03:00			`Path: gs://taar_models/taar/lite/guid_coinstallation.json`
Added support for TAAR profiles from Cloud BigTable 2020-06-04 05:39:51 +03:00

Convert all ETL jobs to use only GCS instead of S3 (#8) * Migrate taar_update_whitelist from S3 to GCS * Migrate taar_amodump AMO database extraction from S3 to GCS * Add requests_toolbelt for addon extraction * Delete AWS enviroment configuration * Convert extended_addons_database.json to bz2 * Convert taarlite addon lists to bz2 * Convert only_guids_top_200 to bz2 * Convert taar_lite_guidguid to only use GCS * Repoint README to GCS from S3 * TAAR-lite job is handled by telemetry-airflow/jobs * Add instructions to get the container image 2020-07-20 19:12:41 +03:00			`## Google Cloud Platform jobs`
Added support for TAAR profiles from Cloud BigTable 2020-06-04 05:39:51 +03:00
			`taar_etl.taar_profile_bigtable`
Add taarlite guid ranking job (#9) 2021-01-23 00:47:35 +03:00
Added support for TAAR profiles from Cloud BigTable 2020-06-04 05:39:51 +03:00			This job extracts user profile data from `clients_last_seen` to
Add profiles deletion job (#11) * Add profiles deletion job * Fix comments 2021-04-09 21:06:10 +03:00			`build a user profile table in Bigtable. This job is split into 4`
Added support for TAAR profiles from Cloud BigTable 2020-06-04 05:39:51 +03:00			`parts:`

			`1. Filling a BigQuery table with all pertinent data so that we`
			`can export to Avro on Google Cloud Storage. The fill is`
			completed using a `CREATE OR REPLACE TABLE` operation in
			`BigQuery.`

			`2. Exporting the newly populated BigQuery table into Google Cloud`
			`Storage in Apache Avro format.`

			`3. Import of Avro files from Google Cloud Storage into`
			`Cloud BigTable.`

Add profiles deletion job (#11) * Add profiles deletion job * Fix comments 2021-04-09 21:06:10 +03:00			`4. Delete users that opt-out from telemetry colleciton.`

Added support for TAAR profiles from Cloud BigTable 2020-06-04 05:39:51 +03:00			`When this set of tasks is scheduled in Airflow, it is expected`
			`that the Google Cloud Storage bucket will be cleared at the start of`
			`the DAG, and cleared again at the end of DAG to prevent unnecessary`
			`storage.`
Added sample execution of a container command 2020-06-24 23:30:47 +03:00

Convert all ETL jobs to use only GCS instead of S3 (#8) * Migrate taar_update_whitelist from S3 to GCS * Migrate taar_amodump AMO database extraction from S3 to GCS * Add requests_toolbelt for addon extraction * Delete AWS enviroment configuration * Convert extended_addons_database.json to bz2 * Convert taarlite addon lists to bz2 * Convert only_guids_top_200 to bz2 * Convert taar_lite_guidguid to only use GCS * Repoint README to GCS from S3 * TAAR-lite job is handled by telemetry-airflow/jobs * Add instructions to get the container image 2020-07-20 19:12:41 +03:00			`## Uploading images to gcr.io`

Add taarlite guid ranking job (#9) 2021-01-23 00:47:35 +03:00			`CircleCI will automatically build a docker image and push the image into`
Convert all ETL jobs to use only GCS instead of S3 (#8) * Migrate taar_update_whitelist from S3 to GCS * Migrate taar_amodump AMO database extraction from S3 to GCS * Add requests_toolbelt for addon extraction * Delete AWS enviroment configuration * Convert extended_addons_database.json to bz2 * Convert taarlite addon lists to bz2 * Convert only_guids_top_200 to bz2 * Convert taar_lite_guidguid to only use GCS * Repoint README to GCS from S3 * TAAR-lite job is handled by telemetry-airflow/jobs * Add instructions to get the container image 2020-07-20 19:12:41 +03:00			`gcr.io for production using the latest tag.`

			`You can use images from the gcr.io image repository using a path like:`

			```
			`gcr.io/moz-fx-data-airflow-prod-88e0/taar_gcp_etl:<latest_tag>`
			```


Added sample execution of a container command 2020-06-24 23:30:47 +03:00
Convert all ETL jobs to use only GCS instead of S3 (#8) * Migrate taar_update_whitelist from S3 to GCS * Migrate taar_amodump AMO database extraction from S3 to GCS * Add requests_toolbelt for addon extraction * Delete AWS enviroment configuration * Convert extended_addons_database.json to bz2 * Convert taarlite addon lists to bz2 * Convert only_guids_top_200 to bz2 * Convert taar_lite_guidguid to only use GCS * Repoint README to GCS from S3 * TAAR-lite job is handled by telemetry-airflow/jobs * Add instructions to get the container image 2020-07-20 19:12:41 +03:00			`## Running a job from within a container`
Added sample execution of a container command 2020-06-24 23:30:47 +03:00
			`Sample command for the impatient:`

			```
			`docker run \`
			`-v ~/.gcp_creds:/app/creds \ # directory where you service_account json file resides`
			`-v ~/.config:/app/.config \`
			`-e GOOGLE_APPLICATION_CREDENTIALS=/app/creds/<YOUR_SERVICE_ACCOUNT_JSON_FILE_HERE.json> \`
			`-e GCLOUD_PROJECT=<YOUR_TEST_GCP_PROJECT_HERE> \`
			`-it app:build \`
			`-m taar_etl.taar_profile_bigtable \`
			`--iso-date=<YESTERDAY_ISODATE_NO_DASHES> \`
			`--gcp-project=<YOUR_TEST_GCP_PROJECT_HERE> \`
			`--avro-gcs-bucket=<YOUR_GCS_BUCKET_FOR_AVRO_HERE> \`
			`--bigquery-dataset-id=<BIG_QUERY_DATASET_ID_HERE> \`
			`--bigquery-table-id=<BIG_QUERY_TABLE_ID_HERE> \`
			`--bigtable-instance-id=<BIG_TABLE_INSTANCE_ID> \`
			`--wipe-bigquery-tmp-table`
			```

			`The container defines an entry point which pre-configures the conda`
			`enviromet and starts up the python interpreter. You need to pass in`
			`arguments to run your module as a task.`

			`Note that to test on your local machine - you need to volume mount two`
			`locations to get your credentials to load, and you will need to mount`
			your google authentication tokens by mounting `.config` and you will
			`also need to volume mount your GCP service account JSON file. You`
			`will also need to specify your GCP_PROJECT.`
Update addon api to v4 (#15) * Update addon api to v4 * Move testing commands to readme 2022-06-27 22:20:02 +03:00
			`### More examples`

			`amodump`
			```
			`docker run \`
			`-v ~/.config:/app/.config \`
			`-v ~/.gcp_creds:/app/creds \`
			`-e GOOGLE_APPLICATION_CREDENTIALS=/app/creds/<YOUR_SERVICE_ACCOUNT_JSON_FILE_HERE.json> \`
			`-e GCLOUD_PROJECT=<YOUR_TEST_GCP_PROJECT_HERE> \`
			`-it app:build \`
			`-m taar_etl.taar_amodump \`
			`--date=20220620`
			```
			`amowhitelist`
			```
			`docker run \`
			`-v ~/.config:/app/.config \`
			`-v ~/.gcp_creds:/app/creds \`
			`-e GOOGLE_APPLICATION_CREDENTIALS=/app/creds/<YOUR_SERVICE_ACCOUNT_JSON_FILE_HERE.json> \`
			`-e GCLOUD_PROJECT=<YOUR_TEST_GCP_PROJECT_HERE> \`
			`-it app:build \`
			`-m taar_etl.taar_amowhitelist`
			```
			`update_whitelist`
			```
			`docker run \`
			`-v ~/.config:/app/.config \`
			`-v ~/.gcp_creds:/app/creds \`
			`-e GOOGLE_APPLICATION_CREDENTIALS=/app/creds/<YOUR_SERVICE_ACCOUNT_JSON_FILE_HERE.json> \`
			`-e GCLOUD_PROJECT=<YOUR_TEST_GCP_PROJECT_HERE> \`
			`-it app:build \`
			`-m taar_etl.taar_update_whitelist \`
			`--date=20220620`
			```
			`profile_bigtable.delete`

			`You might need to replace GCP project specific arguments`
			```
			`docker run \`
			`-v ~/.config:/app/.config \`
			`-e GCLOUD_PROJECT=moz-fx-data-taar-nonprod-48b6 \`
			`-it app:build \`
			`-m taar_etl.taar_profile_bigtable \`
			`--iso-date=20210426 \`
			`--gcp-project=moz-fx-data-taar-nonprod-48b6 \`
			`--bigtable-table-id=taar_profile \`
			`--bigtable-instance-id=taar-stage-202006 \`
			`--delete-opt-out-days 28 \`
			`--avro-gcs-bucket moz-fx-data-taar-nonprod-48b6-stage-etl \`
			`--subnetwork regions/us-west1/subnetworks/gke-taar-nonprod-v1 \`
			`--dataflow-workers=2 \`
			`--dataflow-service-account taar-stage-dataflow@moz-fx-data-taar-nonprod-48b6.iam.gserviceaccount.com \`
			`--sample-rate=1.0 \`
			`--bigtable-delete-opt-out`
			```