bigquery-etl/script/glam
Eduardo Filho b994884098
GLAM purge percentile calculations and prep downstream (#5966)
* Remove percentiles

* Remove tests that test percentiles

* Refresh scripts insert null to new percentiles

* Remove percentile columns from queries and schemas

* Delete more percentile tables

* Formatting

* histogram_cast_struct's keys are strings

* Re-add test after fixing failure cause
2024-07-25 10:44:43 -04:00
..
test GLAM ETL: filter out unofficial desktop builds in FOG pipeline (#2957) 2022-05-11 00:33:35 +02:00
README.md Use --no-deps when installing compiled requirements files (#2752) 2022-02-24 21:36:47 +00:00
backfill_glean Add glam cli for incremental backfill (#1313) 2020-09-23 14:45:44 -07:00
backfill_glean_all_fenix Add glam cli for incremental backfill (#1313) 2020-09-23 14:45:44 -07:00
export_csv chore(glam): remove extract to gcs (#5780) 2024-07-08 18:39:52 -04:00
generate_and_run_desktop_sql Glam-ETL: print out script statements 2022-04-28 16:55:45 +02:00
generate_glean_sql fix(glam): add a more robust way to fetch glean tables (#5787) 2024-06-13 20:38:41 -04:00
list_tables Add script to list tables in glam_etl datasets (#1478) 2020-10-23 10:33:50 -07:00
run_glam_sql GLAM purge percentile calculations and prep downstream (#5966) 2024-07-25 10:44:43 -04:00
run_scalar_agg_clustered_query.py bug 1890351: Move GLAM ETL legacy to glam-prod project 2024-05-06 12:14:08 -04:00

README.md

Scripts for GLAM ETL

This directory contains an assortment of scripts for managing the state of GLAM queries. Also refer to the bqetl glam CLI tool.

The GLAM glean queries are situated within the glam_etl dataset under the glam-fenix-dev project. Each query is prefixed with the namespace of the the glean application.

Setup

Make sure you have installed all the dependencies for the project.

# optional: in the project root, a new virtual environment
python3 -m venv venv
source venv/bin/activate

# install dependencies
pip install --no-deps -r requirements.txt
# install bqetl command-line tool
pip install -e .

Ensure that you're logged into GCP services.

gcloud auth login
gcloud auth application-default

The former allows command-line tools from the google-cloud-sdk to run, while the latter allows the sdk (e.g. the google-cloud-bigquery python package) to run locally.

Running the main integration test

The test_glean_org_mozilla_fenix_glam_nightly script is the main testing script for this set of GLAM ETL queries. The volume of data is typically low and requires joining across several different glean datasets. Read the script to determine how to run it. For reference, here is how to generate the the queries and the schemas for check-in.

GENERATE_ONLY=true script/glam/test/test_glean_org_mozilla_fenix_glam_nightly

The script test_glean_all_fenix is the same script, but using the full set of all Firefox for Android document types. diff_glean_all_fenix_incremental generates the set of SQL and generates a diff between two revisions of the repository.

After generating and updating the schemas for the tables, it is helpful to commit the changes into source control to make it easier to distinguish changes between revisions of the codebase. Only commit the subset of queries that are representative, since there would be too much duplication otherwise.

Adding a new glean application

Adding a new Glean application currently requires a few depedencies:

  1. In the configuration (currently located in bigquery_etl.glam.generate), add a new entry for the application id
  2. Define a build_date_udf that accepts a build id and returns a datetime. This is required as part of visualizing data for GLAM. See mozfun.glam.build_hour_to_datetime for an example.
  3. Test the SQL using run_glam_sql, either directly or by writing a script for automating parts in the test/ directory.
  4. Add the new application to the dags/glam_fenix DAG in telemetry-airflow.

Logical app ids

See this document for a specification on logical application ids. This was introduced in PR#1221. A logical app id is only necessary if a single application is split across multiple app names, e.g. Fenix Nightly being spread across org-mozilla-fenix, org-mozilla-fenix-aurora, and org-mozilla-fenix-nightly depending on the date. The views for the logical app can be found under bigquery_etl/glam/templates/logical_app_id.

Dropping tables

Use the list_tables script to enumerate all of the tables for the GLAM dataset. For example, we may want to prune all of the tables that do not belong to the logical glam app ids for Fenix (e.g. org_mozilla_fenix_glam_nightly). We can use grep to limit the set of tables that we want to delete.

script/glam/list_tables glam_etl | \
    grep -v clients_daily | grep -v fenix_glam | \
    xargs -I {} echo "bq rm -f {}"

We generate the commands to drop incremental tables using the following:

script/glam/list_tables glam_etl | \
    grep -v clients_daily | grep aggregates_v1 | \
    xargs -I {} echo "bq rm -f {}"

To copy tables from the glam_etl table to the glam_etl_dev table, run the following:

script/glam/list_tables glam_etl | \
    grep clients_daily | grep -v org_mozilla_firefox | grep -v view | \
xargs -I {} echo bq cp -f {} {} | sed 's/glam_etl/glam_etl_dev/2'