b994884098
* Remove percentiles * Remove tests that test percentiles * Refresh scripts insert null to new percentiles * Remove percentile columns from queries and schemas * Delete more percentile tables * Formatting * histogram_cast_struct's keys are strings * Re-add test after fixing failure cause |
||
---|---|---|
.. | ||
test | ||
README.md | ||
backfill_glean | ||
backfill_glean_all_fenix | ||
export_csv | ||
generate_and_run_desktop_sql | ||
generate_glean_sql | ||
list_tables | ||
run_glam_sql | ||
run_scalar_agg_clustered_query.py |
README.md
Scripts for GLAM ETL
This directory contains an assortment of scripts for managing the state of GLAM
queries. Also refer to the bqetl glam
CLI tool.
The GLAM glean queries are situated within the glam_etl
dataset under the
glam-fenix-dev
project. Each query is prefixed with the namespace of the the
glean application.
Setup
Make sure you have installed all the dependencies for the project.
# optional: in the project root, a new virtual environment
python3 -m venv venv
source venv/bin/activate
# install dependencies
pip install --no-deps -r requirements.txt
# install bqetl command-line tool
pip install -e .
Ensure that you're logged into GCP services.
gcloud auth login
gcloud auth application-default
The former allows command-line tools from the google-cloud-sdk to run, while the
latter allows the sdk (e.g. the google-cloud-bigquery
python package) to
run locally.
Running the main integration test
The test_glean_org_mozilla_fenix_glam_nightly
script is the main testing
script for this set of GLAM ETL queries. The volume of data is typically low and
requires joining across several different glean datasets. Read the script to
determine how to run it. For reference, here is how to generate the the queries
and the schemas for check-in.
GENERATE_ONLY=true script/glam/test/test_glean_org_mozilla_fenix_glam_nightly
The script test_glean_all_fenix
is the same script, but using the full set of
all Firefox for Android document types. diff_glean_all_fenix_incremental
generates the set of SQL and generates a diff between two revisions of the
repository.
After generating and updating the schemas for the tables, it is helpful to commit the changes into source control to make it easier to distinguish changes between revisions of the codebase. Only commit the subset of queries that are representative, since there would be too much duplication otherwise.
Adding a new glean application
Adding a new Glean application currently requires a few depedencies:
- In the configuration (currently located in
bigquery_etl.glam.generate
), add a new entry for the application id - Define a
build_date_udf
that accepts a build id and returns a datetime. This is required as part of visualizing data for GLAM. Seemozfun.glam.build_hour_to_datetime
for an example. - Test the SQL using
run_glam_sql
, either directly or by writing a script for automating parts in thetest/
directory. - Add the new application to the
dags/glam_fenix
DAG intelemetry-airflow
.
Logical app ids
See this
document
for a specification on logical application ids. This was introduced in
PR#1221. A logical app id
is only necessary if a single application is split across multiple app names,
e.g. Fenix Nightly being spread across org-mozilla-fenix
,
org-mozilla-fenix-aurora
, and org-mozilla-fenix-nightly
depending on the
date. The views for the logical app can be found under
bigquery_etl/glam/templates/logical_app_id
.
Dropping tables
Use the list_tables
script to enumerate all of the tables for the GLAM
dataset. For example, we may want to prune all of the tables that do not belong
to the logical glam app ids for Fenix (e.g. org_mozilla_fenix_glam_nightly
).
We can use grep to limit the set of tables that we want to delete.
script/glam/list_tables glam_etl | \
grep -v clients_daily | grep -v fenix_glam | \
xargs -I {} echo "bq rm -f {}"
We generate the commands to drop incremental tables using the following:
script/glam/list_tables glam_etl | \
grep -v clients_daily | grep aggregates_v1 | \
xargs -I {} echo "bq rm -f {}"
To copy tables from the glam_etl
table to the glam_etl_dev
table, run the
following:
script/glam/list_tables glam_etl | \
grep clients_daily | grep -v org_mozilla_firefox | grep -v view | \
xargs -I {} echo bq cp -f {} {} | sed 's/glam_etl/glam_etl_dev/2'