165fe50cc8 | ||
---|---|---|
.circleci | ||
bigquery_etl | ||
script | ||
sql | ||
templates | ||
tests | ||
udf | ||
udf_js | ||
udf_legacy | ||
.bigqueryrc | ||
.eslintrc.yml | ||
.flake8 | ||
.gitignore | ||
CODE_OF_CONDUCT.md | ||
Dockerfile | ||
GRAVEYARD.md | ||
README.md | ||
conftest.py | ||
constraints.txt | ||
pytest.ini | ||
requirements.txt |
README.md
BigQuery ETL
Bigquery UDFs and SQL queries for building derived datasets.
Formatting SQL
We enforce consistent SQL formatting as part of CI. After adding or changing a
query, use script/format_sql
to apply formatting rules.
Directories and files passed as arguments to script/format_sql
will be
formatted in place, with directories recursively searched for files with a
.sql
extension, e.g.:
$ echo 'SELECT 1,2,3' > test.sql
$ script/format_sql test.sql
modified test.sql
1 file(s) modified
$ cat test.sql
SELECT
1,
2,
3
If no arguments are specified the script will read from stdin and write to stdout, e.g.:
$ echo 'SELECT 1,2,3' | script/format_sql
SELECT
1,
2,
3
To turn off sql formatting for a block of SQL, wrap it in format:off
and
format:on
comments, like this:
SELECT
-- format:off
submission_date, sample_id, client_id
-- format:on
Recommended practices
Queries
- Should be defined in files named as
templates/<dataset>/<table>_<version>/query.sql
e.g.sql/telemetry_derived/clients_daily_v7/query.sql
- May be generated using a python script that prints the query to stdout
- Should save output as
templates/<dataset>/<table>_<version>/query.sql
as above - Should be named as
sql/query_type.sql.py
e.g.sql/clients_daily.sql.py
- May use options to generate queries for different destination tables e.g.
using
--source telemetry_core_parquet_v3
to generatesql/telemetry/core_clients_daily_v1/query.sql
and using--source main_summary_v4
to generatesql/telemetry/clients_daily_v7/query.sql
- Should output a header indicating options used e.g.
-- Query generated by: sql/clients_daily.sql.py --source telemetry_core_parquet
- Should save output as
- Should not specify a project or dataset in table names to simplify testing
- Should be incremental
- Should filter input tables on partition and clustering columns
- Should use
_
prefix in generated column names not meant for output - Should use
_bits
suffix for any integer column that represents a bit pattern - Should not use
DATETIME
type, due to incompatibility with spark-bigquery-connector - Should read from
*_stable
tables instead of including custom deduplication- Should use the earliest row for each
document_id
bysubmission_timestamp
where filtering duplicates is necessary
- Should use the earliest row for each
- Should escape identifiers that match keywords, even if they aren't reserved keywords
Views
- Should be defined in files named as
sql/dataset/table_version/view.sql
e.g.sql/telemetry/telemetry_core_parquet_v3/view.sql
- Must specify project and dataset in all table names
- Should default to using the
moz-fx-data-shared-prod
project
- Should default to using the
UDFs
- Should limit the number of expression subqueries to avoid:
BigQuery error in query operation: Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex.
- Should be used to avoid code duplication
- Must be named in files with lower snake case names ending in
.sql
e.g.mode_last.sql
- Each file must only define effectively private helper functions and one
public function which must be defined last
- Helper functions must not conflict with function names in other files
- SQL UDFs must be defined in the
udf/
directory and JS UDFs must be defined in theudf_js
directory- The
udf_legacy/
directory is an exception which must only contain compatibility functions for queries migrated from Athena/Presto.
- The
- Functions must be named with a prefix of
<dir_name>_
so all functions inudf/*.sql
must start withudf_
- The final function in a file must be named as
<dir_name>_<file_name_without_suffix>
soudf/mode_last.sql
must define a functionudf_mode_last
- The final function in a file must be named as
- Each file must only define effectively private helper functions and one
public function which must be defined last
- Functions must be defined as temporary using
CREATE TEMP FUNCTION
syntax- We provide tooling in
scripts/publish_persistent_udfs
for converting these definitions to persistent UDFs (temporary UDFudf_mode_last
is published as persistent UDFudf.mode_last
)
- We provide tooling in
- Should use
SQL
overjs
for performance
Backfills
- Should be avoided on large tables
- Backfills may double storage cost for a table for 90 days by moving
data from long-term storage to short-term storage
- For example regenerating
clients_last_seen_v1
from scratch would cost about $1600 for the query and about $6800 for data moved to short-term storage
- For example regenerating
- Should combine multiple backfills happening around the same time
- Should delay column deletes until the next other backfill
- Should use
NULL
for new data andEXCEPT
to exclude from views until dropped
- Should use
- Backfills may double storage cost for a table for 90 days by moving
data from long-term storage to short-term storage
- Should use copy operations in append mode to change column order
- Copy operations do not allow changing partitioning, changing clustering, or column deletes
- Should split backfilling into queries that finish in minutes not hours
- May use script/generate_incremental_table to automate backfilling incremental queries
- May be performed in a single query for smaller tables that do not depend on history
- A useful pattern is to have the only reference to
@submission_date
be a clauseWHERE (@submission_date IS NULL OR @submission_date = submission_date)
which allows recreating all dates by passing--parameter=submission_date:DATE:NULL
- A useful pattern is to have the only reference to
Incremental Queries
Benefits
- BigQuery billing discounts for destination table partitions not modified in the last 90 days
- May use dags.utils.gcp.bigquery_etl_query to simplify airflow configuration e.g. see dags.main_summary.exact_mau28_by_dimensions
- May use script/generate_incremental_table to automate backfilling
- Should use
WRITE_TRUNCATE
mode orbq query --replace
to replace partitions atomically to prevent duplicate data - Will have tooling to generate an optimized mostly materialized view that only calculates the most recent partition
Properties
- Must accept a date via
@submission_date
query parameter- Must output a column named
submission_date
matching the query parameter
- Must output a column named
- Must produce similar results when run multiple times
- Should produce identical results when run multiple times
- May depend on the previous partition
- If using previous partition, must include an
init.sql
query to initialize the table, e.g.templates/telemetry_derived/clients_last_seen_v1/init.sql
- Should be impacted by values from a finite number of preceding partitions
- This allows for backfilling in chunks instead of serially for all time and limiting backfills to a certain number of days following updated data
- For example
sql/clients_last_seen_v1.sql
can be run serially on any 28 day period and the last day will be the same whether or not the partition preceding the first day was missing because values are only impacted by 27 preceding days
- If using previous partition, must include an
Scheduling Queries in Airflow
Instructions for scheduling queries in Airflow can be found in this cookbook.
Contributing
When adding or modifying a query in this repository, make your changes in the
templates/
directory. Each time you run tests locally (see Tests below),
the sql/
directory will be regenerated, inserting definitions of any UDFs
referenced by the query. To force recreation of the sql/
directory without
running tests, invoke:
./script/generate_sql
You are expected to commit the generated content in sql/
along with your
changes to the source in templates/
, otherwise CI will fail. This matches
the strategy used by mozilla-pipeline-schemas and ensures that the final
queries being run by Airflow are directly available to reference via URL and
to view via the GitHub UI.