271 строка
12 KiB
Markdown
271 строка
12 KiB
Markdown
[![CircleCI](https://circleci.com/gh/mozilla/bigquery-etl.svg?style=shield&circle-token=742fb1108f7e6e5a28c11d43b21f62605037f5a4)](https://circleci.com/gh/mozilla/bigquery-etl)
|
|
|
|
BigQuery ETL
|
|
===
|
|
|
|
Bigquery UDFs and SQL queries for building derived datasets.
|
|
|
|
Formatting SQL
|
|
---
|
|
|
|
We enforce consistent SQL formatting as part of CI. After adding or changing a
|
|
query, use `script/format_sql` to apply formatting rules.
|
|
|
|
Directories and files passed as arguments to `script/format_sql` will be
|
|
formatted in place, with directories recursively searched for files with a
|
|
`.sql` extension, e.g.:
|
|
|
|
```bash
|
|
$ echo 'SELECT 1,2,3' > test.sql
|
|
$ script/format_sql test.sql
|
|
modified test.sql
|
|
1 file(s) modified
|
|
$ cat test.sql
|
|
SELECT
|
|
1,
|
|
2,
|
|
3
|
|
```
|
|
|
|
If no arguments are specified the script will read from stdin and write to
|
|
stdout, e.g.:
|
|
|
|
```bash
|
|
$ echo 'SELECT 1,2,3' | script/format_sql
|
|
SELECT
|
|
1,
|
|
2,
|
|
3
|
|
```
|
|
|
|
To turn off sql formatting for a block of SQL, wrap it in `format:off` and
|
|
`format:on` comments, like this:
|
|
|
|
```sql
|
|
SELECT
|
|
-- format:off
|
|
submission_date, sample_id, client_id
|
|
-- format:on
|
|
```
|
|
|
|
Recommended practices
|
|
---
|
|
|
|
### Queries
|
|
|
|
- Should be defined in files named as `sql/<dataset>/<table>_<version>/query.sql` e.g.
|
|
`sql/telemetry_derived/clients_daily_v7/query.sql`
|
|
- Queries that populate tables should always be named with a version suffix;
|
|
we assume that future optimizations to the data representation may require
|
|
schema-incompatible changes such as dropping columns
|
|
- May be generated using a python script that prints the query to stdout
|
|
- Should save output as `sql/<dataset>/<table>_<version>/query.sql` as above
|
|
- Should be named as `sql/query_type.sql.py` e.g. `sql/clients_daily.sql.py`
|
|
- May use options to generate queries for different destination tables e.g.
|
|
using `--source telemetry_core_parquet_v3` to generate
|
|
`sql/telemetry/core_clients_daily_v1/query.sql` and using `--source main_summary_v4` to
|
|
generate `sql/telemetry/clients_daily_v7/query.sql`
|
|
- Should output a header indicating options used e.g.
|
|
```sql
|
|
-- Query generated by: sql/clients_daily.sql.py --source telemetry_core_parquet
|
|
```
|
|
- Should not specify a project or dataset in table names to simplify testing
|
|
- Should be [incremental]
|
|
- Should filter input tables on partition and clustering columns
|
|
- Should use `_` prefix in generated column names not meant for output
|
|
- Should use `_bits` suffix for any integer column that represents a bit pattern
|
|
- Should not use `DATETIME` type, due to incompatibility with
|
|
[spark-bigquery-connector]
|
|
- Should read from `*_stable` tables instead of including custom deduplication
|
|
- Should use the earliest row for each `document_id` by `submission_timestamp`
|
|
where filtering duplicates is necessary
|
|
- Should escape identifiers that match keywords, even if they aren't [reserved keywords]
|
|
|
|
### Views
|
|
|
|
- Should be defined in files named as `sql/<dataset>/<table>/view.sql` e.g.
|
|
`sql/telemetry/core/view.sql`
|
|
- Views should generally _not_ be named with a version suffix; a view represents a
|
|
stable interface for users and whenever possible should maintain compatibility
|
|
with existing queries; if the view logic cannot be adapted to changes in underlying
|
|
tables, breaking changes must be communicated to `fx-data-dev@mozilla.org`
|
|
- Must specify project and dataset in all table names
|
|
- Should default to using the `moz-fx-data-shared-prod` project;
|
|
the `scripts/publish_views` tooling can handle parsing the definitions to publish
|
|
to other projects such as `derived-datasets`
|
|
|
|
### UDFs
|
|
|
|
- Should limit the number of [expression subqueries] to avoid: `BigQuery error
|
|
in query operation: Resources exceeded during query execution: Not enough
|
|
resources for query planning - too many subqueries or query is too complex.`
|
|
- Should be used to avoid code duplication
|
|
- Must be named in files with lower snake case names ending in `.sql`
|
|
e.g. `mode_last.sql`
|
|
- Each file must only define effectively private helper functions and one
|
|
public function which must be defined last
|
|
- Helper functions must not conflict with function names in other files
|
|
- SQL UDFs must be defined in the `udf/` directory and JS UDFs must be defined
|
|
in the `udf_js` directory
|
|
- The `udf_legacy/` directory is an exception which must only contain
|
|
compatibility functions for queries migrated from Athena/Presto.
|
|
- Functions must be defined as [persistent UDFs](https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions#temporary-udf-syntax)
|
|
using `CREATE OR REPLACE FUNCTION` syntax
|
|
- Function names must be prefixed with a dataset of `<dir_name>.` so, for example,
|
|
all functions in `udf/*.sql` are part of the `udf` dataset
|
|
- The final syntax for creating a function in a file will look like
|
|
`CREATE OR REPLACE FUNCTION <dir_name>.<file_name>`
|
|
- We provide tooling in `scripts/publish_persistent_udfs` for
|
|
publishing these UDFs to BigQuery
|
|
- Changes made to UDFs need to be published manually in order for the
|
|
dry run CI task to pass
|
|
- Should use `SQL` over `js` for performance
|
|
|
|
### Backfills
|
|
|
|
- Should be avoided on large tables
|
|
- Backfills may double storage cost for a table for 90 days by moving
|
|
data from long-term storage to short-term storage
|
|
- For example regenerating `clients_last_seen_v1` from scratch would cost
|
|
about $1600 for the query and about $6800 for data moved to short-term
|
|
storage
|
|
- Should combine multiple backfills happening around the same time
|
|
- Should delay column deletes until the next other backfill
|
|
- Should use `NULL` for new data and `EXCEPT` to exclude from views until
|
|
dropped
|
|
- Should use copy operations in append mode to change column order
|
|
- Copy operations do not allow changing partitioning, changing clustering, or
|
|
column deletes
|
|
- Should split backfilling into queries that finish in minutes not hours
|
|
- May use [script/generate_incremental_table] to automate backfilling incremental
|
|
queries
|
|
- May be performed in a single query for smaller tables that do not depend on history
|
|
- A useful pattern is to have the only reference to `@submission_date` be a
|
|
clause `WHERE (@submission_date IS NULL OR @submission_date = submission_date)`
|
|
which allows recreating all dates by passing `--parameter=submission_date:DATE:NULL`
|
|
|
|
Incremental Queries
|
|
---
|
|
|
|
### Benefits
|
|
|
|
- BigQuery billing discounts for destination table partitions not modified in
|
|
the last 90 days
|
|
- May use [dags.utils.gcp.bigquery_etl_query] to simplify airflow configuration
|
|
e.g. see [dags.main_summary.exact_mau28_by_dimensions]
|
|
- May use [script/generate_incremental_table] to automate backfilling
|
|
- Should use `WRITE_TRUNCATE` mode or `bq query --replace` to replace
|
|
partitions atomically to prevent duplicate data
|
|
- Will have tooling to generate an optimized _mostly materialized view_ that
|
|
only calculates the most recent partition
|
|
|
|
### Properties
|
|
|
|
- Must accept a date via `@submission_date` query parameter
|
|
- Must output a column named `submission_date` matching the query parameter
|
|
- Must produce similar results when run multiple times
|
|
- Should produce identical results when run multiple times
|
|
- May depend on the previous partition
|
|
- If using previous partition, must include an `init.sql` query to initialize the
|
|
table, e.g. `sql/telemetry_derived/clients_last_seen_v1/init.sql`
|
|
- Should be impacted by values from a finite number of preceding partitions
|
|
- This allows for backfilling in chunks instead of serially for all time
|
|
and limiting backfills to a certain number of days following updated data
|
|
- For example `sql/clients_last_seen_v1.sql` can be run serially on any 28 day
|
|
period and the last day will be the same whether or not the partition
|
|
preceding the first day was missing because values are only impacted by
|
|
27 preceding days
|
|
|
|
Query Metadata
|
|
---
|
|
|
|
- For each query, a `metadata.yaml` file should be created in the same directory
|
|
- This file contains a description, owners and labels. As an example:
|
|
|
|
```yaml
|
|
friendly_name: SSL Ratios
|
|
description: >
|
|
Percentages of page loads Firefox users have performed that were
|
|
conducted over SSL broken down by country.
|
|
owners:
|
|
- example@mozilla.com
|
|
labels:
|
|
application: firefox
|
|
incremental: true # incremental queries add data to existing tables
|
|
schedule: daily # scheduled in Airflow to run daily
|
|
public_json: true
|
|
public_bigquery: true
|
|
review_bug: 1414839 # Bugzilla bug ID of data review
|
|
incremental_export: false # non-incremental JSON export writes all data to a single location
|
|
```
|
|
|
|
### Publishing Datasets
|
|
|
|
- To make query results publicly available, the `public_bigquery` flag must be set in
|
|
`metadata.yaml`
|
|
- Tables will get published in the `mozilla-public-data` GCP project which is accessible
|
|
by everyone, also external users
|
|
- To make query results publicly available as JSON, `public_json` flag must be set in
|
|
`metadata.yaml`
|
|
- Data will be accessible under https://public-data.telemetry.mozilla.org
|
|
- A list of all available datasets is published under https://public-data.telemetry.mozilla.org/all-datasets.json
|
|
- For example: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/000000000000.json
|
|
- Output JSON files have a maximum size of 1GB, data can be split up into multiple files (`000000000000.json`, `000000000001.json`, ...)
|
|
- `incremental_export` controls how data should be exported as JSON:
|
|
- `false`: all data of the source table gets exported to a single location
|
|
- https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/000000000000.json
|
|
- `true`: only data that matches the `submission_date` parameter is exported as JSON to a separate directory for this date
|
|
- https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/2020-03-15/000000000000.json
|
|
- For each dataset, a `metadata.json` gets published listing all available files, for example: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/metadata.json
|
|
- The timestamp when the dataset was last updated is recorded in `last_updated`, e.g.: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/last_updated
|
|
|
|
Scheduling Queries in Airflow
|
|
---
|
|
|
|
Instructions for scheduling queries in Airflow can be found in this
|
|
[cookbook](https://docs.telemetry.mozilla.org/cookbooks/bigquery-airflow.html).
|
|
|
|
Contributing
|
|
---
|
|
|
|
When adding or modifying a query in this repository, make your changes in the
|
|
`sql/` directory.
|
|
|
|
When adding a new library to the Python requirements, first add the library to
|
|
the requirements and then add any meta-dependencies into constraints.
|
|
Constraints are discovered by installing requirements into a fresh virtual
|
|
environment. A dependency should be added to either `requirements.txt` or
|
|
`constraints.txt`, but not both.
|
|
|
|
```bash
|
|
# Create and activate a python virtual environment.
|
|
python3 -m venv venv/
|
|
source venv/bin/activate
|
|
|
|
# If not installed:
|
|
pip install pip-tools
|
|
|
|
# Add the dependency to requirements.in e.g. Jinja2.
|
|
echo Jinja2==2.11.1 >> requirements.in
|
|
|
|
# Compile hashes for new dependencies.
|
|
pip-compile --generate-hashes requirements.in
|
|
|
|
# Deactivate the python virtual environment.
|
|
deactivate
|
|
```
|
|
|
|
Tests
|
|
---
|
|
|
|
[See the documentation in tests/](tests/README.md)
|
|
|
|
[script/generate_incremental_table]: https://github.com/mozilla/bigquery-etl/blob/master/script/generate_incremental_table
|
|
[expression subqueries]: https://cloud.google.com/bigquery/docs/reference/standard-sql/expression_subqueries
|
|
[dags.utils.gcp.bigquery_etl_query]: https://github.com/mozilla/telemetry-airflow/blob/89a6dc3/dags/utils/gcp.py#L364
|
|
[dags.main_summary.exact_mau28_by_dimensions]: https://github.com/mozilla/telemetry-airflow/blob/89a6dc3/dags/main_summary.py#L385-L390
|
|
[incremental]: #incremental-queries
|
|
[spark-bigquery-connector]: https://github.com/GoogleCloudPlatform/spark-bigquery-connector/issues/5
|
|
[reserved keywords]: https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical#reserved-keywords
|
|
[mozilla-pipeline-schemas]: https://github.com/mozilla-services/mozilla-pipeline-schemas
|