bigquery-etl/README.md

271 строка
12 KiB
Markdown

[![CircleCI](https://circleci.com/gh/mozilla/bigquery-etl.svg?style=shield&circle-token=742fb1108f7e6e5a28c11d43b21f62605037f5a4)](https://circleci.com/gh/mozilla/bigquery-etl)
BigQuery ETL
===
Bigquery UDFs and SQL queries for building derived datasets.
Formatting SQL
---
We enforce consistent SQL formatting as part of CI. After adding or changing a
query, use `script/format_sql` to apply formatting rules.
Directories and files passed as arguments to `script/format_sql` will be
formatted in place, with directories recursively searched for files with a
`.sql` extension, e.g.:
```bash
$ echo 'SELECT 1,2,3' > test.sql
$ script/format_sql test.sql
modified test.sql
1 file(s) modified
$ cat test.sql
SELECT
1,
2,
3
```
If no arguments are specified the script will read from stdin and write to
stdout, e.g.:
```bash
$ echo 'SELECT 1,2,3' | script/format_sql
SELECT
1,
2,
3
```
To turn off sql formatting for a block of SQL, wrap it in `format:off` and
`format:on` comments, like this:
```sql
SELECT
-- format:off
submission_date, sample_id, client_id
-- format:on
```
Recommended practices
---
### Queries
- Should be defined in files named as `sql/<dataset>/<table>_<version>/query.sql` e.g.
`sql/telemetry_derived/clients_daily_v7/query.sql`
- Queries that populate tables should always be named with a version suffix;
we assume that future optimizations to the data representation may require
schema-incompatible changes such as dropping columns
- May be generated using a python script that prints the query to stdout
- Should save output as `sql/<dataset>/<table>_<version>/query.sql` as above
- Should be named as `sql/query_type.sql.py` e.g. `sql/clients_daily.sql.py`
- May use options to generate queries for different destination tables e.g.
using `--source telemetry_core_parquet_v3` to generate
`sql/telemetry/core_clients_daily_v1/query.sql` and using `--source main_summary_v4` to
generate `sql/telemetry/clients_daily_v7/query.sql`
- Should output a header indicating options used e.g.
```sql
-- Query generated by: sql/clients_daily.sql.py --source telemetry_core_parquet
```
- Should not specify a project or dataset in table names to simplify testing
- Should be [incremental]
- Should filter input tables on partition and clustering columns
- Should use `_` prefix in generated column names not meant for output
- Should use `_bits` suffix for any integer column that represents a bit pattern
- Should not use `DATETIME` type, due to incompatibility with
[spark-bigquery-connector]
- Should read from `*_stable` tables instead of including custom deduplication
- Should use the earliest row for each `document_id` by `submission_timestamp`
where filtering duplicates is necessary
- Should escape identifiers that match keywords, even if they aren't [reserved keywords]
### Views
- Should be defined in files named as `sql/<dataset>/<table>/view.sql` e.g.
`sql/telemetry/core/view.sql`
- Views should generally _not_ be named with a version suffix; a view represents a
stable interface for users and whenever possible should maintain compatibility
with existing queries; if the view logic cannot be adapted to changes in underlying
tables, breaking changes must be communicated to `fx-data-dev@mozilla.org`
- Must specify project and dataset in all table names
- Should default to using the `moz-fx-data-shared-prod` project;
the `scripts/publish_views` tooling can handle parsing the definitions to publish
to other projects such as `derived-datasets`
### UDFs
- Should limit the number of [expression subqueries] to avoid: `BigQuery error
in query operation: Resources exceeded during query execution: Not enough
resources for query planning - too many subqueries or query is too complex.`
- Should be used to avoid code duplication
- Must be named in files with lower snake case names ending in `.sql`
e.g. `mode_last.sql`
- Each file must only define effectively private helper functions and one
public function which must be defined last
- Helper functions must not conflict with function names in other files
- SQL UDFs must be defined in the `udf/` directory and JS UDFs must be defined
in the `udf_js` directory
- The `udf_legacy/` directory is an exception which must only contain
compatibility functions for queries migrated from Athena/Presto.
- Functions must be defined as [persistent UDFs](https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions#temporary-udf-syntax)
using `CREATE OR REPLACE FUNCTION` syntax
- Function names must be prefixed with a dataset of `<dir_name>.` so, for example,
all functions in `udf/*.sql` are part of the `udf` dataset
- The final syntax for creating a function in a file will look like
`CREATE OR REPLACE FUNCTION <dir_name>.<file_name>`
- We provide tooling in `scripts/publish_persistent_udfs` for
publishing these UDFs to BigQuery
- Changes made to UDFs need to be published manually in order for the
dry run CI task to pass
- Should use `SQL` over `js` for performance
### Backfills
- Should be avoided on large tables
- Backfills may double storage cost for a table for 90 days by moving
data from long-term storage to short-term storage
- For example regenerating `clients_last_seen_v1` from scratch would cost
about $1600 for the query and about $6800 for data moved to short-term
storage
- Should combine multiple backfills happening around the same time
- Should delay column deletes until the next other backfill
- Should use `NULL` for new data and `EXCEPT` to exclude from views until
dropped
- Should use copy operations in append mode to change column order
- Copy operations do not allow changing partitioning, changing clustering, or
column deletes
- Should split backfilling into queries that finish in minutes not hours
- May use [script/generate_incremental_table] to automate backfilling incremental
queries
- May be performed in a single query for smaller tables that do not depend on history
- A useful pattern is to have the only reference to `@submission_date` be a
clause `WHERE (@submission_date IS NULL OR @submission_date = submission_date)`
which allows recreating all dates by passing `--parameter=submission_date:DATE:NULL`
Incremental Queries
---
### Benefits
- BigQuery billing discounts for destination table partitions not modified in
the last 90 days
- May use [dags.utils.gcp.bigquery_etl_query] to simplify airflow configuration
e.g. see [dags.main_summary.exact_mau28_by_dimensions]
- May use [script/generate_incremental_table] to automate backfilling
- Should use `WRITE_TRUNCATE` mode or `bq query --replace` to replace
partitions atomically to prevent duplicate data
- Will have tooling to generate an optimized _mostly materialized view_ that
only calculates the most recent partition
### Properties
- Must accept a date via `@submission_date` query parameter
- Must output a column named `submission_date` matching the query parameter
- Must produce similar results when run multiple times
- Should produce identical results when run multiple times
- May depend on the previous partition
- If using previous partition, must include an `init.sql` query to initialize the
table, e.g. `sql/telemetry_derived/clients_last_seen_v1/init.sql`
- Should be impacted by values from a finite number of preceding partitions
- This allows for backfilling in chunks instead of serially for all time
and limiting backfills to a certain number of days following updated data
- For example `sql/clients_last_seen_v1.sql` can be run serially on any 28 day
period and the last day will be the same whether or not the partition
preceding the first day was missing because values are only impacted by
27 preceding days
Query Metadata
---
- For each query, a `metadata.yaml` file should be created in the same directory
- This file contains a description, owners and labels. As an example:
```yaml
friendly_name: SSL Ratios
description: >
Percentages of page loads Firefox users have performed that were
conducted over SSL broken down by country.
owners:
- example@mozilla.com
labels:
application: firefox
incremental: true # incremental queries add data to existing tables
schedule: daily # scheduled in Airflow to run daily
public_json: true
public_bigquery: true
review_bug: 1414839 # Bugzilla bug ID of data review
incremental_export: false # non-incremental JSON export writes all data to a single location
```
### Publishing Datasets
- To make query results publicly available, the `public_bigquery` flag must be set in
`metadata.yaml`
- Tables will get published in the `mozilla-public-data` GCP project which is accessible
by everyone, also external users
- To make query results publicly available as JSON, `public_json` flag must be set in
`metadata.yaml`
- Data will be accessible under https://public-data.telemetry.mozilla.org
- A list of all available datasets is published under https://public-data.telemetry.mozilla.org/all-datasets.json
- For example: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/000000000000.json
- Output JSON files have a maximum size of 1GB, data can be split up into multiple files (`000000000000.json`, `000000000001.json`, ...)
- `incremental_export` controls how data should be exported as JSON:
- `false`: all data of the source table gets exported to a single location
- https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/000000000000.json
- `true`: only data that matches the `submission_date` parameter is exported as JSON to a separate directory for this date
- https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/2020-03-15/000000000000.json
- For each dataset, a `metadata.json` gets published listing all available files, for example: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/metadata.json
- The timestamp when the dataset was last updated is recorded in `last_updated`, e.g.: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/last_updated
Scheduling Queries in Airflow
---
Instructions for scheduling queries in Airflow can be found in this
[cookbook](https://docs.telemetry.mozilla.org/cookbooks/bigquery-airflow.html).
Contributing
---
When adding or modifying a query in this repository, make your changes in the
`sql/` directory.
When adding a new library to the Python requirements, first add the library to
the requirements and then add any meta-dependencies into constraints.
Constraints are discovered by installing requirements into a fresh virtual
environment. A dependency should be added to either `requirements.txt` or
`constraints.txt`, but not both.
```bash
# Create and activate a python virtual environment.
python3 -m venv venv/
source venv/bin/activate
# If not installed:
pip install pip-tools
# Add the dependency to requirements.in e.g. Jinja2.
echo Jinja2==2.11.1 >> requirements.in
# Compile hashes for new dependencies.
pip-compile --generate-hashes requirements.in
# Deactivate the python virtual environment.
deactivate
```
Tests
---
[See the documentation in tests/](tests/README.md)
[script/generate_incremental_table]: https://github.com/mozilla/bigquery-etl/blob/master/script/generate_incremental_table
[expression subqueries]: https://cloud.google.com/bigquery/docs/reference/standard-sql/expression_subqueries
[dags.utils.gcp.bigquery_etl_query]: https://github.com/mozilla/telemetry-airflow/blob/89a6dc3/dags/utils/gcp.py#L364
[dags.main_summary.exact_mau28_by_dimensions]: https://github.com/mozilla/telemetry-airflow/blob/89a6dc3/dags/main_summary.py#L385-L390
[incremental]: #incremental-queries
[spark-bigquery-connector]: https://github.com/GoogleCloudPlatform/spark-bigquery-connector/issues/5
[reserved keywords]: https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical#reserved-keywords
[mozilla-pipeline-schemas]: https://github.com/mozilla-services/mozilla-pipeline-schemas