Big docs/README refactoring (#2527)
* Move most reference and cookbook documentation in README.md into the "reference" and "cookbook" sections of the generated documentation, respectively. * Try to steer people to the generated docs inside the README.md (since it is now basically just a set of quickstart instructions) * Provide a bit of guidance that this repository isn't great for 3rd party contributors in a new CONTRIBUTING.md. Co-authored-by: Jeff Klukas <jklukas@mozilla.com>
This commit is contained in:
Родитель
c434e75494
Коммит
1d6fea8f2d
|
@ -0,0 +1,10 @@
|
|||
# Contributing to bigquery-etl
|
||||
|
||||
Thank you for your interest in contributing to bigquery-etl! Although the code in this repository is licensed under the MPL, working on this repository effectively requires access to Mozilla's BigQuery data infrastructure which is reserved for Mozilla employees and designated contributors. For more information, see the sections on [gaining access] and [BigQuery Access Request] on [docs.telemetry.mozilla.org].
|
||||
|
||||
More information on working with this repository can be found in the README.md file (at the root of this repository) and in the [repository documentation].
|
||||
|
||||
[gaining access]: https://docs.telemetry.mozilla.org/concepts/gaining_access.html
|
||||
[BigQuery Access Request]: https://docs.telemetry.mozilla.org/cookbooks/bigquery/access.html#bigquery-access-request
|
||||
[docs.telemetry.mozilla.org]: https://docs.telemetry.mozilla.org
|
||||
[repository documentation]: https://mozilla.github.io/bigquery-etl/
|
420
README.md
420
README.md
|
@ -2,12 +2,14 @@
|
|||
|
||||
# BigQuery ETL
|
||||
|
||||
This repository contains Mozilla Data Team's
|
||||
This repository contains Mozilla Data Team's:
|
||||
|
||||
- Derived ETL jobs that do not require a custom container
|
||||
- User-defined functions (UDFs)
|
||||
- Airflow DAGs for scheduled bigquery-etl queries
|
||||
- Tools for query & UDF deployment, management and scheduling
|
||||
|
||||
For more information, see [https://mozilla.github.io/bigquery-etl/](https://mozilla.github.io/bigquery-etl/)
|
||||
|
||||
## Quick Start
|
||||
|
||||
|
@ -26,7 +28,7 @@ This repository contains Mozilla Data Team's
|
|||
- **For Mozilla Employees or Contributors (not in Data Engineering)** - Set up GCP command line tools, [as described on docs.telemetry.mozilla.org](https://docs.telemetry.mozilla.org/cookbooks/bigquery/access.html#using-the-bq-command-line-tool). Note that some functionality (e.g. writing UDFs or backfilling queries) may not be allowed.
|
||||
- **For Data Engineering** - In addition to setting up the command line tools, you will want to log in to `shared-prod` if making changes to production systems. Run `gcloud auth login --update-adc --project=moz-fx-data-shared-prod` (if you have not run it previously).
|
||||
|
||||
### Installing bqetl library
|
||||
### Installing bqetl
|
||||
|
||||
1. Clone the repository
|
||||
```bash
|
||||
|
@ -55,416 +57,4 @@ Finally, if you are using Visual Studio Code, you may also wish to use our recom
|
|||
cp .vscode/settings.json.default .vscode/settings.json
|
||||
```
|
||||
|
||||
And you should now be set up to start working in the repo! The easiest way to do this is for many tasks is to use `bqetl`, which is described below.
|
||||
|
||||
---
|
||||
|
||||
|
||||
# The `bqetl` CLI
|
||||
|
||||
The `bqetl` command-line tool aims to simplify working with the bigquery-etl repository by supporting
|
||||
common workflows, such as creating, validating and scheduling queries or adding new UDFs.
|
||||
|
||||
|
||||
## Usage
|
||||
|
||||
The CLI groups commands into different groups:
|
||||
|
||||
```
|
||||
$ ./bqetl --help
|
||||
Commands:
|
||||
dag Commands for managing DAGs.
|
||||
dryrun Dry run SQL.
|
||||
format Format SQL.
|
||||
mozfun Commands for managing mozfun UDFs.
|
||||
query Commands for managing queries.
|
||||
udf Commands for managing UDFs.
|
||||
...
|
||||
```
|
||||
|
||||
To get information about commands and available options, simply append the `--help` flag:
|
||||
|
||||
```
|
||||
$ ./bqetl query create --help
|
||||
Usage: bqetl query create [OPTIONS] NAME
|
||||
|
||||
Create a new query with name <dataset>.<query_name>, for example:
|
||||
telemetry_derived.asn_aggregates
|
||||
|
||||
Options:
|
||||
-p, --path DIRECTORY Path to directory in which query should be created
|
||||
-o, --owner TEXT Owner of the query (email address)
|
||||
-i, --init Create an init.sql file to initialize the table
|
||||
--help Show this message and exit.
|
||||
```
|
||||
|
||||
Documentation of all `bqetl` commands including usage examples can be found in the [bigquery-etl docs](https://github.com/mozilla/bigquery-etl#the-bqetl-cli).
|
||||
|
||||
Running some commands, for example to create or query tables, will [require access to Mozilla's GCP Account](https://docs.telemetry.mozilla.org/cookbooks/bigquery/access.html#bigquery-access-request).
|
||||
|
||||
## Formatting SQL
|
||||
|
||||
We enforce consistent SQL formatting as part of CI. After adding or changing a
|
||||
query, use `./bqetl format` to apply formatting rules.
|
||||
|
||||
Directories and files passed as arguments to `./bqetl format` will be
|
||||
formatted in place, with directories recursively searched for files with a
|
||||
`.sql` extension, e.g.:
|
||||
|
||||
```bash
|
||||
$ echo 'SELECT 1,2,3' > test.sql
|
||||
$ ./bqetl format test.sql
|
||||
modified test.sql
|
||||
1 file(s) modified
|
||||
$ cat test.sql
|
||||
SELECT
|
||||
1,
|
||||
2,
|
||||
3
|
||||
```
|
||||
|
||||
If no arguments are specified the script will read from stdin and write to
|
||||
stdout, e.g.:
|
||||
|
||||
```bash
|
||||
$ echo 'SELECT 1,2,3' | ./bqetl format
|
||||
SELECT
|
||||
1,
|
||||
2,
|
||||
3
|
||||
```
|
||||
|
||||
To turn off sql formatting for a block of SQL, wrap it in `format:off` and
|
||||
`format:on` comments, like this:
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
-- format:off
|
||||
submission_date, sample_id, client_id
|
||||
-- format:on
|
||||
```
|
||||
|
||||
Recommended practices
|
||||
---
|
||||
|
||||
### Queries
|
||||
|
||||
- Should be defined in files named as `sql/<project>/<dataset>/<table>_<version>/query.sql` e.g.
|
||||
- `<project>` defines both where the destination table resides and in which project the query job runs
|
||||
`sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v7/query.sql`
|
||||
- Queries that populate tables should always be named with a version suffix;
|
||||
we assume that future optimizations to the data representation may require
|
||||
schema-incompatible changes such as dropping columns
|
||||
- May be generated using a python script that prints the query to stdout
|
||||
- Should save output as `sql/<project>/<dataset>/<table>_<version>/query.sql` as above
|
||||
- Should be named as `sql/<project>/query_type.sql.py` e.g. `sql/moz-fx-data-shared-prod/clients_daily.sql.py`
|
||||
- May use options to generate queries for different destination tables e.g.
|
||||
using `--source telemetry_core_parquet_v3` to generate
|
||||
`sql/moz-fx-data-shared-prod/telemetry/core_clients_daily_v1/query.sql` and using `--source main_summary_v4` to
|
||||
generate `sql/moz-fx-data-shared-prod/telemetry/clients_daily_v7/query.sql`
|
||||
- Should output a header indicating options used e.g.
|
||||
```sql
|
||||
-- Query generated by: sql/moz-fx-data-shared-prod/clients_daily.sql.py --source telemetry_core_parquet
|
||||
```
|
||||
- Should not specify a project or dataset in table names to simplify testing
|
||||
- Should be [incremental]
|
||||
- Should filter input tables on partition and clustering columns
|
||||
- Should use `_` prefix in generated column names not meant for output
|
||||
- Should use `_bits` suffix for any integer column that represents a bit pattern
|
||||
- Should not use `DATETIME` type, due to incompatibility with
|
||||
[spark-bigquery-connector]
|
||||
- Should read from `*_stable` tables instead of including custom deduplication
|
||||
- Should use the earliest row for each `document_id` by `submission_timestamp`
|
||||
where filtering duplicates is necessary
|
||||
- Should escape identifiers that match keywords, even if they aren't [reserved keywords]
|
||||
|
||||
### Views
|
||||
|
||||
- Should be defined in files named as `sql/<project>/<dataset>/<table>/view.sql` e.g.
|
||||
`sql/moz-fx-data-shared-prod/telemetry/core/view.sql`
|
||||
- Views should generally _not_ be named with a version suffix; a view represents a
|
||||
stable interface for users and whenever possible should maintain compatibility
|
||||
with existing queries; if the view logic cannot be adapted to changes in underlying
|
||||
tables, breaking changes must be communicated to `fx-data-dev@mozilla.org`
|
||||
- Must specify project and dataset in all table names
|
||||
- Should default to using the `moz-fx-data-shared-prod` project;
|
||||
the `scripts/publish_views` tooling can handle parsing the definitions to publish
|
||||
to other projects such as `derived-datasets`
|
||||
|
||||
### UDFs
|
||||
|
||||
- Should limit the number of [expression subqueries] to avoid: `BigQuery error
|
||||
in query operation: Resources exceeded during query execution: Not enough
|
||||
resources for query planning - too many subqueries or query is too complex.`
|
||||
- Should be used to avoid code duplication
|
||||
- Must be named in files with lower snake case names ending in `.sql`
|
||||
e.g. `mode_last.sql`
|
||||
- Each file must only define effectively private helper functions and one
|
||||
public function which must be defined last
|
||||
- Helper functions must not conflict with function names in other files
|
||||
- SQL UDFs must be defined in the `udf/` directory and JS UDFs must be defined
|
||||
in the `udf_js` directory
|
||||
- The `udf_legacy/` directory is an exception which must only contain
|
||||
compatibility functions for queries migrated from Athena/Presto.
|
||||
- Functions must be defined as [persistent UDFs](https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions#temporary-udf-syntax)
|
||||
using `CREATE OR REPLACE FUNCTION` syntax
|
||||
- Function names must be prefixed with a dataset of `<dir_name>.` so, for example,
|
||||
all functions in `udf/*.sql` are part of the `udf` dataset
|
||||
- The final syntax for creating a function in a file will look like
|
||||
`CREATE OR REPLACE FUNCTION <dir_name>.<file_name>`
|
||||
- We provide tooling in `scripts/publish_persistent_udfs` for
|
||||
publishing these UDFs to BigQuery
|
||||
- Changes made to UDFs need to be published manually in order for the
|
||||
dry run CI task to pass
|
||||
- Should use `SQL` over `js` for performance
|
||||
|
||||
### Backfills
|
||||
|
||||
- Should be avoided on large tables
|
||||
- Backfills may double storage cost for a table for 90 days by moving
|
||||
data from long-term storage to short-term storage
|
||||
- For example regenerating `clients_last_seen_v1` from scratch would cost
|
||||
about $1600 for the query and about $6800 for data moved to short-term
|
||||
storage
|
||||
- Should combine multiple backfills happening around the same time
|
||||
- Should delay column deletes until the next other backfill
|
||||
- Should use `NULL` for new data and `EXCEPT` to exclude from views until
|
||||
dropped
|
||||
- Should use copy operations in append mode to change column order
|
||||
- Copy operations do not allow changing partitioning, changing clustering, or
|
||||
column deletes
|
||||
- Should split backfilling into queries that finish in minutes not hours
|
||||
- May use [script/generate_incremental_table] to automate backfilling incremental
|
||||
queries
|
||||
- May be performed in a single query for smaller tables that do not depend on history
|
||||
- A useful pattern is to have the only reference to `@submission_date` be a
|
||||
clause `WHERE (@submission_date IS NULL OR @submission_date = submission_date)`
|
||||
which allows recreating all dates by passing `--parameter=submission_date:DATE:NULL`
|
||||
|
||||
Incremental Queries
|
||||
---
|
||||
|
||||
### Benefits
|
||||
|
||||
- BigQuery billing discounts for destination table partitions not modified in
|
||||
the last 90 days
|
||||
- May use [dags.utils.gcp.bigquery_etl_query] to simplify airflow configuration
|
||||
e.g. see [dags.main_summary.exact_mau28_by_dimensions]
|
||||
- May use [script/generate_incremental_table] to automate backfilling
|
||||
- Should use `WRITE_TRUNCATE` mode or `bq query --replace` to replace
|
||||
partitions atomically to prevent duplicate data
|
||||
- Will have tooling to generate an optimized _mostly materialized view_ that
|
||||
only calculates the most recent partition
|
||||
|
||||
### Properties
|
||||
|
||||
- Must accept a date via `@submission_date` query parameter
|
||||
- Must output a column named `submission_date` matching the query parameter
|
||||
- Must produce similar results when run multiple times
|
||||
- Should produce identical results when run multiple times
|
||||
- May depend on the previous partition
|
||||
- If using previous partition, must include an `init.sql` query to initialize the
|
||||
table, e.g. `sql/moz-fx-data-shared-prod/telemetry_derived/clients_last_seen_v1/init.sql`
|
||||
- Should be impacted by values from a finite number of preceding partitions
|
||||
- This allows for backfilling in chunks instead of serially for all time
|
||||
and limiting backfills to a certain number of days following updated data
|
||||
- For example `sql/moz-fx-data-shared-prod/clients_last_seen_v1.sql` can be run serially on any 28 day
|
||||
period and the last day will be the same whether or not the partition
|
||||
preceding the first day was missing because values are only impacted by
|
||||
27 preceding days
|
||||
|
||||
Query Metadata
|
||||
---
|
||||
|
||||
- For each query, a `metadata.yaml` file should be created in the same directory
|
||||
- This file contains a description, owners and labels. As an example:
|
||||
|
||||
```yaml
|
||||
friendly_name: SSL Ratios
|
||||
description: >
|
||||
Percentages of page loads Firefox users have performed that were
|
||||
conducted over SSL broken down by country.
|
||||
owners:
|
||||
- example@mozilla.com
|
||||
labels:
|
||||
application: firefox
|
||||
incremental: true # incremental queries add data to existing tables
|
||||
schedule: daily # scheduled in Airflow to run daily
|
||||
public_json: true
|
||||
public_bigquery: true
|
||||
review_bugs:
|
||||
- 1414839 # Bugzilla bug ID of data review
|
||||
incremental_export: false # non-incremental JSON export writes all data to a single location
|
||||
```
|
||||
|
||||
### Publishing a Table Publicly
|
||||
|
||||
For background, see [Accessing Public Data](https://docs.telemetry.mozilla.org/cookbooks/public_data.html)
|
||||
on `docs.telemetry.mozilla.org`.
|
||||
|
||||
- To make query results publicly available, the `public_bigquery` flag must be set in
|
||||
`metadata.yaml`
|
||||
- Tables will get published in the `mozilla-public-data` GCP project which is accessible
|
||||
by everyone, also external users
|
||||
- To make query results publicly available as JSON, `public_json` flag must be set in
|
||||
`metadata.yaml`
|
||||
- Data will be accessible under https://public-data.telemetry.mozilla.org
|
||||
- A list of all available datasets is published under https://public-data.telemetry.mozilla.org/all-datasets.json
|
||||
- For example: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/000000000000.json
|
||||
- Output JSON files have a maximum size of 1GB, data can be split up into multiple files (`000000000000.json`, `000000000001.json`, ...)
|
||||
- `incremental_export` controls how data should be exported as JSON:
|
||||
- `false`: all data of the source table gets exported to a single location
|
||||
- https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/000000000000.json
|
||||
- `true`: only data that matches the `submission_date` parameter is exported as JSON to a separate directory for this date
|
||||
- https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/2020-03-15/000000000000.json
|
||||
- For each dataset, a `metadata.json` gets published listing all available files, for example: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/metadata.json
|
||||
- The timestamp when the dataset was last updated is recorded in `last_updated`, e.g.: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/last_updated
|
||||
|
||||
Dataset Metadata
|
||||
---
|
||||
|
||||
To provision a new BigQuery dataset for holding tables, you'll need to
|
||||
create a `dataset_metadata.yaml` which will cause the dataset to be
|
||||
automatically deployed a few hours after merging. Changes to existing
|
||||
datasets may trigger manual operator approval (such as changing access policies).
|
||||
|
||||
The `bqetl query create` command will automatically generate a skeleton
|
||||
`dataset_metadata.yaml` file if the query name contains a dataset that
|
||||
is not yet defined.
|
||||
|
||||
See example with commentary for `telemetry_derived`:
|
||||
|
||||
```yaml
|
||||
friendly_name: Telemetry Derived
|
||||
description: |-
|
||||
Derived data based on pings from legacy Firefox telemetry, plus many other
|
||||
general-purpose derived tables
|
||||
labels: {}
|
||||
|
||||
# Base ACL should can be:
|
||||
# "derived" for `_derived` datasets that contain concrete tables
|
||||
# "view" for user-facing datasets containing virtual views
|
||||
dataset_base_acl: derived
|
||||
|
||||
# Datasets with user-facing set to true will be created both in shared-prod
|
||||
# and in mozdata; this should be false for all `_derived` datasets
|
||||
user_facing: false
|
||||
|
||||
# Most datasets can have mozilla-confidential access like below,
|
||||
# but some datasets will be defined with more restricted access
|
||||
# or with additional access for services.
|
||||
workgroup_access:
|
||||
- role: roles/bigquery.dataViewer
|
||||
members:
|
||||
- workgroup:mozilla-confidential
|
||||
```
|
||||
|
||||
Scheduling Queries in Airflow
|
||||
---
|
||||
|
||||
- bigquery-etl has tooling to automatically generate Airflow DAGs for scheduling queries
|
||||
- To be scheduled, a query must be assigned to a DAG that is specified in `dags.yaml`
|
||||
- New DAGs can be configured in `dags.yaml`, e.g., by adding the following:
|
||||
```yaml
|
||||
bqetl_ssl_ratios: # name of the DAG; must start with bqetl_
|
||||
schedule_interval: 0 2 * * * # query schedule
|
||||
description: The DAG schedules SSL ratios queries.
|
||||
default_args:
|
||||
owner: example@mozilla.com
|
||||
start_date: '2020-04-05' # YYYY-MM-DD
|
||||
email: ['example@mozilla.com']
|
||||
retries: 2 # number of retries if the query execution fails
|
||||
retry_delay: 30m
|
||||
```
|
||||
- All DAG names need to have `bqetl_` as prefix.
|
||||
- `schedule_interval` is either defined as a [CRON expression](https://en.wikipedia.org/wiki/Cron) or alternatively as one of the following [CRON presets](https://airflow.readthedocs.io/en/latest/dag-run.html): `once`, `hourly`, `daily`, `weekly`, `monthly`
|
||||
- `start_date` defines the first date for which the query should be executed
|
||||
- Airflow will not automatically backfill older dates if `start_date` is set in the past, backfilling can be done via the Airflow web interface
|
||||
- `email` lists email addresses alerts should be sent to in case of failures when running the query
|
||||
- Alternatively, new DAGs can also be created via the `bqetl` CLI by running `bqetl dag create bqetl_ssl_ratios --schedule_interval='0 2 * * *' --owner="example@mozilla.com" --start_date="2020-04-05" --description="This DAG generates SSL ratios."`
|
||||
- To schedule a specific query, add a `metadata.yaml` file that includes a `scheduling` section, for example:
|
||||
```yaml
|
||||
friendly_name: SSL ratios
|
||||
# ... more metadata, see Query Metadata section above
|
||||
scheduling:
|
||||
dag_name: bqetl_ssl_ratios
|
||||
```
|
||||
- Additional scheduling options:
|
||||
- `depends_on_past` keeps query from getting executed if the previous schedule for the query hasn't succeeded
|
||||
- `date_partition_parameter` - by default set to `submission_date`; can be set to `null` if query doesn't write to a partitioned table
|
||||
- `parameters` specifies a list of query parameters, e.g. `["n_clients:INT64:500"]`
|
||||
- `arguments` - a list of arguments passed when running the query, for example: `["--append_table"]`
|
||||
- `referenced_tables` - manually curated list of tables the query depends on; used to speed up the DAG generation process or to specify tables that the dry run doesn't have permissions to access, e. g. `[['telemetry_stable', 'main_v4']]`
|
||||
- `multipart` indicates whether a query is split over multiple files `part1.sql`, `part2.sql`, ...
|
||||
- `depends_on` defines external dependencies in telemetry-airflow that are not detected automatically:
|
||||
```yaml
|
||||
depends_on:
|
||||
- task_id: external_task
|
||||
dag_name: external_dag
|
||||
execution_delta: 1h
|
||||
```
|
||||
- `task_id`: name of task query depends on
|
||||
- `dag_name`: name of the DAG the external task is part of
|
||||
- `execution_delta`: time difference between the `schedule_intervals` of the external DAG and the DAG the query is part of
|
||||
- `destination_table`: The table to write to. If unspecified, defaults to the query destination; if None, no destination table is used (the query is simply run as-is). Note that if no destination table is specified, you will need to specify the `submission_date` parameter manually
|
||||
- Queries can also be scheduled using the `bqetl` CLI: `./bqetl query schedule path/to/query_v1 --dag bqetl_ssl_ratios `
|
||||
- To generate all Airflow DAGs run `./script/generate_airflow_dags` or `./bqetl dag generate`
|
||||
- Generated DAGs are located in the `dags/` directory
|
||||
- Dependencies between queries scheduled in bigquery-etl and dependencies to stable tables are detected automatically
|
||||
- Specific DAGs can be generated by running `./bqetl dag generate bqetl_ssl_ratios`
|
||||
- Generated DAGs will be automatically detected and scheduled by Airflow
|
||||
- It might take up to 10 minutes for new DAGs and updates to show up in the Airflow UI
|
||||
|
||||
Contributing
|
||||
---
|
||||
|
||||
When adding or modifying a query in this repository, make your changes in the `sql/` directory.
|
||||
|
||||
When adding a new library to the Python requirements, first add the library to
|
||||
the requirements and then add any meta-dependencies into constraints.
|
||||
Constraints are discovered by installing requirements into a fresh virtual
|
||||
environment. A dependency should be added to either `requirements.txt` or
|
||||
`constraints.txt`, but not both.
|
||||
|
||||
```bash
|
||||
# Create a python virtual environment (not necessary if you have already
|
||||
# run `./bqetl bootstrap`)
|
||||
python3 -m venv venv/
|
||||
|
||||
# Activate the virtual environment
|
||||
source venv/bin/activate
|
||||
|
||||
# If not installed:
|
||||
pip install pip-tools
|
||||
|
||||
# Add the dependency to requirements.in e.g. Jinja2.
|
||||
echo Jinja2==2.11.1 >> requirements.in
|
||||
|
||||
# Compile hashes for new dependencies.
|
||||
pip-compile --generate-hashes requirements.in
|
||||
|
||||
# Deactivate the python virtual environment.
|
||||
deactivate
|
||||
```
|
||||
|
||||
When opening a pull-request to merge a fork, the `manual-trigger-required-for-fork` CI task will
|
||||
fail and some integration test tasks will be skipped. A user with repository write permissions
|
||||
will have to run the [Push to upstream workflow](https://github.com/mozilla/bigquery-etl/actions/workflows/push-to-upstream.yml)
|
||||
and provide the `<username>:<branch>` of the fork as parameter. The parameter will also show up
|
||||
in the logs of the `manual-trigger-required-for-fork` CI task together with more detailed instructions.
|
||||
Once the workflow has been executed, the CI tasks, including the integration tests, of the PR will be
|
||||
executed.
|
||||
|
||||
Tests
|
||||
---
|
||||
|
||||
[See the documentation in tests/](tests/README.md)
|
||||
|
||||
[script/generate_incremental_table]: https://github.com/mozilla/bigquery-etl/blob/main/script/generate_incremental_table
|
||||
[expression subqueries]: https://cloud.google.com/bigquery/docs/reference/standard-sql/expression_subqueries
|
||||
[dags.utils.gcp.bigquery_etl_query]: https://github.com/mozilla/telemetry-airflow/blob/89a6dc3/dags/utils/gcp.py#L364
|
||||
[dags.main_summary.exact_mau28_by_dimensions]: https://github.com/mozilla/telemetry-airflow/blob/89a6dc3/dags/main_summary.py#L385-L390
|
||||
[incremental]: #incremental-queries
|
||||
[spark-bigquery-connector]: https://github.com/GoogleCloudPlatform/spark-bigquery-connector/issues/5
|
||||
[reserved keywords]: https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical#reserved-keywords
|
||||
[mozilla-pipeline-schemas]: https://github.com/mozilla-services/mozilla-pipeline-schemas
|
||||
And you should now be set up to start working in the repo! The easiest way to do this is for many tasks is to use [`bqetl`](https://mozilla.github.io/bigquery-etl/bqetl/). You may also want to read up on [common workflows](https://mozilla.github.io/bigquery-etl/cookbooks/common_workflows/).
|
||||
|
|
|
@ -2,6 +2,8 @@
|
|||
|
||||
The `bqetl` command-line tool aims to simplify working with the bigquery-etl repository by supporting common workflows, such as creating, validating and scheduling queries or adding new UDFs.
|
||||
|
||||
Running some commands, for example to create or query tables, will [require Mozilla GCP access](https://docs.telemetry.mozilla.org/cookbooks/bigquery/access.html#bigquery-access-request).
|
||||
|
||||
## Installation
|
||||
|
||||
Follow the [Quick Start](https://github.com/mozilla/bigquery-etl#quick-start) to set up bigquery-etl and the bqetl CLI.
|
||||
|
|
|
@ -44,6 +44,48 @@ The [Creating derived datasets tutorial](https://mozilla.github.io/bigquery-etl/
|
|||
1. Deploy schema changes by running `./bqetl query schema deploy <dataset>.<table>_<version>`
|
||||
1. Merge pull-request
|
||||
|
||||
## Formatting SQL
|
||||
|
||||
We enforce consistent SQL formatting as part of CI. After adding or changing a
|
||||
query, use `./bqetl format` to apply formatting rules.
|
||||
|
||||
Directories and files passed as arguments to `./bqetl format` will be
|
||||
formatted in place, with directories recursively searched for files with a
|
||||
`.sql` extension, e.g.:
|
||||
|
||||
```bash
|
||||
$ echo 'SELECT 1,2,3' > test.sql
|
||||
$ ./bqetl format test.sql
|
||||
modified test.sql
|
||||
1 file(s) modified
|
||||
$ cat test.sql
|
||||
SELECT
|
||||
1,
|
||||
2,
|
||||
3
|
||||
```
|
||||
|
||||
If no arguments are specified the script will read from stdin and write to
|
||||
stdout, e.g.:
|
||||
|
||||
```bash
|
||||
$ echo 'SELECT 1,2,3' | ./bqetl format
|
||||
SELECT
|
||||
1,
|
||||
2,
|
||||
3
|
||||
```
|
||||
|
||||
To turn off sql formatting for a block of SQL, wrap it in `format:off` and
|
||||
`format:on` comments, like this:
|
||||
|
||||
```sql
|
||||
SELECT
|
||||
-- format:off
|
||||
submission_date, sample_id, client_id
|
||||
-- format:on
|
||||
```
|
||||
|
||||
## Add a new field to clients_daily
|
||||
|
||||
Adding a new field to `clients_daily` also means that field has to propagate to several
|
||||
|
@ -112,8 +154,48 @@ The same steps as creating a new UDF apply for creating stored procedures, excep
|
|||
1. Open a PR
|
||||
1. PR gets reviews, approved and merged
|
||||
|
||||
## Creating a new BigQuery Dataset
|
||||
|
||||
To provision a new BigQuery dataset for holding tables, you'll need to
|
||||
create a `dataset_metadata.yaml` which will cause the dataset to be
|
||||
automatically deployed a few hours after merging. Changes to existing
|
||||
datasets may trigger manual operator approval (such as changing access policies).
|
||||
|
||||
The `bqetl query create` command will automatically generate a skeleton
|
||||
`dataset_metadata.yaml` file if the query name contains a dataset that
|
||||
is not yet defined.
|
||||
|
||||
See example with commentary for `telemetry_derived`:
|
||||
|
||||
```yaml
|
||||
friendly_name: Telemetry Derived
|
||||
description: |-
|
||||
Derived data based on pings from legacy Firefox telemetry, plus many other
|
||||
general-purpose derived tables
|
||||
labels: {}
|
||||
|
||||
# Base ACL should can be:
|
||||
# "derived" for `_derived` datasets that contain concrete tables
|
||||
# "view" for user-facing datasets containing virtual views
|
||||
dataset_base_acl: derived
|
||||
|
||||
# Datasets with user-facing set to true will be created both in shared-prod
|
||||
# and in mozdata; this should be false for all `_derived` datasets
|
||||
user_facing: false
|
||||
|
||||
# Most datasets can have mozilla-confidential access like below,
|
||||
# but some datasets will be defined with more restricted access
|
||||
# or with additional access for services.
|
||||
workgroup_access:
|
||||
- role: roles/bigquery.dataViewer
|
||||
members:
|
||||
- workgroup:mozilla-confidential
|
||||
```
|
||||
|
||||
## Publishing data
|
||||
|
||||
See also the reference for [Public Data](../reference/public_data.md).
|
||||
|
||||
1. Get a data review by following the [data publishing process](https://wiki.mozilla.org/Data_Publishing#Dataset_Publishing_Process_2)
|
||||
1. Update the `metadata.yaml` file of the query to be published
|
||||
* Set `public_bigquery: true` and optionally `public_json: true`
|
||||
|
@ -124,3 +206,42 @@ The same steps as creating a new UDF apply for creating stored procedures, excep
|
|||
1. Open a PR
|
||||
1. PR gets reviewed, approved and merged
|
||||
* Once, ETL is running a view will get automatically published to `moz-fx-data-shared-prod` referencing the public dataset
|
||||
|
||||
## Adding new Python requirements
|
||||
|
||||
When adding a new library to the Python requirements, first add the library to
|
||||
the requirements and then add any meta-dependencies into constraints.
|
||||
Constraints are discovered by installing requirements into a fresh virtual
|
||||
environment. A dependency should be added to either `requirements.txt` or
|
||||
`constraints.txt`, but not both.
|
||||
|
||||
```bash
|
||||
# Create a python virtual environment (not necessary if you have already
|
||||
# run `./bqetl bootstrap`)
|
||||
python3 -m venv venv/
|
||||
|
||||
# Activate the virtual environment
|
||||
source venv/bin/activate
|
||||
|
||||
# If not installed:
|
||||
pip install pip-tools
|
||||
|
||||
# Add the dependency to requirements.in e.g. Jinja2.
|
||||
echo Jinja2==2.11.1 >> requirements.in
|
||||
|
||||
# Compile hashes for new dependencies.
|
||||
pip-compile --generate-hashes requirements.in
|
||||
|
||||
# Deactivate the python virtual environment.
|
||||
deactivate
|
||||
```
|
||||
|
||||
## Making a pull request from a fork
|
||||
|
||||
When opening a pull-request to merge a fork, the `manual-trigger-required-for-fork` CI task will
|
||||
fail and some integration test tasks will be skipped. A user with repository write permissions
|
||||
will have to run the [Push to upstream workflow](https://github.com/mozilla/bigquery-etl/actions/workflows/push-to-upstream.yml)
|
||||
and provide the `<username>:<branch>` of the fork as parameter. The parameter will also show up
|
||||
in the logs of the `manual-trigger-required-for-fork` CI task together with more detailed instructions.
|
||||
Once the workflow has been executed, the CI tasks, including the integration tests, of the PR will be
|
||||
executed.
|
||||
|
|
|
@ -0,0 +1,166 @@
|
|||
# How to Run Tests
|
||||
|
||||
This repository uses `pytest`:
|
||||
|
||||
```
|
||||
# create a venv
|
||||
python3.8 -m venv venv/
|
||||
|
||||
# install pip-tools for managing dependencies
|
||||
./venv/bin/pip install pip-tools -c requirements.in
|
||||
|
||||
# install python dependencies with pip-sync (provided by pip-tools)
|
||||
./venv/bin/pip-sync
|
||||
|
||||
# install java dependencies with maven
|
||||
mvn dependency:copy-dependencies
|
||||
|
||||
# run pytest with all linters and 4 workers in parallel
|
||||
./venv/bin/pytest --black --pydocstyle --flake8 --mypy-ignore-missing-imports -n 4
|
||||
|
||||
# use -k to selectively run a set of tests that matches the expression `udf`
|
||||
./venv/bin/pytest -k udf
|
||||
|
||||
# run integration tests with 4 workers in parallel
|
||||
gcloud auth application-default login # or set GOOGLE_APPLICATION_CREDENTIALS
|
||||
export GOOGLE_PROJECT_ID=bigquery-etl-integration-test
|
||||
gcloud config set project $GOOGLE_PROJECT_ID
|
||||
./venv/bin/pytest -m integration -n 4
|
||||
```
|
||||
|
||||
To provide [authentication credentials for the Google Cloud API](https://cloud.google.com/docs/authentication/getting-started) the `GOOGLE_APPLICATION_CREDENTIALS` environment variable must be set to the file path of the JSON file that contains the service account key.
|
||||
See [Mozilla BigQuery API Access instructions](https://docs.telemetry.mozilla.org/cookbooks/bigquery.html#gcp-bigquery-api-access) to request credentials if you don't already have them.
|
||||
|
||||
## How to Configure a UDF Test
|
||||
|
||||
Include a comment like `-- Tests` followed by one or more query statements
|
||||
after the UDF in the SQL file where it is defined. Each statement in a SQL file
|
||||
that defines a UDF that does not define a temporary function is collected as a
|
||||
test and executed independently of other tests in the file.
|
||||
|
||||
Each test must use the UDF and throw an error to fail. Assert functions defined
|
||||
in `tests/assert/` may be used to evaluate outputs. Tests must not use any
|
||||
query parameters and should not reference any tables. Each test that is
|
||||
expected to fail must be preceded by a comment like `#xfail`, similar to a [SQL
|
||||
dialect prefix] in the BigQuery Cloud Console.
|
||||
|
||||
For example:
|
||||
|
||||
```sql
|
||||
CREATE TEMP FUNCTION udf_example(option INT64) AS (
|
||||
CASE
|
||||
WHEN option > 0 then TRUE
|
||||
WHEN option = 0 then FALSE
|
||||
ELSE ERROR("invalid option")
|
||||
END
|
||||
);
|
||||
-- Tests
|
||||
SELECT
|
||||
assert_true(udf_example(1)),
|
||||
assert_false(udf_example(0));
|
||||
#xfail
|
||||
SELECT
|
||||
udf_example(-1);
|
||||
#xfail
|
||||
SELECT
|
||||
udf_example(NULL);
|
||||
```
|
||||
|
||||
[sql dialect prefix]: https://cloud.google.com/bigquery/docs/reference/standard-sql/enabling-standard-sql#sql-prefix
|
||||
|
||||
## How to Configure a Generated Test
|
||||
|
||||
1. Make a directory for test resources named `tests/{dataset}/{table}/{test_name}/`,
|
||||
e.g. `tests/telemetry_derived/clients_last_seen_raw_v1/test_single_day`
|
||||
- `table` must match a directory named like `{dataset}/{table}`, e.g.
|
||||
`telemetry_derived/clients_last_seen_v1`
|
||||
- `test_name` should start with `test_`, e.g. `test_single_day`
|
||||
- If `test_name` is `test_init` or `test_script`, then the query will run `init.sql`
|
||||
or `script.sql` respectively; otherwise, the test will run `query.sql`
|
||||
1. Add `.yaml` files for input tables, e.g. `clients_daily_v6.yaml`
|
||||
- Include the dataset prefix if it's set in the tested query,
|
||||
e.g. `analysis.clients_last_seen_v1.yaml`
|
||||
- This will result in the dataset prefix being removed from the query,
|
||||
e.g. `query = query.replace("analysis.clients_last_seen_v1", "clients_last_seen_v1")`
|
||||
1. Add `.sql` files for input view queries, e.g. `main_summary_v4.sql`
|
||||
- **_Don't_** include a `CREATE ... AS` clause
|
||||
- Fully qualify table names as `` `{project}.{dataset}.table` ``
|
||||
- Include the dataset prefix if it's set in the tested query,
|
||||
e.g. `telemetry.main_summary_v4.sql`
|
||||
- This will result in the dataset prefix being removed from the query,
|
||||
e.g. `query = query.replace("telemetry.main_summary_v4", "main_summary_v4")`
|
||||
1. Add `expect.yaml` to validate the result
|
||||
- `DATE` and `DATETIME` type columns in the result are coerced to strings
|
||||
using `.isoformat()`
|
||||
- Columns named `generated_time` are removed from the result before
|
||||
comparing to `expect` because they should not be static
|
||||
1. Optionally add `.schema.json` files for input table schemas, e.g.
|
||||
`clients_daily_v6.schema.json`
|
||||
1. Optionally add `query_params.yaml` to define query parameters
|
||||
- `query_params` must be a list
|
||||
|
||||
## Init Tests
|
||||
|
||||
Tests of `init.sql` statements are supported, similarly to other generated tests.
|
||||
Simply name the test `test_init`. The other guidelines still apply.
|
||||
|
||||
_Note_: Init SQL statements must contain a create statement with the dataset
|
||||
and table name, like so:
|
||||
|
||||
```
|
||||
CREATE OR REPLACE TABLE
|
||||
dataset.table_v1
|
||||
AS
|
||||
...
|
||||
```
|
||||
|
||||
## Additional Guidelines and Options
|
||||
|
||||
- If the destination table is also an input table then `generated_time` should
|
||||
be a required `DATETIME` field to ensure minimal validation
|
||||
- Input table files
|
||||
- All of the formats supported by `bq load` are supported
|
||||
- `yaml` and `json` format are supported and must contain an array of rows
|
||||
which are converted in memory to `ndjson` before loading
|
||||
- Preferred formats are `yaml` for readability or `ndjson` for compatiblity
|
||||
with `bq load`
|
||||
- `expect.yaml`
|
||||
- File extensions `yaml`, `json` and `ndjson` are supported
|
||||
- Preferred formats are `yaml` for readability or `ndjson` for compatiblity
|
||||
with `bq load`
|
||||
- Schema files
|
||||
- Setting the description of a top level field to `time_partitioning_field`
|
||||
will cause the table to use it for time partitioning
|
||||
- File extensions `yaml`, `json` and `ndjson` are supported
|
||||
- Preferred formats are `yaml` for readability or `json` for compatiblity
|
||||
with `bq load`
|
||||
- Query parameters
|
||||
- Scalar query params should be defined as a dict with keys `name`, `type` or
|
||||
`type_`, and `value`
|
||||
- `query_parameters.yaml` may be used instead of `query_params.yaml`, but
|
||||
they are mutually exclusive
|
||||
- File extensions `yaml`, `json` and `ndjson` are supported
|
||||
- Preferred format is `yaml` for readability
|
||||
|
||||
## How to Run CircleCI Locally
|
||||
|
||||
- Install the [CircleCI Local CI](https://circleci.com/docs/2.0/local-cli/)
|
||||
- Download GCP [service account](https://cloud.google.com/iam/docs/service-accounts) keys
|
||||
- Integration tests will only successfully run with service account keys
|
||||
that belong to the `circleci` service account in the `biguqery-etl-integration-test` project
|
||||
- Run `circleci build` and set required environment variables `GOOGLE_PROJECT_ID` and
|
||||
`GCLOUD_SERVICE_KEY`:
|
||||
|
||||
```
|
||||
gcloud_service_key=`cat /path/to/key_file.json`
|
||||
|
||||
# to run a specific job, e.g. integration:
|
||||
circleci build --job integration \
|
||||
--env GOOGLE_PROJECT_ID=bigquery-etl-integration-test \
|
||||
--env GCLOUD_SERVICE_KEY=$gcloud_service_key
|
||||
|
||||
# to run all jobs
|
||||
circleci build \
|
||||
--env GOOGLE_PROJECT_ID=bigquery-etl-integration-test \
|
||||
--env GCLOUD_SERVICE_KEY=$gcloud_service_key
|
||||
```
|
|
@ -27,12 +27,17 @@ plugins:
|
|||
- awesome-pages
|
||||
nav:
|
||||
- index.md
|
||||
- Cookbooks:
|
||||
- Common workflows: cookbooks/common_workflows.md
|
||||
- Creating a derived dataset: cookbooks/creating_a_derived_dataset.md
|
||||
- Datasets:
|
||||
- ... | mozdata/**.md
|
||||
- UDFs:
|
||||
- ... | mozfun/**.md
|
||||
- Cookbooks:
|
||||
- Common workflows: cookbooks/common_workflows.md
|
||||
- Creating a derived dataset: cookbooks/creating_a_derived_dataset.md
|
||||
- Testing: cookbooks/testing.md
|
||||
- Reference:
|
||||
- bqetl CLI: bqetl.md
|
||||
- Recommended practices: reference/recommended_practices.md
|
||||
- Incremental queries: reference/incremental.md
|
||||
- Scheduling: reference/scheduling.md
|
||||
- Public data: reference/public_data.md
|
||||
|
|
|
@ -0,0 +1,35 @@
|
|||
# Incremental Queries
|
||||
|
||||
## Benefits
|
||||
|
||||
- BigQuery billing discounts for destination table partitions not modified in
|
||||
the last 90 days
|
||||
- May use [dags.utils.gcp.bigquery_etl_query] to simplify airflow configuration
|
||||
e.g. see [dags.main_summary.exact_mau28_by_dimensions]
|
||||
- May use [script/generate_incremental_table] to automate backfilling
|
||||
- Should use `WRITE_TRUNCATE` mode or `bq query --replace` to replace
|
||||
partitions atomically to prevent duplicate data
|
||||
- Will have tooling to generate an optimized _mostly materialized view_ that
|
||||
only calculates the most recent partition
|
||||
|
||||
[script/generate_incremental_table]: https://github.com/mozilla/bigquery-etl/blob/main/script/generate_incremental_table
|
||||
[dags.utils.gcp.bigquery_etl_query]: https://github.com/mozilla/telemetry-airflow/blob/89a6dc3/dags/utils/gcp.py#L364
|
||||
|
||||
## Properties
|
||||
|
||||
- Must accept a date via `@submission_date` query parameter
|
||||
- Must output a column named `submission_date` matching the query parameter
|
||||
- Must produce similar results when run multiple times
|
||||
- Should produce identical results when run multiple times
|
||||
- May depend on the previous partition
|
||||
- If using previous partition, must include an `init.sql` query to initialize the
|
||||
table, e.g. `sql/moz-fx-data-shared-prod/telemetry_derived/clients_last_seen_v1/init.sql`
|
||||
- Should be impacted by values from a finite number of preceding partitions
|
||||
- This allows for backfilling in chunks instead of serially for all time
|
||||
and limiting backfills to a certain number of days following updated data
|
||||
- For example `sql/moz-fx-data-shared-prod/clients_last_seen_v1.sql` can be run serially on any 28 day
|
||||
period and the last day will be the same whether or not the partition
|
||||
preceding the first day was missing because values are only impacted by
|
||||
27 preceding days
|
||||
|
||||
[dags.main_summary.exact_mau28_by_dimensions]: https://github.com/mozilla/telemetry-airflow/blob/89a6dc3/dags/main_summary.py#L385-L390
|
|
@ -0,0 +1,22 @@
|
|||
# Public Data
|
||||
|
||||
For background, see [Accessing Public Data](https://docs.telemetry.mozilla.org/cookbooks/public_data.html)
|
||||
on `docs.telemetry.mozilla.org`.
|
||||
|
||||
- To make query results publicly available, the `public_bigquery` flag must be set in
|
||||
`metadata.yaml`
|
||||
- Tables will get published in the `mozilla-public-data` GCP project which is accessible
|
||||
by everyone, also external users
|
||||
- To make query results publicly available as JSON, `public_json` flag must be set in
|
||||
`metadata.yaml`
|
||||
- Data will be accessible under https://public-data.telemetry.mozilla.org
|
||||
- A list of all available datasets is published under https://public-data.telemetry.mozilla.org/all-datasets.json
|
||||
- For example: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/000000000000.json
|
||||
- Output JSON files have a maximum size of 1GB, data can be split up into multiple files (`000000000000.json`, `000000000001.json`, ...)
|
||||
- `incremental_export` controls how data should be exported as JSON:
|
||||
- `false`: all data of the source table gets exported to a single location
|
||||
- https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/000000000000.json
|
||||
- `true`: only data that matches the `submission_date` parameter is exported as JSON to a separate directory for this date
|
||||
- https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/2020-03-15/000000000000.json
|
||||
- For each dataset, a `metadata.json` gets published listing all available files, for example: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/metadata.json
|
||||
- The timestamp when the dataset was last updated is recorded in `last_updated`, e.g.: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/last_updated
|
|
@ -0,0 +1,121 @@
|
|||
# Recommended practices
|
||||
|
||||
## Queries
|
||||
|
||||
- Should be defined in files named as `sql/<project>/<dataset>/<table>_<version>/query.sql` e.g.
|
||||
- `<project>` defines both where the destination table resides and in which project the query job runs
|
||||
`sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v7/query.sql`
|
||||
- Queries that populate tables should always be named with a version suffix;
|
||||
we assume that future optimizations to the data representation may require
|
||||
schema-incompatible changes such as dropping columns
|
||||
- May be generated using a python script that prints the query to stdout
|
||||
- Should save output as `sql/<project>/<dataset>/<table>_<version>/query.sql` as above
|
||||
- Should be named as `sql/<project>/query_type.sql.py` e.g. `sql/moz-fx-data-shared-prod/clients_daily.sql.py`
|
||||
- May use options to generate queries for different destination tables e.g.
|
||||
using `--source telemetry_core_parquet_v3` to generate
|
||||
`sql/moz-fx-data-shared-prod/telemetry/core_clients_daily_v1/query.sql` and using `--source main_summary_v4` to
|
||||
generate `sql/moz-fx-data-shared-prod/telemetry/clients_daily_v7/query.sql`
|
||||
- Should output a header indicating options used e.g.
|
||||
```sql
|
||||
-- Query generated by: sql/moz-fx-data-shared-prod/clients_daily.sql.py --source telemetry_core_parquet
|
||||
```
|
||||
- Should not specify a project or dataset in table names to simplify testing
|
||||
- Should be [incremental](./incremental.md)
|
||||
- Should filter input tables on partition and clustering columns
|
||||
- Should use `_` prefix in generated column names not meant for output
|
||||
- Should use `_bits` suffix for any integer column that represents a bit pattern
|
||||
- Should not use `DATETIME` type, due to incompatibility with
|
||||
[spark-bigquery-connector]
|
||||
- Should read from `*_stable` tables instead of including custom deduplication
|
||||
- Should use the earliest row for each `document_id` by `submission_timestamp`
|
||||
where filtering duplicates is necessary
|
||||
- Should escape identifiers that match keywords, even if they aren't [reserved keywords]
|
||||
|
||||
[spark-bigquery-connector]: https://github.com/GoogleCloudPlatform/spark-bigquery-connector/issues/5
|
||||
[reserved keywords]: https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical#reserved-keywords
|
||||
|
||||
## Query Metadata
|
||||
|
||||
- For each query, a `metadata.yaml` file should be created in the same directory
|
||||
- This file contains a description, owners and labels. As an example:
|
||||
|
||||
```yaml
|
||||
friendly_name: SSL Ratios
|
||||
description: >
|
||||
Percentages of page loads Firefox users have performed that were
|
||||
conducted over SSL broken down by country.
|
||||
owners:
|
||||
- example@mozilla.com
|
||||
labels:
|
||||
application: firefox
|
||||
incremental: true # incremental queries add data to existing tables
|
||||
schedule: daily # scheduled in Airflow to run daily
|
||||
public_json: true
|
||||
public_bigquery: true
|
||||
review_bugs:
|
||||
- 1414839 # Bugzilla bug ID of data review
|
||||
incremental_export: false # non-incremental JSON export writes all data to a single location
|
||||
```
|
||||
|
||||
## Views
|
||||
|
||||
- Should be defined in files named as `sql/<project>/<dataset>/<table>/view.sql` e.g.
|
||||
`sql/moz-fx-data-shared-prod/telemetry/core/view.sql`
|
||||
- Views should generally _not_ be named with a version suffix; a view represents a
|
||||
stable interface for users and whenever possible should maintain compatibility
|
||||
with existing queries; if the view logic cannot be adapted to changes in underlying
|
||||
tables, breaking changes must be communicated to `fx-data-dev@mozilla.org`
|
||||
- Must specify project and dataset in all table names
|
||||
- Should default to using the `moz-fx-data-shared-prod` project;
|
||||
the `scripts/publish_views` tooling can handle parsing the definitions to publish
|
||||
to other projects such as `derived-datasets`
|
||||
|
||||
## UDFs
|
||||
|
||||
- Should limit the number of [expression subqueries] to avoid: `BigQuery error in query operation: Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex.`
|
||||
- Should be used to avoid code duplication
|
||||
- Must be named in files with lower snake case names ending in `.sql`
|
||||
e.g. `mode_last.sql`
|
||||
- Each file must only define effectively private helper functions and one
|
||||
public function which must be defined last
|
||||
- Helper functions must not conflict with function names in other files
|
||||
- SQL UDFs must be defined in the `udf/` directory and JS UDFs must be defined
|
||||
in the `udf_js` directory
|
||||
- The `udf_legacy/` directory is an exception which must only contain
|
||||
compatibility functions for queries migrated from Athena/Presto.
|
||||
- Functions must be defined as [persistent UDFs](https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions#temporary-udf-syntax)
|
||||
using `CREATE OR REPLACE FUNCTION` syntax
|
||||
- Function names must be prefixed with a dataset of `<dir_name>.` so, for example,
|
||||
all functions in `udf/*.sql` are part of the `udf` dataset
|
||||
- The final syntax for creating a function in a file will look like
|
||||
`CREATE OR REPLACE FUNCTION <dir_name>.<file_name>`
|
||||
- We provide tooling in `scripts/publish_persistent_udfs` for
|
||||
publishing these UDFs to BigQuery
|
||||
- Changes made to UDFs need to be published manually in order for the
|
||||
dry run CI task to pass
|
||||
- Should use `SQL` over `js` for performance
|
||||
|
||||
[expression subqueries]: https://cloud.google.com/bigquery/docs/reference/standard-sql/expression_subqueries
|
||||
|
||||
## Backfills
|
||||
|
||||
- Should be avoided on large tables
|
||||
- Backfills may double storage cost for a table for 90 days by moving
|
||||
data from long-term storage to short-term storage
|
||||
- For example regenerating `clients_last_seen_v1` from scratch would cost
|
||||
about $1600 for the query and about $6800 for data moved to short-term
|
||||
storage
|
||||
- Should combine multiple backfills happening around the same time
|
||||
- Should delay column deletes until the next other backfill
|
||||
- Should use `NULL` for new data and `EXCEPT` to exclude from views until
|
||||
dropped
|
||||
- Should use copy operations in append mode to change column order
|
||||
- Copy operations do not allow changing partitioning, changing clustering, or
|
||||
column deletes
|
||||
- Should split backfilling into queries that finish in minutes not hours
|
||||
- May use [script/generate_incremental_table] to automate backfilling incremental
|
||||
queries
|
||||
- May be performed in a single query for smaller tables that do not depend on history
|
||||
- A useful pattern is to have the only reference to `@submission_date` be a
|
||||
clause `WHERE (@submission_date IS NULL OR @submission_date = submission_date)`
|
||||
which allows recreating all dates by passing `--parameter=submission_date:DATE:NULL`
|
|
@ -0,0 +1,54 @@
|
|||
# Scheduling Queries in Airflow
|
||||
|
||||
- bigquery-etl has tooling to automatically generate Airflow DAGs for scheduling queries
|
||||
- To be scheduled, a query must be assigned to a DAG that is specified in `dags.yaml`
|
||||
- New DAGs can be configured in `dags.yaml`, e.g., by adding the following:
|
||||
```yaml
|
||||
bqetl_ssl_ratios: # name of the DAG; must start with bqetl_
|
||||
schedule_interval: 0 2 * * * # query schedule
|
||||
description: The DAG schedules SSL ratios queries.
|
||||
default_args:
|
||||
owner: example@mozilla.com
|
||||
start_date: "2020-04-05" # YYYY-MM-DD
|
||||
email: ["example@mozilla.com"]
|
||||
retries: 2 # number of retries if the query execution fails
|
||||
retry_delay: 30m
|
||||
```
|
||||
- All DAG names need to have `bqetl_` as prefix.
|
||||
- `schedule_interval` is either defined as a [CRON expression](https://en.wikipedia.org/wiki/Cron) or alternatively as one of the following [CRON presets](https://airflow.readthedocs.io/en/latest/dag-run.html): `once`, `hourly`, `daily`, `weekly`, `monthly`
|
||||
- `start_date` defines the first date for which the query should be executed
|
||||
- Airflow will not automatically backfill older dates if `start_date` is set in the past, backfilling can be done via the Airflow web interface
|
||||
- `email` lists email addresses alerts should be sent to in case of failures when running the query
|
||||
- Alternatively, new DAGs can also be created via the `bqetl` CLI by running `bqetl dag create bqetl_ssl_ratios --schedule_interval='0 2 * * *' --owner="example@mozilla.com" --start_date="2020-04-05" --description="This DAG generates SSL ratios."`
|
||||
- To schedule a specific query, add a `metadata.yaml` file that includes a `scheduling` section, for example:
|
||||
```yaml
|
||||
friendly_name: SSL ratios
|
||||
# ... more metadata, see Query Metadata section above
|
||||
scheduling:
|
||||
dag_name: bqetl_ssl_ratios
|
||||
```
|
||||
- Additional scheduling options:
|
||||
- `depends_on_past` keeps query from getting executed if the previous schedule for the query hasn't succeeded
|
||||
- `date_partition_parameter` - by default set to `submission_date`; can be set to `null` if query doesn't write to a partitioned table
|
||||
- `parameters` specifies a list of query parameters, e.g. `["n_clients:INT64:500"]`
|
||||
- `arguments` - a list of arguments passed when running the query, for example: `["--append_table"]`
|
||||
- `referenced_tables` - manually curated list of tables the query depends on; used to speed up the DAG generation process or to specify tables that the dry run doesn't have permissions to access, e. g. `[['telemetry_stable', 'main_v4']]`
|
||||
- `multipart` indicates whether a query is split over multiple files `part1.sql`, `part2.sql`, ...
|
||||
- `depends_on` defines external dependencies in telemetry-airflow that are not detected automatically:
|
||||
```yaml
|
||||
depends_on:
|
||||
- task_id: external_task
|
||||
dag_name: external_dag
|
||||
execution_delta: 1h
|
||||
```
|
||||
- `task_id`: name of task query depends on
|
||||
- `dag_name`: name of the DAG the external task is part of
|
||||
- `execution_delta`: time difference between the `schedule_intervals` of the external DAG and the DAG the query is part of
|
||||
- `destination_table`: The table to write to. If unspecified, defaults to the query destination; if None, no destination table is used (the query is simply run as-is). Note that if no destination table is specified, you will need to specify the `submission_date` parameter manually
|
||||
- Queries can also be scheduled using the `bqetl` CLI: `./bqetl query schedule path/to/query_v1 --dag bqetl_ssl_ratios `
|
||||
- To generate all Airflow DAGs run `./script/generate_airflow_dags` or `./bqetl dag generate`
|
||||
- Generated DAGs are located in the `dags/` directory
|
||||
- Dependencies between queries scheduled in bigquery-etl and dependencies to stable tables are detected automatically
|
||||
- Specific DAGs can be generated by running `./bqetl dag generate bqetl_ssl_ratios`
|
||||
- Generated DAGs will be automatically detected and scheduled by Airflow
|
||||
- It might take up to 10 minutes for new DAGs and updates to show up in the Airflow UI
|
174
tests/README.md
174
tests/README.md
|
@ -1,173 +1 @@
|
|||
How to Run Tests
|
||||
===
|
||||
|
||||
This repository uses `pytest`:
|
||||
|
||||
```
|
||||
# create a venv
|
||||
python3.8 -m venv venv/
|
||||
|
||||
# install pip-tools for managing dependencies
|
||||
./venv/bin/pip install pip-tools -c requirements.in
|
||||
|
||||
# install python dependencies with pip-sync (provided by pip-tools)
|
||||
./venv/bin/pip-sync
|
||||
|
||||
# install java dependencies with maven
|
||||
mvn dependency:copy-dependencies
|
||||
|
||||
# run pytest with all linters and 4 workers in parallel
|
||||
./venv/bin/pytest --black --pydocstyle --flake8 --mypy-ignore-missing-imports -n 4
|
||||
|
||||
# use -k to selectively run a set of tests that matches the expression `udf`
|
||||
./venv/bin/pytest -k udf
|
||||
|
||||
# run integration tests with 4 workers in parallel
|
||||
gcloud auth application-default login # or set GOOGLE_APPLICATION_CREDENTIALS
|
||||
export GOOGLE_PROJECT_ID=bigquery-etl-integration-test
|
||||
gcloud config set project $GOOGLE_PROJECT_ID
|
||||
./venv/bin/pytest -m integration -n 4
|
||||
```
|
||||
|
||||
To provide [authentication credentials for the Google Cloud API](https://cloud.google.com/docs/authentication/getting-started) the `GOOGLE_APPLICATION_CREDENTIALS` environment variable must be set to the file path of the JSON file that contains the service account key.
|
||||
See [Mozilla BigQuery API Access instructions](https://docs.telemetry.mozilla.org/cookbooks/bigquery.html#gcp-bigquery-api-access) to request credentials if you don't already have them.
|
||||
|
||||
How to Configure a UDF Test
|
||||
===
|
||||
|
||||
Include a comment like `-- Tests` followed by one or more query statements
|
||||
after the UDF in the SQL file where it is defined. Each statement in a SQL file
|
||||
that defines a UDF that does not define a temporary function is collected as a
|
||||
test and executed independently of other tests in the file.
|
||||
|
||||
Each test must use the UDF and throw an error to fail. Assert functions defined
|
||||
in `tests/assert/` may be used to evaluate outputs. Tests must not use any
|
||||
query parameters and should not reference any tables. Each test that is
|
||||
expected to fail must be preceded by a comment like `#xfail`, similar to a [SQL
|
||||
dialect prefix] in the BigQuery Cloud Console.
|
||||
|
||||
For example:
|
||||
|
||||
```sql
|
||||
CREATE TEMP FUNCTION udf_example(option INT64) AS (
|
||||
CASE
|
||||
WHEN option > 0 then TRUE
|
||||
WHEN option = 0 then FALSE
|
||||
ELSE ERROR("invalid option")
|
||||
END
|
||||
);
|
||||
-- Tests
|
||||
SELECT
|
||||
assert_true(udf_example(1)),
|
||||
assert_false(udf_example(0));
|
||||
#xfail
|
||||
SELECT
|
||||
udf_example(-1);
|
||||
#xfail
|
||||
SELECT
|
||||
udf_example(NULL);
|
||||
```
|
||||
|
||||
[SQL dialect prefix]: https://cloud.google.com/bigquery/docs/reference/standard-sql/enabling-standard-sql#sql-prefix
|
||||
|
||||
How to Configure a Generated Test
|
||||
===
|
||||
|
||||
1. Make a directory for test resources named `tests/{dataset}/{table}/{test_name}/`,
|
||||
e.g. `tests/telemetry_derived/clients_last_seen_raw_v1/test_single_day`
|
||||
- `table` must match a directory named like `{dataset}/{table}`, e.g.
|
||||
`telemetry_derived/clients_last_seen_v1`
|
||||
- `test_name` should start with `test_`, e.g. `test_single_day`
|
||||
- If `test_name` is `test_init` or `test_script`, then the query will run `init.sql`
|
||||
or `script.sql` respectively; otherwise, the test will run `query.sql`
|
||||
1. Add `.yaml` files for input tables, e.g. `clients_daily_v6.yaml`
|
||||
- Include the dataset prefix if it's set in the tested query,
|
||||
e.g. `analysis.clients_last_seen_v1.yaml`
|
||||
- This will result in the dataset prefix being removed from the query,
|
||||
e.g. `query = query.replace("analysis.clients_last_seen_v1",
|
||||
"clients_last_seen_v1")`
|
||||
1. Add `.sql` files for input view queries, e.g. `main_summary_v4.sql`
|
||||
- ***Don't*** include a `CREATE ... AS` clause
|
||||
- Fully qualify table names as ``` `{project}.{dataset}.table` ```
|
||||
- Include the dataset prefix if it's set in the tested query,
|
||||
e.g. `telemetry.main_summary_v4.sql`
|
||||
- This will result in the dataset prefix being removed from the query,
|
||||
e.g. `query = query.replace("telemetry.main_summary_v4",
|
||||
"main_summary_v4")`
|
||||
1. Add `expect.yaml` to validate the result
|
||||
- `DATE` and `DATETIME` type columns in the result are coerced to strings
|
||||
using `.isoformat()`
|
||||
- Columns named `generated_time` are removed from the result before
|
||||
comparing to `expect` because they should not be static
|
||||
1. Optionally add `.schema.json` files for input table schemas, e.g.
|
||||
`clients_daily_v6.schema.json`
|
||||
1. Optionally add `query_params.yaml` to define query parameters
|
||||
- `query_params` must be a list
|
||||
|
||||
Init Tests
|
||||
===
|
||||
|
||||
Tests of `init.sql` statements are supported, similarly to other generated tests.
|
||||
Simply name the test `test_init`. The other guidelines still apply.
|
||||
|
||||
*Note*: Init SQL statements must contain a create statement with the dataset
|
||||
and table name, like so:
|
||||
```
|
||||
CREATE OR REPLACE TABLE
|
||||
dataset.table_v1
|
||||
AS
|
||||
...
|
||||
```
|
||||
|
||||
Additional Guidelines and Options
|
||||
---
|
||||
|
||||
- If the destination table is also an input table then `generated_time` should
|
||||
be a required `DATETIME` field to ensure minimal validation
|
||||
- Input table files
|
||||
- All of the formats supported by `bq load` are supported
|
||||
- `yaml` and `json` format are supported and must contain an array of rows
|
||||
which are converted in memory to `ndjson` before loading
|
||||
- Preferred formats are `yaml` for readability or `ndjson` for compatiblity
|
||||
with `bq load`
|
||||
- `expect.yaml`
|
||||
- File extensions `yaml`, `json` and `ndjson` are supported
|
||||
- Preferred formats are `yaml` for readability or `ndjson` for compatiblity
|
||||
with `bq load`
|
||||
- Schema files
|
||||
- Setting the description of a top level field to `time_partitioning_field`
|
||||
will cause the table to use it for time partitioning
|
||||
- File extensions `yaml`, `json` and `ndjson` are supported
|
||||
- Preferred formats are `yaml` for readability or `json` for compatiblity
|
||||
with `bq load`
|
||||
- Query parameters
|
||||
- Scalar query params should be defined as a dict with keys `name`, `type` or
|
||||
`type_`, and `value`
|
||||
- `query_parameters.yaml` may be used instead of `query_params.yaml`, but
|
||||
they are mutually exclusive
|
||||
- File extensions `yaml`, `json` and `ndjson` are supported
|
||||
- Preferred format is `yaml` for readability
|
||||
|
||||
How to Run CircleCI Locally
|
||||
===
|
||||
|
||||
- Install the [CircleCI Local CI](https://circleci.com/docs/2.0/local-cli/)
|
||||
- Download GCP [service account](https://cloud.google.com/iam/docs/service-accounts) keys
|
||||
- Integration tests will only successfully run with service account keys
|
||||
that belong to the `circleci` service account in the `biguqery-etl-integration-test` project
|
||||
- Run `circleci build` and set required environment variables `GOOGLE_PROJECT_ID` and
|
||||
`GCLOUD_SERVICE_KEY`:
|
||||
|
||||
```
|
||||
gcloud_service_key=`cat /path/to/key_file.json`
|
||||
|
||||
# to run a specific job, e.g. integration:
|
||||
circleci build --job integration \
|
||||
--env GOOGLE_PROJECT_ID=bigquery-etl-integration-test \
|
||||
--env GCLOUD_SERVICE_KEY=$gcloud_service_key
|
||||
|
||||
# to run all jobs
|
||||
circleci build \
|
||||
--env GOOGLE_PROJECT_ID=bigquery-etl-integration-test \
|
||||
--env GCLOUD_SERVICE_KEY=$gcloud_service_key
|
||||
```
|
||||
For information on how to run tests, see https://mozilla.github.io/bigquery-etl/cookbooks/testing/
|
Загрузка…
Ссылка в новой задаче