* Move most reference and cookbook documentation in README.md into
  the "reference" and "cookbook" sections of the generated documentation,
  respectively.
* Try to steer people to the generated docs inside the README.md (since
  it is now basically just a set of quickstart instructions)
* Provide a bit of guidance that this repository isn't great for 3rd
  party contributors in a new CONTRIBUTING.md.

Co-authored-by: Jeff Klukas <jklukas@mozilla.com>
This commit is contained in:
Will Lachance 2021-11-29 16:28:07 -05:00 коммит произвёл GitHub
Родитель c434e75494
Коммит 1d6fea8f2d
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
11 изменённых файлов: 545 добавлений и 591 удалений

10
CONTRIBUTING.md Normal file
Просмотреть файл

@ -0,0 +1,10 @@
# Contributing to bigquery-etl
Thank you for your interest in contributing to bigquery-etl! Although the code in this repository is licensed under the MPL, working on this repository effectively requires access to Mozilla's BigQuery data infrastructure which is reserved for Mozilla employees and designated contributors. For more information, see the sections on [gaining access] and [BigQuery Access Request] on [docs.telemetry.mozilla.org].
More information on working with this repository can be found in the README.md file (at the root of this repository) and in the [repository documentation].
[gaining access]: https://docs.telemetry.mozilla.org/concepts/gaining_access.html
[BigQuery Access Request]: https://docs.telemetry.mozilla.org/cookbooks/bigquery/access.html#bigquery-access-request
[docs.telemetry.mozilla.org]: https://docs.telemetry.mozilla.org
[repository documentation]: https://mozilla.github.io/bigquery-etl/

420
README.md
Просмотреть файл

@ -2,12 +2,14 @@
# BigQuery ETL
This repository contains Mozilla Data Team's
This repository contains Mozilla Data Team's:
- Derived ETL jobs that do not require a custom container
- User-defined functions (UDFs)
- Airflow DAGs for scheduled bigquery-etl queries
- Tools for query & UDF deployment, management and scheduling
For more information, see [https://mozilla.github.io/bigquery-etl/](https://mozilla.github.io/bigquery-etl/)
## Quick Start
@ -26,7 +28,7 @@ This repository contains Mozilla Data Team's
- **For Mozilla Employees or Contributors (not in Data Engineering)** - Set up GCP command line tools, [as described on docs.telemetry.mozilla.org](https://docs.telemetry.mozilla.org/cookbooks/bigquery/access.html#using-the-bq-command-line-tool). Note that some functionality (e.g. writing UDFs or backfilling queries) may not be allowed.
- **For Data Engineering** - In addition to setting up the command line tools, you will want to log in to `shared-prod` if making changes to production systems. Run `gcloud auth login --update-adc --project=moz-fx-data-shared-prod` (if you have not run it previously).
### Installing bqetl library
### Installing bqetl
1. Clone the repository
```bash
@ -55,416 +57,4 @@ Finally, if you are using Visual Studio Code, you may also wish to use our recom
cp .vscode/settings.json.default .vscode/settings.json
```
And you should now be set up to start working in the repo! The easiest way to do this is for many tasks is to use `bqetl`, which is described below.
---
&nbsp;
# The `bqetl` CLI
The `bqetl` command-line tool aims to simplify working with the bigquery-etl repository by supporting
common workflows, such as creating, validating and scheduling queries or adding new UDFs.
## Usage
The CLI groups commands into different groups:
```
$ ./bqetl --help
Commands:
dag Commands for managing DAGs.
dryrun Dry run SQL.
format Format SQL.
mozfun Commands for managing mozfun UDFs.
query Commands for managing queries.
udf Commands for managing UDFs.
...
```
To get information about commands and available options, simply append the `--help` flag:
```
$ ./bqetl query create --help
Usage: bqetl query create [OPTIONS] NAME
Create a new query with name <dataset>.<query_name>, for example:
telemetry_derived.asn_aggregates
Options:
-p, --path DIRECTORY Path to directory in which query should be created
-o, --owner TEXT Owner of the query (email address)
-i, --init Create an init.sql file to initialize the table
--help Show this message and exit.
```
Documentation of all `bqetl` commands including usage examples can be found in the [bigquery-etl docs](https://github.com/mozilla/bigquery-etl#the-bqetl-cli).
Running some commands, for example to create or query tables, will [require access to Mozilla's GCP Account](https://docs.telemetry.mozilla.org/cookbooks/bigquery/access.html#bigquery-access-request).
## Formatting SQL
We enforce consistent SQL formatting as part of CI. After adding or changing a
query, use `./bqetl format` to apply formatting rules.
Directories and files passed as arguments to `./bqetl format` will be
formatted in place, with directories recursively searched for files with a
`.sql` extension, e.g.:
```bash
$ echo 'SELECT 1,2,3' > test.sql
$ ./bqetl format test.sql
modified test.sql
1 file(s) modified
$ cat test.sql
SELECT
1,
2,
3
```
If no arguments are specified the script will read from stdin and write to
stdout, e.g.:
```bash
$ echo 'SELECT 1,2,3' | ./bqetl format
SELECT
1,
2,
3
```
To turn off sql formatting for a block of SQL, wrap it in `format:off` and
`format:on` comments, like this:
```sql
SELECT
-- format:off
submission_date, sample_id, client_id
-- format:on
```
Recommended practices
---
### Queries
- Should be defined in files named as `sql/<project>/<dataset>/<table>_<version>/query.sql` e.g.
- `<project>` defines both where the destination table resides and in which project the query job runs
`sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v7/query.sql`
- Queries that populate tables should always be named with a version suffix;
we assume that future optimizations to the data representation may require
schema-incompatible changes such as dropping columns
- May be generated using a python script that prints the query to stdout
- Should save output as `sql/<project>/<dataset>/<table>_<version>/query.sql` as above
- Should be named as `sql/<project>/query_type.sql.py` e.g. `sql/moz-fx-data-shared-prod/clients_daily.sql.py`
- May use options to generate queries for different destination tables e.g.
using `--source telemetry_core_parquet_v3` to generate
`sql/moz-fx-data-shared-prod/telemetry/core_clients_daily_v1/query.sql` and using `--source main_summary_v4` to
generate `sql/moz-fx-data-shared-prod/telemetry/clients_daily_v7/query.sql`
- Should output a header indicating options used e.g.
```sql
-- Query generated by: sql/moz-fx-data-shared-prod/clients_daily.sql.py --source telemetry_core_parquet
```
- Should not specify a project or dataset in table names to simplify testing
- Should be [incremental]
- Should filter input tables on partition and clustering columns
- Should use `_` prefix in generated column names not meant for output
- Should use `_bits` suffix for any integer column that represents a bit pattern
- Should not use `DATETIME` type, due to incompatibility with
[spark-bigquery-connector]
- Should read from `*_stable` tables instead of including custom deduplication
- Should use the earliest row for each `document_id` by `submission_timestamp`
where filtering duplicates is necessary
- Should escape identifiers that match keywords, even if they aren't [reserved keywords]
### Views
- Should be defined in files named as `sql/<project>/<dataset>/<table>/view.sql` e.g.
`sql/moz-fx-data-shared-prod/telemetry/core/view.sql`
- Views should generally _not_ be named with a version suffix; a view represents a
stable interface for users and whenever possible should maintain compatibility
with existing queries; if the view logic cannot be adapted to changes in underlying
tables, breaking changes must be communicated to `fx-data-dev@mozilla.org`
- Must specify project and dataset in all table names
- Should default to using the `moz-fx-data-shared-prod` project;
the `scripts/publish_views` tooling can handle parsing the definitions to publish
to other projects such as `derived-datasets`
### UDFs
- Should limit the number of [expression subqueries] to avoid: `BigQuery error
in query operation: Resources exceeded during query execution: Not enough
resources for query planning - too many subqueries or query is too complex.`
- Should be used to avoid code duplication
- Must be named in files with lower snake case names ending in `.sql`
e.g. `mode_last.sql`
- Each file must only define effectively private helper functions and one
public function which must be defined last
- Helper functions must not conflict with function names in other files
- SQL UDFs must be defined in the `udf/` directory and JS UDFs must be defined
in the `udf_js` directory
- The `udf_legacy/` directory is an exception which must only contain
compatibility functions for queries migrated from Athena/Presto.
- Functions must be defined as [persistent UDFs](https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions#temporary-udf-syntax)
using `CREATE OR REPLACE FUNCTION` syntax
- Function names must be prefixed with a dataset of `<dir_name>.` so, for example,
all functions in `udf/*.sql` are part of the `udf` dataset
- The final syntax for creating a function in a file will look like
`CREATE OR REPLACE FUNCTION <dir_name>.<file_name>`
- We provide tooling in `scripts/publish_persistent_udfs` for
publishing these UDFs to BigQuery
- Changes made to UDFs need to be published manually in order for the
dry run CI task to pass
- Should use `SQL` over `js` for performance
### Backfills
- Should be avoided on large tables
- Backfills may double storage cost for a table for 90 days by moving
data from long-term storage to short-term storage
- For example regenerating `clients_last_seen_v1` from scratch would cost
about $1600 for the query and about $6800 for data moved to short-term
storage
- Should combine multiple backfills happening around the same time
- Should delay column deletes until the next other backfill
- Should use `NULL` for new data and `EXCEPT` to exclude from views until
dropped
- Should use copy operations in append mode to change column order
- Copy operations do not allow changing partitioning, changing clustering, or
column deletes
- Should split backfilling into queries that finish in minutes not hours
- May use [script/generate_incremental_table] to automate backfilling incremental
queries
- May be performed in a single query for smaller tables that do not depend on history
- A useful pattern is to have the only reference to `@submission_date` be a
clause `WHERE (@submission_date IS NULL OR @submission_date = submission_date)`
which allows recreating all dates by passing `--parameter=submission_date:DATE:NULL`
Incremental Queries
---
### Benefits
- BigQuery billing discounts for destination table partitions not modified in
the last 90 days
- May use [dags.utils.gcp.bigquery_etl_query] to simplify airflow configuration
e.g. see [dags.main_summary.exact_mau28_by_dimensions]
- May use [script/generate_incremental_table] to automate backfilling
- Should use `WRITE_TRUNCATE` mode or `bq query --replace` to replace
partitions atomically to prevent duplicate data
- Will have tooling to generate an optimized _mostly materialized view_ that
only calculates the most recent partition
### Properties
- Must accept a date via `@submission_date` query parameter
- Must output a column named `submission_date` matching the query parameter
- Must produce similar results when run multiple times
- Should produce identical results when run multiple times
- May depend on the previous partition
- If using previous partition, must include an `init.sql` query to initialize the
table, e.g. `sql/moz-fx-data-shared-prod/telemetry_derived/clients_last_seen_v1/init.sql`
- Should be impacted by values from a finite number of preceding partitions
- This allows for backfilling in chunks instead of serially for all time
and limiting backfills to a certain number of days following updated data
- For example `sql/moz-fx-data-shared-prod/clients_last_seen_v1.sql` can be run serially on any 28 day
period and the last day will be the same whether or not the partition
preceding the first day was missing because values are only impacted by
27 preceding days
Query Metadata
---
- For each query, a `metadata.yaml` file should be created in the same directory
- This file contains a description, owners and labels. As an example:
```yaml
friendly_name: SSL Ratios
description: >
Percentages of page loads Firefox users have performed that were
conducted over SSL broken down by country.
owners:
- example@mozilla.com
labels:
application: firefox
incremental: true # incremental queries add data to existing tables
schedule: daily # scheduled in Airflow to run daily
public_json: true
public_bigquery: true
review_bugs:
- 1414839 # Bugzilla bug ID of data review
incremental_export: false # non-incremental JSON export writes all data to a single location
```
### Publishing a Table Publicly
For background, see [Accessing Public Data](https://docs.telemetry.mozilla.org/cookbooks/public_data.html)
on `docs.telemetry.mozilla.org`.
- To make query results publicly available, the `public_bigquery` flag must be set in
`metadata.yaml`
- Tables will get published in the `mozilla-public-data` GCP project which is accessible
by everyone, also external users
- To make query results publicly available as JSON, `public_json` flag must be set in
`metadata.yaml`
- Data will be accessible under https://public-data.telemetry.mozilla.org
- A list of all available datasets is published under https://public-data.telemetry.mozilla.org/all-datasets.json
- For example: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/000000000000.json
- Output JSON files have a maximum size of 1GB, data can be split up into multiple files (`000000000000.json`, `000000000001.json`, ...)
- `incremental_export` controls how data should be exported as JSON:
- `false`: all data of the source table gets exported to a single location
- https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/000000000000.json
- `true`: only data that matches the `submission_date` parameter is exported as JSON to a separate directory for this date
- https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/2020-03-15/000000000000.json
- For each dataset, a `metadata.json` gets published listing all available files, for example: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/metadata.json
- The timestamp when the dataset was last updated is recorded in `last_updated`, e.g.: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/last_updated
Dataset Metadata
---
To provision a new BigQuery dataset for holding tables, you'll need to
create a `dataset_metadata.yaml` which will cause the dataset to be
automatically deployed a few hours after merging. Changes to existing
datasets may trigger manual operator approval (such as changing access policies).
The `bqetl query create` command will automatically generate a skeleton
`dataset_metadata.yaml` file if the query name contains a dataset that
is not yet defined.
See example with commentary for `telemetry_derived`:
```yaml
friendly_name: Telemetry Derived
description: |-
Derived data based on pings from legacy Firefox telemetry, plus many other
general-purpose derived tables
labels: {}
# Base ACL should can be:
# "derived" for `_derived` datasets that contain concrete tables
# "view" for user-facing datasets containing virtual views
dataset_base_acl: derived
# Datasets with user-facing set to true will be created both in shared-prod
# and in mozdata; this should be false for all `_derived` datasets
user_facing: false
# Most datasets can have mozilla-confidential access like below,
# but some datasets will be defined with more restricted access
# or with additional access for services.
workgroup_access:
- role: roles/bigquery.dataViewer
members:
- workgroup:mozilla-confidential
```
Scheduling Queries in Airflow
---
- bigquery-etl has tooling to automatically generate Airflow DAGs for scheduling queries
- To be scheduled, a query must be assigned to a DAG that is specified in `dags.yaml`
- New DAGs can be configured in `dags.yaml`, e.g., by adding the following:
```yaml
bqetl_ssl_ratios: # name of the DAG; must start with bqetl_
schedule_interval: 0 2 * * * # query schedule
description: The DAG schedules SSL ratios queries.
default_args:
owner: example@mozilla.com
start_date: '2020-04-05' # YYYY-MM-DD
email: ['example@mozilla.com']
retries: 2 # number of retries if the query execution fails
retry_delay: 30m
```
- All DAG names need to have `bqetl_` as prefix.
- `schedule_interval` is either defined as a [CRON expression](https://en.wikipedia.org/wiki/Cron) or alternatively as one of the following [CRON presets](https://airflow.readthedocs.io/en/latest/dag-run.html): `once`, `hourly`, `daily`, `weekly`, `monthly`
- `start_date` defines the first date for which the query should be executed
- Airflow will not automatically backfill older dates if `start_date` is set in the past, backfilling can be done via the Airflow web interface
- `email` lists email addresses alerts should be sent to in case of failures when running the query
- Alternatively, new DAGs can also be created via the `bqetl` CLI by running `bqetl dag create bqetl_ssl_ratios --schedule_interval='0 2 * * *' --owner="example@mozilla.com" --start_date="2020-04-05" --description="This DAG generates SSL ratios."`
- To schedule a specific query, add a `metadata.yaml` file that includes a `scheduling` section, for example:
```yaml
friendly_name: SSL ratios
# ... more metadata, see Query Metadata section above
scheduling:
dag_name: bqetl_ssl_ratios
```
- Additional scheduling options:
- `depends_on_past` keeps query from getting executed if the previous schedule for the query hasn't succeeded
- `date_partition_parameter` - by default set to `submission_date`; can be set to `null` if query doesn't write to a partitioned table
- `parameters` specifies a list of query parameters, e.g. `["n_clients:INT64:500"]`
- `arguments` - a list of arguments passed when running the query, for example: `["--append_table"]`
- `referenced_tables` - manually curated list of tables the query depends on; used to speed up the DAG generation process or to specify tables that the dry run doesn't have permissions to access, e. g. `[['telemetry_stable', 'main_v4']]`
- `multipart` indicates whether a query is split over multiple files `part1.sql`, `part2.sql`, ...
- `depends_on` defines external dependencies in telemetry-airflow that are not detected automatically:
```yaml
depends_on:
- task_id: external_task
dag_name: external_dag
execution_delta: 1h
```
- `task_id`: name of task query depends on
- `dag_name`: name of the DAG the external task is part of
- `execution_delta`: time difference between the `schedule_intervals` of the external DAG and the DAG the query is part of
- `destination_table`: The table to write to. If unspecified, defaults to the query destination; if None, no destination table is used (the query is simply run as-is). Note that if no destination table is specified, you will need to specify the `submission_date` parameter manually
- Queries can also be scheduled using the `bqetl` CLI: `./bqetl query schedule path/to/query_v1 --dag bqetl_ssl_ratios `
- To generate all Airflow DAGs run `./script/generate_airflow_dags` or `./bqetl dag generate`
- Generated DAGs are located in the `dags/` directory
- Dependencies between queries scheduled in bigquery-etl and dependencies to stable tables are detected automatically
- Specific DAGs can be generated by running `./bqetl dag generate bqetl_ssl_ratios`
- Generated DAGs will be automatically detected and scheduled by Airflow
- It might take up to 10 minutes for new DAGs and updates to show up in the Airflow UI
Contributing
---
When adding or modifying a query in this repository, make your changes in the `sql/` directory.
When adding a new library to the Python requirements, first add the library to
the requirements and then add any meta-dependencies into constraints.
Constraints are discovered by installing requirements into a fresh virtual
environment. A dependency should be added to either `requirements.txt` or
`constraints.txt`, but not both.
```bash
# Create a python virtual environment (not necessary if you have already
# run `./bqetl bootstrap`)
python3 -m venv venv/
# Activate the virtual environment
source venv/bin/activate
# If not installed:
pip install pip-tools
# Add the dependency to requirements.in e.g. Jinja2.
echo Jinja2==2.11.1 >> requirements.in
# Compile hashes for new dependencies.
pip-compile --generate-hashes requirements.in
# Deactivate the python virtual environment.
deactivate
```
When opening a pull-request to merge a fork, the `manual-trigger-required-for-fork` CI task will
fail and some integration test tasks will be skipped. A user with repository write permissions
will have to run the [Push to upstream workflow](https://github.com/mozilla/bigquery-etl/actions/workflows/push-to-upstream.yml)
and provide the `<username>:<branch>` of the fork as parameter. The parameter will also show up
in the logs of the `manual-trigger-required-for-fork` CI task together with more detailed instructions.
Once the workflow has been executed, the CI tasks, including the integration tests, of the PR will be
executed.
Tests
---
[See the documentation in tests/](tests/README.md)
[script/generate_incremental_table]: https://github.com/mozilla/bigquery-etl/blob/main/script/generate_incremental_table
[expression subqueries]: https://cloud.google.com/bigquery/docs/reference/standard-sql/expression_subqueries
[dags.utils.gcp.bigquery_etl_query]: https://github.com/mozilla/telemetry-airflow/blob/89a6dc3/dags/utils/gcp.py#L364
[dags.main_summary.exact_mau28_by_dimensions]: https://github.com/mozilla/telemetry-airflow/blob/89a6dc3/dags/main_summary.py#L385-L390
[incremental]: #incremental-queries
[spark-bigquery-connector]: https://github.com/GoogleCloudPlatform/spark-bigquery-connector/issues/5
[reserved keywords]: https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical#reserved-keywords
[mozilla-pipeline-schemas]: https://github.com/mozilla-services/mozilla-pipeline-schemas
And you should now be set up to start working in the repo! The easiest way to do this is for many tasks is to use [`bqetl`](https://mozilla.github.io/bigquery-etl/bqetl/). You may also want to read up on [common workflows](https://mozilla.github.io/bigquery-etl/cookbooks/common_workflows/).

Просмотреть файл

@ -2,6 +2,8 @@
The `bqetl` command-line tool aims to simplify working with the bigquery-etl repository by supporting common workflows, such as creating, validating and scheduling queries or adding new UDFs.
Running some commands, for example to create or query tables, will [require Mozilla GCP access](https://docs.telemetry.mozilla.org/cookbooks/bigquery/access.html#bigquery-access-request).
## Installation
Follow the [Quick Start](https://github.com/mozilla/bigquery-etl#quick-start) to set up bigquery-etl and the bqetl CLI.

Просмотреть файл

@ -44,6 +44,48 @@ The [Creating derived datasets tutorial](https://mozilla.github.io/bigquery-etl/
1. Deploy schema changes by running `./bqetl query schema deploy <dataset>.<table>_<version>`
1. Merge pull-request
## Formatting SQL
We enforce consistent SQL formatting as part of CI. After adding or changing a
query, use `./bqetl format` to apply formatting rules.
Directories and files passed as arguments to `./bqetl format` will be
formatted in place, with directories recursively searched for files with a
`.sql` extension, e.g.:
```bash
$ echo 'SELECT 1,2,3' > test.sql
$ ./bqetl format test.sql
modified test.sql
1 file(s) modified
$ cat test.sql
SELECT
1,
2,
3
```
If no arguments are specified the script will read from stdin and write to
stdout, e.g.:
```bash
$ echo 'SELECT 1,2,3' | ./bqetl format
SELECT
1,
2,
3
```
To turn off sql formatting for a block of SQL, wrap it in `format:off` and
`format:on` comments, like this:
```sql
SELECT
-- format:off
submission_date, sample_id, client_id
-- format:on
```
## Add a new field to clients_daily
Adding a new field to `clients_daily` also means that field has to propagate to several
@ -112,8 +154,48 @@ The same steps as creating a new UDF apply for creating stored procedures, excep
1. Open a PR
1. PR gets reviews, approved and merged
## Creating a new BigQuery Dataset
To provision a new BigQuery dataset for holding tables, you'll need to
create a `dataset_metadata.yaml` which will cause the dataset to be
automatically deployed a few hours after merging. Changes to existing
datasets may trigger manual operator approval (such as changing access policies).
The `bqetl query create` command will automatically generate a skeleton
`dataset_metadata.yaml` file if the query name contains a dataset that
is not yet defined.
See example with commentary for `telemetry_derived`:
```yaml
friendly_name: Telemetry Derived
description: |-
Derived data based on pings from legacy Firefox telemetry, plus many other
general-purpose derived tables
labels: {}
# Base ACL should can be:
# "derived" for `_derived` datasets that contain concrete tables
# "view" for user-facing datasets containing virtual views
dataset_base_acl: derived
# Datasets with user-facing set to true will be created both in shared-prod
# and in mozdata; this should be false for all `_derived` datasets
user_facing: false
# Most datasets can have mozilla-confidential access like below,
# but some datasets will be defined with more restricted access
# or with additional access for services.
workgroup_access:
- role: roles/bigquery.dataViewer
members:
- workgroup:mozilla-confidential
```
## Publishing data
See also the reference for [Public Data](../reference/public_data.md).
1. Get a data review by following the [data publishing process](https://wiki.mozilla.org/Data_Publishing#Dataset_Publishing_Process_2)
1. Update the `metadata.yaml` file of the query to be published
* Set `public_bigquery: true` and optionally `public_json: true`
@ -124,3 +206,42 @@ The same steps as creating a new UDF apply for creating stored procedures, excep
1. Open a PR
1. PR gets reviewed, approved and merged
* Once, ETL is running a view will get automatically published to `moz-fx-data-shared-prod` referencing the public dataset
## Adding new Python requirements
When adding a new library to the Python requirements, first add the library to
the requirements and then add any meta-dependencies into constraints.
Constraints are discovered by installing requirements into a fresh virtual
environment. A dependency should be added to either `requirements.txt` or
`constraints.txt`, but not both.
```bash
# Create a python virtual environment (not necessary if you have already
# run `./bqetl bootstrap`)
python3 -m venv venv/
# Activate the virtual environment
source venv/bin/activate
# If not installed:
pip install pip-tools
# Add the dependency to requirements.in e.g. Jinja2.
echo Jinja2==2.11.1 >> requirements.in
# Compile hashes for new dependencies.
pip-compile --generate-hashes requirements.in
# Deactivate the python virtual environment.
deactivate
```
## Making a pull request from a fork
When opening a pull-request to merge a fork, the `manual-trigger-required-for-fork` CI task will
fail and some integration test tasks will be skipped. A user with repository write permissions
will have to run the [Push to upstream workflow](https://github.com/mozilla/bigquery-etl/actions/workflows/push-to-upstream.yml)
and provide the `<username>:<branch>` of the fork as parameter. The parameter will also show up
in the logs of the `manual-trigger-required-for-fork` CI task together with more detailed instructions.
Once the workflow has been executed, the CI tasks, including the integration tests, of the PR will be
executed.

166
docs/cookbooks/testing.md Normal file
Просмотреть файл

@ -0,0 +1,166 @@
# How to Run Tests
This repository uses `pytest`:
```
# create a venv
python3.8 -m venv venv/
# install pip-tools for managing dependencies
./venv/bin/pip install pip-tools -c requirements.in
# install python dependencies with pip-sync (provided by pip-tools)
./venv/bin/pip-sync
# install java dependencies with maven
mvn dependency:copy-dependencies
# run pytest with all linters and 4 workers in parallel
./venv/bin/pytest --black --pydocstyle --flake8 --mypy-ignore-missing-imports -n 4
# use -k to selectively run a set of tests that matches the expression `udf`
./venv/bin/pytest -k udf
# run integration tests with 4 workers in parallel
gcloud auth application-default login # or set GOOGLE_APPLICATION_CREDENTIALS
export GOOGLE_PROJECT_ID=bigquery-etl-integration-test
gcloud config set project $GOOGLE_PROJECT_ID
./venv/bin/pytest -m integration -n 4
```
To provide [authentication credentials for the Google Cloud API](https://cloud.google.com/docs/authentication/getting-started) the `GOOGLE_APPLICATION_CREDENTIALS` environment variable must be set to the file path of the JSON file that contains the service account key.
See [Mozilla BigQuery API Access instructions](https://docs.telemetry.mozilla.org/cookbooks/bigquery.html#gcp-bigquery-api-access) to request credentials if you don't already have them.
## How to Configure a UDF Test
Include a comment like `-- Tests` followed by one or more query statements
after the UDF in the SQL file where it is defined. Each statement in a SQL file
that defines a UDF that does not define a temporary function is collected as a
test and executed independently of other tests in the file.
Each test must use the UDF and throw an error to fail. Assert functions defined
in `tests/assert/` may be used to evaluate outputs. Tests must not use any
query parameters and should not reference any tables. Each test that is
expected to fail must be preceded by a comment like `#xfail`, similar to a [SQL
dialect prefix] in the BigQuery Cloud Console.
For example:
```sql
CREATE TEMP FUNCTION udf_example(option INT64) AS (
CASE
WHEN option > 0 then TRUE
WHEN option = 0 then FALSE
ELSE ERROR("invalid option")
END
);
-- Tests
SELECT
assert_true(udf_example(1)),
assert_false(udf_example(0));
#xfail
SELECT
udf_example(-1);
#xfail
SELECT
udf_example(NULL);
```
[sql dialect prefix]: https://cloud.google.com/bigquery/docs/reference/standard-sql/enabling-standard-sql#sql-prefix
## How to Configure a Generated Test
1. Make a directory for test resources named `tests/{dataset}/{table}/{test_name}/`,
e.g. `tests/telemetry_derived/clients_last_seen_raw_v1/test_single_day`
- `table` must match a directory named like `{dataset}/{table}`, e.g.
`telemetry_derived/clients_last_seen_v1`
- `test_name` should start with `test_`, e.g. `test_single_day`
- If `test_name` is `test_init` or `test_script`, then the query will run `init.sql`
or `script.sql` respectively; otherwise, the test will run `query.sql`
1. Add `.yaml` files for input tables, e.g. `clients_daily_v6.yaml`
- Include the dataset prefix if it's set in the tested query,
e.g. `analysis.clients_last_seen_v1.yaml`
- This will result in the dataset prefix being removed from the query,
e.g. `query = query.replace("analysis.clients_last_seen_v1", "clients_last_seen_v1")`
1. Add `.sql` files for input view queries, e.g. `main_summary_v4.sql`
- **_Don't_** include a `CREATE ... AS` clause
- Fully qualify table names as `` `{project}.{dataset}.table` ``
- Include the dataset prefix if it's set in the tested query,
e.g. `telemetry.main_summary_v4.sql`
- This will result in the dataset prefix being removed from the query,
e.g. `query = query.replace("telemetry.main_summary_v4", "main_summary_v4")`
1. Add `expect.yaml` to validate the result
- `DATE` and `DATETIME` type columns in the result are coerced to strings
using `.isoformat()`
- Columns named `generated_time` are removed from the result before
comparing to `expect` because they should not be static
1. Optionally add `.schema.json` files for input table schemas, e.g.
`clients_daily_v6.schema.json`
1. Optionally add `query_params.yaml` to define query parameters
- `query_params` must be a list
## Init Tests
Tests of `init.sql` statements are supported, similarly to other generated tests.
Simply name the test `test_init`. The other guidelines still apply.
_Note_: Init SQL statements must contain a create statement with the dataset
and table name, like so:
```
CREATE OR REPLACE TABLE
dataset.table_v1
AS
...
```
## Additional Guidelines and Options
- If the destination table is also an input table then `generated_time` should
be a required `DATETIME` field to ensure minimal validation
- Input table files
- All of the formats supported by `bq load` are supported
- `yaml` and `json` format are supported and must contain an array of rows
which are converted in memory to `ndjson` before loading
- Preferred formats are `yaml` for readability or `ndjson` for compatiblity
with `bq load`
- `expect.yaml`
- File extensions `yaml`, `json` and `ndjson` are supported
- Preferred formats are `yaml` for readability or `ndjson` for compatiblity
with `bq load`
- Schema files
- Setting the description of a top level field to `time_partitioning_field`
will cause the table to use it for time partitioning
- File extensions `yaml`, `json` and `ndjson` are supported
- Preferred formats are `yaml` for readability or `json` for compatiblity
with `bq load`
- Query parameters
- Scalar query params should be defined as a dict with keys `name`, `type` or
`type_`, and `value`
- `query_parameters.yaml` may be used instead of `query_params.yaml`, but
they are mutually exclusive
- File extensions `yaml`, `json` and `ndjson` are supported
- Preferred format is `yaml` for readability
## How to Run CircleCI Locally
- Install the [CircleCI Local CI](https://circleci.com/docs/2.0/local-cli/)
- Download GCP [service account](https://cloud.google.com/iam/docs/service-accounts) keys
- Integration tests will only successfully run with service account keys
that belong to the `circleci` service account in the `biguqery-etl-integration-test` project
- Run `circleci build` and set required environment variables `GOOGLE_PROJECT_ID` and
`GCLOUD_SERVICE_KEY`:
```
gcloud_service_key=`cat /path/to/key_file.json`
# to run a specific job, e.g. integration:
circleci build --job integration \
--env GOOGLE_PROJECT_ID=bigquery-etl-integration-test \
--env GCLOUD_SERVICE_KEY=$gcloud_service_key
# to run all jobs
circleci build \
--env GOOGLE_PROJECT_ID=bigquery-etl-integration-test \
--env GCLOUD_SERVICE_KEY=$gcloud_service_key
```

Просмотреть файл

@ -27,12 +27,17 @@ plugins:
- awesome-pages
nav:
- index.md
- Cookbooks:
- Common workflows: cookbooks/common_workflows.md
- Creating a derived dataset: cookbooks/creating_a_derived_dataset.md
- Datasets:
- ... | mozdata/**.md
- UDFs:
- ... | mozfun/**.md
- Cookbooks:
- Common workflows: cookbooks/common_workflows.md
- Creating a derived dataset: cookbooks/creating_a_derived_dataset.md
- Testing: cookbooks/testing.md
- Reference:
- bqetl CLI: bqetl.md
- Recommended practices: reference/recommended_practices.md
- Incremental queries: reference/incremental.md
- Scheduling: reference/scheduling.md
- Public data: reference/public_data.md

Просмотреть файл

@ -0,0 +1,35 @@
# Incremental Queries
## Benefits
- BigQuery billing discounts for destination table partitions not modified in
the last 90 days
- May use [dags.utils.gcp.bigquery_etl_query] to simplify airflow configuration
e.g. see [dags.main_summary.exact_mau28_by_dimensions]
- May use [script/generate_incremental_table] to automate backfilling
- Should use `WRITE_TRUNCATE` mode or `bq query --replace` to replace
partitions atomically to prevent duplicate data
- Will have tooling to generate an optimized _mostly materialized view_ that
only calculates the most recent partition
[script/generate_incremental_table]: https://github.com/mozilla/bigquery-etl/blob/main/script/generate_incremental_table
[dags.utils.gcp.bigquery_etl_query]: https://github.com/mozilla/telemetry-airflow/blob/89a6dc3/dags/utils/gcp.py#L364
## Properties
- Must accept a date via `@submission_date` query parameter
- Must output a column named `submission_date` matching the query parameter
- Must produce similar results when run multiple times
- Should produce identical results when run multiple times
- May depend on the previous partition
- If using previous partition, must include an `init.sql` query to initialize the
table, e.g. `sql/moz-fx-data-shared-prod/telemetry_derived/clients_last_seen_v1/init.sql`
- Should be impacted by values from a finite number of preceding partitions
- This allows for backfilling in chunks instead of serially for all time
and limiting backfills to a certain number of days following updated data
- For example `sql/moz-fx-data-shared-prod/clients_last_seen_v1.sql` can be run serially on any 28 day
period and the last day will be the same whether or not the partition
preceding the first day was missing because values are only impacted by
27 preceding days
[dags.main_summary.exact_mau28_by_dimensions]: https://github.com/mozilla/telemetry-airflow/blob/89a6dc3/dags/main_summary.py#L385-L390

Просмотреть файл

@ -0,0 +1,22 @@
# Public Data
For background, see [Accessing Public Data](https://docs.telemetry.mozilla.org/cookbooks/public_data.html)
on `docs.telemetry.mozilla.org`.
- To make query results publicly available, the `public_bigquery` flag must be set in
`metadata.yaml`
- Tables will get published in the `mozilla-public-data` GCP project which is accessible
by everyone, also external users
- To make query results publicly available as JSON, `public_json` flag must be set in
`metadata.yaml`
- Data will be accessible under https://public-data.telemetry.mozilla.org
- A list of all available datasets is published under https://public-data.telemetry.mozilla.org/all-datasets.json
- For example: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/000000000000.json
- Output JSON files have a maximum size of 1GB, data can be split up into multiple files (`000000000000.json`, `000000000001.json`, ...)
- `incremental_export` controls how data should be exported as JSON:
- `false`: all data of the source table gets exported to a single location
- https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/000000000000.json
- `true`: only data that matches the `submission_date` parameter is exported as JSON to a separate directory for this date
- https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/2020-03-15/000000000000.json
- For each dataset, a `metadata.json` gets published listing all available files, for example: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/files/metadata.json
- The timestamp when the dataset was last updated is recorded in `last_updated`, e.g.: https://public-data.telemetry.mozilla.org/api/v1/tables/telemetry_derived/ssl_ratios/v1/last_updated

Просмотреть файл

@ -0,0 +1,121 @@
# Recommended practices
## Queries
- Should be defined in files named as `sql/<project>/<dataset>/<table>_<version>/query.sql` e.g.
- `<project>` defines both where the destination table resides and in which project the query job runs
`sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v7/query.sql`
- Queries that populate tables should always be named with a version suffix;
we assume that future optimizations to the data representation may require
schema-incompatible changes such as dropping columns
- May be generated using a python script that prints the query to stdout
- Should save output as `sql/<project>/<dataset>/<table>_<version>/query.sql` as above
- Should be named as `sql/<project>/query_type.sql.py` e.g. `sql/moz-fx-data-shared-prod/clients_daily.sql.py`
- May use options to generate queries for different destination tables e.g.
using `--source telemetry_core_parquet_v3` to generate
`sql/moz-fx-data-shared-prod/telemetry/core_clients_daily_v1/query.sql` and using `--source main_summary_v4` to
generate `sql/moz-fx-data-shared-prod/telemetry/clients_daily_v7/query.sql`
- Should output a header indicating options used e.g.
```sql
-- Query generated by: sql/moz-fx-data-shared-prod/clients_daily.sql.py --source telemetry_core_parquet
```
- Should not specify a project or dataset in table names to simplify testing
- Should be [incremental](./incremental.md)
- Should filter input tables on partition and clustering columns
- Should use `_` prefix in generated column names not meant for output
- Should use `_bits` suffix for any integer column that represents a bit pattern
- Should not use `DATETIME` type, due to incompatibility with
[spark-bigquery-connector]
- Should read from `*_stable` tables instead of including custom deduplication
- Should use the earliest row for each `document_id` by `submission_timestamp`
where filtering duplicates is necessary
- Should escape identifiers that match keywords, even if they aren't [reserved keywords]
[spark-bigquery-connector]: https://github.com/GoogleCloudPlatform/spark-bigquery-connector/issues/5
[reserved keywords]: https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical#reserved-keywords
## Query Metadata
- For each query, a `metadata.yaml` file should be created in the same directory
- This file contains a description, owners and labels. As an example:
```yaml
friendly_name: SSL Ratios
description: >
Percentages of page loads Firefox users have performed that were
conducted over SSL broken down by country.
owners:
- example@mozilla.com
labels:
application: firefox
incremental: true # incremental queries add data to existing tables
schedule: daily # scheduled in Airflow to run daily
public_json: true
public_bigquery: true
review_bugs:
- 1414839 # Bugzilla bug ID of data review
incremental_export: false # non-incremental JSON export writes all data to a single location
```
## Views
- Should be defined in files named as `sql/<project>/<dataset>/<table>/view.sql` e.g.
`sql/moz-fx-data-shared-prod/telemetry/core/view.sql`
- Views should generally _not_ be named with a version suffix; a view represents a
stable interface for users and whenever possible should maintain compatibility
with existing queries; if the view logic cannot be adapted to changes in underlying
tables, breaking changes must be communicated to `fx-data-dev@mozilla.org`
- Must specify project and dataset in all table names
- Should default to using the `moz-fx-data-shared-prod` project;
the `scripts/publish_views` tooling can handle parsing the definitions to publish
to other projects such as `derived-datasets`
## UDFs
- Should limit the number of [expression subqueries] to avoid: `BigQuery error in query operation: Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex.`
- Should be used to avoid code duplication
- Must be named in files with lower snake case names ending in `.sql`
e.g. `mode_last.sql`
- Each file must only define effectively private helper functions and one
public function which must be defined last
- Helper functions must not conflict with function names in other files
- SQL UDFs must be defined in the `udf/` directory and JS UDFs must be defined
in the `udf_js` directory
- The `udf_legacy/` directory is an exception which must only contain
compatibility functions for queries migrated from Athena/Presto.
- Functions must be defined as [persistent UDFs](https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions#temporary-udf-syntax)
using `CREATE OR REPLACE FUNCTION` syntax
- Function names must be prefixed with a dataset of `<dir_name>.` so, for example,
all functions in `udf/*.sql` are part of the `udf` dataset
- The final syntax for creating a function in a file will look like
`CREATE OR REPLACE FUNCTION <dir_name>.<file_name>`
- We provide tooling in `scripts/publish_persistent_udfs` for
publishing these UDFs to BigQuery
- Changes made to UDFs need to be published manually in order for the
dry run CI task to pass
- Should use `SQL` over `js` for performance
[expression subqueries]: https://cloud.google.com/bigquery/docs/reference/standard-sql/expression_subqueries
## Backfills
- Should be avoided on large tables
- Backfills may double storage cost for a table for 90 days by moving
data from long-term storage to short-term storage
- For example regenerating `clients_last_seen_v1` from scratch would cost
about $1600 for the query and about $6800 for data moved to short-term
storage
- Should combine multiple backfills happening around the same time
- Should delay column deletes until the next other backfill
- Should use `NULL` for new data and `EXCEPT` to exclude from views until
dropped
- Should use copy operations in append mode to change column order
- Copy operations do not allow changing partitioning, changing clustering, or
column deletes
- Should split backfilling into queries that finish in minutes not hours
- May use [script/generate_incremental_table] to automate backfilling incremental
queries
- May be performed in a single query for smaller tables that do not depend on history
- A useful pattern is to have the only reference to `@submission_date` be a
clause `WHERE (@submission_date IS NULL OR @submission_date = submission_date)`
which allows recreating all dates by passing `--parameter=submission_date:DATE:NULL`

Просмотреть файл

@ -0,0 +1,54 @@
# Scheduling Queries in Airflow
- bigquery-etl has tooling to automatically generate Airflow DAGs for scheduling queries
- To be scheduled, a query must be assigned to a DAG that is specified in `dags.yaml`
- New DAGs can be configured in `dags.yaml`, e.g., by adding the following:
```yaml
bqetl_ssl_ratios: # name of the DAG; must start with bqetl_
schedule_interval: 0 2 * * * # query schedule
description: The DAG schedules SSL ratios queries.
default_args:
owner: example@mozilla.com
start_date: "2020-04-05" # YYYY-MM-DD
email: ["example@mozilla.com"]
retries: 2 # number of retries if the query execution fails
retry_delay: 30m
```
- All DAG names need to have `bqetl_` as prefix.
- `schedule_interval` is either defined as a [CRON expression](https://en.wikipedia.org/wiki/Cron) or alternatively as one of the following [CRON presets](https://airflow.readthedocs.io/en/latest/dag-run.html): `once`, `hourly`, `daily`, `weekly`, `monthly`
- `start_date` defines the first date for which the query should be executed
- Airflow will not automatically backfill older dates if `start_date` is set in the past, backfilling can be done via the Airflow web interface
- `email` lists email addresses alerts should be sent to in case of failures when running the query
- Alternatively, new DAGs can also be created via the `bqetl` CLI by running `bqetl dag create bqetl_ssl_ratios --schedule_interval='0 2 * * *' --owner="example@mozilla.com" --start_date="2020-04-05" --description="This DAG generates SSL ratios."`
- To schedule a specific query, add a `metadata.yaml` file that includes a `scheduling` section, for example:
```yaml
friendly_name: SSL ratios
# ... more metadata, see Query Metadata section above
scheduling:
dag_name: bqetl_ssl_ratios
```
- Additional scheduling options:
- `depends_on_past` keeps query from getting executed if the previous schedule for the query hasn't succeeded
- `date_partition_parameter` - by default set to `submission_date`; can be set to `null` if query doesn't write to a partitioned table
- `parameters` specifies a list of query parameters, e.g. `["n_clients:INT64:500"]`
- `arguments` - a list of arguments passed when running the query, for example: `["--append_table"]`
- `referenced_tables` - manually curated list of tables the query depends on; used to speed up the DAG generation process or to specify tables that the dry run doesn't have permissions to access, e. g. `[['telemetry_stable', 'main_v4']]`
- `multipart` indicates whether a query is split over multiple files `part1.sql`, `part2.sql`, ...
- `depends_on` defines external dependencies in telemetry-airflow that are not detected automatically:
```yaml
depends_on:
- task_id: external_task
dag_name: external_dag
execution_delta: 1h
```
- `task_id`: name of task query depends on
- `dag_name`: name of the DAG the external task is part of
- `execution_delta`: time difference between the `schedule_intervals` of the external DAG and the DAG the query is part of
- `destination_table`: The table to write to. If unspecified, defaults to the query destination; if None, no destination table is used (the query is simply run as-is). Note that if no destination table is specified, you will need to specify the `submission_date` parameter manually
- Queries can also be scheduled using the `bqetl` CLI: `./bqetl query schedule path/to/query_v1 --dag bqetl_ssl_ratios `
- To generate all Airflow DAGs run `./script/generate_airflow_dags` or `./bqetl dag generate`
- Generated DAGs are located in the `dags/` directory
- Dependencies between queries scheduled in bigquery-etl and dependencies to stable tables are detected automatically
- Specific DAGs can be generated by running `./bqetl dag generate bqetl_ssl_ratios`
- Generated DAGs will be automatically detected and scheduled by Airflow
- It might take up to 10 minutes for new DAGs and updates to show up in the Airflow UI

Просмотреть файл

@ -1,173 +1 @@
How to Run Tests
===
This repository uses `pytest`:
```
# create a venv
python3.8 -m venv venv/
# install pip-tools for managing dependencies
./venv/bin/pip install pip-tools -c requirements.in
# install python dependencies with pip-sync (provided by pip-tools)
./venv/bin/pip-sync
# install java dependencies with maven
mvn dependency:copy-dependencies
# run pytest with all linters and 4 workers in parallel
./venv/bin/pytest --black --pydocstyle --flake8 --mypy-ignore-missing-imports -n 4
# use -k to selectively run a set of tests that matches the expression `udf`
./venv/bin/pytest -k udf
# run integration tests with 4 workers in parallel
gcloud auth application-default login # or set GOOGLE_APPLICATION_CREDENTIALS
export GOOGLE_PROJECT_ID=bigquery-etl-integration-test
gcloud config set project $GOOGLE_PROJECT_ID
./venv/bin/pytest -m integration -n 4
```
To provide [authentication credentials for the Google Cloud API](https://cloud.google.com/docs/authentication/getting-started) the `GOOGLE_APPLICATION_CREDENTIALS` environment variable must be set to the file path of the JSON file that contains the service account key.
See [Mozilla BigQuery API Access instructions](https://docs.telemetry.mozilla.org/cookbooks/bigquery.html#gcp-bigquery-api-access) to request credentials if you don't already have them.
How to Configure a UDF Test
===
Include a comment like `-- Tests` followed by one or more query statements
after the UDF in the SQL file where it is defined. Each statement in a SQL file
that defines a UDF that does not define a temporary function is collected as a
test and executed independently of other tests in the file.
Each test must use the UDF and throw an error to fail. Assert functions defined
in `tests/assert/` may be used to evaluate outputs. Tests must not use any
query parameters and should not reference any tables. Each test that is
expected to fail must be preceded by a comment like `#xfail`, similar to a [SQL
dialect prefix] in the BigQuery Cloud Console.
For example:
```sql
CREATE TEMP FUNCTION udf_example(option INT64) AS (
CASE
WHEN option > 0 then TRUE
WHEN option = 0 then FALSE
ELSE ERROR("invalid option")
END
);
-- Tests
SELECT
assert_true(udf_example(1)),
assert_false(udf_example(0));
#xfail
SELECT
udf_example(-1);
#xfail
SELECT
udf_example(NULL);
```
[SQL dialect prefix]: https://cloud.google.com/bigquery/docs/reference/standard-sql/enabling-standard-sql#sql-prefix
How to Configure a Generated Test
===
1. Make a directory for test resources named `tests/{dataset}/{table}/{test_name}/`,
e.g. `tests/telemetry_derived/clients_last_seen_raw_v1/test_single_day`
- `table` must match a directory named like `{dataset}/{table}`, e.g.
`telemetry_derived/clients_last_seen_v1`
- `test_name` should start with `test_`, e.g. `test_single_day`
- If `test_name` is `test_init` or `test_script`, then the query will run `init.sql`
or `script.sql` respectively; otherwise, the test will run `query.sql`
1. Add `.yaml` files for input tables, e.g. `clients_daily_v6.yaml`
- Include the dataset prefix if it's set in the tested query,
e.g. `analysis.clients_last_seen_v1.yaml`
- This will result in the dataset prefix being removed from the query,
e.g. `query = query.replace("analysis.clients_last_seen_v1",
"clients_last_seen_v1")`
1. Add `.sql` files for input view queries, e.g. `main_summary_v4.sql`
- ***Don't*** include a `CREATE ... AS` clause
- Fully qualify table names as ``` `{project}.{dataset}.table` ```
- Include the dataset prefix if it's set in the tested query,
e.g. `telemetry.main_summary_v4.sql`
- This will result in the dataset prefix being removed from the query,
e.g. `query = query.replace("telemetry.main_summary_v4",
"main_summary_v4")`
1. Add `expect.yaml` to validate the result
- `DATE` and `DATETIME` type columns in the result are coerced to strings
using `.isoformat()`
- Columns named `generated_time` are removed from the result before
comparing to `expect` because they should not be static
1. Optionally add `.schema.json` files for input table schemas, e.g.
`clients_daily_v6.schema.json`
1. Optionally add `query_params.yaml` to define query parameters
- `query_params` must be a list
Init Tests
===
Tests of `init.sql` statements are supported, similarly to other generated tests.
Simply name the test `test_init`. The other guidelines still apply.
*Note*: Init SQL statements must contain a create statement with the dataset
and table name, like so:
```
CREATE OR REPLACE TABLE
dataset.table_v1
AS
...
```
Additional Guidelines and Options
---
- If the destination table is also an input table then `generated_time` should
be a required `DATETIME` field to ensure minimal validation
- Input table files
- All of the formats supported by `bq load` are supported
- `yaml` and `json` format are supported and must contain an array of rows
which are converted in memory to `ndjson` before loading
- Preferred formats are `yaml` for readability or `ndjson` for compatiblity
with `bq load`
- `expect.yaml`
- File extensions `yaml`, `json` and `ndjson` are supported
- Preferred formats are `yaml` for readability or `ndjson` for compatiblity
with `bq load`
- Schema files
- Setting the description of a top level field to `time_partitioning_field`
will cause the table to use it for time partitioning
- File extensions `yaml`, `json` and `ndjson` are supported
- Preferred formats are `yaml` for readability or `json` for compatiblity
with `bq load`
- Query parameters
- Scalar query params should be defined as a dict with keys `name`, `type` or
`type_`, and `value`
- `query_parameters.yaml` may be used instead of `query_params.yaml`, but
they are mutually exclusive
- File extensions `yaml`, `json` and `ndjson` are supported
- Preferred format is `yaml` for readability
How to Run CircleCI Locally
===
- Install the [CircleCI Local CI](https://circleci.com/docs/2.0/local-cli/)
- Download GCP [service account](https://cloud.google.com/iam/docs/service-accounts) keys
- Integration tests will only successfully run with service account keys
that belong to the `circleci` service account in the `biguqery-etl-integration-test` project
- Run `circleci build` and set required environment variables `GOOGLE_PROJECT_ID` and
`GCLOUD_SERVICE_KEY`:
```
gcloud_service_key=`cat /path/to/key_file.json`
# to run a specific job, e.g. integration:
circleci build --job integration \
--env GOOGLE_PROJECT_ID=bigquery-etl-integration-test \
--env GCLOUD_SERVICE_KEY=$gcloud_service_key
# to run all jobs
circleci build \
--env GOOGLE_PROJECT_ID=bigquery-etl-integration-test \
--env GCLOUD_SERVICE_KEY=$gcloud_service_key
```
For information on how to run tests, see https://mozilla.github.io/bigquery-etl/cookbooks/testing/