Bigquery ETL
Перейти к файлу
Alessio Placitelli 76bac7a98e
Create an ETL job for the Internet Outages (#1058)
* Add aggregation by country

* Copy the initial Italy focus query

This initial commit provides a baseline for the
next commits to ease review, since this initial
code was already reviewed.

* Cleanup the country list and replace FULL OUTER with LEFT joins

* Aggregate by city for cities with more than 15k inhabitants

The actual 15k limit is enforced at ingestion time.
This further limits the resulting cities to ones with at
least 1000 active daily users.

* Produce hourly aggregates

* Move the query to the `internet_outage` dataset

* Provide automatic daily scheduling through AirFlow

* Tweak the SQL addressing review comments

This additionally changes the `CAST` to
`SAFE_CAST` to account for weirdnesses in
the data.

* Add ssl_error_prop

* Add missing_dns_success

* Add missing_dns_failure

* Lower the minimum reported bucket size to 50

This allows us to match the EDA by Saptarshi and
to have a better comparable baseline.

* Document the oddities around `submission_timestamp_min`
2020-07-01 06:44:40 +02:00
.circleci Generate requirements with python 3.8 (#1021) 2020-06-01 11:44:11 -07:00
.vscode Add tests for parsing scheduling configs 2020-04-22 13:48:04 -07:00
bigquery_etl DAGs for client queries 2020-06-24 12:18:04 -07:00
dags bqetl_clients_daily DAG 2020-06-24 12:18:04 -07:00
mozfun Fix UDF formatting 2020-06-12 09:07:58 -07:00
script Reference shared-prod views when republishing to other projects (#1105) 2020-06-30 12:46:23 -04:00
sql Create an ETL job for the Internet Outages (#1058) 2020-07-01 06:44:40 +02:00
stored_procedures Improvements for CRC32 stored procedure 2020-01-21 14:58:24 -08:00
tests update aggregate_search_counts (#1081) 2020-06-22 13:29:16 -07:00
udf Set country to NULL for unknown search engines (#1104) 2020-06-30 11:28:26 -04:00
udf_js Improve naming in json_extract_missing_cols 2020-05-12 09:57:10 -07:00
udf_legacy Add semi-compatible date_trunc version to udf_legacy (#329) 2019-09-04 10:52:09 -05:00
.bigqueryrc Create ~/.bigqueryrc without GCLOUD_SERVICE_KEY (#112) 2019-05-01 13:38:31 -07:00
.eslintrc.yml Add script to format sql (#173) 2019-09-18 17:48:53 -07:00
.flake8 Add first test (#9) 2019-03-07 12:43:21 -08:00
.gitignore Rewrite script/format_sql in python (#640) 2020-01-06 16:17:41 -08:00
CODE_OF_CONDUCT.md Create CODE_OF_CONDUCT.md (#50) 2019-03-30 10:01:54 -07:00
Dockerfile Use pip-compile instead of constraints.txt (#736) 2020-02-12 20:26:24 +01:00
GRAVEYARD.md Fenix baseline_daily and clients_last_seen tables 2020-04-22 13:30:00 -04:00
README.md Update README 2020-05-01 10:48:01 -07:00
conftest.py Add doc comments 2020-04-30 14:05:26 -07:00
dags.yaml Create an ETL job for the Internet Outages (#1058) 2020-07-01 06:44:40 +02:00
pytest.ini update dependencies (#994) 2020-05-20 15:55:17 -07:00
requirements.in Fix dependencies 2020-05-28 14:12:24 -07:00
requirements.txt Generate requirements with python 3.8 (#1021) 2020-06-01 11:44:11 -07:00

README.md

CircleCI

BigQuery ETL

Bigquery UDFs and SQL queries for building derived datasets.

Formatting SQL

We enforce consistent SQL formatting as part of CI. After adding or changing a query, use script/format_sql to apply formatting rules.

Directories and files passed as arguments to script/format_sql will be formatted in place, with directories recursively searched for files with a .sql extension, e.g.:

$ echo 'SELECT 1,2,3' > test.sql
$ script/format_sql test.sql
modified test.sql
1 file(s) modified
$ cat test.sql
SELECT
  1,
  2,
  3

If no arguments are specified the script will read from stdin and write to stdout, e.g.:

$ echo 'SELECT 1,2,3' | script/format_sql
SELECT
  1,
  2,
  3

To turn off sql formatting for a block of SQL, wrap it in format:off and format:on comments, like this:

SELECT
  -- format:off
  submission_date, sample_id, client_id
  -- format:on

Queries

  • Should be defined in files named as sql/<dataset>/<table>_<version>/query.sql e.g. sql/telemetry_derived/clients_daily_v7/query.sql
    • Queries that populate tables should always be named with a version suffix; we assume that future optimizations to the data representation may require schema-incompatible changes such as dropping columns
  • May be generated using a python script that prints the query to stdout
    • Should save output as sql/<dataset>/<table>_<version>/query.sql as above
    • Should be named as sql/query_type.sql.py e.g. sql/clients_daily.sql.py
    • May use options to generate queries for different destination tables e.g. using --source telemetry_core_parquet_v3 to generate sql/telemetry/core_clients_daily_v1/query.sql and using --source main_summary_v4 to generate sql/telemetry/clients_daily_v7/query.sql
    • Should output a header indicating options used e.g.
      -- Query generated by: sql/clients_daily.sql.py --source telemetry_core_parquet
      
  • Should not specify a project or dataset in table names to simplify testing
  • Should be incremental
  • Should filter input tables on partition and clustering columns
  • Should use _ prefix in generated column names not meant for output
  • Should use _bits suffix for any integer column that represents a bit pattern
  • Should not use DATETIME type, due to incompatibility with spark-bigquery-connector
  • Should read from *_stable tables instead of including custom deduplication
    • Should use the earliest row for each document_id by submission_timestamp where filtering duplicates is necessary
  • Should escape identifiers that match keywords, even if they aren't reserved keywords

Views

  • Should be defined in files named as sql/<dataset>/<table>/view.sql e.g. sql/telemetry/core/view.sql
    • Views should generally not be named with a version suffix; a view represents a stable interface for users and whenever possible should maintain compatibility with existing queries; if the view logic cannot be adapted to changes in underlying tables, breaking changes must be communicated to fx-data-dev@mozilla.org
  • Must specify project and dataset in all table names
    • Should default to using the moz-fx-data-shared-prod project; the scripts/publish_views tooling can handle parsing the definitions to publish to other projects such as derived-datasets

UDFs

  • Should limit the number of expression subqueries to avoid: BigQuery error in query operation: Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex.
  • Should be used to avoid code duplication
  • Must be named in files with lower snake case names ending in .sql e.g. mode_last.sql
    • Each file must only define effectively private helper functions and one public function which must be defined last
      • Helper functions must not conflict with function names in other files
    • SQL UDFs must be defined in the udf/ directory and JS UDFs must be defined in the udf_js directory
      • The udf_legacy/ directory is an exception which must only contain compatibility functions for queries migrated from Athena/Presto.
  • Functions must be defined as persistent UDFs using CREATE OR REPLACE FUNCTION syntax
    • Function names must be prefixed with a dataset of <dir_name>. so, for example, all functions in udf/*.sql are part of the udf dataset
      • The final syntax for creating a function in a file will look like CREATE OR REPLACE FUNCTION <dir_name>.<file_name>
    • We provide tooling in scripts/publish_persistent_udfs for publishing these UDFs to BigQuery
      • Changes made to UDFs need to be published manually in order for the dry run CI task to pass
  • Should use SQL over js for performance

Backfills

  • Should be avoided on large tables
    • Backfills may double storage cost for a table for 90 days by moving data from long-term storage to short-term storage
      • For example regenerating clients_last_seen_v1 from scratch would cost about $1600 for the query and about $6800 for data moved to short-term storage
    • Should combine multiple backfills happening around the same time
    • Should delay column deletes until the next other backfill
      • Should use NULL for new data and EXCEPT to exclude from views until dropped
  • Should use copy operations in append mode to change column order
    • Copy operations do not allow changing partitioning, changing clustering, or column deletes
  • Should split backfilling into queries that finish in minutes not hours
  • May use script/generate_incremental_table to automate backfilling incremental queries
  • May be performed in a single query for smaller tables that do not depend on history
    • A useful pattern is to have the only reference to @submission_date be a clause WHERE (@submission_date IS NULL OR @submission_date = submission_date) which allows recreating all dates by passing --parameter=submission_date:DATE:NULL

Incremental Queries

Benefits

Properties

  • Must accept a date via @submission_date query parameter
    • Must output a column named submission_date matching the query parameter
  • Must produce similar results when run multiple times
    • Should produce identical results when run multiple times
  • May depend on the previous partition
    • If using previous partition, must include an init.sql query to initialize the table, e.g. sql/telemetry_derived/clients_last_seen_v1/init.sql
    • Should be impacted by values from a finite number of preceding partitions
      • This allows for backfilling in chunks instead of serially for all time and limiting backfills to a certain number of days following updated data
      • For example sql/clients_last_seen_v1.sql can be run serially on any 28 day period and the last day will be the same whether or not the partition preceding the first day was missing because values are only impacted by 27 preceding days

Query Metadata

  • For each query, a metadata.yaml file should be created in the same directory
  • This file contains a description, owners and labels. As an example:
friendly_name: SSL Ratios
description: >
  Percentages of page loads Firefox users have performed that were 
  conducted over SSL broken down by country.  
owners:
  - example@mozilla.com
labels:
  application: firefox
  incremental: true     # incremental queries add data to existing tables
  schedule: daily       # scheduled in Airflow to run daily
  public_json: true
  public_bigquery: true
  review_bug: 1414839   # Bugzilla bug ID of data review
  incremental_export: false  # non-incremental JSON export writes all data to a single location

Publishing Datasets

Scheduling Queries in Airflow

Instructions for scheduling queries in Airflow can be found in this cookbook.

Contributing

When adding or modifying a query in this repository, make your changes in the sql/ directory.

When adding a new library to the Python requirements, first add the library to the requirements and then add any meta-dependencies into constraints. Constraints are discovered by installing requirements into a fresh virtual environment. A dependency should be added to either requirements.txt or constraints.txt, but not both.

# Create and activate a python virtual environment.
python3 -m venv venv/
source venv/bin/activate

# If not installed:
pip install pip-tools

# Add the dependency to requirements.in e.g. Jinja2.
echo Jinja2==2.11.1 >> requirements.in

# Compile hashes for new dependencies.
pip-compile --generate-hashes requirements.in

# Deactivate the python virtual environment.
deactivate

Tests

See the documentation in tests/