Bigquery ETL
Перейти к файлу
Daniel Thorn 51378e8049
Use days_since_* and new table name (#41)
2019-03-28 10:39:14 -07:00
.circleci Add first test (#9) 2019-03-07 12:43:21 -08:00
sql Use days_since_* and new table name (#41) 2019-03-28 10:39:14 -07:00
tests Use days_since_* and new table name (#41) 2019-03-28 10:39:14 -07:00
udf Rename get_key to udf_get_key (#39) 2019-03-26 10:44:49 -07:00
.flake8 Add first test (#9) 2019-03-07 12:43:21 -08:00
.gitignore Add first test (#9) 2019-03-07 12:43:21 -08:00
README.md Use days_since_* and new table name (#41) 2019-03-28 10:39:14 -07:00
constraints.txt Bump more-itertools from 6.0.0 to 7.0.0 (#44) 2019-03-28 10:29:22 -07:00
pytest.ini Add first test (#9) 2019-03-07 12:43:21 -08:00
requirements.txt Bump pytest-xdist from 1.26.1 to 1.27.0 (#31) 2019-03-21 09:35:41 -07:00

README.md

CircleCI

BigQuery ETL

Bigquery UDFs and SQL queries for building derived datasets.

Recommended practices

  • Queries
    • Should be defined in files named as sql/table_version.sql e.g. sql/clients_daily_v6.sql
    • Should not specify a project or dataset in table names to simplify testing
    • Should be incremental
    • Should filter input tables on partition and clustering columns
    • Should use _ prefix in generated column names not meant for output
    • Should not use jinja templating on the query file in Airflow
  • UDFs
    • Should be used to avoid code duplication
    • Should use lower snake case names with udf_ prefix e.g. udf_mode_last
    • Should be defined in files named as udfs/function.{sql,js} e.g. udfs/udf_mode_last.sql
    • Should use SQL over js for performance
    • Must not be used for incremental queries with a mostly materialized view (defined below)

Incremental Queries

Incremental queries have these benefits:

  • BigQuery billing discounts for destination table partitions not modified in the last 90 days
  • Requires less airflow configuration
  • Will have tooling to automate backfilling
  • Will have tooling to replace partitions atomically to prevent duplicate data
  • Will have tooling to generate an optimized mostly materialized view that only calculates the most recent partition
    • Note: incompatible with UDFs, which are not allowed in views

Incremental queries have these properties:

  • Must accept a date via @submission_date query parameter
    • Must output a column named submission_date matching the query parameter
  • Must produce similar results when run multiple times
    • Should produce identical results when run multiple times
  • May depend on the previous partition
    • If using previous partition, must include a .init.sql query to init the table
    • Should be impacted by values from a finite number of preceding partitions
      • This allows for backfilling in chunks instead of serially for all time and limiting backfills to a certain number of days following updated data
      • For example sql/clients_last_seen_v1.sql can be run serially on any 28 day period and the last day will be the same whether or not the partition preceding the first day was missing because values are only impacted by 27 preceding days

Tests

See the documentation in tests/