Bigquery ETL

Перейти к файлу

Daniel Thorn 51378e8049 Use days_since_* and new table name (#41 )		2019-03-28 10:39:14 -07:00
.circleci	Add first test (#9 )	2019-03-07 12:43:21 -08:00
sql	Use days_since_* and new table name (#41 )	2019-03-28 10:39:14 -07:00
tests	Use days_since_* and new table name (#41 )	2019-03-28 10:39:14 -07:00
udf	Rename get_key to udf_get_key (#39 )	2019-03-26 10:44:49 -07:00
.flake8	Add first test (#9 )	2019-03-07 12:43:21 -08:00
.gitignore	Add first test (#9 )	2019-03-07 12:43:21 -08:00
README.md	Use days_since_* and new table name (#41 )	2019-03-28 10:39:14 -07:00
constraints.txt	Bump more-itertools from 6.0.0 to 7.0.0 (#44 )	2019-03-28 10:29:22 -07:00
pytest.ini	Add first test (#9 )	2019-03-07 12:43:21 -08:00
requirements.txt	Bump pytest-xdist from 1.26.1 to 1.27.0 (#31 )	2019-03-21 09:35:41 -07:00

BigQuery ETL

Bigquery UDFs and SQL queries for building derived datasets.

Recommended practices

Incremental queries have these benefits:

BigQuery billing discounts for destination table partitions not modified in the last 90 days
Requires less airflow configuration
Will have tooling to automate backfilling
Will have tooling to replace partitions atomically to prevent duplicate data
Will have tooling to generate an optimized mostly materialized view that only calculates the most recent partition
- Note: incompatible with UDFs, which are not allowed in views

Incremental queries have these properties:

Must accept a date via @submission_date query parameter
- Must output a column named submission_date matching the query parameter
Must produce similar results when run multiple times
- Should produce identical results when run multiple times
May depend on the previous partition
- If using previous partition, must include a .init.sql query to init the table
- Should be impacted by values from a finite number of preceding partitions
  - This allows for backfilling in chunks instead of serially for all time and limiting backfills to a certain number of days following updated data
  - For example sql/clients_last_seen_v1.sql can be run serially on any 28 day period and the last day will be the same whether or not the partition preceding the first day was missing because values are only impacted by 27 preceding days