История

Ben Wu 9d063e4108 Bug 1716074 - Derive search clients daily from clients daily (#2127 )		2021-07-19 14:30:35 +00:00
..
alchemer	Add separate CI step for SQL and routine tests	2021-07-12 14:10:20 -07:00
assert	Expose scrubbed data from apple receipts to redash (#1753 )	2021-02-04 13:40:49 -08:00
cli	Add separate CI step for SQL and routine tests	2021-07-12 14:10:20 -07:00
dags_config	Bug 1709871 - Support private DAGs in the CLI (#2081 )	2021-05-28 19:01:36 +02:00
data	Use ExternalTaskCompletedSensor in generated Airflow DAGs	2021-07-06 08:34:26 -07:00
docs	Fix use of parse_routine.sub_local_routines in validate-docs (#1860 )	2021-03-01 15:03:29 -08:00
format_sql	Upgrade to pytest 6.0.1 (#1281 )	2020-09-02 11:30:14 -07:00
generate_queries	Automatically sort python imports (#1840 )	2021-02-24 17:11:52 -05:00
metadata	Add tests	2021-05-19 12:51:11 -07:00
public_data	Automatically sort python imports (#1840 )	2021-02-24 17:11:52 -05:00
query_scheduling	Bug 1709871 - Support private DAGs in the CLI (#2081 )	2021-05-28 19:01:36 +02:00
resources/casing	Use snake_case() and associated tests.	2020-06-03 16:29:48 -04:00
routine	Update routine tests	2020-11-17 15:03:20 -08:00
schema	Automatically sort python imports (#1840 )	2021-02-24 17:11:52 -05:00
sql	Bug 1716074 - Derive search clients daily from clients daily (#2127 )	2021-07-19 14:30:35 +00:00
stripe	Bug 1720752 - Fix stripe import for custom_field array (#2193 )	2021-07-15 13:44:21 -07:00
templates/event_types	Add query generation capability for events_daily	2021-02-10 17:03:02 -05:00
util	Automatically sort python imports (#1840 )	2021-02-24 17:11:52 -05:00
validation	Automatically sort python imports (#1840 )	2021-02-24 17:11:52 -05:00
README.md	Small update to testing docs (#2184 )	2021-07-15 19:11:15 +02:00
__init__.py	Add first test (#9 )	2019-03-07 12:43:21 -08:00
test_dryrun.py	Create desktop funnel query schemas	2021-03-19 13:48:43 -07:00
test_entrypoint.py	Automatically sort python imports (#1840 )	2021-02-24 17:11:52 -05:00
test_run_query.py	Automatically sort python imports (#1840 )	2021-02-24 17:11:52 -05:00

README.md

How to Run Tests

This repository uses pytest:

# create a venv
python3.8 -m venv venv/

# install pip-tools for managing dependencies
./venv/bin/pip install pip-tools -c requirements.in

# install python dependencies with pip-sync (provided by pip-tools)
./venv/bin/pip-sync

# install java dependencies with maven
mvn dependency:copy-dependencies

# run pytest with all linters and 4 workers in parallel
./venv/bin/pytest --black --pydocstyle --flake8 --mypy-ignore-missing-imports -n 4

# use -k to selectively run a set of tests that matches the expression `udf`
./venv/bin/pytest -k udf

# run integration tests with 4 workers in parallel
gcloud auth application-default login # or set GOOGLE_APPLICATION_CREDENTIALS
export GOOGLE_PROJECT_ID="bigquery-etl-integration-test"
./venv/bin/pytest -m integration -n 4

To provide authentication credentials for the Google Cloud API the GOOGLE_APPLICATION_CREDENTIALS environment variable must be set to the file path of the JSON file that contains the service account key. See Mozilla BigQuery API Access instructions to request credentials if you don't already have them.

How to Configure a UDF Test

Include a comment like -- Tests followed by one or more query statements after the UDF in the SQL file where it is defined. Each statement in a SQL file that defines a UDF that does not define a temporary function is collected as a test and executed independently of other tests in the file.

Each test must use the UDF and throw an error to fail. Assert functions defined in tests/assert/ may be used to evaluate outputs. Tests must not use any query parameters and should not reference any tables. Each test that is expected to fail must be preceded by a comment like #xfail, similar to a SQL dialect prefix in the BigQuery Cloud Console.

For example:

CREATE TEMP FUNCTION udf_example(option INT64) AS (
  CASE
  WHEN option > 0 then TRUE
  WHEN option = 0 then FALSE
  ELSE ERROR("invalid option")
  END
);
-- Tests
SELECT
  assert_true(udf_example(1)),
  assert_false(udf_example(0));
#xfail
SELECT
  udf_example(-1);
#xfail
SELECT
  udf_example(NULL);

How to Configure a Generated Test

Make a directory for test resources named tests/{dataset}/{table}/{test_name}/, e.g. tests/telemetry_derived/clients_last_seen_raw_v1/test_single_day
- table must match a directory named like {dataset}/{table}, e.g. telemetry_derived/clients_last_seen_v1
- test_name should start with test_, e.g. test_single_day
- If test_name is test_init or test_script, then the query will run init.sql or script.sql respectively; otherwise, the test will run query.sql
Add .yaml files for input tables, e.g. clients_daily_v6.yaml
- Include the dataset prefix if it's set in the tested query, e.g. analysis.clients_last_seen_v1.yaml
  - This will result in the dataset prefix being removed from the query, e.g. query = query.replace("analysis.clients_last_seen_v1", "clients_last_seen_v1")
Add .sql files for input view queries, e.g. main_summary_v4.sql
- Don't include a CREATE ... AS clause
- Fully qualify table names as `{project}.{dataset}.table`
- Include the dataset prefix if it's set in the tested query, e.g. telemetry.main_summary_v4.sql
  - This will result in the dataset prefix being removed from the query, e.g. query = query.replace("telemetry.main_summary_v4", "main_summary_v4")
Add expect.yaml to validate the result
- DATE and DATETIME type columns in the result are coerced to strings using .isoformat()
- Columns named generated_time are removed from the result before comparing to expect because they should not be static
Optionally add .schema.json files for input table schemas, e.g. clients_daily_v6.schema.json
Optionally add query_params.yaml to define query parameters
- query_params must be a list

Init Tests

Tests of init.sql statements are supported, similarly to other generated tests. Simply name the test test_init. The other guidelines still apply.

Note: Init SQL statements must contain a create statement with the dataset and table name, like so:

CREATE OR REPLACE TABLE
  dataset.table_v1
AS
...

Additional Guidelines and Options

If the destination table is also an input table then generated_time should be a required DATETIME field to ensure minimal validation
Input table files
- All of the formats supported by bq load are supported
- yaml and json format are supported and must contain an array of rows which are converted in memory to ndjson before loading
- Preferred formats are yaml for readability or ndjson for compatiblity with bq load
expect.yaml
- File extensions yaml, json and ndjson are supported
- Preferred formats are yaml for readability or ndjson for compatiblity with bq load
Schema files
- Setting the description of a top level field to time_partitioning_field will cause the table to use it for time partitioning
- File extensions yaml, json and ndjson are supported
- Preferred formats are yaml for readability or json for compatiblity with bq load
Query parameters
- Scalar query params should be defined as a dict with keys name, type or type_, and value
- query_parameters.yaml may be used instead of query_params.yaml, but they are mutually exclusive
- File extensions yaml, json and ndjson are supported
- Preferred format is yaml for readability

How to Run CircleCI Locally

Install the CircleCI Local CI
Download GCP service account keys
- Integration tests will only successfully run with service account keys that belong to the circleci service account in the biguqery-etl-integration-test project
Run circleci build and set required environment variables GOOGLE_PROJECT_ID and GCLOUD_SERVICE_KEY:

gcloud_service_key=`cat /path/to/key_file.json`

# to run a specific job, e.g. integration:
circleci build --job integration \
  --env GOOGLE_PROJECT_ID=bigquery-etl-integration-test \
  --env GCLOUD_SERVICE_KEY=$gcloud_service_key

# to run all jobs
circleci build \
  --env GOOGLE_PROJECT_ID=bigquery-etl-integration-test \
  --env GCLOUD_SERVICE_KEY=$gcloud_service_key