bigquery-etl

Граф коммитов

Автор	SHA1	Сообщение	Дата
Eduardo Filho	b994884098	GLAM purge percentile calculations and prep downstream (#5966 ) * Remove percentiles * Remove tests that test percentiles * Refresh scripts insert null to new percentiles * Remove percentile columns from queries and schemas * Delete more percentile tables * Formatting * histogram_cast_struct's keys are strings * Re-add test after fixing failure cause	2024-07-25 10:44:43 -04:00
Anna Scholtz	57bd939905	Fully qualified identifiers in SQL queries (#5764 ) * Add fully-qualified identifiers when formatting queries * Fully-qualified identifiers for queries in sql/ * Check in only formatted SQL to generated-sql branch * Add comment * Fully qualify more tables * Fully qualify test files * Formatting improvements around CTEs and unit tests * Option to skip auto qualifying queries	2024-06-27 09:53:33 -07:00
Lucia	84ee88e2b9	Dependabot/pip/black 24.1.1 fix (#5027 ) * Bump black from 23.10.1 to 24.1.1 Bumps [black](https://github.com/psf/black) from 23.10.1 to 24.1.1. - [Release notes](https://github.com/psf/black/releases) - [Changelog](https://github.com/psf/black/blob/main/CHANGES.md) - [Commits](https://github.com/psf/black/compare/23.10.1...24.1.1) --- updated-dependencies: - dependency-name: black dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> * Reformat files with black to fix dependabot update. * Reformat with black 24.1.1. Update test dag with required space. * Update test dags. --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-02-19 15:27:34 +01:00
Sean Rose	a70b2aa689	Support symlinks (#4881 ) * Avoid using `Path.glob()` or `Path.rglob()` for recursive file searches. Because they don't currently support following symlinks (they will in Python 3.13). * Specify `followlinks=True` as necessary when calling `os.walk()`.	2024-01-24 13:02:43 -08:00
Alexander	588d468dc8	Hoist schemas in SQL tests up to table dir (#3145 )	2022-08-17 13:11:24 -04:00
Anna Scholtz	2f5c6ac41a	Generate ExternalTaskMarkers for Airflow downstream dependencies	2022-06-22 11:05:25 -07:00
akkomar	ceda6dd35f	Use approximate client count in GLAM scalar_percentiles_v1 (#3039 ) This is a follow-up to https://github.com/mozilla/bigquery-etl/pull/3037 which unblocked `scalar_bucket_counts_v1`. `scalar_percentiles_v1` uses the same source table (`clients_scalar_aggregates_v1`) and started failing today with the same error (disk/memory limits exceeded for shuffle operations). `APPROX_COUNT_DISTINCT` used here runs HLL under the hood. The reason for using it here is that we can't split the aggregation here into two stages as in the aforementioned PR due to quantiles calculation. I have run this query locally and confirmed that it works.	2022-06-21 10:55:08 -04:00
Alekhya	a76cd01efa	add minimum client count for fenix (#2642 ) add minimum client count for fenix add minimum client count for fenix add minimum client count for fenix add minimum client count for fenix	2022-01-12 11:49:59 -05:00
Alekhya	2f1413fee1	Revert "correcting minimum client count - desktop and fenix (#2544 )" (#2566 ) This reverts commit `5b743090b4`.	2021-12-10 10:15:52 -05:00
Alekhya	5b743090b4	correcting minimum client count - desktop and fenix (#2544 ) * correcting minimum client count - desktop and fenix * corrected test cases for desktop * corrected the join for desktop	2021-12-06 10:14:42 -05:00
Daniel Thorn	a190e18264	Automatically sort python imports (#1840 )	2021-02-24 17:11:52 -05:00
Anthony Miyaguchi	ce9fe86ed2	Fix #1587 - fix inconsistent range_min and range_max in bucket counts (#1591 ) * Fix egregious double counting in scalar bucket counts * Update for newer version of black * Update scalar bucket count test to account for combinations * Update minimal test for histogram bucket counts * Add test for multiple clients in histogram aggregates * Remove deduplicated cte in histogram bucket counts * Use count distinct for client counts to be explicit	2020-12-04 14:47:45 -08:00
Anthony Miyaguchi	4234c40040	Add minimal set of tests for GLAM Fenix queries (#1488 ) * Add script to determine query dependencies * Add schemas and folders for minimal test * Add schema for geckoview_versions * Add query params to each query * Update schema for new queries * Remove main from bootstrap file * Add dataset prefix to schemas * Add failing test for clients_histogram_aggregates It turns out that the dependency resolution I'm using for autogenerate the schemas is ignoring the views. I actually want to keep the views around. The tables also all need to be prefixed with the dataset name or they won't be inserted into the sql query correctly. * Add successful test for clients histogram aggregates * Add minimal tests for clients_scalar_aggregates * Remove skeleton files for views (no test support for views) * Add tests for latest versions * Add tests for scalar bucket counts that passes * Add scalar bucket counts * Add test for scalar percentiles * Add test for histogram bucket counts * Add passing test for probe counts * Add test for histogram percentiles * Add tests for extract counts * Update readme * Add data for scalar percentiles test * Fix linting errors * Fix mypy issues with tests module * Name it data instead of tests..data Ignore mypy on tests directory * Remove mypy section * Remove extra line in pytest * Try pytest invocation of mypy-scripts-are-modules * Run mypy outside of pytest * Use exec on pytest instead of mypy * Update tests/sql/glam-fenix-dev/glam_etl/bootstrap.py Co-authored-by: Ben Wu <benjaminwu124@gmail.com> * Update tests/sql/glam-fenix-dev/glam_etl/README.md Co-authored-by: Ben Wu <benjaminwu124@gmail.com> * Document bootstrap in documentation * Use artificial range for histogram_percentiles * Simplify parameters for scalar probe counts * Simplify tests for histogram probe counts * Add test for incremental histogram aggregates * Update scalar percentile counts to count distinct client ids * Update readme for creating a new test * Use unorded list for sublist * Use --ignore-glob for pytest to avoid data files Co-authored-by: Ben Wu <benjaminwu124@gmail.com>	2020-12-01 17:11:45 -08:00

13 Коммитов