bigquery-etl

Граф коммитов

Автор	SHA1	Сообщение	Дата
Eduardo Filho	b994884098	GLAM purge percentile calculations and prep downstream (#5966 ) * Remove percentiles * Remove tests that test percentiles * Refresh scripts insert null to new percentiles * Remove percentile columns from queries and schemas * Delete more percentile tables * Formatting * histogram_cast_struct's keys are strings * Re-add test after fixing failure cause	2024-07-25 10:44:43 -04:00
Anna Scholtz	57bd939905	Fully qualified identifiers in SQL queries (#5764 ) * Add fully-qualified identifiers when formatting queries * Fully-qualified identifiers for queries in sql/ * Check in only formatted SQL to generated-sql branch * Add comment * Fully qualify more tables * Fully qualify test files * Formatting improvements around CTEs and unit tests * Option to skip auto qualifying queries	2024-06-27 09:53:33 -07:00
Anna Scholtz	c2b77ba3bd	Support range partitioning and remove GLAM init.sql (#5346 ) * Add support for range partitioning * Remove init.sql for GLAM tables	2024-04-08 15:03:43 -07:00
Sean Rose	1d1addb86c	Indent join conditions (#4223 ) * Indent join conditions. * Put parentheses around some `BETWEEN ... AND ...` join conditions.	2024-01-11 15:50:26 -08:00
kik-kik	ce300779ef	# issues(3416): added missing metadata.yaml files inside glam-fenix-dev (#3655 ) * Added metadata.yaml files to glam-fenix-dev project * Add owner to metadata.yaml files for glam-fenix-dev --------- Co-authored-by: Eduardo Filho <edugomfilho@gmail.com>	2023-03-16 10:51:40 +00:00
Sean Rose	2d84a1d3b7	Change `bqetl format` to improve readability of `CASE` statements (#3546 ) * Indent `WHEN` and `ELSE` clauses one level more than `CASE`. * Indent `THEN` clauses one level more than the corresponding `WHEN` clause. * Have the content of `WHEN`, `THEN`, and `ELSE` clauses start on the same line as the clause keyword. * Allow an alias, comma, or dot right after a `CASE` statement's `END`.	2023-02-03 14:35:59 -08:00
Sean Rose	a46678ed96	Change `bqetl format` to uppercase built-in function names (#3536 )	2023-01-26 16:07:10 -08:00
Eduardo Filho	26413d5ecd	[Bug 1803344] GLAM - use firefox-desktop Glean metadata to build fog queries (#3386 )	2022-12-01 13:39:08 +01:00
akkomar	ceda6dd35f	Use approximate client count in GLAM scalar_percentiles_v1 (#3039 ) This is a follow-up to https://github.com/mozilla/bigquery-etl/pull/3037 which unblocked `scalar_bucket_counts_v1`. `scalar_percentiles_v1` uses the same source table (`clients_scalar_aggregates_v1`) and started failing today with the same error (disk/memory limits exceeded for shuffle operations). `APPROX_COUNT_DISTINCT` used here runs HLL under the hood. The reason for using it here is that we can't split the aggregation here into two stages as in the aforementioned PR due to quantiles calculation. I have run this query locally and confirmed that it works.	2022-06-21 10:55:08 -04:00
Arkadiusz Komarzewski	98549e3cb8	Bug 1772532 - Use HLL for user counts in GLAM scalar_bucker_counts_v1 `scalar_bucket_counts_v1` queries started failing in 2022-06 for FOG and Fenix. They started exceeding BQ disk and memory limits available for shuffle operations, most likely because of growing number of clients, metrics, and values. This commit introduces HyperLogLog for estimating number of clients in `scalar_bucket_counts`. In my tests, after this change, query was finishing in 30-70 minutes while the current production query fails after 4-6 hours. I have run this query before and after the change on a 50% sample of input table. In terms of BQ slots used, this uses ~700 vs. ~2100 before - although not exact, this can be treated as an approximation of the run time. In terms of errors, this notebook compares both outputs and user counts: https://colab.research.google.com/drive/1uilzQcFn1ppFMTTpi-RXj2EFLjCxX_mN#scrollTo=VfRZpBCjxgQM. It shows that 99.8% of counts estimated with HLL have an error smaller than 0.1%, 81% are below 0.01%. This makes this approach pretty good compared to sampling that we use in the legacy pipeline as it does not introduce risk of sampling-out some populations.	2022-06-21 08:13:18 +02:00
Alekhya	8e0b1e6818	adjusting the fenix counts (#2837 )	2022-03-29 16:20:45 -04:00
Alekhya	a436a574ff	Add agg_type to the sample counts for desktop and glean, remove the extract sample counts query (#2772 ) * added agg_type for sample counts table * removed the extract sample counts for both dekstop and glean products * corrected sql formatting corrected sql formatting	2022-03-02 15:51:20 -05:00
Alekhya	14ba76eacc	added client_agg_type column (#2754 ) added client_agg_type column added client_agg_type column	2022-02-25 17:40:53 -05:00
Alekhya	3b77cfa001	add process to sample counts glam tables (#2749 ) add process to sample counts glam tables	2022-02-23 18:30:00 -05:00
Alekhya	ceb5eeac61	added sample counts for desktop and glean products (both scalars and histograms) (#2743 ) added sample counts for desktop and glean products (both scalars and histograms) added sample counts for desktop and glean products (both scalars and histograms) added sample counts for desktop and glean products (both scalars and histograms) add sample counts for scalars and histogram	2022-02-22 20:15:44 -05:00
Alekhya	d186da3cea	reduced the minimum count values for firefox desktop (#2651 ) reduced the minimum count values for firefox desktop	2022-01-12 16:52:12 -05:00
Alekhya	a76cd01efa	add minimum client count for fenix (#2642 ) add minimum client count for fenix add minimum client count for fenix add minimum client count for fenix add minimum client count for fenix	2022-01-12 11:49:59 -05:00
Alekhya	2f1413fee1	Revert "correcting minimum client count - desktop and fenix (#2544 )" (#2566 ) This reverts commit `5b743090b4`.	2021-12-10 10:15:52 -05:00
Alekhya	5b743090b4	correcting minimum client count - desktop and fenix (#2544 ) * correcting minimum client count - desktop and fenix * corrected test cases for desktop * corrected the join for desktop	2021-12-06 10:14:42 -05:00
Alekhya	80af7b96df	firefox_desktop_to_glam (#2485 ) * firefox_desktop_to_glam added sql files part 1 * update with the lastest mozfun function	2021-11-22 13:45:19 -05:00
Alekhya	99b5a5f06f	add 99 and 99.9 for desktop and fenix data (#2412 ) * add 99 and 99.9 for desktop data * added fenix 99 and 99.9 percentiles	2021-10-07 15:00:54 -04:00
Alekhya	df5eb5e77e	Added sample counts for glam fenix (#2355 ) * added sample counts for glam fenix * formatted for black format check * Revert "formatted for black format check" This reverts commit `cf71fed487`. * formated for black format check * added the sample coubt scripts	2021-09-21 16:18:41 -04:00
Anthony Miyaguchi	08c406c384	Fix #2274 - Nest schemas under field object in glam queries (#2275 ) * Update schema to be nested under field object * Update schemas for glam-fenix-dev	2021-08-20 19:49:55 +00:00
Anthony Miyaguchi	e2820e1255	Add filter for large scalars in aggregates (#2176 )	2021-07-13 17:19:06 +00:00
Anthony Miyaguchi	7a7cd3cae6	Bug 1719188 - Fix overflows by filtering out large histogram values (#2173 ) * Add filter on max histogram bucket value for glean glam queries * Add update generated SQL * Update comment	2021-07-08 11:11:08 -07:00
Anthony Miyaguchi	3b3294b159	Fix #1729 - Add timespan to metrics for GLAM ETL queries for Glean (#1738 ) * Add updated metrics since last update * Add timespans to query * Update generated code to select nested value for timespans * Add generated queries	2021-02-03 09:40:52 -08:00
Anthony Miyaguchi	ce9fe86ed2	Fix #1587 - fix inconsistent range_min and range_max in bucket counts (#1591 ) * Fix egregious double counting in scalar bucket counts * Update for newer version of black * Update scalar bucket count test to account for combinations * Update minimal test for histogram bucket counts * Add test for multiple clients in histogram aggregates * Remove deduplicated cte in histogram bucket counts * Use count distinct for client counts to be explicit	2020-12-04 14:47:45 -08:00
Anthony Miyaguchi	4234c40040	Add minimal set of tests for GLAM Fenix queries (#1488 ) * Add script to determine query dependencies * Add schemas and folders for minimal test * Add schema for geckoview_versions * Add query params to each query * Update schema for new queries * Remove main from bootstrap file * Add dataset prefix to schemas * Add failing test for clients_histogram_aggregates It turns out that the dependency resolution I'm using for autogenerate the schemas is ignoring the views. I actually want to keep the views around. The tables also all need to be prefixed with the dataset name or they won't be inserted into the sql query correctly. * Add successful test for clients histogram aggregates * Add minimal tests for clients_scalar_aggregates * Remove skeleton files for views (no test support for views) * Add tests for latest versions * Add tests for scalar bucket counts that passes * Add scalar bucket counts * Add test for scalar percentiles * Add test for histogram bucket counts * Add passing test for probe counts * Add test for histogram percentiles * Add tests for extract counts * Update readme * Add data for scalar percentiles test * Fix linting errors * Fix mypy issues with tests module * Name it data instead of tests..data Ignore mypy on tests directory * Remove mypy section * Remove extra line in pytest * Try pytest invocation of mypy-scripts-are-modules * Run mypy outside of pytest * Use exec on pytest instead of mypy * Update tests/sql/glam-fenix-dev/glam_etl/bootstrap.py Co-authored-by: Ben Wu <benjaminwu124@gmail.com> * Update tests/sql/glam-fenix-dev/glam_etl/README.md Co-authored-by: Ben Wu <benjaminwu124@gmail.com> * Document bootstrap in documentation * Use artificial range for histogram_percentiles * Simplify parameters for scalar probe counts * Simplify tests for histogram probe counts * Add test for incremental histogram aggregates * Update scalar percentile counts to count distinct client ids * Update readme for creating a new test * Use unorded list for sublist * Use --ignore-glob for pytest to avoid data files Co-authored-by: Ben Wu <benjaminwu124@gmail.com>	2020-12-01 17:11:45 -08:00
Anthony Miyaguchi	44cc882c6e	Use SAFE_CAST with bucket values (#1533 ) * Safe cast bucket values * Add formatted sql	2020-11-09 10:01:35 -08:00
Anthony Miyaguchi	03544c1d69	Add total_user filter on glam exports (#1526 )	2020-11-05 15:57:37 -08:00
Anthony Miyaguchi	b77b542743	Replace GLAM temp functions with persistent functions (#1523 ) * Replace GLAM temp functions with persistent functions * Add generated sql * Fix typo in udf name * Add missing files and fully qualify udfs * Add missing namespace * Namespace even more things * format sql	2020-11-05 13:42:09 -08:00
Anthony Miyaguchi	0c244613fb	Update glam fenix etl with updated scalar bucketing (#1493 ) * Add initial udf replacements * Update scalar bucketing scheme * Update schemas in script * Revert change to query * Remove comma before CROSS JOIN * Add functional query * Add option to skip steps * Add ordering for keys * Update bigquery_etl/glam/templates/scalar_bucket_counts_v1.sql Co-authored-by: Ben Wu <benjaminwu124@gmail.com> * Add instructions for copying tables and modify bucket location * Generate schemas when GENERATE_ONLY specified * Set build date to NULL instead of "*" Co-authored-by: Ben Wu <benjaminwu124@gmail.com>	2020-11-03 16:07:00 -08:00
Anthony Miyaguchi	fc594d9753	Check in schemas for GLAM Fenix queries (#1487 ) * Add schemas for GLAM Fenix queries * Add command for updating schemas for checked-in queries	2020-10-27 09:28:09 -07:00
Anthony Miyaguchi	aab6fffdb9	Fix #1402 - Use org_mozilla_fenix.geckoview_version view for app_version in GLAM (#1456 ) * Use join against geckoview_version view to get app_version * Update logic to keep last 3 major versions * Update template with proper syntax * Keep 3 major versions instead of 4	2020-10-22 15:19:41 -07:00
Anthony Miyaguchi	b7695049c6	Fix #1457 - Generate and run Fenix ETL for GLAM in glam-fenix-dev (#1458 ) * Resolve generated sql to glam-fenix-dev and change output in sql/ dir * Add new script for testing glam-fenix queries * Add generated sql for version control * Use variables correctly in bash * Remove latest versions from UDF * Update test to generate minimum set of tables for nightly * Commit generated queries for testing * Cast only if not glob * Ignore dryrun and publish view for glam-fenix-dev * Fix linting error * Update comments * Use DST_PROJECT consistently in scripts * Update comments * Update script/glam/test/test_glean_org_mozilla_fenix_glam_nightly Co-authored-by: Ben Wu <benjaminwu124@gmail.com> * Update script/glam/generate_and_run_desktop_sql Co-authored-by: Ben Wu <benjaminwu124@gmail.com> Co-authored-by: Ben Wu <benjaminwu124@gmail.com>	2020-10-22 11:40:52 -07:00

35 Коммитов