bigquery-etl

Граф коммитов

Автор	SHA1	Сообщение	Дата
Linh Nguyen	d82acc1856	Simplify GLAM template for getting the latest version (#3933 ) * Simplify GLAM template for getting the latest version * Add comment about using buildhub2 data for Fenix	2023-06-14 20:03:01 +00:00
Linh Nguyen	d6a55664d0	Revert "Simplify GLAM template for getting latest versions (#3880 )" (#3908 ) This reverts commit `8ad45a0592`.	2023-06-06 18:41:03 +00:00
Linh Nguyen	8ad45a0592	Simplify GLAM template for getting latest versions (#3880 ) * Simplify GLAM latest version template * Use buildhub2 table instead --------- Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2023-06-05 14:23:21 +00:00
Sean Rose	2d84a1d3b7	Change `bqetl format` to improve readability of `CASE` statements (#3546 ) * Indent `WHEN` and `ELSE` clauses one level more than `CASE`. * Indent `THEN` clauses one level more than the corresponding `WHEN` clause. * Have the content of `WHEN`, `THEN`, and `ELSE` clauses start on the same line as the clause keyword. * Allow an alias, comma, or dot right after a `CASE` statement's `END`.	2023-02-03 14:35:59 -08:00
Sean Rose	a46678ed96	Change `bqetl format` to uppercase built-in function names (#3536 )	2023-01-26 16:07:10 -08:00
Eduardo Filho	26413d5ecd	[Bug 1803344] GLAM - use firefox-desktop Glean metadata to build fog queries (#3386 )	2022-12-01 13:39:08 +01:00
Arkadiusz Komarzewski	c06700fc19	Bug 1783427 - GLAM ETL: filter out negative values in aggregate queries This should fix `extract_probe_counts_v1` query failing due to integer overflows on casting. We got some big negative numbers in few histogram probes around 2022-08-04 which seem to be some random bit flips. I'm putting a lower bound for a value in a filter at 0 as we do not expect them to be negative in any case.	2022-08-10 16:17:28 +02:00
Arkadiusz Komarzewski	c501f964b4	Document GLAM ETL client count filter values This adds comments to provide some context behind client count values used when filtering out builds in GLAM ETL.	2022-08-03 17:04:24 +02:00
akkomar	ceda6dd35f	Use approximate client count in GLAM scalar_percentiles_v1 (#3039 ) This is a follow-up to https://github.com/mozilla/bigquery-etl/pull/3037 which unblocked `scalar_bucket_counts_v1`. `scalar_percentiles_v1` uses the same source table (`clients_scalar_aggregates_v1`) and started failing today with the same error (disk/memory limits exceeded for shuffle operations). `APPROX_COUNT_DISTINCT` used here runs HLL under the hood. The reason for using it here is that we can't split the aggregation here into two stages as in the aforementioned PR due to quantiles calculation. I have run this query locally and confirmed that it works.	2022-06-21 10:55:08 -04:00
Arkadiusz Komarzewski	98549e3cb8	Bug 1772532 - Use HLL for user counts in GLAM scalar_bucker_counts_v1 `scalar_bucket_counts_v1` queries started failing in 2022-06 for FOG and Fenix. They started exceeding BQ disk and memory limits available for shuffle operations, most likely because of growing number of clients, metrics, and values. This commit introduces HyperLogLog for estimating number of clients in `scalar_bucket_counts`. In my tests, after this change, query was finishing in 30-70 minutes while the current production query fails after 4-6 hours. I have run this query before and after the change on a 50% sample of input table. In terms of BQ slots used, this uses ~700 vs. ~2100 before - although not exact, this can be treated as an approximation of the run time. In terms of errors, this notebook compares both outputs and user counts: https://colab.research.google.com/drive/1uilzQcFn1ppFMTTpi-RXj2EFLjCxX_mN#scrollTo=VfRZpBCjxgQM. It shows that 99.8% of counts estimated with HLL have an error smaller than 0.1%, 81% are below 0.01%. This makes this approach pretty good compared to sampling that we use in the legacy pipeline as it does not introduce risk of sampling-out some populations.	2022-06-21 08:13:18 +02:00
akkomar	ef9cbd9db3	Bug 1772532 - Increase clients_scalar_aggregates_v1 partitioning range (#3025 ) `clients_scalar_aggregates_v1` tables are partitioned by version into 100 buckets starting from 0. This causes data from Firefox versions >=100 to flow to `__UNPARTITIONED__` dummy partitions, effectively rendering these tables unpartitioned. This bumps the partitioning range to 200. Per [Firefox Release Calendar](https://wiki.mozilla.org/Release_Management/Calendar) this is enough for ~7 years.	2022-06-13 16:39:50 +02:00
Daniel Thorn	18bf525e8a	Correctly aggregate null histogram_aggregates (#3006 )	2022-06-02 16:18:38 +00:00
akkomar	a72ecf7f59	GLAM ETL: filter out unofficial desktop builds in FOG pipeline (#2957 ) GLAM ETL: filter out unofficial desktop builds in FOG pipeline In https://github.com/mozilla/glam/issues/1941 it was discovered that some desktop clients send version "1024" in FOG telemetry. This is breaking GLAM aggregations because they focus on latest released versions. In order to address that, from now on we'll: 1. filter pings by build_id against official Mozilla builds published in Buildhub. This is equivalent to the filtering we already do in the legacy telemetry GLAM ETL: `9bca48821a/sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_keyed_scalar_aggregates_v1/query.sql (L3-L8)` 2. Explicitly filter out pings with version "1024.0.0". At the moment we do not know why this version number is sent by clients (see https://bugzilla.mozilla.org/show_bug.cgi?id=1768187 for details)	2022-05-11 00:33:35 +02:00
Eduardo Filho	0cf0699774	Exclude probes that cause error on GLAM from ETL (#2792 ) (#2793 ) * Stop calculating probes that cause error on GLAM from ETL (#2792) * #2792 Glam: applying same probe blocklist mechanism to glean queries * #2792 Add 'product' parameter to excl probes func	2022-04-01 09:36:08 -04:00
Alekhya	8e0b1e6818	adjusting the fenix counts (#2837 )	2022-03-29 16:20:45 -04:00
Alekhya	ff40754f49	remove unwanted versions from fenix etl (#2820 )	2022-03-24 11:49:18 -04:00
Alekhya	5ae5299afa	changes to extract probes counts to include total_sample column (#2781 ) * changes to extract probes counts to include total_sample column * corrected the formatting * corrected view sample counts	2022-03-21 09:23:04 -04:00
Alekhya	a436a574ff	Add agg_type to the sample counts for desktop and glean, remove the extract sample counts query (#2772 ) * added agg_type for sample counts table * removed the extract sample counts for both dekstop and glean products * corrected sql formatting corrected sql formatting	2022-03-02 15:51:20 -05:00
Alekhya	752aee463b	correcting glean sample counts channel column to adjust for the keyed-scalar-boolean histogram (#2758 )	2022-02-28 10:32:33 -05:00
Alekhya	14ba76eacc	added client_agg_type column (#2754 ) added client_agg_type column added client_agg_type column	2022-02-25 17:40:53 -05:00
Alekhya	4819ecc031	add process to sample counts glam tables (#2751 ) add process to sample counts glam tables add process to extract tables add process to extract tables add sample counts for scalars and histogram	2022-02-24 16:16:42 -05:00
Alekhya	3b77cfa001	add process to sample counts glam tables (#2749 ) add process to sample counts glam tables	2022-02-23 18:30:00 -05:00
Alekhya	ceb5eeac61	added sample counts for desktop and glean products (both scalars and histograms) (#2743 ) added sample counts for desktop and glean products (both scalars and histograms) added sample counts for desktop and glean products (both scalars and histograms) added sample counts for desktop and glean products (both scalars and histograms) add sample counts for scalars and histogram	2022-02-22 20:15:44 -05:00
Alekhya	b775d5322e	added 0.1 and 1 percentiles for both desktop and fenix data (#2723 )	2022-02-16 14:31:02 -05:00
Alekhya	d186da3cea	reduced the minimum count values for firefox desktop (#2651 ) reduced the minimum count values for firefox desktop	2022-01-12 16:52:12 -05:00
Alekhya	ef44a3bcd7	reduced the minimum count values for firefox desktop (#2649 )	2022-01-12 16:13:40 -05:00
Alekhya	3e08e973ee	Firefox desktop date check (#2648 ) * add minimum client count for fenix add minimum client count for fenix add minimum client count for fenix add minimum client count for fenix add minimum client count for fenix * FOG date check	2022-01-12 15:10:08 -05:00
Alekhya	a76cd01efa	add minimum client count for fenix (#2642 ) add minimum client count for fenix add minimum client count for fenix add minimum client count for fenix add minimum client count for fenix	2022-01-12 11:49:59 -05:00
Alekhya	6b076a978a	added firefox desktop date check (#2563 )	2022-01-04 15:01:48 -05:00
Alekhya	2f1413fee1	Revert "correcting minimum client count - desktop and fenix (#2544 )" (#2566 ) This reverts commit `5b743090b4`.	2021-12-10 10:15:52 -05:00
Alekhya	5b743090b4	correcting minimum client count - desktop and fenix (#2544 ) * correcting minimum client count - desktop and fenix * corrected test cases for desktop * corrected the join for desktop	2021-12-06 10:14:42 -05:00
Alekhya	2264e18ca3	corrected the channel name in the release views (#2520 )	2021-11-22 16:58:43 -05:00
Alekhya	80af7b96df	firefox_desktop_to_glam (#2485 ) * firefox_desktop_to_glam added sql files part 1 * update with the lastest mozfun function	2021-11-22 13:45:19 -05:00
Alekhya	f7d2863213	added minimun client counts for each channel (#2453 )	2021-11-10 10:25:31 -05:00
Alekhya	99b5a5f06f	add 99 and 99.9 for desktop and fenix data (#2412 ) * add 99 and 99.9 for desktop data * added fenix 99 and 99.9 percentiles	2021-10-07 15:00:54 -04:00
Alekhya	df5eb5e77e	Added sample counts for glam fenix (#2355 ) * added sample counts for glam fenix * formatted for black format check * Revert "formatted for black format check" This reverts commit `cf71fed487`. * formated for black format check * added the sample coubt scripts	2021-09-21 16:18:41 -04:00
Anthony Miyaguchi	08c406c384	Fix #2274 - Nest schemas under field object in glam queries (#2275 ) * Update schema to be nested under field object * Update schemas for glam-fenix-dev	2021-08-20 19:49:55 +00:00
Anthony Miyaguchi	e2820e1255	Add filter for large scalars in aggregates (#2176 )	2021-07-13 17:19:06 +00:00
Anthony Miyaguchi	7a7cd3cae6	Bug 1719188 - Fix overflows by filtering out large histogram values (#2173 ) * Add filter on max histogram bucket value for glean glam queries * Add update generated SQL * Update comment	2021-07-08 11:11:08 -07:00
dependabot[bot]	68cac0fbfe	Bump mozilla-schema-generator from 0.1.4 to 0.3.0 (#2007 ) * Bump mozilla-schema-generator from 0.1.4 to 0.3.0 Bumps [mozilla-schema-generator](https://github.com/mozilla/mozilla-schema-generator) from 0.1.4 to 0.3.0. - [Release notes](https://github.com/mozilla/mozilla-schema-generator/releases) - [Commits](https://github.com/mozilla/mozilla-schema-generator/compare/v0.1.4...v0.3.0) Signed-off-by: dependabot[bot] <support@github.com> * Update GleanPing constructor for 0.3 of msg Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Anthony Miyaguchi <amiyaguchi@mozilla.com>	2021-05-10 13:48:07 -04:00
Anthony Miyaguchi	3b3294b159	Fix #1729 - Add timespan to metrics for GLAM ETL queries for Glean (#1738 ) * Add updated metrics since last update * Add timespans to query * Update generated code to select nested value for timespans * Add generated queries	2021-02-03 09:40:52 -08:00
Anthony Miyaguchi	ce9fe86ed2	Fix #1587 - fix inconsistent range_min and range_max in bucket counts (#1591 ) * Fix egregious double counting in scalar bucket counts * Update for newer version of black * Update scalar bucket count test to account for combinations * Update minimal test for histogram bucket counts * Add test for multiple clients in histogram aggregates * Remove deduplicated cte in histogram bucket counts * Use count distinct for client counts to be explicit	2020-12-04 14:47:45 -08:00
Anthony Miyaguchi	4234c40040	Add minimal set of tests for GLAM Fenix queries (#1488 ) * Add script to determine query dependencies * Add schemas and folders for minimal test * Add schema for geckoview_versions * Add query params to each query * Update schema for new queries * Remove main from bootstrap file * Add dataset prefix to schemas * Add failing test for clients_histogram_aggregates It turns out that the dependency resolution I'm using for autogenerate the schemas is ignoring the views. I actually want to keep the views around. The tables also all need to be prefixed with the dataset name or they won't be inserted into the sql query correctly. * Add successful test for clients histogram aggregates * Add minimal tests for clients_scalar_aggregates * Remove skeleton files for views (no test support for views) * Add tests for latest versions * Add tests for scalar bucket counts that passes * Add scalar bucket counts * Add test for scalar percentiles * Add test for histogram bucket counts * Add passing test for probe counts * Add test for histogram percentiles * Add tests for extract counts * Update readme * Add data for scalar percentiles test * Fix linting errors * Fix mypy issues with tests module * Name it data instead of tests..data Ignore mypy on tests directory * Remove mypy section * Remove extra line in pytest * Try pytest invocation of mypy-scripts-are-modules * Run mypy outside of pytest * Use exec on pytest instead of mypy * Update tests/sql/glam-fenix-dev/glam_etl/bootstrap.py Co-authored-by: Ben Wu <benjaminwu124@gmail.com> * Update tests/sql/glam-fenix-dev/glam_etl/README.md Co-authored-by: Ben Wu <benjaminwu124@gmail.com> * Document bootstrap in documentation * Use artificial range for histogram_percentiles * Simplify parameters for scalar probe counts * Simplify tests for histogram probe counts * Add test for incremental histogram aggregates * Update scalar percentile counts to count distinct client ids * Update readme for creating a new test * Use unorded list for sublist * Use --ignore-glob for pytest to avoid data files Co-authored-by: Ben Wu <benjaminwu124@gmail.com>	2020-12-01 17:11:45 -08:00
Anthony Miyaguchi	cb638026f6	Add `bqetl glam glean backfill-daily` command (#1535 )	2020-11-10 10:30:42 -08:00
Anthony Miyaguchi	44cc882c6e	Use SAFE_CAST with bucket values (#1533 ) * Safe cast bucket values * Add formatted sql	2020-11-09 10:01:35 -08:00
Anthony Miyaguchi	03544c1d69	Add total_user filter on glam exports (#1526 )	2020-11-05 15:57:37 -08:00
Anthony Miyaguchi	b77b542743	Replace GLAM temp functions with persistent functions (#1523 ) * Replace GLAM temp functions with persistent functions * Add generated sql * Fix typo in udf name * Add missing files and fully qualify udfs * Add missing namespace * Namespace even more things * format sql	2020-11-05 13:42:09 -08:00
Anthony Miyaguchi	0c244613fb	Update glam fenix etl with updated scalar bucketing (#1493 ) * Add initial udf replacements * Update scalar bucketing scheme * Update schemas in script * Revert change to query * Remove comma before CROSS JOIN * Add functional query * Add option to skip steps * Add ordering for keys * Update bigquery_etl/glam/templates/scalar_bucket_counts_v1.sql Co-authored-by: Ben Wu <benjaminwu124@gmail.com> * Add instructions for copying tables and modify bucket location * Generate schemas when GENERATE_ONLY specified * Set build date to NULL instead of "*" Co-authored-by: Ben Wu <benjaminwu124@gmail.com>	2020-11-03 16:07:00 -08:00
Anthony Miyaguchi	fc594d9753	Check in schemas for GLAM Fenix queries (#1487 ) * Add schemas for GLAM Fenix queries * Add command for updating schemas for checked-in queries	2020-10-27 09:28:09 -07:00
Anthony Miyaguchi	aab6fffdb9	Fix #1402 - Use org_mozilla_fenix.geckoview_version view for app_version in GLAM (#1456 ) * Use join against geckoview_version view to get app_version * Update logic to keep last 3 major versions * Update template with proper syntax * Keep 3 major versions instead of 4	2020-10-22 15:19:41 -07:00

1 2

80 Коммитов