Граф коммитов

35 Коммитов

Автор SHA1 Сообщение Дата
Eduardo Filho b994884098
GLAM purge percentile calculations and prep downstream (#5966)
* Remove percentiles

* Remove tests that test percentiles

* Refresh scripts insert null to new percentiles

* Remove percentile columns from queries and schemas

* Delete more percentile tables

* Formatting

* histogram_cast_struct's keys are strings

* Re-add test after fixing failure cause
2024-07-25 10:44:43 -04:00
Anna Scholtz 57bd939905
Fully qualified identifiers in SQL queries (#5764)
* Add fully-qualified identifiers when formatting queries

* Fully-qualified identifiers for queries in sql/

* Check in only formatted SQL to generated-sql branch

* Add comment

* Fully qualify more tables

* Fully qualify test files

* Formatting improvements around CTEs and unit tests

* Option to skip auto qualifying queries
2024-06-27 09:53:33 -07:00
Anna Scholtz c2b77ba3bd
Support range partitioning and remove GLAM init.sql (#5346)
* Add support for range partitioning

* Remove init.sql for GLAM tables
2024-04-08 15:03:43 -07:00
Sean Rose 1d1addb86c
Indent join conditions (#4223)
* Indent join conditions.

* Put parentheses around some `BETWEEN ... AND ...` join conditions.
2024-01-11 15:50:26 -08:00
kik-kik ce300779ef
# issues(3416): added missing metadata.yaml files inside glam-fenix-dev (#3655)
* Added metadata.yaml files to glam-fenix-dev project

* Add owner to metadata.yaml files for glam-fenix-dev

---------

Co-authored-by: Eduardo Filho <edugomfilho@gmail.com>
2023-03-16 10:51:40 +00:00
Sean Rose 2d84a1d3b7
Change `bqetl format` to improve readability of `CASE` statements (#3546)
* Indent `WHEN` and `ELSE` clauses one level more than `CASE`.
* Indent `THEN` clauses one level more than the corresponding `WHEN` clause.
* Have the content of `WHEN`, `THEN`, and `ELSE` clauses start on the same line as the clause keyword.
* Allow an alias, comma, or dot right after a `CASE` statement's `END`.
2023-02-03 14:35:59 -08:00
Sean Rose a46678ed96
Change `bqetl format` to uppercase built-in function names (#3536) 2023-01-26 16:07:10 -08:00
Eduardo Filho 26413d5ecd
[Bug 1803344] GLAM - use firefox-desktop Glean metadata to build fog queries (#3386) 2022-12-01 13:39:08 +01:00
akkomar ceda6dd35f
Use approximate client count in GLAM scalar_percentiles_v1 (#3039)
This is a follow-up to https://github.com/mozilla/bigquery-etl/pull/3037 which unblocked `scalar_bucket_counts_v1`.
`scalar_percentiles_v1` uses the same source table (`clients_scalar_aggregates_v1`) and started failing today with the same error (disk/memory limits exceeded for shuffle operations).

`APPROX_COUNT_DISTINCT` used here runs HLL under the hood. The reason for using it here is that we can't split the aggregation here into two stages as in the aforementioned PR due to quantiles calculation.

I have run this query locally and confirmed that it works.
2022-06-21 10:55:08 -04:00
Arkadiusz Komarzewski 98549e3cb8 Bug 1772532 - Use HLL for user counts in GLAM scalar_bucker_counts_v1
`scalar_bucket_counts_v1` queries started failing in 2022-06 for FOG and Fenix.
They started exceeding BQ disk and memory limits available for shuffle operations, most likely because of growing number of clients, metrics, and values.

This commit introduces HyperLogLog for estimating number of clients in `scalar_bucket_counts`. In my tests, after this change, query was finishing in 30-70 minutes while the current production query fails after 4-6 hours.

I have run this query before and after the change on a 50% sample of input table.
In terms of BQ slots used, this uses ~700 vs. ~2100 before - although not exact, this can be treated as an approximation of the run time.
In terms of errors, this notebook compares both outputs and user counts: https://colab.research.google.com/drive/1uilzQcFn1ppFMTTpi-RXj2EFLjCxX_mN#scrollTo=VfRZpBCjxgQM. It shows that 99.8% of counts estimated with HLL have an error smaller than 0.1%, 81% are below 0.01%. This makes this approach pretty good compared to sampling that we use in the legacy pipeline as it does not introduce risk of sampling-out some populations.
2022-06-21 08:13:18 +02:00
Alekhya 8e0b1e6818
adjusting the fenix counts (#2837) 2022-03-29 16:20:45 -04:00
Alekhya a436a574ff
Add agg_type to the sample counts for desktop and glean, remove the extract sample counts query (#2772)
* added agg_type for sample counts table

* removed the extract sample counts for both dekstop and glean products

* corrected sql formatting

corrected sql formatting
2022-03-02 15:51:20 -05:00
Alekhya 14ba76eacc
added client_agg_type column (#2754)
added client_agg_type column

added client_agg_type column
2022-02-25 17:40:53 -05:00
Alekhya 3b77cfa001
add process to sample counts glam tables (#2749)
add process to sample counts glam tables
2022-02-23 18:30:00 -05:00
Alekhya ceb5eeac61
added sample counts for desktop and glean products (both scalars and histograms) (#2743)
added sample counts for desktop and glean products (both scalars and histograms)

 added sample counts for desktop and glean products (both scalars and histograms)

 added sample counts for desktop and glean products (both scalars and histograms)

add sample counts for scalars and histogram
2022-02-22 20:15:44 -05:00
Alekhya d186da3cea
reduced the minimum count values for firefox desktop (#2651)
reduced the minimum count values for firefox desktop
2022-01-12 16:52:12 -05:00
Alekhya a76cd01efa
add minimum client count for fenix (#2642)
add minimum client count for fenix

add minimum client count for fenix

add minimum client count for fenix

add minimum client count for fenix
2022-01-12 11:49:59 -05:00
Alekhya 2f1413fee1
Revert "correcting minimum client count - desktop and fenix (#2544)" (#2566)
This reverts commit 5b743090b4.
2021-12-10 10:15:52 -05:00
Alekhya 5b743090b4
correcting minimum client count - desktop and fenix (#2544)
* correcting minimum client count - desktop and fenix

* corrected test cases for desktop

* corrected the join for desktop
2021-12-06 10:14:42 -05:00
Alekhya 80af7b96df
firefox_desktop_to_glam (#2485)
* firefox_desktop_to_glam

added sql files part 1

* update with the lastest mozfun function
2021-11-22 13:45:19 -05:00
Alekhya 99b5a5f06f
add 99 and 99.9 for desktop and fenix data (#2412)
* add 99 and 99.9 for desktop data

* added fenix 99 and 99.9 percentiles
2021-10-07 15:00:54 -04:00
Alekhya df5eb5e77e
Added sample counts for glam fenix (#2355)
* added sample counts for glam fenix

* formatted for black format check

* Revert "formatted for black format check"

This reverts commit cf71fed487.

* formated for black format check

* added the sample coubt scripts
2021-09-21 16:18:41 -04:00
Anthony Miyaguchi 08c406c384
Fix #2274 - Nest schemas under field object in glam queries (#2275)
* Update schema to be nested under field object

* Update schemas for glam-fenix-dev
2021-08-20 19:49:55 +00:00
Anthony Miyaguchi e2820e1255
Add filter for large scalars in aggregates (#2176) 2021-07-13 17:19:06 +00:00
Anthony Miyaguchi 7a7cd3cae6
Bug 1719188 - Fix overflows by filtering out large histogram values (#2173)
* Add filter on max histogram bucket value for glean glam queries

* Add update generated SQL

* Update comment
2021-07-08 11:11:08 -07:00
Anthony Miyaguchi 3b3294b159
Fix #1729 - Add timespan to metrics for GLAM ETL queries for Glean (#1738)
* Add updated metrics since last update

* Add timespans to query

* Update generated code to select nested value for timespans

* Add generated queries
2021-02-03 09:40:52 -08:00
Anthony Miyaguchi ce9fe86ed2
Fix #1587 - fix inconsistent range_min and range_max in bucket counts (#1591)
* Fix egregious double counting in scalar bucket counts

* Update for newer version of black

* Update scalar bucket count test to account for combinations

* Update minimal test for histogram bucket counts

* Add test for multiple clients in histogram aggregates

* Remove deduplicated cte in histogram bucket counts

* Use count distinct for client counts to be explicit
2020-12-04 14:47:45 -08:00
Anthony Miyaguchi 4234c40040
Add minimal set of tests for GLAM Fenix queries (#1488)
* Add script to determine query dependencies

* Add schemas and folders for minimal test

* Add schema for geckoview_versions

* Add query params to each query

* Update schema for new queries

* Remove main from bootstrap file

* Add dataset prefix to schemas

* Add failing test for clients_histogram_aggregates

It turns out that the dependency resolution I'm using for autogenerate
the schemas is ignoring the views. I actually want to keep the views
around. The tables also all need to be prefixed with the dataset name or
they won't be inserted into the sql query correctly.

* Add successful test for clients histogram aggregates

* Add minimal tests for clients_scalar_aggregates

* Remove skeleton files for views (no test support for views)

* Add tests for latest versions

* Add tests for scalar bucket counts that passes

* Add scalar bucket counts

* Add test for scalar percentiles

* Add test for histogram bucket counts

* Add passing test for probe counts

* Add test for histogram percentiles

* Add tests for extract counts

* Update readme

* Add data for scalar percentiles test

* Fix linting errors

* Fix mypy issues with tests module

* Name it data instead of tests.*.data

* Ignore mypy on tests directory

* Remove mypy section

* Remove extra line in pytest

* Try pytest invocation of mypy-scripts-are-modules

* Run mypy outside of pytest

* Use exec on pytest instead of mypy

* Update tests/sql/glam-fenix-dev/glam_etl/bootstrap.py

Co-authored-by: Ben Wu <benjaminwu124@gmail.com>

* Update tests/sql/glam-fenix-dev/glam_etl/README.md

Co-authored-by: Ben Wu <benjaminwu124@gmail.com>

* Document bootstrap in documentation

* Use artificial range for histogram_percentiles

* Simplify parameters for scalar probe counts

* Simplify tests for histogram probe counts

* Add test for incremental histogram aggregates

* Update scalar percentile counts to count distinct client ids

* Update readme for creating a new test

* Use unorded list for sublist

* Use --ignore-glob for pytest to avoid data files

Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
2020-12-01 17:11:45 -08:00
Anthony Miyaguchi 44cc882c6e
Use SAFE_CAST with bucket values (#1533)
* Safe cast bucket values

* Add formatted sql
2020-11-09 10:01:35 -08:00
Anthony Miyaguchi 03544c1d69
Add total_user filter on glam exports (#1526) 2020-11-05 15:57:37 -08:00
Anthony Miyaguchi b77b542743
Replace GLAM temp functions with persistent functions (#1523)
* Replace GLAM temp functions with persistent functions

* Add generated sql

* Fix typo in udf name

* Add missing files and fully qualify udfs

* Add missing namespace

* Namespace even more things

* format sql
2020-11-05 13:42:09 -08:00
Anthony Miyaguchi 0c244613fb
Update glam fenix etl with updated scalar bucketing (#1493)
* Add initial udf replacements

* Update scalar bucketing scheme

* Update schemas in script

* Revert change to query

* Remove comma before CROSS JOIN

* Add functional query

* Add option to skip steps

* Add ordering for keys

* Update bigquery_etl/glam/templates/scalar_bucket_counts_v1.sql

Co-authored-by: Ben Wu <benjaminwu124@gmail.com>

* Add instructions for copying tables and modify bucket location

* Generate schemas when GENERATE_ONLY specified

* Set build date to NULL instead of "*"

Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
2020-11-03 16:07:00 -08:00
Anthony Miyaguchi fc594d9753
Check in schemas for GLAM Fenix queries (#1487)
* Add schemas for GLAM Fenix queries

* Add command for updating schemas for checked-in queries
2020-10-27 09:28:09 -07:00
Anthony Miyaguchi aab6fffdb9
Fix #1402 - Use org_mozilla_fenix.geckoview_version view for app_version in GLAM (#1456)
* Use join against geckoview_version view to get app_version

* Update logic to keep last 3 major versions

* Update template with proper syntax

* Keep 3 major versions instead of 4
2020-10-22 15:19:41 -07:00
Anthony Miyaguchi b7695049c6
Fix #1457 - Generate and run Fenix ETL for GLAM in glam-fenix-dev (#1458)
* Resolve generated sql to glam-fenix-dev and change output in sql/ dir

* Add new script for testing glam-fenix queries

* Add generated sql for version control

* Use variables correctly in bash

* Remove latest versions from UDF

* Update test to generate minimum set of tables for nightly

* Commit generated queries for testing

* Cast only if not glob

* Ignore dryrun and publish view for glam-fenix-dev

* Fix linting error

* Update comments

* Use DST_PROJECT consistently in scripts

* Update comments

* Update script/glam/test/test_glean_org_mozilla_fenix_glam_nightly

Co-authored-by: Ben Wu <benjaminwu124@gmail.com>

* Update script/glam/generate_and_run_desktop_sql

Co-authored-by: Ben Wu <benjaminwu124@gmail.com>

Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
2020-10-22 11:40:52 -07:00