* Indent `WHEN` and `ELSE` clauses one level more than `CASE`.
* Indent `THEN` clauses one level more than the corresponding `WHEN` clause.
* Have the content of `WHEN`, `THEN`, and `ELSE` clauses start on the same line as the clause keyword.
* Allow an alias, comma, or dot right after a `CASE` statement's `END`.
This should fix `extract_probe_counts_v1` query failing due to integer overflows on casting.
We got some big negative numbers in few histogram probes around 2022-08-04 which seem to be some random bit flips. I'm putting a lower bound for a value in a filter at 0 as we do not expect them to be negative in any case.
This is a follow-up to https://github.com/mozilla/bigquery-etl/pull/3037 which unblocked `scalar_bucket_counts_v1`.
`scalar_percentiles_v1` uses the same source table (`clients_scalar_aggregates_v1`) and started failing today with the same error (disk/memory limits exceeded for shuffle operations).
`APPROX_COUNT_DISTINCT` used here runs HLL under the hood. The reason for using it here is that we can't split the aggregation here into two stages as in the aforementioned PR due to quantiles calculation.
I have run this query locally and confirmed that it works.
`scalar_bucket_counts_v1` queries started failing in 2022-06 for FOG and Fenix.
They started exceeding BQ disk and memory limits available for shuffle operations, most likely because of growing number of clients, metrics, and values.
This commit introduces HyperLogLog for estimating number of clients in `scalar_bucket_counts`. In my tests, after this change, query was finishing in 30-70 minutes while the current production query fails after 4-6 hours.
I have run this query before and after the change on a 50% sample of input table.
In terms of BQ slots used, this uses ~700 vs. ~2100 before - although not exact, this can be treated as an approximation of the run time.
In terms of errors, this notebook compares both outputs and user counts: https://colab.research.google.com/drive/1uilzQcFn1ppFMTTpi-RXj2EFLjCxX_mN#scrollTo=VfRZpBCjxgQM. It shows that 99.8% of counts estimated with HLL have an error smaller than 0.1%, 81% are below 0.01%. This makes this approach pretty good compared to sampling that we use in the legacy pipeline as it does not introduce risk of sampling-out some populations.
`clients_scalar_aggregates_v1` tables are partitioned by version into 100 buckets starting from 0. This causes data from Firefox versions >=100 to flow to `__UNPARTITIONED__` dummy partitions, effectively rendering these tables unpartitioned.
This bumps the partitioning range to 200. Per [Firefox Release Calendar](https://wiki.mozilla.org/Release_Management/Calendar) this is enough for ~7 years.
* Stop calculating probes that cause error on GLAM from ETL (#2792)
* #2792 Glam: applying same probe blocklist mechanism to glean queries
* #2792 Add 'product' parameter to excl probes func
added sample counts for desktop and glean products (both scalars and histograms)
added sample counts for desktop and glean products (both scalars and histograms)
added sample counts for desktop and glean products (both scalars and histograms)
add sample counts for scalars and histogram
* add minimum client count for fenix
add minimum client count for fenix
add minimum client count for fenix
add minimum client count for fenix
add minimum client count for fenix
* FOG date check
* added sample counts for glam fenix
* formatted for black format check
* Revert "formatted for black format check"
This reverts commit cf71fed487.
* formated for black format check
* added the sample coubt scripts
* Fix egregious double counting in scalar bucket counts
* Update for newer version of black
* Update scalar bucket count test to account for combinations
* Update minimal test for histogram bucket counts
* Add test for multiple clients in histogram aggregates
* Remove deduplicated cte in histogram bucket counts
* Use count distinct for client counts to be explicit
* Add script to determine query dependencies
* Add schemas and folders for minimal test
* Add schema for geckoview_versions
* Add query params to each query
* Update schema for new queries
* Remove main from bootstrap file
* Add dataset prefix to schemas
* Add failing test for clients_histogram_aggregates
It turns out that the dependency resolution I'm using for autogenerate
the schemas is ignoring the views. I actually want to keep the views
around. The tables also all need to be prefixed with the dataset name or
they won't be inserted into the sql query correctly.
* Add successful test for clients histogram aggregates
* Add minimal tests for clients_scalar_aggregates
* Remove skeleton files for views (no test support for views)
* Add tests for latest versions
* Add tests for scalar bucket counts that passes
* Add scalar bucket counts
* Add test for scalar percentiles
* Add test for histogram bucket counts
* Add passing test for probe counts
* Add test for histogram percentiles
* Add tests for extract counts
* Update readme
* Add data for scalar percentiles test
* Fix linting errors
* Fix mypy issues with tests module
* Name it data instead of tests.*.data
* Ignore mypy on tests directory
* Remove mypy section
* Remove extra line in pytest
* Try pytest invocation of mypy-scripts-are-modules
* Run mypy outside of pytest
* Use exec on pytest instead of mypy
* Update tests/sql/glam-fenix-dev/glam_etl/bootstrap.py
Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
* Update tests/sql/glam-fenix-dev/glam_etl/README.md
Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
* Document bootstrap in documentation
* Use artificial range for histogram_percentiles
* Simplify parameters for scalar probe counts
* Simplify tests for histogram probe counts
* Add test for incremental histogram aggregates
* Update scalar percentile counts to count distinct client ids
* Update readme for creating a new test
* Use unorded list for sublist
* Use --ignore-glob for pytest to avoid data files
Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
* Replace GLAM temp functions with persistent functions
* Add generated sql
* Fix typo in udf name
* Add missing files and fully qualify udfs
* Add missing namespace
* Namespace even more things
* format sql
* Use join against geckoview_version view to get app_version
* Update logic to keep last 3 major versions
* Update template with proper syntax
* Keep 3 major versions instead of 4