* Remove percentiles
* Remove tests that test percentiles
* Refresh scripts insert null to new percentiles
* Remove percentile columns from queries and schemas
* Delete more percentile tables
* Formatting
* histogram_cast_struct's keys are strings
* Re-add test after fixing failure cause
* Add fully-qualified identifiers when formatting queries
* Fully-qualified identifiers for queries in sql/
* Check in only formatted SQL to generated-sql branch
* Add comment
* Fully qualify more tables
* Fully qualify test files
* Formatting improvements around CTEs and unit tests
* Option to skip auto qualifying queries
* Added metadata.yaml files to glam-fenix-dev project
* Add owner to metadata.yaml files for glam-fenix-dev
---------
Co-authored-by: Eduardo Filho <edugomfilho@gmail.com>
* Indent `WHEN` and `ELSE` clauses one level more than `CASE`.
* Indent `THEN` clauses one level more than the corresponding `WHEN` clause.
* Have the content of `WHEN`, `THEN`, and `ELSE` clauses start on the same line as the clause keyword.
* Allow an alias, comma, or dot right after a `CASE` statement's `END`.
This is a follow-up to https://github.com/mozilla/bigquery-etl/pull/3037 which unblocked `scalar_bucket_counts_v1`.
`scalar_percentiles_v1` uses the same source table (`clients_scalar_aggregates_v1`) and started failing today with the same error (disk/memory limits exceeded for shuffle operations).
`APPROX_COUNT_DISTINCT` used here runs HLL under the hood. The reason for using it here is that we can't split the aggregation here into two stages as in the aforementioned PR due to quantiles calculation.
I have run this query locally and confirmed that it works.
`scalar_bucket_counts_v1` queries started failing in 2022-06 for FOG and Fenix.
They started exceeding BQ disk and memory limits available for shuffle operations, most likely because of growing number of clients, metrics, and values.
This commit introduces HyperLogLog for estimating number of clients in `scalar_bucket_counts`. In my tests, after this change, query was finishing in 30-70 minutes while the current production query fails after 4-6 hours.
I have run this query before and after the change on a 50% sample of input table.
In terms of BQ slots used, this uses ~700 vs. ~2100 before - although not exact, this can be treated as an approximation of the run time.
In terms of errors, this notebook compares both outputs and user counts: https://colab.research.google.com/drive/1uilzQcFn1ppFMTTpi-RXj2EFLjCxX_mN#scrollTo=VfRZpBCjxgQM. It shows that 99.8% of counts estimated with HLL have an error smaller than 0.1%, 81% are below 0.01%. This makes this approach pretty good compared to sampling that we use in the legacy pipeline as it does not introduce risk of sampling-out some populations.
added sample counts for desktop and glean products (both scalars and histograms)
added sample counts for desktop and glean products (both scalars and histograms)
added sample counts for desktop and glean products (both scalars and histograms)
add sample counts for scalars and histogram
* added sample counts for glam fenix
* formatted for black format check
* Revert "formatted for black format check"
This reverts commit cf71fed487.
* formated for black format check
* added the sample coubt scripts
* Fix egregious double counting in scalar bucket counts
* Update for newer version of black
* Update scalar bucket count test to account for combinations
* Update minimal test for histogram bucket counts
* Add test for multiple clients in histogram aggregates
* Remove deduplicated cte in histogram bucket counts
* Use count distinct for client counts to be explicit
* Add script to determine query dependencies
* Add schemas and folders for minimal test
* Add schema for geckoview_versions
* Add query params to each query
* Update schema for new queries
* Remove main from bootstrap file
* Add dataset prefix to schemas
* Add failing test for clients_histogram_aggregates
It turns out that the dependency resolution I'm using for autogenerate
the schemas is ignoring the views. I actually want to keep the views
around. The tables also all need to be prefixed with the dataset name or
they won't be inserted into the sql query correctly.
* Add successful test for clients histogram aggregates
* Add minimal tests for clients_scalar_aggregates
* Remove skeleton files for views (no test support for views)
* Add tests for latest versions
* Add tests for scalar bucket counts that passes
* Add scalar bucket counts
* Add test for scalar percentiles
* Add test for histogram bucket counts
* Add passing test for probe counts
* Add test for histogram percentiles
* Add tests for extract counts
* Update readme
* Add data for scalar percentiles test
* Fix linting errors
* Fix mypy issues with tests module
* Name it data instead of tests.*.data
* Ignore mypy on tests directory
* Remove mypy section
* Remove extra line in pytest
* Try pytest invocation of mypy-scripts-are-modules
* Run mypy outside of pytest
* Use exec on pytest instead of mypy
* Update tests/sql/glam-fenix-dev/glam_etl/bootstrap.py
Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
* Update tests/sql/glam-fenix-dev/glam_etl/README.md
Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
* Document bootstrap in documentation
* Use artificial range for histogram_percentiles
* Simplify parameters for scalar probe counts
* Simplify tests for histogram probe counts
* Add test for incremental histogram aggregates
* Update scalar percentile counts to count distinct client ids
* Update readme for creating a new test
* Use unorded list for sublist
* Use --ignore-glob for pytest to avoid data files
Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
* Replace GLAM temp functions with persistent functions
* Add generated sql
* Fix typo in udf name
* Add missing files and fully qualify udfs
* Add missing namespace
* Namespace even more things
* format sql
* Use join against geckoview_version view to get app_version
* Update logic to keep last 3 major versions
* Update template with proper syntax
* Keep 3 major versions instead of 4
* Resolve generated sql to glam-fenix-dev and change output in sql/ dir
* Add new script for testing glam-fenix queries
* Add generated sql for version control
* Use variables correctly in bash
* Remove latest versions from UDF
* Update test to generate minimum set of tables for nightly
* Commit generated queries for testing
* Cast only if not glob
* Ignore dryrun and publish view for glam-fenix-dev
* Fix linting error
* Update comments
* Use DST_PROJECT consistently in scripts
* Update comments
* Update script/glam/test/test_glean_org_mozilla_fenix_glam_nightly
Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
* Update script/glam/generate_and_run_desktop_sql
Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
Co-authored-by: Ben Wu <benjaminwu124@gmail.com>