* glam: Partition clients_histogram_aggregates by sample_id (has been running like this since April 3 from a different branch)
* glam: Non normalized aggregations to legacy histograms
* glam: add non-normalized aggs to probe counts extract
* glam: add init.sql to relevant tbls for non-norm aggs
* glam: ignore dryrun histogram_percentiles
* glam: add description and eol to init
* glam: Partition clients_histogram_aggregates by sample_id (has been running like this since April 3 from a different branch)
* glam: Non normalized aggregations to legacy histograms
* glam: add non-normalized aggs to probe counts extract
* glam: add init.sql to relevant tbls for non-norm aggs
* glam: ignore dryrun histogram_percentiles
* glam: add description and eol to init
* fix schema files
* fix clients_histogram_probe_counts schema
* remove another init.sql
* fix dryrun ignore order
* fix table name
* change dryrun ignore order to try avoiding fenix for being on path
* another change in dryrun
* Move glam queries from dryrun to bqetl_project.yaml to ignore
* add tbl deps on tests
* Include new fields for SubPlat in `fxa_content_auth_stdout_events`.
* Include new fields for SubPlat in `nonprod_fxa_content_auth_stdout_events`.
* Include new fields for SubPlat in `fxa_all_events`.
* Move new `time` column to be by the other timestamp columns.
* Keep `subscribed_plan_ids` as a string so it's accessible in Looker.
* Add `schema.yaml` files for FxA events ETLs.
So the tables can be successfully staged for CI for downstream ETLs/views to pass.
* Fully qualify view in `fxa_users_daily_v1` to try to get test to pass.
* Rename `time` column `event_time`.
* Include new fields for SubPlat in `nonprod_fxa_all_events`.
---------
Co-authored-by: Daniel Thorn <dthorn@mozilla.com>
* DENG-850 Retrieve FALSE instead of NULL in the in metadata when there isn't first_session or metrics ping.
* DENG-850 Unitest for no first session ping.
* DENG-850 syntax fix
* DENG-850 Tests for first session ping, and no baseline ping.
* DENG-850 Tests suite.
* DENG-850 YAML fixes.
* DENG-850 Adjustment to the case of reported first_session and metrics ping. The unitests are adjusted to get the value for reported pings.
* DENG-850 Add sample id to test.
---------
Co-authored-by: Lucia Vargas <lvargas@mozilla.com>
* DENG-775 Added session_id to JOIN between GA data and stub_attr.stdout. Also expanded date range on GA session data to [download_date - 2 days, download_date + 1 day]
* Updated query to handle missing GA download_session_id. It effectively applies V1 logic to the MISSING_GA_CLIENT dl_tokens.
* Initial table definitions for dl_token processing. Includes update to sql pytest_plugin to account for tablenames with date suffixes.
* Removed cluster reference and shortened description
* Added sql/moz-fx-data-marketing-prod/ga_derived/downloads_with_attribution_v1/query.sql to dryrun skip
* Added time_on_site
* Moved country_names sample test data file.
* Update bigquery_etl/pytest_plugin/sql.py
Co-authored-by: Daniel Thorn <dthorn@mozilla.com>
* Update sql/moz-fx-data-marketing-prod/ga_derived/downloads_with_attribution_v1/query.sql
Co-authored-by: Frank Bertsch <frank.bertsch@gmail.com>
* Update sql/moz-fx-data-marketing-prod/ga_derived/downloads_with_attribution_v1/query.sql
Co-authored-by: Frank Bertsch <frank.bertsch@gmail.com>
* Updated based on PR feedback. Added LEFT JOIN to ensure sessions without pageviews are not dropped.
* Set has_ga_download_event = null if exception=GA_UNRESOLVABLE
* Standardized logic for time_on_site
* - Added test for multiple downloads for 1 session
- Added detailed description of table.
* Updated to use mode_last_retain_nulls instead of ANY_VALUE
* Set pageviews, unique_pageviews = 0 if null.
* Added boolean additional_download_occurred to indicate if another download occurred in the same session.
---------
Co-authored-by: Daniel Thorn <dthorn@mozilla.com>
Co-authored-by: Frank Bertsch <frank.bertsch@gmail.com>
* feat: new field in search clients daily - is_sap_monetizable
* Added column to tests
---------
Co-authored-by: Alexander Nicholson <anicholson@mozilla.com>
* Revert "CI fixes for supporting private UDFs in bigquery-etl - DENG-735 (#3631)"
This reverts commit edcfe758f7.
* Added stub UDF for monetized_search
* Add docs for using a private internal UDF
* Minimize stub normalize_search_engine UDF and usage in search_clients_last_seen tests
* Move sql tests downstream of private-generate-sql and copy UDFs into sql-dir for tests
* Add sidebar search probes to clients_daily tables
* update downstream schemas
* Format fix
* Add new field to test
* Update schemas with hist fields
* Remove duplicated field from schema
---------
Co-authored-by: Glenda Leonard <75265513+gleonard-m@users.noreply.github.com>
* Add urlbar_persisted to query for daily search client table
* Add column to schema (not backward breaking)
* Address test expectations
* Update schemata and queries for companion tables
* Make adjustments that Alex identified in the PR to make sure the new fields get ingested properly
* Run schema update for clients_daily_v6
* Update the suggest_impression_sanitized_v3 query
* exclude region and country when preparing for the join
* filter `impressions.request_id` to non null to drop queries without a
corresponding impression.
* Add tests? Haven't figured out bootstraping issues on my M1 yet so not
sure how well these will work. TO CI!
* Swap the left and right and remove the conditional on the final join
* Align expectations
* Will this fix the tests? tune in to find out.
* Fix expectations AGAIN
* Update based on review comments and formatter changes
This is a follow-up to https://github.com/mozilla/bigquery-etl/pull/3037 which unblocked `scalar_bucket_counts_v1`.
`scalar_percentiles_v1` uses the same source table (`clients_scalar_aggregates_v1`) and started failing today with the same error (disk/memory limits exceeded for shuffle operations).
`APPROX_COUNT_DISTINCT` used here runs HLL under the hood. The reason for using it here is that we can't split the aggregation here into two stages as in the aforementioned PR due to quantiles calculation.
I have run this query locally and confirmed that it works.
* CONSVC-1681 Add mobile data to contextual services event_aggregates
See https://mozilla-hub.atlassian.net/browse/CONSVC-1681
* Use 'phone' instead of 'mobile'
* Update init.sql
* Commentary on filter
* Aggregation test update
* Update overactive filter test
* Dry run exemptions
* Update sql/moz-fx-data-shared-prod/contextual_services_derived/event_aggregates_v1/query.sql
* format
This better matches the current client behavior for matching.
We're currently getting `<disallowed>` in results and results with uppercase
letters. I don't think preserving these differences has analytical value,
and it makes the results harder to work with.
* ROAD-85 Simple sanitization job for Merino logs
See https://mozilla-hub.atlassian.net/browse/ROAD-85
This uses the adM allowlist of queries for sanitization, so can be expressed
entirely in a single query. Future iterations will involve python logic and
will likely need to be held elsewhere.
* Separate external query and query to copy data into shared-prod
* add provider for quicksuggest tables
* update test
* update test
* update test
* update test
* update test
* update test
Co-authored-by: Jeff Klukas <jklukas@mozilla.com>
Added replace_outlier_values_with_zero UDF and use it to
replace the values in the keyed scalar metrics with 0 if
they pass a threshold value.
Also renamed some function params, added test and fixed
an off-by-1 error in index->position transform
Motivated in particular by https://github.com/mozilla/bigquery-etl/pull/2115
where new changes ended up making the test case too complex to run.
The difficulty of updating this test case is outweighing the safety benefit
at this point, so we are removing, but leaving a pointer in case we want to
reestablish the test in the future.
* Add tests for core_clients_first_seen_init
* Add failing test for core clients first seen
* Fix issues with core clients first seen
* Keep left join
* Update sql/moz-fx-data-shared-prod/telemetry_derived/core_clients_first_seen_v1/query.sql
Co-authored-by: Jeff Klukas <jklukas@mozilla.com>
Co-authored-by: Jeff Klukas <jklukas@mozilla.com>
* Add first_seen_date to core_clients_daily and last seen
Supports KPI work for iOS and Focus apps.
See https://docs.google.com/document/d/1-sifTuu3lWd5umvaUmncFrdBIK6eKVTPzmGDLv6GDak/edit?ts=6078667e#
* Update tests
* Add new_profiles to mobile_usage
* Make sure is_new_profile reflects only current day
* Remove is_new_profile from core_clients_last_seen
This field could be confusing.
If we do `COUNTIF(is_new_profile)`,
we'll overcount since a client that appears on a single day will continue
to appear in clients_last_seen with is_new_profile=True carried over from the
original day of observation.
* Remove is_new_profile from core_clients_last_seen query
* bugfixes
* DAG change
* Add first_seen_date and related test fixtures
* Use is_new_profile instead of baseline_first_seen
* Update view for baseline_clients_first_seen
* Fix yamllint issues
* Set is_new_profile when submission matches first seen
* Include AS in table alias
* Nit: capitalize AS
* Update bigquery_etl/glean_usage/templates/baseline_clients_daily_v1.sql
Co-authored-by: Jeff Klukas <jklukas@mozilla.com>
* Update bigquery_etl/glean_usage/templates/baseline_clients_daily_v1.sql
Co-authored-by: Jeff Klukas <jklukas@mozilla.com>
* Update clustering specification
Co-authored-by: Jeff Klukas <jklukas@mozilla.com>
* Add initial boilerplate for clients_first_seen
* Remove submission_timestamp as a field
* [wip] Join data against legacy fennec id if applicable
* Remove user facing view
* Revert "Remove user facing view"
This reverts commit a728a7882170eadad5413c7a7046c0f38297bb87.
* Add flag for fennec_id
* Update logic to limit rows in partitions to submission_date
* Add all sql in glean_usage to format ignores
* Separate init and query
* Add default encoders for testing sql
* Add test for initialization of baseline clients first seen in fenix
* Update query to update over previous history
* Add test for aggregation
* Add generated sql and tests for simple baseline clients first seen
* Add dry-run exceptions for clients first seen tables
* Add clients first seen to generated sql
* Update bigquery_etl/glean_usage/templates/baseline_clients_first_seen.metadata.yaml
Co-authored-by: Jeff Klukas <jklukas@mozilla.com>
* Update bigquery_etl/glean_usage/templates/baseline_clients_first_seen.metadata.yaml
Co-authored-by: Jeff Klukas <jklukas@mozilla.com>
* Group by sample id instead of min
* Add submission_date as baseline first seen date
Co-authored-by: Jeff Klukas <jklukas@mozilla.com>
* Convert test data to yaml and add ad_click/search_with_ads scalar data
* Convert expected data to yaml
* Fix expected test results
* Add new columns in test_experiments
* Update sql/moz-fx-data-shared-prod/search_derived/search_clients_daily_v8/query.sql
Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
* Fix egregious double counting in scalar bucket counts
* Update for newer version of black
* Update scalar bucket count test to account for combinations
* Update minimal test for histogram bucket counts
* Add test for multiple clients in histogram aggregates
* Remove deduplicated cte in histogram bucket counts
* Use count distinct for client counts to be explicit
* Add script to determine query dependencies
* Add schemas and folders for minimal test
* Add schema for geckoview_versions
* Add query params to each query
* Update schema for new queries
* Remove main from bootstrap file
* Add dataset prefix to schemas
* Add failing test for clients_histogram_aggregates
It turns out that the dependency resolution I'm using for autogenerate
the schemas is ignoring the views. I actually want to keep the views
around. The tables also all need to be prefixed with the dataset name or
they won't be inserted into the sql query correctly.
* Add successful test for clients histogram aggregates
* Add minimal tests for clients_scalar_aggregates
* Remove skeleton files for views (no test support for views)
* Add tests for latest versions
* Add tests for scalar bucket counts that passes
* Add scalar bucket counts
* Add test for scalar percentiles
* Add test for histogram bucket counts
* Add passing test for probe counts
* Add test for histogram percentiles
* Add tests for extract counts
* Update readme
* Add data for scalar percentiles test
* Fix linting errors
* Fix mypy issues with tests module
* Name it data instead of tests.*.data
* Ignore mypy on tests directory
* Remove mypy section
* Remove extra line in pytest
* Try pytest invocation of mypy-scripts-are-modules
* Run mypy outside of pytest
* Use exec on pytest instead of mypy
* Update tests/sql/glam-fenix-dev/glam_etl/bootstrap.py
Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
* Update tests/sql/glam-fenix-dev/glam_etl/README.md
Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
* Document bootstrap in documentation
* Use artificial range for histogram_percentiles
* Simplify parameters for scalar probe counts
* Simplify tests for histogram probe counts
* Add test for incremental histogram aggregates
* Update scalar percentile counts to count distinct client ids
* Update readme for creating a new test
* Use unorded list for sublist
* Use --ignore-glob for pytest to avoid data files
Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
* Bug 1677609 Join clients_first_seen into clients_last_seen
Several folks on DS report that they have been getting great value from
clients_first_seen, as the first_seen_date there is a much more stable way
to define new profiles compared to using profile_created_date from pings.
Currently, using first_seen_date requires doing a join between these two tables.
This PR adds that join to the clients_last_seen query itself to make this
workflow more efficient. I'd like to get this merged before we proceed with
the backfill discussed in https://bugzilla.mozilla.org/show_bug.cgi?id=1677609
This change has a few operational implications. Most importantly, it makes
clients_last_seen dependent on clients_first_seen, so those queries can no
longer proceed in parallel. `clients_first_seen` takes on average 10 minutes
to run, so we'll be delaying all ETL downstream of `clients_last_seen` by
about 10 minutes, which seems acceptable. It also adds some mental complexity
to the model.
The extra join does not appear to significantly slow down the
`clients_last_seen` query itself; it scans about 15% more data and consumes
about 15% more slot time.
I expect the performance is dominated by the existing join between
clients_daily and the previous day of clients_last_seen.
* Bug 1677609 Add core active fields to clients_last_seen
See https://bugzilla.mozilla.org/show_bug.cgi?id=1677609
This adds just the new underlying bit pattern fields that will need to be
backfilled, and these will be hidden from users initially.
After the backfill is complete, we will update the view to include these
fields along with the various fields derived from them.
We include days_visited_10_uri_bits which was not explicitly requested in
the context of this bug, but was proposed as part of the prototype feature_usage
table (https://github.com/mozilla/bigquery-etl/pull/1193); it may be useful
for future comparisons.
* Update tests to match new logic
* Add initial incremental query for geckoview build dates
* Add initial tests for incremental query (WIP)
* Add files for initial tests
* Rework query so it doesn't fail during tests
* Fix schema so queries run
* Add passing test for init
* Add test for query aggregation
* Add metadata file for scheduling the query
* Move scripts from fenix_nightly to fenix
* Remove scheduling
* Add document strings.
* Change dataset reference and indent comments correctly
* Remove init and address feedback
* remove init file
* make query idempotent by appending window to each submission_date
* rename n_builds to n_pings
* reduce window size from 30 days to 14 days
* avoid use of subqueries
* Update tests for query
* Fix tests
* Add failing test for 100
* Fix query so it work across fx100 boundary
* Add linting fixes