* updating products and adding changed products model
* updating user profiles to join on external id
* updating tests to match new schema
* reverting naming
* updating test
* updating products model to include additional attributes and updating tests
* updating order and tests
* updating format
* upddating source table
* updating schema
* updating user profiles table schema to match new products array
* adding models for changed and deleted users
* adding filter to changed waitlists model
* updating tests to include has_fxa field
* updating tests to include an unchanged user
* adding new line
* removing unnecessary cross join
* updating create statement
* joining on users table to filter for active and ensuring there is at least one subscription
* joining on users to filter for active
* adding dev subscription group
* removing fxa_id in favor of has_fxa
* bringing in update timestamp for downstream use
* updating formatting and adding filter for active users
* adding filter for one active newsletter
* updating tests
* adding fxa id back to users table to join to products
* updating query
* updating values
* updating tests
* fix test for subscriptions
* changing schema to array
* updating format
* updating to pull in all subscriptions with statuses
* removing create statement
* updating subscriptions query to make it an array and updating associated tests
* updating formatting and comment
---------
Co-authored-by: Leli Schiestl <lschiestl@mozilla.com>
* Remove init.sql files for fenix and use is_init() instead
* Remove init.sql files for search_derived datasets
* Simplify is_init() for acer_cohort_v1
* DENG-2979 - add logic to use GA4 client ID after it starts coming through instead of old GA3 one, & start updating tests
* DENG-2979 fix SQL formatting
* DENG-2979 - add temp table to do a final distinct
* DENG-2979 add a data check to make sure the code errors out if someone tries to run prior to 8-25-2023
* DENG-2979 update the date of the tests to be more recent and not fall in the data check fail period
* DENG-2979 update test cases
* Bump black from 23.10.1 to 24.1.1
Bumps [black](https://github.com/psf/black) from 23.10.1 to 24.1.1.
- [Release notes](https://github.com/psf/black/releases)
- [Changelog](https://github.com/psf/black/blob/main/CHANGES.md)
- [Commits](https://github.com/psf/black/compare/23.10.1...24.1.1)
---
updated-dependencies:
- dependency-name: black
dependency-type: direct:production
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
* Reformat files with black to fix dependabot update.
* Reformat with black 24.1.1. Update test dag with required space.
* Update test dags.
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Tweaking firefox_android_clients_v1 to also include play_store attribution fields
* removed additional logic used for testing found within _previous CTE
* removed firefox_android_clients_v1 init.sql in favour of templating via is_init() inside the query
* Made changes as suggested by fbertsch in PR#4940
* Fixing sql tests
* Avoid using `Path.glob()` or `Path.rglob()` for recursive file searches.
Because they don't currently support following symlinks (they will in Python 3.13).
* Specify `followlinks=True` as necessary when calling `os.walk()`.
* Adding topsite dismissals to newtab_* jobs.
* Update test to include new dismissal fields
---------
Co-authored-by: Curtis Morales <cmorales@mozilla.com>
Previously, NULL values in the join keys didn't join, resulting
in duplicate rows. This change will coalesce those to empty
strings and NULLIFY them in the view.
* Add ga_clients_v1 table & view
- Query from ga_sessions
- Fix tests
* Use correct scheduling parameters
Co-authored-by: Alexander <anicholson@mozilla.com>
* Move HAVING clause to WHERE
Co-authored-by: Alexander <anicholson@mozilla.com>
* Change CTE name
Co-authored-by: Alexander <anicholson@mozilla.com>
---------
Co-authored-by: Alexander <anicholson@mozilla.com>
* Add derived stub attribution logs
This table keeps triplets from the stub attribution logs.
The triplet of (dl_token, ga_client_id, stub_session_id)
will only ever appear once here.
See the associated decision brief:
https://docs.google.com/document/d/1L4vOR0nCGawwSRPA9xiR8Hmu_8ozCGUecXAtBWmGGA0/edit
* Move stub attribution table to new dataset
In order to ensure limited access to the stub attribution service
data without significantly decreasing developer velocity, we
move these tables to a new dataset. That dataset has the defaults
we want for all stub attribution log data:
- Defaults to just read access to data-science/DUET workgroup
- No read/write access for DE
We will backfill via the bqetl_backfill DAG.
* Rename view
* Use correct dataset name in view
* Skip dryrun; no access
* DENG-850 Add test setup.
* DS-2947. Create new dataset and tests for Firefox Desktop Clients.
* DS-2947. Update dataset name to clients_first_seen_v2.
* DS-2947. Dataset name to clients_first_seen_v2.
* DS-2947. Updating tests.
* DS-2947. Schema for clients_first_seen_v2.
* DS-2947. Tests update.
* Tests update
* Restore test files
* DS-2947. Get data from main and new profile ping. Get first dltoken and dlsource available. Update tests.
* DS-2947. Use main ping's submission_timestamp_min to find the earliest ping.
* Remove app_display_versin as it is normalised in app_version. Update fields on a 7 day window. Retrieve data from the ping with the earliest NOT NULL value to remove NULLS when main ping is not available.
* Remove app_display_versin as it is normalised in app_version. Update fields on a 7 day window. Retrieve data from the ping with the earliest NOT NULL value to remove NULLS when main ping is not available.
* Update schemas, remove duplicated columns from query and init. Adapt existing unitest and add unitest for 7-day window updates. Include scheduler in DAG bqetl_analytics_tables.
* Update to enable initialize from query.sql. Remove init.sql.
* Update DAGs dependencies.
* DAG bqetl_main_summary updated.
* Query and tests update to join with sample_id.
* Refactor metadata fields in query and tests.
* Schema and descriptions updated. Remove filter to query the existing table. Remove the DATETIME, the first_seen_date is equivalent.
* Column required to be explicit in the query to match the schema.
* Test fix.
* Tests tmp changes.
* Remove 7-day window update and update tests.
* Add second_seen_date to the query
* DS-3037 Add second_seen_date and tests.
* DS-3037 Add is_init to calculate second_seen_date.
* remove files in analysis dataset
* DS-3037 Add is_init to calculate second_seen_date. Formatting.
* DS-2986 Add initialize script. Change submission_timestamp_min to submission_date due to NULL values in that field.
* DENG-1314. Update metadata reported pings and tests in the query.
* DS-3054. Update bqetl initialize command and query to support parallel run.
* DS-3054. Update query to use submission_timestamp_min from main ping where available for precision in source ping for first_seen_date, add source ping of second_seen_date, get only first_seen date from new_profile and shutdown ping due to 16% clients with more than one new_profile ping. Add capability to run in parallel in bqetl. Update tests.
* DS-3054. Remove initialize.py.
* Reset unrelated formatting changes from this branch to match the main branch.
* Correct jira template.
* DS-2947. Update naming for attribution dltoken and dlsource.
* DS-2947. Update column names and tests and clarity for the initialization command.
* DS-2986. Create table with schema and metadata in command initialize.
* Document what is the result expected from each subquery.
* Documentation update.
* DS-3145. Include user agent and the source ping. Add query documentation.
* DS-3145. Update tests.
* DS-3146 Update logic to get attributes only from the ping that reports the first_seen_date, include locale, update the source for app_build_id and collect second_seen_date only from main ping.
* DS-3146 Update logic to get attributes only from the ping that reports the first_seen_date, include locale, update the source for app_build_id and collect second_seen_date only from main ping.
* DS-3054. Updates and save initialization.
* DS-3054. Table name required for DAG generation.
* DS-2947_implement_bigquery_changes_in_another_PR.
* DS-2947 Naming
* Add clients_first_seen_v2 to skip dry-run.
---------
Co-authored-by: Lucia Vargas <lvargas@mozilla.com>
* added render subcommand to the bqetl check command
* added a dry_run flag to bqqetl check run command
* added a test to make sure run command exists with status code 0
* added test for check render subcommand
* fixing linter checks
* attempting using an alternative way of testing the render command
* fixing render test by testing the _render() directly rather than the render cli wrapper
* removed dead test
* Apply suggestions from code review by ascholtz
Co-authored-by: Anna Scholtz <anna@scholtzan.net>
* fixed black and mypy errors
* fixed app_store_funnel_v1 check formatting
* reformatted tests checks
---------
Co-authored-by: Anna Scholtz <anna@scholtzan.net>
* glam: Partition clients_histogram_aggregates by sample_id (has been running like this since April 3 from a different branch)
* glam: Non normalized aggregations to legacy histograms
* glam: add non-normalized aggs to probe counts extract
* glam: add init.sql to relevant tbls for non-norm aggs
* glam: ignore dryrun histogram_percentiles
* glam: add description and eol to init
* glam: Partition clients_histogram_aggregates by sample_id (has been running like this since April 3 from a different branch)
* glam: Non normalized aggregations to legacy histograms
* glam: add non-normalized aggs to probe counts extract
* glam: add init.sql to relevant tbls for non-norm aggs
* glam: ignore dryrun histogram_percentiles
* glam: add description and eol to init
* fix schema files
* fix clients_histogram_probe_counts schema
* remove another init.sql
* fix dryrun ignore order
* fix table name
* change dryrun ignore order to try avoiding fenix for being on path
* another change in dryrun
* Move glam queries from dryrun to bqetl_project.yaml to ignore
* add tbl deps on tests
* Include new fields for SubPlat in `fxa_content_auth_stdout_events`.
* Include new fields for SubPlat in `nonprod_fxa_content_auth_stdout_events`.
* Include new fields for SubPlat in `fxa_all_events`.
* Move new `time` column to be by the other timestamp columns.
* Keep `subscribed_plan_ids` as a string so it's accessible in Looker.
* Add `schema.yaml` files for FxA events ETLs.
So the tables can be successfully staged for CI for downstream ETLs/views to pass.
* Fully qualify view in `fxa_users_daily_v1` to try to get test to pass.
* Rename `time` column `event_time`.
* Include new fields for SubPlat in `nonprod_fxa_all_events`.
---------
Co-authored-by: Daniel Thorn <dthorn@mozilla.com>
* DENG-850 Retrieve FALSE instead of NULL in the in metadata when there isn't first_session or metrics ping.
* DENG-850 Unitest for no first session ping.
* DENG-850 syntax fix
* DENG-850 Tests for first session ping, and no baseline ping.
* DENG-850 Tests suite.
* DENG-850 YAML fixes.
* DENG-850 Adjustment to the case of reported first_session and metrics ping. The unitests are adjusted to get the value for reported pings.
* DENG-850 Add sample id to test.
---------
Co-authored-by: Lucia Vargas <lvargas@mozilla.com>
* DENG-775 Added session_id to JOIN between GA data and stub_attr.stdout. Also expanded date range on GA session data to [download_date - 2 days, download_date + 1 day]
* Updated query to handle missing GA download_session_id. It effectively applies V1 logic to the MISSING_GA_CLIENT dl_tokens.
* Initial table definitions for dl_token processing. Includes update to sql pytest_plugin to account for tablenames with date suffixes.
* Removed cluster reference and shortened description
* Added sql/moz-fx-data-marketing-prod/ga_derived/downloads_with_attribution_v1/query.sql to dryrun skip
* Added time_on_site
* Moved country_names sample test data file.
* Update bigquery_etl/pytest_plugin/sql.py
Co-authored-by: Daniel Thorn <dthorn@mozilla.com>
* Update sql/moz-fx-data-marketing-prod/ga_derived/downloads_with_attribution_v1/query.sql
Co-authored-by: Frank Bertsch <frank.bertsch@gmail.com>
* Update sql/moz-fx-data-marketing-prod/ga_derived/downloads_with_attribution_v1/query.sql
Co-authored-by: Frank Bertsch <frank.bertsch@gmail.com>
* Updated based on PR feedback. Added LEFT JOIN to ensure sessions without pageviews are not dropped.
* Set has_ga_download_event = null if exception=GA_UNRESOLVABLE
* Standardized logic for time_on_site
* - Added test for multiple downloads for 1 session
- Added detailed description of table.
* Updated to use mode_last_retain_nulls instead of ANY_VALUE
* Set pageviews, unique_pageviews = 0 if null.
* Added boolean additional_download_occurred to indicate if another download occurred in the same session.
---------
Co-authored-by: Daniel Thorn <dthorn@mozilla.com>
Co-authored-by: Frank Bertsch <frank.bertsch@gmail.com>
* feat: new field in search clients daily - is_sap_monetizable
* Added column to tests
---------
Co-authored-by: Alexander Nicholson <anicholson@mozilla.com>
* Revert "CI fixes for supporting private UDFs in bigquery-etl - DENG-735 (#3631)"
This reverts commit edcfe758f7.
* Added stub UDF for monetized_search
* Add docs for using a private internal UDF
* Minimize stub normalize_search_engine UDF and usage in search_clients_last_seen tests
* Move sql tests downstream of private-generate-sql and copy UDFs into sql-dir for tests
* Add sidebar search probes to clients_daily tables
* update downstream schemas
* Format fix
* Add new field to test
* Update schemas with hist fields
* Remove duplicated field from schema
---------
Co-authored-by: Glenda Leonard <75265513+gleonard-m@users.noreply.github.com>
* Add urlbar_persisted to query for daily search client table
* Add column to schema (not backward breaking)
* Address test expectations
* Update schemata and queries for companion tables
* Make adjustments that Alex identified in the PR to make sure the new fields get ingested properly
* Run schema update for clients_daily_v6
* Update the suggest_impression_sanitized_v3 query
* exclude region and country when preparing for the join
* filter `impressions.request_id` to non null to drop queries without a
corresponding impression.
* Add tests? Haven't figured out bootstraping issues on my M1 yet so not
sure how well these will work. TO CI!
* Swap the left and right and remove the conditional on the final join
* Align expectations
* Will this fix the tests? tune in to find out.
* Fix expectations AGAIN
* Update based on review comments and formatter changes
This is a follow-up to https://github.com/mozilla/bigquery-etl/pull/3037 which unblocked `scalar_bucket_counts_v1`.
`scalar_percentiles_v1` uses the same source table (`clients_scalar_aggregates_v1`) and started failing today with the same error (disk/memory limits exceeded for shuffle operations).
`APPROX_COUNT_DISTINCT` used here runs HLL under the hood. The reason for using it here is that we can't split the aggregation here into two stages as in the aforementioned PR due to quantiles calculation.
I have run this query locally and confirmed that it works.
* CONSVC-1681 Add mobile data to contextual services event_aggregates
See https://mozilla-hub.atlassian.net/browse/CONSVC-1681
* Use 'phone' instead of 'mobile'
* Update init.sql
* Commentary on filter
* Aggregation test update
* Update overactive filter test
* Dry run exemptions
* Update sql/moz-fx-data-shared-prod/contextual_services_derived/event_aggregates_v1/query.sql
* format
This better matches the current client behavior for matching.
We're currently getting `<disallowed>` in results and results with uppercase
letters. I don't think preserving these differences has analytical value,
and it makes the results harder to work with.
* ROAD-85 Simple sanitization job for Merino logs
See https://mozilla-hub.atlassian.net/browse/ROAD-85
This uses the adM allowlist of queries for sanitization, so can be expressed
entirely in a single query. Future iterations will involve python logic and
will likely need to be held elsewhere.
* Separate external query and query to copy data into shared-prod