* feat(managed-backfills): add date range iterator to standardize backfill date ranges
* refactor: query backfill to use new date range module
* feat(managed-backfills): DENG-1285 - Support completing backfills for unpartitioned tables and align partitions for offset table backfills
* Added dry run in processing
* Update bigquery_etl/cli/backfill.py
Co-authored-by: Alexander <anicholson@mozilla.com>
* Removed deploy when dry run
---------
Co-authored-by: Alexander <anicholson@mozilla.com>
* Require authentication for dry run function and run gcloud auth when not logged in
* authenticate step in CI, remove interactive gcloud auth
* Skip dryrun for ltv_state_values_v2
* Refactor skip_fork in CI, clarify login requirements
So that if any of the ETLs use `depends_on` to manually depend on one of those external dependencies the external task variable will be guaranteed to exist before it's referenced in the `set_upstream()` call.
* Bump black from 23.10.1 to 24.1.1
Bumps [black](https://github.com/psf/black) from 23.10.1 to 24.1.1.
- [Release notes](https://github.com/psf/black/releases)
- [Changelog](https://github.com/psf/black/blob/main/CHANGES.md)
- [Commits](https://github.com/psf/black/compare/23.10.1...24.1.1)
---
updated-dependencies:
- dependency-name: black
dependency-type: direct:production
update-type: version-update:semver-major
...
Signed-off-by: dependabot[bot] <support@github.com>
* Reformat files with black to fix dependabot update.
* Reformat with black 24.1.1. Update test dag with required space.
* Update test dags.
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Add BigQuery schema conversion util
* Update bigquery_etl/schema/__init__.py
Co-authored-by: Sean Rose <1994030+sean-rose@users.noreply.github.com>
---------
Co-authored-by: Sean Rose <1994030+sean-rose@users.noreply.github.com>
* Add ETLs for historical Google Search Console data synced by Fivetran.
* Fix formatting of `CASE` subclauses like `WHEN` inside Jinja blocks.
* Add ETLs for current Google Search Console data exported directly to BigQuery.
* Add views for Google Search Console data.
* pyproject.toml for bqetl
* Correctly resolve SQL generators from package
* CircleCI config to publish tagged versions to PyPI
* Get version from git tags
* Update `bqetl format` to not have an extra blank line after Jinja expressions.
* Update `bqetl format` to add a blank line before `#fail` and `#warn` comments.
* Update dataset workgroup_access when deprecated: true
* Update deprecation metadata tests
* Add metadata.yaml files in telemetry_derived for tables that are managed through other tooling
* Deprecate telemetry_derived datasets
* Don't try to write existing view files
* Use state_values_v2 for client ad click predictions
* Normalize countries in client_ltv
* Don't get view if unavailable
* Add test for new version of existing table
* Fully qualify tables in view defn
* Avoid using `Path.glob()` or `Path.rglob()` for recursive file searches.
Because they don't currently support following symlinks (they will in Python 3.13).
* Specify `followlinks=True` as necessary when calling `os.walk()`.
* Remove referenced_tables usages
* Resolve check dependencies when generating DAGs
* Add tests for checks automatically resolving dependencies
* Incorporate feedback for removing referenced_tables
* Use depends_on for empty_checks
* More depends_on and fixes
* Verify the format of the original SQL, not the generated SQL.
The generated SQL gets reformatted by `bqetl query render`.
* Format all SQL.
* Quote column names containing Jinja expressions to prevent `bqetl format` causing invalid SQL.
* Adjust indentation of some comments to align with the formatted SQL.
* Refactor final `SELECT` in `telemetry_derived.clients_first_seen_v2` to work better with `bqetl format` SQL formatting.
* Fix trailing line comments breaking inline block formatting.
* Fix leading whitespace before Jinja comments not being preserved.
* Add `schema.yaml` for `firefox_ios_derived.baseline_clients_yearly_v1`.
So the `deploy-changes-to-stage` CI can work for the downstream `firefox_ios.baseline_clients_yearly` view.
* Add `schema.yaml` for `firefox_accounts_derived/fxa_users_services_daily_v1`.
So the `dry-run-sql` CI can work for the downstream `firefox_accounts_derived.fxa_users_services_last_seen_v1` ETL.
* Correct `schema.yaml` and `init.sql` for `firefox_accounts_derived.fxa_users_last_seen_v1`.
So the `dry-run-sql` CI can work for the downstream `firefox_accounts_derived.fxa_users_last_seen_v1` ETL.
* Fully qualify table reference in `init.sql` for `firefox_accounts_derived.fxa_users_last_seen_v1`.
So the table dependency will get detected by the `deploy-changes-to-stage` CI to deploy it so the `dry-run-sql` CI can work for the `init.sql` file.
* Improve `JinjaComment` inheritance and docstring.
* Implement `Line.ends_with_line_comment` property and refactor `inline_block_format()`.
* added existing fxa tables to shredder config
* Apply suggestions from code review
Co-authored-by: Sean Rose <1994030+sean-rose@users.noreply.github.com>
* removing some of the fxa tables from the config as suggested by srose
---------
Co-authored-by: Sean Rose <1994030+sean-rose@users.noreply.github.com>
* Add skip-existing option to ./bqetl query initialize
* Handle initialization exceptions and refactor skip-existing check
* Refactoring of ./bqetl initialization
* Add --force option to ./bqetl initialize
* Update bigquery_etl/cli/query.py
Co-authored-by: Sean Rose <1994030+sean-rose@users.noreply.github.com>
* Update bigquery_etl/cli/query.py
Co-authored-by: Sean Rose <1994030+sean-rose@users.noreply.github.com>
* Update bigquery_etl/cli/query.py
Co-authored-by: Sean Rose <1994030+sean-rose@users.noreply.github.com>
* Update bigquery_etl/cli/query.py
Co-authored-by: Sean Rose <1994030+sean-rose@users.noreply.github.com>
* Update bigquery_etl/cli/query.py
Co-authored-by: Sean Rose <1994030+sean-rose@users.noreply.github.com>
---------
Co-authored-by: Sean Rose <1994030+sean-rose@users.noreply.github.com>
* Handle Jinja whitespace control characters in `bqetl format`.
* Use default formatting for Jinja in `bigquery_usage_v2` ETL.
* Reformat `sql_generators/active_users/templates/mobile_checks.sql`.
* DAG docs - fix broken links and add tags to docs
* change tests
* remove empty line
* fix typo
* fix second test template
* add if case for private-bigquery-etl
* Define `event_monitoring_live_v1` views in `view.sql` files.
So they get automatically deployed by the `bqetl_artifact_deployment.publish_views` Airflow task.
* Support materialized views in view naming validation.
* Handle `IF NOT EXISTS` in view naming validation.
* Use regular expression to extract view ID in view naming validation.
This simplifies the logic and avoids a sqlparse bug where it doesn't recognize the `MATERIALIZED` keyword.
* Update other view regular expressions to allow for materialized views.
* Skip backfills for queries without metadata.yaml
* Support date_partition_offset
* Fixed exclude, modified exception
* Add test for offset backfill
* Apply suggestions from code review
Co-authored-by: Frank Bertsch <frank.bertsch@gmail.com>
* Formatting
---------
Co-authored-by: Frank Bertsch <frank.bertsch@gmail.com>
* added firefox_ios_derived.clients_activation_v1 and corresponding view
* fixing a missing seperator in firefox_ios_derived.clients_activation_v1 checks
* adding firefox_ios_derived.clients_activation_v1 to shredder configuration
* removed is_suspicious_device_client as it should not be there, thanks bani for pointing this out
* fixed black formatting error inside shredder/config.py
* applied bqetl formatting
* minor styling tweak as suggested by bani in PR#4631
* deleting fenix_derived/firefox_android_clients_v2, v1 will remain the active model
* removed fenix_derived.firefox_android_clients_v2 from shredder config
* Add support for assigning Airflow tasks to task groups
* Generate separate Airflow tasks for glean_usage
* Remove Airflow dependencies from old glean_usage tasks
* Glam - fix legacy windows & release probes' sample count going fwd
* Glam FOG accounts for sampling when calculating total_sample for windows & release probes
* fog - fix client count and sample count
* Add channel filtering for fog
* Fix checks to filter on partitions
* Don't print "missing checks file" on success
Previously, the statement that checks.sql files
were missing was printed on any execution of the for
statement. ("else" clauses after "for"s execute after
completion of the "for" clause).
Instead, we want to print only when there are no files.
* Fill empty description
* Assign a friendly name if the table doesn't have one
* Update metadata tests
* Update bigquery_etl/metadata/parse_metadata.py
Co-authored-by: Alexander <anicholson@mozilla.com>
* update test again
---------
Co-authored-by: Alexander <anicholson@mozilla.com>
* Generate normal task dependencies from `depends_on` if the task is in the same DAG.
* Update `metadata.yaml` files to use `depends_on` rather than `upstream_dependencies`.
* DS-3054. Create functions to support running an initialization query for all sample_ids in parallel.
* DS-3054. Update _run_query function.
* DS-3054. Use _run_query and mapped values for initialization in parallel.
* DS-3054. Unify initialization to run in parallel and get sample_id range from metadata.
* DS-3054. Minimize formatting of query template and remove need to modify existing initialization queries. Validate if a query should use parallelized or regular update.
* DS-3054. Adding link to caveats.
* DS-3054. Update sample_id range for initialization.
* DS-3054. Use current implementation of run_query.
* DS-3054. Update using a parameter instead of initialization in metadata.
* DS-3054. DAG update with new parameter.
* Pass parameters before calling _run_query().
* Use --append_tablein favour of INSERT INTO.
* DS-3054 Separate parallel and non parallel init, plus some improvements.
---------
Co-authored-by: Lucia Vargas <lvargas@mozilla.com>
* Put assert UDFs in `mozfun` project.
* Tweak syntax in `assert.array_equals()` to avoid SQLGlot parsing error.
https://github.com/tobymao/sqlglot/issues/2348
* Fix SQL syntax error in `assert.struct_equals()` tests.
* Fix UDF dependency file path logic when deploying to stage.
* Change regular expressions in `parse_routine` module to allow quotes around routines' dataset and name.
* Respect sql_dir in dryrun skip
* Update bigquery_etl/dryrun.py
Co-authored-by: Sean Rose <1994030+sean-rose@users.noreply.github.com>
* Update bigquery_etl/dryrun.py
Co-authored-by: Sean Rose <1994030+sean-rose@users.noreply.github.com>
* Set sql_dir when using Schema.from_query_file()
---------
Co-authored-by: Sean Rose <1994030+sean-rose@users.noreply.github.com>
* Fix publishing udfs that use backticks in identifiers
* Update bigquery_etl/routine/parse_routine.py
Co-authored-by: Sean Rose <1994030+sean-rose@users.noreply.github.com>
---------
Co-authored-by: Sean Rose <1994030+sean-rose@users.noreply.github.com>
* DENG-1314 Implement changes to bqetl and create default DAG.
* DENG-1314. Update Documentation.
* DENG-1314. Dummy query to enable generating DAG and run tests.
* DENG-1314. Update tests.
* Update bigquery_etl/cli/query.py
Raise exception when scheduling information is missing.
Co-authored-by: Daniel Thorn <dthorn@mozilla.com>
* DENG-1314. Update tests.
* DS-3054. Update query creation to set bqetl_default as default value for --dag. Update tests.
* Default task and tests update.
* Default task and tests update.
* 3650 - Remove default DAG option, update DAG template comment & tests.
* 3650 - Condition for DAG warning.
* 3650 - Update docs.
* Clarification on sql/moz-fx-data-shared-prod/analysis/bqetl_default_task_v1/metadata.yaml
Co-authored-by: Anna Scholtz <anna@scholtzan.net>
* Update docs/cookbooks/creating_a_derived_dataset.md
Co-authored-by: Anna Scholtz <anna@scholtzan.net>
---------
Co-authored-by: Lucia Vargas <lvargas@mozilla.com>
Co-authored-by: Daniel Thorn <dthorn@mozilla.com>
Co-authored-by: Anna Scholtz <anna@scholtzan.net>
* Small tweaks made to the cli cmds comments / help display for data checks
* added usage docs to data_checks reference docs
* Apply suggestions from code review provided by scholtzan
Co-authored-by: Anna Scholtz <anna@scholtzan.net>
---------
Co-authored-by: Anna Scholtz <anna@scholtzan.net>
* added render subcommand to the bqetl check command
* added a dry_run flag to bqqetl check run command
* added a test to make sure run command exists with status code 0
* added test for check render subcommand
* fixing linter checks
* attempting using an alternative way of testing the render command
* fixing render test by testing the _render() directly rather than the render cli wrapper
* removed dead test
* Apply suggestions from code review by ascholtz
Co-authored-by: Anna Scholtz <anna@scholtzan.net>
* fixed black and mypy errors
* fixed app_store_funnel_v1 check formatting
* reformatted tests checks
---------
Co-authored-by: Anna Scholtz <anna@scholtzan.net>
* added history and bookmarks fields
* adding automated corrections
* some auto schema updates but perhaps not all
* Update schemas
---------
Co-authored-by: Anna Scholtz <anna@scholtzan.net>
* Bug 1741487 - Rename url2 and related fields in stable views
This removes the following unpopulated fields from Glean views: `metrics.url`, `metrics.text`, `metrics.jwe`, and `metrics.labeled_rate`. If any of these metrics exist in the source table under `2`-suffixed name, it is also aliased to its original name (`url2` to `url` and so on).
Suffixed fields are still preserved until view consumers migrate.
* Remove redundant comma from generated sql
* Ignore missing fields in views if any of them were removed
* added a todo comment
* Added additional context around why we are excluding some of the non-suffixed fields and why alising to remove suffix 2 from some fields
---------
Co-authored-by: Arkadiusz Komarzewski <akomarzewski@mozilla.com>
Co-authored-by: Anna Scholtz <anna@scholtzan.net>
* Add `synced_at` column to `stripe_subscriptions_changelog_v1`.
* Tweak `stripe_subscriptions_changelog_v1` tax rate and discount joins to only include those that existed when the change happened.
* Parse subscription metadata in `stripe_subscriptions_changelog_v1`.
* Add `stripe_external.invoice_line_item_v1` ETL.
* Add `stripe_subscriptions_revised_changelog_v1` ETL.
* Add `stripe_subscriptions_history_v2` ETL.
* CAccomodate dq checks in dag generation
* Modify the tests to include dq check
* Generate dags to include bigquery_dq_check
* rename destination to source for dq check
* Add DQ check to download attribution dag
* Update bigquery_etl/query_scheduling/templates/airflow_dag.j2
Co-authored-by: Anna Scholtz <anna@scholtzan.net>
* Update bigquery_etl/query_scheduling/generate_airflow_dags.py
Co-authored-by: Anna Scholtz <anna@scholtzan.net>
* Set upstream check dependencies using upstream_dependencies
* Change bigquery_dq_check as per gcp.py utils
* remove sql_file_path in airflow jinja
* Fix download attribution dag
---------
Co-authored-by: Anna Scholtz <anna@scholtzan.net>
* added a new table for the new nonprod fxa backend events and a fxa_all_events_nonprod view to simulate the process we will need to follow for prod
* added date filters to the nonprod_fxa_all_events view as requested by akkomar and updated the metadata
* added the new nonprod_fxa_server_events_v1 table to dry run skip due to permissions
* improved the comment about deleting a view as requested by akkomar
* tweaked date filtering as requested by srose
* pulled nonprod_fxa schema from DENG-1006-fxa-log-fields
* added schema.yaml for nonprod_fxa_server_events_v1
* deleted init.sql and added clustering config to metadata.yaml instead
* added AS as requested by srose
* fixed yaml lint errors
* added the ability to pass end_date param into Airflow task
* updated nonprod_fxa queries and schema for fxa_server_events_v1 as requested by srose, this query also pulls data for stout which now has end date
* regenerated bqetl_fxa_events DAG
* renamed fxa_log to fxa_server as agreed on with srose
* reverted merging of the stdout and server event etls due to incompatible schemas
* removed changes related to task level end_date
* removed date filter for stdout events
* undoing test changes
* added country to fxa_server_events_v1 schema
* tweaked selected ordering as requested by srose and updated comments and metadata.yaml
* Speed up schema update
* Speed up schema update
* Sort and update schemas in parallel
* Update sql/moz-fx-data-shared-prod/telemetry_derived/clients_daily_v6/metadata.yaml
Co-authored-by: Daniel Thorn <dthorn@mozilla.com>
---------
Co-authored-by: Daniel Thorn <dthorn@mozilla.com>
* Create new Fenix attributable_clients table
Further updates to attributable clients
- Handle clients who were only _activated_ on that day
- Separate facts/dimensions
- Rename some things
- Add metadata about why a client is present
- Limit new_activations to just activated clients
- Rename client_count field
- Include submission_date in activation join
- Move to v2
* Add DAG
* Add schema file
* Move some joins to view; add initialization
1. Move attribution & activation joins to the view. This lets
us immediately access updates to those tables, rather than
re-materializing this table on changes there.
2. Add the capability to init from a query file. This uses
an `is_init` jinja function, which is only set to True
when run from `bqetl query initialize`.
* Use dict for default template vars
* Add default for addl_templates
* Reformat files
* Update view
* Regenerate DAG
* Keep metadata field in view
* glam: Partition clients_histogram_aggregates by sample_id (has been running like this since April 3 from a different branch)
* glam: add description and eol to init
* glam: Partition clients_histogram_aggregates by sample_id (has been running like this since April 3 from a different branch)
* glam: add description and eol to init
* add init.sql to missing tbls
* Add schema.yaml
* increase ci output timeout to 30m
* remove init.sql to prevent ci from trying to derive schema from it and break
* Fix schema.yaml files
* Revert output timeout to default
* initial impl
* Updated based on PR feedback
* Moved check from query to separate command
* Expanded from --partition option to generic --parameter option
* Removed `query check` command (check moved to new command)
* Update bigquery_etl/cli/check.py
remove date param format check
Co-authored-by: Anna Scholtz <anna@scholtzan.net>
* Removed 'parameter' parameter, everything is passed through ctx.args and then converted to a dict for Jinja rendering. There are no restrictions on ctx.args values.
* Merge error
---------
Co-authored-by: Anna Scholtz <anna@scholtzan.net>
When using `bqetl query schema update` to create a new `schema.yaml` file, BigQuery returns the column schema properties in a sensible order (`name`, `type`, `mode`, `fields`), but our `schema.yaml` output has been sorting those properties alphabetically which makes it much less readable.
Also, when using `bqetl query schema update` to update an existing `schema.yaml` file, this will now preserve whatever order the column schema properties were in.
* added support for --log-level to bqetl query command and updated print statements to be log statements
* now --log-level flag is a bqetl global flag
* fixing linter errors
* Update bigquery_etl/cli/__init__.py
Co-authored-by: Anna Scholtz <anna@scholtzan.net>
* Update bigquery_etl/cli/__init__.py
Co-authored-by: Anna Scholtz <anna@scholtzan.net>
* fixed indentation of --log-level option
---------
Co-authored-by: Anna Scholtz <anna@scholtzan.net>
* Fix `bqetl stage` ID quoting.
Quoting the entire table ID breaks cases where an unaliased table name is used to qualify a column reference.
* Have `bqetl stage` preserve fully quoted references.
* Simplify regular expressions for fully quoted references.
* Compile all reference replacement regular expressions for performance.
* Save current SubPlat ETL views logic in versioned ETLs (DENG-973).
* Add `incremental` labels to the new tables.
* List all CJMS ETLs to dryrun-skip rather than using `glob`.
The `glob` approach doesn't currently work well with the CI staging process.
* DENG-970 Only Glean in Focus Android view.
* DENG-970 Only Glean in Focus Android view.
* DENG-970 Only Glean in Focus Android view.
* DENG-970 Only Glean in Focus Android view.
* DENG-970 Only Glean in Focus Android view.
* DENG-970 Only Glean in Focus Android view.
* DENG-970 CI fix
* DENG-970 CI failure fix. Related to issue 3889.
* Fix UDF dependencies deploy on stage
* DENG-970 Revert specific calling to dataset for UDF.
---------
Co-authored-by: Lucia Vargas <lvargas@mozilla.com>
Co-authored-by: Brad Ochocki <brad.ochocki@gmail.com>
Co-authored-by: Anna Scholtz <anna@scholtzan.net>
* Bug 1823627 - Normalize the channel based on probeinfo data in UNIONized views
* Handle fenix channel normalization for app pings
* Parallelize stage schema deploys
* Fix schema field order
---------
Co-authored-by: Jan-Erik Rediger <jrediger@mozilla.com>
* added apple_ads_derived for copying over apple_ad data from the fivetran dataset, and apple_ads views now read from it
* added bqetl_fivetran_apple_ads.py DAG responsible for copying apple_ads data from the fivetran project over to moz-fx-data-shared-prod
* now dryrun skips apple_ads_derived instead of apple_ads as the query now accesses restricted dataset
* added schema files for apple_ads_derived datasets
* added descriptions to schema.yaml files for apple_ads_derived namespace
* added dataset_metadata for apple_ads_derived to include a link to the dbt transformations
* fixed apple_ads view definitions
* removed application label and referenced_tables section inside metadata.yaml for apple_ads as requsted by srose in PR#3847
* corrected source project for apple_ads views
* renamed apple_ads_derived to apple_ads_external
* added * to apple_ads_external namespace name to skip in the dryrun due to integration test deployment
* made tweaks to apple_ads and apple_ads_external datasets/namespaces as requested by whd
* updated apple_ads_external skip rule to the way it is meant to be defined, this will work once a fix is rolled out for dryrun
* fixed dag bqetl_fivetran_apple_ads description and updated the schedule to run once a day
* RS-722 Remove task_name from dag generation when it is not available.
* RS-722 Reformat files.
---------
Co-authored-by: Lucia Vargas <lvargas@mozilla.com>