* Add CI task to push content to generated-sql branch
Fixes#1742
The
[`generated-sql`](https://github.com/mozilla/bigquery-etl/tree/generated-sql)
branch now exists and you can browse the contents. See, for example,
[telemetry.main](https://github.com/mozilla/bigquery-etl/tree/generated-sql/sql/moz-fx-data-shared-prod/telemetry/main)
Follow-ups for which I'll file issues:
- This doesn't currently publish the generated Glean baseline ETL queries
and views; we'll need to update that logic to use probe-scraper metadata
rather than listing tables in BigQuery (due to creds) to integrate it.
- Docs publishing should reference this generated content rather
These were found as part of investigating
https://bugzilla.mozilla.org/show_bug.cgi?id=1671933
They are defined in derived-datasets even though they weren't in bigquery-etl
and so there may be user queries that depend on them. These may have been
erroneously deleted in a previous refactor of this repo.
Restoring them here so that they'll be published to the new `mozdata` project.
* Add views to survey tables in the external dataset
* Add mozilla_vpn survey as exceptions to dry run
* Point the view to the correct location
* Move surveys from external to derived
* Move views point to derived instead of external
* Remove survey from dryrun exceptions
This query failed on the 2021-01-18 run with:
> jsonPayload.fields.id has changed type from STRING to FLOAT
Attempts to run this query in the BQ console show errors because the wildcard
query picks up only the most recent schema, which is incompatible with
historical days. We replace the field with null for now and can investigate
with FxA folks what might have caused this schema change.
* Experiment enrollment aggregates hourly
* Experiment enrollments recents query
* Add execution_delay support for tasks
* Experiment enrollment aggregates base query
* Schedule experiment enrollment cumulative population estimate and active population
* Experiment enrollment monitoring queries as views
* Script for exporting experiment monitoring data to GCS
* Export experiment monitoring data script aggregating data of longer running experiments
* Parallelize experiment monitoring data export
* init.sql for experiment enrollment monitoring queries
* Use Airflow ds_format macro for hourly destination table
* Use Airflow macros for experiments monitoring hourly execution delay
* experiment_enrollment_cumulative_population_estimate as query instead of view
* Fix referenced tables in enrollment_aggregates_hourly metadata and add comment
* Simplify cumulative population estimate query
ISO 8601 years are week-numbering years. Specifically, an ISO
year begins on the first Monday of Week 01; so if in the
Gregorian calendar the year starts on a Friday, the Friday,
Saturday, and Sunday will all fall in the previous year.
We run into this problem here. %g uses the ISO year, so for
2021-01-01 and 2021-01-02, that is year '20'. This has no match
in the underlying data (no 20-01-01 in FxA logs). Switching to
%y gives us year '21' for this data, and a match in the FxA logs.
https://en.wikipedia.org/wiki/ISO_8601#Week_dates
* Add initial code from telemetry_derived.surveygizmo_daily_attitudes
* Update module to successfully insert documents to bigquery
* Add a schema instead of inferring the schema
* Add recommend survey
* Add the rest of the surveys in the bug
* Add generated dag for survey imports
* Fix linting issues
This query is failing as of 2020-11-18 due to unexpected input.
We are descheduling it until the logic is updated to handle this situation,
since new DAG runs are staying in the running state, waiting on past
runs that will never complete.
To be truly useful for quick investigations, the standard approach to views
doesn't work here.
The following query runs in 4 seconds:
```
SELECT
DATE(submission_timestamp) AS dt,
COUNT(*)
FROM
`moz-fx-data-shared-prod.telemetry_derived.main_1pct_v1`
WHERE
DATE(submission_timestamp) >= '2020-11-24'
AND subsample_id = 0
GROUP BY
1
ORDER BY
1 DESC
```
But the equivalent on top of the existing view takes more than 30 seconds,
bottlenecked on query planning.
* Add script to determine query dependencies
* Add schemas and folders for minimal test
* Add schema for geckoview_versions
* Add query params to each query
* Update schema for new queries
* Remove main from bootstrap file
* Add dataset prefix to schemas
* Add failing test for clients_histogram_aggregates
It turns out that the dependency resolution I'm using for autogenerate
the schemas is ignoring the views. I actually want to keep the views
around. The tables also all need to be prefixed with the dataset name or
they won't be inserted into the sql query correctly.
* Add successful test for clients histogram aggregates
* Add minimal tests for clients_scalar_aggregates
* Remove skeleton files for views (no test support for views)
* Add tests for latest versions
* Add tests for scalar bucket counts that passes
* Add scalar bucket counts
* Add test for scalar percentiles
* Add test for histogram bucket counts
* Add passing test for probe counts
* Add test for histogram percentiles
* Add tests for extract counts
* Update readme
* Add data for scalar percentiles test
* Fix linting errors
* Fix mypy issues with tests module
* Name it data instead of tests.*.data
* Ignore mypy on tests directory
* Remove mypy section
* Remove extra line in pytest
* Try pytest invocation of mypy-scripts-are-modules
* Run mypy outside of pytest
* Use exec on pytest instead of mypy
* Update tests/sql/glam-fenix-dev/glam_etl/bootstrap.py
Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
* Update tests/sql/glam-fenix-dev/glam_etl/README.md
Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
* Document bootstrap in documentation
* Use artificial range for histogram_percentiles
* Simplify parameters for scalar probe counts
* Simplify tests for histogram probe counts
* Add test for incremental histogram aggregates
* Update scalar percentile counts to count distinct client ids
* Update readme for creating a new test
* Use unorded list for sublist
* Use --ignore-glob for pytest to avoid data files
Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
* Bug 1677609 Join clients_first_seen into clients_last_seen
Several folks on DS report that they have been getting great value from
clients_first_seen, as the first_seen_date there is a much more stable way
to define new profiles compared to using profile_created_date from pings.
Currently, using first_seen_date requires doing a join between these two tables.
This PR adds that join to the clients_last_seen query itself to make this
workflow more efficient. I'd like to get this merged before we proceed with
the backfill discussed in https://bugzilla.mozilla.org/show_bug.cgi?id=1677609
This change has a few operational implications. Most importantly, it makes
clients_last_seen dependent on clients_first_seen, so those queries can no
longer proceed in parallel. `clients_first_seen` takes on average 10 minutes
to run, so we'll be delaying all ETL downstream of `clients_last_seen` by
about 10 minutes, which seems acceptable. It also adds some mental complexity
to the model.
The extra join does not appear to significantly slow down the
`clients_last_seen` query itself; it scans about 15% more data and consumes
about 15% more slot time.
I expect the performance is dominated by the existing join between
clients_daily and the previous day of clients_last_seen.
* Bug 1677609 Add core active fields to clients_last_seen
See https://bugzilla.mozilla.org/show_bug.cgi?id=1677609
This adds just the new underlying bit pattern fields that will need to be
backfilled, and these will be hidden from users initially.
After the backfill is complete, we will update the view to include these
fields along with the various fields derived from them.
We include days_visited_10_uri_bits which was not explicitly requested in
the context of this bug, but was proposed as part of the prototype feature_usage
table (https://github.com/mozilla/bigquery-etl/pull/1193); it may be useful
for future comparisons.
* Update tests to match new logic
* Add materialized table for missing columns in telemetry dataset
* Ignore dryrun failures when fetching references
* Add generated dag
* Add manual reference to main ping copy deduplicate
* Update bigquery_etl/dryrun.py
Co-authored-by: Frank Bertsch <fbertsch@mozilla.com>
* Remove email from all monitoring queries
* Change order of logic
* Remove copy_deduplicate reference due to bug
Co-authored-by: Frank Bertsch <fbertsch@mozilla.com>
* Replace GLAM temp functions with persistent functions
* Add generated sql
* Fix typo in udf name
* Add missing files and fully qualify udfs
* Add missing namespace
* Namespace even more things
* format sql
* Add bits_from_offsets UDF
This is relevant to the emerging "clients_all_time" work drafted in
https://github.com/mozilla/bigquery-etl/pull/1480
I'm proposing we add this under `udf` rather than in `mozfun` because I'm not
yet certain about the naming. If we want to have additional functionality to
support all-time bit patterns, I would like to have those organized under a
single `mozfun` namespace, and it's not clear yet what the interface should
look like.
This function on its own should be enough to empower a new DS workflow for
experimenting with new usage definitions before committing them to clients_daily
and clients_last_seen (to be documented).
* Resolve generated sql to glam-fenix-dev and change output in sql/ dir
* Add new script for testing glam-fenix queries
* Add generated sql for version control
* Use variables correctly in bash
* Remove latest versions from UDF
* Update test to generate minimum set of tables for nightly
* Commit generated queries for testing
* Cast only if not glob
* Ignore dryrun and publish view for glam-fenix-dev
* Fix linting error
* Update comments
* Use DST_PROJECT consistently in scripts
* Update comments
* Update script/glam/test/test_glean_org_mozilla_fenix_glam_nightly
Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
* Update script/glam/generate_and_run_desktop_sql
Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
* Exempt a few files from dry run due to new table-level ACLs
The dry run service can no longer perform queries with wildcard table
specifications or access raw AET data. See https://github.com/mozilla-services/cloudops-infra/pull/2599
* Verbose referenced_tables for AET logging clients daily
The `nondesktop_clients_last_seen_v1` view was developed mostly as an
internal implementation detail for downstream tables, but it has become
useful in its own right. This PR formalizes the view by providing an alias
without a version modifier and it adds a `product` field with application
names that are short but more meaningful than the `app_name` field.
See discussion in https://jira.mozilla.com/browse/DO-330 about confusion that
has resulted from the name "Fennec iOS" used in dashboards, etc. This is a
step toward reducing that kind of confusion.
This PR also adds `contributes_to_2019_kpi` and `contributes_to_2020_kpi` fields
as source of truth for how we count KPI metrics. That logic is currently
copied and pasted in several places, which could lead to errors.
This will need a fair amount of review from data users before moving forward.
It will also require backfilling several downstream tables and communicating
the change.
* Add initial incremental query for geckoview build dates
* Add initial tests for incremental query (WIP)
* Add files for initial tests
* Rework query so it doesn't fail during tests
* Fix schema so queries run
* Add passing test for init
* Add test for query aggregation
* Add metadata file for scheduling the query
* Move scripts from fenix_nightly to fenix
* Remove scheduling
* Add document strings.
* Change dataset reference and indent comments correctly
* Remove init and address feedback
* remove init file
* make query idempotent by appending window to each submission_date
* rename n_builds to n_pings
* reduce window size from 30 days to 14 days
* avoid use of subqueries
* Update tests for query
* Fix tests
* Add failing test for 100
* Fix query so it work across fx100 boundary
* Add linting fixes
While looking at Shredder logs, I noticed that all entries for FxA-related
derived tables show 0 deletions, like:
> 712215424909 bytes and deleted 0 rows from moz-fx-data-shared-prod.firefox_accounts_derived.fxa_users_daily_v1
It appears that `account.deleted` events populate a field named `uid` rather
than `user_id`. I was able to verify this by choosing a recent `uid` value
from a deletion event and counting events where that same value appears as
`user_id` in FxA logs. There were matching messages.
- Scrape all projects for routine defns when generating tests
- Create UDFs as non-temp for stored procedure tests
- Make assert functions default non-temp (to support above)
* Bug 1669516 Use `app_display_version` for Fenix AMO stats
* Use geckoview_version for fenix nightly
* Remove deprecated installs_v1
* Remove dev installs_v1