* Add views for extracting to glam
* wip: Add export script
* Rename extract queries and don't run them
* Add user counts
* Add generated sql
* Update extract queries to rely on current project
* Fix optional day partitioning
* Fix extraction to glam-fenix-dev project
* Add globs for ignoring dryrun and format
* Reorder columns in probe count extract
* Filter on null channel and version
* Do not print header
* Refactor datacube groupings and fix scalar_percentiles
* Rename extract tables
* Convert user counts into a view to avoid needless materialization
* Rename client probe counts to probe counts
* Update publish_views so job does not fail
The logic for incline exports in getting too complex to handle via Airflow
operations (see https://github.com/mozilla/telemetry-airflow/pull/935).
Instead, we put the logic into this script and will invoke via docker.
Addresses https://github.com/mozilla/bigquery-etl/issues/848
We commit a single application's generated queries here to allow CI to verify
them. I have run these scripts for the target app so that the tables exist
and CI should pass.
* Add incline exec dash submission_date based query
* Move to derived dataset
* Create migrated_clients_v1 as materialized
Previously we had this as a view. Now it will
update every day, taking the latest value seen from migration
pings as its value.
* Update view for materialized query
* Use clients_last_seen view and migrated_clients
Previously, we were using the materialized clients_last_seen
and the unmaterialized migrated_clients. We are now materializing
the latter, and using the view of the former.
- Added documentation
- Removed unecessary arrays
* Run SQL formatter
* Change keyword identifier with prefixed underscore
* Ignore missing table errors in dryrun
* Run migrated_clients init only
* Add docline for negative churn
* Remove histogram_percentiles from format_sql ignore
* Move udf for histogram percentiles into persistent udf
* Add an approximate equals for testing
* Add some initial tests for linear histograms
* Add tests and minor formatting of comments
This fixes a particular bug within one of the conditionals. However, the
results for linear histograms still look incorrect.
* Update tests/assert/approx_equals.sql
Co-Authored-By: Daniel Thorn <dthorn@mozilla.com>
* Format approx_equals
* Remove a few TODOs
Co-authored-by: Daniel Thorn <dthorn@mozilla.com>
* Add initial template for histogram aggregates
* Factor out common functions and get all distributions
* Add viable query for histogram aggregates
* Add more efficient aggregation
* Update header and update comment
* Add code to generate clients daily histograms
* Add queries for generated sql
* Return non-zero exit code when histograms not found
* Delete empty queries to reduce data scanned
* Add non-zero exit code for scalars if probes are not found
* Sort histograms for stable output
* Add view for histogram aggregates
* Add initial sql for histogram aggregates
* Format template scripts
* Add mostly reformatted sql for aggregates
* Update histogram aggregates before adding statements
* Fix up details for daily aggregation
* Add completed histograms template
* Add code to generate clients histogram aggregates
* Add init for clients histogram aggregates
* Remove sample_id from set of attributes
* Add sections to run generated sql
* Add generated sql
* Remove extra latest_version columns
* Fix many small issues during first draft of sql
* Fix clients histogram aggregates
* Add initial modification to probe counts
* Add histogram bucket counts
* Add option to generate histogram probe counts
* Update generated_fenix_sql for histograms
* Add generated sql
* Update run_fenix_sql
* Fix bucket counts
* Update source table for probe counts
* Add missing ping_type to histograms
* Add first,last,num buckets
* Update probe counts so it succeeds
* Add mozilla_schema_generator to dependencies
* Add metadata from probe-info for custom distributions
* Update probe counts with metadata for custom distributions
* Add UDF for generating functional buckets
* Add proper bucketing by including range_max of measures
* Format histogram udfs
* Add updated templates to skip
* Add new queries to dryrun ignore
* Add view to the publish ignore list
* Fix python linting
* Remove old comments from probe counts
* Do not count metadata from custom distributions twice
* Remove sum from histogram aggregates
* Add generated SQL
* Add sample_id to histograms earlier in pipeline
* Add generated SQL
* Add comments to functional bucketing for metrics
Glean pings now contain a client_info.locale field; baseline pings
previously reported locale as a string metric. This change provides
some historical continuity so that views on baseline pings show
client_info.locale values even if only the old metric is populated.
* Update daily aggregates to run all scalars in a single query
* Update generate and run script for new scalar aggregates
* Update generated sql (and view)
* Fix linting
* Update SKIP for format
* Update run script to be more generic
* Update run script with parameters and avoid reusing destination
* Add view for clients_daily_scalar_aggregates
* Add new view to dryrun
* Remove old comment
* Refactor render into a separate function
* Add variables for source and destination tables
* Add support for aggregating glean pings
* Add render_init along with --init option
* Add partition clause and add proper init file
* Add attributes_type to the template
* Update clients_scalar_aggregates_v1 with dataset.table
* Add command for generating init for fenix scalars aggregates
* Add queries for fenix_clients_scalar_aggregates_v1
* Update partititioning in init
* Update glam scripts for scalar aggregates
* Update version to only include valid versions
* Add generated sql
* Add --quiet flag
* Add notes
* Fix linting and CI errors
* Ignore glam_etl in dryrun
* Add initial template files that have been formatted
* Update generated queries
* Add metric counts for histogram and scalars
* Update metric_counts_v1 for scalars only
* Add formatted version of telemetry_derived/clients_scalar_bucket_counts_v1
* Add module for generating metric bucketing
* Refactor generate_fenix_sql for skipping stages
* Add templates to format SKIP
* Fix trailing whitespace
* Add option to generate fenix bucket/probe counts
* Add initial bucket/probe counts sql for fenix
* Sort attributes for stable query generation
* Refactor bucketing logic
* Add scalar_metric_types variable
* Add argument parser and glean variables
* Update scalar bucket counts for glean
* Update run_fenix_sql with bucket counts
* Fix invalid syntax
* Do not aggregate booleans as a scalar
* Add scalar_metric_types to metric_counts_v1
* Add argparser and change source tablename to scalar
* Update fenix_clients_scalar_probe_counts_v1
* Remove first_bucket
* Add scalar_probe_counts to run script
* Removing first_bucket requires changing where clause conditional
* Get grouping attributes correct
* Give columns stable ordering
* Add correct query (that is too complex)
* Reduce number of combinations
* Simplify logic for null values
* Cast booleans instead of when clause
* Format
* Rename files to avoid confusion
* Add initial scalar_percentiles
* Add initial files for scalar_percentiles
* Add scalar_percentiles for fenix
* Add scalar_percentiles to run script
* Add problematic files to SKIP in format and dryrun
* Add installation ping
* Fix missing merge item
* Add missing newlines
* Reduce set of grouped attributes
* Factor out boolean_metric_types
* Refactor render into a separate function
* Add variables for source and destination tables
* Add support for aggregating glean pings
* Add render_init along with --init option
* Add partition clause and add proper init file
* Add attributes_type to the template
* Update clients_scalar_aggregates_v1 with dataset.table
* Add command for generating init for fenix scalars aggregates
* Add queries for fenix_clients_scalar_aggregates_v1
* Update partititioning in init
* Update glam scripts for scalar aggregates
* Update version to only include valid versions
* Add generated sql
* Add --quiet flag
* Add notes
* Fix linting and CI errors
* Ignore glam_etl in dryrun
* Add latest_versions template
* Add generated code for latest versions
* Update header
* Add latest versions to run script
* Update version filter using fenix_latest_versions_v1
* Add initial moustache files for scalar_aggregates
* Add Jinja2 dependency
* Update templates with more parameters
* Add format clauses and add query to be formatted
* Add formatted sql
* Add generated sql
python -m bigquery_etl.glam.scalar_aggregates_incremental > sql/telemetry_derived/clients_scalar_aggregates_v1/query.sql
* Generalize clients_scalar_aggregates
* Refactor into attributes and attributes_list
* Add generated sql for generalized query
* Add glam templates to format_sql SKIP
* Fix dryrun by using AS and USING properly in sql
* Add generated sql
* Add instructions on adding new Python library
* Fix linting issues
* Use r""" for backslash in docstring
* Add Jinja2 dependencies to constraints.txt
* Document process for adding new Python dependencies
* Add copy of clients_daily_scalar_aggregates for fenix
* Change table to Fenix metrics ping and modify columns
* Modify get_scalar_probes to fetch the relevant list of metrics
* Remove logic for keyed booleans
* Add valid generated SQL for scalars
* Generate valid keyed_scalars
* Factor out attributes into reusable string
* Use the bigquery-etl formatter
* Add `--no-parameterize` flag for debugging in console
* Add option for table_id
* Add comma conditionally
* Add script to run against all Glean pings in dataset
* Move scripts into appropriate locations
* Use stable tables as source for generate script
* Report glean metric types instead of scalar/keyed-scalar
* Fix linting
* Add script to generate sql for each table in org_mozilla_fenix
* Add generated sql
* Rename script for running etl in testing environment
* Update run script to use generated sql
* Fix missing --table-id parameter
* Update header comment in script
* Update generated sql
* Add ping_type to list of attributes
* Update generated schemas
* Fail on NULL in assert_false udf
* Update tests/README.md
Co-Authored-By: Anna Scholtz <anna@scholtzan.net>
Co-authored-by: Anna Scholtz <anna@scholtzan.net>
Closes#600
We improve the logic for parsing view definition files, and we also now
error out when we encounter a view.sql file that doesn't match our parsing
rather than silently skipping.
* Improve error message for ndjson parsing
* Make JSON error messages nicer
* Cast BYTES fields to/from string
BYTES types are not JSON-serializable. To deal with that, we do
two things:
1. Assume the input tables are hex strings, and decode them
to get the BYTES fields values (on input)
2. Encode BYTES fields as hex strings (on output)
This means that any data files use hex strings for BYTES fields.
Note: This only works on top-level fields
* Add better discrepancy reporting for test assertions
When JSON blobs differ, it can be hard to tell what is wrong.
These functions easily show what's different, and automatically
prints them to be available when tests fail.
* Add search_clients_last_seen for 1-year of history
This new dataset, search_clients_last_seen, contains a year
of history for each client. It is split into 3 main parts:
1. Recent info that is contained in search_clients_daily,
similar to how we store that in clients_last_seen
2. A year of history, represented as a BYTES field,
indicating which days they were active for different
types of activity
3. Among the major search providers, arrays of totals of
different metrics, split into 12 parts, to account for
each months total
This dataset will power LTV.
* Fix linting issues
* Enforce sampling on search_clients_daily
* Address review feedback
- Change all bits/bytes functions to include no. of bits
- Use fileobj for tests
- Rename some vars
- Use base64 for bytes in/out
* Generate sql
* Add missing comma
* Move search_clients_ls to search_derived
* Generate moar sql
* Use clients_daily_v8
* Fix query
* Move tests to search_derived
* Fix tests for search_clients_daily_v8
* Don't dryrun with search_clients_last_seen
* Update udf/new_monthly_engine_searches_struct.sql
Co-Authored-By: Jeff Klukas <jeff@klukas.net>
* sample_id is now an int
* Add documentation
* Update schemas
* Make tests use int sample-id
* Add tables to replace experiment enrollment aggregates spark streaming job
* Switch to python to fill in date in enrollment aggregates live view since parameters are not allowed in view defs
* Direct output of arbitrary commands in entrypoint script to airflow xcom location
This makes it explicit that we no longer are using imported Parquet data.
It also moves several of these tables to the shared-prod project where
we want them to live long-term.
* Add query for last month of schema errors
* Add generated sql for schema error counts
* Move schemas into correct location
* Add document_version and named groups
* Skip schema error counts in dryrun
* Support many billing projects and dates in copy_deduplicate
* fix docs for --project_id
* explain default --billing_projects behavior
* Fix return value bug
I got tired of running generate_sql, then checking git status while it was
running and seeing a jumble of deleted files. This PR changes the behavior to
build the files in a temp dir and then copy into place only at the end.
* Views for monitoring structured ingestion errors
* Add UDF for extracting missing columns
* Add docs for json_extract_missing_cols
* Fix missing WITH clause
* Add generated sql
* Fix test function
* Update views
* Update tests
* Use shared-prod
* Update docstring for missing cols udf
* Move schemas to structured_ingestion dataset
* Don't dryrun queries on payload_bytes
* Change format of payload_bytes views
- Use a with_ratio to reduce duplication
- Change the view name to match the filename
* Fix missing test function
* Move to monitoring dataset
* Move to new test structure
* Remove spaces from js udfs
* Use persistent gunzip UDF
* UDF for decompressing gzip data
* Update script for publishing UDFs to upload UDF dependency files
* Address review feedback for gunzip UDF
* Set default GCS bucket to moz-fx-data-prod-bigquery-etl
* Add function to upload UDF dependencies to GCS
* Set data-eng-circleci-tests context in CircleCI config
* Add approval step in CircleCI config
I believe Airflow may need to issue the jobs from derived-datasets
for the time being, so we make sure to fully qualify all table
references with the project_id that's passed as a parameter.
We implemented a view-based solution for creating clients_last_seen
from clients_last_seen_raw in Athena and Presto, but this made the
table unavailable from Spark.
By adding in these options, we can materialize view logic at the
time of writing to Parquet, so that it will be available to all
Parquet consumers.