* Add views for extracting to glam
* wip: Add export script
* Rename extract queries and don't run them
* Add user counts
* Add generated sql
* Update extract queries to rely on current project
* Fix optional day partitioning
* Fix extraction to glam-fenix-dev project
* Add globs for ignoring dryrun and format
* Reorder columns in probe count extract
* Filter on null channel and version
* Do not print header
* Refactor datacube groupings and fix scalar_percentiles
* Rename extract tables
* Convert user counts into a view to avoid needless materialization
* Rename client probe counts to probe counts
* Update publish_views so job does not fail
Addresses https://github.com/mozilla/bigquery-etl/issues/848
We commit a single application's generated queries here to allow CI to verify
them. I have run these scripts for the target app so that the tables exist
and CI should pass.
* Add initial template for histogram aggregates
* Factor out common functions and get all distributions
* Add viable query for histogram aggregates
* Add more efficient aggregation
* Update header and update comment
* Add code to generate clients daily histograms
* Add queries for generated sql
* Return non-zero exit code when histograms not found
* Delete empty queries to reduce data scanned
* Add non-zero exit code for scalars if probes are not found
* Sort histograms for stable output
* Add view for histogram aggregates
* Add initial sql for histogram aggregates
* Format template scripts
* Add mostly reformatted sql for aggregates
* Update histogram aggregates before adding statements
* Fix up details for daily aggregation
* Add completed histograms template
* Add code to generate clients histogram aggregates
* Add init for clients histogram aggregates
* Remove sample_id from set of attributes
* Add sections to run generated sql
* Add generated sql
* Remove extra latest_version columns
* Fix many small issues during first draft of sql
* Fix clients histogram aggregates
* Add initial modification to probe counts
* Add histogram bucket counts
* Add option to generate histogram probe counts
* Update generated_fenix_sql for histograms
* Add generated sql
* Update run_fenix_sql
* Fix bucket counts
* Update source table for probe counts
* Add missing ping_type to histograms
* Add first,last,num buckets
* Update probe counts so it succeeds
* Add mozilla_schema_generator to dependencies
* Add metadata from probe-info for custom distributions
* Update probe counts with metadata for custom distributions
* Add UDF for generating functional buckets
* Add proper bucketing by including range_max of measures
* Format histogram udfs
* Add updated templates to skip
* Add new queries to dryrun ignore
* Add view to the publish ignore list
* Fix python linting
* Remove old comments from probe counts
* Do not count metadata from custom distributions twice
* Remove sum from histogram aggregates
* Add generated SQL
* Add sample_id to histograms earlier in pipeline
* Add generated SQL
* Add comments to functional bucketing for metrics
* Update daily aggregates to run all scalars in a single query
* Update generate and run script for new scalar aggregates
* Update generated sql (and view)
* Fix linting
* Update SKIP for format
* Refactor render into a separate function
* Add variables for source and destination tables
* Add support for aggregating glean pings
* Add render_init along with --init option
* Add partition clause and add proper init file
* Add attributes_type to the template
* Update clients_scalar_aggregates_v1 with dataset.table
* Add command for generating init for fenix scalars aggregates
* Add queries for fenix_clients_scalar_aggregates_v1
* Update partititioning in init
* Update glam scripts for scalar aggregates
* Update version to only include valid versions
* Add generated sql
* Add --quiet flag
* Add notes
* Fix linting and CI errors
* Ignore glam_etl in dryrun
* Add initial template files that have been formatted
* Update generated queries
* Add metric counts for histogram and scalars
* Update metric_counts_v1 for scalars only
* Add formatted version of telemetry_derived/clients_scalar_bucket_counts_v1
* Add module for generating metric bucketing
* Refactor generate_fenix_sql for skipping stages
* Add templates to format SKIP
* Fix trailing whitespace
* Add option to generate fenix bucket/probe counts
* Add initial bucket/probe counts sql for fenix
* Sort attributes for stable query generation
* Refactor bucketing logic
* Add scalar_metric_types variable
* Add argument parser and glean variables
* Update scalar bucket counts for glean
* Update run_fenix_sql with bucket counts
* Fix invalid syntax
* Do not aggregate booleans as a scalar
* Add scalar_metric_types to metric_counts_v1
* Add argparser and change source tablename to scalar
* Update fenix_clients_scalar_probe_counts_v1
* Remove first_bucket
* Add scalar_probe_counts to run script
* Removing first_bucket requires changing where clause conditional
* Get grouping attributes correct
* Give columns stable ordering
* Add correct query (that is too complex)
* Reduce number of combinations
* Simplify logic for null values
* Cast booleans instead of when clause
* Format
* Rename files to avoid confusion
* Add initial scalar_percentiles
* Add initial files for scalar_percentiles
* Add scalar_percentiles for fenix
* Add scalar_percentiles to run script
* Add problematic files to SKIP in format and dryrun
* Add installation ping
* Fix missing merge item
* Add missing newlines
* Reduce set of grouped attributes
* Factor out boolean_metric_types
* Refactor render into a separate function
* Add variables for source and destination tables
* Add support for aggregating glean pings
* Add render_init along with --init option
* Add partition clause and add proper init file
* Add attributes_type to the template
* Update clients_scalar_aggregates_v1 with dataset.table
* Add command for generating init for fenix scalars aggregates
* Add queries for fenix_clients_scalar_aggregates_v1
* Update partititioning in init
* Update glam scripts for scalar aggregates
* Update version to only include valid versions
* Add generated sql
* Add --quiet flag
* Add notes
* Fix linting and CI errors
* Ignore glam_etl in dryrun
* Add latest_versions template
* Add generated code for latest versions
* Update header
* Add latest versions to run script
* Update version filter using fenix_latest_versions_v1
* Add initial moustache files for scalar_aggregates
* Add Jinja2 dependency
* Update templates with more parameters
* Add format clauses and add query to be formatted
* Add formatted sql
* Add generated sql
python -m bigquery_etl.glam.scalar_aggregates_incremental > sql/telemetry_derived/clients_scalar_aggregates_v1/query.sql
* Generalize clients_scalar_aggregates
* Refactor into attributes and attributes_list
* Add generated sql for generalized query
* Add glam templates to format_sql SKIP
* Fix dryrun by using AS and USING properly in sql
* Add generated sql
* Add instructions on adding new Python library
* Fix linting issues
* Use r""" for backslash in docstring
* Add Jinja2 dependencies to constraints.txt
* Document process for adding new Python dependencies
* Add copy of clients_daily_scalar_aggregates for fenix
* Change table to Fenix metrics ping and modify columns
* Modify get_scalar_probes to fetch the relevant list of metrics
* Remove logic for keyed booleans
* Add valid generated SQL for scalars
* Generate valid keyed_scalars
* Factor out attributes into reusable string
* Use the bigquery-etl formatter
* Add `--no-parameterize` flag for debugging in console
* Add option for table_id
* Add comma conditionally
* Add script to run against all Glean pings in dataset
* Move scripts into appropriate locations
* Use stable tables as source for generate script
* Report glean metric types instead of scalar/keyed-scalar
* Fix linting
* Add script to generate sql for each table in org_mozilla_fenix
* Add generated sql
* Rename script for running etl in testing environment
* Update run script to use generated sql
* Fix missing --table-id parameter
* Update header comment in script
* Update generated sql
* Add ping_type to list of attributes
* Update generated schemas
* Fail on NULL in assert_false udf
* Update tests/README.md
Co-Authored-By: Anna Scholtz <anna@scholtzan.net>
Co-authored-by: Anna Scholtz <anna@scholtzan.net>
* Improve error message for ndjson parsing
* Make JSON error messages nicer
* Cast BYTES fields to/from string
BYTES types are not JSON-serializable. To deal with that, we do
two things:
1. Assume the input tables are hex strings, and decode them
to get the BYTES fields values (on input)
2. Encode BYTES fields as hex strings (on output)
This means that any data files use hex strings for BYTES fields.
Note: This only works on top-level fields
* Add better discrepancy reporting for test assertions
When JSON blobs differ, it can be hard to tell what is wrong.
These functions easily show what's different, and automatically
prints them to be available when tests fail.
* Add search_clients_last_seen for 1-year of history
This new dataset, search_clients_last_seen, contains a year
of history for each client. It is split into 3 main parts:
1. Recent info that is contained in search_clients_daily,
similar to how we store that in clients_last_seen
2. A year of history, represented as a BYTES field,
indicating which days they were active for different
types of activity
3. Among the major search providers, arrays of totals of
different metrics, split into 12 parts, to account for
each months total
This dataset will power LTV.
* Fix linting issues
* Enforce sampling on search_clients_daily
* Address review feedback
- Change all bits/bytes functions to include no. of bits
- Use fileobj for tests
- Rename some vars
- Use base64 for bytes in/out
* Generate sql
* Add missing comma
* Move search_clients_ls to search_derived
* Generate moar sql
* Use clients_daily_v8
* Fix query
* Move tests to search_derived
* Fix tests for search_clients_daily_v8
* Don't dryrun with search_clients_last_seen
* Update udf/new_monthly_engine_searches_struct.sql
Co-Authored-By: Jeff Klukas <jeff@klukas.net>
* sample_id is now an int
* Add documentation
* Update schemas
* Make tests use int sample-id
* UDF for decompressing gzip data
* Update script for publishing UDFs to upload UDF dependency files
* Address review feedback for gunzip UDF
* Set default GCS bucket to moz-fx-data-prod-bigquery-etl
* Add function to upload UDF dependencies to GCS
* Set data-eng-circleci-tests context in CircleCI config
* Add approval step in CircleCI config
* Ignore hidden files in udf testing
* Add fenix event table generation
* Use fenix prod table
* Make fenix_events a view
* Fix last nit
* Remove date filter from view