Граф коммитов

148 Коммитов

Автор SHA1 Сообщение Дата
Anna Scholtz 79feea7b8f Write last_updated to file when data gets published 2020-04-16 17:08:55 -07:00
Anna Scholtz e2e749f6bb Add last_updated to GCS datasets metadata 2020-04-16 17:08:55 -07:00
Anna Scholtz b25c271933 Simplify metadata label regex and disallow international characters 2020-04-16 11:38:59 -07:00
Anna Scholtz 760b45b02a Metadata lowercase check 2020-04-16 11:38:59 -07:00
Anna Scholtz 1b9af73379 Refactor public_data tests 2020-04-16 11:38:59 -07:00
Anna Scholtz b67d430ba1 Add tests for parsing metadata 2020-04-16 11:38:59 -07:00
Jeff Klukas 3cf06aeadb Remove redundant project prefix 2020-04-15 15:03:07 -04:00
Jeff Klukas 46e5ee8fb0 Add script for reporting on broken views
Addresses #903
2020-04-15 15:03:07 -04:00
Anna Scholtz 9b6aa5fb46 Refactory publish_json script 2020-04-15 08:49:24 -07:00
Anthony Miyaguchi 82a6a5f687
Fenix exports for GLAM (#870)
* Add views for extracting to glam

* wip: Add export script

* Rename extract queries and don't run them

* Add user counts

* Add generated sql

* Update extract queries to rely on current project

* Fix optional day partitioning

* Fix extraction to glam-fenix-dev project

* Add globs for ignoring dryrun and format

* Reorder columns in probe count extract

* Filter on null channel and version

* Do not print header

* Refactor datacube groupings and fix scalar_percentiles

* Rename extract tables

* Convert user counts into a view to avoid needless materialization

* Rename client probe counts to probe counts

* Update publish_views so job does not fail
2020-04-14 11:45:59 -07:00
Anna Scholtz bc5c116c80 generate_views version matches to list 2020-04-09 16:36:28 -07:00
Anna Scholtz adde137cce Fix generate_views version parsing 2020-04-09 16:36:28 -07:00
Anna Scholtz 8af044d844 Pretty print dataset metadata 2020-04-09 14:38:34 -07:00
Anna Scholtz bffcf3fc45 Fix lint complaints 2020-04-09 14:38:34 -07:00
Anna Scholtz f3a6d7c490 Add option to publish metadata to GCS to entrypoint script 2020-04-09 14:38:34 -07:00
Anna Scholtz 750738898d Tests for publishing GCS metadata for public data 2020-04-09 14:38:34 -07:00
Anna Scholtz a8d161e47f Refactor and add comments to gcs metadata publish 2020-04-09 14:38:34 -07:00
Anna Scholtz 9a7af8b1b5 Write metadata for datasets and tables to GCS 2020-04-09 14:38:34 -07:00
Jeff Klukas 966a0a7b11 Add days_created_profile_bits 2020-04-08 11:11:46 -04:00
Jeff Klukas 1ab2eb4f33 Check baseline ping reason field 2020-04-08 11:11:46 -04:00
Jeff Klukas e36abdd616 Generic Glean ETL fixups 2020-04-08 11:11:46 -04:00
Jeff Klukas 86350db0d4 Generated clients_daily and last_seen tables for all Glean apps
Addresses https://github.com/mozilla/bigquery-etl/issues/848

We commit a single application's generated queries here to allow CI to verify
them. I have run these scripts for the target app so that the tables exist
and CI should pass.
2020-04-06 16:47:51 -04:00
Anthony Miyaguchi 6e3351272e
Cleanup GLAM etl queries for Fenix (#851)
* Rename scalar_aggregates_incremental and remove telemetry

* Remove ping-type and telemetry variables

* Add initial files for final views

* Add glam templates to format ignore

* Add blacked scalar_percentiles

* Generate view for scalar aggregates

* Add generated view for daily histogram aggregate view

* Add probe counts to generated views

* Generalize writing out queries

* Add a latest_versions to generate

* ADd histogram_percentiles to generate

* Add scalar bucket counts to generate

* Add histogram bucket counts to generate

* Add clients scalar aggregates to generate

* Move probe counts to generate

* Add scalar percentiles to generate

* Add clients histogram aggregates to generate

* Use generate within generate_fenix_sql script

* Fix probe_counts view and bucket counts

* Rename template for scalar bucket counts

* Fix client probe counts view

* Remove irrelevant telemetry where clause

* Add docstrings and shorten lines

* Add probe counts view to dryrun and publish_views ignores

* Rename udf for merged user data

* Use python3
2020-04-03 14:17:32 -07:00
Anna Scholtz 23b6e24f7e Add incremental_export option 2020-04-03 09:02:30 -07:00
Anna Scholtz ae2acd34af Update JSON export limits 2020-04-02 12:30:38 -07:00
Anna Scholtz 9ab5babcdb Refactor json export 2020-04-02 12:30:38 -07:00
Anna Scholtz 0b00e222bf Check for max. output file for json export 2020-04-02 12:30:38 -07:00
Anna Scholtz a298caa595 Updates JSON export tests 2020-04-02 12:30:38 -07:00
Anna Scholtz a8da84995c Limit number of files created when exporting to JSON 2020-04-02 12:30:38 -07:00
Anna Scholtz 9eaae84788 Write data to non-partitioned tmp table before exporting as JSON 2020-04-02 12:30:38 -07:00
Anna Scholtz 2102eb2a38 Update comments 2020-03-31 14:01:43 -07:00
Anna Scholtz c087e929a7 Refactor publish_metadata script 2020-03-31 14:01:43 -07:00
Anna Scholtz 1bad91ab33 Refactor generate_views script 2020-03-31 14:01:43 -07:00
Anna Scholtz 979ca4e8df Refactor publish_public_data_views script 2020-03-31 14:01:43 -07:00
Anna Scholtz a70275307b Forward query parameters when running queries in publish_json 2020-03-31 13:28:45 -07:00
Anna Scholtz 87ff585ccc Add logging to public_json script 2020-03-31 13:28:45 -07:00
Anna Scholtz 69b2a22971 Update json tests 2020-03-31 13:28:45 -07:00
Anna Scholtz 5f4b0dd8b7 Staging of json files 2020-03-31 13:28:45 -07:00
Anna Scholtz 9cd6d69754 Test JSON streaming 2020-03-31 13:28:45 -07:00
Anna Scholtz 74cf1a4cbd Use streaming for converting ndjson to json 2020-03-31 13:28:45 -07:00
Anna Scholtz 4825fb0664 Add tests for publishing json with mocking 2020-03-31 13:28:45 -07:00
Anna Scholtz 1f32c6010f Prevent publish_json script tests from running by default 2020-03-31 13:28:45 -07:00
Anna Scholtz a703989bed Refactor publish_json and factor out into separate class 2020-03-31 13:28:45 -07:00
Anna Scholtz 05301925cb Publish JSON data of query result 2020-03-31 13:28:45 -07:00
Anna Scholtz c2a5a62b1e [WIP] Separate script to publish data as json 2020-03-31 13:28:45 -07:00
Daniel Thorn b5f37518dd
Fix null date partition condition in shredder (#858) 2020-03-26 13:58:52 -07:00
Anthony Miyaguchi 33160eea96
Add histogram percentiles for Fenix data into GLAM (#829)
* Add initial histogram_percentiles

* Update metric_type with histogram_type suffix

* Add generated SQL (backwards incompatible)

* Add body for copy of histogram_percentiles_v1`

* Update histogram_percentiles with Glean specific metrics

* Add histogram_percentiles module

* Uncomment histogram percentiles

* Add generated SQL

* Add template to ignore section of format_sql

* Add histogram percentiles to ignore of dryrun

* Move udf into persistent_udf directory

* Rewrite udf_js.glean_percentile

* Add generated SQL
2020-03-25 10:17:21 -07:00
Daniel Thorn c3127baac4
Standardize common script arguments (#828) 2020-03-25 09:40:36 -07:00
Anthony Miyaguchi 4f0080559a
Add histograms to fenix glam etl (#766)
* Add initial template for histogram aggregates

* Factor out common functions and get all distributions

* Add viable query for histogram aggregates

* Add more efficient aggregation

* Update header and update comment

* Add code to generate clients daily histograms

* Add queries for generated sql

* Return non-zero exit code when histograms not found

* Delete empty queries to reduce data scanned

* Add non-zero exit code for scalars if probes are not found

* Sort histograms for stable output

* Add view for histogram aggregates

* Add initial sql for histogram aggregates

* Format template scripts

* Add mostly reformatted sql for aggregates

* Update histogram aggregates before adding statements

* Fix up details for daily aggregation

* Add completed histograms template

* Add code to generate clients histogram aggregates

* Add init for clients histogram aggregates

* Remove sample_id from set of attributes

* Add sections to run generated sql

* Add generated sql

* Remove extra latest_version columns

* Fix many small issues during first draft of sql

* Fix clients histogram aggregates

* Add initial modification to probe counts

* Add histogram bucket counts

* Add option to generate histogram probe counts

* Update generated_fenix_sql for histograms

* Add generated sql

* Update run_fenix_sql

* Fix bucket counts

* Update source table for probe counts

* Add missing ping_type to histograms

* Add first,last,num buckets

* Update probe counts so it succeeds

* Add mozilla_schema_generator to dependencies

* Add metadata from probe-info for custom distributions

* Update probe counts with metadata for custom distributions

* Add UDF for generating functional buckets

* Add proper bucketing by including range_max of measures

* Format histogram udfs

* Add updated templates to skip

* Add new queries to dryrun ignore

* Add view to the publish ignore list

* Fix python linting

* Remove old comments from probe counts

* Do not count metadata from custom distributions twice

* Remove sum from histogram aggregates

* Add generated SQL

* Add sample_id to histograms earlier in pipeline

* Add generated SQL

* Add comments to functional bucketing for metrics
2020-03-18 13:53:28 -07:00
Daniel Thorn 90d266c708
Use streaming inserts for shredder state (#826) 2020-03-18 13:21:12 -07:00
Daniel Thorn 78b2337465
Sort shredder jobs by partition_id then table_id (#804) 2020-03-10 16:02:34 -07:00
Daniel Thorn e7e3da7b86
Use job.created for shredder state (#796) 2020-03-10 14:44:36 -07:00
Anna Scholtz fed6e7b297 Check metadata label length 2020-03-10 10:52:53 -07:00
Anna Scholtz 75530d58b2 Update entrypoint comment and metadata parsing 2020-03-10 10:52:53 -07:00
Anna Scholtz c2b56907fb PyYaml dryrun 2020-03-10 10:52:53 -07:00
Anna Scholtz ac6344045a Convenience function to get metadata of associated SQL file 2020-03-10 10:52:53 -07:00
Anna Scholtz 26ad69af95 Refactor run_query and add doc comments to Metadata 2020-03-10 10:52:53 -07:00
Anna Scholtz d4574bd0fd Public datasets in dryrun 2020-03-10 10:52:53 -07:00
Anna Scholtz ed3b2853ed Refactor scripts to use Metadata class 2020-03-10 10:52:53 -07:00
Anna Scholtz 103f37f3d0 Refactor parsing metadata 2020-03-10 10:52:53 -07:00
Daniel Thorn f6ebc9e1a8
Add shredder support for integer range partitioning (#788) 2020-03-09 12:37:30 -07:00
Daniel Thorn 8c5c6bdee9
Support using --only and --except together in shredder (#789) 2020-03-09 09:09:11 -07:00
Anthony Miyaguchi 4e773ba6eb
Simplify scalar aggregates for glam-fenix etl (#767)
* Update daily aggregates to run all scalars in a single query

* Update generate and run script for new scalar aggregates

* Update generated sql (and view)

* Fix linting

* Update SKIP for format
2020-02-26 11:22:20 -08:00
Anthony Miyaguchi d0b71bcefd
End-to-end Fenix scalar aggregates (#743)
* Refactor render into a separate function

* Add variables for source and destination tables

* Add support for aggregating glean pings

* Add render_init along with --init option

* Add partition clause and add proper init file

* Add attributes_type to the template

* Update clients_scalar_aggregates_v1 with dataset.table

* Add command for generating init for fenix scalars aggregates

* Add queries for fenix_clients_scalar_aggregates_v1

* Update partititioning in init

* Update glam scripts for scalar aggregates

* Update version to only include valid versions

* Add generated sql

* Add --quiet flag

* Add notes

* Fix linting and CI errors

* Ignore glam_etl in dryrun

* Add initial template files that have been formatted

* Update generated queries

* Add metric counts for histogram and scalars

* Update metric_counts_v1 for scalars only

* Add formatted version of telemetry_derived/clients_scalar_bucket_counts_v1

* Add module for generating metric bucketing

* Refactor generate_fenix_sql for skipping stages

* Add templates to format SKIP

* Fix trailing whitespace

* Add option to generate fenix bucket/probe counts

* Add initial bucket/probe counts sql for fenix

* Sort attributes for stable query generation

* Refactor bucketing logic

* Add scalar_metric_types variable

* Add argument parser and glean variables

* Update scalar bucket counts for glean

* Update run_fenix_sql with bucket counts

* Fix invalid syntax

* Do not aggregate booleans as a scalar

* Add scalar_metric_types to metric_counts_v1

* Add argparser and change source tablename to scalar

* Update fenix_clients_scalar_probe_counts_v1

* Remove first_bucket

* Add scalar_probe_counts to run script

* Removing first_bucket requires changing where clause conditional

* Get grouping attributes correct

* Give columns stable ordering

* Add correct query (that is too complex)

* Reduce number of combinations

* Simplify logic for null values

* Cast booleans instead of when clause

* Format

* Rename files to avoid confusion

* Add initial scalar_percentiles

* Add initial files for scalar_percentiles

* Add scalar_percentiles for fenix

* Add scalar_percentiles to run script

* Add problematic files to SKIP in format and dryrun

* Add installation ping

* Fix missing merge item

* Add missing newlines

* Reduce set of grouped attributes

* Factor out boolean_metric_types
2020-02-19 13:43:53 -08:00
Anthony Miyaguchi 0d892cba4e
Add scalar aggregates from clients daily scalar aggregates for Fenix (#735)
* Refactor render into a separate function

* Add variables for source and destination tables

* Add support for aggregating glean pings

* Add render_init along with --init option

* Add partition clause and add proper init file

* Add attributes_type to the template

* Update clients_scalar_aggregates_v1 with dataset.table

* Add command for generating init for fenix scalars aggregates

* Add queries for fenix_clients_scalar_aggregates_v1

* Update partititioning in init

* Update glam scripts for scalar aggregates

* Update version to only include valid versions

* Add generated sql

* Add --quiet flag

* Add notes

* Fix linting and CI errors

* Ignore glam_etl in dryrun

* Add latest_versions template

* Add generated code for latest versions

* Update header

* Add latest versions to run script

* Update version filter using fenix_latest_versions_v1
2020-02-19 10:51:22 -08:00
Daniel Thorn 7864154807
Remove support for cluster handling in shredder (#733) 2020-02-18 21:07:40 +01:00
Anthony Miyaguchi cf511d8cc2
Add template for clients_scalar_aggregates (#727)
* Add initial moustache files for scalar_aggregates

* Add Jinja2 dependency

* Update templates with more parameters

* Add format clauses and add query to be formatted

* Add formatted sql

* Add generated sql

python -m bigquery_etl.glam.scalar_aggregates_incremental > sql/telemetry_derived/clients_scalar_aggregates_v1/query.sql

* Generalize clients_scalar_aggregates

* Refactor into attributes and attributes_list

* Add generated sql for generalized query

* Add glam templates to format_sql SKIP

* Fix dryrun by using AS and USING properly in sql

* Add generated sql

* Add instructions on adding new Python library

* Fix linting issues

* Use r""" for backslash in docstring

* Add Jinja2 dependencies to constraints.txt

* Document process for adding new Python dependencies
2020-02-12 09:43:37 -08:00
Anna Scholtz 3f1cb398fa Undo formatting for old SQL files 2020-02-07 09:48:23 -08:00
Anna Scholtz 2c25b3a34e Review feedback 2020-02-07 09:48:23 -08:00
Anna Scholtz adb79cb2a5 Ignore newlines when parsing UDFs 2020-02-07 09:48:23 -08:00
Anna Scholtz 88f188a93a Transform UDFs to temporary UDFs in tests 2020-02-07 09:48:23 -08:00
Anna Scholtz ab9135f951 Publish persistent UDFs step in Circle CI 2020-02-07 09:48:23 -08:00
Anna Scholtz 97b5386b41 Change UDFs to persistent UDFs and remove sql generations script 2020-02-07 09:48:23 -08:00
Anthony Miyaguchi f32f866129
Bug 1610983 - Add clients daily scalar aggregates for GLAM in Fenix (#724)
* Add copy of clients_daily_scalar_aggregates for fenix

* Change table to Fenix metrics ping and modify columns

* Modify get_scalar_probes to fetch the relevant list of metrics

* Remove logic for keyed booleans

* Add valid generated SQL for scalars

* Generate valid keyed_scalars

* Factor out attributes into reusable string

* Use the bigquery-etl formatter

* Add `--no-parameterize` flag for debugging in console

* Add option for table_id

* Add comma conditionally

* Add script to run against all Glean pings in dataset

* Move scripts into appropriate locations

* Use stable tables as source for generate script

* Report glean metric types instead of scalar/keyed-scalar

* Fix linting

* Add script to generate sql for each table in org_mozilla_fenix

* Add generated sql

* Rename script for running etl in testing environment

* Update run script to use generated sql

* Fix missing --table-id parameter

* Update header comment in script

* Update generated sql

* Add ping_type to list of attributes

* Update generated schemas
2020-02-06 14:01:24 -08:00
Daniel Thorn 0e0567285d
Fix unholy WITH OFFSET format (#709) 2020-01-28 15:34:08 +01:00
Daniel Thorn 1efbe0344a
Add script for self serve deletion (#635) 2020-01-23 14:52:08 -08:00
Daniel Thorn 5fa7e4e61e
Correctly format scripting keywords (#693) 2020-01-21 20:05:47 -08:00
Daniel Thorn 7c134d5617
Enforce format_sql on more files (#659) 2020-01-10 17:07:21 -08:00
Daniel Thorn e871c70e09
Fail on NULL in assert_false udf (#657)
* Fail on NULL in assert_false udf

* Update tests/README.md

Co-Authored-By: Anna Scholtz <anna@scholtzan.net>

Co-authored-by: Anna Scholtz <anna@scholtzan.net>
2020-01-09 16:15:42 -08:00
Daniel Thorn 2f7de8683d
Enforce script/format_sql for all new sql files (#656) 2020-01-09 13:55:46 -08:00
Daniel Thorn 8ca73c2b60
Rewrite script/format_sql in python (#640) 2020-01-06 16:17:41 -08:00
Frank Bertsch 719f607a0a
Update UDF names with prefixed numbers (#593)
* Error on improper UDF names

* Rename udfs with prefixed numbers
2019-12-12 15:09:57 -05:00
Frank Bertsch 6c825425b3
Search clients last seen (#451)
* Improve error message for ndjson parsing

* Make JSON error messages nicer

* Cast BYTES fields to/from string

BYTES types are not JSON-serializable. To deal with that, we do
two things:
1. Assume the input tables are hex strings, and decode them
   to get the BYTES fields values (on input)
2. Encode BYTES fields as hex strings (on output)

This means that any data files use hex strings for BYTES fields.

Note: This only works on top-level fields

* Add better discrepancy reporting for test assertions

When JSON blobs differ, it can be hard to tell what is wrong.
These functions easily show what's different, and automatically
prints them to be available when tests fail.

* Add search_clients_last_seen for 1-year of history

This new dataset, search_clients_last_seen, contains a year
of history for each client. It is split into 3 main parts:

1. Recent info that is contained in search_clients_daily,
   similar to how we store that in clients_last_seen
2. A year of history, represented as a BYTES field,
   indicating which days they were active for different
   types of activity
3. Among the major search providers, arrays of totals of
   different metrics, split into 12 parts, to account for
   each months total

This dataset will power LTV.

* Fix linting issues

* Enforce sampling on search_clients_daily

* Address review feedback

- Change all bits/bytes functions to include no. of bits
- Use fileobj for tests
- Rename some vars
- Use base64 for bytes in/out

* Generate sql

* Add missing comma

* Move search_clients_ls to search_derived

* Generate moar sql

* Use clients_daily_v8

* Fix query

* Move tests to search_derived

* Fix tests for search_clients_daily_v8

* Don't dryrun with search_clients_last_seen

* Update udf/new_monthly_engine_searches_struct.sql

Co-Authored-By: Jeff Klukas <jeff@klukas.net>

* sample_id is now an int

* Add documentation

* Update schemas

* Make tests use int sample-id
2019-12-12 12:43:09 -05:00
Ben Wu 7727bbff31
Combine project and dataset names with table name in test runner (#520) 2019-11-19 14:59:41 -05:00
Daniel Thorn e872a76860
Add pytest plugins to lint python scripts (#410)
* Add pytest plugins to lint python scripts

* Fix lint errors
2019-10-08 14:00:11 -07:00
Daniel Thorn ce69d308dd
Improve pytest failure output for udf and sql tests (#391)
* Improve pytest failure output for udf and sql tests

* address review
2019-10-01 12:48:22 -07:00
Jeff Klukas 4867be2af6 Remove redundant `list` call
Co-Authored-By: Daniel Thorn <dthorn@mozilla.com>
2019-09-13 10:00:33 -04:00
Jeff Klukas f4c5ea8e7c Run black 2019-09-13 10:00:33 -04:00
Jeff Klukas 57a15e67c6 Make generate_sql more deterministic 2019-09-13 10:00:33 -04:00
Anna Scholtz 9580029e20 UDF for unzipping gzipped bytes (#272)
* UDF for decompressing gzip data

* Update script for publishing UDFs to upload UDF dependency files

* Address review feedback for gunzip UDF

* Set default GCS bucket to moz-fx-data-prod-bigquery-etl

* Add function to upload UDF dependencies to GCS

* Set data-eng-circleci-tests context in CircleCI config

* Add approval step in CircleCI config
2019-08-26 10:53:06 -07:00
Daniel Thorn 22520e31f6
Use prepend_udf_usage_definitions in generate_sql (#287) 2019-08-05 16:10:31 -07:00
Daniel Thorn e1bf990b9a
Add support for testing queries with persistent UDFs (#285) 2019-08-05 14:14:19 -07:00
Jeff Klukas 01cb6e1074 Refactor naming of UDFs 2019-07-24 09:01:13 -04:00
Frank Bertsch ca2ab0e7b2
Fenix events (#210)
* Ignore hidden files in udf testing

* Add fenix event table generation

* Use fenix prod table

* Make fenix_events a view

* Fix last nit

* Remove date filter from view
2019-07-05 13:59:51 -04:00
Anna Scholtz b06acd389d Fix generated SQL 2019-06-25 08:07:26 -07:00
Anna Scholtz aa637154c5 Ensure that UDFs are added only once and in order when generating SQL files 2019-06-25 08:07:26 -07:00
Anna Scholtz eabba390af Fix formatting issues 2019-06-25 08:07:26 -07:00
Anna Scholtz b62970f3a9 Makefile for generating sql and add newline breaks to new files 2019-06-25 08:07:26 -07:00