Граф коммитов

438 Коммитов

Автор SHA1 Сообщение Дата
Anthony Miyaguchi c6cabd4391
Add statements to generate glam queries for fenix (#2208)
* Add statements to generate glam queries for fenix

* Use newlines in single string for multiple products

* Move glam generation into generate_sql script

* Add documentation on ignoring target project
2021-07-22 15:31:25 -04:00
Anna Scholtz b60d4f4be2 CircleCI build check for fork 2021-07-12 14:10:20 -07:00
Anna Scholtz aafa54c346 Add separate CI step for SQL and routine tests 2021-07-12 14:10:20 -07:00
Daniel Thorn 3c8894fdf1
Make schema validation part of dryrun (#2069) 2021-05-25 14:53:09 -04:00
Jeff Klukas c6f0c3ce81
Allow generate_sql to twice without raising error (#2067)
Fixes https://github.com/mozilla/bigquery-etl/issues/2066
2021-05-24 16:36:11 -04:00
Jeff Klukas 7486920237
Fix inconsistent invocation of bqetl in script (#2037)
This is causing view deploys to fail with:

> Please run ./bqetl bootstrap
2021-05-18 12:41:59 -07:00
Anna Scholtz 4443a6e463 Specify output_dir in generate-sql script
Co-authored-by: Jeff Klukas <jklukas@mozilla.com>
2021-05-18 11:24:27 -07:00
Anna Scholtz 5eb0ada329 Review feedback 2021-05-18 11:24:27 -07:00
Anna Scholtz 7a3b4f499f Remove old glean generation scripts 2021-05-18 11:24:27 -07:00
Anna Scholtz cee749c4ba Backfill with init option 2021-05-18 11:24:27 -07:00
Anna Scholtz bc14ec8877 Generate Glean table when creating generated-sql branch 2021-05-18 11:24:27 -07:00
Anthony Miyaguchi f58f0bfd3b
Revert "Add migration script for joining against first seen table (#1947)" (#1950)
This reverts commit e4dfedd285.
2021-04-12 15:47:16 -04:00
Anthony Miyaguchi e4dfedd285
Add migration script for joining against first seen table (#1947)
* Add migration script for joining against first seen table

* Update logic for is_new_profile

* Update templates to use DDL with partitioning/clustering

* Fix output of migrate tables to backfill-8

* Add instructions for backfilling

* Fix linting errors
2021-04-12 12:41:52 -07:00
Anthony Miyaguchi 871270f2c4
[DS-1424] Join baseline clients daily with first seen table (#1946)
* Add first_seen_date and related test fixtures

* Use is_new_profile instead of baseline_first_seen

* Update view for baseline_clients_first_seen

* Fix yamllint issues

* Set is_new_profile when submission matches first seen

* Include AS in table alias

* Nit: capitalize AS

* Update bigquery_etl/glean_usage/templates/baseline_clients_daily_v1.sql

Co-authored-by: Jeff Klukas <jklukas@mozilla.com>

* Update bigquery_etl/glean_usage/templates/baseline_clients_daily_v1.sql

Co-authored-by: Jeff Klukas <jklukas@mozilla.com>

* Update clustering specification

Co-authored-by: Jeff Klukas <jklukas@mozilla.com>
2021-04-12 12:29:57 -07:00
whd 7c1b03934b
Default branch (#1939)
* Rename default branch

* Rename branch

* Update circleci for default branch name
2021-04-06 21:15:21 +00:00
Anthony Miyaguchi 0aacbe5c22
Pull out main logic and generate example queries for glean usage (#1937)
* Move argument parser into shared function

* Move shared main entrypoint into common

* Update example script to include other usage queries

* Commit generated queries for example usage queries

* Parallelize generation of example queries

* Add docstring

* Remove ios example queries for daily and last seen

* Fix pydocstyle linting

* Add update_example_glean_usage to CI
2021-04-06 11:38:30 -07:00
Anthony Miyaguchi 1503a7fa89
[DS-1424] Implementation of mobile clients first seen (#1934)
* Add initial boilerplate for clients_first_seen

* Remove submission_timestamp as a field

* [wip] Join data against legacy fennec id if applicable

* Remove user facing view

* Revert "Remove user facing view"

This reverts commit a728a7882170eadad5413c7a7046c0f38297bb87.

* Add flag for fennec_id

* Update logic to limit rows in partitions to submission_date

* Add all sql in glean_usage to format ignores

* Separate init and query

* Add default encoders for testing sql

* Add test for initialization of baseline clients first seen in fenix

* Update query to update over previous history

* Add test for aggregation

* Add generated sql and tests for simple baseline clients first seen

* Add dry-run exceptions for clients first seen tables

* Add clients first seen to generated sql

* Update bigquery_etl/glean_usage/templates/baseline_clients_first_seen.metadata.yaml

Co-authored-by: Jeff Klukas <jklukas@mozilla.com>

* Update bigquery_etl/glean_usage/templates/baseline_clients_first_seen.metadata.yaml

Co-authored-by: Jeff Klukas <jklukas@mozilla.com>

* Group by sample id instead of min

* Add submission_date as baseline first seen date

Co-authored-by: Jeff Klukas <jklukas@mozilla.com>
2021-04-05 11:36:39 -07:00
Anna Scholtz 9fc546fd87 Rewrite experiment search aggregates query 2021-03-09 10:11:27 -08:00
Daniel Thorn 024e993c44
Record table references in metadata.yaml (#1875) 2021-03-09 12:29:05 -05:00
Jeff Klukas dd6ddee6b9
Use dataset labels to speed up stable view generation (#1863)
* Use dataset labels to speed up stable view generation

Builds on new dry run affordance from
https://github.com/mozilla/bigquery-etl/pull/1858

We also remove the `--no-dry-run` option now since only the single dry run
is now needed, and stable view generation completes in less than 2 seconds.
2021-03-02 15:05:39 -05:00
Daniel Thorn a190e18264
Automatically sort python imports (#1840) 2021-02-24 17:11:52 -05:00
Daniel Thorn 5d07beaca7
Use zetasql to get dependencies for dag generation (#1817) 2021-02-18 17:49:46 -05:00
Daniel Thorn 2ce8084dd9
Add option to generate stable views without dry run (#1814) 2021-02-18 12:02:21 -05:00
Ben Wu 3a62ba7490
Allow setting project for glam clustered query temp tables (#1821) 2021-02-17 12:22:22 -08:00
Linh Nguyen 28f15e16e5
Use generated SQL content as source for docs (#1811) 2021-02-16 13:33:53 -08:00
Jeff Klukas 0637808f95
Use probeinfo rather than BQ calls for glean_usage sql generation (#1786) 2021-02-16 13:26:11 -05:00
Frank Bertsch f675ccf533 Add query generation capability for events_daily
This is a straightforward way to share queries between datasets.
2021-02-10 17:03:02 -05:00
Jeff Klukas 3512fb6ff7
Publish generated views and queries to a generated-sql branch (#1775)
* Add CI task to push content to generated-sql branch

Fixes #1742

The
[`generated-sql`](https://github.com/mozilla/bigquery-etl/tree/generated-sql)
branch now exists and you can browse the contents. See, for example,
[telemetry.main](https://github.com/mozilla/bigquery-etl/tree/generated-sql/sql/moz-fx-data-shared-prod/telemetry/main)

Follow-ups for which I'll file issues:

- This doesn't currently publish the generated Glean baseline ETL queries
  and views; we'll need to update that logic to use probe-scraper metadata
  rather than listing tables in BigQuery (due to creds) to integrate it.
- Docs publishing should reference this generated content rather
2021-02-10 09:42:58 -05:00
Anna Scholtz 9eba25ac0f Monitoring data export fixes 2021-01-29 11:02:03 -08:00
Anna Scholtz 56c846dd07
CI validate views (#1711)
* Script for validating view definitions

* Add SKIP list for view validation

* Add view validation step to CI

* Regex for validating referenced tables in view definitions
2021-01-25 11:03:31 -08:00
Jeff Klukas b6ae2765c0
Add glean_usage ETL generation to generate_all_views (#1709)
* Add glean_usage ETL generation to generate_all_views

The new `generate_all_views` script is intended to replace `generate_views`
as the entrypoint for Jenkins. Its usage is demonstrated in the
`generate_and_publish_views` script.

This supports the move to user queries in the `mozdata` project.

* Add --user-facing-only

Co-authored-by: whd <whd@users.noreply.github.com>
2021-01-22 20:51:01 +00:00
Anthony Miyaguchi ef9d0efc78
Add metric and channel as clustering fields for GLAM (#1695) 2021-01-20 13:54:38 -08:00
Anna Scholtz 89bb53824c Remove date parameter 2021-01-20 12:36:17 -08:00
Anna Scholtz a80f4a2d3c Add remaining experiment enrollment Grafana queries 2021-01-19 13:10:19 -08:00
Anna Scholtz e868fd4e97 Experiments export data aggregated and by branch 2021-01-19 13:10:19 -08:00
Anna Scholtz 35b3e84ce2 Remove datasets required from experiment export script 2021-01-19 10:12:13 -08:00
Anna Scholtz 843903d6bd
Experiment enrollment monitoring queries (#1656)
* Experiment enrollment aggregates hourly

* Experiment enrollments recents query

* Add execution_delay support for tasks

* Experiment enrollment aggregates base query

* Schedule experiment enrollment cumulative population estimate and active population

* Experiment enrollment monitoring queries as views

* Script for exporting experiment monitoring data to GCS

* Export experiment monitoring data script aggregating data of longer running experiments

* Parallelize experiment monitoring data export

* init.sql for experiment enrollment monitoring queries

* Use Airflow ds_format macro for hourly destination table

* Use Airflow macros for experiments monitoring hourly execution delay

* experiment_enrollment_cumulative_population_estimate as query instead of view

* Fix referenced tables in enrollment_aggregates_hourly metadata and add comment

* Simplify cumulative population estimate query
2021-01-13 13:53:32 -08:00
Anthony Miyaguchi 7b28856491
Ensure the sql directory for glam-fenix exists (#1607) 2020-12-09 10:37:20 -08:00
Anthony Miyaguchi 3632c52815
Specify project when generating glam_etl sql (#1604) 2020-12-08 14:28:28 -08:00
Ben Wu b50a95944d
Separate queries on clients_scalar_aggregates by app_version (#1594) 2020-12-03 14:26:35 -05:00
Anthony Miyaguchi 4234c40040
Add minimal set of tests for GLAM Fenix queries (#1488)
* Add script to determine query dependencies

* Add schemas and folders for minimal test

* Add schema for geckoview_versions

* Add query params to each query

* Update schema for new queries

* Remove main from bootstrap file

* Add dataset prefix to schemas

* Add failing test for clients_histogram_aggregates

It turns out that the dependency resolution I'm using for autogenerate
the schemas is ignoring the views. I actually want to keep the views
around. The tables also all need to be prefixed with the dataset name or
they won't be inserted into the sql query correctly.

* Add successful test for clients histogram aggregates

* Add minimal tests for clients_scalar_aggregates

* Remove skeleton files for views (no test support for views)

* Add tests for latest versions

* Add tests for scalar bucket counts that passes

* Add scalar bucket counts

* Add test for scalar percentiles

* Add test for histogram bucket counts

* Add passing test for probe counts

* Add test for histogram percentiles

* Add tests for extract counts

* Update readme

* Add data for scalar percentiles test

* Fix linting errors

* Fix mypy issues with tests module

* Name it data instead of tests.*.data

* Ignore mypy on tests directory

* Remove mypy section

* Remove extra line in pytest

* Try pytest invocation of mypy-scripts-are-modules

* Run mypy outside of pytest

* Use exec on pytest instead of mypy

* Update tests/sql/glam-fenix-dev/glam_etl/bootstrap.py

Co-authored-by: Ben Wu <benjaminwu124@gmail.com>

* Update tests/sql/glam-fenix-dev/glam_etl/README.md

Co-authored-by: Ben Wu <benjaminwu124@gmail.com>

* Document bootstrap in documentation

* Use artificial range for histogram_percentiles

* Simplify parameters for scalar probe counts

* Simplify tests for histogram probe counts

* Add test for incremental histogram aggregates

* Update scalar percentile counts to count distinct client ids

* Update readme for creating a new test

* Use unorded list for sublist

* Use --ignore-glob for pytest to avoid data files

Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
2020-12-01 17:11:45 -08:00
Ben Wu df0508841f
Move apple_app_store to marketing project dir (#1570) 2020-11-20 13:17:33 -05:00
Ben Wu 2692ebf1d7
Create script to copy ga_sessions tables between projects (#1565) 2020-11-20 11:58:58 -05:00
Anthony Miyaguchi 0c244613fb
Update glam fenix etl with updated scalar bucketing (#1493)
* Add initial udf replacements

* Update scalar bucketing scheme

* Update schemas in script

* Revert change to query

* Remove comma before CROSS JOIN

* Add functional query

* Add option to skip steps

* Add ordering for keys

* Update bigquery_etl/glam/templates/scalar_bucket_counts_v1.sql

Co-authored-by: Ben Wu <benjaminwu124@gmail.com>

* Add instructions for copying tables and modify bucket location

* Generate schemas when GENERATE_ONLY specified

* Set build date to NULL instead of "*"

Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
2020-11-03 16:07:00 -08:00
Anthony Miyaguchi c6e3f210b9
Fix #1470 - Wait on process ids from backgrounded tasks (#1475) 2020-10-26 10:10:22 -07:00
Anthony Miyaguchi 2aa055b178
Add script to list tables in glam_etl datasets (#1478) 2020-10-23 10:33:50 -07:00
Anthony Miyaguchi b7695049c6
Fix #1457 - Generate and run Fenix ETL for GLAM in glam-fenix-dev (#1458)
* Resolve generated sql to glam-fenix-dev and change output in sql/ dir

* Add new script for testing glam-fenix queries

* Add generated sql for version control

* Use variables correctly in bash

* Remove latest versions from UDF

* Update test to generate minimum set of tables for nightly

* Commit generated queries for testing

* Cast only if not glob

* Ignore dryrun and publish view for glam-fenix-dev

* Fix linting error

* Update comments

* Use DST_PROJECT consistently in scripts

* Update comments

* Update script/glam/test/test_glean_org_mozilla_fenix_glam_nightly

Co-authored-by: Ben Wu <benjaminwu124@gmail.com>

* Update script/glam/generate_and_run_desktop_sql

Co-authored-by: Ben Wu <benjaminwu124@gmail.com>

Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
2020-10-22 11:40:52 -07:00
Daniel Thorn 9e441fac96
Add script/bqetl for run cli without install (#1448) 2020-10-16 15:44:15 -07:00
Daniel Thorn 824ef5f6d5
quote query arguments (#1433) 2020-10-13 14:43:39 -07:00
Anna Scholtz 0d51459bd1 Move dependencies to udf_js_lib 2020-10-08 10:30:22 -07:00
Anna Scholtz 67c5265b6f Rename udf module to routine 2020-10-08 10:30:22 -07:00
Anna Scholtz 8cdc12b70f Add alternate project support for publishing UDFs 2020-10-08 10:30:22 -07:00
Anna Scholtz a1fadf293f Update path for publishing public UDFs 2020-10-05 13:55:07 -07:00
Anna Scholtz d1c67dab53 Move projects into high-level sql/ folder 2020-10-05 12:59:58 -07:00
Anna Scholtz 06233819ab Remove sql/ directory 2020-10-05 12:59:58 -07:00
Anthony Miyaguchi 0ed408e7dd
Fix #1329 - Use app_build_id as app_version in GLAM fenix nightly (#1354)
* Make versions to keep configurable

* Replace app_version with app_build_id in nightly

* Add jsonschema as a requirement=

* Filter based on build date instead of version for nightly

* Add script for comparing the output of two branches

* Add option for specifying the bucket in export

* Cast build_id to integer

* Remove latest versions from histogram aggregates

* Format logical_app_id

* Use @submission_date parameter in latest versions
2020-10-01 14:28:42 -07:00
Anna Scholtz 2e56471644 Move run_multipart_query logic to bigquery_etl 2020-09-24 08:55:35 -07:00
Anna Scholtz a604268c7e Move publish_static to bigquery_etl 2020-09-24 08:55:35 -07:00
Anna Scholtz 00a36c3553 Move json_to_table_ddl to bigquery_etl 2020-09-24 08:55:35 -07:00
Anna Scholtz 08be8da2a1 Move generate_incremental_table logic to bigquery_etl 2020-09-24 08:55:35 -07:00
Anna Scholtz f6bf253144 Move copy_deduplicate logic to bigquery_etl 2020-09-24 08:55:35 -07:00
Anna Scholtz 6f31338ecd Move view related scripts to view module 2020-09-24 08:55:35 -07:00
Anthony Miyaguchi dd283c264f
Add glam cli for incremental backfill (#1313)
* Add glam cli for listing processed app ids

* Make backfill scripts more consistent

* Add export to glam glean cli

* Add pandas dependency

* Add black format of glam-cli

* Commit hashes based on bigquery-etl container

* Fix various linting issues

* Be stricter with is_logical matching

* Fix more linting issues
2020-09-23 14:45:44 -07:00
Daniel Thorn 26c67c7ee8
Upgrade to pytest 6.0.1 (#1281)
Also upgrade and fix pytest plugins
2020-09-02 11:30:14 -07:00
Anna Scholtz 437cf67aa2 Refactor parse_udf 2020-09-02 10:24:38 -07:00
Anna Scholtz 0080ff8867 Refactor publish_udfs script 2020-09-02 10:24:38 -07:00
Anna Scholtz 2b29d24f59 Migrate UDFs to new format 2020-09-02 10:24:38 -07:00
Anna Scholtz debd57c662 Fix entrypoint run_query call 2020-08-27 21:08:14 -07:00
Anna Scholtz ffaaa2ab26 Call bigquery_etl.run_query from script/run_query 2020-08-27 14:48:32 -07:00
Anna Scholtz 04cbf80eab Add format command to CLI 2020-08-27 14:48:32 -07:00
Anthony Miyaguchi d8f782dc62
Add scripts for backfilling and exporting all fenix aggregates (#1255)
* Add scripts for backfilling and exporting all fenix aggregates

* Update script/glam/export_glean_all_fenix

Co-authored-by: Ben Wu <benjaminwu124@gmail.com>

Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
2020-08-26 11:08:32 -07:00
Anna Scholtz cbf560a1fa Add CLI tests for creating queries 2020-08-21 11:10:58 -07:00
Anthony Miyaguchi 96b85854d2
Update Glam ETL for Fenix (#1240)
* Use UNION ALL instead of UNION

* Move tests into separate directory and add test for all fenix products

* Replace channel with *
2020-08-18 16:15:56 -07:00
Anthony Miyaguchi 222e04b081
Fix #1232 - Ignore glam_etl directory when publishing views (#1234) 2020-08-18 11:33:50 -07:00
Anthony Miyaguchi ca2204625d
Add views for logical Fenix app ids in GLAM ETL (#1221)
* Add views for logical app ids

* Add new generated sql

* Update generate_glean_sql script to handle logical apps

* Update logical app view for partitiontime

* Make sure to generate view for all of the app ids

* Update last versions to be logical app id agnostic

* Add formatting for black

* Fix linting error

* Update bigquery_etl/glam/generate.py

Co-authored-by: Ben Wu <benjaminwu124@gmail.com>

* Add "all" option to STAGE

* Add new metrics added since last PR

Co-authored-by: Ben Wu <benjaminwu124@gmail.com>
2020-08-17 15:05:15 -07:00
Anthony Miyaguchi 36b7c184e6
Add script to backfill glam tables for a glean product (#1108)
* Add backfill script for glean products

* Specify product correctly and add target dataset

* Add product to example

* Use datetime.fromisoformat
2020-08-06 15:48:40 -07:00
Jeff Klukas d5d64359f6 Bug 1657360 Exclude pings with "automation" tag from stable
We will also need to update monitoring queries to account for this when
counting unique document_ids in decoded and live tables.
2020-08-06 12:56:15 -04:00
asiOvOtus 2acb30c9b0
Rewrite duplicated map udfs to mozfun shims (#1211)
* Rewrite duplicated map udfs to mozfun shims

* Format get_key_with_null.sql
2020-08-04 13:26:13 -07:00
Ben Wu 019666b51b
Add queries for exported app store data (#1207) 2020-07-29 18:02:16 -04:00
asiOvOtus 306c667b2d
Add unit tests and documentations for udfs (#1197)
* Add unit tests and documentations for udfs

* Auto format SQL files

* fix and format

Co-authored-by: Frank Bertsch <fbertsch@mozilla.com>
2020-07-28 11:54:44 -07:00
Ben Wu ab50e40fc6
Generalize fenix glam generate and run code (#1183) 2020-07-20 11:25:27 -04:00
Ben Wu c42aa317c4
Replace jq in generate_glean_sql (#1174) 2020-07-15 18:26:51 -04:00
Anna Scholtz cfc80e3da4 Fix mozfun comments 2020-07-15 11:24:17 -07:00
Anna Scholtz 2f7a07a578 Migrate some UDFs to mozfun 2020-07-15 11:24:17 -07:00
Anna Scholtz 0852f90125 Fix UDF signatures 2020-07-15 11:24:17 -07:00
Anna Scholtz 6d90a47bc5 Migrate some UDFs to mozfun 2020-07-15 11:24:17 -07:00
Anna Scholtz 4ddb5d4b58 Fix metadata migration 2020-07-15 11:24:17 -07:00
Anna Scholtz 312e0ed21a Refactor migrate_to_mozfun script 2020-07-15 11:24:17 -07:00
Anna Scholtz e24c7bdf41 Add script for migrating UDFs to mozfun 2020-07-15 11:24:17 -07:00
Anna Scholtz ee4a3ee0ce More task re-scheduling 2020-07-10 13:30:24 -07:00
Ben Wu 7653acdb6f
Split daily_histogram_aggregates query by process type (#1124) 2020-07-07 13:12:06 -04:00
Anna Scholtz dfd52b5647 Dry run example SQL files 2020-07-06 13:55:19 -07:00
Anna Scholtz 94768fc14e Generate SQL of doc examples for dry run 2020-07-06 13:55:19 -07:00
Anna Scholtz 88ecf499cd Generate docs 2020-07-02 13:37:38 -07:00
Jeff Klukas 790fec1c52 mozfun.histogram.extract cleanup
Follow-up to #1000 now that the function is published and I've had some
chance to use it.
2020-07-02 12:31:31 -04:00
Jeff Klukas ff2e30da32 Revert "Reference shared-prod views when republishing to other projects (#1105)"
This reverts commit 4654bbda31.
2020-07-01 13:38:21 -04:00
Jeff Klukas baeff74751 Bug 1649754 Remove reference to derived-datasets dry run url
We no longer have any destination tables in derived-datasets.
2020-07-01 09:52:12 -04:00
Jeff Klukas 4654bbda31
Reference shared-prod views when republishing to other projects (#1105)
Fixes https://github.com/mozilla/bigquery-etl/issues/1075
2020-06-30 12:46:23 -04:00
Jeff Klukas 9422909cfd
Retry once on "invalid snapshot time" when publishing views (#1095)
Fixes #1001
2020-06-25 13:44:54 -04:00
Daniel Thorn 19a00ebce3
Add shredder script to forward deletion requests to amplitude (#1082) 2020-06-24 11:20:17 -07:00