Граф коммитов

454 Коммитов

Автор SHA1 Сообщение Дата
Marina Samuel 45e882099d Update dryrun script. 2020-05-05 12:25:22 -04:00
Marina Samuel 08e553907e Update script for running desktop glam. 2020-05-05 12:25:22 -04:00
Ben Wu 6450c0036f
Bug 1632245 - Add new fenix apps to mobile search (#946) 2020-05-05 12:03:14 -04:00
Daniel Thorn 17310d06f0
Add script for reporting shredder cost (#936) 2020-04-30 10:23:07 -07:00
Anna Scholtz 74b119a826 Metadata validation refactoring 2020-04-29 14:18:53 -07:00
Anna Scholtz de36a0ddee Update integration tests 2020-04-29 14:18:53 -07:00
Anna Scholtz e7bd2fb50c Move all metadata related scripts to a metadata directory 2020-04-29 14:18:53 -07:00
Anna Scholtz f49b7920d2 Script for validating metadata files 2020-04-29 14:18:53 -07:00
Frank Bertsch 0543762c5f
Add views for search_clients_last_seen (#926)
* Add views for search_clients_last_seen

* Ignore failed search dryrun
2020-04-22 15:22:25 -04:00
Daniel Thorn 6dd5893444
Use CREATE TABLE to set expiration on copy_deduplicate temp tables (#911) 2020-04-16 12:39:48 -07:00
Anna Scholtz 011e1c2111 Run publish_public_data_json as bash script in entrypoint 2020-04-15 13:00:05 -07:00
Jeff Klukas 46e5ee8fb0 Add script for reporting on broken views
Addresses #903
2020-04-15 15:03:07 -04:00
Anna Scholtz 9b6aa5fb46 Refactory publish_json script 2020-04-15 08:49:24 -07:00
Anthony Miyaguchi 82a6a5f687
Fenix exports for GLAM (#870)
* Add views for extracting to glam

* wip: Add export script

* Rename extract queries and don't run them

* Add user counts

* Add generated sql

* Update extract queries to rely on current project

* Fix optional day partitioning

* Fix extraction to glam-fenix-dev project

* Add globs for ignoring dryrun and format

* Reorder columns in probe count extract

* Filter on null channel and version

* Do not print header

* Refactor datacube groupings and fix scalar_percentiles

* Rename extract tables

* Convert user counts into a view to avoid needless materialization

* Rename client probe counts to probe counts

* Update publish_views so job does not fail
2020-04-14 11:45:59 -07:00
Jason Thomas 4b17f1e22b
Add Pocket MAU view Bug 1629151 (#905) 2020-04-14 10:31:40 -07:00
Anna Scholtz 8af044d844 Pretty print dataset metadata 2020-04-09 14:38:34 -07:00
Anna Scholtz f3a6d7c490 Add option to publish metadata to GCS to entrypoint script 2020-04-09 14:38:34 -07:00
Anna Scholtz 9a7af8b1b5 Write metadata for datasets and tables to GCS 2020-04-09 14:38:34 -07:00
Jeff Klukas 03fc8911ac Add additional Glean apps to GUD tables 2020-04-09 14:28:00 -04:00
Jeff Klukas 7ff705f5ec Exempt broken activity_stream view 2020-04-08 11:11:46 -04:00
Jeff Klukas 9ff1d3666f Dryrun fixup 2020-04-08 11:11:46 -04:00
Jeff Klukas aeccb5c503 Add Overall 2020-04-07 09:44:18 -04:00
Jeff Klukas 98a8b83b0c Add script/export_incline_dash
The logic for incline exports in getting too complex to handle via Airflow
operations (see https://github.com/mozilla/telemetry-airflow/pull/935).

Instead, we put the logic into this script and will invoke via docker.
2020-04-07 09:44:18 -04:00
Jeff Klukas 86350db0d4 Generated clients_daily and last_seen tables for all Glean apps
Addresses https://github.com/mozilla/bigquery-etl/issues/848

We commit a single application's generated queries here to allow CI to verify
them. I have run these scripts for the target app so that the tables exist
and CI should pass.
2020-04-06 16:47:51 -04:00
Anna Scholtz 2c70919da7 Ignore asn_aggregates dryruns 2020-04-06 12:50:27 -07:00
Anthony Miyaguchi 6e3351272e
Cleanup GLAM etl queries for Fenix (#851)
* Rename scalar_aggregates_incremental and remove telemetry

* Remove ping-type and telemetry variables

* Add initial files for final views

* Add glam templates to format ignore

* Add blacked scalar_percentiles

* Generate view for scalar aggregates

* Add generated view for daily histogram aggregate view

* Add probe counts to generated views

* Generalize writing out queries

* Add a latest_versions to generate

* ADd histogram_percentiles to generate

* Add scalar bucket counts to generate

* Add histogram bucket counts to generate

* Add clients scalar aggregates to generate

* Move probe counts to generate

* Add scalar percentiles to generate

* Add clients histogram aggregates to generate

* Use generate within generate_fenix_sql script

* Fix probe_counts view and bucket counts

* Rename template for scalar bucket counts

* Fix client probe counts view

* Remove irrelevant telemetry where clause

* Add docstrings and shorten lines

* Add probe counts view to dryrun and publish_views ignores

* Rename udf for merged user data

* Use python3
2020-04-03 14:17:32 -07:00
Marina Samuel 9187175543 Convert run_fenix_sql to run_glam_sql. 2020-04-03 16:46:05 -04:00
Anthony Miyaguchi c9e0c8a564
Bug 1607229 - Add authorized view to buildhub2 table (#863)
* Bug 1607229 - Add authorized view to buildhub2 table

* Add buildhub2 to dryrun and publish_view ignores
2020-04-02 09:59:05 -07:00
Anna Scholtz 29295f0507 Add comments to refactored scripts 2020-03-31 14:01:43 -07:00
Anna Scholtz c087e929a7 Refactor publish_metadata script 2020-03-31 14:01:43 -07:00
Anna Scholtz 1bad91ab33 Refactor generate_views script 2020-03-31 14:01:43 -07:00
Anna Scholtz 979ca4e8df Refactor publish_public_data_views script 2020-03-31 14:01:43 -07:00
Anna Scholtz a70275307b Forward query parameters when running queries in publish_json 2020-03-31 13:28:45 -07:00
Anna Scholtz 4f40d15eed Configure CircleCI to run integration tests 2020-03-31 13:28:45 -07:00
Anna Scholtz 24e6c797d3 Update entrypoint to run query and publish json if flag is set 2020-03-31 13:28:45 -07:00
Anna Scholtz dade17d22e Set public GCS bucket as default for json publishing 2020-03-31 13:28:45 -07:00
Anna Scholtz 74cf1a4cbd Use streaming for converting ndjson to json 2020-03-31 13:28:45 -07:00
Anna Scholtz 4825fb0664 Add tests for publishing json with mocking 2020-03-31 13:28:45 -07:00
Anna Scholtz a703989bed Refactor publish_json and factor out into separate class 2020-03-31 13:28:45 -07:00
Anna Scholtz 438b87a6da Add tests for publishing json 2020-03-31 13:28:45 -07:00
Anna Scholtz 825b6ccf0e Update entrypoint and format ndjson to json 2020-03-31 13:28:45 -07:00
Anna Scholtz 9a1443b639 Add comments for making JSON data public 2020-03-31 13:28:45 -07:00
Anna Scholtz b9bcab6b51 Set upload JSON format 2020-03-31 13:28:45 -07:00
Anna Scholtz 05301925cb Publish JSON data of query result 2020-03-31 13:28:45 -07:00
Anna Scholtz c2a5a62b1e [WIP] Separate script to publish data as json 2020-03-31 13:28:45 -07:00
Anna Scholtz f207c7b838 Skip dryrun for usage anomaly query 2020-03-31 10:56:01 -07:00
benmiroglio bb03dca054 Add addons_daily_v1 query to dryrun skip list 2020-03-31 08:42:10 -04:00
Daniel Thorn d7ea8520fa
Add OR REPLACE to migrated_clients_v1 (#854) 2020-03-25 10:55:37 -07:00
Daniel Thorn caa7a64b50
Remove new init.sql from dry_run (#853) 2020-03-25 10:30:59 -07:00
Anthony Miyaguchi 33160eea96
Add histogram percentiles for Fenix data into GLAM (#829)
* Add initial histogram_percentiles

* Update metric_type with histogram_type suffix

* Add generated SQL (backwards incompatible)

* Add body for copy of histogram_percentiles_v1`

* Update histogram_percentiles with Glean specific metrics

* Add histogram_percentiles module

* Uncomment histogram percentiles

* Add generated SQL

* Add template to ignore section of format_sql

* Add histogram percentiles to ignore of dryrun

* Move udf into persistent_udf directory

* Rewrite udf_js.glean_percentile

* Add generated SQL
2020-03-25 10:17:21 -07:00
Daniel Thorn d51ef367a9
Add legacy scripts used to validate sink for *_live tables (#845) 2020-03-25 10:03:39 -07:00
Daniel Thorn c3127baac4
Standardize common script arguments (#828) 2020-03-25 09:40:36 -07:00
Frank Bertsch e64a8e0e87
Add incline exec dash submission_date based query (#843)
* Add incline exec dash submission_date based query

* Move to derived dataset

* Create migrated_clients_v1 as materialized

Previously we had this as a view. Now it will
update every day, taking the latest value seen from migration
pings as its value.

* Update view for materialized query

* Use clients_last_seen view and migrated_clients

Previously, we were using the materialized clients_last_seen
and the unmaterialized migrated_clients. We are now materializing
the latter, and using the view of the former.

- Added documentation
- Removed unecessary arrays

* Run SQL formatter

* Change keyword identifier with prefixed underscore

* Ignore missing table errors in dryrun

* Run migrated_clients init only

* Add docline for negative churn
2020-03-25 11:32:36 -04:00
Anthony Miyaguchi 34fa6d7e54
Refactor udf_percentile in telemetry histogram percentiles as a persistent udf (#832)
* Remove histogram_percentiles from format_sql ignore

* Move udf for histogram percentiles into persistent udf

* Add an approximate equals for testing

* Add some initial tests for linear histograms

* Add tests and minor formatting of comments

This fixes a particular bug within one of the conditionals. However, the
results for linear histograms still look incorrect.

* Update tests/assert/approx_equals.sql

Co-Authored-By: Daniel Thorn <dthorn@mozilla.com>

* Format approx_equals

* Remove a few TODOs

Co-authored-by: Daniel Thorn <dthorn@mozilla.com>
2020-03-23 12:30:23 -07:00
Anthony Miyaguchi 4f0080559a
Add histograms to fenix glam etl (#766)
* Add initial template for histogram aggregates

* Factor out common functions and get all distributions

* Add viable query for histogram aggregates

* Add more efficient aggregation

* Update header and update comment

* Add code to generate clients daily histograms

* Add queries for generated sql

* Return non-zero exit code when histograms not found

* Delete empty queries to reduce data scanned

* Add non-zero exit code for scalars if probes are not found

* Sort histograms for stable output

* Add view for histogram aggregates

* Add initial sql for histogram aggregates

* Format template scripts

* Add mostly reformatted sql for aggregates

* Update histogram aggregates before adding statements

* Fix up details for daily aggregation

* Add completed histograms template

* Add code to generate clients histogram aggregates

* Add init for clients histogram aggregates

* Remove sample_id from set of attributes

* Add sections to run generated sql

* Add generated sql

* Remove extra latest_version columns

* Fix many small issues during first draft of sql

* Fix clients histogram aggregates

* Add initial modification to probe counts

* Add histogram bucket counts

* Add option to generate histogram probe counts

* Update generated_fenix_sql for histograms

* Add generated sql

* Update run_fenix_sql

* Fix bucket counts

* Update source table for probe counts

* Add missing ping_type to histograms

* Add first,last,num buckets

* Update probe counts so it succeeds

* Add mozilla_schema_generator to dependencies

* Add metadata from probe-info for custom distributions

* Update probe counts with metadata for custom distributions

* Add UDF for generating functional buckets

* Add proper bucketing by including range_max of measures

* Format histogram udfs

* Add updated templates to skip

* Add new queries to dryrun ignore

* Add view to the publish ignore list

* Fix python linting

* Remove old comments from probe counts

* Do not count metadata from custom distributions twice

* Remove sum from histogram aggregates

* Add generated SQL

* Add sample_id to histograms earlier in pipeline

* Add generated SQL

* Add comments to functional bucketing for metrics
2020-03-18 13:53:28 -07:00
Jeff Klukas a1af581961 Filter out temporary tables created by the Beam BigQuery sink job
Closes https://github.com/mozilla/bigquery-etl/issues/834
2020-03-18 09:57:28 -04:00
Anna Scholtz 80d77951ad Retrieve tables from BigQuery and update metadata 2020-03-13 09:20:55 -07:00
Anna Scholtz 5c18d8ef84 Update publish_metadata to write metadata to public tables 2020-03-13 09:20:55 -07:00
Rob Hudson ff2d826ef1
Update glam extract query to use parameters (#809) 2020-03-12 10:27:19 -07:00
Nan Jiang 76a4bf797f Bug 1620663 - Part 1. Add active users_last_seen tables for CFR 2020-03-11 15:45:43 -04:00
Anna Scholtz a0232213b2 Allow partitions for destination_table names in run_query 2020-03-11 10:40:42 -07:00
Anna Scholtz 4f2a52289f Use subprocess.check_call when running queries 2020-03-11 10:33:14 -07:00
Jeff Klukas 02ffaf2aa4 Remove views on top of sync derived data
We will be deleting the underlying tables.
See https://bugzilla.mozilla.org/show_bug.cgi?id=1612541
2020-03-11 08:56:04 -04:00
Anna Scholtz 98dce31f7f Generate views for public tables that exist in BigQuery 2020-03-10 10:52:53 -07:00
Anna Scholtz 75530d58b2 Update entrypoint comment and metadata parsing 2020-03-10 10:52:53 -07:00
Anna Scholtz f2c458a8e4 More validation of destination_table name for running query 2020-03-10 10:52:53 -07:00
Anna Scholtz 1e688cd063 Change destination table for public data 2020-03-10 10:52:53 -07:00
Anna Scholtz 2c998a9252 Reset DryRun 2020-03-10 10:52:53 -07:00
Anna Scholtz c2b56907fb PyYaml dryrun 2020-03-10 10:52:53 -07:00
Anna Scholtz ac6344045a Convenience function to get metadata of associated SQL file 2020-03-10 10:52:53 -07:00
Anna Scholtz 0746465e11 Dryrun in public data project 2020-03-10 10:52:53 -07:00
Anna Scholtz 26ad69af95 Refactor run_query and add doc comments to Metadata 2020-03-10 10:52:53 -07:00
Anna Scholtz d4574bd0fd Public datasets in dryrun 2020-03-10 10:52:53 -07:00
Anna Scholtz ed3b2853ed Refactor scripts to use Metadata class 2020-03-10 10:52:53 -07:00
Anna Scholtz 103f37f3d0 Refactor parsing metadata 2020-03-10 10:52:53 -07:00
Anna Scholtz c91eac813b Add script for generating and publishing views for public data queries 2020-03-10 10:52:53 -07:00
Anna Scholtz e39eb5c10b Run Bigquery query using bq shell command 2020-03-10 10:52:53 -07:00
Anna Scholtz 37d22b9b68 WIP run queries and check if they should be made public 2020-03-10 10:52:53 -07:00
Anthony Miyaguchi ecaaf8ae15
Fix #779 - Ignore view deploys to glam_etl (#780) 2020-03-04 15:05:02 -08:00
Jeff Klukas e3d875cd93 Coalesce the client_info.locale value in baseline ping views
Glean pings now contain a client_info.locale field; baseline pings
previously reported locale as a string metric. This change provides
some historical continuity so that views on baseline pings show
client_info.locale values even if only the old metric is populated.
2020-02-28 12:52:21 -05:00
Anthony Miyaguchi 4e773ba6eb
Simplify scalar aggregates for glam-fenix etl (#767)
* Update daily aggregates to run all scalars in a single query

* Update generate and run script for new scalar aggregates

* Update generated sql (and view)

* Fix linting

* Update SKIP for format
2020-02-26 11:22:20 -08:00
Marina Samuel 88e184525c Use sample_id partitioning along more of the histogram_aggregates pipeline. 2020-02-24 15:43:22 -05:00
Anthony Miyaguchi 4f2d5dd51d
Add script for intitial scheduling (#757)
* Update run script to be more generic

* Update run script with parameters and avoid reusing destination

* Add view for clients_daily_scalar_aggregates

* Add new view to dryrun

* Remove old comment
2020-02-21 11:13:20 -08:00
Jeff Klukas 8bbcd8892d Update combine_adject_days_bits 2020-02-20 15:33:41 -05:00
Jeff Klukas dc0bbed352 More first seen ignores 2020-02-20 15:33:41 -05:00
Jeff Klukas 45ae248e5a first_seen fixes 2020-02-20 15:33:41 -05:00
Anthony Miyaguchi d0b71bcefd
End-to-end Fenix scalar aggregates (#743)
* Refactor render into a separate function

* Add variables for source and destination tables

* Add support for aggregating glean pings

* Add render_init along with --init option

* Add partition clause and add proper init file

* Add attributes_type to the template

* Update clients_scalar_aggregates_v1 with dataset.table

* Add command for generating init for fenix scalars aggregates

* Add queries for fenix_clients_scalar_aggregates_v1

* Update partititioning in init

* Update glam scripts for scalar aggregates

* Update version to only include valid versions

* Add generated sql

* Add --quiet flag

* Add notes

* Fix linting and CI errors

* Ignore glam_etl in dryrun

* Add initial template files that have been formatted

* Update generated queries

* Add metric counts for histogram and scalars

* Update metric_counts_v1 for scalars only

* Add formatted version of telemetry_derived/clients_scalar_bucket_counts_v1

* Add module for generating metric bucketing

* Refactor generate_fenix_sql for skipping stages

* Add templates to format SKIP

* Fix trailing whitespace

* Add option to generate fenix bucket/probe counts

* Add initial bucket/probe counts sql for fenix

* Sort attributes for stable query generation

* Refactor bucketing logic

* Add scalar_metric_types variable

* Add argument parser and glean variables

* Update scalar bucket counts for glean

* Update run_fenix_sql with bucket counts

* Fix invalid syntax

* Do not aggregate booleans as a scalar

* Add scalar_metric_types to metric_counts_v1

* Add argparser and change source tablename to scalar

* Update fenix_clients_scalar_probe_counts_v1

* Remove first_bucket

* Add scalar_probe_counts to run script

* Removing first_bucket requires changing where clause conditional

* Get grouping attributes correct

* Give columns stable ordering

* Add correct query (that is too complex)

* Reduce number of combinations

* Simplify logic for null values

* Cast booleans instead of when clause

* Format

* Rename files to avoid confusion

* Add initial scalar_percentiles

* Add initial files for scalar_percentiles

* Add scalar_percentiles for fenix

* Add scalar_percentiles to run script

* Add problematic files to SKIP in format and dryrun

* Add installation ping

* Fix missing merge item

* Add missing newlines

* Reduce set of grouped attributes

* Factor out boolean_metric_types
2020-02-19 13:43:53 -08:00
Anthony Miyaguchi 0d892cba4e
Add scalar aggregates from clients daily scalar aggregates for Fenix (#735)
* Refactor render into a separate function

* Add variables for source and destination tables

* Add support for aggregating glean pings

* Add render_init along with --init option

* Add partition clause and add proper init file

* Add attributes_type to the template

* Update clients_scalar_aggregates_v1 with dataset.table

* Add command for generating init for fenix scalars aggregates

* Add queries for fenix_clients_scalar_aggregates_v1

* Update partititioning in init

* Update glam scripts for scalar aggregates

* Update version to only include valid versions

* Add generated sql

* Add --quiet flag

* Add notes

* Fix linting and CI errors

* Ignore glam_etl in dryrun

* Add latest_versions template

* Add generated code for latest versions

* Update header

* Add latest versions to run script

* Update version filter using fenix_latest_versions_v1
2020-02-19 10:51:22 -08:00
Anna Scholtz 2590b1bd78 Add support for tags when publishing metadata 2020-02-18 09:32:53 -08:00
Daniel Thorn f3e11b4255
Use pip-compile instead of constraints.txt (#736)
* Fix docstyle check

* Use pip-compile instead of constraints.txt
2020-02-12 20:26:24 +01:00
Anthony Miyaguchi cf511d8cc2
Add template for clients_scalar_aggregates (#727)
* Add initial moustache files for scalar_aggregates

* Add Jinja2 dependency

* Update templates with more parameters

* Add format clauses and add query to be formatted

* Add formatted sql

* Add generated sql

python -m bigquery_etl.glam.scalar_aggregates_incremental > sql/telemetry_derived/clients_scalar_aggregates_v1/query.sql

* Generalize clients_scalar_aggregates

* Refactor into attributes and attributes_list

* Add generated sql for generalized query

* Add glam templates to format_sql SKIP

* Fix dryrun by using AS and USING properly in sql

* Add generated sql

* Add instructions on adding new Python library

* Fix linting issues

* Use r""" for backslash in docstring

* Add Jinja2 dependencies to constraints.txt

* Document process for adding new Python dependencies
2020-02-12 09:43:37 -08:00
Marina Samuel 4da81311a1 Break up clients_histogram_aggregates in Glam into smaller chunks. 2020-02-11 19:49:48 -05:00
Anna Scholtz 3f1cb398fa Undo formatting for old SQL files 2020-02-07 09:48:23 -08:00
Anna Scholtz e4ad3a60ed Remove template references from formatting script 2020-02-07 09:48:23 -08:00
Anna Scholtz e4babb120b Update --udf-dirs help string 2020-02-07 09:48:23 -08:00
Anna Scholtz 88f188a93a Transform UDFs to temporary UDFs in tests 2020-02-07 09:48:23 -08:00
Anna Scholtz 77c7f73edb Bugfix: add project id when creating persistend UDFs 2020-02-07 09:48:23 -08:00
Anna Scholtz ab9135f951 Publish persistent UDFs step in Circle CI 2020-02-07 09:48:23 -08:00
Anna Scholtz 97b5386b41 Change UDFs to persistent UDFs and remove sql generations script 2020-02-07 09:48:23 -08:00
Anthony Miyaguchi f32f866129
Bug 1610983 - Add clients daily scalar aggregates for GLAM in Fenix (#724)
* Add copy of clients_daily_scalar_aggregates for fenix

* Change table to Fenix metrics ping and modify columns

* Modify get_scalar_probes to fetch the relevant list of metrics

* Remove logic for keyed booleans

* Add valid generated SQL for scalars

* Generate valid keyed_scalars

* Factor out attributes into reusable string

* Use the bigquery-etl formatter

* Add `--no-parameterize` flag for debugging in console

* Add option for table_id

* Add comma conditionally

* Add script to run against all Glean pings in dataset

* Move scripts into appropriate locations

* Use stable tables as source for generate script

* Report glean metric types instead of scalar/keyed-scalar

* Fix linting

* Add script to generate sql for each table in org_mozilla_fenix

* Add generated sql

* Rename script for running etl in testing environment

* Update run script to use generated sql

* Fix missing --table-id parameter

* Update header comment in script

* Update generated sql

* Add ping_type to list of attributes

* Update generated schemas
2020-02-06 14:01:24 -08:00
Jeff Klukas 7f5079b26e sqlfiles -> sql_files 2020-02-05 09:46:49 -05:00
Jeff Klukas 7905d7fa21 Parallelize scripts/publish_views
Per request of @whd
2020-02-05 09:46:49 -05:00
Anna Scholtz 041376b487 Add init.sql for VR browser clients_daily to dryrun skip list 2020-01-24 12:33:20 -08:00
Daniel Thorn 1f343af30d format_sql on devtools_panel_usage_v1 2020-01-24 09:12:09 -05:00
Daniel Thorn 1efbe0344a
Add script for self serve deletion (#635) 2020-01-23 14:52:08 -08:00
Daniel Thorn 5fa7e4e61e
Correctly format scripting keywords (#693) 2020-01-21 20:05:47 -08:00
Anna Scholtz efcba3286d Improvements for CRC32 stored procedure 2020-01-21 14:58:24 -08:00
Daniel Thorn 58bb0183b8
Allow generate_incremental_table to backfill days in reverse (#696)
by specifying --start as a date after --end
2020-01-21 14:01:00 -08:00
Jeff Klukas c092f7479c Syntax error in publish_views for fenix 2020-01-21 16:12:45 -05:00
Jeff Klukas 4f201da964 Filter out histograms in fenix metrics ping from Glean SDK<19 2020-01-21 14:16:51 -05:00
Anna Scholtz b31fbe3497 Metadata publish improvements and update clients_daily_v6 metadata 2020-01-17 16:03:59 -08:00
Anna Scholtz 165fe50cc8 Script for updating metadata of table 2020-01-17 16:03:59 -08:00
Anna Scholtz 47f77b7c62 Copy metadata.yaml when generating SQL 2020-01-17 16:03:59 -08:00
Jeff Klukas 0ae8143af7 Bug 1609666 Use SAFE_CAST in udf.json_extract_int_map (#681) 2020-01-16 09:06:14 -08:00
Daniel Thorn 7c134d5617
Enforce format_sql on more files (#659) 2020-01-10 17:07:21 -08:00
Jeff Klukas 19a4353c97 Bug: make skipping authorized views more robust
This was not working in the production context where the target sql dir is
under /tmp as discussed in https://github.com/mozilla/bigquery-etl/pull/655#issuecomment-572789847
2020-01-10 13:43:30 -05:00
Daniel Thorn e871c70e09
Fail on NULL in assert_false udf (#657)
* Fail on NULL in assert_false udf

* Update tests/README.md

Co-Authored-By: Anna Scholtz <anna@scholtzan.net>

Co-authored-by: Anna Scholtz <anna@scholtzan.net>
2020-01-09 16:15:42 -08:00
Daniel Thorn 2f7de8683d
Enforce script/format_sql for all new sql files (#656) 2020-01-09 13:55:46 -08:00
Jeff Klukas ac4a17c33f Add list of authorized views to exempt from publishing
Related to CI failures addressed in https://github.com/mozilla/bigquery-etl/issues/653
2020-01-09 11:04:50 -05:00
Jeff Klukas a5b5da6220 Inject null for negative session_length and event timestamps
As of https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/474
we will be allowing negative values for session lengths and event timestamps.
We provide some safety by modifying the user-facing view for main pings to null
out these negative values (since large negative values may cause significant
skew in aggregations).

We don't modify the user-facing view for event pings, but rather assume users
will be applying the deanonymize_events UDF and handle nulling negative values
there.

See bugs
https://bugzilla.mozilla.org/show_bug.cgi?id=1592012
and
https://bugzilla.mozilla.org/show_bug.cgi?id=1602521
2020-01-09 09:28:55 -05:00
Anthony Miyaguchi 8ed78e2a18
Fix #653 - CI failing on activity_stream/tile_id_types/views.sql (#654) 2020-01-08 15:32:12 -08:00
Jeff Klukas 78619180d0 Use labels to determine glean pings 2020-01-08 10:11:11 -05:00
Jeff Klukas be92fd1a4e Add parsed start and end times to views on top of Glean schemas
Closes https://github.com/mozilla/gcp-ingestion/issues/633
2020-01-08 10:11:11 -05:00
Daniel Thorn 8ca73c2b60
Rewrite script/format_sql in python (#640) 2020-01-06 16:17:41 -08:00
Frank Bertsch de80cfd652
RFM View for LTV (#611)
* Add new UDFs for BYTE column/day_seen

Rename bitpos to align with the new convention.

* Add search_rfm dataset for LTV

* Move RFM calculations to UDF

* Address review feedback

* Fully escape UDFs

* Fix _actual_ missing UDF

* Don't dryrun; access denied
2019-12-19 18:01:44 -05:00
Jeff Klukas d66f64d4ab Bug: publish_views ignored some view.sql files
Closes #600

We improve the logic for parsing view definition files, and we also now
error out when we encounter a view.sql file that doesn't match our parsing
rather than silently skipping.
2019-12-18 13:10:51 -05:00
Sunah Suh ddfaf63f83
Add athena query migration script for posterity (#592)
* Add athena query migration script for posterity

* Add legacy athena migration script to pytest exclusions
2019-12-12 17:55:02 -05:00
Frank Bertsch 6c825425b3
Search clients last seen (#451)
* Improve error message for ndjson parsing

* Make JSON error messages nicer

* Cast BYTES fields to/from string

BYTES types are not JSON-serializable. To deal with that, we do
two things:
1. Assume the input tables are hex strings, and decode them
   to get the BYTES fields values (on input)
2. Encode BYTES fields as hex strings (on output)

This means that any data files use hex strings for BYTES fields.

Note: This only works on top-level fields

* Add better discrepancy reporting for test assertions

When JSON blobs differ, it can be hard to tell what is wrong.
These functions easily show what's different, and automatically
prints them to be available when tests fail.

* Add search_clients_last_seen for 1-year of history

This new dataset, search_clients_last_seen, contains a year
of history for each client. It is split into 3 main parts:

1. Recent info that is contained in search_clients_daily,
   similar to how we store that in clients_last_seen
2. A year of history, represented as a BYTES field,
   indicating which days they were active for different
   types of activity
3. Among the major search providers, arrays of totals of
   different metrics, split into 12 parts, to account for
   each months total

This dataset will power LTV.

* Fix linting issues

* Enforce sampling on search_clients_daily

* Address review feedback

- Change all bits/bytes functions to include no. of bits
- Use fileobj for tests
- Rename some vars
- Use base64 for bytes in/out

* Generate sql

* Add missing comma

* Move search_clients_ls to search_derived

* Generate moar sql

* Use clients_daily_v8

* Fix query

* Move tests to search_derived

* Fix tests for search_clients_daily_v8

* Don't dryrun with search_clients_last_seen

* Update udf/new_monthly_engine_searches_struct.sql

Co-Authored-By: Jeff Klukas <jeff@klukas.net>

* sample_id is now an int

* Add documentation

* Update schemas

* Make tests use int sample-id
2019-12-12 12:43:09 -05:00
Ben Wu 50932354ce
Add script for publishing csv's as static tables (#582) 2019-12-11 14:25:33 -05:00
Jeff Klukas d1948fa445 Revert "Add authorized views for payload_bytes raw and error"
This reverts commit 1d2fa74f9e.
2019-12-10 14:36:14 -05:00
Jeff Klukas b77a2b9ac1 Revert "Add all relevant views to the authorization list"
This reverts commit 4f064f2e6d.
2019-12-10 14:36:14 -05:00
Jeff Klukas 4f064f2e6d Add all relevant views to the authorization list 2019-12-10 13:58:56 -05:00
Jeff Klukas 1d2fa74f9e Add authorized views for payload_bytes raw and error
Supports https://github.com/mozilla/bigquery-etl/issues/360
2019-12-10 13:58:56 -05:00
Ben Wu 7d9782b1ba
Bug 1543434 - Create search datasets for mobile (#559) 2019-12-06 13:28:53 -05:00
Anthony Miyaguchi b938356d48
Bug 1601139 - Add query to sample documents per doctype (#570)
* Bug 1601139 - Add query to sample documents per doctype

* Add docstring, fix formatting, and update column name
2019-12-04 13:46:00 -08:00
Daniel Thorn e11d009aac
Fix case expected for XCOM_PUSH (#575)
and add spaces to avoid issues with empty variables
2019-12-04 12:35:26 -08:00
Sunah Suh f45591a834
Fix #550: create airflow xcom output file in dockerfile and make writ… (#553) 2019-12-04 14:32:08 -05:00
Daniel Thorn c70d2e179d
Reimplement experiments_v1 as SQL (#565)
* Reimplement experiments_v1 as SQL

* Apply suggestions from code review

Co-Authored-By: Sunah Suh <github@sunahsuh.com>

* Update templates/telemetry_derived/experiments_v1/get_experiment_list.py

* fix generate_sql
2019-12-03 15:08:57 -08:00
Jeff Klukas 0094f4ba7d Normalize metadata in generated views on historical ping tables
Merging this change will cause the changes to be used on the next
deploy of the schema generation pipeline

See https://github.com/mozilla-services/cloudops-infra/blob/master/projects/data-shared/Jenkinsfile.bigquery.prod#L105
2019-12-02 15:21:32 -05:00
Sunah Suh 25b702d082
Add tables to replace experiment enrollment aggregates spark streamin… (#524)
* Add tables to replace experiment enrollment aggregates spark streaming job

* Switch to python to fill in date in enrollment aggregates live view since parameters are not allowed in view defs

* Direct output of arbitrary commands in entrypoint script to airflow xcom location
2019-11-27 16:55:21 -05:00
Marina Samuel 4465965f14 Code cleanup. 2019-11-25 15:30:33 -05:00
Daniel Thorn ce28624b74
Improve format-sql for views and timestamp functions (#528) 2019-11-25 14:15:32 -05:00
Daniel Thorn 9468f997ab
Make addons and addon_aggregates exactly replace spark versions (#532) 2019-11-25 13:14:52 -05:00
Jeff Klukas 52e0a1acab Move nondesktop KPI queries to stable table DAG
This makes it explicit that we no longer are using imported Parquet data.
It also moves several of these tables to the shared-prod project where
we want them to live long-term.
2019-11-20 10:22:29 -05:00
Ben Wu c0160496d4
Add views for search dataset (#513) 2019-11-18 19:26:15 -05:00
Daniel Thorn 8822b522aa
Promote sql clients_daily_v6 (#501) 2019-11-14 18:08:49 -05:00
Daniel Thorn 4b80ee2c23
Support restoring int columns in export_to_parquet (#506) 2019-11-14 10:41:03 -05:00
Ben Wu 73dc724086
Switch search to read from flattened main summary (#500) 2019-11-13 13:20:14 -08:00
Daniel Thorn 7b1c9d96ad
Support bigquery export to parquet via avro (#492) 2019-11-07 13:56:33 -05:00
Daniel Thorn 9176dc940e
Unnest clients_daily (#481) 2019-11-06 17:14:02 -05:00
Marina Samuel 384a017935 Update dryrun script. 2019-11-06 16:13:26 -05:00
Daniel Thorn f239525363
Fix skip list for publishing udfs (#489) 2019-11-06 15:28:34 -05:00
Daniel Thorn fe4ec05c93
Add --dataset_id and --project_id to script/run_multipart_query (#488) 2019-11-06 13:54:25 -05:00
Daniel Thorn 1d871edc0b
Add list of udfs to skip when publishing (#487) 2019-11-06 13:37:47 -05:00
Daniel Thorn dfb54323bf
Fix detection of maps when table includes dataset (#485) 2019-11-06 12:37:56 -05:00
Daniel Thorn eba9db159d
Add support for replacing columns in export_to_parquet (#443) 2019-11-05 14:17:06 -05:00
Daniel Thorn 890141c140
Reimplement main_summary_v4 as SQL (#258) 2019-11-05 13:32:25 -05:00
Anthony Miyaguchi 4c44359310 Fix #465 - Add regex strings to STRING_REGEX in format-sql (#470) 2019-10-31 17:04:26 -07:00
Anthony Miyaguchi 084b960602
Use `#!/usr/bin/env python3` consistently (#461) 2019-10-30 14:06:22 -07:00
Daniel Thorn b65fbbbd93
Fix automatic formatting for nested types (#457)
without this `<` and `>` are getting formatted as operators instead of parens
2019-10-29 22:17:06 -07:00
Daniel Thorn cad76a9b6f Fix CI by skipping slow-to-validate queries (#448) 2019-10-25 09:15:20 -05:00
Sunah Suh 387536cbee Fixes to json -> table ddl generator script (#444) 2019-10-24 14:56:19 -07:00
Anthony Miyaguchi d60c0fd842
Add dataset for monitoring schema errors over time (#442)
* Add query for last month of schema errors

* Add generated sql for schema error counts

* Move schemas into correct location

* Add document_version and named groups

* Skip schema error counts in dryrun
2019-10-23 15:37:05 -07:00
Daniel Thorn 0f433f6a91
Support many billing projects and dates in copy_deduplicate (#426)
* Support many billing projects and dates in copy_deduplicate

* fix docs for --project_id

* explain default --billing_projects behavior

* Fix return value bug
2019-10-21 11:16:24 -07:00
Marina Samuel a42d97af2d Add new queries to dryrun. 2019-10-17 15:25:19 -04:00
Daniel Thorn 9bf053de74
Add --preceding-days option to copy_deduplicate (#413) 2019-10-14 08:54:40 -07:00
Jeff Klukas 096a209ced Fix bugs in monitoring views
Also cleans up a bug in the script for publishing views to get udf_js/gunzip
working, and removes accidental print statements in generate_sql.
2019-10-10 11:48:28 -04:00
Jeff Klukas 68c4d79228 Replace sql dir all at once in generate_sql
I got tired of running generate_sql, then checking git status while it was
running and seeing a jumble of deleted files. This PR changes the behavior to
build the files in a temp dir and then copy into place only at the end.
2019-10-10 09:21:30 -04:00
Frank Bertsch 239fab252a
Pipeline sql (#388)
* Views for monitoring structured ingestion errors

* Add UDF for extracting missing columns

* Add docs for json_extract_missing_cols

* Fix missing WITH clause

* Add generated sql

* Fix test function

* Update views

* Update tests

* Use shared-prod

* Update docstring for missing cols udf

* Move schemas to structured_ingestion dataset

* Don't dryrun queries on payload_bytes

* Change format of payload_bytes views

- Use a with_ratio to reduce duplication
- Change the view name to match the filename

* Fix missing test function

* Move to monitoring dataset

* Move to new test structure

* Remove spaces from js udfs

* Use persistent gunzip UDF
2019-10-09 10:03:58 -04:00
Jeff Klukas 2e821fbdc1 Use the shared-prod URL for the dryrun script 2019-10-08 20:21:48 -04:00
Daniel Thorn e872a76860
Add pytest plugins to lint python scripts (#410)
* Add pytest plugins to lint python scripts

* Fix lint errors
2019-10-08 14:00:11 -07:00
Jeff Klukas fff2ff3275
fxa_users_services tables (#396) 2019-10-08 13:31:01 -04:00
Daniel Thorn 8ccce8702c
Use chunksize=1 for consistent ordering in script/generate_incremental_table (#403) 2019-10-07 10:39:47 -07:00
Ben Wu 484d5079a7
Add additional fields to search datasets (#381) 2019-09-30 10:26:28 -04:00
Daniel Thorn 6158817ea3
Improve list_tables speed for script/copy_deduplicate (#382) 2019-09-26 11:28:22 -07:00
Daniel Thorn 91ba7297d5
Fix incorrect parameter name in copy_deduplicate (#383) 2019-09-25 13:30:29 -07:00
Daniel Thorn 16a1491821
Add --slices option to copy_deduplicate (#380) 2019-09-25 11:27:36 -07:00
Daniel Thorn 6c05a96847 dont create temp tables for dry_run 2019-09-24 13:52:36 -04:00
Jeff Klukas 29f44ebbf5 Bug: datetime logic in copy_deduplicate
Closes https://github.com/mozilla/bigquery-etl/issues/376

Addition to datetime.date considers only the days part of a timedelta,
so we have to convert to a timestamp first.
2019-09-24 13:52:36 -04:00
Daniel Thorn 54ae019ff3
Set temp table expiration in copy_deduplicate (#374) 2019-09-23 13:03:19 -07:00
Daniel Thorn 4f48ae21ef
Add --hourly option to copy_deduplicate (#370) 2019-09-23 10:54:21 -07:00
Jeff Klukas f0f5ec99a6 Import events from FxA oauth server
I've already created the new table and backfilled existing dates.

Addresses #348
2019-09-19 14:01:58 -04:00
Daniel Thorn 469c03ec10
Add script to format sql (#173) 2019-09-18 17:48:53 -07:00
Jeff Klukas 0ea63c7775 dryrun script updates 2019-09-17 15:30:59 -04:00
Jeff Klukas f4c5ea8e7c Run black 2019-09-13 10:00:33 -04:00
Sunah Suh f9c611a906
Fix UDF publisher script (#330) 2019-09-04 10:54:48 -05:00
Jeff Klukas 71ad6652f5 Update publish_views script for new directory structure 2019-08-28 20:58:43 -04:00
Jeff Klukas c2269a69af Update generate_view script for new directory structure
Closes #317
2019-08-28 20:58:43 -04:00
Sunah Suh 030ca5872a
Add script to recreate table creation DDL SQL from json description (#309)
Add script to recreate table creation DDL SQL from json description
2019-08-28 11:49:42 -05:00
Daniel Thorn 99fe0dfd9e
Move queries into destination-table directories (#286)
* Move queries into destination-table directories

* Apply suggestions from code review

Co-Authored-By: Jeff Klukas <jeff@klukas.net>
2019-08-26 12:52:49 -07:00
Anna Scholtz 9580029e20 UDF for unzipping gzipped bytes (#272)
* UDF for decompressing gzip data

* Update script for publishing UDFs to upload UDF dependency files

* Address review feedback for gunzip UDF

* Set default GCS bucket to moz-fx-data-prod-bigquery-etl

* Add function to upload UDF dependencies to GCS

* Set data-eng-circleci-tests context in CircleCI config

* Add approval step in CircleCI config
2019-08-26 10:53:06 -07:00
Jeff Klukas 5b005570db Make copy_deduplicate query more efficient for large tables
Closes #307
2019-08-23 14:43:17 -04:00
Anna Scholtz 7a6f7aacf8 View generation fixes 2019-08-21 15:03:50 -07:00
Anna Scholtz 52061238f5 Add wrapper script for generating and publishing views 2019-08-21 15:03:50 -07:00
Anna Scholtz 1d87797bd6 Fix rebase conflicts 2019-08-21 15:03:50 -07:00
Anna Scholtz 7520de5092 Script for auto-generating views 2019-08-21 15:03:50 -07:00
Jeff Klukas 7e533b4b24 Apply suggestions from code review
Co-Authored-By: Daniel Thorn <dthorn@mozilla.com>
2019-08-20 16:47:33 -04:00
Jeff Klukas a407f50c14 Explain the case of ignoring tables with trailing underscore 2019-08-20 16:47:33 -04:00
Jeff Klukas 5c06876590 Support altering target project in publish_views script 2019-08-20 16:47:33 -04:00
Jeff Klukas 12215f88fa Generate latest-version views for derived tables 2019-08-20 16:47:33 -04:00
Daniel Thorn 91e1be5394
Use mode last in clients_daily_v7 (#86) 2019-08-14 14:49:38 -07:00
Jeff Klukas 55cdf93c5f instantiate client after parsing args 2019-08-07 13:18:31 -04:00
Jeff Klukas a929285ca0 Apply suggestions from code review
Co-Authored-By: Daniel Thorn <dthorn@mozilla.com>
2019-08-07 13:18:31 -04:00
Jeff Klukas 46232da366 Refactor dryrun script for concurrency and permissions 2019-08-07 13:18:31 -04:00
Jeff Klukas fbc1b9f564 Fix up imports 2019-08-07 13:18:31 -04:00
Jeff Klukas 937d294d65 Add publish_views script 2019-08-07 13:18:31 -04:00
Jeff Klukas f2ecc96ce8 Adapt view update script for generating definitions instead 2019-08-07 13:18:31 -04:00
Daniel Thorn 28ce0d1f11 Add script for updating latest-version views 2019-08-07 13:18:31 -04:00
Daniel Thorn 22520e31f6
Use prepend_udf_usage_definitions in generate_sql (#287) 2019-08-05 16:10:31 -07:00
Daniel Thorn e1bf990b9a
Add support for testing queries with persistent UDFs (#285) 2019-08-05 14:14:19 -07:00
Daniel Thorn a241017c15
Reuse bigquery client and set default project_id (#282) 2019-08-02 13:36:10 -07:00
Daniel Thorn 5308e79570
detect errors when publishing udfs (#281) 2019-08-02 13:16:43 -07:00
Jeff Klukas 15640b831f Support options with underscores and fix incorrect variable 2019-08-01 14:00:57 -04:00
Jeff Klukas 0bc42132a2 Add --parallelism option 2019-08-01 10:15:33 -04:00
Jeff Klukas 4242d95777 Remove new entrypoint clause 2019-08-01 10:15:33 -04:00
Jeff Klukas 00cef9d7e9 Run black and refactor --only and --except args 2019-08-01 10:15:33 -04:00
Jeff Klukas ccb65d6d18 Apply suggestions from code review
Co-Authored-By: Daniel Thorn <dthorn@mozilla.com>
2019-08-01 10:15:33 -04:00
Jeff Klukas c9f65a7af8 Refactor to allow jobs to run in a different project
I believe Airflow may need to issue the jobs from derived-datasets
for the time being, so we make sure to fully qualify all table
references with the project_id that's passed as a parameter.
2019-08-01 10:15:33 -04:00
Jeff Klukas 0a3d4ea59d Fix typo
Co-Authored-By: Anna Scholtz <anna@scholtzan.net>
2019-08-01 10:15:33 -04:00
Jeff Klukas a275cfb9e5 Add copy_deduplicate script
Closes #220

A PR to add schedule this script in Airflow to follow.
2019-08-01 10:15:33 -04:00
Allen Short 351b42e84a Dry-run each query in CircleCI against prod datasets (#261)
* Dry-run each query in CircleCI against prod datasets

* Apply suggestions from code review

* Update script/dryrun
2019-07-30 10:26:25 -07:00
Daniel Thorn f79d075448
Add dataset names to paths in sql/ (#265)
* Add dataset names to paths in sql/

* rename clients_last_seen_raw_v1 to clients_last_seen_v1

* rename telemetry_raw to telemetry_derived

* address review
2019-07-30 09:39:22 -07:00
Jeff Klukas 01cb6e1074 Refactor naming of UDFs 2019-07-24 09:01:13 -04:00
Jeff Klukas 680c26ac41 Efficiency tweak: avoid double-publishing UDFs 2019-07-24 09:01:13 -04:00
Anna Scholtz ff356466f6 Bugfix: missing UDFs without dependencies 2019-07-22 13:44:28 -07:00
Anna Scholtz 4f897edd8a Add project ID when creating UDFs 2019-07-22 13:44:28 -07:00
Anna Scholtz c3d06f94d2 Script to publish persistent UDFs 2019-07-22 13:44:28 -07:00
Daniel Thorn d6e35295ec
Fix help page for script/generate_incremental_table (#244) 2019-07-22 12:58:35 -07:00
Anna Scholtz 7207a4e52f Move SQL templates to templates/ and add generated SQL 2019-06-25 08:07:26 -07:00
Anna Scholtz fe7325dcb4 Run SQL generation script in when creating docker image 2019-06-25 08:07:26 -07:00
Anna Scholtz aa637154c5 Ensure that UDFs are added only once and in order when generating SQL files 2019-06-25 08:07:26 -07:00
Anna Scholtz a6661c5896 Trigger SQL query generation in pytest and update CircleCI config 2019-06-25 08:07:26 -07:00
Anna Scholtz b62970f3a9 Makefile for generating sql and add newline breaks to new files 2019-06-25 08:07:26 -07:00
Anna Scholtz f2efcc0432 Adopt CircleCI script to generate SQL queries 2019-06-25 08:07:26 -07:00
Anna Scholtz fd21ba88c2 Add Python script to generate SQL files with UDF declarations 2019-06-25 08:07:26 -07:00
Jeff Klukas 5eb134ca86 fixups found while running the deletions 2019-05-23 16:42:08 -04:00
Jeff Klukas 120153dabe respond to review comments 2019-05-23 16:42:08 -04:00
Jeff Klukas 845fa792c3 Codify archiving of exact mau table 2019-05-23 16:42:08 -04:00
Jeff Klukas 76a8a23e54 Use generate_incremental_table for clients_last_seen backfill 2019-05-23 16:42:08 -04:00
Jeff Klukas d9669e325a Add comments on tables that do not exist in BQ 2019-05-23 16:42:08 -04:00
Jeff Klukas d481d93861 delete from experiments and search_clients_daily datasets 2019-05-23 16:42:08 -04:00
Jeff Klukas 090fce87cb Better handling for bq tables 2019-05-23 16:42:08 -04:00
Jeff Klukas b421eeeb11 Correct time range for gs 2019-05-23 16:42:08 -04:00
Jeff Klukas 9970aebd4e Add delete-from-bq.sh 2019-05-23 16:42:08 -04:00
Jeff Klukas 7c8cbbc0f4 Bug 1550814 Remove data collected during hotfix rollout
See https://bugzilla.mozilla.org/show_bug.cgi?id=1550814
2019-05-23 16:42:08 -04:00
Jeff Klukas e782c5f6ff Add --destination-table and selectExprs options to export_to_parquet
We implemented a view-based solution for creating clients_last_seen
from clients_last_seen_raw in Athena and Presto, but this made the
table unavailable from Spark.

By adding in these options, we can materialize view logic at the
time of writing to Parquet, so that it will be available to all
Parquet consumers.
2019-05-22 14:01:40 -04:00
Jeff Klukas e0dcbdaf36 Usability improvements for generate_incremental_table 2019-05-16 09:41:14 -04:00
Daniel Thorn 1dad4e14b4
Rewrite generate_incremental_table in python (#126)
fixes #115
2019-05-14 09:22:31 -07:00
Daniel Thorn c2e416cefd
Create ~/.bigqueryrc without GCLOUD_SERVICE_KEY (#112) 2019-05-01 13:38:31 -07:00
Daniel Thorn 606fec9c04
Set sane defaults for bq use in airflow (#110) 2019-05-01 08:24:57 -07:00