bigquery-etl

Граф коммитов

Автор	SHA1	Сообщение	Дата
Marina Samuel	45e882099d	Update dryrun script.	2020-05-05 12:25:22 -04:00
Marina Samuel	08e553907e	Update script for running desktop glam.	2020-05-05 12:25:22 -04:00
Ben Wu	6450c0036f	Bug 1632245 - Add new fenix apps to mobile search (#946 )	2020-05-05 12:03:14 -04:00
Daniel Thorn	17310d06f0	Add script for reporting shredder cost (#936 )	2020-04-30 10:23:07 -07:00
Anna Scholtz	74b119a826	Metadata validation refactoring	2020-04-29 14:18:53 -07:00
Anna Scholtz	de36a0ddee	Update integration tests	2020-04-29 14:18:53 -07:00
Anna Scholtz	e7bd2fb50c	Move all metadata related scripts to a metadata directory	2020-04-29 14:18:53 -07:00
Anna Scholtz	f49b7920d2	Script for validating metadata files	2020-04-29 14:18:53 -07:00
Frank Bertsch	0543762c5f	Add views for search_clients_last_seen (#926 ) * Add views for search_clients_last_seen * Ignore failed search dryrun	2020-04-22 15:22:25 -04:00
Daniel Thorn	6dd5893444	Use CREATE TABLE to set expiration on copy_deduplicate temp tables (#911 )	2020-04-16 12:39:48 -07:00
Anna Scholtz	011e1c2111	Run publish_public_data_json as bash script in entrypoint	2020-04-15 13:00:05 -07:00
Jeff Klukas	46e5ee8fb0	Add script for reporting on broken views Addresses #903	2020-04-15 15:03:07 -04:00
Anna Scholtz	9b6aa5fb46	Refactory publish_json script	2020-04-15 08:49:24 -07:00
Anthony Miyaguchi	82a6a5f687	Fenix exports for GLAM (#870 ) * Add views for extracting to glam * wip: Add export script * Rename extract queries and don't run them * Add user counts * Add generated sql * Update extract queries to rely on current project * Fix optional day partitioning * Fix extraction to glam-fenix-dev project * Add globs for ignoring dryrun and format * Reorder columns in probe count extract * Filter on null channel and version * Do not print header * Refactor datacube groupings and fix scalar_percentiles * Rename extract tables * Convert user counts into a view to avoid needless materialization * Rename client probe counts to probe counts * Update publish_views so job does not fail	2020-04-14 11:45:59 -07:00
Jason Thomas	4b17f1e22b	Add Pocket MAU view Bug 1629151 (#905 )	2020-04-14 10:31:40 -07:00
Anna Scholtz	8af044d844	Pretty print dataset metadata	2020-04-09 14:38:34 -07:00
Anna Scholtz	f3a6d7c490	Add option to publish metadata to GCS to entrypoint script	2020-04-09 14:38:34 -07:00
Anna Scholtz	9a7af8b1b5	Write metadata for datasets and tables to GCS	2020-04-09 14:38:34 -07:00
Jeff Klukas	03fc8911ac	Add additional Glean apps to GUD tables	2020-04-09 14:28:00 -04:00
Jeff Klukas	7ff705f5ec	Exempt broken activity_stream view	2020-04-08 11:11:46 -04:00
Jeff Klukas	9ff1d3666f	Dryrun fixup	2020-04-08 11:11:46 -04:00
Jeff Klukas	aeccb5c503	Add Overall	2020-04-07 09:44:18 -04:00
Jeff Klukas	98a8b83b0c	Add script/export_incline_dash The logic for incline exports in getting too complex to handle via Airflow operations (see https://github.com/mozilla/telemetry-airflow/pull/935). Instead, we put the logic into this script and will invoke via docker.	2020-04-07 09:44:18 -04:00
Jeff Klukas	86350db0d4	Generated clients_daily and last_seen tables for all Glean apps Addresses https://github.com/mozilla/bigquery-etl/issues/848 We commit a single application's generated queries here to allow CI to verify them. I have run these scripts for the target app so that the tables exist and CI should pass.	2020-04-06 16:47:51 -04:00
Anna Scholtz	2c70919da7	Ignore asn_aggregates dryruns	2020-04-06 12:50:27 -07:00
Anthony Miyaguchi	6e3351272e	Cleanup GLAM etl queries for Fenix (#851 ) * Rename scalar_aggregates_incremental and remove telemetry * Remove ping-type and telemetry variables * Add initial files for final views * Add glam templates to format ignore * Add blacked scalar_percentiles * Generate view for scalar aggregates * Add generated view for daily histogram aggregate view * Add probe counts to generated views * Generalize writing out queries * Add a latest_versions to generate * ADd histogram_percentiles to generate * Add scalar bucket counts to generate * Add histogram bucket counts to generate * Add clients scalar aggregates to generate * Move probe counts to generate * Add scalar percentiles to generate * Add clients histogram aggregates to generate * Use generate within generate_fenix_sql script * Fix probe_counts view and bucket counts * Rename template for scalar bucket counts * Fix client probe counts view * Remove irrelevant telemetry where clause * Add docstrings and shorten lines * Add probe counts view to dryrun and publish_views ignores * Rename udf for merged user data * Use python3	2020-04-03 14:17:32 -07:00
Marina Samuel	9187175543	Convert run_fenix_sql to run_glam_sql.	2020-04-03 16:46:05 -04:00
Anthony Miyaguchi	c9e0c8a564	Bug 1607229 - Add authorized view to buildhub2 table (#863 ) * Bug 1607229 - Add authorized view to buildhub2 table * Add buildhub2 to dryrun and publish_view ignores	2020-04-02 09:59:05 -07:00
Anna Scholtz	29295f0507	Add comments to refactored scripts	2020-03-31 14:01:43 -07:00
Anna Scholtz	c087e929a7	Refactor publish_metadata script	2020-03-31 14:01:43 -07:00
Anna Scholtz	1bad91ab33	Refactor generate_views script	2020-03-31 14:01:43 -07:00
Anna Scholtz	979ca4e8df	Refactor publish_public_data_views script	2020-03-31 14:01:43 -07:00
Anna Scholtz	a70275307b	Forward query parameters when running queries in publish_json	2020-03-31 13:28:45 -07:00
Anna Scholtz	4f40d15eed	Configure CircleCI to run integration tests	2020-03-31 13:28:45 -07:00
Anna Scholtz	24e6c797d3	Update entrypoint to run query and publish json if flag is set	2020-03-31 13:28:45 -07:00
Anna Scholtz	dade17d22e	Set public GCS bucket as default for json publishing	2020-03-31 13:28:45 -07:00
Anna Scholtz	74cf1a4cbd	Use streaming for converting ndjson to json	2020-03-31 13:28:45 -07:00
Anna Scholtz	4825fb0664	Add tests for publishing json with mocking	2020-03-31 13:28:45 -07:00
Anna Scholtz	a703989bed	Refactor publish_json and factor out into separate class	2020-03-31 13:28:45 -07:00
Anna Scholtz	438b87a6da	Add tests for publishing json	2020-03-31 13:28:45 -07:00
Anna Scholtz	825b6ccf0e	Update entrypoint and format ndjson to json	2020-03-31 13:28:45 -07:00
Anna Scholtz	9a1443b639	Add comments for making JSON data public	2020-03-31 13:28:45 -07:00
Anna Scholtz	b9bcab6b51	Set upload JSON format	2020-03-31 13:28:45 -07:00
Anna Scholtz	05301925cb	Publish JSON data of query result	2020-03-31 13:28:45 -07:00
Anna Scholtz	c2a5a62b1e	[WIP] Separate script to publish data as json	2020-03-31 13:28:45 -07:00
Anna Scholtz	f207c7b838	Skip dryrun for usage anomaly query	2020-03-31 10:56:01 -07:00
benmiroglio	bb03dca054	Add addons_daily_v1 query to dryrun skip list	2020-03-31 08:42:10 -04:00
Daniel Thorn	d7ea8520fa	Add OR REPLACE to migrated_clients_v1 (#854 )	2020-03-25 10:55:37 -07:00
Daniel Thorn	caa7a64b50	Remove new init.sql from dry_run (#853 )	2020-03-25 10:30:59 -07:00
Anthony Miyaguchi	33160eea96	Add histogram percentiles for Fenix data into GLAM (#829 ) * Add initial histogram_percentiles * Update metric_type with histogram_type suffix * Add generated SQL (backwards incompatible) * Add body for copy of histogram_percentiles_v1` * Update histogram_percentiles with Glean specific metrics * Add histogram_percentiles module * Uncomment histogram percentiles * Add generated SQL * Add template to ignore section of format_sql * Add histogram percentiles to ignore of dryrun * Move udf into persistent_udf directory * Rewrite udf_js.glean_percentile * Add generated SQL	2020-03-25 10:17:21 -07:00
Daniel Thorn	d51ef367a9	Add legacy scripts used to validate sink for *_live tables (#845 )	2020-03-25 10:03:39 -07:00
Daniel Thorn	c3127baac4	Standardize common script arguments (#828 )	2020-03-25 09:40:36 -07:00
Frank Bertsch	e64a8e0e87	Add incline exec dash submission_date based query (#843 ) * Add incline exec dash submission_date based query * Move to derived dataset * Create migrated_clients_v1 as materialized Previously we had this as a view. Now it will update every day, taking the latest value seen from migration pings as its value. * Update view for materialized query * Use clients_last_seen view and migrated_clients Previously, we were using the materialized clients_last_seen and the unmaterialized migrated_clients. We are now materializing the latter, and using the view of the former. - Added documentation - Removed unecessary arrays * Run SQL formatter * Change keyword identifier with prefixed underscore * Ignore missing table errors in dryrun * Run migrated_clients init only * Add docline for negative churn	2020-03-25 11:32:36 -04:00
Anthony Miyaguchi	34fa6d7e54	Refactor udf_percentile in telemetry histogram percentiles as a persistent udf (#832 ) * Remove histogram_percentiles from format_sql ignore * Move udf for histogram percentiles into persistent udf * Add an approximate equals for testing * Add some initial tests for linear histograms * Add tests and minor formatting of comments This fixes a particular bug within one of the conditionals. However, the results for linear histograms still look incorrect. * Update tests/assert/approx_equals.sql Co-Authored-By: Daniel Thorn <dthorn@mozilla.com> * Format approx_equals * Remove a few TODOs Co-authored-by: Daniel Thorn <dthorn@mozilla.com>	2020-03-23 12:30:23 -07:00
Anthony Miyaguchi	4f0080559a	Add histograms to fenix glam etl (#766 ) * Add initial template for histogram aggregates * Factor out common functions and get all distributions * Add viable query for histogram aggregates * Add more efficient aggregation * Update header and update comment * Add code to generate clients daily histograms * Add queries for generated sql * Return non-zero exit code when histograms not found * Delete empty queries to reduce data scanned * Add non-zero exit code for scalars if probes are not found * Sort histograms for stable output * Add view for histogram aggregates * Add initial sql for histogram aggregates * Format template scripts * Add mostly reformatted sql for aggregates * Update histogram aggregates before adding statements * Fix up details for daily aggregation * Add completed histograms template * Add code to generate clients histogram aggregates * Add init for clients histogram aggregates * Remove sample_id from set of attributes * Add sections to run generated sql * Add generated sql * Remove extra latest_version columns * Fix many small issues during first draft of sql * Fix clients histogram aggregates * Add initial modification to probe counts * Add histogram bucket counts * Add option to generate histogram probe counts * Update generated_fenix_sql for histograms * Add generated sql * Update run_fenix_sql * Fix bucket counts * Update source table for probe counts * Add missing ping_type to histograms * Add first,last,num buckets * Update probe counts so it succeeds * Add mozilla_schema_generator to dependencies * Add metadata from probe-info for custom distributions * Update probe counts with metadata for custom distributions * Add UDF for generating functional buckets * Add proper bucketing by including range_max of measures * Format histogram udfs * Add updated templates to skip * Add new queries to dryrun ignore * Add view to the publish ignore list * Fix python linting * Remove old comments from probe counts * Do not count metadata from custom distributions twice * Remove sum from histogram aggregates * Add generated SQL * Add sample_id to histograms earlier in pipeline * Add generated SQL * Add comments to functional bucketing for metrics	2020-03-18 13:53:28 -07:00
Jeff Klukas	a1af581961	Filter out temporary tables created by the Beam BigQuery sink job Closes https://github.com/mozilla/bigquery-etl/issues/834	2020-03-18 09:57:28 -04:00
Anna Scholtz	80d77951ad	Retrieve tables from BigQuery and update metadata	2020-03-13 09:20:55 -07:00
Anna Scholtz	5c18d8ef84	Update publish_metadata to write metadata to public tables	2020-03-13 09:20:55 -07:00
Rob Hudson	ff2d826ef1	Update glam extract query to use parameters (#809 )	2020-03-12 10:27:19 -07:00
Nan Jiang	76a4bf797f	Bug 1620663 - Part 1. Add active users_last_seen tables for CFR	2020-03-11 15:45:43 -04:00
Anna Scholtz	a0232213b2	Allow partitions for destination_table names in run_query	2020-03-11 10:40:42 -07:00
Anna Scholtz	4f2a52289f	Use subprocess.check_call when running queries	2020-03-11 10:33:14 -07:00
Jeff Klukas	02ffaf2aa4	Remove views on top of sync derived data We will be deleting the underlying tables. See https://bugzilla.mozilla.org/show_bug.cgi?id=1612541	2020-03-11 08:56:04 -04:00
Anna Scholtz	98dce31f7f	Generate views for public tables that exist in BigQuery	2020-03-10 10:52:53 -07:00
Anna Scholtz	75530d58b2	Update entrypoint comment and metadata parsing	2020-03-10 10:52:53 -07:00
Anna Scholtz	f2c458a8e4	More validation of destination_table name for running query	2020-03-10 10:52:53 -07:00
Anna Scholtz	1e688cd063	Change destination table for public data	2020-03-10 10:52:53 -07:00
Anna Scholtz	2c998a9252	Reset DryRun	2020-03-10 10:52:53 -07:00
Anna Scholtz	c2b56907fb	PyYaml dryrun	2020-03-10 10:52:53 -07:00
Anna Scholtz	ac6344045a	Convenience function to get metadata of associated SQL file	2020-03-10 10:52:53 -07:00
Anna Scholtz	0746465e11	Dryrun in public data project	2020-03-10 10:52:53 -07:00
Anna Scholtz	26ad69af95	Refactor run_query and add doc comments to Metadata	2020-03-10 10:52:53 -07:00
Anna Scholtz	d4574bd0fd	Public datasets in dryrun	2020-03-10 10:52:53 -07:00
Anna Scholtz	ed3b2853ed	Refactor scripts to use Metadata class	2020-03-10 10:52:53 -07:00
Anna Scholtz	103f37f3d0	Refactor parsing metadata	2020-03-10 10:52:53 -07:00
Anna Scholtz	c91eac813b	Add script for generating and publishing views for public data queries	2020-03-10 10:52:53 -07:00
Anna Scholtz	e39eb5c10b	Run Bigquery query using bq shell command	2020-03-10 10:52:53 -07:00
Anna Scholtz	37d22b9b68	WIP run queries and check if they should be made public	2020-03-10 10:52:53 -07:00
Anthony Miyaguchi	ecaaf8ae15	Fix #779 - Ignore view deploys to glam_etl (#780 )	2020-03-04 15:05:02 -08:00
Jeff Klukas	e3d875cd93	Coalesce the client_info.locale value in baseline ping views Glean pings now contain a client_info.locale field; baseline pings previously reported locale as a string metric. This change provides some historical continuity so that views on baseline pings show client_info.locale values even if only the old metric is populated.	2020-02-28 12:52:21 -05:00
Anthony Miyaguchi	4e773ba6eb	Simplify scalar aggregates for glam-fenix etl (#767 ) * Update daily aggregates to run all scalars in a single query * Update generate and run script for new scalar aggregates * Update generated sql (and view) * Fix linting * Update SKIP for format	2020-02-26 11:22:20 -08:00
Marina Samuel	88e184525c	Use sample_id partitioning along more of the histogram_aggregates pipeline.	2020-02-24 15:43:22 -05:00
Anthony Miyaguchi	4f2d5dd51d	Add script for intitial scheduling (#757 ) * Update run script to be more generic * Update run script with parameters and avoid reusing destination * Add view for clients_daily_scalar_aggregates * Add new view to dryrun * Remove old comment	2020-02-21 11:13:20 -08:00
Jeff Klukas	8bbcd8892d	Update combine_adject_days_bits	2020-02-20 15:33:41 -05:00
Jeff Klukas	dc0bbed352	More first seen ignores	2020-02-20 15:33:41 -05:00
Jeff Klukas	45ae248e5a	first_seen fixes	2020-02-20 15:33:41 -05:00
Anthony Miyaguchi	d0b71bcefd	End-to-end Fenix scalar aggregates (#743 ) * Refactor render into a separate function * Add variables for source and destination tables * Add support for aggregating glean pings * Add render_init along with --init option * Add partition clause and add proper init file * Add attributes_type to the template * Update clients_scalar_aggregates_v1 with dataset.table * Add command for generating init for fenix scalars aggregates * Add queries for fenix_clients_scalar_aggregates_v1 * Update partititioning in init * Update glam scripts for scalar aggregates * Update version to only include valid versions * Add generated sql * Add --quiet flag * Add notes * Fix linting and CI errors * Ignore glam_etl in dryrun * Add initial template files that have been formatted * Update generated queries * Add metric counts for histogram and scalars * Update metric_counts_v1 for scalars only * Add formatted version of telemetry_derived/clients_scalar_bucket_counts_v1 * Add module for generating metric bucketing * Refactor generate_fenix_sql for skipping stages * Add templates to format SKIP * Fix trailing whitespace * Add option to generate fenix bucket/probe counts * Add initial bucket/probe counts sql for fenix * Sort attributes for stable query generation * Refactor bucketing logic * Add scalar_metric_types variable * Add argument parser and glean variables * Update scalar bucket counts for glean * Update run_fenix_sql with bucket counts * Fix invalid syntax * Do not aggregate booleans as a scalar * Add scalar_metric_types to metric_counts_v1 * Add argparser and change source tablename to scalar * Update fenix_clients_scalar_probe_counts_v1 * Remove first_bucket * Add scalar_probe_counts to run script * Removing first_bucket requires changing where clause conditional * Get grouping attributes correct * Give columns stable ordering * Add correct query (that is too complex) * Reduce number of combinations * Simplify logic for null values * Cast booleans instead of when clause * Format * Rename files to avoid confusion * Add initial scalar_percentiles * Add initial files for scalar_percentiles * Add scalar_percentiles for fenix * Add scalar_percentiles to run script * Add problematic files to SKIP in format and dryrun * Add installation ping * Fix missing merge item * Add missing newlines * Reduce set of grouped attributes * Factor out boolean_metric_types	2020-02-19 13:43:53 -08:00
Anthony Miyaguchi	0d892cba4e	Add scalar aggregates from clients daily scalar aggregates for Fenix (#735 ) * Refactor render into a separate function * Add variables for source and destination tables * Add support for aggregating glean pings * Add render_init along with --init option * Add partition clause and add proper init file * Add attributes_type to the template * Update clients_scalar_aggregates_v1 with dataset.table * Add command for generating init for fenix scalars aggregates * Add queries for fenix_clients_scalar_aggregates_v1 * Update partititioning in init * Update glam scripts for scalar aggregates * Update version to only include valid versions * Add generated sql * Add --quiet flag * Add notes * Fix linting and CI errors * Ignore glam_etl in dryrun * Add latest_versions template * Add generated code for latest versions * Update header * Add latest versions to run script * Update version filter using fenix_latest_versions_v1	2020-02-19 10:51:22 -08:00
Anna Scholtz	2590b1bd78	Add support for tags when publishing metadata	2020-02-18 09:32:53 -08:00
Daniel Thorn	f3e11b4255	Use pip-compile instead of constraints.txt (#736 ) * Fix docstyle check * Use pip-compile instead of constraints.txt	2020-02-12 20:26:24 +01:00
Anthony Miyaguchi	cf511d8cc2	Add template for clients_scalar_aggregates (#727 ) * Add initial moustache files for scalar_aggregates * Add Jinja2 dependency * Update templates with more parameters * Add format clauses and add query to be formatted * Add formatted sql * Add generated sql python -m bigquery_etl.glam.scalar_aggregates_incremental > sql/telemetry_derived/clients_scalar_aggregates_v1/query.sql * Generalize clients_scalar_aggregates * Refactor into attributes and attributes_list * Add generated sql for generalized query * Add glam templates to format_sql SKIP * Fix dryrun by using AS and USING properly in sql * Add generated sql * Add instructions on adding new Python library * Fix linting issues * Use r""" for backslash in docstring * Add Jinja2 dependencies to constraints.txt * Document process for adding new Python dependencies	2020-02-12 09:43:37 -08:00
Marina Samuel	4da81311a1	Break up clients_histogram_aggregates in Glam into smaller chunks.	2020-02-11 19:49:48 -05:00
Anna Scholtz	3f1cb398fa	Undo formatting for old SQL files	2020-02-07 09:48:23 -08:00
Anna Scholtz	e4ad3a60ed	Remove template references from formatting script	2020-02-07 09:48:23 -08:00
Anna Scholtz	e4babb120b	Update --udf-dirs help string	2020-02-07 09:48:23 -08:00
Anna Scholtz	88f188a93a	Transform UDFs to temporary UDFs in tests	2020-02-07 09:48:23 -08:00
Anna Scholtz	77c7f73edb	Bugfix: add project id when creating persistend UDFs	2020-02-07 09:48:23 -08:00
Anna Scholtz	ab9135f951	Publish persistent UDFs step in Circle CI	2020-02-07 09:48:23 -08:00
Anna Scholtz	97b5386b41	Change UDFs to persistent UDFs and remove sql generations script	2020-02-07 09:48:23 -08:00
Anthony Miyaguchi	f32f866129	Bug 1610983 - Add clients daily scalar aggregates for GLAM in Fenix (#724 ) * Add copy of clients_daily_scalar_aggregates for fenix * Change table to Fenix metrics ping and modify columns * Modify get_scalar_probes to fetch the relevant list of metrics * Remove logic for keyed booleans * Add valid generated SQL for scalars * Generate valid keyed_scalars * Factor out attributes into reusable string * Use the bigquery-etl formatter * Add `--no-parameterize` flag for debugging in console * Add option for table_id * Add comma conditionally * Add script to run against all Glean pings in dataset * Move scripts into appropriate locations * Use stable tables as source for generate script * Report glean metric types instead of scalar/keyed-scalar * Fix linting * Add script to generate sql for each table in org_mozilla_fenix * Add generated sql * Rename script for running etl in testing environment * Update run script to use generated sql * Fix missing --table-id parameter * Update header comment in script * Update generated sql * Add ping_type to list of attributes * Update generated schemas	2020-02-06 14:01:24 -08:00
Jeff Klukas	7f5079b26e	sqlfiles -> sql_files	2020-02-05 09:46:49 -05:00
Jeff Klukas	7905d7fa21	Parallelize scripts/publish_views Per request of @whd	2020-02-05 09:46:49 -05:00
Anna Scholtz	041376b487	Add init.sql for VR browser clients_daily to dryrun skip list	2020-01-24 12:33:20 -08:00
Daniel Thorn	1f343af30d	format_sql on devtools_panel_usage_v1	2020-01-24 09:12:09 -05:00
Daniel Thorn	1efbe0344a	Add script for self serve deletion (#635 )	2020-01-23 14:52:08 -08:00
Daniel Thorn	5fa7e4e61e	Correctly format scripting keywords (#693 )	2020-01-21 20:05:47 -08:00
Anna Scholtz	efcba3286d	Improvements for CRC32 stored procedure	2020-01-21 14:58:24 -08:00
Daniel Thorn	58bb0183b8	Allow generate_incremental_table to backfill days in reverse (#696 ) by specifying --start as a date after --end	2020-01-21 14:01:00 -08:00
Jeff Klukas	c092f7479c	Syntax error in publish_views for fenix	2020-01-21 16:12:45 -05:00
Jeff Klukas	4f201da964	Filter out histograms in fenix metrics ping from Glean SDK<19	2020-01-21 14:16:51 -05:00
Anna Scholtz	b31fbe3497	Metadata publish improvements and update clients_daily_v6 metadata	2020-01-17 16:03:59 -08:00
Anna Scholtz	165fe50cc8	Script for updating metadata of table	2020-01-17 16:03:59 -08:00
Anna Scholtz	47f77b7c62	Copy metadata.yaml when generating SQL	2020-01-17 16:03:59 -08:00
Jeff Klukas	0ae8143af7	Bug 1609666 Use SAFE_CAST in udf.json_extract_int_map (#681 )	2020-01-16 09:06:14 -08:00
Daniel Thorn	7c134d5617	Enforce format_sql on more files (#659 )	2020-01-10 17:07:21 -08:00
Jeff Klukas	19a4353c97	Bug: make skipping authorized views more robust This was not working in the production context where the target sql dir is under /tmp as discussed in https://github.com/mozilla/bigquery-etl/pull/655#issuecomment-572789847	2020-01-10 13:43:30 -05:00
Daniel Thorn	e871c70e09	Fail on NULL in assert_false udf (#657 ) * Fail on NULL in assert_false udf * Update tests/README.md Co-Authored-By: Anna Scholtz <anna@scholtzan.net> Co-authored-by: Anna Scholtz <anna@scholtzan.net>	2020-01-09 16:15:42 -08:00
Daniel Thorn	2f7de8683d	Enforce script/format_sql for all new sql files (#656 )	2020-01-09 13:55:46 -08:00
Jeff Klukas	ac4a17c33f	Add list of authorized views to exempt from publishing Related to CI failures addressed in https://github.com/mozilla/bigquery-etl/issues/653	2020-01-09 11:04:50 -05:00
Jeff Klukas	a5b5da6220	Inject null for negative session_length and event timestamps As of https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/474 we will be allowing negative values for session lengths and event timestamps. We provide some safety by modifying the user-facing view for main pings to null out these negative values (since large negative values may cause significant skew in aggregations). We don't modify the user-facing view for event pings, but rather assume users will be applying the deanonymize_events UDF and handle nulling negative values there. See bugs https://bugzilla.mozilla.org/show_bug.cgi?id=1592012 and https://bugzilla.mozilla.org/show_bug.cgi?id=1602521	2020-01-09 09:28:55 -05:00
Anthony Miyaguchi	8ed78e2a18	Fix #653 - CI failing on activity_stream/tile_id_types/views.sql (#654 )	2020-01-08 15:32:12 -08:00
Jeff Klukas	78619180d0	Use labels to determine glean pings	2020-01-08 10:11:11 -05:00
Jeff Klukas	be92fd1a4e	Add parsed start and end times to views on top of Glean schemas Closes https://github.com/mozilla/gcp-ingestion/issues/633	2020-01-08 10:11:11 -05:00
Daniel Thorn	8ca73c2b60	Rewrite script/format_sql in python (#640 )	2020-01-06 16:17:41 -08:00
Frank Bertsch	de80cfd652	RFM View for LTV (#611 ) * Add new UDFs for BYTE column/day_seen Rename bitpos to align with the new convention. * Add search_rfm dataset for LTV * Move RFM calculations to UDF * Address review feedback * Fully escape UDFs * Fix _actual_ missing UDF * Don't dryrun; access denied	2019-12-19 18:01:44 -05:00
Jeff Klukas	d66f64d4ab	Bug: publish_views ignored some view.sql files Closes #600 We improve the logic for parsing view definition files, and we also now error out when we encounter a view.sql file that doesn't match our parsing rather than silently skipping.	2019-12-18 13:10:51 -05:00
Sunah Suh	ddfaf63f83	Add athena query migration script for posterity (#592 ) * Add athena query migration script for posterity * Add legacy athena migration script to pytest exclusions	2019-12-12 17:55:02 -05:00
Frank Bertsch	6c825425b3	Search clients last seen (#451 ) * Improve error message for ndjson parsing * Make JSON error messages nicer * Cast BYTES fields to/from string BYTES types are not JSON-serializable. To deal with that, we do two things: 1. Assume the input tables are hex strings, and decode them to get the BYTES fields values (on input) 2. Encode BYTES fields as hex strings (on output) This means that any data files use hex strings for BYTES fields. Note: This only works on top-level fields * Add better discrepancy reporting for test assertions When JSON blobs differ, it can be hard to tell what is wrong. These functions easily show what's different, and automatically prints them to be available when tests fail. * Add search_clients_last_seen for 1-year of history This new dataset, search_clients_last_seen, contains a year of history for each client. It is split into 3 main parts: 1. Recent info that is contained in search_clients_daily, similar to how we store that in clients_last_seen 2. A year of history, represented as a BYTES field, indicating which days they were active for different types of activity 3. Among the major search providers, arrays of totals of different metrics, split into 12 parts, to account for each months total This dataset will power LTV. * Fix linting issues * Enforce sampling on search_clients_daily * Address review feedback - Change all bits/bytes functions to include no. of bits - Use fileobj for tests - Rename some vars - Use base64 for bytes in/out * Generate sql * Add missing comma * Move search_clients_ls to search_derived * Generate moar sql * Use clients_daily_v8 * Fix query * Move tests to search_derived * Fix tests for search_clients_daily_v8 * Don't dryrun with search_clients_last_seen * Update udf/new_monthly_engine_searches_struct.sql Co-Authored-By: Jeff Klukas <jeff@klukas.net> * sample_id is now an int * Add documentation * Update schemas * Make tests use int sample-id	2019-12-12 12:43:09 -05:00
Ben Wu	50932354ce	Add script for publishing csv's as static tables (#582 )	2019-12-11 14:25:33 -05:00
Jeff Klukas	d1948fa445	Revert "Add authorized views for payload_bytes raw and error" This reverts commit `1d2fa74f9e`.	2019-12-10 14:36:14 -05:00
Jeff Klukas	b77a2b9ac1	Revert "Add all relevant views to the authorization list" This reverts commit `4f064f2e6d`.	2019-12-10 14:36:14 -05:00
Jeff Klukas	4f064f2e6d	Add all relevant views to the authorization list	2019-12-10 13:58:56 -05:00
Jeff Klukas	1d2fa74f9e	Add authorized views for payload_bytes raw and error Supports https://github.com/mozilla/bigquery-etl/issues/360	2019-12-10 13:58:56 -05:00
Ben Wu	7d9782b1ba	Bug 1543434 - Create search datasets for mobile (#559 )	2019-12-06 13:28:53 -05:00
Anthony Miyaguchi	b938356d48	Bug 1601139 - Add query to sample documents per doctype (#570 ) * Bug 1601139 - Add query to sample documents per doctype * Add docstring, fix formatting, and update column name	2019-12-04 13:46:00 -08:00
Daniel Thorn	e11d009aac	Fix case expected for XCOM_PUSH (#575 ) and add spaces to avoid issues with empty variables	2019-12-04 12:35:26 -08:00
Sunah Suh	f45591a834	Fix #550 : create airflow xcom output file in dockerfile and make writ… (#553 )	2019-12-04 14:32:08 -05:00
Daniel Thorn	c70d2e179d	Reimplement experiments_v1 as SQL (#565 ) * Reimplement experiments_v1 as SQL * Apply suggestions from code review Co-Authored-By: Sunah Suh <github@sunahsuh.com> * Update templates/telemetry_derived/experiments_v1/get_experiment_list.py * fix generate_sql	2019-12-03 15:08:57 -08:00
Jeff Klukas	0094f4ba7d	Normalize metadata in generated views on historical ping tables Merging this change will cause the changes to be used on the next deploy of the schema generation pipeline See https://github.com/mozilla-services/cloudops-infra/blob/master/projects/data-shared/Jenkinsfile.bigquery.prod#L105	2019-12-02 15:21:32 -05:00
Sunah Suh	25b702d082	Add tables to replace experiment enrollment aggregates spark streamin… (#524 ) * Add tables to replace experiment enrollment aggregates spark streaming job * Switch to python to fill in date in enrollment aggregates live view since parameters are not allowed in view defs * Direct output of arbitrary commands in entrypoint script to airflow xcom location	2019-11-27 16:55:21 -05:00
Marina Samuel	4465965f14	Code cleanup.	2019-11-25 15:30:33 -05:00
Daniel Thorn	ce28624b74	Improve format-sql for views and timestamp functions (#528 )	2019-11-25 14:15:32 -05:00
Daniel Thorn	9468f997ab	Make addons and addon_aggregates exactly replace spark versions (#532 )	2019-11-25 13:14:52 -05:00
Jeff Klukas	52e0a1acab	Move nondesktop KPI queries to stable table DAG This makes it explicit that we no longer are using imported Parquet data. It also moves several of these tables to the shared-prod project where we want them to live long-term.	2019-11-20 10:22:29 -05:00
Ben Wu	c0160496d4	Add views for search dataset (#513 )	2019-11-18 19:26:15 -05:00
Daniel Thorn	8822b522aa	Promote sql clients_daily_v6 (#501 )	2019-11-14 18:08:49 -05:00
Daniel Thorn	4b80ee2c23	Support restoring int columns in export_to_parquet (#506 )	2019-11-14 10:41:03 -05:00
Ben Wu	73dc724086	Switch search to read from flattened main summary (#500 )	2019-11-13 13:20:14 -08:00
Daniel Thorn	7b1c9d96ad	Support bigquery export to parquet via avro (#492 )	2019-11-07 13:56:33 -05:00
Daniel Thorn	9176dc940e	Unnest clients_daily (#481 )	2019-11-06 17:14:02 -05:00
Marina Samuel	384a017935	Update dryrun script.	2019-11-06 16:13:26 -05:00
Daniel Thorn	f239525363	Fix skip list for publishing udfs (#489 )	2019-11-06 15:28:34 -05:00
Daniel Thorn	fe4ec05c93	Add --dataset_id and --project_id to script/run_multipart_query (#488 )	2019-11-06 13:54:25 -05:00
Daniel Thorn	1d871edc0b	Add list of udfs to skip when publishing (#487 )	2019-11-06 13:37:47 -05:00
Daniel Thorn	dfb54323bf	Fix detection of maps when table includes dataset (#485 )	2019-11-06 12:37:56 -05:00
Daniel Thorn	eba9db159d	Add support for replacing columns in export_to_parquet (#443 )	2019-11-05 14:17:06 -05:00
Daniel Thorn	890141c140	Reimplement main_summary_v4 as SQL (#258 )	2019-11-05 13:32:25 -05:00
Anthony Miyaguchi	4c44359310	Fix #465 - Add regex strings to STRING_REGEX in format-sql (#470 )	2019-10-31 17:04:26 -07:00
Anthony Miyaguchi	084b960602	Use `#!/usr/bin/env python3` consistently (#461 )	2019-10-30 14:06:22 -07:00
Daniel Thorn	b65fbbbd93	Fix automatic formatting for nested types (#457 ) without this `<` and `>` are getting formatted as operators instead of parens	2019-10-29 22:17:06 -07:00
Daniel Thorn	cad76a9b6f	Fix CI by skipping slow-to-validate queries (#448 )	2019-10-25 09:15:20 -05:00
Sunah Suh	387536cbee	Fixes to json -> table ddl generator script (#444 )	2019-10-24 14:56:19 -07:00
Anthony Miyaguchi	d60c0fd842	Add dataset for monitoring schema errors over time (#442 ) * Add query for last month of schema errors * Add generated sql for schema error counts * Move schemas into correct location * Add document_version and named groups * Skip schema error counts in dryrun	2019-10-23 15:37:05 -07:00
Daniel Thorn	0f433f6a91	Support many billing projects and dates in copy_deduplicate (#426 ) * Support many billing projects and dates in copy_deduplicate * fix docs for --project_id * explain default --billing_projects behavior * Fix return value bug	2019-10-21 11:16:24 -07:00
Marina Samuel	a42d97af2d	Add new queries to dryrun.	2019-10-17 15:25:19 -04:00
Daniel Thorn	9bf053de74	Add --preceding-days option to copy_deduplicate (#413 )	2019-10-14 08:54:40 -07:00
Jeff Klukas	096a209ced	Fix bugs in monitoring views Also cleans up a bug in the script for publishing views to get udf_js/gunzip working, and removes accidental print statements in generate_sql.	2019-10-10 11:48:28 -04:00
Jeff Klukas	68c4d79228	Replace sql dir all at once in generate_sql I got tired of running generate_sql, then checking git status while it was running and seeing a jumble of deleted files. This PR changes the behavior to build the files in a temp dir and then copy into place only at the end.	2019-10-10 09:21:30 -04:00
Frank Bertsch	239fab252a	Pipeline sql (#388 ) * Views for monitoring structured ingestion errors * Add UDF for extracting missing columns * Add docs for json_extract_missing_cols * Fix missing WITH clause * Add generated sql * Fix test function * Update views * Update tests * Use shared-prod * Update docstring for missing cols udf * Move schemas to structured_ingestion dataset * Don't dryrun queries on payload_bytes * Change format of payload_bytes views - Use a with_ratio to reduce duplication - Change the view name to match the filename * Fix missing test function * Move to monitoring dataset * Move to new test structure * Remove spaces from js udfs * Use persistent gunzip UDF	2019-10-09 10:03:58 -04:00
Jeff Klukas	2e821fbdc1	Use the shared-prod URL for the dryrun script	2019-10-08 20:21:48 -04:00
Daniel Thorn	e872a76860	Add pytest plugins to lint python scripts (#410 ) * Add pytest plugins to lint python scripts * Fix lint errors	2019-10-08 14:00:11 -07:00
Jeff Klukas	fff2ff3275	fxa_users_services tables (#396 )	2019-10-08 13:31:01 -04:00
Daniel Thorn	8ccce8702c	Use chunksize=1 for consistent ordering in script/generate_incremental_table (#403 )	2019-10-07 10:39:47 -07:00
Ben Wu	484d5079a7	Add additional fields to search datasets (#381 )	2019-09-30 10:26:28 -04:00
Daniel Thorn	6158817ea3	Improve list_tables speed for script/copy_deduplicate (#382 )	2019-09-26 11:28:22 -07:00
Daniel Thorn	91ba7297d5	Fix incorrect parameter name in copy_deduplicate (#383 )	2019-09-25 13:30:29 -07:00
Daniel Thorn	16a1491821	Add --slices option to copy_deduplicate (#380 )	2019-09-25 11:27:36 -07:00
Daniel Thorn	6c05a96847	dont create temp tables for dry_run	2019-09-24 13:52:36 -04:00
Jeff Klukas	29f44ebbf5	Bug: datetime logic in copy_deduplicate Closes https://github.com/mozilla/bigquery-etl/issues/376 Addition to datetime.date considers only the days part of a timedelta, so we have to convert to a timestamp first.	2019-09-24 13:52:36 -04:00
Daniel Thorn	54ae019ff3	Set temp table expiration in copy_deduplicate (#374 )	2019-09-23 13:03:19 -07:00
Daniel Thorn	4f48ae21ef	Add --hourly option to copy_deduplicate (#370 )	2019-09-23 10:54:21 -07:00
Jeff Klukas	f0f5ec99a6	Import events from FxA oauth server I've already created the new table and backfilled existing dates. Addresses #348	2019-09-19 14:01:58 -04:00
Daniel Thorn	469c03ec10	Add script to format sql (#173 )	2019-09-18 17:48:53 -07:00
Jeff Klukas	0ea63c7775	dryrun script updates	2019-09-17 15:30:59 -04:00
Jeff Klukas	f4c5ea8e7c	Run black	2019-09-13 10:00:33 -04:00
Sunah Suh	f9c611a906	Fix UDF publisher script (#330 )	2019-09-04 10:54:48 -05:00
Jeff Klukas	71ad6652f5	Update publish_views script for new directory structure	2019-08-28 20:58:43 -04:00
Jeff Klukas	c2269a69af	Update generate_view script for new directory structure Closes #317	2019-08-28 20:58:43 -04:00
Sunah Suh	030ca5872a	Add script to recreate table creation DDL SQL from json description (#309 ) Add script to recreate table creation DDL SQL from json description	2019-08-28 11:49:42 -05:00
Daniel Thorn	99fe0dfd9e	Move queries into destination-table directories (#286 ) * Move queries into destination-table directories * Apply suggestions from code review Co-Authored-By: Jeff Klukas <jeff@klukas.net>	2019-08-26 12:52:49 -07:00
Anna Scholtz	9580029e20	UDF for unzipping gzipped bytes (#272 ) * UDF for decompressing gzip data * Update script for publishing UDFs to upload UDF dependency files * Address review feedback for gunzip UDF * Set default GCS bucket to moz-fx-data-prod-bigquery-etl * Add function to upload UDF dependencies to GCS * Set data-eng-circleci-tests context in CircleCI config * Add approval step in CircleCI config	2019-08-26 10:53:06 -07:00
Jeff Klukas	5b005570db	Make copy_deduplicate query more efficient for large tables Closes #307	2019-08-23 14:43:17 -04:00
Anna Scholtz	7a6f7aacf8	View generation fixes	2019-08-21 15:03:50 -07:00
Anna Scholtz	52061238f5	Add wrapper script for generating and publishing views	2019-08-21 15:03:50 -07:00
Anna Scholtz	1d87797bd6	Fix rebase conflicts	2019-08-21 15:03:50 -07:00
Anna Scholtz	7520de5092	Script for auto-generating views	2019-08-21 15:03:50 -07:00
Jeff Klukas	7e533b4b24	Apply suggestions from code review Co-Authored-By: Daniel Thorn <dthorn@mozilla.com>	2019-08-20 16:47:33 -04:00
Jeff Klukas	a407f50c14	Explain the case of ignoring tables with trailing underscore	2019-08-20 16:47:33 -04:00
Jeff Klukas	5c06876590	Support altering target project in publish_views script	2019-08-20 16:47:33 -04:00
Jeff Klukas	12215f88fa	Generate latest-version views for derived tables	2019-08-20 16:47:33 -04:00
Daniel Thorn	91e1be5394	Use mode last in clients_daily_v7 (#86 )	2019-08-14 14:49:38 -07:00
Jeff Klukas	55cdf93c5f	instantiate client after parsing args	2019-08-07 13:18:31 -04:00
Jeff Klukas	a929285ca0	Apply suggestions from code review Co-Authored-By: Daniel Thorn <dthorn@mozilla.com>	2019-08-07 13:18:31 -04:00
Jeff Klukas	46232da366	Refactor dryrun script for concurrency and permissions	2019-08-07 13:18:31 -04:00
Jeff Klukas	fbc1b9f564	Fix up imports	2019-08-07 13:18:31 -04:00
Jeff Klukas	937d294d65	Add publish_views script	2019-08-07 13:18:31 -04:00
Jeff Klukas	f2ecc96ce8	Adapt view update script for generating definitions instead	2019-08-07 13:18:31 -04:00
Daniel Thorn	28ce0d1f11	Add script for updating latest-version views	2019-08-07 13:18:31 -04:00
Daniel Thorn	22520e31f6	Use prepend_udf_usage_definitions in generate_sql (#287 )	2019-08-05 16:10:31 -07:00
Daniel Thorn	e1bf990b9a	Add support for testing queries with persistent UDFs (#285 )	2019-08-05 14:14:19 -07:00
Daniel Thorn	a241017c15	Reuse bigquery client and set default project_id (#282 )	2019-08-02 13:36:10 -07:00
Daniel Thorn	5308e79570	detect errors when publishing udfs (#281 )	2019-08-02 13:16:43 -07:00
Jeff Klukas	15640b831f	Support options with underscores and fix incorrect variable	2019-08-01 14:00:57 -04:00
Jeff Klukas	0bc42132a2	Add --parallelism option	2019-08-01 10:15:33 -04:00
Jeff Klukas	4242d95777	Remove new entrypoint clause	2019-08-01 10:15:33 -04:00
Jeff Klukas	00cef9d7e9	Run black and refactor --only and --except args	2019-08-01 10:15:33 -04:00
Jeff Klukas	ccb65d6d18	Apply suggestions from code review Co-Authored-By: Daniel Thorn <dthorn@mozilla.com>	2019-08-01 10:15:33 -04:00
Jeff Klukas	c9f65a7af8	Refactor to allow jobs to run in a different project I believe Airflow may need to issue the jobs from derived-datasets for the time being, so we make sure to fully qualify all table references with the project_id that's passed as a parameter.	2019-08-01 10:15:33 -04:00
Jeff Klukas	0a3d4ea59d	Fix typo Co-Authored-By: Anna Scholtz <anna@scholtzan.net>	2019-08-01 10:15:33 -04:00
Jeff Klukas	a275cfb9e5	Add copy_deduplicate script Closes #220 A PR to add schedule this script in Airflow to follow.	2019-08-01 10:15:33 -04:00
Allen Short	351b42e84a	Dry-run each query in CircleCI against prod datasets (#261 ) * Dry-run each query in CircleCI against prod datasets * Apply suggestions from code review * Update script/dryrun	2019-07-30 10:26:25 -07:00
Daniel Thorn	f79d075448	Add dataset names to paths in sql/ (#265 ) * Add dataset names to paths in sql/ * rename clients_last_seen_raw_v1 to clients_last_seen_v1 * rename telemetry_raw to telemetry_derived * address review	2019-07-30 09:39:22 -07:00
Jeff Klukas	01cb6e1074	Refactor naming of UDFs	2019-07-24 09:01:13 -04:00
Jeff Klukas	680c26ac41	Efficiency tweak: avoid double-publishing UDFs	2019-07-24 09:01:13 -04:00
Anna Scholtz	ff356466f6	Bugfix: missing UDFs without dependencies	2019-07-22 13:44:28 -07:00
Anna Scholtz	4f897edd8a	Add project ID when creating UDFs	2019-07-22 13:44:28 -07:00
Anna Scholtz	c3d06f94d2	Script to publish persistent UDFs	2019-07-22 13:44:28 -07:00
Daniel Thorn	d6e35295ec	Fix help page for script/generate_incremental_table (#244 )	2019-07-22 12:58:35 -07:00
Anna Scholtz	7207a4e52f	Move SQL templates to templates/ and add generated SQL	2019-06-25 08:07:26 -07:00
Anna Scholtz	fe7325dcb4	Run SQL generation script in when creating docker image	2019-06-25 08:07:26 -07:00
Anna Scholtz	aa637154c5	Ensure that UDFs are added only once and in order when generating SQL files	2019-06-25 08:07:26 -07:00
Anna Scholtz	a6661c5896	Trigger SQL query generation in pytest and update CircleCI config	2019-06-25 08:07:26 -07:00
Anna Scholtz	b62970f3a9	Makefile for generating sql and add newline breaks to new files	2019-06-25 08:07:26 -07:00
Anna Scholtz	f2efcc0432	Adopt CircleCI script to generate SQL queries	2019-06-25 08:07:26 -07:00
Anna Scholtz	fd21ba88c2	Add Python script to generate SQL files with UDF declarations	2019-06-25 08:07:26 -07:00
Jeff Klukas	5eb134ca86	fixups found while running the deletions	2019-05-23 16:42:08 -04:00
Jeff Klukas	120153dabe	respond to review comments	2019-05-23 16:42:08 -04:00
Jeff Klukas	845fa792c3	Codify archiving of exact mau table	2019-05-23 16:42:08 -04:00
Jeff Klukas	76a8a23e54	Use generate_incremental_table for clients_last_seen backfill	2019-05-23 16:42:08 -04:00
Jeff Klukas	d9669e325a	Add comments on tables that do not exist in BQ	2019-05-23 16:42:08 -04:00
Jeff Klukas	d481d93861	delete from experiments and search_clients_daily datasets	2019-05-23 16:42:08 -04:00
Jeff Klukas	090fce87cb	Better handling for bq tables	2019-05-23 16:42:08 -04:00
Jeff Klukas	b421eeeb11	Correct time range for gs	2019-05-23 16:42:08 -04:00
Jeff Klukas	9970aebd4e	Add delete-from-bq.sh	2019-05-23 16:42:08 -04:00
Jeff Klukas	7c8cbbc0f4	Bug 1550814 Remove data collected during hotfix rollout See https://bugzilla.mozilla.org/show_bug.cgi?id=1550814	2019-05-23 16:42:08 -04:00
Jeff Klukas	e782c5f6ff	Add --destination-table and selectExprs options to export_to_parquet We implemented a view-based solution for creating clients_last_seen from clients_last_seen_raw in Athena and Presto, but this made the table unavailable from Spark. By adding in these options, we can materialize view logic at the time of writing to Parquet, so that it will be available to all Parquet consumers.	2019-05-22 14:01:40 -04:00
Jeff Klukas	e0dcbdaf36	Usability improvements for generate_incremental_table	2019-05-16 09:41:14 -04:00
Daniel Thorn	1dad4e14b4	Rewrite generate_incremental_table in python (#126 ) fixes #115	2019-05-14 09:22:31 -07:00
Daniel Thorn	c2e416cefd	Create ~/.bigqueryrc without GCLOUD_SERVICE_KEY (#112 )	2019-05-01 13:38:31 -07:00
Daniel Thorn	606fec9c04	Set sane defaults for bq use in airflow (#110 )	2019-05-01 08:24:57 -07:00

... 3 4 5 6 7 ...

454 Коммитов