bigquery-etl

Граф коммитов

Автор	SHA1	Сообщение	Дата
Anna Scholtz	79feea7b8f	Write last_updated to file when data gets published	2020-04-16 17:08:55 -07:00
Anna Scholtz	e2e749f6bb	Add last_updated to GCS datasets metadata	2020-04-16 17:08:55 -07:00
Anna Scholtz	b25c271933	Simplify metadata label regex and disallow international characters	2020-04-16 11:38:59 -07:00
Anna Scholtz	760b45b02a	Metadata lowercase check	2020-04-16 11:38:59 -07:00
Anna Scholtz	1b9af73379	Refactor public_data tests	2020-04-16 11:38:59 -07:00
Anna Scholtz	b67d430ba1	Add tests for parsing metadata	2020-04-16 11:38:59 -07:00
Jeff Klukas	3cf06aeadb	Remove redundant project prefix	2020-04-15 15:03:07 -04:00
Jeff Klukas	46e5ee8fb0	Add script for reporting on broken views Addresses #903	2020-04-15 15:03:07 -04:00
Anna Scholtz	9b6aa5fb46	Refactory publish_json script	2020-04-15 08:49:24 -07:00
Anthony Miyaguchi	82a6a5f687	Fenix exports for GLAM (#870 ) * Add views for extracting to glam * wip: Add export script * Rename extract queries and don't run them * Add user counts * Add generated sql * Update extract queries to rely on current project * Fix optional day partitioning * Fix extraction to glam-fenix-dev project * Add globs for ignoring dryrun and format * Reorder columns in probe count extract * Filter on null channel and version * Do not print header * Refactor datacube groupings and fix scalar_percentiles * Rename extract tables * Convert user counts into a view to avoid needless materialization * Rename client probe counts to probe counts * Update publish_views so job does not fail	2020-04-14 11:45:59 -07:00
Anna Scholtz	bc5c116c80	generate_views version matches to list	2020-04-09 16:36:28 -07:00
Anna Scholtz	adde137cce	Fix generate_views version parsing	2020-04-09 16:36:28 -07:00
Anna Scholtz	8af044d844	Pretty print dataset metadata	2020-04-09 14:38:34 -07:00
Anna Scholtz	bffcf3fc45	Fix lint complaints	2020-04-09 14:38:34 -07:00
Anna Scholtz	f3a6d7c490	Add option to publish metadata to GCS to entrypoint script	2020-04-09 14:38:34 -07:00
Anna Scholtz	750738898d	Tests for publishing GCS metadata for public data	2020-04-09 14:38:34 -07:00
Anna Scholtz	a8d161e47f	Refactor and add comments to gcs metadata publish	2020-04-09 14:38:34 -07:00
Anna Scholtz	9a7af8b1b5	Write metadata for datasets and tables to GCS	2020-04-09 14:38:34 -07:00
Jeff Klukas	966a0a7b11	Add days_created_profile_bits	2020-04-08 11:11:46 -04:00
Jeff Klukas	1ab2eb4f33	Check baseline ping reason field	2020-04-08 11:11:46 -04:00
Jeff Klukas	e36abdd616	Generic Glean ETL fixups	2020-04-08 11:11:46 -04:00
Jeff Klukas	86350db0d4	Generated clients_daily and last_seen tables for all Glean apps Addresses https://github.com/mozilla/bigquery-etl/issues/848 We commit a single application's generated queries here to allow CI to verify them. I have run these scripts for the target app so that the tables exist and CI should pass.	2020-04-06 16:47:51 -04:00
Anthony Miyaguchi	6e3351272e	Cleanup GLAM etl queries for Fenix (#851 ) * Rename scalar_aggregates_incremental and remove telemetry * Remove ping-type and telemetry variables * Add initial files for final views * Add glam templates to format ignore * Add blacked scalar_percentiles * Generate view for scalar aggregates * Add generated view for daily histogram aggregate view * Add probe counts to generated views * Generalize writing out queries * Add a latest_versions to generate * ADd histogram_percentiles to generate * Add scalar bucket counts to generate * Add histogram bucket counts to generate * Add clients scalar aggregates to generate * Move probe counts to generate * Add scalar percentiles to generate * Add clients histogram aggregates to generate * Use generate within generate_fenix_sql script * Fix probe_counts view and bucket counts * Rename template for scalar bucket counts * Fix client probe counts view * Remove irrelevant telemetry where clause * Add docstrings and shorten lines * Add probe counts view to dryrun and publish_views ignores * Rename udf for merged user data * Use python3	2020-04-03 14:17:32 -07:00
Anna Scholtz	23b6e24f7e	Add incremental_export option	2020-04-03 09:02:30 -07:00
Anna Scholtz	ae2acd34af	Update JSON export limits	2020-04-02 12:30:38 -07:00
Anna Scholtz	9ab5babcdb	Refactor json export	2020-04-02 12:30:38 -07:00
Anna Scholtz	0b00e222bf	Check for max. output file for json export	2020-04-02 12:30:38 -07:00
Anna Scholtz	a298caa595	Updates JSON export tests	2020-04-02 12:30:38 -07:00
Anna Scholtz	a8da84995c	Limit number of files created when exporting to JSON	2020-04-02 12:30:38 -07:00
Anna Scholtz	9eaae84788	Write data to non-partitioned tmp table before exporting as JSON	2020-04-02 12:30:38 -07:00
Anna Scholtz	2102eb2a38	Update comments	2020-03-31 14:01:43 -07:00
Anna Scholtz	c087e929a7	Refactor publish_metadata script	2020-03-31 14:01:43 -07:00
Anna Scholtz	1bad91ab33	Refactor generate_views script	2020-03-31 14:01:43 -07:00
Anna Scholtz	979ca4e8df	Refactor publish_public_data_views script	2020-03-31 14:01:43 -07:00
Anna Scholtz	a70275307b	Forward query parameters when running queries in publish_json	2020-03-31 13:28:45 -07:00
Anna Scholtz	87ff585ccc	Add logging to public_json script	2020-03-31 13:28:45 -07:00
Anna Scholtz	69b2a22971	Update json tests	2020-03-31 13:28:45 -07:00
Anna Scholtz	5f4b0dd8b7	Staging of json files	2020-03-31 13:28:45 -07:00
Anna Scholtz	9cd6d69754	Test JSON streaming	2020-03-31 13:28:45 -07:00
Anna Scholtz	74cf1a4cbd	Use streaming for converting ndjson to json	2020-03-31 13:28:45 -07:00
Anna Scholtz	4825fb0664	Add tests for publishing json with mocking	2020-03-31 13:28:45 -07:00
Anna Scholtz	1f32c6010f	Prevent publish_json script tests from running by default	2020-03-31 13:28:45 -07:00
Anna Scholtz	a703989bed	Refactor publish_json and factor out into separate class	2020-03-31 13:28:45 -07:00
Anna Scholtz	05301925cb	Publish JSON data of query result	2020-03-31 13:28:45 -07:00
Anna Scholtz	c2a5a62b1e	[WIP] Separate script to publish data as json	2020-03-31 13:28:45 -07:00
Daniel Thorn	b5f37518dd	Fix null date partition condition in shredder (#858 )	2020-03-26 13:58:52 -07:00
Anthony Miyaguchi	33160eea96	Add histogram percentiles for Fenix data into GLAM (#829 ) * Add initial histogram_percentiles * Update metric_type with histogram_type suffix * Add generated SQL (backwards incompatible) * Add body for copy of histogram_percentiles_v1` * Update histogram_percentiles with Glean specific metrics * Add histogram_percentiles module * Uncomment histogram percentiles * Add generated SQL * Add template to ignore section of format_sql * Add histogram percentiles to ignore of dryrun * Move udf into persistent_udf directory * Rewrite udf_js.glean_percentile * Add generated SQL	2020-03-25 10:17:21 -07:00
Daniel Thorn	c3127baac4	Standardize common script arguments (#828 )	2020-03-25 09:40:36 -07:00
Anthony Miyaguchi	4f0080559a	Add histograms to fenix glam etl (#766 ) * Add initial template for histogram aggregates * Factor out common functions and get all distributions * Add viable query for histogram aggregates * Add more efficient aggregation * Update header and update comment * Add code to generate clients daily histograms * Add queries for generated sql * Return non-zero exit code when histograms not found * Delete empty queries to reduce data scanned * Add non-zero exit code for scalars if probes are not found * Sort histograms for stable output * Add view for histogram aggregates * Add initial sql for histogram aggregates * Format template scripts * Add mostly reformatted sql for aggregates * Update histogram aggregates before adding statements * Fix up details for daily aggregation * Add completed histograms template * Add code to generate clients histogram aggregates * Add init for clients histogram aggregates * Remove sample_id from set of attributes * Add sections to run generated sql * Add generated sql * Remove extra latest_version columns * Fix many small issues during first draft of sql * Fix clients histogram aggregates * Add initial modification to probe counts * Add histogram bucket counts * Add option to generate histogram probe counts * Update generated_fenix_sql for histograms * Add generated sql * Update run_fenix_sql * Fix bucket counts * Update source table for probe counts * Add missing ping_type to histograms * Add first,last,num buckets * Update probe counts so it succeeds * Add mozilla_schema_generator to dependencies * Add metadata from probe-info for custom distributions * Update probe counts with metadata for custom distributions * Add UDF for generating functional buckets * Add proper bucketing by including range_max of measures * Format histogram udfs * Add updated templates to skip * Add new queries to dryrun ignore * Add view to the publish ignore list * Fix python linting * Remove old comments from probe counts * Do not count metadata from custom distributions twice * Remove sum from histogram aggregates * Add generated SQL * Add sample_id to histograms earlier in pipeline * Add generated SQL * Add comments to functional bucketing for metrics	2020-03-18 13:53:28 -07:00
Daniel Thorn	90d266c708	Use streaming inserts for shredder state (#826 )	2020-03-18 13:21:12 -07:00
Daniel Thorn	78b2337465	Sort shredder jobs by partition_id then table_id (#804 )	2020-03-10 16:02:34 -07:00
Daniel Thorn	e7e3da7b86	Use job.created for shredder state (#796 )	2020-03-10 14:44:36 -07:00
Anna Scholtz	fed6e7b297	Check metadata label length	2020-03-10 10:52:53 -07:00
Anna Scholtz	75530d58b2	Update entrypoint comment and metadata parsing	2020-03-10 10:52:53 -07:00
Anna Scholtz	c2b56907fb	PyYaml dryrun	2020-03-10 10:52:53 -07:00
Anna Scholtz	ac6344045a	Convenience function to get metadata of associated SQL file	2020-03-10 10:52:53 -07:00
Anna Scholtz	26ad69af95	Refactor run_query and add doc comments to Metadata	2020-03-10 10:52:53 -07:00
Anna Scholtz	d4574bd0fd	Public datasets in dryrun	2020-03-10 10:52:53 -07:00
Anna Scholtz	ed3b2853ed	Refactor scripts to use Metadata class	2020-03-10 10:52:53 -07:00
Anna Scholtz	103f37f3d0	Refactor parsing metadata	2020-03-10 10:52:53 -07:00
Daniel Thorn	f6ebc9e1a8	Add shredder support for integer range partitioning (#788 )	2020-03-09 12:37:30 -07:00
Daniel Thorn	8c5c6bdee9	Support using --only and --except together in shredder (#789 )	2020-03-09 09:09:11 -07:00
Anthony Miyaguchi	4e773ba6eb	Simplify scalar aggregates for glam-fenix etl (#767 ) * Update daily aggregates to run all scalars in a single query * Update generate and run script for new scalar aggregates * Update generated sql (and view) * Fix linting * Update SKIP for format	2020-02-26 11:22:20 -08:00
Anthony Miyaguchi	d0b71bcefd	End-to-end Fenix scalar aggregates (#743 ) * Refactor render into a separate function * Add variables for source and destination tables * Add support for aggregating glean pings * Add render_init along with --init option * Add partition clause and add proper init file * Add attributes_type to the template * Update clients_scalar_aggregates_v1 with dataset.table * Add command for generating init for fenix scalars aggregates * Add queries for fenix_clients_scalar_aggregates_v1 * Update partititioning in init * Update glam scripts for scalar aggregates * Update version to only include valid versions * Add generated sql * Add --quiet flag * Add notes * Fix linting and CI errors * Ignore glam_etl in dryrun * Add initial template files that have been formatted * Update generated queries * Add metric counts for histogram and scalars * Update metric_counts_v1 for scalars only * Add formatted version of telemetry_derived/clients_scalar_bucket_counts_v1 * Add module for generating metric bucketing * Refactor generate_fenix_sql for skipping stages * Add templates to format SKIP * Fix trailing whitespace * Add option to generate fenix bucket/probe counts * Add initial bucket/probe counts sql for fenix * Sort attributes for stable query generation * Refactor bucketing logic * Add scalar_metric_types variable * Add argument parser and glean variables * Update scalar bucket counts for glean * Update run_fenix_sql with bucket counts * Fix invalid syntax * Do not aggregate booleans as a scalar * Add scalar_metric_types to metric_counts_v1 * Add argparser and change source tablename to scalar * Update fenix_clients_scalar_probe_counts_v1 * Remove first_bucket * Add scalar_probe_counts to run script * Removing first_bucket requires changing where clause conditional * Get grouping attributes correct * Give columns stable ordering * Add correct query (that is too complex) * Reduce number of combinations * Simplify logic for null values * Cast booleans instead of when clause * Format * Rename files to avoid confusion * Add initial scalar_percentiles * Add initial files for scalar_percentiles * Add scalar_percentiles for fenix * Add scalar_percentiles to run script * Add problematic files to SKIP in format and dryrun * Add installation ping * Fix missing merge item * Add missing newlines * Reduce set of grouped attributes * Factor out boolean_metric_types	2020-02-19 13:43:53 -08:00
Anthony Miyaguchi	0d892cba4e	Add scalar aggregates from clients daily scalar aggregates for Fenix (#735 ) * Refactor render into a separate function * Add variables for source and destination tables * Add support for aggregating glean pings * Add render_init along with --init option * Add partition clause and add proper init file * Add attributes_type to the template * Update clients_scalar_aggregates_v1 with dataset.table * Add command for generating init for fenix scalars aggregates * Add queries for fenix_clients_scalar_aggregates_v1 * Update partititioning in init * Update glam scripts for scalar aggregates * Update version to only include valid versions * Add generated sql * Add --quiet flag * Add notes * Fix linting and CI errors * Ignore glam_etl in dryrun * Add latest_versions template * Add generated code for latest versions * Update header * Add latest versions to run script * Update version filter using fenix_latest_versions_v1	2020-02-19 10:51:22 -08:00
Daniel Thorn	7864154807	Remove support for cluster handling in shredder (#733 )	2020-02-18 21:07:40 +01:00
Anthony Miyaguchi	cf511d8cc2	Add template for clients_scalar_aggregates (#727 ) * Add initial moustache files for scalar_aggregates * Add Jinja2 dependency * Update templates with more parameters * Add format clauses and add query to be formatted * Add formatted sql * Add generated sql python -m bigquery_etl.glam.scalar_aggregates_incremental > sql/telemetry_derived/clients_scalar_aggregates_v1/query.sql * Generalize clients_scalar_aggregates * Refactor into attributes and attributes_list * Add generated sql for generalized query * Add glam templates to format_sql SKIP * Fix dryrun by using AS and USING properly in sql * Add generated sql * Add instructions on adding new Python library * Fix linting issues * Use r""" for backslash in docstring * Add Jinja2 dependencies to constraints.txt * Document process for adding new Python dependencies	2020-02-12 09:43:37 -08:00
Anna Scholtz	3f1cb398fa	Undo formatting for old SQL files	2020-02-07 09:48:23 -08:00
Anna Scholtz	2c25b3a34e	Review feedback	2020-02-07 09:48:23 -08:00
Anna Scholtz	adb79cb2a5	Ignore newlines when parsing UDFs	2020-02-07 09:48:23 -08:00
Anna Scholtz	88f188a93a	Transform UDFs to temporary UDFs in tests	2020-02-07 09:48:23 -08:00
Anna Scholtz	ab9135f951	Publish persistent UDFs step in Circle CI	2020-02-07 09:48:23 -08:00
Anna Scholtz	97b5386b41	Change UDFs to persistent UDFs and remove sql generations script	2020-02-07 09:48:23 -08:00
Anthony Miyaguchi	f32f866129	Bug 1610983 - Add clients daily scalar aggregates for GLAM in Fenix (#724 ) * Add copy of clients_daily_scalar_aggregates for fenix * Change table to Fenix metrics ping and modify columns * Modify get_scalar_probes to fetch the relevant list of metrics * Remove logic for keyed booleans * Add valid generated SQL for scalars * Generate valid keyed_scalars * Factor out attributes into reusable string * Use the bigquery-etl formatter * Add `--no-parameterize` flag for debugging in console * Add option for table_id * Add comma conditionally * Add script to run against all Glean pings in dataset * Move scripts into appropriate locations * Use stable tables as source for generate script * Report glean metric types instead of scalar/keyed-scalar * Fix linting * Add script to generate sql for each table in org_mozilla_fenix * Add generated sql * Rename script for running etl in testing environment * Update run script to use generated sql * Fix missing --table-id parameter * Update header comment in script * Update generated sql * Add ping_type to list of attributes * Update generated schemas	2020-02-06 14:01:24 -08:00
Daniel Thorn	0e0567285d	Fix unholy WITH OFFSET format (#709 )	2020-01-28 15:34:08 +01:00
Daniel Thorn	1efbe0344a	Add script for self serve deletion (#635 )	2020-01-23 14:52:08 -08:00
Daniel Thorn	5fa7e4e61e	Correctly format scripting keywords (#693 )	2020-01-21 20:05:47 -08:00
Daniel Thorn	7c134d5617	Enforce format_sql on more files (#659 )	2020-01-10 17:07:21 -08:00
Daniel Thorn	e871c70e09	Fail on NULL in assert_false udf (#657 ) * Fail on NULL in assert_false udf * Update tests/README.md Co-Authored-By: Anna Scholtz <anna@scholtzan.net> Co-authored-by: Anna Scholtz <anna@scholtzan.net>	2020-01-09 16:15:42 -08:00
Daniel Thorn	2f7de8683d	Enforce script/format_sql for all new sql files (#656 )	2020-01-09 13:55:46 -08:00
Daniel Thorn	8ca73c2b60	Rewrite script/format_sql in python (#640 )	2020-01-06 16:17:41 -08:00
Frank Bertsch	719f607a0a	Update UDF names with prefixed numbers (#593 ) * Error on improper UDF names * Rename udfs with prefixed numbers	2019-12-12 15:09:57 -05:00
Frank Bertsch	6c825425b3	Search clients last seen (#451 ) * Improve error message for ndjson parsing * Make JSON error messages nicer * Cast BYTES fields to/from string BYTES types are not JSON-serializable. To deal with that, we do two things: 1. Assume the input tables are hex strings, and decode them to get the BYTES fields values (on input) 2. Encode BYTES fields as hex strings (on output) This means that any data files use hex strings for BYTES fields. Note: This only works on top-level fields * Add better discrepancy reporting for test assertions When JSON blobs differ, it can be hard to tell what is wrong. These functions easily show what's different, and automatically prints them to be available when tests fail. * Add search_clients_last_seen for 1-year of history This new dataset, search_clients_last_seen, contains a year of history for each client. It is split into 3 main parts: 1. Recent info that is contained in search_clients_daily, similar to how we store that in clients_last_seen 2. A year of history, represented as a BYTES field, indicating which days they were active for different types of activity 3. Among the major search providers, arrays of totals of different metrics, split into 12 parts, to account for each months total This dataset will power LTV. * Fix linting issues * Enforce sampling on search_clients_daily * Address review feedback - Change all bits/bytes functions to include no. of bits - Use fileobj for tests - Rename some vars - Use base64 for bytes in/out * Generate sql * Add missing comma * Move search_clients_ls to search_derived * Generate moar sql * Use clients_daily_v8 * Fix query * Move tests to search_derived * Fix tests for search_clients_daily_v8 * Don't dryrun with search_clients_last_seen * Update udf/new_monthly_engine_searches_struct.sql Co-Authored-By: Jeff Klukas <jeff@klukas.net> * sample_id is now an int * Add documentation * Update schemas * Make tests use int sample-id	2019-12-12 12:43:09 -05:00
Ben Wu	7727bbff31	Combine project and dataset names with table name in test runner (#520 )	2019-11-19 14:59:41 -05:00
Daniel Thorn	e872a76860	Add pytest plugins to lint python scripts (#410 ) * Add pytest plugins to lint python scripts * Fix lint errors	2019-10-08 14:00:11 -07:00
Daniel Thorn	ce69d308dd	Improve pytest failure output for udf and sql tests (#391 ) * Improve pytest failure output for udf and sql tests * address review	2019-10-01 12:48:22 -07:00
Jeff Klukas	4867be2af6	Remove redundant `list` call Co-Authored-By: Daniel Thorn <dthorn@mozilla.com>	2019-09-13 10:00:33 -04:00
Jeff Klukas	f4c5ea8e7c	Run black	2019-09-13 10:00:33 -04:00
Jeff Klukas	57a15e67c6	Make generate_sql more deterministic	2019-09-13 10:00:33 -04:00
Anna Scholtz	9580029e20	UDF for unzipping gzipped bytes (#272 ) * UDF for decompressing gzip data * Update script for publishing UDFs to upload UDF dependency files * Address review feedback for gunzip UDF * Set default GCS bucket to moz-fx-data-prod-bigquery-etl * Add function to upload UDF dependencies to GCS * Set data-eng-circleci-tests context in CircleCI config * Add approval step in CircleCI config	2019-08-26 10:53:06 -07:00
Daniel Thorn	22520e31f6	Use prepend_udf_usage_definitions in generate_sql (#287 )	2019-08-05 16:10:31 -07:00
Daniel Thorn	e1bf990b9a	Add support for testing queries with persistent UDFs (#285 )	2019-08-05 14:14:19 -07:00
Jeff Klukas	01cb6e1074	Refactor naming of UDFs	2019-07-24 09:01:13 -04:00
Frank Bertsch	ca2ab0e7b2	Fenix events (#210 ) * Ignore hidden files in udf testing * Add fenix event table generation * Use fenix prod table * Make fenix_events a view * Fix last nit * Remove date filter from view	2019-07-05 13:59:51 -04:00
Anna Scholtz	b06acd389d	Fix generated SQL	2019-06-25 08:07:26 -07:00
Anna Scholtz	aa637154c5	Ensure that UDFs are added only once and in order when generating SQL files	2019-06-25 08:07:26 -07:00
Anna Scholtz	eabba390af	Fix formatting issues	2019-06-25 08:07:26 -07:00
Anna Scholtz	b62970f3a9	Makefile for generating sql and add newline breaks to new files	2019-06-25 08:07:26 -07:00

1 2 3

148 Коммитов