bigquery-etl

Граф коммитов

Автор	SHA1	Сообщение	Дата
Anthony Miyaguchi	c6cabd4391	Add statements to generate glam queries for fenix (#2208 ) * Add statements to generate glam queries for fenix * Use newlines in single string for multiple products * Move glam generation into generate_sql script * Add documentation on ignoring target project	2021-07-22 15:31:25 -04:00
Anna Scholtz	b60d4f4be2	CircleCI build check for fork	2021-07-12 14:10:20 -07:00
Anna Scholtz	aafa54c346	Add separate CI step for SQL and routine tests	2021-07-12 14:10:20 -07:00
Daniel Thorn	3c8894fdf1	Make schema validation part of dryrun (#2069 )	2021-05-25 14:53:09 -04:00
Jeff Klukas	c6f0c3ce81	Allow generate_sql to twice without raising error (#2067 ) Fixes https://github.com/mozilla/bigquery-etl/issues/2066	2021-05-24 16:36:11 -04:00
Jeff Klukas	7486920237	Fix inconsistent invocation of bqetl in script (#2037 ) This is causing view deploys to fail with: > Please run ./bqetl bootstrap	2021-05-18 12:41:59 -07:00
Anna Scholtz	4443a6e463	Specify output_dir in generate-sql script Co-authored-by: Jeff Klukas <jklukas@mozilla.com>	2021-05-18 11:24:27 -07:00
Anna Scholtz	5eb0ada329	Review feedback	2021-05-18 11:24:27 -07:00
Anna Scholtz	7a3b4f499f	Remove old glean generation scripts	2021-05-18 11:24:27 -07:00
Anna Scholtz	cee749c4ba	Backfill with init option	2021-05-18 11:24:27 -07:00
Anna Scholtz	bc14ec8877	Generate Glean table when creating generated-sql branch	2021-05-18 11:24:27 -07:00
Anthony Miyaguchi	f58f0bfd3b	Revert "Add migration script for joining against first seen table (#1947 )" (#1950 ) This reverts commit `e4dfedd285`.	2021-04-12 15:47:16 -04:00
Anthony Miyaguchi	e4dfedd285	Add migration script for joining against first seen table (#1947 ) * Add migration script for joining against first seen table * Update logic for is_new_profile * Update templates to use DDL with partitioning/clustering * Fix output of migrate tables to backfill-8 * Add instructions for backfilling * Fix linting errors	2021-04-12 12:41:52 -07:00
Anthony Miyaguchi	871270f2c4	[DS-1424] Join baseline clients daily with first seen table (#1946 ) * Add first_seen_date and related test fixtures * Use is_new_profile instead of baseline_first_seen * Update view for baseline_clients_first_seen * Fix yamllint issues * Set is_new_profile when submission matches first seen * Include AS in table alias * Nit: capitalize AS * Update bigquery_etl/glean_usage/templates/baseline_clients_daily_v1.sql Co-authored-by: Jeff Klukas <jklukas@mozilla.com> * Update bigquery_etl/glean_usage/templates/baseline_clients_daily_v1.sql Co-authored-by: Jeff Klukas <jklukas@mozilla.com> * Update clustering specification Co-authored-by: Jeff Klukas <jklukas@mozilla.com>	2021-04-12 12:29:57 -07:00
whd	7c1b03934b	Default branch (#1939 ) * Rename default branch * Rename branch * Update circleci for default branch name	2021-04-06 21:15:21 +00:00
Anthony Miyaguchi	0aacbe5c22	Pull out main logic and generate example queries for glean usage (#1937 ) * Move argument parser into shared function * Move shared main entrypoint into common * Update example script to include other usage queries * Commit generated queries for example usage queries * Parallelize generation of example queries * Add docstring * Remove ios example queries for daily and last seen * Fix pydocstyle linting * Add update_example_glean_usage to CI	2021-04-06 11:38:30 -07:00
Anthony Miyaguchi	1503a7fa89	[DS-1424] Implementation of mobile clients first seen (#1934 ) * Add initial boilerplate for clients_first_seen * Remove submission_timestamp as a field * [wip] Join data against legacy fennec id if applicable * Remove user facing view * Revert "Remove user facing view" This reverts commit a728a7882170eadad5413c7a7046c0f38297bb87. * Add flag for fennec_id * Update logic to limit rows in partitions to submission_date * Add all sql in glean_usage to format ignores * Separate init and query * Add default encoders for testing sql * Add test for initialization of baseline clients first seen in fenix * Update query to update over previous history * Add test for aggregation * Add generated sql and tests for simple baseline clients first seen * Add dry-run exceptions for clients first seen tables * Add clients first seen to generated sql * Update bigquery_etl/glean_usage/templates/baseline_clients_first_seen.metadata.yaml Co-authored-by: Jeff Klukas <jklukas@mozilla.com> * Update bigquery_etl/glean_usage/templates/baseline_clients_first_seen.metadata.yaml Co-authored-by: Jeff Klukas <jklukas@mozilla.com> * Group by sample id instead of min * Add submission_date as baseline first seen date Co-authored-by: Jeff Klukas <jklukas@mozilla.com>	2021-04-05 11:36:39 -07:00
Anna Scholtz	9fc546fd87	Rewrite experiment search aggregates query	2021-03-09 10:11:27 -08:00
Daniel Thorn	024e993c44	Record table references in metadata.yaml (#1875 )	2021-03-09 12:29:05 -05:00
Jeff Klukas	dd6ddee6b9	Use dataset labels to speed up stable view generation (#1863 ) * Use dataset labels to speed up stable view generation Builds on new dry run affordance from https://github.com/mozilla/bigquery-etl/pull/1858 We also remove the `--no-dry-run` option now since only the single dry run is now needed, and stable view generation completes in less than 2 seconds.	2021-03-02 15:05:39 -05:00
Daniel Thorn	a190e18264	Automatically sort python imports (#1840 )	2021-02-24 17:11:52 -05:00
Daniel Thorn	5d07beaca7	Use zetasql to get dependencies for dag generation (#1817 )	2021-02-18 17:49:46 -05:00
Daniel Thorn	2ce8084dd9	Add option to generate stable views without dry run (#1814 )	2021-02-18 12:02:21 -05:00
Ben Wu	3a62ba7490	Allow setting project for glam clustered query temp tables (#1821 )	2021-02-17 12:22:22 -08:00
Linh Nguyen	28f15e16e5	Use generated SQL content as source for docs (#1811 )	2021-02-16 13:33:53 -08:00
Jeff Klukas	0637808f95	Use probeinfo rather than BQ calls for glean_usage sql generation (#1786 )	2021-02-16 13:26:11 -05:00
Frank Bertsch	f675ccf533	Add query generation capability for events_daily This is a straightforward way to share queries between datasets.	2021-02-10 17:03:02 -05:00
Jeff Klukas	3512fb6ff7	Publish generated views and queries to a generated-sql branch (#1775 ) * Add CI task to push content to generated-sql branch Fixes #1742 The [`generated-sql`](https://github.com/mozilla/bigquery-etl/tree/generated-sql) branch now exists and you can browse the contents. See, for example, [telemetry.main](https://github.com/mozilla/bigquery-etl/tree/generated-sql/sql/moz-fx-data-shared-prod/telemetry/main) Follow-ups for which I'll file issues: - This doesn't currently publish the generated Glean baseline ETL queries and views; we'll need to update that logic to use probe-scraper metadata rather than listing tables in BigQuery (due to creds) to integrate it. - Docs publishing should reference this generated content rather	2021-02-10 09:42:58 -05:00
Anna Scholtz	9eba25ac0f	Monitoring data export fixes	2021-01-29 11:02:03 -08:00
Anna Scholtz	56c846dd07	CI validate views (#1711 ) * Script for validating view definitions * Add SKIP list for view validation * Add view validation step to CI * Regex for validating referenced tables in view definitions	2021-01-25 11:03:31 -08:00
Jeff Klukas	b6ae2765c0	Add glean_usage ETL generation to generate_all_views (#1709 ) * Add glean_usage ETL generation to generate_all_views The new `generate_all_views` script is intended to replace `generate_views` as the entrypoint for Jenkins. Its usage is demonstrated in the `generate_and_publish_views` script. This supports the move to user queries in the `mozdata` project. * Add --user-facing-only Co-authored-by: whd <whd@users.noreply.github.com>	2021-01-22 20:51:01 +00:00
Anthony Miyaguchi	ef9d0efc78	Add metric and channel as clustering fields for GLAM (#1695 )	2021-01-20 13:54:38 -08:00
Anna Scholtz	89bb53824c	Remove date parameter	2021-01-20 12:36:17 -08:00
Anna Scholtz	a80f4a2d3c	Add remaining experiment enrollment Grafana queries	2021-01-19 13:10:19 -08:00
Anna Scholtz	e868fd4e97	Experiments export data aggregated and by branch	2021-01-19 13:10:19 -08:00
Anna Scholtz	35b3e84ce2	Remove datasets required from experiment export script	2021-01-19 10:12:13 -08:00
Anna Scholtz	843903d6bd	Experiment enrollment monitoring queries (#1656 ) * Experiment enrollment aggregates hourly * Experiment enrollments recents query * Add execution_delay support for tasks * Experiment enrollment aggregates base query * Schedule experiment enrollment cumulative population estimate and active population * Experiment enrollment monitoring queries as views * Script for exporting experiment monitoring data to GCS * Export experiment monitoring data script aggregating data of longer running experiments * Parallelize experiment monitoring data export * init.sql for experiment enrollment monitoring queries * Use Airflow ds_format macro for hourly destination table * Use Airflow macros for experiments monitoring hourly execution delay * experiment_enrollment_cumulative_population_estimate as query instead of view * Fix referenced tables in enrollment_aggregates_hourly metadata and add comment * Simplify cumulative population estimate query	2021-01-13 13:53:32 -08:00
Anthony Miyaguchi	7b28856491	Ensure the sql directory for glam-fenix exists (#1607 )	2020-12-09 10:37:20 -08:00
Anthony Miyaguchi	3632c52815	Specify project when generating glam_etl sql (#1604 )	2020-12-08 14:28:28 -08:00
Ben Wu	b50a95944d	Separate queries on clients_scalar_aggregates by app_version (#1594 )	2020-12-03 14:26:35 -05:00
Anthony Miyaguchi	4234c40040	Add minimal set of tests for GLAM Fenix queries (#1488 ) * Add script to determine query dependencies * Add schemas and folders for minimal test * Add schema for geckoview_versions * Add query params to each query * Update schema for new queries * Remove main from bootstrap file * Add dataset prefix to schemas * Add failing test for clients_histogram_aggregates It turns out that the dependency resolution I'm using for autogenerate the schemas is ignoring the views. I actually want to keep the views around. The tables also all need to be prefixed with the dataset name or they won't be inserted into the sql query correctly. * Add successful test for clients histogram aggregates * Add minimal tests for clients_scalar_aggregates * Remove skeleton files for views (no test support for views) * Add tests for latest versions * Add tests for scalar bucket counts that passes * Add scalar bucket counts * Add test for scalar percentiles * Add test for histogram bucket counts * Add passing test for probe counts * Add test for histogram percentiles * Add tests for extract counts * Update readme * Add data for scalar percentiles test * Fix linting errors * Fix mypy issues with tests module * Name it data instead of tests..data Ignore mypy on tests directory * Remove mypy section * Remove extra line in pytest * Try pytest invocation of mypy-scripts-are-modules * Run mypy outside of pytest * Use exec on pytest instead of mypy * Update tests/sql/glam-fenix-dev/glam_etl/bootstrap.py Co-authored-by: Ben Wu <benjaminwu124@gmail.com> * Update tests/sql/glam-fenix-dev/glam_etl/README.md Co-authored-by: Ben Wu <benjaminwu124@gmail.com> * Document bootstrap in documentation * Use artificial range for histogram_percentiles * Simplify parameters for scalar probe counts * Simplify tests for histogram probe counts * Add test for incremental histogram aggregates * Update scalar percentile counts to count distinct client ids * Update readme for creating a new test * Use unorded list for sublist * Use --ignore-glob for pytest to avoid data files Co-authored-by: Ben Wu <benjaminwu124@gmail.com>	2020-12-01 17:11:45 -08:00
Ben Wu	df0508841f	Move apple_app_store to marketing project dir (#1570 )	2020-11-20 13:17:33 -05:00
Ben Wu	2692ebf1d7	Create script to copy ga_sessions tables between projects (#1565 )	2020-11-20 11:58:58 -05:00
Anthony Miyaguchi	0c244613fb	Update glam fenix etl with updated scalar bucketing (#1493 ) * Add initial udf replacements * Update scalar bucketing scheme * Update schemas in script * Revert change to query * Remove comma before CROSS JOIN * Add functional query * Add option to skip steps * Add ordering for keys * Update bigquery_etl/glam/templates/scalar_bucket_counts_v1.sql Co-authored-by: Ben Wu <benjaminwu124@gmail.com> * Add instructions for copying tables and modify bucket location * Generate schemas when GENERATE_ONLY specified * Set build date to NULL instead of "*" Co-authored-by: Ben Wu <benjaminwu124@gmail.com>	2020-11-03 16:07:00 -08:00
Anthony Miyaguchi	c6e3f210b9	Fix #1470 - Wait on process ids from backgrounded tasks (#1475 )	2020-10-26 10:10:22 -07:00
Anthony Miyaguchi	2aa055b178	Add script to list tables in glam_etl datasets (#1478 )	2020-10-23 10:33:50 -07:00
Anthony Miyaguchi	b7695049c6	Fix #1457 - Generate and run Fenix ETL for GLAM in glam-fenix-dev (#1458 ) * Resolve generated sql to glam-fenix-dev and change output in sql/ dir * Add new script for testing glam-fenix queries * Add generated sql for version control * Use variables correctly in bash * Remove latest versions from UDF * Update test to generate minimum set of tables for nightly * Commit generated queries for testing * Cast only if not glob * Ignore dryrun and publish view for glam-fenix-dev * Fix linting error * Update comments * Use DST_PROJECT consistently in scripts * Update comments * Update script/glam/test/test_glean_org_mozilla_fenix_glam_nightly Co-authored-by: Ben Wu <benjaminwu124@gmail.com> * Update script/glam/generate_and_run_desktop_sql Co-authored-by: Ben Wu <benjaminwu124@gmail.com> Co-authored-by: Ben Wu <benjaminwu124@gmail.com>	2020-10-22 11:40:52 -07:00
Daniel Thorn	9e441fac96	Add script/bqetl for run cli without install (#1448 )	2020-10-16 15:44:15 -07:00
Daniel Thorn	824ef5f6d5	quote query arguments (#1433 )	2020-10-13 14:43:39 -07:00
Anna Scholtz	0d51459bd1	Move dependencies to udf_js_lib	2020-10-08 10:30:22 -07:00
Anna Scholtz	67c5265b6f	Rename udf module to routine	2020-10-08 10:30:22 -07:00
Anna Scholtz	8cdc12b70f	Add alternate project support for publishing UDFs	2020-10-08 10:30:22 -07:00
Anna Scholtz	a1fadf293f	Update path for publishing public UDFs	2020-10-05 13:55:07 -07:00
Anna Scholtz	d1c67dab53	Move projects into high-level sql/ folder	2020-10-05 12:59:58 -07:00
Anna Scholtz	06233819ab	Remove sql/ directory	2020-10-05 12:59:58 -07:00
Anthony Miyaguchi	0ed408e7dd	Fix #1329 - Use app_build_id as app_version in GLAM fenix nightly (#1354 ) * Make versions to keep configurable * Replace app_version with app_build_id in nightly * Add jsonschema as a requirement= * Filter based on build date instead of version for nightly * Add script for comparing the output of two branches * Add option for specifying the bucket in export * Cast build_id to integer * Remove latest versions from histogram aggregates * Format logical_app_id * Use @submission_date parameter in latest versions	2020-10-01 14:28:42 -07:00
Anna Scholtz	2e56471644	Move run_multipart_query logic to bigquery_etl	2020-09-24 08:55:35 -07:00
Anna Scholtz	a604268c7e	Move publish_static to bigquery_etl	2020-09-24 08:55:35 -07:00
Anna Scholtz	00a36c3553	Move json_to_table_ddl to bigquery_etl	2020-09-24 08:55:35 -07:00
Anna Scholtz	08be8da2a1	Move generate_incremental_table logic to bigquery_etl	2020-09-24 08:55:35 -07:00
Anna Scholtz	f6bf253144	Move copy_deduplicate logic to bigquery_etl	2020-09-24 08:55:35 -07:00
Anna Scholtz	6f31338ecd	Move view related scripts to view module	2020-09-24 08:55:35 -07:00
Anthony Miyaguchi	dd283c264f	Add glam cli for incremental backfill (#1313 ) * Add glam cli for listing processed app ids * Make backfill scripts more consistent * Add export to glam glean cli * Add pandas dependency * Add black format of glam-cli * Commit hashes based on bigquery-etl container * Fix various linting issues * Be stricter with is_logical matching * Fix more linting issues	2020-09-23 14:45:44 -07:00
Daniel Thorn	26c67c7ee8	Upgrade to pytest 6.0.1 (#1281 ) Also upgrade and fix pytest plugins	2020-09-02 11:30:14 -07:00
Anna Scholtz	437cf67aa2	Refactor parse_udf	2020-09-02 10:24:38 -07:00
Anna Scholtz	0080ff8867	Refactor publish_udfs script	2020-09-02 10:24:38 -07:00
Anna Scholtz	2b29d24f59	Migrate UDFs to new format	2020-09-02 10:24:38 -07:00
Anna Scholtz	debd57c662	Fix entrypoint run_query call	2020-08-27 21:08:14 -07:00
Anna Scholtz	ffaaa2ab26	Call bigquery_etl.run_query from script/run_query	2020-08-27 14:48:32 -07:00
Anna Scholtz	04cbf80eab	Add format command to CLI	2020-08-27 14:48:32 -07:00
Anthony Miyaguchi	d8f782dc62	Add scripts for backfilling and exporting all fenix aggregates (#1255 ) * Add scripts for backfilling and exporting all fenix aggregates * Update script/glam/export_glean_all_fenix Co-authored-by: Ben Wu <benjaminwu124@gmail.com> Co-authored-by: Ben Wu <benjaminwu124@gmail.com>	2020-08-26 11:08:32 -07:00
Anna Scholtz	cbf560a1fa	Add CLI tests for creating queries	2020-08-21 11:10:58 -07:00
Anthony Miyaguchi	96b85854d2	Update Glam ETL for Fenix (#1240 ) * Use UNION ALL instead of UNION * Move tests into separate directory and add test for all fenix products * Replace channel with *	2020-08-18 16:15:56 -07:00
Anthony Miyaguchi	222e04b081	Fix #1232 - Ignore glam_etl directory when publishing views (#1234 )	2020-08-18 11:33:50 -07:00
Anthony Miyaguchi	ca2204625d	Add views for logical Fenix app ids in GLAM ETL (#1221 ) * Add views for logical app ids * Add new generated sql * Update generate_glean_sql script to handle logical apps * Update logical app view for partitiontime * Make sure to generate view for all of the app ids * Update last versions to be logical app id agnostic * Add formatting for black * Fix linting error * Update bigquery_etl/glam/generate.py Co-authored-by: Ben Wu <benjaminwu124@gmail.com> * Add "all" option to STAGE * Add new metrics added since last PR Co-authored-by: Ben Wu <benjaminwu124@gmail.com>	2020-08-17 15:05:15 -07:00
Anthony Miyaguchi	36b7c184e6	Add script to backfill glam tables for a glean product (#1108 ) * Add backfill script for glean products * Specify product correctly and add target dataset * Add product to example * Use datetime.fromisoformat	2020-08-06 15:48:40 -07:00
Jeff Klukas	d5d64359f6	Bug 1657360 Exclude pings with "automation" tag from stable We will also need to update monitoring queries to account for this when counting unique document_ids in decoded and live tables.	2020-08-06 12:56:15 -04:00
asiOvOtus	2acb30c9b0	Rewrite duplicated map udfs to mozfun shims (#1211 ) * Rewrite duplicated map udfs to mozfun shims * Format get_key_with_null.sql	2020-08-04 13:26:13 -07:00
Ben Wu	019666b51b	Add queries for exported app store data (#1207 )	2020-07-29 18:02:16 -04:00
asiOvOtus	306c667b2d	Add unit tests and documentations for udfs (#1197 ) * Add unit tests and documentations for udfs * Auto format SQL files * fix and format Co-authored-by: Frank Bertsch <fbertsch@mozilla.com>	2020-07-28 11:54:44 -07:00
Ben Wu	ab50e40fc6	Generalize fenix glam generate and run code (#1183 )	2020-07-20 11:25:27 -04:00
Ben Wu	c42aa317c4	Replace jq in generate_glean_sql (#1174 )	2020-07-15 18:26:51 -04:00
Anna Scholtz	cfc80e3da4	Fix mozfun comments	2020-07-15 11:24:17 -07:00
Anna Scholtz	2f7a07a578	Migrate some UDFs to mozfun	2020-07-15 11:24:17 -07:00
Anna Scholtz	0852f90125	Fix UDF signatures	2020-07-15 11:24:17 -07:00
Anna Scholtz	6d90a47bc5	Migrate some UDFs to mozfun	2020-07-15 11:24:17 -07:00
Anna Scholtz	4ddb5d4b58	Fix metadata migration	2020-07-15 11:24:17 -07:00
Anna Scholtz	312e0ed21a	Refactor migrate_to_mozfun script	2020-07-15 11:24:17 -07:00
Anna Scholtz	e24c7bdf41	Add script for migrating UDFs to mozfun	2020-07-15 11:24:17 -07:00
Anna Scholtz	ee4a3ee0ce	More task re-scheduling	2020-07-10 13:30:24 -07:00
Ben Wu	7653acdb6f	Split daily_histogram_aggregates query by process type (#1124 )	2020-07-07 13:12:06 -04:00
Anna Scholtz	dfd52b5647	Dry run example SQL files	2020-07-06 13:55:19 -07:00
Anna Scholtz	94768fc14e	Generate SQL of doc examples for dry run	2020-07-06 13:55:19 -07:00
Anna Scholtz	88ecf499cd	Generate docs	2020-07-02 13:37:38 -07:00
Jeff Klukas	790fec1c52	mozfun.histogram.extract cleanup Follow-up to #1000 now that the function is published and I've had some chance to use it.	2020-07-02 12:31:31 -04:00
Jeff Klukas	ff2e30da32	Revert "Reference shared-prod views when republishing to other projects (#1105 )" This reverts commit `4654bbda31`.	2020-07-01 13:38:21 -04:00
Jeff Klukas	baeff74751	Bug 1649754 Remove reference to derived-datasets dry run url We no longer have any destination tables in derived-datasets.	2020-07-01 09:52:12 -04:00
Jeff Klukas	4654bbda31	Reference shared-prod views when republishing to other projects (#1105 ) Fixes https://github.com/mozilla/bigquery-etl/issues/1075	2020-06-30 12:46:23 -04:00
Jeff Klukas	9422909cfd	Retry once on "invalid snapshot time" when publishing views (#1095 ) Fixes #1001	2020-06-25 13:44:54 -04:00
Daniel Thorn	19a00ebce3	Add shredder script to forward deletion requests to amplitude (#1082 )	2020-06-24 11:20:17 -07:00

1 2 3 4 5 ...

438 Коммитов