Jeff Klukas
7f5079b26e
sqlfiles -> sql_files
2020-02-05 09:46:49 -05:00
Jeff Klukas
7905d7fa21
Parallelize scripts/publish_views
...
Per request of @whd
2020-02-05 09:46:49 -05:00
Anna Scholtz
041376b487
Add init.sql for VR browser clients_daily to dryrun skip list
2020-01-24 12:33:20 -08:00
Daniel Thorn
1f343af30d
format_sql on devtools_panel_usage_v1
2020-01-24 09:12:09 -05:00
Daniel Thorn
1efbe0344a
Add script for self serve deletion ( #635 )
2020-01-23 14:52:08 -08:00
Daniel Thorn
5fa7e4e61e
Correctly format scripting keywords ( #693 )
2020-01-21 20:05:47 -08:00
Anna Scholtz
efcba3286d
Improvements for CRC32 stored procedure
2020-01-21 14:58:24 -08:00
Daniel Thorn
58bb0183b8
Allow generate_incremental_table to backfill days in reverse ( #696 )
...
by specifying --start as a date after --end
2020-01-21 14:01:00 -08:00
Jeff Klukas
c092f7479c
Syntax error in publish_views for fenix
2020-01-21 16:12:45 -05:00
Jeff Klukas
4f201da964
Filter out histograms in fenix metrics ping from Glean SDK<19
2020-01-21 14:16:51 -05:00
Anna Scholtz
b31fbe3497
Metadata publish improvements and update clients_daily_v6 metadata
2020-01-17 16:03:59 -08:00
Anna Scholtz
165fe50cc8
Script for updating metadata of table
2020-01-17 16:03:59 -08:00
Anna Scholtz
47f77b7c62
Copy metadata.yaml when generating SQL
2020-01-17 16:03:59 -08:00
Jeff Klukas
0ae8143af7
Bug 1609666 Use SAFE_CAST in udf.json_extract_int_map ( #681 )
2020-01-16 09:06:14 -08:00
Daniel Thorn
7c134d5617
Enforce format_sql on more files ( #659 )
2020-01-10 17:07:21 -08:00
Jeff Klukas
19a4353c97
Bug: make skipping authorized views more robust
...
This was not working in the production context where the target sql dir is
under /tmp as discussed in https://github.com/mozilla/bigquery-etl/pull/655#issuecomment-572789847
2020-01-10 13:43:30 -05:00
Daniel Thorn
e871c70e09
Fail on NULL in assert_false udf ( #657 )
...
* Fail on NULL in assert_false udf
* Update tests/README.md
Co-Authored-By: Anna Scholtz <anna@scholtzan.net>
Co-authored-by: Anna Scholtz <anna@scholtzan.net>
2020-01-09 16:15:42 -08:00
Daniel Thorn
2f7de8683d
Enforce script/format_sql for all new sql files ( #656 )
2020-01-09 13:55:46 -08:00
Jeff Klukas
ac4a17c33f
Add list of authorized views to exempt from publishing
...
Related to CI failures addressed in https://github.com/mozilla/bigquery-etl/issues/653
2020-01-09 11:04:50 -05:00
Jeff Klukas
a5b5da6220
Inject null for negative session_length and event timestamps
...
As of https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/474
we will be allowing negative values for session lengths and event timestamps.
We provide some safety by modifying the user-facing view for main pings to null
out these negative values (since large negative values may cause significant
skew in aggregations).
We don't modify the user-facing view for event pings, but rather assume users
will be applying the deanonymize_events UDF and handle nulling negative values
there.
See bugs
https://bugzilla.mozilla.org/show_bug.cgi?id=1592012
and
https://bugzilla.mozilla.org/show_bug.cgi?id=1602521
2020-01-09 09:28:55 -05:00
Anthony Miyaguchi
8ed78e2a18
Fix #653 - CI failing on activity_stream/tile_id_types/views.sql ( #654 )
2020-01-08 15:32:12 -08:00
Jeff Klukas
78619180d0
Use labels to determine glean pings
2020-01-08 10:11:11 -05:00
Jeff Klukas
be92fd1a4e
Add parsed start and end times to views on top of Glean schemas
...
Closes https://github.com/mozilla/gcp-ingestion/issues/633
2020-01-08 10:11:11 -05:00
Daniel Thorn
8ca73c2b60
Rewrite script/format_sql in python ( #640 )
2020-01-06 16:17:41 -08:00
Frank Bertsch
de80cfd652
RFM View for LTV ( #611 )
...
* Add new UDFs for BYTE column/day_seen
Rename bitpos to align with the new convention.
* Add search_rfm dataset for LTV
* Move RFM calculations to UDF
* Address review feedback
* Fully escape UDFs
* Fix _actual_ missing UDF
* Don't dryrun; access denied
2019-12-19 18:01:44 -05:00
Jeff Klukas
d66f64d4ab
Bug: publish_views ignored some view.sql files
...
Closes #600
We improve the logic for parsing view definition files, and we also now
error out when we encounter a view.sql file that doesn't match our parsing
rather than silently skipping.
2019-12-18 13:10:51 -05:00
Sunah Suh
ddfaf63f83
Add athena query migration script for posterity ( #592 )
...
* Add athena query migration script for posterity
* Add legacy athena migration script to pytest exclusions
2019-12-12 17:55:02 -05:00
Frank Bertsch
6c825425b3
Search clients last seen ( #451 )
...
* Improve error message for ndjson parsing
* Make JSON error messages nicer
* Cast BYTES fields to/from string
BYTES types are not JSON-serializable. To deal with that, we do
two things:
1. Assume the input tables are hex strings, and decode them
to get the BYTES fields values (on input)
2. Encode BYTES fields as hex strings (on output)
This means that any data files use hex strings for BYTES fields.
Note: This only works on top-level fields
* Add better discrepancy reporting for test assertions
When JSON blobs differ, it can be hard to tell what is wrong.
These functions easily show what's different, and automatically
prints them to be available when tests fail.
* Add search_clients_last_seen for 1-year of history
This new dataset, search_clients_last_seen, contains a year
of history for each client. It is split into 3 main parts:
1. Recent info that is contained in search_clients_daily,
similar to how we store that in clients_last_seen
2. A year of history, represented as a BYTES field,
indicating which days they were active for different
types of activity
3. Among the major search providers, arrays of totals of
different metrics, split into 12 parts, to account for
each months total
This dataset will power LTV.
* Fix linting issues
* Enforce sampling on search_clients_daily
* Address review feedback
- Change all bits/bytes functions to include no. of bits
- Use fileobj for tests
- Rename some vars
- Use base64 for bytes in/out
* Generate sql
* Add missing comma
* Move search_clients_ls to search_derived
* Generate moar sql
* Use clients_daily_v8
* Fix query
* Move tests to search_derived
* Fix tests for search_clients_daily_v8
* Don't dryrun with search_clients_last_seen
* Update udf/new_monthly_engine_searches_struct.sql
Co-Authored-By: Jeff Klukas <jeff@klukas.net>
* sample_id is now an int
* Add documentation
* Update schemas
* Make tests use int sample-id
2019-12-12 12:43:09 -05:00
Ben Wu
50932354ce
Add script for publishing csv's as static tables ( #582 )
2019-12-11 14:25:33 -05:00
Jeff Klukas
d1948fa445
Revert "Add authorized views for payload_bytes raw and error"
...
This reverts commit 1d2fa74f9e
.
2019-12-10 14:36:14 -05:00
Jeff Klukas
b77a2b9ac1
Revert "Add all relevant views to the authorization list"
...
This reverts commit 4f064f2e6d
.
2019-12-10 14:36:14 -05:00
Jeff Klukas
4f064f2e6d
Add all relevant views to the authorization list
2019-12-10 13:58:56 -05:00
Jeff Klukas
1d2fa74f9e
Add authorized views for payload_bytes raw and error
...
Supports https://github.com/mozilla/bigquery-etl/issues/360
2019-12-10 13:58:56 -05:00
Ben Wu
7d9782b1ba
Bug 1543434 - Create search datasets for mobile ( #559 )
2019-12-06 13:28:53 -05:00
Anthony Miyaguchi
b938356d48
Bug 1601139 - Add query to sample documents per doctype ( #570 )
...
* Bug 1601139 - Add query to sample documents per doctype
* Add docstring, fix formatting, and update column name
2019-12-04 13:46:00 -08:00
Daniel Thorn
e11d009aac
Fix case expected for XCOM_PUSH ( #575 )
...
and add spaces to avoid issues with empty variables
2019-12-04 12:35:26 -08:00
Sunah Suh
f45591a834
Fix #550 : create airflow xcom output file in dockerfile and make writ… ( #553 )
2019-12-04 14:32:08 -05:00
Daniel Thorn
c70d2e179d
Reimplement experiments_v1 as SQL ( #565 )
...
* Reimplement experiments_v1 as SQL
* Apply suggestions from code review
Co-Authored-By: Sunah Suh <github@sunahsuh.com>
* Update templates/telemetry_derived/experiments_v1/get_experiment_list.py
* fix generate_sql
2019-12-03 15:08:57 -08:00
Jeff Klukas
0094f4ba7d
Normalize metadata in generated views on historical ping tables
...
Merging this change will cause the changes to be used on the next
deploy of the schema generation pipeline
See https://github.com/mozilla-services/cloudops-infra/blob/master/projects/data-shared/Jenkinsfile.bigquery.prod#L105
2019-12-02 15:21:32 -05:00
Sunah Suh
25b702d082
Add tables to replace experiment enrollment aggregates spark streamin… ( #524 )
...
* Add tables to replace experiment enrollment aggregates spark streaming job
* Switch to python to fill in date in enrollment aggregates live view since parameters are not allowed in view defs
* Direct output of arbitrary commands in entrypoint script to airflow xcom location
2019-11-27 16:55:21 -05:00
Marina Samuel
4465965f14
Code cleanup.
2019-11-25 15:30:33 -05:00
Daniel Thorn
ce28624b74
Improve format-sql for views and timestamp functions ( #528 )
2019-11-25 14:15:32 -05:00
Daniel Thorn
9468f997ab
Make addons and addon_aggregates exactly replace spark versions ( #532 )
2019-11-25 13:14:52 -05:00
Jeff Klukas
52e0a1acab
Move nondesktop KPI queries to stable table DAG
...
This makes it explicit that we no longer are using imported Parquet data.
It also moves several of these tables to the shared-prod project where
we want them to live long-term.
2019-11-20 10:22:29 -05:00
Ben Wu
c0160496d4
Add views for search dataset ( #513 )
2019-11-18 19:26:15 -05:00
Daniel Thorn
8822b522aa
Promote sql clients_daily_v6 ( #501 )
2019-11-14 18:08:49 -05:00
Daniel Thorn
4b80ee2c23
Support restoring int columns in export_to_parquet ( #506 )
2019-11-14 10:41:03 -05:00
Ben Wu
73dc724086
Switch search to read from flattened main summary ( #500 )
2019-11-13 13:20:14 -08:00
Daniel Thorn
7b1c9d96ad
Support bigquery export to parquet via avro ( #492 )
2019-11-07 13:56:33 -05:00
Daniel Thorn
9176dc940e
Unnest clients_daily ( #481 )
2019-11-06 17:14:02 -05:00
Marina Samuel
384a017935
Update dryrun script.
2019-11-06 16:13:26 -05:00
Daniel Thorn
f239525363
Fix skip list for publishing udfs ( #489 )
2019-11-06 15:28:34 -05:00
Daniel Thorn
fe4ec05c93
Add --dataset_id and --project_id to script/run_multipart_query ( #488 )
2019-11-06 13:54:25 -05:00
Daniel Thorn
1d871edc0b
Add list of udfs to skip when publishing ( #487 )
2019-11-06 13:37:47 -05:00
Daniel Thorn
dfb54323bf
Fix detection of maps when table includes dataset ( #485 )
2019-11-06 12:37:56 -05:00
Daniel Thorn
eba9db159d
Add support for replacing columns in export_to_parquet ( #443 )
2019-11-05 14:17:06 -05:00
Daniel Thorn
890141c140
Reimplement main_summary_v4 as SQL ( #258 )
2019-11-05 13:32:25 -05:00
Anthony Miyaguchi
4c44359310
Fix #465 - Add regex strings to STRING_REGEX in format-sql ( #470 )
2019-10-31 17:04:26 -07:00
Anthony Miyaguchi
084b960602
Use `#!/usr/bin/env python3` consistently ( #461 )
2019-10-30 14:06:22 -07:00
Daniel Thorn
b65fbbbd93
Fix automatic formatting for nested types ( #457 )
...
without this `<` and `>` are getting formatted as operators instead of parens
2019-10-29 22:17:06 -07:00
Daniel Thorn
cad76a9b6f
Fix CI by skipping slow-to-validate queries ( #448 )
2019-10-25 09:15:20 -05:00
Sunah Suh
387536cbee
Fixes to json -> table ddl generator script ( #444 )
2019-10-24 14:56:19 -07:00
Anthony Miyaguchi
d60c0fd842
Add dataset for monitoring schema errors over time ( #442 )
...
* Add query for last month of schema errors
* Add generated sql for schema error counts
* Move schemas into correct location
* Add document_version and named groups
* Skip schema error counts in dryrun
2019-10-23 15:37:05 -07:00
Daniel Thorn
0f433f6a91
Support many billing projects and dates in copy_deduplicate ( #426 )
...
* Support many billing projects and dates in copy_deduplicate
* fix docs for --project_id
* explain default --billing_projects behavior
* Fix return value bug
2019-10-21 11:16:24 -07:00
Marina Samuel
a42d97af2d
Add new queries to dryrun.
2019-10-17 15:25:19 -04:00
Daniel Thorn
9bf053de74
Add --preceding-days option to copy_deduplicate ( #413 )
2019-10-14 08:54:40 -07:00
Jeff Klukas
096a209ced
Fix bugs in monitoring views
...
Also cleans up a bug in the script for publishing views to get udf_js/gunzip
working, and removes accidental print statements in generate_sql.
2019-10-10 11:48:28 -04:00
Jeff Klukas
68c4d79228
Replace sql dir all at once in generate_sql
...
I got tired of running generate_sql, then checking git status while it was
running and seeing a jumble of deleted files. This PR changes the behavior to
build the files in a temp dir and then copy into place only at the end.
2019-10-10 09:21:30 -04:00
Frank Bertsch
239fab252a
Pipeline sql ( #388 )
...
* Views for monitoring structured ingestion errors
* Add UDF for extracting missing columns
* Add docs for json_extract_missing_cols
* Fix missing WITH clause
* Add generated sql
* Fix test function
* Update views
* Update tests
* Use shared-prod
* Update docstring for missing cols udf
* Move schemas to structured_ingestion dataset
* Don't dryrun queries on payload_bytes
* Change format of payload_bytes views
- Use a with_ratio to reduce duplication
- Change the view name to match the filename
* Fix missing test function
* Move to monitoring dataset
* Move to new test structure
* Remove spaces from js udfs
* Use persistent gunzip UDF
2019-10-09 10:03:58 -04:00
Jeff Klukas
2e821fbdc1
Use the shared-prod URL for the dryrun script
2019-10-08 20:21:48 -04:00
Daniel Thorn
e872a76860
Add pytest plugins to lint python scripts ( #410 )
...
* Add pytest plugins to lint python scripts
* Fix lint errors
2019-10-08 14:00:11 -07:00
Jeff Klukas
fff2ff3275
fxa_users_services tables ( #396 )
2019-10-08 13:31:01 -04:00
Daniel Thorn
8ccce8702c
Use chunksize=1 for consistent ordering in script/generate_incremental_table ( #403 )
2019-10-07 10:39:47 -07:00
Ben Wu
484d5079a7
Add additional fields to search datasets ( #381 )
2019-09-30 10:26:28 -04:00
Daniel Thorn
6158817ea3
Improve list_tables speed for script/copy_deduplicate ( #382 )
2019-09-26 11:28:22 -07:00
Daniel Thorn
91ba7297d5
Fix incorrect parameter name in copy_deduplicate ( #383 )
2019-09-25 13:30:29 -07:00
Daniel Thorn
16a1491821
Add --slices option to copy_deduplicate ( #380 )
2019-09-25 11:27:36 -07:00
Daniel Thorn
6c05a96847
dont create temp tables for dry_run
2019-09-24 13:52:36 -04:00
Jeff Klukas
29f44ebbf5
Bug: datetime logic in copy_deduplicate
...
Closes https://github.com/mozilla/bigquery-etl/issues/376
Addition to datetime.date considers only the days part of a timedelta,
so we have to convert to a timestamp first.
2019-09-24 13:52:36 -04:00
Daniel Thorn
54ae019ff3
Set temp table expiration in copy_deduplicate ( #374 )
2019-09-23 13:03:19 -07:00
Daniel Thorn
4f48ae21ef
Add --hourly option to copy_deduplicate ( #370 )
2019-09-23 10:54:21 -07:00
Jeff Klukas
f0f5ec99a6
Import events from FxA oauth server
...
I've already created the new table and backfilled existing dates.
Addresses #348
2019-09-19 14:01:58 -04:00
Daniel Thorn
469c03ec10
Add script to format sql ( #173 )
2019-09-18 17:48:53 -07:00
Jeff Klukas
0ea63c7775
dryrun script updates
2019-09-17 15:30:59 -04:00
Jeff Klukas
f4c5ea8e7c
Run black
2019-09-13 10:00:33 -04:00
Sunah Suh
f9c611a906
Fix UDF publisher script ( #330 )
2019-09-04 10:54:48 -05:00
Jeff Klukas
71ad6652f5
Update publish_views script for new directory structure
2019-08-28 20:58:43 -04:00
Jeff Klukas
c2269a69af
Update generate_view script for new directory structure
...
Closes #317
2019-08-28 20:58:43 -04:00
Sunah Suh
030ca5872a
Add script to recreate table creation DDL SQL from json description ( #309 )
...
Add script to recreate table creation DDL SQL from json description
2019-08-28 11:49:42 -05:00
Daniel Thorn
99fe0dfd9e
Move queries into destination-table directories ( #286 )
...
* Move queries into destination-table directories
* Apply suggestions from code review
Co-Authored-By: Jeff Klukas <jeff@klukas.net>
2019-08-26 12:52:49 -07:00
Anna Scholtz
9580029e20
UDF for unzipping gzipped bytes ( #272 )
...
* UDF for decompressing gzip data
* Update script for publishing UDFs to upload UDF dependency files
* Address review feedback for gunzip UDF
* Set default GCS bucket to moz-fx-data-prod-bigquery-etl
* Add function to upload UDF dependencies to GCS
* Set data-eng-circleci-tests context in CircleCI config
* Add approval step in CircleCI config
2019-08-26 10:53:06 -07:00
Jeff Klukas
5b005570db
Make copy_deduplicate query more efficient for large tables
...
Closes #307
2019-08-23 14:43:17 -04:00
Anna Scholtz
7a6f7aacf8
View generation fixes
2019-08-21 15:03:50 -07:00
Anna Scholtz
52061238f5
Add wrapper script for generating and publishing views
2019-08-21 15:03:50 -07:00
Anna Scholtz
1d87797bd6
Fix rebase conflicts
2019-08-21 15:03:50 -07:00
Anna Scholtz
7520de5092
Script for auto-generating views
2019-08-21 15:03:50 -07:00
Jeff Klukas
7e533b4b24
Apply suggestions from code review
...
Co-Authored-By: Daniel Thorn <dthorn@mozilla.com>
2019-08-20 16:47:33 -04:00
Jeff Klukas
a407f50c14
Explain the case of ignoring tables with trailing underscore
2019-08-20 16:47:33 -04:00
Jeff Klukas
5c06876590
Support altering target project in publish_views script
2019-08-20 16:47:33 -04:00
Jeff Klukas
12215f88fa
Generate latest-version views for derived tables
2019-08-20 16:47:33 -04:00
Daniel Thorn
91e1be5394
Use mode last in clients_daily_v7 ( #86 )
2019-08-14 14:49:38 -07:00
Jeff Klukas
55cdf93c5f
instantiate client after parsing args
2019-08-07 13:18:31 -04:00
Jeff Klukas
a929285ca0
Apply suggestions from code review
...
Co-Authored-By: Daniel Thorn <dthorn@mozilla.com>
2019-08-07 13:18:31 -04:00
Jeff Klukas
46232da366
Refactor dryrun script for concurrency and permissions
2019-08-07 13:18:31 -04:00
Jeff Klukas
fbc1b9f564
Fix up imports
2019-08-07 13:18:31 -04:00
Jeff Klukas
937d294d65
Add publish_views script
2019-08-07 13:18:31 -04:00
Jeff Klukas
f2ecc96ce8
Adapt view update script for generating definitions instead
2019-08-07 13:18:31 -04:00
Daniel Thorn
28ce0d1f11
Add script for updating latest-version views
2019-08-07 13:18:31 -04:00
Daniel Thorn
22520e31f6
Use prepend_udf_usage_definitions in generate_sql ( #287 )
2019-08-05 16:10:31 -07:00
Daniel Thorn
e1bf990b9a
Add support for testing queries with persistent UDFs ( #285 )
2019-08-05 14:14:19 -07:00
Daniel Thorn
a241017c15
Reuse bigquery client and set default project_id ( #282 )
2019-08-02 13:36:10 -07:00
Daniel Thorn
5308e79570
detect errors when publishing udfs ( #281 )
2019-08-02 13:16:43 -07:00
Jeff Klukas
15640b831f
Support options with underscores and fix incorrect variable
2019-08-01 14:00:57 -04:00
Jeff Klukas
0bc42132a2
Add --parallelism option
2019-08-01 10:15:33 -04:00
Jeff Klukas
4242d95777
Remove new entrypoint clause
2019-08-01 10:15:33 -04:00
Jeff Klukas
00cef9d7e9
Run black and refactor --only and --except args
2019-08-01 10:15:33 -04:00
Jeff Klukas
ccb65d6d18
Apply suggestions from code review
...
Co-Authored-By: Daniel Thorn <dthorn@mozilla.com>
2019-08-01 10:15:33 -04:00
Jeff Klukas
c9f65a7af8
Refactor to allow jobs to run in a different project
...
I believe Airflow may need to issue the jobs from derived-datasets
for the time being, so we make sure to fully qualify all table
references with the project_id that's passed as a parameter.
2019-08-01 10:15:33 -04:00
Jeff Klukas
0a3d4ea59d
Fix typo
...
Co-Authored-By: Anna Scholtz <anna@scholtzan.net>
2019-08-01 10:15:33 -04:00
Jeff Klukas
a275cfb9e5
Add copy_deduplicate script
...
Closes #220
A PR to add schedule this script in Airflow to follow.
2019-08-01 10:15:33 -04:00
Allen Short
351b42e84a
Dry-run each query in CircleCI against prod datasets ( #261 )
...
* Dry-run each query in CircleCI against prod datasets
* Apply suggestions from code review
* Update script/dryrun
2019-07-30 10:26:25 -07:00
Daniel Thorn
f79d075448
Add dataset names to paths in sql/ ( #265 )
...
* Add dataset names to paths in sql/
* rename clients_last_seen_raw_v1 to clients_last_seen_v1
* rename telemetry_raw to telemetry_derived
* address review
2019-07-30 09:39:22 -07:00
Jeff Klukas
01cb6e1074
Refactor naming of UDFs
2019-07-24 09:01:13 -04:00
Jeff Klukas
680c26ac41
Efficiency tweak: avoid double-publishing UDFs
2019-07-24 09:01:13 -04:00
Anna Scholtz
ff356466f6
Bugfix: missing UDFs without dependencies
2019-07-22 13:44:28 -07:00
Anna Scholtz
4f897edd8a
Add project ID when creating UDFs
2019-07-22 13:44:28 -07:00
Anna Scholtz
c3d06f94d2
Script to publish persistent UDFs
2019-07-22 13:44:28 -07:00
Daniel Thorn
d6e35295ec
Fix help page for script/generate_incremental_table ( #244 )
2019-07-22 12:58:35 -07:00
Anna Scholtz
7207a4e52f
Move SQL templates to templates/ and add generated SQL
2019-06-25 08:07:26 -07:00
Anna Scholtz
fe7325dcb4
Run SQL generation script in when creating docker image
2019-06-25 08:07:26 -07:00
Anna Scholtz
aa637154c5
Ensure that UDFs are added only once and in order when generating SQL files
2019-06-25 08:07:26 -07:00
Anna Scholtz
a6661c5896
Trigger SQL query generation in pytest and update CircleCI config
2019-06-25 08:07:26 -07:00
Anna Scholtz
b62970f3a9
Makefile for generating sql and add newline breaks to new files
2019-06-25 08:07:26 -07:00
Anna Scholtz
f2efcc0432
Adopt CircleCI script to generate SQL queries
2019-06-25 08:07:26 -07:00
Anna Scholtz
fd21ba88c2
Add Python script to generate SQL files with UDF declarations
2019-06-25 08:07:26 -07:00
Jeff Klukas
5eb134ca86
fixups found while running the deletions
2019-05-23 16:42:08 -04:00
Jeff Klukas
120153dabe
respond to review comments
2019-05-23 16:42:08 -04:00
Jeff Klukas
845fa792c3
Codify archiving of exact mau table
2019-05-23 16:42:08 -04:00
Jeff Klukas
76a8a23e54
Use generate_incremental_table for clients_last_seen backfill
2019-05-23 16:42:08 -04:00
Jeff Klukas
d9669e325a
Add comments on tables that do not exist in BQ
2019-05-23 16:42:08 -04:00
Jeff Klukas
d481d93861
delete from experiments and search_clients_daily datasets
2019-05-23 16:42:08 -04:00
Jeff Klukas
090fce87cb
Better handling for bq tables
2019-05-23 16:42:08 -04:00
Jeff Klukas
b421eeeb11
Correct time range for gs
2019-05-23 16:42:08 -04:00
Jeff Klukas
9970aebd4e
Add delete-from-bq.sh
2019-05-23 16:42:08 -04:00
Jeff Klukas
7c8cbbc0f4
Bug 1550814 Remove data collected during hotfix rollout
...
See https://bugzilla.mozilla.org/show_bug.cgi?id=1550814
2019-05-23 16:42:08 -04:00
Jeff Klukas
e782c5f6ff
Add --destination-table and selectExprs options to export_to_parquet
...
We implemented a view-based solution for creating clients_last_seen
from clients_last_seen_raw in Athena and Presto, but this made the
table unavailable from Spark.
By adding in these options, we can materialize view logic at the
time of writing to Parquet, so that it will be available to all
Parquet consumers.
2019-05-22 14:01:40 -04:00
Jeff Klukas
e0dcbdaf36
Usability improvements for generate_incremental_table
2019-05-16 09:41:14 -04:00
Daniel Thorn
1dad4e14b4
Rewrite generate_incremental_table in python ( #126 )
...
fixes #115
2019-05-14 09:22:31 -07:00
Daniel Thorn
c2e416cefd
Create ~/.bigqueryrc without GCLOUD_SERVICE_KEY ( #112 )
2019-05-01 13:38:31 -07:00
Daniel Thorn
606fec9c04
Set sane defaults for bq use in airflow ( #110 )
2019-05-01 08:24:57 -07:00
Daniel Thorn
e38d3d6f14
Fix --submission-date for export_to_parquet.py ( #105 )
2019-04-25 13:32:52 -07:00
Daniel Thorn
8abd41397d
Add script to automate generating incremental tables ( #103 )
2019-04-25 09:10:29 -07:00
Daniel Thorn
52d10541d3
Create and publish docker image ( #95 )
...
* Create and publish docker image
* Update config.yml
2019-04-19 09:19:36 -07:00
Daniel Thorn
099ff31aae
Add pyspark script for exporting to parquet ( #89 )
...
* Add pyspark script for exporting to parquet
* address review
2019-04-17 12:07:41 -07:00