Граф коммитов

454 Коммитов

Автор SHA1 Сообщение Дата
Jeff Klukas 7f5079b26e sqlfiles -> sql_files 2020-02-05 09:46:49 -05:00
Jeff Klukas 7905d7fa21 Parallelize scripts/publish_views
Per request of @whd
2020-02-05 09:46:49 -05:00
Anna Scholtz 041376b487 Add init.sql for VR browser clients_daily to dryrun skip list 2020-01-24 12:33:20 -08:00
Daniel Thorn 1f343af30d format_sql on devtools_panel_usage_v1 2020-01-24 09:12:09 -05:00
Daniel Thorn 1efbe0344a
Add script for self serve deletion (#635) 2020-01-23 14:52:08 -08:00
Daniel Thorn 5fa7e4e61e
Correctly format scripting keywords (#693) 2020-01-21 20:05:47 -08:00
Anna Scholtz efcba3286d Improvements for CRC32 stored procedure 2020-01-21 14:58:24 -08:00
Daniel Thorn 58bb0183b8
Allow generate_incremental_table to backfill days in reverse (#696)
by specifying --start as a date after --end
2020-01-21 14:01:00 -08:00
Jeff Klukas c092f7479c Syntax error in publish_views for fenix 2020-01-21 16:12:45 -05:00
Jeff Klukas 4f201da964 Filter out histograms in fenix metrics ping from Glean SDK<19 2020-01-21 14:16:51 -05:00
Anna Scholtz b31fbe3497 Metadata publish improvements and update clients_daily_v6 metadata 2020-01-17 16:03:59 -08:00
Anna Scholtz 165fe50cc8 Script for updating metadata of table 2020-01-17 16:03:59 -08:00
Anna Scholtz 47f77b7c62 Copy metadata.yaml when generating SQL 2020-01-17 16:03:59 -08:00
Jeff Klukas 0ae8143af7 Bug 1609666 Use SAFE_CAST in udf.json_extract_int_map (#681) 2020-01-16 09:06:14 -08:00
Daniel Thorn 7c134d5617
Enforce format_sql on more files (#659) 2020-01-10 17:07:21 -08:00
Jeff Klukas 19a4353c97 Bug: make skipping authorized views more robust
This was not working in the production context where the target sql dir is
under /tmp as discussed in https://github.com/mozilla/bigquery-etl/pull/655#issuecomment-572789847
2020-01-10 13:43:30 -05:00
Daniel Thorn e871c70e09
Fail on NULL in assert_false udf (#657)
* Fail on NULL in assert_false udf

* Update tests/README.md

Co-Authored-By: Anna Scholtz <anna@scholtzan.net>

Co-authored-by: Anna Scholtz <anna@scholtzan.net>
2020-01-09 16:15:42 -08:00
Daniel Thorn 2f7de8683d
Enforce script/format_sql for all new sql files (#656) 2020-01-09 13:55:46 -08:00
Jeff Klukas ac4a17c33f Add list of authorized views to exempt from publishing
Related to CI failures addressed in https://github.com/mozilla/bigquery-etl/issues/653
2020-01-09 11:04:50 -05:00
Jeff Klukas a5b5da6220 Inject null for negative session_length and event timestamps
As of https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/474
we will be allowing negative values for session lengths and event timestamps.
We provide some safety by modifying the user-facing view for main pings to null
out these negative values (since large negative values may cause significant
skew in aggregations).

We don't modify the user-facing view for event pings, but rather assume users
will be applying the deanonymize_events UDF and handle nulling negative values
there.

See bugs
https://bugzilla.mozilla.org/show_bug.cgi?id=1592012
and
https://bugzilla.mozilla.org/show_bug.cgi?id=1602521
2020-01-09 09:28:55 -05:00
Anthony Miyaguchi 8ed78e2a18
Fix #653 - CI failing on activity_stream/tile_id_types/views.sql (#654) 2020-01-08 15:32:12 -08:00
Jeff Klukas 78619180d0 Use labels to determine glean pings 2020-01-08 10:11:11 -05:00
Jeff Klukas be92fd1a4e Add parsed start and end times to views on top of Glean schemas
Closes https://github.com/mozilla/gcp-ingestion/issues/633
2020-01-08 10:11:11 -05:00
Daniel Thorn 8ca73c2b60
Rewrite script/format_sql in python (#640) 2020-01-06 16:17:41 -08:00
Frank Bertsch de80cfd652
RFM View for LTV (#611)
* Add new UDFs for BYTE column/day_seen

Rename bitpos to align with the new convention.

* Add search_rfm dataset for LTV

* Move RFM calculations to UDF

* Address review feedback

* Fully escape UDFs

* Fix _actual_ missing UDF

* Don't dryrun; access denied
2019-12-19 18:01:44 -05:00
Jeff Klukas d66f64d4ab Bug: publish_views ignored some view.sql files
Closes #600

We improve the logic for parsing view definition files, and we also now
error out when we encounter a view.sql file that doesn't match our parsing
rather than silently skipping.
2019-12-18 13:10:51 -05:00
Sunah Suh ddfaf63f83
Add athena query migration script for posterity (#592)
* Add athena query migration script for posterity

* Add legacy athena migration script to pytest exclusions
2019-12-12 17:55:02 -05:00
Frank Bertsch 6c825425b3
Search clients last seen (#451)
* Improve error message for ndjson parsing

* Make JSON error messages nicer

* Cast BYTES fields to/from string

BYTES types are not JSON-serializable. To deal with that, we do
two things:
1. Assume the input tables are hex strings, and decode them
   to get the BYTES fields values (on input)
2. Encode BYTES fields as hex strings (on output)

This means that any data files use hex strings for BYTES fields.

Note: This only works on top-level fields

* Add better discrepancy reporting for test assertions

When JSON blobs differ, it can be hard to tell what is wrong.
These functions easily show what's different, and automatically
prints them to be available when tests fail.

* Add search_clients_last_seen for 1-year of history

This new dataset, search_clients_last_seen, contains a year
of history for each client. It is split into 3 main parts:

1. Recent info that is contained in search_clients_daily,
   similar to how we store that in clients_last_seen
2. A year of history, represented as a BYTES field,
   indicating which days they were active for different
   types of activity
3. Among the major search providers, arrays of totals of
   different metrics, split into 12 parts, to account for
   each months total

This dataset will power LTV.

* Fix linting issues

* Enforce sampling on search_clients_daily

* Address review feedback

- Change all bits/bytes functions to include no. of bits
- Use fileobj for tests
- Rename some vars
- Use base64 for bytes in/out

* Generate sql

* Add missing comma

* Move search_clients_ls to search_derived

* Generate moar sql

* Use clients_daily_v8

* Fix query

* Move tests to search_derived

* Fix tests for search_clients_daily_v8

* Don't dryrun with search_clients_last_seen

* Update udf/new_monthly_engine_searches_struct.sql

Co-Authored-By: Jeff Klukas <jeff@klukas.net>

* sample_id is now an int

* Add documentation

* Update schemas

* Make tests use int sample-id
2019-12-12 12:43:09 -05:00
Ben Wu 50932354ce
Add script for publishing csv's as static tables (#582) 2019-12-11 14:25:33 -05:00
Jeff Klukas d1948fa445 Revert "Add authorized views for payload_bytes raw and error"
This reverts commit 1d2fa74f9e.
2019-12-10 14:36:14 -05:00
Jeff Klukas b77a2b9ac1 Revert "Add all relevant views to the authorization list"
This reverts commit 4f064f2e6d.
2019-12-10 14:36:14 -05:00
Jeff Klukas 4f064f2e6d Add all relevant views to the authorization list 2019-12-10 13:58:56 -05:00
Jeff Klukas 1d2fa74f9e Add authorized views for payload_bytes raw and error
Supports https://github.com/mozilla/bigquery-etl/issues/360
2019-12-10 13:58:56 -05:00
Ben Wu 7d9782b1ba
Bug 1543434 - Create search datasets for mobile (#559) 2019-12-06 13:28:53 -05:00
Anthony Miyaguchi b938356d48
Bug 1601139 - Add query to sample documents per doctype (#570)
* Bug 1601139 - Add query to sample documents per doctype

* Add docstring, fix formatting, and update column name
2019-12-04 13:46:00 -08:00
Daniel Thorn e11d009aac
Fix case expected for XCOM_PUSH (#575)
and add spaces to avoid issues with empty variables
2019-12-04 12:35:26 -08:00
Sunah Suh f45591a834
Fix #550: create airflow xcom output file in dockerfile and make writ… (#553) 2019-12-04 14:32:08 -05:00
Daniel Thorn c70d2e179d
Reimplement experiments_v1 as SQL (#565)
* Reimplement experiments_v1 as SQL

* Apply suggestions from code review

Co-Authored-By: Sunah Suh <github@sunahsuh.com>

* Update templates/telemetry_derived/experiments_v1/get_experiment_list.py

* fix generate_sql
2019-12-03 15:08:57 -08:00
Jeff Klukas 0094f4ba7d Normalize metadata in generated views on historical ping tables
Merging this change will cause the changes to be used on the next
deploy of the schema generation pipeline

See https://github.com/mozilla-services/cloudops-infra/blob/master/projects/data-shared/Jenkinsfile.bigquery.prod#L105
2019-12-02 15:21:32 -05:00
Sunah Suh 25b702d082
Add tables to replace experiment enrollment aggregates spark streamin… (#524)
* Add tables to replace experiment enrollment aggregates spark streaming job

* Switch to python to fill in date in enrollment aggregates live view since parameters are not allowed in view defs

* Direct output of arbitrary commands in entrypoint script to airflow xcom location
2019-11-27 16:55:21 -05:00
Marina Samuel 4465965f14 Code cleanup. 2019-11-25 15:30:33 -05:00
Daniel Thorn ce28624b74
Improve format-sql for views and timestamp functions (#528) 2019-11-25 14:15:32 -05:00
Daniel Thorn 9468f997ab
Make addons and addon_aggregates exactly replace spark versions (#532) 2019-11-25 13:14:52 -05:00
Jeff Klukas 52e0a1acab Move nondesktop KPI queries to stable table DAG
This makes it explicit that we no longer are using imported Parquet data.
It also moves several of these tables to the shared-prod project where
we want them to live long-term.
2019-11-20 10:22:29 -05:00
Ben Wu c0160496d4
Add views for search dataset (#513) 2019-11-18 19:26:15 -05:00
Daniel Thorn 8822b522aa
Promote sql clients_daily_v6 (#501) 2019-11-14 18:08:49 -05:00
Daniel Thorn 4b80ee2c23
Support restoring int columns in export_to_parquet (#506) 2019-11-14 10:41:03 -05:00
Ben Wu 73dc724086
Switch search to read from flattened main summary (#500) 2019-11-13 13:20:14 -08:00
Daniel Thorn 7b1c9d96ad
Support bigquery export to parquet via avro (#492) 2019-11-07 13:56:33 -05:00
Daniel Thorn 9176dc940e
Unnest clients_daily (#481) 2019-11-06 17:14:02 -05:00
Marina Samuel 384a017935 Update dryrun script. 2019-11-06 16:13:26 -05:00
Daniel Thorn f239525363
Fix skip list for publishing udfs (#489) 2019-11-06 15:28:34 -05:00
Daniel Thorn fe4ec05c93
Add --dataset_id and --project_id to script/run_multipart_query (#488) 2019-11-06 13:54:25 -05:00
Daniel Thorn 1d871edc0b
Add list of udfs to skip when publishing (#487) 2019-11-06 13:37:47 -05:00
Daniel Thorn dfb54323bf
Fix detection of maps when table includes dataset (#485) 2019-11-06 12:37:56 -05:00
Daniel Thorn eba9db159d
Add support for replacing columns in export_to_parquet (#443) 2019-11-05 14:17:06 -05:00
Daniel Thorn 890141c140
Reimplement main_summary_v4 as SQL (#258) 2019-11-05 13:32:25 -05:00
Anthony Miyaguchi 4c44359310 Fix #465 - Add regex strings to STRING_REGEX in format-sql (#470) 2019-10-31 17:04:26 -07:00
Anthony Miyaguchi 084b960602
Use `#!/usr/bin/env python3` consistently (#461) 2019-10-30 14:06:22 -07:00
Daniel Thorn b65fbbbd93
Fix automatic formatting for nested types (#457)
without this `<` and `>` are getting formatted as operators instead of parens
2019-10-29 22:17:06 -07:00
Daniel Thorn cad76a9b6f Fix CI by skipping slow-to-validate queries (#448) 2019-10-25 09:15:20 -05:00
Sunah Suh 387536cbee Fixes to json -> table ddl generator script (#444) 2019-10-24 14:56:19 -07:00
Anthony Miyaguchi d60c0fd842
Add dataset for monitoring schema errors over time (#442)
* Add query for last month of schema errors

* Add generated sql for schema error counts

* Move schemas into correct location

* Add document_version and named groups

* Skip schema error counts in dryrun
2019-10-23 15:37:05 -07:00
Daniel Thorn 0f433f6a91
Support many billing projects and dates in copy_deduplicate (#426)
* Support many billing projects and dates in copy_deduplicate

* fix docs for --project_id

* explain default --billing_projects behavior

* Fix return value bug
2019-10-21 11:16:24 -07:00
Marina Samuel a42d97af2d Add new queries to dryrun. 2019-10-17 15:25:19 -04:00
Daniel Thorn 9bf053de74
Add --preceding-days option to copy_deduplicate (#413) 2019-10-14 08:54:40 -07:00
Jeff Klukas 096a209ced Fix bugs in monitoring views
Also cleans up a bug in the script for publishing views to get udf_js/gunzip
working, and removes accidental print statements in generate_sql.
2019-10-10 11:48:28 -04:00
Jeff Klukas 68c4d79228 Replace sql dir all at once in generate_sql
I got tired of running generate_sql, then checking git status while it was
running and seeing a jumble of deleted files. This PR changes the behavior to
build the files in a temp dir and then copy into place only at the end.
2019-10-10 09:21:30 -04:00
Frank Bertsch 239fab252a
Pipeline sql (#388)
* Views for monitoring structured ingestion errors

* Add UDF for extracting missing columns

* Add docs for json_extract_missing_cols

* Fix missing WITH clause

* Add generated sql

* Fix test function

* Update views

* Update tests

* Use shared-prod

* Update docstring for missing cols udf

* Move schemas to structured_ingestion dataset

* Don't dryrun queries on payload_bytes

* Change format of payload_bytes views

- Use a with_ratio to reduce duplication
- Change the view name to match the filename

* Fix missing test function

* Move to monitoring dataset

* Move to new test structure

* Remove spaces from js udfs

* Use persistent gunzip UDF
2019-10-09 10:03:58 -04:00
Jeff Klukas 2e821fbdc1 Use the shared-prod URL for the dryrun script 2019-10-08 20:21:48 -04:00
Daniel Thorn e872a76860
Add pytest plugins to lint python scripts (#410)
* Add pytest plugins to lint python scripts

* Fix lint errors
2019-10-08 14:00:11 -07:00
Jeff Klukas fff2ff3275
fxa_users_services tables (#396) 2019-10-08 13:31:01 -04:00
Daniel Thorn 8ccce8702c
Use chunksize=1 for consistent ordering in script/generate_incremental_table (#403) 2019-10-07 10:39:47 -07:00
Ben Wu 484d5079a7
Add additional fields to search datasets (#381) 2019-09-30 10:26:28 -04:00
Daniel Thorn 6158817ea3
Improve list_tables speed for script/copy_deduplicate (#382) 2019-09-26 11:28:22 -07:00
Daniel Thorn 91ba7297d5
Fix incorrect parameter name in copy_deduplicate (#383) 2019-09-25 13:30:29 -07:00
Daniel Thorn 16a1491821
Add --slices option to copy_deduplicate (#380) 2019-09-25 11:27:36 -07:00
Daniel Thorn 6c05a96847 dont create temp tables for dry_run 2019-09-24 13:52:36 -04:00
Jeff Klukas 29f44ebbf5 Bug: datetime logic in copy_deduplicate
Closes https://github.com/mozilla/bigquery-etl/issues/376

Addition to datetime.date considers only the days part of a timedelta,
so we have to convert to a timestamp first.
2019-09-24 13:52:36 -04:00
Daniel Thorn 54ae019ff3
Set temp table expiration in copy_deduplicate (#374) 2019-09-23 13:03:19 -07:00
Daniel Thorn 4f48ae21ef
Add --hourly option to copy_deduplicate (#370) 2019-09-23 10:54:21 -07:00
Jeff Klukas f0f5ec99a6 Import events from FxA oauth server
I've already created the new table and backfilled existing dates.

Addresses #348
2019-09-19 14:01:58 -04:00
Daniel Thorn 469c03ec10
Add script to format sql (#173) 2019-09-18 17:48:53 -07:00
Jeff Klukas 0ea63c7775 dryrun script updates 2019-09-17 15:30:59 -04:00
Jeff Klukas f4c5ea8e7c Run black 2019-09-13 10:00:33 -04:00
Sunah Suh f9c611a906
Fix UDF publisher script (#330) 2019-09-04 10:54:48 -05:00
Jeff Klukas 71ad6652f5 Update publish_views script for new directory structure 2019-08-28 20:58:43 -04:00
Jeff Klukas c2269a69af Update generate_view script for new directory structure
Closes #317
2019-08-28 20:58:43 -04:00
Sunah Suh 030ca5872a
Add script to recreate table creation DDL SQL from json description (#309)
Add script to recreate table creation DDL SQL from json description
2019-08-28 11:49:42 -05:00
Daniel Thorn 99fe0dfd9e
Move queries into destination-table directories (#286)
* Move queries into destination-table directories

* Apply suggestions from code review

Co-Authored-By: Jeff Klukas <jeff@klukas.net>
2019-08-26 12:52:49 -07:00
Anna Scholtz 9580029e20 UDF for unzipping gzipped bytes (#272)
* UDF for decompressing gzip data

* Update script for publishing UDFs to upload UDF dependency files

* Address review feedback for gunzip UDF

* Set default GCS bucket to moz-fx-data-prod-bigquery-etl

* Add function to upload UDF dependencies to GCS

* Set data-eng-circleci-tests context in CircleCI config

* Add approval step in CircleCI config
2019-08-26 10:53:06 -07:00
Jeff Klukas 5b005570db Make copy_deduplicate query more efficient for large tables
Closes #307
2019-08-23 14:43:17 -04:00
Anna Scholtz 7a6f7aacf8 View generation fixes 2019-08-21 15:03:50 -07:00
Anna Scholtz 52061238f5 Add wrapper script for generating and publishing views 2019-08-21 15:03:50 -07:00
Anna Scholtz 1d87797bd6 Fix rebase conflicts 2019-08-21 15:03:50 -07:00
Anna Scholtz 7520de5092 Script for auto-generating views 2019-08-21 15:03:50 -07:00
Jeff Klukas 7e533b4b24 Apply suggestions from code review
Co-Authored-By: Daniel Thorn <dthorn@mozilla.com>
2019-08-20 16:47:33 -04:00
Jeff Klukas a407f50c14 Explain the case of ignoring tables with trailing underscore 2019-08-20 16:47:33 -04:00
Jeff Klukas 5c06876590 Support altering target project in publish_views script 2019-08-20 16:47:33 -04:00
Jeff Klukas 12215f88fa Generate latest-version views for derived tables 2019-08-20 16:47:33 -04:00
Daniel Thorn 91e1be5394
Use mode last in clients_daily_v7 (#86) 2019-08-14 14:49:38 -07:00
Jeff Klukas 55cdf93c5f instantiate client after parsing args 2019-08-07 13:18:31 -04:00
Jeff Klukas a929285ca0 Apply suggestions from code review
Co-Authored-By: Daniel Thorn <dthorn@mozilla.com>
2019-08-07 13:18:31 -04:00
Jeff Klukas 46232da366 Refactor dryrun script for concurrency and permissions 2019-08-07 13:18:31 -04:00
Jeff Klukas fbc1b9f564 Fix up imports 2019-08-07 13:18:31 -04:00
Jeff Klukas 937d294d65 Add publish_views script 2019-08-07 13:18:31 -04:00
Jeff Klukas f2ecc96ce8 Adapt view update script for generating definitions instead 2019-08-07 13:18:31 -04:00
Daniel Thorn 28ce0d1f11 Add script for updating latest-version views 2019-08-07 13:18:31 -04:00
Daniel Thorn 22520e31f6
Use prepend_udf_usage_definitions in generate_sql (#287) 2019-08-05 16:10:31 -07:00
Daniel Thorn e1bf990b9a
Add support for testing queries with persistent UDFs (#285) 2019-08-05 14:14:19 -07:00
Daniel Thorn a241017c15
Reuse bigquery client and set default project_id (#282) 2019-08-02 13:36:10 -07:00
Daniel Thorn 5308e79570
detect errors when publishing udfs (#281) 2019-08-02 13:16:43 -07:00
Jeff Klukas 15640b831f Support options with underscores and fix incorrect variable 2019-08-01 14:00:57 -04:00
Jeff Klukas 0bc42132a2 Add --parallelism option 2019-08-01 10:15:33 -04:00
Jeff Klukas 4242d95777 Remove new entrypoint clause 2019-08-01 10:15:33 -04:00
Jeff Klukas 00cef9d7e9 Run black and refactor --only and --except args 2019-08-01 10:15:33 -04:00
Jeff Klukas ccb65d6d18 Apply suggestions from code review
Co-Authored-By: Daniel Thorn <dthorn@mozilla.com>
2019-08-01 10:15:33 -04:00
Jeff Klukas c9f65a7af8 Refactor to allow jobs to run in a different project
I believe Airflow may need to issue the jobs from derived-datasets
for the time being, so we make sure to fully qualify all table
references with the project_id that's passed as a parameter.
2019-08-01 10:15:33 -04:00
Jeff Klukas 0a3d4ea59d Fix typo
Co-Authored-By: Anna Scholtz <anna@scholtzan.net>
2019-08-01 10:15:33 -04:00
Jeff Klukas a275cfb9e5 Add copy_deduplicate script
Closes #220

A PR to add schedule this script in Airflow to follow.
2019-08-01 10:15:33 -04:00
Allen Short 351b42e84a Dry-run each query in CircleCI against prod datasets (#261)
* Dry-run each query in CircleCI against prod datasets

* Apply suggestions from code review

* Update script/dryrun
2019-07-30 10:26:25 -07:00
Daniel Thorn f79d075448
Add dataset names to paths in sql/ (#265)
* Add dataset names to paths in sql/

* rename clients_last_seen_raw_v1 to clients_last_seen_v1

* rename telemetry_raw to telemetry_derived

* address review
2019-07-30 09:39:22 -07:00
Jeff Klukas 01cb6e1074 Refactor naming of UDFs 2019-07-24 09:01:13 -04:00
Jeff Klukas 680c26ac41 Efficiency tweak: avoid double-publishing UDFs 2019-07-24 09:01:13 -04:00
Anna Scholtz ff356466f6 Bugfix: missing UDFs without dependencies 2019-07-22 13:44:28 -07:00
Anna Scholtz 4f897edd8a Add project ID when creating UDFs 2019-07-22 13:44:28 -07:00
Anna Scholtz c3d06f94d2 Script to publish persistent UDFs 2019-07-22 13:44:28 -07:00
Daniel Thorn d6e35295ec
Fix help page for script/generate_incremental_table (#244) 2019-07-22 12:58:35 -07:00
Anna Scholtz 7207a4e52f Move SQL templates to templates/ and add generated SQL 2019-06-25 08:07:26 -07:00
Anna Scholtz fe7325dcb4 Run SQL generation script in when creating docker image 2019-06-25 08:07:26 -07:00
Anna Scholtz aa637154c5 Ensure that UDFs are added only once and in order when generating SQL files 2019-06-25 08:07:26 -07:00
Anna Scholtz a6661c5896 Trigger SQL query generation in pytest and update CircleCI config 2019-06-25 08:07:26 -07:00
Anna Scholtz b62970f3a9 Makefile for generating sql and add newline breaks to new files 2019-06-25 08:07:26 -07:00
Anna Scholtz f2efcc0432 Adopt CircleCI script to generate SQL queries 2019-06-25 08:07:26 -07:00
Anna Scholtz fd21ba88c2 Add Python script to generate SQL files with UDF declarations 2019-06-25 08:07:26 -07:00
Jeff Klukas 5eb134ca86 fixups found while running the deletions 2019-05-23 16:42:08 -04:00
Jeff Klukas 120153dabe respond to review comments 2019-05-23 16:42:08 -04:00
Jeff Klukas 845fa792c3 Codify archiving of exact mau table 2019-05-23 16:42:08 -04:00
Jeff Klukas 76a8a23e54 Use generate_incremental_table for clients_last_seen backfill 2019-05-23 16:42:08 -04:00
Jeff Klukas d9669e325a Add comments on tables that do not exist in BQ 2019-05-23 16:42:08 -04:00
Jeff Klukas d481d93861 delete from experiments and search_clients_daily datasets 2019-05-23 16:42:08 -04:00
Jeff Klukas 090fce87cb Better handling for bq tables 2019-05-23 16:42:08 -04:00
Jeff Klukas b421eeeb11 Correct time range for gs 2019-05-23 16:42:08 -04:00
Jeff Klukas 9970aebd4e Add delete-from-bq.sh 2019-05-23 16:42:08 -04:00
Jeff Klukas 7c8cbbc0f4 Bug 1550814 Remove data collected during hotfix rollout
See https://bugzilla.mozilla.org/show_bug.cgi?id=1550814
2019-05-23 16:42:08 -04:00
Jeff Klukas e782c5f6ff Add --destination-table and selectExprs options to export_to_parquet
We implemented a view-based solution for creating clients_last_seen
from clients_last_seen_raw in Athena and Presto, but this made the
table unavailable from Spark.

By adding in these options, we can materialize view logic at the
time of writing to Parquet, so that it will be available to all
Parquet consumers.
2019-05-22 14:01:40 -04:00
Jeff Klukas e0dcbdaf36 Usability improvements for generate_incremental_table 2019-05-16 09:41:14 -04:00
Daniel Thorn 1dad4e14b4
Rewrite generate_incremental_table in python (#126)
fixes #115
2019-05-14 09:22:31 -07:00
Daniel Thorn c2e416cefd
Create ~/.bigqueryrc without GCLOUD_SERVICE_KEY (#112) 2019-05-01 13:38:31 -07:00
Daniel Thorn 606fec9c04
Set sane defaults for bq use in airflow (#110) 2019-05-01 08:24:57 -07:00
Daniel Thorn e38d3d6f14
Fix --submission-date for export_to_parquet.py (#105) 2019-04-25 13:32:52 -07:00
Daniel Thorn 8abd41397d
Add script to automate generating incremental tables (#103) 2019-04-25 09:10:29 -07:00
Daniel Thorn 52d10541d3
Create and publish docker image (#95)
* Create and publish docker image

* Update config.yml
2019-04-19 09:19:36 -07:00
Daniel Thorn 099ff31aae
Add pyspark script for exporting to parquet (#89)
* Add pyspark script for exporting to parquet

* address review
2019-04-17 12:07:41 -07:00