Add initial crash ping backfill

2020-01-16 16:47:04 -05:00 · 2020-01-16 16:47:04 -05:00 · d17e566aef
--- a/README.md
+++ b/README.md
@ -1,2 +1,19 @@
 # bigquery-backfill
 Scripts and historical records related to backfills in Mozilla's telemetry pipeline
+
+## Layout
+
+There is a `script` directory containing relatively pristine reference scripts
+that you can copy and paste into a new backfill scenario and modify for your
+particular needs.
+
+There is a `backfills` directory where each subdirectory should be a dated
+backfill event, containing all the scripts used and a description of the
+overall scenario.
+
+## Setup
+
+Most of these backfill scenarios will assume that you have
+[`gcp-ingestion`](https://github.com/mozilla/gcp-ingestion) checked out
+locally. `mvn` invocations in scripts likely assume that you're in the
+`ingestion-beam` directory of that repo.
--- a/backfill/2020-01-12-crash-ping/README.md
+++ b/backfill/2020-01-12-crash-ping/README.md
@ -0,0 +1,103 @@
+# Backfilling crash pings due to minidumpSha256Hash field
+
+In late December 2019, we merged a change to the crash ping schema that
+added a required `minidumpSha256Hash` field that turned out to be not
+always present. We dropped the `required` part and then needed to backfill
+the rejected pings.
+
+## Steps
+
+We ran the backfill in the `moz-fx-data-backfill-2` project. 
+
+First, we determine the backfill range by querying the relevant error table:
+
+```
+SELECT
+  date(submission_timestamp) as dt, count(*)
+FROM
+  `moz-fx-data-shared-prod.payload_bytes_error.telemetry`
+WHERE
+  DATE(submission_timestamp) >= "2019-12-01"
+  AND document_type = 'crash'
+  AND error_message LIKE 'org.everit.json.schema.ValidationException: #/payload/minidumpSha256Hash%'
+  group by 1 order by 1
+```
+
+That showed the affected range as `2019-12-20` through `2020-01-08`.
+
+Next, we create destination tables via the `mirror-prod-tables` script.
+
+Next, we construct a suitable Dataflow job configuration in
+`launch-dataflow-minidump` and run the script.
+
+We visit the GCP console, choose the `moz-fx-data-backfill-2` project
+and go to the Dataflow section to watch the progress of the job.
+It took about 45 minutes to run to completion.
+
+We validate the results by checking counts per day and also by checking whether
+we have any overlapping IDs between prod and the backfilled table:
+
+```
+WITH
+  ids AS (
+  SELECT
+    DATE(submission_timestamp) AS dt,
+    document_id
+  FROM
+    `moz-fx-data-shared-prod.telemetry_stable.crash_v4`
+  WHERE
+    DATE(submission_timestamp) BETWEEN '2019-12-04'
+    AND '2020-01-09'
+  UNION ALL
+  SELECT
+    DATE(submission_timestamp) AS dt,
+    document_id
+  FROM
+    `moz-fx-data-backfill-2.telemetry_stable.crash_v4`
+  WHERE
+    DATE(submission_timestamp) BETWEEN '2019-12-04'
+    AND '2020-01-09' ),
+  dupes AS (
+  SELECT
+    dt,
+    document_id,
+    COUNT(*) AS n
+  FROM
+    ids
+  GROUP BY
+    1,
+    2
+  HAVING
+    n > 1)
+SELECT
+  dt,
+  COUNT(*)
+FROM
+  dupes
+GROUP BY
+  1
+ORDER BY
+  1
+```
+
+The results show duplicates only on the first and last day of the backfill.
+
+LEARNING: We might want to bake logic into the original query in the Dataflow
+job for skipping any document_ids that exist in the prod table. Then we'd
+be guaranteed to have a disjoint backfill table that we can blindly use
+`bq cp --append_table` to add into prod.
+
+Instead, we'll take advantage of the low data volume here and craft a
+query to append this data into the prod table. We do the final append
+into prod by running:
+
+```
+bq query -n 0 --nouse_legacy_sql \
+  --project_id=moz-fx-data-shared-prod \
+  --dataset_id=telemetry_stable \
+  --destination_table=crash_v4 \
+  --append_table \
+  < append_to_prod.sql
+```
+
+And we're done!
--- a/backfill/2020-01-12-crash-ping/append_to_prod.sql
+++ b/backfill/2020-01-12-crash-ping/append_to_prod.sql
@ -0,0 +1,40 @@
+WITH ids AS (
+  SELECT
+    DATE(submission_timestamp) AS dt,
+    document_id
+  FROM
+    `moz-fx-data-shared-prod.telemetry_stable.crash_v4`
+  WHERE
+    DATE(submission_timestamp)
+    BETWEEN '2019-12-04'
+    AND '2020-01-09'
+  UNION ALL
+  SELECT
+    DATE(submission_timestamp) AS dt,
+    document_id
+  FROM
+    `moz-fx-data-backfill-2.telemetry_stable.crash_v4`
+  WHERE
+    DATE(submission_timestamp)
+    BETWEEN '2019-12-04'
+    AND '2020-01-09'
+),
+dupes AS (
+  SELECT
+    dt,
+    document_id,
+    COUNT(*) AS n
+  FROM
+    ids
+  GROUP BY
+    1,
+    2
+  HAVING
+    n > 1
+)
+SELECT
+  *
+FROM
+  `moz-fx-data-backfill-2.telemetry_stable.crash_v4`
+WHERE
+  document_id NOT IN (SELECT document_id FROM dupes)
--- a/backfill/2020-01-12-crash-ping/launch-dataflow-minidump
+++ b/backfill/2020-01-12-crash-ping/launch-dataflow-minidump
@ -0,0 +1,31 @@
+#!/bin/bash
+
+set -exo pipefail
+
+PROJECT="moz-fx-data-backfill-2"
+JOB_NAME="minidump-backfill"
+
+## this script assumes it's being run from the ingestion-beam directory
+## of the gcp-ingestion repo.
+
+mvn compile exec:java -Dexec.mainClass=com.mozilla.telemetry.Decoder -Dexec.args="\
+    --runner=Dataflow \
+    --jobName=$JOB_NAME \
+    --project=$PROJECT  \
+    --geoCityDatabase=gs://backfill-test-public1/GeoIP2-City.mmdb \
+    --geoCityFilter=gs://backfill-test-public1/cities15000.txt \
+    --schemasLocation=gs://backfill-test-public1/202001101955_c5b19a4.tar.gz \
+    --inputType=bigquery_query \
+    --input=\"with pings as (SELECT \`moz-fx-data-shared-prod\`.udf.parse_desktop_telemetry_uri(uri).document_id, * FROM \`moz-fx-data-shared-prod.payload_bytes_error.telemetry\` WHERE DATE(submission_timestamp) between '2019-12-20' and '2020-01-08' AND document_type = 'crash' AND payload is not null and (error_message LIKE 'org.everit.json.schema.ValidationException: #/payload/minidumpSha256Hash%')), distinct_document_ids AS (SELECT document_id, MIN(submission_timestamp) AS submission_timestamp FROM pings GROUP BY document_id), base AS (SELECT * FROM pings JOIN distinct_document_ids USING (document_id, submission_timestamp)), numbered_duplicates AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY document_id) AS _n FROM base) SELECT * FROM numbered_duplicates WHERE _n = 1\" \
+    --bqReadMethod=export \
+    --outputType=bigquery \
+    --bqWriteMethod=file_loads \
+    --bqClusteringFields=normalized_channel,sample_id \
+    --output=${PROJECT}:\${document_namespace}_stable.\${document_type}_v\${document_version} \
+    --errorOutputType=bigquery \
+    --errorOutput=${PROJECT}:payload_bytes_error.telemetry \
+    --experiments=shuffle_mode=service \
+    --region=us-central1 \
+    --usePublicIps=false \
+    --gcsUploadBufferSizeBytes=16777216 \
+"
--- a/backfill/2020-01-12-crash-ping/mirror-prod-tables
+++ b/backfill/2020-01-12-crash-ping/mirror-prod-tables
@ -0,0 +1,27 @@
+#!/bin/bash
+
+PROJECT=moz-fx-data-backfill-2
+
+for dataset in $(bq ls -n 1000 --project_id=moz-fx-data-shared-prod | grep 'telemetry_stable'); do
+    bq mk $PROJECT:$dataset
+    for table in $(bq ls -n 1000 --project_id=moz-fx-data-shared-prod $dataset | tail -n+3 | awk '{print $1}' | grep -E 'crash_v4'); do
+        bq rm -f $PROJECT:$dataset.$table
+        bq mk -t \
+           --time_partitioning_field=submission_timestamp \
+           --clustering_fields=normalized_channel,sample_id \
+           --schema <(bq show --format=json moz-fx-data-shared-prod:$dataset.$table | jq '.schema.fields') \
+           $PROJECT:$dataset.$table
+    done
+done
+
+for dataset in $(bq ls -n 1000 --project_id=moz-fx-data-shared-prod | grep 'payload_bytes_error'); do
+    bq mk $PROJECT:$dataset
+    for table in $(bq ls -n 1000 --project_id=moz-fx-data-shared-prod $dataset | tail -n+3 | awk '{print $1}'); do
+        bq rm -f $PROJECT:$dataset.$table
+        bq mk -t \
+           --time_partitioning_field=submission_timestamp \
+           --clustering_fields=submission_timestamp \
+           --schema <(bq show --format=json moz-fx-data-shared-prod:$dataset.$table | jq '.schema.fields') \
+           $PROJECT:$dataset.$table
+    done
+done
--- a/script/mirror-prod-tables.sh
+++ b/script/mirror-prod-tables.sh
@ -0,0 +1,36 @@
+#!/bin/bash
+
+# Copies the structure of prod tables into a project where you'll be backfilling.
+# BEWARE: This script by default deletes existing tables!
+
+# You likely want to make a copy of this and modify to suit, deleting some of the blocks,
+# adding `grep` invocations to limit which tables you create, etc.
+
+# Set this to a desired destination project that you'll be backfilling into
+PROJECT=moz-fx-data-backfill-?
+
+# Copy stable table schemas
+for dataset in $(bq ls -n 1000 --project_id=moz-fx-data-shared-prod | grep 'telemetry_stable'); do
+    bq mk $PROJECT:$dataset
+    for table in $(bq ls -n 1000 --project_id=moz-fx-data-shared-prod $dataset | tail -n+3 | awk '{print $1}' | grep -E 'crash_v4'); do
+        bq rm -f $PROJECT:$dataset.$table
+        bq mk -t \
+           --time_partitioning_field=submission_timestamp \
+           --clustering_fields=normalized_channel,sample_id \
+           --schema <(bq show --format=json moz-fx-data-shared-prod:$dataset.$table | jq '.schema.fields') \
+           $PROJECT:$dataset.$table
+    done
+done
+
+# Copy payload_bytes_error tables schemas
+for dataset in $(bq ls -n 1000 --project_id=moz-fx-data-shared-prod | grep 'payload_bytes_error'); do
+    bq mk $PROJECT:$dataset
+    for table in $(bq ls -n 1000 --project_id=moz-fx-data-shared-prod $dataset | tail -n+3 | awk '{print $1}'); do
+        bq rm -f $PROJECT:$dataset.$table
+        bq mk -t \
+           --time_partitioning_field=submission_timestamp \
+           --clustering_fields=submission_timestamp \
+           --schema <(bq show --format=json moz-fx-data-shared-prod:$dataset.$table | jq '.schema.fields') \
+           $PROJECT:$dataset.$table
+    done
+done