Add initial crash ping backfill

This commit is contained in:
Jeff Klukas 2020-01-16 16:47:04 -05:00
Родитель d02bea7837
Коммит d17e566aef
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: DDCB8ACB3942E362
6 изменённых файлов: 254 добавлений и 0 удалений

Просмотреть файл

@ -1,2 +1,19 @@
# bigquery-backfill
Scripts and historical records related to backfills in Mozilla's telemetry pipeline
## Layout
There is a `script` directory containing relatively pristine reference scripts
that you can copy and paste into a new backfill scenario and modify for your
particular needs.
There is a `backfills` directory where each subdirectory should be a dated
backfill event, containing all the scripts used and a description of the
overall scenario.
## Setup
Most of these backfill scenarios will assume that you have
[`gcp-ingestion`](https://github.com/mozilla/gcp-ingestion) checked out
locally. `mvn` invocations in scripts likely assume that you're in the
`ingestion-beam` directory of that repo.

Просмотреть файл

@ -0,0 +1,103 @@
# Backfilling crash pings due to minidumpSha256Hash field
In late December 2019, we merged a change to the crash ping schema that
added a required `minidumpSha256Hash` field that turned out to be not
always present. We dropped the `required` part and then needed to backfill
the rejected pings.
## Steps
We ran the backfill in the `moz-fx-data-backfill-2` project.
First, we determine the backfill range by querying the relevant error table:
```
SELECT
date(submission_timestamp) as dt, count(*)
FROM
`moz-fx-data-shared-prod.payload_bytes_error.telemetry`
WHERE
DATE(submission_timestamp) >= "2019-12-01"
AND document_type = 'crash'
AND error_message LIKE 'org.everit.json.schema.ValidationException: #/payload/minidumpSha256Hash%'
group by 1 order by 1
```
That showed the affected range as `2019-12-20` through `2020-01-08`.
Next, we create destination tables via the `mirror-prod-tables` script.
Next, we construct a suitable Dataflow job configuration in
`launch-dataflow-minidump` and run the script.
We visit the GCP console, choose the `moz-fx-data-backfill-2` project
and go to the Dataflow section to watch the progress of the job.
It took about 45 minutes to run to completion.
We validate the results by checking counts per day and also by checking whether
we have any overlapping IDs between prod and the backfilled table:
```
WITH
ids AS (
SELECT
DATE(submission_timestamp) AS dt,
document_id
FROM
`moz-fx-data-shared-prod.telemetry_stable.crash_v4`
WHERE
DATE(submission_timestamp) BETWEEN '2019-12-04'
AND '2020-01-09'
UNION ALL
SELECT
DATE(submission_timestamp) AS dt,
document_id
FROM
`moz-fx-data-backfill-2.telemetry_stable.crash_v4`
WHERE
DATE(submission_timestamp) BETWEEN '2019-12-04'
AND '2020-01-09' ),
dupes AS (
SELECT
dt,
document_id,
COUNT(*) AS n
FROM
ids
GROUP BY
1,
2
HAVING
n > 1)
SELECT
dt,
COUNT(*)
FROM
dupes
GROUP BY
1
ORDER BY
1
```
The results show duplicates only on the first and last day of the backfill.
LEARNING: We might want to bake logic into the original query in the Dataflow
job for skipping any document_ids that exist in the prod table. Then we'd
be guaranteed to have a disjoint backfill table that we can blindly use
`bq cp --append_table` to add into prod.
Instead, we'll take advantage of the low data volume here and craft a
query to append this data into the prod table. We do the final append
into prod by running:
```
bq query -n 0 --nouse_legacy_sql \
--project_id=moz-fx-data-shared-prod \
--dataset_id=telemetry_stable \
--destination_table=crash_v4 \
--append_table \
< append_to_prod.sql
```
And we're done!

Просмотреть файл

@ -0,0 +1,40 @@
WITH ids AS (
SELECT
DATE(submission_timestamp) AS dt,
document_id
FROM
`moz-fx-data-shared-prod.telemetry_stable.crash_v4`
WHERE
DATE(submission_timestamp)
BETWEEN '2019-12-04'
AND '2020-01-09'
UNION ALL
SELECT
DATE(submission_timestamp) AS dt,
document_id
FROM
`moz-fx-data-backfill-2.telemetry_stable.crash_v4`
WHERE
DATE(submission_timestamp)
BETWEEN '2019-12-04'
AND '2020-01-09'
),
dupes AS (
SELECT
dt,
document_id,
COUNT(*) AS n
FROM
ids
GROUP BY
1,
2
HAVING
n > 1
)
SELECT
*
FROM
`moz-fx-data-backfill-2.telemetry_stable.crash_v4`
WHERE
document_id NOT IN (SELECT document_id FROM dupes)

Просмотреть файл

@ -0,0 +1,31 @@
#!/bin/bash
set -exo pipefail
PROJECT="moz-fx-data-backfill-2"
JOB_NAME="minidump-backfill"
## this script assumes it's being run from the ingestion-beam directory
## of the gcp-ingestion repo.
mvn compile exec:java -Dexec.mainClass=com.mozilla.telemetry.Decoder -Dexec.args="\
--runner=Dataflow \
--jobName=$JOB_NAME \
--project=$PROJECT \
--geoCityDatabase=gs://backfill-test-public1/GeoIP2-City.mmdb \
--geoCityFilter=gs://backfill-test-public1/cities15000.txt \
--schemasLocation=gs://backfill-test-public1/202001101955_c5b19a4.tar.gz \
--inputType=bigquery_query \
--input=\"with pings as (SELECT \`moz-fx-data-shared-prod\`.udf.parse_desktop_telemetry_uri(uri).document_id, * FROM \`moz-fx-data-shared-prod.payload_bytes_error.telemetry\` WHERE DATE(submission_timestamp) between '2019-12-20' and '2020-01-08' AND document_type = 'crash' AND payload is not null and (error_message LIKE 'org.everit.json.schema.ValidationException: #/payload/minidumpSha256Hash%')), distinct_document_ids AS (SELECT document_id, MIN(submission_timestamp) AS submission_timestamp FROM pings GROUP BY document_id), base AS (SELECT * FROM pings JOIN distinct_document_ids USING (document_id, submission_timestamp)), numbered_duplicates AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY document_id) AS _n FROM base) SELECT * FROM numbered_duplicates WHERE _n = 1\" \
--bqReadMethod=export \
--outputType=bigquery \
--bqWriteMethod=file_loads \
--bqClusteringFields=normalized_channel,sample_id \
--output=${PROJECT}:\${document_namespace}_stable.\${document_type}_v\${document_version} \
--errorOutputType=bigquery \
--errorOutput=${PROJECT}:payload_bytes_error.telemetry \
--experiments=shuffle_mode=service \
--region=us-central1 \
--usePublicIps=false \
--gcsUploadBufferSizeBytes=16777216 \
"

Просмотреть файл

@ -0,0 +1,27 @@
#!/bin/bash
PROJECT=moz-fx-data-backfill-2
for dataset in $(bq ls -n 1000 --project_id=moz-fx-data-shared-prod | grep 'telemetry_stable'); do
bq mk $PROJECT:$dataset
for table in $(bq ls -n 1000 --project_id=moz-fx-data-shared-prod $dataset | tail -n+3 | awk '{print $1}' | grep -E 'crash_v4'); do
bq rm -f $PROJECT:$dataset.$table
bq mk -t \
--time_partitioning_field=submission_timestamp \
--clustering_fields=normalized_channel,sample_id \
--schema <(bq show --format=json moz-fx-data-shared-prod:$dataset.$table | jq '.schema.fields') \
$PROJECT:$dataset.$table
done
done
for dataset in $(bq ls -n 1000 --project_id=moz-fx-data-shared-prod | grep 'payload_bytes_error'); do
bq mk $PROJECT:$dataset
for table in $(bq ls -n 1000 --project_id=moz-fx-data-shared-prod $dataset | tail -n+3 | awk '{print $1}'); do
bq rm -f $PROJECT:$dataset.$table
bq mk -t \
--time_partitioning_field=submission_timestamp \
--clustering_fields=submission_timestamp \
--schema <(bq show --format=json moz-fx-data-shared-prod:$dataset.$table | jq '.schema.fields') \
$PROJECT:$dataset.$table
done
done

36
script/mirror-prod-tables.sh Executable file
Просмотреть файл

@ -0,0 +1,36 @@
#!/bin/bash
# Copies the structure of prod tables into a project where you'll be backfilling.
# BEWARE: This script by default deletes existing tables!
# You likely want to make a copy of this and modify to suit, deleting some of the blocks,
# adding `grep` invocations to limit which tables you create, etc.
# Set this to a desired destination project that you'll be backfilling into
PROJECT=moz-fx-data-backfill-?
# Copy stable table schemas
for dataset in $(bq ls -n 1000 --project_id=moz-fx-data-shared-prod | grep 'telemetry_stable'); do
bq mk $PROJECT:$dataset
for table in $(bq ls -n 1000 --project_id=moz-fx-data-shared-prod $dataset | tail -n+3 | awk '{print $1}' | grep -E 'crash_v4'); do
bq rm -f $PROJECT:$dataset.$table
bq mk -t \
--time_partitioning_field=submission_timestamp \
--clustering_fields=normalized_channel,sample_id \
--schema <(bq show --format=json moz-fx-data-shared-prod:$dataset.$table | jq '.schema.fields') \
$PROJECT:$dataset.$table
done
done
# Copy payload_bytes_error tables schemas
for dataset in $(bq ls -n 1000 --project_id=moz-fx-data-shared-prod | grep 'payload_bytes_error'); do
bq mk $PROJECT:$dataset
for table in $(bq ls -n 1000 --project_id=moz-fx-data-shared-prod $dataset | tail -n+3 | awk '{print $1}'); do
bq rm -f $PROJECT:$dataset.$table
bq mk -t \
--time_partitioning_field=submission_timestamp \
--clustering_fields=submission_timestamp \
--schema <(bq show --format=json moz-fx-data-shared-prod:$dataset.$table | jq '.schema.fields') \
$PROJECT:$dataset.$table
done
done