Add initial crash ping backfill
This commit is contained in:
Родитель
d02bea7837
Коммит
d17e566aef
17
README.md
17
README.md
|
@ -1,2 +1,19 @@
|
|||
# bigquery-backfill
|
||||
Scripts and historical records related to backfills in Mozilla's telemetry pipeline
|
||||
|
||||
## Layout
|
||||
|
||||
There is a `script` directory containing relatively pristine reference scripts
|
||||
that you can copy and paste into a new backfill scenario and modify for your
|
||||
particular needs.
|
||||
|
||||
There is a `backfills` directory where each subdirectory should be a dated
|
||||
backfill event, containing all the scripts used and a description of the
|
||||
overall scenario.
|
||||
|
||||
## Setup
|
||||
|
||||
Most of these backfill scenarios will assume that you have
|
||||
[`gcp-ingestion`](https://github.com/mozilla/gcp-ingestion) checked out
|
||||
locally. `mvn` invocations in scripts likely assume that you're in the
|
||||
`ingestion-beam` directory of that repo.
|
||||
|
|
|
@ -0,0 +1,103 @@
|
|||
# Backfilling crash pings due to minidumpSha256Hash field
|
||||
|
||||
In late December 2019, we merged a change to the crash ping schema that
|
||||
added a required `minidumpSha256Hash` field that turned out to be not
|
||||
always present. We dropped the `required` part and then needed to backfill
|
||||
the rejected pings.
|
||||
|
||||
## Steps
|
||||
|
||||
We ran the backfill in the `moz-fx-data-backfill-2` project.
|
||||
|
||||
First, we determine the backfill range by querying the relevant error table:
|
||||
|
||||
```
|
||||
SELECT
|
||||
date(submission_timestamp) as dt, count(*)
|
||||
FROM
|
||||
`moz-fx-data-shared-prod.payload_bytes_error.telemetry`
|
||||
WHERE
|
||||
DATE(submission_timestamp) >= "2019-12-01"
|
||||
AND document_type = 'crash'
|
||||
AND error_message LIKE 'org.everit.json.schema.ValidationException: #/payload/minidumpSha256Hash%'
|
||||
group by 1 order by 1
|
||||
```
|
||||
|
||||
That showed the affected range as `2019-12-20` through `2020-01-08`.
|
||||
|
||||
Next, we create destination tables via the `mirror-prod-tables` script.
|
||||
|
||||
Next, we construct a suitable Dataflow job configuration in
|
||||
`launch-dataflow-minidump` and run the script.
|
||||
|
||||
We visit the GCP console, choose the `moz-fx-data-backfill-2` project
|
||||
and go to the Dataflow section to watch the progress of the job.
|
||||
It took about 45 minutes to run to completion.
|
||||
|
||||
We validate the results by checking counts per day and also by checking whether
|
||||
we have any overlapping IDs between prod and the backfilled table:
|
||||
|
||||
```
|
||||
WITH
|
||||
ids AS (
|
||||
SELECT
|
||||
DATE(submission_timestamp) AS dt,
|
||||
document_id
|
||||
FROM
|
||||
`moz-fx-data-shared-prod.telemetry_stable.crash_v4`
|
||||
WHERE
|
||||
DATE(submission_timestamp) BETWEEN '2019-12-04'
|
||||
AND '2020-01-09'
|
||||
UNION ALL
|
||||
SELECT
|
||||
DATE(submission_timestamp) AS dt,
|
||||
document_id
|
||||
FROM
|
||||
`moz-fx-data-backfill-2.telemetry_stable.crash_v4`
|
||||
WHERE
|
||||
DATE(submission_timestamp) BETWEEN '2019-12-04'
|
||||
AND '2020-01-09' ),
|
||||
dupes AS (
|
||||
SELECT
|
||||
dt,
|
||||
document_id,
|
||||
COUNT(*) AS n
|
||||
FROM
|
||||
ids
|
||||
GROUP BY
|
||||
1,
|
||||
2
|
||||
HAVING
|
||||
n > 1)
|
||||
SELECT
|
||||
dt,
|
||||
COUNT(*)
|
||||
FROM
|
||||
dupes
|
||||
GROUP BY
|
||||
1
|
||||
ORDER BY
|
||||
1
|
||||
```
|
||||
|
||||
The results show duplicates only on the first and last day of the backfill.
|
||||
|
||||
LEARNING: We might want to bake logic into the original query in the Dataflow
|
||||
job for skipping any document_ids that exist in the prod table. Then we'd
|
||||
be guaranteed to have a disjoint backfill table that we can blindly use
|
||||
`bq cp --append_table` to add into prod.
|
||||
|
||||
Instead, we'll take advantage of the low data volume here and craft a
|
||||
query to append this data into the prod table. We do the final append
|
||||
into prod by running:
|
||||
|
||||
```
|
||||
bq query -n 0 --nouse_legacy_sql \
|
||||
--project_id=moz-fx-data-shared-prod \
|
||||
--dataset_id=telemetry_stable \
|
||||
--destination_table=crash_v4 \
|
||||
--append_table \
|
||||
< append_to_prod.sql
|
||||
```
|
||||
|
||||
And we're done!
|
|
@ -0,0 +1,40 @@
|
|||
WITH ids AS (
|
||||
SELECT
|
||||
DATE(submission_timestamp) AS dt,
|
||||
document_id
|
||||
FROM
|
||||
`moz-fx-data-shared-prod.telemetry_stable.crash_v4`
|
||||
WHERE
|
||||
DATE(submission_timestamp)
|
||||
BETWEEN '2019-12-04'
|
||||
AND '2020-01-09'
|
||||
UNION ALL
|
||||
SELECT
|
||||
DATE(submission_timestamp) AS dt,
|
||||
document_id
|
||||
FROM
|
||||
`moz-fx-data-backfill-2.telemetry_stable.crash_v4`
|
||||
WHERE
|
||||
DATE(submission_timestamp)
|
||||
BETWEEN '2019-12-04'
|
||||
AND '2020-01-09'
|
||||
),
|
||||
dupes AS (
|
||||
SELECT
|
||||
dt,
|
||||
document_id,
|
||||
COUNT(*) AS n
|
||||
FROM
|
||||
ids
|
||||
GROUP BY
|
||||
1,
|
||||
2
|
||||
HAVING
|
||||
n > 1
|
||||
)
|
||||
SELECT
|
||||
*
|
||||
FROM
|
||||
`moz-fx-data-backfill-2.telemetry_stable.crash_v4`
|
||||
WHERE
|
||||
document_id NOT IN (SELECT document_id FROM dupes)
|
|
@ -0,0 +1,31 @@
|
|||
#!/bin/bash
|
||||
|
||||
set -exo pipefail
|
||||
|
||||
PROJECT="moz-fx-data-backfill-2"
|
||||
JOB_NAME="minidump-backfill"
|
||||
|
||||
## this script assumes it's being run from the ingestion-beam directory
|
||||
## of the gcp-ingestion repo.
|
||||
|
||||
mvn compile exec:java -Dexec.mainClass=com.mozilla.telemetry.Decoder -Dexec.args="\
|
||||
--runner=Dataflow \
|
||||
--jobName=$JOB_NAME \
|
||||
--project=$PROJECT \
|
||||
--geoCityDatabase=gs://backfill-test-public1/GeoIP2-City.mmdb \
|
||||
--geoCityFilter=gs://backfill-test-public1/cities15000.txt \
|
||||
--schemasLocation=gs://backfill-test-public1/202001101955_c5b19a4.tar.gz \
|
||||
--inputType=bigquery_query \
|
||||
--input=\"with pings as (SELECT \`moz-fx-data-shared-prod\`.udf.parse_desktop_telemetry_uri(uri).document_id, * FROM \`moz-fx-data-shared-prod.payload_bytes_error.telemetry\` WHERE DATE(submission_timestamp) between '2019-12-20' and '2020-01-08' AND document_type = 'crash' AND payload is not null and (error_message LIKE 'org.everit.json.schema.ValidationException: #/payload/minidumpSha256Hash%')), distinct_document_ids AS (SELECT document_id, MIN(submission_timestamp) AS submission_timestamp FROM pings GROUP BY document_id), base AS (SELECT * FROM pings JOIN distinct_document_ids USING (document_id, submission_timestamp)), numbered_duplicates AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY document_id) AS _n FROM base) SELECT * FROM numbered_duplicates WHERE _n = 1\" \
|
||||
--bqReadMethod=export \
|
||||
--outputType=bigquery \
|
||||
--bqWriteMethod=file_loads \
|
||||
--bqClusteringFields=normalized_channel,sample_id \
|
||||
--output=${PROJECT}:\${document_namespace}_stable.\${document_type}_v\${document_version} \
|
||||
--errorOutputType=bigquery \
|
||||
--errorOutput=${PROJECT}:payload_bytes_error.telemetry \
|
||||
--experiments=shuffle_mode=service \
|
||||
--region=us-central1 \
|
||||
--usePublicIps=false \
|
||||
--gcsUploadBufferSizeBytes=16777216 \
|
||||
"
|
|
@ -0,0 +1,27 @@
|
|||
#!/bin/bash
|
||||
|
||||
PROJECT=moz-fx-data-backfill-2
|
||||
|
||||
for dataset in $(bq ls -n 1000 --project_id=moz-fx-data-shared-prod | grep 'telemetry_stable'); do
|
||||
bq mk $PROJECT:$dataset
|
||||
for table in $(bq ls -n 1000 --project_id=moz-fx-data-shared-prod $dataset | tail -n+3 | awk '{print $1}' | grep -E 'crash_v4'); do
|
||||
bq rm -f $PROJECT:$dataset.$table
|
||||
bq mk -t \
|
||||
--time_partitioning_field=submission_timestamp \
|
||||
--clustering_fields=normalized_channel,sample_id \
|
||||
--schema <(bq show --format=json moz-fx-data-shared-prod:$dataset.$table | jq '.schema.fields') \
|
||||
$PROJECT:$dataset.$table
|
||||
done
|
||||
done
|
||||
|
||||
for dataset in $(bq ls -n 1000 --project_id=moz-fx-data-shared-prod | grep 'payload_bytes_error'); do
|
||||
bq mk $PROJECT:$dataset
|
||||
for table in $(bq ls -n 1000 --project_id=moz-fx-data-shared-prod $dataset | tail -n+3 | awk '{print $1}'); do
|
||||
bq rm -f $PROJECT:$dataset.$table
|
||||
bq mk -t \
|
||||
--time_partitioning_field=submission_timestamp \
|
||||
--clustering_fields=submission_timestamp \
|
||||
--schema <(bq show --format=json moz-fx-data-shared-prod:$dataset.$table | jq '.schema.fields') \
|
||||
$PROJECT:$dataset.$table
|
||||
done
|
||||
done
|
|
@ -0,0 +1,36 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Copies the structure of prod tables into a project where you'll be backfilling.
|
||||
# BEWARE: This script by default deletes existing tables!
|
||||
|
||||
# You likely want to make a copy of this and modify to suit, deleting some of the blocks,
|
||||
# adding `grep` invocations to limit which tables you create, etc.
|
||||
|
||||
# Set this to a desired destination project that you'll be backfilling into
|
||||
PROJECT=moz-fx-data-backfill-?
|
||||
|
||||
# Copy stable table schemas
|
||||
for dataset in $(bq ls -n 1000 --project_id=moz-fx-data-shared-prod | grep 'telemetry_stable'); do
|
||||
bq mk $PROJECT:$dataset
|
||||
for table in $(bq ls -n 1000 --project_id=moz-fx-data-shared-prod $dataset | tail -n+3 | awk '{print $1}' | grep -E 'crash_v4'); do
|
||||
bq rm -f $PROJECT:$dataset.$table
|
||||
bq mk -t \
|
||||
--time_partitioning_field=submission_timestamp \
|
||||
--clustering_fields=normalized_channel,sample_id \
|
||||
--schema <(bq show --format=json moz-fx-data-shared-prod:$dataset.$table | jq '.schema.fields') \
|
||||
$PROJECT:$dataset.$table
|
||||
done
|
||||
done
|
||||
|
||||
# Copy payload_bytes_error tables schemas
|
||||
for dataset in $(bq ls -n 1000 --project_id=moz-fx-data-shared-prod | grep 'payload_bytes_error'); do
|
||||
bq mk $PROJECT:$dataset
|
||||
for table in $(bq ls -n 1000 --project_id=moz-fx-data-shared-prod $dataset | tail -n+3 | awk '{print $1}'); do
|
||||
bq rm -f $PROJECT:$dataset.$table
|
||||
bq mk -t \
|
||||
--time_partitioning_field=submission_timestamp \
|
||||
--clustering_fields=submission_timestamp \
|
||||
--schema <(bq show --format=json moz-fx-data-shared-prod:$dataset.$table | jq '.schema.fields') \
|
||||
$PROJECT:$dataset.$table
|
||||
done
|
||||
done
|
Загрузка…
Ссылка в новой задаче