DEPRECATED - Scripts to dump mozaggregator to BigQuery
Перейти к файлу
Anthony Miyaguchi ef028e7e63 Export POSTGRES variables instead of running from within container 2021-01-28 17:16:10 -08:00
bin Use usr/bin/env and add py extension back for spark-submit 2021-01-28 17:10:01 -08:00
notebooks Update notebook for converting into parquet 2020-02-27 15:14:30 -08:00
.env.template Fix issues with python3 for spark, POSTGRES_HOST location 2021-01-28 16:47:48 -08:00
.gitignore Add .env file with variables 2021-01-28 15:13:12 -08:00
Dockerfile Fix issues with python3 for spark, POSTGRES_HOST location 2021-01-28 16:47:48 -08:00
README.md Update README 2021-01-28 17:10:32 -08:00
docker-compose.yml Export POSTGRES variables instead of running from within container 2021-01-28 17:16:10 -08:00
requirements-dev.in Split dev and normal requirements 2021-01-28 15:12:51 -08:00
requirements-dev.txt Split dev and normal requirements 2021-01-28 15:12:51 -08:00
requirements.in Upgrade to pyspark 3 2021-01-28 16:48:24 -08:00
requirements.txt Upgrade to pyspark 3 2021-01-28 16:48:24 -08:00

README.md

mozaggregator2bq

A set of scripts for loading Firefox Telemetry aggregates into BigQuery. These aggregates power the Telemetry Dashboard and Evolution Viewer.

Overview

Build the container and launch it:

docker-compose build
docker-compose run --rm app bash

Interacting with the database

To start a psql instance with the read-only replica of the production Postgres instance, run the following commands. Ensure that you have the appropriate AWS credentials.

source bin/export_postgres_credentials_s3

PGPASSWORD=$POSTGRES_PASS psql \
    --host="$POSTGRES_HOST" \
    --username="$POSTGRES_USER" \
    --dbname="$POSTGRES_DB"

An example query:

-- list all aggregates by build_id
select tablename
from pg_catalog.pg_tables
where schemaname='public' and tablename like 'build_id%';

--  build_id_aurora_0_20130414
--  build_id_aurora_0_20150128
--  build_id_aurora_0_20150329
--  build_id_aurora_1_20130203
--  build_id_aurora_1_20150604
-- ...

-- list all aggregates by submission_date
select tablename
from pg_catalog.pg_tables
where schemaname='public' and tablename like 'submission_date%';

--  submission_date_beta_1_20151027
--  submission_date_nightly_40_20151029
--  submission_date_beta_39_20151027
--  submission_date_nightly_1_20151025
--  submission_date_nightly_39_20151031
-- ...

Database dumps by aggregate type and date

To start dumping data, run the following commands.

source bin/export_postgres_credentials_s3

time DATA_DIR=data AGGREGATE_TYPE=submission DS_NODASH=20191201 bin/pg_dump_by_day
# 23.92s user 1.97s system 39% cpu 1:05.48 total

time DATA_DIR=data AGGREGATE_TYPE=build_id DS_NODASH=20191201 bin/pg_dump_by_day
# 3.47s user 0.49s system 24% cpu 16.188 total

This should result in gzipped files in the following hierarchy.

data
├── [  96]  build_id
│   └── [ 128]  20191201
│       ├── [8.4M]  474306.dat.gz
│       └── [1.6K]  toc.dat
└── [  96]  submission
    └── [3.2K]  20191201
        ├── [ 74K]  474405.dat.gz
        ├── [ 48K]  474406.dat.gz
        ....
        ├── [1.8M]  474504.dat.gz
        └── [ 93K]  toc.dat

4 directories, 103 files

See the pg_dump documentation for details on the file format.

$ gzip -cd data/submission/20191201/474405.dat.gz | head -n3
{"os": "Windows_NT", "child": "false", "label": "", "metric": "A11Y_INSTANTIATED_FLAG", "osVersion": "6.3", "application": "Firefox", "architecture": "x86"}    {0,2,0,2,2}
{"os": "Windows_NT", "child": "false", "label": "", "metric": "A11Y_CONSUMERS", "osVersion": "6.3", "application": "Firefox", "architecture": "x86"}    {0,0,0,0,0,0,0,0,0,0,2,0,20,2}
{"os": "Windows_NT", "child": "false", "label": "", "metric": "A11Y_ISIMPLEDOM_USAGE_FLAG", "osVersion": "6.3", "application": "Firefox", "architecture": "x86"}        {2,0,0,0,2}

Running a notebook

Ensure your data directory in the top-level directory matches the one in the notebook. Run the following script.

bin/start-jupyter

This script can be modified to include various configuration parameters for spark, including the default parallelism and the amount of executor memory.

Processing pg_dump into parquet

Run the following scripts to transform the data dumps into parquet, where the json fields have been transformed into appropriate columns and arrays.

bin/submit-local bin/pg_dump_to_parquet.py \
    --input-dir data/submission_date/20191201 \
    --output-dir data/parquet/submission_date/20191201

Running backfill

The bin/backfill script will dump data from the Postgres database, transform the data into Parquet, and load the data into a BigQuery table. The current schema for the table is as follows:

Field name Type Mode
ingest_date DATE REQUIRED
aggregate_type STRING NULLABLE
ds_nodash STRING NULLABLE
channel STRING NULLABLE
version STRING NULLABLE
os STRING NULLABLE
child STRING NULLABLE
label STRING NULLABLE
metric STRING NULLABLE
osVersion STRING NULLABLE
application STRING NULLABLE
architecture STRING NULLABLE
aggregate STRING NULLABLE

There is a table for the build id aggregates and the submission date aggregates. The build ids are truncated to the nearest date.

It may be useful to use a small entry-point script.

#!/bin/bash
set -x
start=2015-06-01
end=2020-04-01
while ! [[ $start > $end ]]; do
    rm -r data
    up_to=$(date -d "$start + 1 month" +%F)
    START_DS=$start END_DS=$up_to bash -x bin/backfill
    start=$up_to
done