DEPRECATED - Scripts to dump mozaggregator to BigQuery

Перейти к файлу

Anthony Miyaguchi a1d3623542 Merge pull request #2 from mozilla/dependabot/pip/bleach-3.3.0 Bump bleach from 3.2.3 to 3.3.0		2021-02-02 16:27:36 -08:00
bin	Use usr/bin/env and add py extension back for spark-submit	2021-01-28 17:10:01 -08:00
notebooks	Update notebook for converting into parquet	2020-02-27 15:14:30 -08:00
.env.template	Fix issues with python3 for spark, POSTGRES_HOST location	2021-01-28 16:47:48 -08:00
.gitignore	Add .env file with variables	2021-01-28 15:13:12 -08:00
Dockerfile	Fix issues with python3 for spark, POSTGRES_HOST location	2021-01-28 16:47:48 -08:00
README.md	Update README	2021-01-28 17:10:32 -08:00
docker-compose.yml	Export POSTGRES variables instead of running from within container	2021-01-28 17:16:10 -08:00
requirements-dev.in	Split dev and normal requirements	2021-01-28 15:12:51 -08:00
requirements-dev.txt	Bump bleach from 3.2.3 to 3.3.0	2021-02-02 23:17:31 +00:00
requirements.in	Upgrade to pyspark 3	2021-01-28 16:48:24 -08:00
requirements.txt	Upgrade to pyspark 3	2021-01-28 16:48:24 -08:00

README.md

mozaggregator2bq

A set of scripts for loading Firefox Telemetry aggregates into BigQuery. These aggregates power the Telemetry Dashboard and Evolution Viewer.

Overview

Build the container and launch it:

docker-compose build
docker-compose run --rm app bash

Interacting with the database

To start a psql instance with the read-only replica of the production Postgres instance, run the following commands. Ensure that you have the appropriate AWS credentials.

source bin/export_postgres_credentials_s3

PGPASSWORD=$POSTGRES_PASS psql \
    --host="$POSTGRES_HOST" \
    --username="$POSTGRES_USER" \
    --dbname="$POSTGRES_DB"

An example query:

-- list all aggregates by build_id
select tablename
from pg_catalog.pg_tables
where schemaname='public' and tablename like 'build_id%';

--  build_id_aurora_0_20130414
--  build_id_aurora_0_20150128
--  build_id_aurora_0_20150329
--  build_id_aurora_1_20130203
--  build_id_aurora_1_20150604
-- ...

-- list all aggregates by submission_date
select tablename
from pg_catalog.pg_tables
where schemaname='public' and tablename like 'submission_date%';

--  submission_date_beta_1_20151027
--  submission_date_nightly_40_20151029
--  submission_date_beta_39_20151027
--  submission_date_nightly_1_20151025
--  submission_date_nightly_39_20151031
-- ...

Database dumps by aggregate type and date

To start dumping data, run the following commands.

source bin/export_postgres_credentials_s3

time DATA_DIR=data AGGREGATE_TYPE=submission DS_NODASH=20191201 bin/pg_dump_by_day
# 23.92s user 1.97s system 39% cpu 1:05.48 total

time DATA_DIR=data AGGREGATE_TYPE=build_id DS_NODASH=20191201 bin/pg_dump_by_day
# 3.47s user 0.49s system 24% cpu 16.188 total

This should result in gzipped files in the following hierarchy.

data
├── [  96]  build_id
│   └── [ 128]  20191201
│       ├── [8.4M]  474306.dat.gz
│       └── [1.6K]  toc.dat
└── [  96]  submission
    └── [3.2K]  20191201
        ├── [ 74K]  474405.dat.gz
        ├── [ 48K]  474406.dat.gz
        ....
        ├── [1.8M]  474504.dat.gz
        └── [ 93K]  toc.dat

4 directories, 103 files

See the pg_dump documentation for details on the file format.

$ gzip -cd data/submission/20191201/474405.dat.gz | head -n3
{"os": "Windows_NT", "child": "false", "label": "", "metric": "A11Y_INSTANTIATED_FLAG", "osVersion": "6.3", "application": "Firefox", "architecture": "x86"}    {0,2,0,2,2}
{"os": "Windows_NT", "child": "false", "label": "", "metric": "A11Y_CONSUMERS", "osVersion": "6.3", "application": "Firefox", "architecture": "x86"}    {0,0,0,0,0,0,0,0,0,0,2,0,20,2}
{"os": "Windows_NT", "child": "false", "label": "", "metric": "A11Y_ISIMPLEDOM_USAGE_FLAG", "osVersion": "6.3", "application": "Firefox", "architecture": "x86"}        {2,0,0,0,2}

Running a notebook

Ensure your data directory in the top-level directory matches the one in the notebook. Run the following script.

bin/start-jupyter

This script can be modified to include various configuration parameters for spark, including the default parallelism and the amount of executor memory.

Processing `pg_dump` into parquet

Run the following scripts to transform the data dumps into parquet, where the json fields have been transformed into appropriate columns and arrays.

bin/submit-local bin/pg_dump_to_parquet.py \
    --input-dir data/submission_date/20191201 \
    --output-dir data/parquet/submission_date/20191201

Running backfill

The bin/backfill script will dump data from the Postgres database, transform the data into Parquet, and load the data into a BigQuery table. The current schema for the table is as follows:

Field name	Type	Mode
ingest_date	DATE	REQUIRED
aggregate_type	STRING	NULLABLE
ds_nodash	STRING	NULLABLE
channel	STRING	NULLABLE
version	STRING	NULLABLE
os	STRING	NULLABLE
child	STRING	NULLABLE
label	STRING	NULLABLE
metric	STRING	NULLABLE
osVersion	STRING	NULLABLE
application	STRING	NULLABLE
architecture	STRING	NULLABLE
aggregate	STRING	NULLABLE

There is a table for the build id aggregates and the submission date aggregates. The build ids are truncated to the nearest date.

It may be useful to use a small entry-point script.

#!/bin/bash
set -x
start=2015-06-01
end=2020-04-01
while ! [[ $start > $end ]]; do
    rm -r data
    up_to=$(date -d "$start + 1 month" +%F)
    START_DS=$start END_DS=$up_to bash -x bin/backfill
    start=$up_to
done