DEPRECATED - Scripts to dump mozaggregator to BigQuery

Перейти к файлу

Anthony Miyaguchi 4884bcb46f Fix issues with python3 for spark, POSTGRES_HOST location		2021-01-28 16:47:48 -08:00
bin	Fix issues with python3 for spark, POSTGRES_HOST location	2021-01-28 16:47:48 -08:00
notebooks	Update notebook for converting into parquet	2020-02-27 15:14:30 -08:00
scripts	Fix issues with python3 for spark, POSTGRES_HOST location	2021-01-28 16:47:48 -08:00
.env.template	Fix issues with python3 for spark, POSTGRES_HOST location	2021-01-28 16:47:48 -08:00
.gitignore	Add .env file with variables	2021-01-28 15:13:12 -08:00
Dockerfile	Fix issues with python3 for spark, POSTGRES_HOST location	2021-01-28 16:47:48 -08:00
README.md	Split dev and normal requirements	2021-01-28 15:12:51 -08:00
docker-compose.yml	Fix issues with python3 for spark, POSTGRES_HOST location	2021-01-28 16:47:48 -08:00
requirements-dev.in	Split dev and normal requirements	2021-01-28 15:12:51 -08:00
requirements-dev.txt	Split dev and normal requirements	2021-01-28 15:12:51 -08:00
requirements.in	Split dev and normal requirements	2021-01-28 15:12:51 -08:00
requirements.txt	Split dev and normal requirements	2021-01-28 15:12:51 -08:00

README.md

mozaggregator2bq

A set of scripts for loading Firefox Telemetry aggregates into BigQuery. These aggregates power the Telemetry Dashboard and Evolution Viewer.

Overview

Install the required dependencies. Here, we will use a Google Cloud VM on an n1-standard-4, on a CentOS 8 image.

sudo dnf install \
    git \
    postgresql \
    java-1.8.0-openjdk \
    jq \
    tmux

Start a new tmux session, which can be detached and reattached from a new ssh session.

# pg_dump is included in postgresql or postgresql-client
pg_dump --version
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# if you are running jupyter notebooks
pip install -r requirements-dev.txt

Interacting with the database

To start a psql instance with the read-only replica of the production Postgres instance, run the following commands. Ensure that you have the appropriate AWS credentials.

source scripts/export_credentials_s3

PGPASSWORD=$POSTGRES_PASS psql \
    --host="$POSTGRES_HOST" \
    --username="$POSTGRES_USER" \
    --dbname="$POSTGRES_DB"

An example query:

-- list all aggregates by build_id
select tablename
from pg_catalog.pg_tables
where schemaname='public' and tablename like 'build_id%';

--  build_id_aurora_0_20130414
--  build_id_aurora_0_20150128
--  build_id_aurora_0_20150329
--  build_id_aurora_1_20130203
--  build_id_aurora_1_20150604
-- ...

-- list all aggregates by submission_date
select tablename
from pg_catalog.pg_tables
where schemaname='public' and tablename like 'submission_date%';

--  submission_date_beta_1_20151027
--  submission_date_nightly_40_20151029
--  submission_date_beta_39_20151027
--  submission_date_nightly_1_20151025
--  submission_date_nightly_39_20151031
-- ...

Database dumps by aggregate type and date

To start dumping data, run the following commands.

source scripts/export_credentials_s3

time DATA_DIR=data AGGREGATE_TYPE=submission DS_NODASH=20191201 scripts/pg_dump_by_day
# 23.92s user 1.97s system 39% cpu 1:05.48 total

time DATA_DIR=data AGGREGATE_TYPE=build_id DS_NODASH=20191201 scripts/pg_dump_by_day
# 3.47s user 0.49s system 24% cpu 16.188 total

This should result in gzipped files in the following hierarchy.

data
├── [  96]  build_id
│   └── [ 128]  20191201
│       ├── [8.4M]  474306.dat.gz
│       └── [1.6K]  toc.dat
└── [  96]  submission
    └── [3.2K]  20191201
        ├── [ 74K]  474405.dat.gz
        ├── [ 48K]  474406.dat.gz
        ....
        ├── [1.8M]  474504.dat.gz
        └── [ 93K]  toc.dat

4 directories, 103 files

See the pg_dump documentation for details on the file format.

$ gzip -cd data/submission/20191201/474405.dat.gz | head -n3
{"os": "Windows_NT", "child": "false", "label": "", "metric": "A11Y_INSTANTIATED_FLAG", "osVersion": "6.3", "application": "Firefox", "architecture": "x86"}    {0,2,0,2,2}
{"os": "Windows_NT", "child": "false", "label": "", "metric": "A11Y_CONSUMERS", "osVersion": "6.3", "application": "Firefox", "architecture": "x86"}    {0,0,0,0,0,0,0,0,0,0,2,0,20,2}
{"os": "Windows_NT", "child": "false", "label": "", "metric": "A11Y_ISIMPLEDOM_USAGE_FLAG", "osVersion": "6.3", "application": "Firefox", "architecture": "x86"}        {2,0,0,0,2}

Running a notebook

Ensure your data directory in the top-level directory matches the one in the notebook. Run the following script.

bin/start-jupyter

This script can be modified to include various configuration parameters for spark, including the default parallelism and the amount of executor memory.

Processing `pg_dump` into parquet

Run the following scripts to transform the data dumps into parquet, where the json fields have been transformed into appropriate columns and arrays.

bin/submit-local scripts/pg_dump_to_parquet.py \
    --input-dir data/submission_date/20191201 \
    --output-dir data/parquet/submission_date/20191201

Running backfill

The bin/backfill script will dump data from the Postgres database, transform the data into Parquet, and load the data into a BigQuery table. The current schema for the table is as follows:

Field name	Type	Mode
ingest_date	DATE	REQUIRED
aggregate_type	STRING	NULLABLE
ds_nodash	STRING	NULLABLE
channel	STRING	NULLABLE
version	STRING	NULLABLE
os	STRING	NULLABLE
child	STRING	NULLABLE
label	STRING	NULLABLE
metric	STRING	NULLABLE
osVersion	STRING	NULLABLE
application	STRING	NULLABLE
architecture	STRING	NULLABLE
aggregate	STRING	NULLABLE

There is a table for the build id aggregates and the submission date aggregates. The build ids are truncated to the nearest date.

It may be useful to use a small entry-point script.

#!/bin/bash
set -x
start=2015-06-01
end=2020-04-01
while ! [[ $start > $end ]]; do
    rm -r data
    up_to=$(date -d "$start + 1 month" +%F)
    START_DS=$start END_DS=$up_to bash -x bin/backfill
    start=$up_to
done