DEPRECATED - Scripts to dump mozaggregator to BigQuery
Перейти к файлу
Anthony Miyaguchi 4884bcb46f Fix issues with python3 for spark, POSTGRES_HOST location 2021-01-28 16:47:48 -08:00
bin Fix issues with python3 for spark, POSTGRES_HOST location 2021-01-28 16:47:48 -08:00
notebooks Update notebook for converting into parquet 2020-02-27 15:14:30 -08:00
scripts Fix issues with python3 for spark, POSTGRES_HOST location 2021-01-28 16:47:48 -08:00
.env.template Fix issues with python3 for spark, POSTGRES_HOST location 2021-01-28 16:47:48 -08:00
.gitignore Add .env file with variables 2021-01-28 15:13:12 -08:00
Dockerfile Fix issues with python3 for spark, POSTGRES_HOST location 2021-01-28 16:47:48 -08:00
README.md Split dev and normal requirements 2021-01-28 15:12:51 -08:00
docker-compose.yml Fix issues with python3 for spark, POSTGRES_HOST location 2021-01-28 16:47:48 -08:00
requirements-dev.in Split dev and normal requirements 2021-01-28 15:12:51 -08:00
requirements-dev.txt Split dev and normal requirements 2021-01-28 15:12:51 -08:00
requirements.in Split dev and normal requirements 2021-01-28 15:12:51 -08:00
requirements.txt Split dev and normal requirements 2021-01-28 15:12:51 -08:00

README.md

mozaggregator2bq

A set of scripts for loading Firefox Telemetry aggregates into BigQuery. These aggregates power the Telemetry Dashboard and Evolution Viewer.

Overview

Install the required dependencies. Here, we will use a Google Cloud VM on an n1-standard-4, on a CentOS 8 image.

sudo dnf install \
    git \
    postgresql \
    java-1.8.0-openjdk \
    jq \
    tmux

Start a new tmux session, which can be detached and reattached from a new ssh session.

# pg_dump is included in postgresql or postgresql-client
pg_dump --version
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# if you are running jupyter notebooks
pip install -r requirements-dev.txt

Interacting with the database

To start a psql instance with the read-only replica of the production Postgres instance, run the following commands. Ensure that you have the appropriate AWS credentials.

source scripts/export_credentials_s3

PGPASSWORD=$POSTGRES_PASS psql \
    --host="$POSTGRES_HOST" \
    --username="$POSTGRES_USER" \
    --dbname="$POSTGRES_DB"

An example query:

-- list all aggregates by build_id
select tablename
from pg_catalog.pg_tables
where schemaname='public' and tablename like 'build_id%';

--  build_id_aurora_0_20130414
--  build_id_aurora_0_20150128
--  build_id_aurora_0_20150329
--  build_id_aurora_1_20130203
--  build_id_aurora_1_20150604
-- ...

-- list all aggregates by submission_date
select tablename
from pg_catalog.pg_tables
where schemaname='public' and tablename like 'submission_date%';

--  submission_date_beta_1_20151027
--  submission_date_nightly_40_20151029
--  submission_date_beta_39_20151027
--  submission_date_nightly_1_20151025
--  submission_date_nightly_39_20151031
-- ...

Database dumps by aggregate type and date

To start dumping data, run the following commands.

source scripts/export_credentials_s3

time DATA_DIR=data AGGREGATE_TYPE=submission DS_NODASH=20191201 scripts/pg_dump_by_day
# 23.92s user 1.97s system 39% cpu 1:05.48 total

time DATA_DIR=data AGGREGATE_TYPE=build_id DS_NODASH=20191201 scripts/pg_dump_by_day
# 3.47s user 0.49s system 24% cpu 16.188 total

This should result in gzipped files in the following hierarchy.

data
├── [  96]  build_id
│   └── [ 128]  20191201
│       ├── [8.4M]  474306.dat.gz
│       └── [1.6K]  toc.dat
└── [  96]  submission
    └── [3.2K]  20191201
        ├── [ 74K]  474405.dat.gz
        ├── [ 48K]  474406.dat.gz
        ....
        ├── [1.8M]  474504.dat.gz
        └── [ 93K]  toc.dat

4 directories, 103 files

See the pg_dump documentation for details on the file format.

$ gzip -cd data/submission/20191201/474405.dat.gz | head -n3
{"os": "Windows_NT", "child": "false", "label": "", "metric": "A11Y_INSTANTIATED_FLAG", "osVersion": "6.3", "application": "Firefox", "architecture": "x86"}    {0,2,0,2,2}
{"os": "Windows_NT", "child": "false", "label": "", "metric": "A11Y_CONSUMERS", "osVersion": "6.3", "application": "Firefox", "architecture": "x86"}    {0,0,0,0,0,0,0,0,0,0,2,0,20,2}
{"os": "Windows_NT", "child": "false", "label": "", "metric": "A11Y_ISIMPLEDOM_USAGE_FLAG", "osVersion": "6.3", "application": "Firefox", "architecture": "x86"}        {2,0,0,0,2}

Running a notebook

Ensure your data directory in the top-level directory matches the one in the notebook. Run the following script.

bin/start-jupyter

This script can be modified to include various configuration parameters for spark, including the default parallelism and the amount of executor memory.

Processing pg_dump into parquet

Run the following scripts to transform the data dumps into parquet, where the json fields have been transformed into appropriate columns and arrays.

bin/submit-local scripts/pg_dump_to_parquet.py \
    --input-dir data/submission_date/20191201 \
    --output-dir data/parquet/submission_date/20191201

Running backfill

The bin/backfill script will dump data from the Postgres database, transform the data into Parquet, and load the data into a BigQuery table. The current schema for the table is as follows:

Field name Type Mode
ingest_date DATE REQUIRED
aggregate_type STRING NULLABLE
ds_nodash STRING NULLABLE
channel STRING NULLABLE
version STRING NULLABLE
os STRING NULLABLE
child STRING NULLABLE
label STRING NULLABLE
metric STRING NULLABLE
osVersion STRING NULLABLE
application STRING NULLABLE
architecture STRING NULLABLE
aggregate STRING NULLABLE

There is a table for the build id aggregates and the submission date aggregates. The build ids are truncated to the nearest date.

It may be useful to use a small entry-point script.

#!/bin/bash
set -x
start=2015-06-01
end=2020-04-01
while ! [[ $start > $end ]]; do
    rm -r data
    up_to=$(date -d "$start + 1 month" +%F)
    START_DS=$start END_DS=$up_to bash -x bin/backfill
    start=$up_to
done