Operational Monitoring (OpMon) 📈

Перейти к файлу

Anna Scholtz 227c1cea07 Update scalar metric SQL template		2022-03-10 13:49:23 -08:00
.circleci	Setup	2022-02-23 15:11:42 -08:00
docs/images	Setup CLI, logging and templates	2022-03-01 14:11:45 -08:00
opmon	Update scalar metric SQL template	2022-03-10 13:49:23 -08:00
script	Setup CLI, logging and templates	2022-03-01 14:11:45 -08:00
.dockerignore	Setup	2022-02-23 15:11:42 -08:00
.gitignore	Setup	2022-02-23 15:11:42 -08:00
.python-version	Setup	2022-02-23 15:11:42 -08:00
CODE_OF_CONDUCT.md	Setup	2022-02-23 15:11:42 -08:00
Dockerfile	Setup	2022-02-23 15:11:42 -08:00
LICENSE	Setup	2022-02-23 15:11:42 -08:00
README.md	Setup CLI, logging and templates	2022-03-01 14:11:45 -08:00
mypy.ini	Setup	2022-02-23 15:11:42 -08:00
pyproject.toml	Setup	2022-02-23 15:11:42 -08:00
requirements.in	Setup CLI, logging and templates	2022-03-01 14:11:45 -08:00
requirements.txt	Setup CLI, logging and templates	2022-03-01 14:11:45 -08:00
setup.cfg	Setup	2022-02-23 15:11:42 -08:00
setup.py	WIP monitoring analysis	2022-03-02 13:28:24 -08:00
tox.ini	Setup CLI, logging and templates	2022-03-01 14:11:45 -08:00

README.md

Operational Monitoring

To understand how to use Operational Monitoring, see the documentation at dtmo.

Overview

The diagram above shows the relationship between different parts of the Operational Monitoring system. At a high level, data flows through the system in the following way:

Users create project definition files as described on dtmo
Two daily jobs run in Airflow to process the config files: one to generate + run the ETL and one to generate LookML for views/explores/dashboards
Updated LookML dashboards and explores are available once per day and loading them runs aggregates on the fly by referencing relevant BigQuery tables.

Below we will dive deeper into what's happening under the hood.

ETL Generator

The ETL generator takes the project definition files as input and uses Jinja templates to generate different SQL queries to process the relevant data. At a high level, it works by doing the following steps for each project file:

Check if the project file was updated since the last time its SQL queries were generated - if so, regenerate.
For each data source in the analysis list in the project definition:

a. For each data type (e.g. scalars, histograms), generate a different query

The aggregations done in this ETL are at the client-level. The queries are grouped by the branch, x-axis value (build id or submission date), and dimensions listed in the project config. Normalization is done so that clients with many submissions would only count once.

For histograms, this is done by summing up values for each bucket then dividing each bucket by the total number of submissions. For more detail, see histogram_normalized_sum()
For scalars, this is done by computing an aggregate function for each client (e.g. sum, avg, etc)

Although the ETL uses a single set of SQL templates, in order to support both build IDs and submission dates on the x-axis, the data is stored/represented in slightly different ways for each case.

For build IDs on the x-axis, submission date is used as the partition but previous submission dates are only there for backup. The most recent submission date is the only one of interest as it will include all relevant builds to be graphed.
For submission dates on the x-axis, submission date is also used as the partition, but they are not backups. The previous submission dates will include dates that need to be graphed.

ETL Runner

The Operational Monitoring DAG runs once per day in Airflow.

A separate table is generated for each operational monitoring project + data type. For example, a given project will have 1 table for scalars that might consist of scalars pulled in from a variety of different tables in BigQuery.

LookML Generator

Specific LookML is generated for Operational Monitoring. The code for this lives in the lookml-generator repo and runs daily as part of the probe_scraper DAG. Each run performs the following steps:

A view is generated for each table that is outputted from the ETL runner. The view contains the dimensions (e.g. metric, branch, build_id) and a measure that computes percentiles
Explores are generated for each view, these include Looker aggregate tables for each graph shown in the default view of a dashboard
Dashboards are generated for each project

Output

Below is an example of a dashboard generated by Operational Monitoring for the Fission Release Rollout

Note that the dropdowns shown, aside from Percentile are generated based on the project definition. There is one dropdown for each dimension specified and it is populated by querying unique values for that dimension. The default value for each dropdown is the most common value found in the table.