opmon/docs
Anna Scholtz ffbc487db3 Generate alert view 2022-05-16 14:56:12 -07:00
..
adr Generate alert view 2022-05-16 14:56:12 -07:00
images Update README 2022-03-31 14:08:02 -07:00
README.md Update README.md 2022-03-31 14:08:56 -07:00
troubleshooting.md Add some troubleshooting tips 2022-03-31 14:08:02 -07:00

README.md

Architecture

The diagram above shows the relationship between different parts of the Operational Monitoring system. At a high level, data flows through the system in the following way:

  1. Users create project config files as described on dtmo in opmon-config
  2. Two daily jobs run in Airflow to process the config files: one to generate + run the ETL and one to generate LookML for views/explores/dashboards
  3. Updated LookML dashboards and explores are available once per day and loading them runs aggregates on the fly by referencing relevant BigQuery tables.

Below we will dive deeper into what's happening under the hood.

Generating the queries

OpMon pulls in the project configuration files and uses Jinja templates to generate different SQL queries to process the relevant data. At a high level, it works by doing the following steps for each project file. A separate query is generated for each data type (e.g. scalars, histograms).

The aggregations done in this ETL are at the client-level. The queries are grouped by the branch, x-axis value (build id or submission date), and dimensions listed in the project config. Normalization is done so that clients with many submissions would only count once.

  • For histograms, this is done by summing up values for each bucket then dividing each bucket by the total number of submissions. For more detail, see histogram_normalized_sum()
  • For scalars, this is done by computing an aggregate function for each client (e.g. sum, avg, etc)

Although a single set of SQL templates is used, in order to support both build IDs and submission dates on the x-axis, the data is stored/represented in slightly different ways for each case.

  • For build IDs on the x-axis, submission date is used as the partition but previous submission dates are only there for backup. The most recent submission date is the only one of interest as it will include all relevant builds to be graphed.
  • For submission dates on the x-axis, submission date is also used as the partition, but they are not backups. The previous submission dates will include dates that need to be graphed.

Results for percentiles are computed in the publicly-facing views for each result table. Computing the percentiles using a Jackknife UDF requires the result values to be bucketed.

The Operational Monitoring DAG runs once per day in Airflow.

A separate table is generated for each operational monitoring project + data type. For example, a given project will have 1 table for scalars that might consist of scalars pulled in from a variety of different tables in BigQuery.

LookML Generator

Looker dashboards, views and explores for Operational Monitoring are generated via lookml-generator which runs daily as part of the probe_scraper DAG. Each run performs the following steps:

  1. A view is generated for each result table. The view contains the dimensions (e.g. metric, branch, build_id) and a measure that computes percentiles
  2. Explores are generated for each view, these include Looker aggregate tables for each graph shown in the default view of a dashboard
  3. Dashboards are generated for each project

Output

Below is an example of a dashboard generated by Operational Monitoring for the Fission Release Rollout

Note that the dropdowns shown, aside from Percentile are generated based on the project definition. There is one dropdown for each dimension specified and it is populated by querying unique values for that dimension. The default value for each dropdown is the most common value found in the table.