opmon

История

Anna Scholtz 64eaf95515 Load configs from metric-hub		2023-01-23 15:44:34 -08:00
..
adr	Generate alert view	2022-05-16 14:56:12 -07:00
images	Update README	2022-03-31 14:08:02 -07:00
README.md	Load configs from metric-hub	2023-01-23 15:44:34 -08:00
troubleshooting.md	Load configs from metric-hub	2023-01-23 15:44:34 -08:00

README.md

Architecture

The diagram above shows the relationship between different parts of the Operational Monitoring system. At a high level, data flows through the system in the following way:

Users create project config files as described on dtmo in opmon/ in metric-hub
Two daily jobs run in Airflow to process the config files: one to generate + run the ETL and one to generate LookML for views/explores/dashboards
Updated LookML dashboards and explores are available once per day and loading them runs aggregates on the fly by referencing relevant BigQuery tables.

Below we will dive deeper into what's happening under the hood.

Generating the queries

OpMon pulls in the project configuration files and uses Jinja templates to generate different SQL queries to process the relevant data. At a high level, it works by doing the following steps for each project file:

First metric aggregations are computed at the client-level. The results are grouped by the branch, x-axis value (build id or submission date), and dimensions listed in the project config. Normalization is done so that clients with many submissions would only count once.

The metric queries first determine the clients that should be part of the analysis (=population). The clients vary based on whether the OpMon project is monitoring rollouts, experiments or the overall population of a product. The query for determining clients which are part of a specific experiment can look as follows:

WITH population AS (
    SELECT
        DATE(submission_date) AS submission_date,
        client_id AS client_id,
        NULL AS build_id,
        CAST(normalized_os AS STRING) AS os, -- dimension clients will be segmented by
        mozfun.map.get_key(
          environment.experiments,
          "tcp-rollout-phase-2-demo"  -- slug of OpMon project
        ).branch AS branch,
    FROM (
        --------------- custom data source as defined in config ----------------
        WITH clients_per_branch AS (
            SELECT 
                client_id,
                sample_id,
                DATE(submission_timestamp) AS submission_date,
                application.build_id AS build_id,
                normalized_channel,
                normalized_os,
                normalized_country_code,
                CASE payload.processes.parent.scalars.privacy_dfpi_rollout_tcp_by_default_feature 
                    WHEN TRUE THEN "enabled"
                    ELSE "disabled"
                END AS branch
            FROM `moz-fx-data-shared-prod.telemetry.main`,
                UNNEST(environment.experiments) AS experiment
            WHERE 
                (normalized_channel = "beta") AND
                experiment.key IN (
                    "total-cookie-protection-beta-roll-out-phase-ii-switch-on-to-new-users", 
                    "total-cookie-protection-roll-out-to-users-phase-ii-switch-on-by-default"
                )
        )
        SELECT 
        client_id,
        submission_date,
        build_id,
        normalized_channel,
        normalized_os,
        normalized_country_code,
        STRUCT (    -- this is the structure opmon expects
            [
            STRUCT (
                "tcp-rollout-phase-2-demo" AS key,   -- dummy experiment/rollout slug to make opmon happy
                STRUCT(branch AS branch) AS value
            )
            ] AS experiments
        ) AS environment
        FROM clients_per_branch 
        WHERE branch IS NOT NULL
        -------------- end of custom data source config ---------------------
    )
    WHERE
        DATE(submission_date) = DATE('2022-06-08 00:00:00+00:00')
    GROUP BY
        submission_date,
        client_id,
        build_id,
        os,
        branch
)
SELECT * FROM population

For each client, metrics are subsequently getting computed like:

WITH merged_metrics_main AS (
    SELECT
        DATE(DATE(submission_timestamp)) AS submission_date,
        client_id AS client_id,
        p.population_build_id AS build_id,
        ARRAY_AGG(mozfun.hist.extract(payload.processes.content.histograms.perf_first_contentful_paint_ms) IGNORE NULLS) AS perf_first_contentful_paint_ms,
    FROM
        mozdata.telemetry.main_1pct
    RIGHT JOIN
        ( 
            SELECT
                client_id AS population_client_id,
                submission_date AS population_submission_date,
                build_id AS population_build_id
            FROM
              population
        ) AS p
    ON
        DATE(submission_timestamp) = p.population_submission_date AND
        client_id = p.population_client_id
    WHERE
        DATE(DATE(submission_timestamp)) = DATE('2022-06-08 00:00:00+00:00')
    GROUP BY
        submission_date,
        build_id,
        client_id
),
merged_metrics_search_clients_engines_sources_daily AS (
    SELECT
        DATE(submission_date) AS submission_date,
        client_id AS client_id,
        p.population_build_id AS build_id,
        SUM(ad_click) AS ad_click,
    FROM
        mozdata.search.search_clients_engines_sources_daily
    RIGHT JOIN
        ( 
            SELECT
                client_id AS population_client_id,
                submission_date AS population_submission_date,
                build_id AS population_build_id
            FROM
              population
        ) AS p
    ON
        submission_date = p.population_submission_date AND
        client_id = p.population_client_id
    WHERE
        DATE(submission_date) = DATE('2022-06-08 00:00:00+00:00')
    GROUP BY
        submission_date,
        build_id,
        client_id
),
-- a CTE is generated for each data source
-- ... merged_metrics_[...] AS ()

-- combine the metrics from all the data sources
joined_metrics AS (
  SELECT
    population.submission_date AS submission_date,
    population.client_id AS client_id,
    population.build_id,
    population.os AS os,
    population.branch AS branch,
    perf_first_contentful_paint_ms,
  FROM population
  LEFT JOIN merged_metrics_main
  ON 
    merged_metrics_main.submission_date = population.submission_date AND
    merged_metrics_main.client_id = population.client_id AND
    (population.build_id IS NULL OR merged_metrics_main.build_id = population.build_id)
  LEFT JOIN merged_metrics_search_clients_engines_sources_daily
  ON 
    merged_metrics_search_clients_engines_sources_daily.submission_date = population.submission_date AND
    merged_metrics_search_clients_engines_sources_daily.client_id = population.client_id AND
    (population.build_id IS NULL OR merged_metrics_search_clients_engines_sources_daily.build_id = population.build_id)
)

SELECT * FROM joined_metrics

The per-client metric results are written to BigQuery tables in the operational_monitoring_derived dataset. The tables are named as follows: <slug_with_underscores>_v1.

The per-client metric results are used in the next step of the ETL to compute statistics. Statistics reduce observations of many clients to one or many rows describing the population. The SQL executed to compute statistics depends on what statistics have been configured for each metric. But looks similar to the following:

SELECT
    submission_date,
    build_id,
    branch,
    ARRAY<STRUCT<
            metric STRING,
            statistic STRING,
            point FLOAT64,
            lower FLOAT64,
            upper FLOAT64,
            parameter STRING
        >>[
        STRUCT(
            "install_volume_win8_1" AS metric,
            "sum" AS statistic,
            SUM(install_volume_win8_1) AS point,
            NULL AS lower,
            NULL AS upper,
            NULL AS parameter
        )
    ] AS install_volume_win8_1_sum
FROM `operational_monitoring.<slug_with_underscores>_v1`

Computing percentile statistics is a little more complicated than other supported statistics and involves bucketing results values and using a Jackknife UDF. The ETL that is executed looks similar to the following:

-- CREATE TEMP FUNCTION jackknife_percentile_ci 
-- see https://github.com/mozilla/opmon/blob/main/opmon/templates/jackknife_percentile_ci_udf.sql

-- CREATE TEMPORARY FUNCTION merge_histogram_values
-- see https://github.com/mozilla/opmon/blob/main/opmon/templates/merge_histogram_values_udf.sql

WITH filtered_metrics AS (
    SELECT *
    FROM `operational_monitoring.<slug_with_underscores>_v1`
    WHERE DATE(submission_date) >= DATE("2022-06-08 00:00:00+00:00")
),

-- bucket scalar metrics that use percentile
-- histogram metrics are already bucketed so this step is not necessary there
log_min_max AS (
  SELECT
    NULL AS dummy,
    LOG(IF(MIN(ad_click) <= 0, 1, MIN(ad_click)), 2) ad_click_log_min,
    LOG(IF(MAX(ad_click) <= 0, 1, MAX(ad_click)), 2) ad_click_log_max,
  FROM
    filtered_metrics
  LIMIT 1
),

buckets_by_metric AS (
  SELECT 
    NULL AS dummy,
    ARRAY(SELECT FORMAT("%.*f", 2, bucket) FROM UNNEST(
        mozfun.glam.histogram_generate_scalar_buckets(ad_click_log_min, ad_click_log_max, 100)
    ) AS bucket ORDER BY bucket) AS ad_click_buckets,
  FROM log_min_max
)

-- compute statistics
SELECT
    submission_date,
    build_id,
    os,
    branch,

    -- computing percentile for a scalar statistic
    -- the scalar values are transformed into a histogram data structure
    jackknife_percentile_ci(
        [25, 50, 75, 95],
        merge_histogram_values(
            ARRAY_CONCAT_AGG(
                histogram_normalized_sum(
                    [IF(False AND ad_click IS NULL,
                        NULL,
                        STRUCT<values ARRAY<STRUCT<key FLOAT64, value FLOAT64>>>(
                        [STRUCT<key FLOAT64, value FLOAT64>(
                            COALESCE(
                                mozfun.glam.histogram_bucket_from_value(
                                    ad_click_buckets,
                                    SAFE_CAST(ad_click AS FLOAT64)
                                ), 0.0
                            ), 1.0
                        )]
                    ))], 1.0
                )
            )
        ),
        "ad_click"
    ) AS ad_click_percentile,

    -- computing percentile for a histogram metric
    jackknife_percentile_ci(
        [25, 50, 75, 95],
        merge_histogram_values(
            ARRAY_CONCAT_AGG(
                histogram_normalized_sum(perf_first_contentful_paint_ms, 1.0)
            )
        ),
        "perf_first_contentful_paint_ms"
    )
    AS perf_first_contentful_paint_ms_percentile
FROM `operational_monitoring.<slug_with_underscores>_v1`
CROSS JOIN buckets_by_metric
WHERE submission_date = DATE("2022-06-08 00:00:00+00:00")
GROUP BY
    submission_date,
    build_id,
    os,
    branch

Although a single set of SQL templates is used, in order to support both build IDs and submission dates on the x-axis, the data is aggregated in slightly different ways for each case.

For submission dates on the x-axis, submission date is also used as the partition.
For build IDs on the x-axis, submission date is used as the partition but for each build ID up to 30 days of previous data will be aggregated. Each partition/submission date will include data of the builds that were released in the last 30 days.

The Operational Monitoring DAG runs once per day in Airflow.

A separate table is generated for each operational monitoring project + metrics, statistics and alerts.

LookML Generator

Looker dashboards, views and explores for Operational Monitoring are generated via lookml-generator which runs daily as part of the probe_scraper DAG. Each run performs the following steps:

A view is generated for each result table. The view contains the dimensions (e.g. metric, branch, build_id) and a measure that computes percentiles
Explores are generated for each view, these include Looker aggregate tables for each graph shown in the default view of a dashboard
Dashboards are generated for each project

Output

Below is an example of a dashboard generated by Operational Monitoring for the Fission Release Rollout

Note that the dropdowns shown, aside from Percentile are generated based on the project definition. There is one dropdown for each dimension specified and it is populated by querying unique values for that dimension. The default value for each dropdown is the most common value found in the table.