ETL docs

2022-11-10 13:45:36 -08:00 · 2022-11-10 13:45:36 -08:00 · df6dcb370b
--- a/docs/README.md
+++ b/docs/README.md
@ -12,22 +12,292 @@ Below we will dive deeper into what's happening under the hood.

 ## Generating the queries

-OpMon pulls in the project configuration files and uses [Jinja](https://jinja.palletsprojects.com/en/3.0.x/) templates to generate different SQL queries to process the relevant data. At a high level, it works by doing the following steps for each project file. A separate query is generated for each data type (e.g. scalars, histograms).
+OpMon pulls in the project configuration files and uses [Jinja](https://jinja.palletsprojects.com/en/3.0.x/) templates to generate different SQL queries to process the relevant data. At a high level, it works by doing the following steps for each project file:

-The aggregations done in this ETL are at the client-level. The queries are grouped by the branch, x-axis value (build id or submission date), and dimensions listed in the project config.
+First metric aggregations are computed at the client-level. The results are grouped by the branch, x-axis value (build id or submission date), and dimensions listed in the project config.
 Normalization is done so that clients with many submissions would only count once.
-* For histograms, this is done by summing up values for each bucket then dividing each bucket by the total number of submissions. For more detail, see [`histogram_normalized_sum()`](https://github.com/mozilla/bigquery-etl/blob/main/sql/mozfun/glam/histogram_normalized_sum/udf.sql)
-* For scalars, this is done by computing an aggregate function for each client (e.g. sum, avg, etc)

-Although a single set of SQL templates is used, in order to support both build IDs and submission dates on the x-axis, the data is stored/represented in slightly different ways for each case.
-* For build IDs on the x-axis, submission date is used as the partition but previous submission dates are only there for backup. The most recent submission date is the only one of interest as it will include all relevant builds to be graphed.
-* For submission dates on the x-axis, submission date is also used as the partition, but they are not backups. The previous submission dates will include dates that need to be graphed.
+The metric queries first determine the clients that should be part of the analysis (=population). The clients vary based on whether the OpMon project is monitoring rollouts, experiments or the overall population of a product. The query for determining clients which are part of a specific experiment can look as follows:

-Results for percentiles are computed in the publicly-facing views for each result table. Computing the percentiles using a [Jackknife UDF](https://github.com/mozilla/bigquery-etl/blob/main/sql/moz-fx-data-shared-prod/udf_js/jackknife_percentile_ci/udf.sql) requires the result values to be bucketed.
+```sql
+WITH population AS (
+    SELECT
+        DATE(submission_date) AS submission_date,
+        client_id AS client_id,
+        NULL AS build_id,
+        CAST(normalized_os AS STRING) AS os, -- dimension clients will be segmented by
+        mozfun.map.get_key(
+          environment.experiments,
+          "tcp-rollout-phase-2-demo"  -- slug of OpMon project
+        ).branch AS branch,
+    FROM (
+        --------------- custom data source as defined in config ----------------
+        WITH clients_per_branch AS (
+            SELECT 
+                client_id,
+                sample_id,
+                DATE(submission_timestamp) AS submission_date,
+                application.build_id AS build_id,
+                normalized_channel,
+                normalized_os,
+                normalized_country_code,
+                CASE payload.processes.parent.scalars.privacy_dfpi_rollout_tcp_by_default_feature 
+                    WHEN TRUE THEN "enabled"
+                    ELSE "disabled"
+                END AS branch
+            FROM `moz-fx-data-shared-prod.telemetry.main`,
+                UNNEST(environment.experiments) AS experiment
+            WHERE 
+                (normalized_channel = "beta") AND
+                experiment.key IN (
+                    "total-cookie-protection-beta-roll-out-phase-ii-switch-on-to-new-users", 
+                    "total-cookie-protection-roll-out-to-users-phase-ii-switch-on-by-default"
+                )
+        )
+        SELECT 
+        client_id,
+        submission_date,
+        build_id,
+        normalized_channel,
+        normalized_os,
+        normalized_country_code,
+        STRUCT (    -- this is the structure opmon expects
+            [
+            STRUCT (
+                "tcp-rollout-phase-2-demo" AS key,   -- dummy experiment/rollout slug to make opmon happy
+                STRUCT(branch AS branch) AS value
+            )
+            ] AS experiments
+        ) AS environment
+        FROM clients_per_branch 
+        WHERE branch IS NOT NULL
+        -------------- end of custom data source config ---------------------
+    )
+    WHERE
+        DATE(submission_date) = DATE('2022-06-08 00:00:00+00:00')
+    GROUP BY
+        submission_date,
+        client_id,
+        build_id,
+        os,
+        branch
+)
+SELECT * FROM population
+```
+
+For each client, metrics are subsequently getting computed like: 
+
+```sql
+WITH merged_metrics_main AS (
+    SELECT
+        DATE(DATE(submission_timestamp)) AS submission_date,
+        client_id AS client_id,
+        p.population_build_id AS build_id,
+        ARRAY_AGG(mozfun.hist.extract(payload.processes.content.histograms.perf_first_contentful_paint_ms) IGNORE NULLS) AS perf_first_contentful_paint_ms,
+    FROM
+        mozdata.telemetry.main_1pct
+    RIGHT JOIN
+        ( 
+            SELECT
+                client_id AS population_client_id,
+                submission_date AS population_submission_date,
+                build_id AS population_build_id
+            FROM
+              population
+        ) AS p
+    ON
+        DATE(submission_timestamp) = p.population_submission_date AND
+        client_id = p.population_client_id
+    WHERE
+        DATE(DATE(submission_timestamp)) = DATE('2022-06-08 00:00:00+00:00')
+    GROUP BY
+        submission_date,
+        build_id,
+        client_id
+),
+merged_metrics_search_clients_engines_sources_daily AS (
+    SELECT
+        DATE(submission_date) AS submission_date,
+        client_id AS client_id,
+        p.population_build_id AS build_id,
+        SUM(ad_click) AS ad_click,
+    FROM
+        mozdata.search.search_clients_engines_sources_daily
+    RIGHT JOIN
+        ( 
+            SELECT
+                client_id AS population_client_id,
+                submission_date AS population_submission_date,
+                build_id AS population_build_id
+            FROM
+              population
+        ) AS p
+    ON
+        submission_date = p.population_submission_date AND
+        client_id = p.population_client_id
+    WHERE
+        DATE(submission_date) = DATE('2022-06-08 00:00:00+00:00')
+    GROUP BY
+        submission_date,
+        build_id,
+        client_id
+),
+-- a CTE is generated for each data source
+-- ... merged_metrics_[...] AS ()
+
+-- combine the metrics from all the data sources
+joined_metrics AS (
+  SELECT
+    population.submission_date AS submission_date,
+    population.client_id AS client_id,
+    population.build_id,
+    population.os AS os,
+    population.branch AS branch,
+    perf_first_contentful_paint_ms,
+  FROM population
+  LEFT JOIN merged_metrics_main
+  ON 
+    merged_metrics_main.submission_date = population.submission_date AND
+    merged_metrics_main.client_id = population.client_id AND
+    (population.build_id IS NULL OR merged_metrics_main.build_id = population.build_id)
+  LEFT JOIN merged_metrics_search_clients_engines_sources_daily
+  ON 
+    merged_metrics_search_clients_engines_sources_daily.submission_date = population.submission_date AND
+    merged_metrics_search_clients_engines_sources_daily.client_id = population.client_id AND
+    (population.build_id IS NULL OR merged_metrics_search_clients_engines_sources_daily.build_id = population.build_id)
+)
+
+SELECT * FROM joined_metrics
+```
+
+The per-client metric results are written to BigQuery tables in the `operational_monitoring_derived` dataset. The tables are named as follows: `<slug_with_underscores>_v1`.
+
+The per-client metric results are used in the next step of the ETL to compute statistics. Statistics reduce observations of many clients to one or many rows describing the population.
+The SQL executed to compute statistics depends on what statistics have been configured for each metric. But looks similar to the following: 
+
+```sql
+SELECT
+    submission_date,
+    build_id,
+    branch,
+    ARRAY<STRUCT<
+            metric STRING,
+            statistic STRING,
+            point FLOAT64,
+            lower FLOAT64,
+            upper FLOAT64,
+            parameter STRING
+        >>[
+        STRUCT(
+            "install_volume_win8_1" AS metric,
+            "sum" AS statistic,
+            SUM(install_volume_win8_1) AS point,
+            NULL AS lower,
+            NULL AS upper,
+            NULL AS parameter
+        )
+    ] AS install_volume_win8_1_sum
+FROM `operational_monitoring.<slug_with_underscores>_v1`
+```
+
+Computing percentile statistics is a little more complicated than other supported statistics and involves bucketing results values and using a [Jackknife UDF](https://github.com/mozilla/opmon/blob/main/opmon/templates/jackknife_percentile_ci_udf.sql).
+The ETL that is executed looks similar to the following: 
+
+```sql
+-- CREATE TEMP FUNCTION jackknife_percentile_ci 
+-- see https://github.com/mozilla/opmon/blob/main/opmon/templates/jackknife_percentile_ci_udf.sql
+
+-- CREATE TEMPORARY FUNCTION merge_histogram_values
+-- see https://github.com/mozilla/opmon/blob/main/opmon/templates/merge_histogram_values_udf.sql
+
+WITH filtered_metrics AS (
+    SELECT *
+    FROM `operational_monitoring.<slug_with_underscores>_v1`
+    WHERE DATE(submission_date) >= DATE("2022-06-08 00:00:00+00:00")
+),
+
+-- bucket scalar metrics that use percentile
+-- histogram metrics are already bucketed so this step is not necessary there
+log_min_max AS (
+  SELECT
+    NULL AS dummy,
+    LOG(IF(MIN(ad_click) <= 0, 1, MIN(ad_click)), 2) ad_click_log_min,
+    LOG(IF(MAX(ad_click) <= 0, 1, MAX(ad_click)), 2) ad_click_log_max,
+  FROM
+    filtered_metrics
+  LIMIT 1
+),
+
+buckets_by_metric AS (
+  SELECT 
+    NULL AS dummy,
+    ARRAY(SELECT FORMAT("%.*f", 2, bucket) FROM UNNEST(
+        mozfun.glam.histogram_generate_scalar_buckets(ad_click_log_min, ad_click_log_max, 100)
+    ) AS bucket ORDER BY bucket) AS ad_click_buckets,
+  FROM log_min_max
+)
+
+-- compute statistics
+SELECT
+    submission_date,
+    build_id,
+    os,
+    branch,
+
+    -- computing percentile for a scalar statistic
+    -- the scalar values are transformed into a histogram data structure
+    jackknife_percentile_ci(
+        [25, 50, 75, 95],
+        merge_histogram_values(
+            ARRAY_CONCAT_AGG(
+                histogram_normalized_sum(
+                    [IF(False AND ad_click IS NULL,
+                        NULL,
+                        STRUCT<values ARRAY<STRUCT<key FLOAT64, value FLOAT64>>>(
+                        [STRUCT<key FLOAT64, value FLOAT64>(
+                            COALESCE(
+                                mozfun.glam.histogram_bucket_from_value(
+                                    ad_click_buckets,
+                                    SAFE_CAST(ad_click AS FLOAT64)
+                                ), 0.0
+                            ), 1.0
+                        )]
+                    ))], 1.0
+                )
+            )
+        ),
+        "ad_click"
+    ) AS ad_click_percentile,
+
+    -- computing percentile for a histogram metric
+    jackknife_percentile_ci(
+        [25, 50, 75, 95],
+        merge_histogram_values(
+            ARRAY_CONCAT_AGG(
+                histogram_normalized_sum(perf_first_contentful_paint_ms, 1.0)
+            )
+        ),
+        "perf_first_contentful_paint_ms"
+    )
+    AS perf_first_contentful_paint_ms_percentile
+FROM `operational_monitoring.<slug_with_underscores>_v1`
+CROSS JOIN buckets_by_metric
+WHERE submission_date = DATE("2022-06-08 00:00:00+00:00")
+GROUP BY
+    submission_date,
+    build_id,
+    os,
+    branch
+```
+
+
+Although a single set of SQL templates is used, in order to support both build IDs and submission dates on the x-axis, the data is aggregated in slightly different ways for each case.
+* For submission dates on the x-axis, submission date is also used as the partition.
+* For build IDs on the x-axis, submission date is used as the partition but for each build ID up to 30 days of previous data will be aggregated. Each partition/submission date will include data of the builds that were released in the last 30 days.

 The [Operational Monitoring DAG](https://workflow.telemetry.mozilla.org/tree?dag_id=operational_monitoring) runs once per day in Airflow.

-A separate table is generated for each operational monitoring project + data type. For example, a given project will have 1 table for scalars that might consist of scalars pulled in from a variety of different tables in BigQuery.
+A separate table is generated for each operational monitoring project + metrics, statistics and alerts.


 ## LookML Generator