This commit is contained in:
Anna Scholtz 2022-11-10 13:45:36 -08:00
Родитель 0851a16347
Коммит df6dcb370b
1 изменённых файлов: 279 добавлений и 9 удалений

Просмотреть файл

@ -12,22 +12,292 @@ Below we will dive deeper into what's happening under the hood.
## Generating the queries
OpMon pulls in the project configuration files and uses [Jinja](https://jinja.palletsprojects.com/en/3.0.x/) templates to generate different SQL queries to process the relevant data. At a high level, it works by doing the following steps for each project file. A separate query is generated for each data type (e.g. scalars, histograms).
OpMon pulls in the project configuration files and uses [Jinja](https://jinja.palletsprojects.com/en/3.0.x/) templates to generate different SQL queries to process the relevant data. At a high level, it works by doing the following steps for each project file:
The aggregations done in this ETL are at the client-level. The queries are grouped by the branch, x-axis value (build id or submission date), and dimensions listed in the project config.
First metric aggregations are computed at the client-level. The results are grouped by the branch, x-axis value (build id or submission date), and dimensions listed in the project config.
Normalization is done so that clients with many submissions would only count once.
* For histograms, this is done by summing up values for each bucket then dividing each bucket by the total number of submissions. For more detail, see [`histogram_normalized_sum()`](https://github.com/mozilla/bigquery-etl/blob/main/sql/mozfun/glam/histogram_normalized_sum/udf.sql)
* For scalars, this is done by computing an aggregate function for each client (e.g. sum, avg, etc)
Although a single set of SQL templates is used, in order to support both build IDs and submission dates on the x-axis, the data is stored/represented in slightly different ways for each case.
* For build IDs on the x-axis, submission date is used as the partition but previous submission dates are only there for backup. The most recent submission date is the only one of interest as it will include all relevant builds to be graphed.
* For submission dates on the x-axis, submission date is also used as the partition, but they are not backups. The previous submission dates will include dates that need to be graphed.
The metric queries first determine the clients that should be part of the analysis (=population). The clients vary based on whether the OpMon project is monitoring rollouts, experiments or the overall population of a product. The query for determining clients which are part of a specific experiment can look as follows:
Results for percentiles are computed in the publicly-facing views for each result table. Computing the percentiles using a [Jackknife UDF](https://github.com/mozilla/bigquery-etl/blob/main/sql/moz-fx-data-shared-prod/udf_js/jackknife_percentile_ci/udf.sql) requires the result values to be bucketed.
```sql
WITH population AS (
SELECT
DATE(submission_date) AS submission_date,
client_id AS client_id,
NULL AS build_id,
CAST(normalized_os AS STRING) AS os, -- dimension clients will be segmented by
mozfun.map.get_key(
environment.experiments,
"tcp-rollout-phase-2-demo" -- slug of OpMon project
).branch AS branch,
FROM (
--------------- custom data source as defined in config ----------------
WITH clients_per_branch AS (
SELECT
client_id,
sample_id,
DATE(submission_timestamp) AS submission_date,
application.build_id AS build_id,
normalized_channel,
normalized_os,
normalized_country_code,
CASE payload.processes.parent.scalars.privacy_dfpi_rollout_tcp_by_default_feature
WHEN TRUE THEN "enabled"
ELSE "disabled"
END AS branch
FROM `moz-fx-data-shared-prod.telemetry.main`,
UNNEST(environment.experiments) AS experiment
WHERE
(normalized_channel = "beta") AND
experiment.key IN (
"total-cookie-protection-beta-roll-out-phase-ii-switch-on-to-new-users",
"total-cookie-protection-roll-out-to-users-phase-ii-switch-on-by-default"
)
)
SELECT
client_id,
submission_date,
build_id,
normalized_channel,
normalized_os,
normalized_country_code,
STRUCT ( -- this is the structure opmon expects
[
STRUCT (
"tcp-rollout-phase-2-demo" AS key, -- dummy experiment/rollout slug to make opmon happy
STRUCT(branch AS branch) AS value
)
] AS experiments
) AS environment
FROM clients_per_branch
WHERE branch IS NOT NULL
-------------- end of custom data source config ---------------------
)
WHERE
DATE(submission_date) = DATE('2022-06-08 00:00:00+00:00')
GROUP BY
submission_date,
client_id,
build_id,
os,
branch
)
SELECT * FROM population
```
For each client, metrics are subsequently getting computed like:
```sql
WITH merged_metrics_main AS (
SELECT
DATE(DATE(submission_timestamp)) AS submission_date,
client_id AS client_id,
p.population_build_id AS build_id,
ARRAY_AGG(mozfun.hist.extract(payload.processes.content.histograms.perf_first_contentful_paint_ms) IGNORE NULLS) AS perf_first_contentful_paint_ms,
FROM
mozdata.telemetry.main_1pct
RIGHT JOIN
(
SELECT
client_id AS population_client_id,
submission_date AS population_submission_date,
build_id AS population_build_id
FROM
population
) AS p
ON
DATE(submission_timestamp) = p.population_submission_date AND
client_id = p.population_client_id
WHERE
DATE(DATE(submission_timestamp)) = DATE('2022-06-08 00:00:00+00:00')
GROUP BY
submission_date,
build_id,
client_id
),
merged_metrics_search_clients_engines_sources_daily AS (
SELECT
DATE(submission_date) AS submission_date,
client_id AS client_id,
p.population_build_id AS build_id,
SUM(ad_click) AS ad_click,
FROM
mozdata.search.search_clients_engines_sources_daily
RIGHT JOIN
(
SELECT
client_id AS population_client_id,
submission_date AS population_submission_date,
build_id AS population_build_id
FROM
population
) AS p
ON
submission_date = p.population_submission_date AND
client_id = p.population_client_id
WHERE
DATE(submission_date) = DATE('2022-06-08 00:00:00+00:00')
GROUP BY
submission_date,
build_id,
client_id
),
-- a CTE is generated for each data source
-- ... merged_metrics_[...] AS ()
-- combine the metrics from all the data sources
joined_metrics AS (
SELECT
population.submission_date AS submission_date,
population.client_id AS client_id,
population.build_id,
population.os AS os,
population.branch AS branch,
perf_first_contentful_paint_ms,
FROM population
LEFT JOIN merged_metrics_main
ON
merged_metrics_main.submission_date = population.submission_date AND
merged_metrics_main.client_id = population.client_id AND
(population.build_id IS NULL OR merged_metrics_main.build_id = population.build_id)
LEFT JOIN merged_metrics_search_clients_engines_sources_daily
ON
merged_metrics_search_clients_engines_sources_daily.submission_date = population.submission_date AND
merged_metrics_search_clients_engines_sources_daily.client_id = population.client_id AND
(population.build_id IS NULL OR merged_metrics_search_clients_engines_sources_daily.build_id = population.build_id)
)
SELECT * FROM joined_metrics
```
The per-client metric results are written to BigQuery tables in the `operational_monitoring_derived` dataset. The tables are named as follows: `<slug_with_underscores>_v1`.
The per-client metric results are used in the next step of the ETL to compute statistics. Statistics reduce observations of many clients to one or many rows describing the population.
The SQL executed to compute statistics depends on what statistics have been configured for each metric. But looks similar to the following:
```sql
SELECT
submission_date,
build_id,
branch,
ARRAY<STRUCT<
metric STRING,
statistic STRING,
point FLOAT64,
lower FLOAT64,
upper FLOAT64,
parameter STRING
>>[
STRUCT(
"install_volume_win8_1" AS metric,
"sum" AS statistic,
SUM(install_volume_win8_1) AS point,
NULL AS lower,
NULL AS upper,
NULL AS parameter
)
] AS install_volume_win8_1_sum
FROM `operational_monitoring.<slug_with_underscores>_v1`
```
Computing percentile statistics is a little more complicated than other supported statistics and involves bucketing results values and using a [Jackknife UDF](https://github.com/mozilla/opmon/blob/main/opmon/templates/jackknife_percentile_ci_udf.sql).
The ETL that is executed looks similar to the following:
```sql
-- CREATE TEMP FUNCTION jackknife_percentile_ci
-- see https://github.com/mozilla/opmon/blob/main/opmon/templates/jackknife_percentile_ci_udf.sql
-- CREATE TEMPORARY FUNCTION merge_histogram_values
-- see https://github.com/mozilla/opmon/blob/main/opmon/templates/merge_histogram_values_udf.sql
WITH filtered_metrics AS (
SELECT *
FROM `operational_monitoring.<slug_with_underscores>_v1`
WHERE DATE(submission_date) >= DATE("2022-06-08 00:00:00+00:00")
),
-- bucket scalar metrics that use percentile
-- histogram metrics are already bucketed so this step is not necessary there
log_min_max AS (
SELECT
NULL AS dummy,
LOG(IF(MIN(ad_click) <= 0, 1, MIN(ad_click)), 2) ad_click_log_min,
LOG(IF(MAX(ad_click) <= 0, 1, MAX(ad_click)), 2) ad_click_log_max,
FROM
filtered_metrics
LIMIT 1
),
buckets_by_metric AS (
SELECT
NULL AS dummy,
ARRAY(SELECT FORMAT("%.*f", 2, bucket) FROM UNNEST(
mozfun.glam.histogram_generate_scalar_buckets(ad_click_log_min, ad_click_log_max, 100)
) AS bucket ORDER BY bucket) AS ad_click_buckets,
FROM log_min_max
)
-- compute statistics
SELECT
submission_date,
build_id,
os,
branch,
-- computing percentile for a scalar statistic
-- the scalar values are transformed into a histogram data structure
jackknife_percentile_ci(
[25, 50, 75, 95],
merge_histogram_values(
ARRAY_CONCAT_AGG(
histogram_normalized_sum(
[IF(False AND ad_click IS NULL,
NULL,
STRUCT<values ARRAY<STRUCT<key FLOAT64, value FLOAT64>>>(
[STRUCT<key FLOAT64, value FLOAT64>(
COALESCE(
mozfun.glam.histogram_bucket_from_value(
ad_click_buckets,
SAFE_CAST(ad_click AS FLOAT64)
), 0.0
), 1.0
)]
))], 1.0
)
)
),
"ad_click"
) AS ad_click_percentile,
-- computing percentile for a histogram metric
jackknife_percentile_ci(
[25, 50, 75, 95],
merge_histogram_values(
ARRAY_CONCAT_AGG(
histogram_normalized_sum(perf_first_contentful_paint_ms, 1.0)
)
),
"perf_first_contentful_paint_ms"
)
AS perf_first_contentful_paint_ms_percentile
FROM `operational_monitoring.<slug_with_underscores>_v1`
CROSS JOIN buckets_by_metric
WHERE submission_date = DATE("2022-06-08 00:00:00+00:00")
GROUP BY
submission_date,
build_id,
os,
branch
```
Although a single set of SQL templates is used, in order to support both build IDs and submission dates on the x-axis, the data is aggregated in slightly different ways for each case.
* For submission dates on the x-axis, submission date is also used as the partition.
* For build IDs on the x-axis, submission date is used as the partition but for each build ID up to 30 days of previous data will be aggregated. Each partition/submission date will include data of the builds that were released in the last 30 days.
The [Operational Monitoring DAG](https://workflow.telemetry.mozilla.org/tree?dag_id=operational_monitoring) runs once per day in Airflow.
A separate table is generated for each operational monitoring project + data type. For example, a given project will have 1 table for scalars that might consist of scalars pulled in from a variety of different tables in BigQuery.
A separate table is generated for each operational monitoring project + metrics, statistics and alerts.
## LookML Generator