Create an ETL job for the Internet Outages (#1058)

* Add aggregation by country

* Copy the initial Italy focus query

This initial commit provides a baseline for the
next commits to ease review, since this initial
code was already reviewed.

* Cleanup the country list and replace FULL OUTER with LEFT joins

* Aggregate by city for cities with more than 15k inhabitants

The actual 15k limit is enforced at ingestion time.
This further limits the resulting cities to ones with at
least 1000 active daily users.

* Produce hourly aggregates

* Move the query to the `internet_outage` dataset

* Provide automatic daily scheduling through AirFlow

* Tweak the SQL addressing review comments

This additionally changes the `CAST` to
`SAFE_CAST` to account for weirdnesses in
the data.

* Add ssl_error_prop

* Add missing_dns_success

* Add missing_dns_failure

* Lower the minimum reported bucket size to 50

This allows us to match the EDA by Saptarshi and
to have a better comparable baseline.

* Document the oddities around `submission_timestamp_min`
This commit is contained in:
Alessio Placitelli 2020-07-01 06:44:40 +02:00 коммит произвёл GitHub
Родитель 4654bbda31
Коммит 76bac7a98e
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
3 изменённых файлов: 578 добавлений и 0 удалений

Просмотреть файл

@ -144,3 +144,13 @@ bqetl_public_data_json:
email: ['telemetry-alerts@mozilla.com', 'ascholtz@mozilla.com']
retries: 2
retry_delay: 30m
# DAG for building the internet outages datasets. See bug 1640204.
bqetl_internet_outages:
schedule_interval: 0 1 * * *
default_args:
owner: aplacitelli@mozilla.com
start_date: '2020-01-01'
email: ['aplacitelli@mozilla.com', 'sguha@mozilla.com']
retries: 2
retry_delay: 30m

Просмотреть файл

@ -0,0 +1,39 @@
friendly_name: Internet Outages
description: |-
This contains a set aggregated metrics that correlate to internet outages for different countries in the world.
The dataset contains the following fields:
- `country`: the Country code of the client.
- `city`: the City name (only for cities with a population >= 15000, 'unknown' otherwise).
- `datetime`: the date and the time (truncated to hour) the data was submitted by the client.
- `proportion_undefined`: the proportion of users who failed to send telemetry for a reason that was not listed in the other cases.
- `proportion_timeout`: the proportion of users that had their connection timeout while uploading telemetry ([after 90s, in Firefox Desktop](https://searchfox.org/mozilla-central/rev/fa2df28a49883612bd7af4dacd80cdfedcccd2f6/toolkit/components/telemetry/app/TelemetrySend.jsm#81)).
- `proportion_abort`: the proportion of users that had their connection terminated by the client (for example, terminating open connections before shutting down).
- `proportion_unreachable`: the proportion of users that failed to upload telemetry because the server was not reachable (e.g. because the host was not reachable, proxy problems or OS waking up after a suspension).
- `proportion_terminated`: the proportion of users that had their connection terminated internally by the networking code.
- `proportion_channel_open`: the proportion of users for which the upload request was terminated immediately, by the client, because of a Necko internal error.
- `avg_dns_success_time`: the average time it takes for a successful DNS resolution, in milliseconds.
- `missing_dns_success`: counts how many sessions did not report the `DNS_LOOKUP_TIME` histogram.
- `avg_dns_failure_time`: the average time it takes for an unsuccessful DNS resolution, in milliseconds.
- `missing_dns_failure`: counts how many sessions did not report the `DNS_FAILED_LOOKUP_TIME` histogram.
- `count_dns_failure`: the average count of unsuccessful DNS resolutions reported.
- `ssl_error_prop`: the proportion of users that reported an error through the `SSL_CERT_VERIFICATION_ERRORS` histogram.
- `avg_tls_handshake_time`: the average time after the TCP SYN to ready for HTTP, in milliseconds.
Caveats with the data:
As with any observational data, there are many caveats and interpretation must be done carefully. Below is a list of issues we have considered, but it is not exhaustive.
- Firefox users are not representative of the general population in their region.
- Users can experience multiple types of failures and so the proportions are not summable. For example, if 2.4% of clients had a timeout and 2.6% of clients had eUnreachable that doesn't necessarily mean that 5.0% of clients had a timeout or a eUnreachable
- Geographical data is based on IPGeo databases. These databases are imperfect, so some activity may be attributed to the wrong location. Further, proxy and VPN usage can create geo-attribution errors.
owners:
- aplacitelli@mozilla.com
- sguha@mozilla.com
labels:
incremental: true
public_json: false
public_bigquery: false
review_bug: 1640204
incremental_export: false
scheduling:
dag_name: bqetl_internet_outages

Просмотреть файл

@ -0,0 +1,529 @@
-- Note: udf.udf_json_extract_int_map map doesn't work in this case as it expects an INT -> INT
-- map, while we have a STRING->int map
CREATE TEMP FUNCTION udf_json_extract_string_to_int_map(input STRING) AS (
ARRAY(
SELECT
STRUCT(
CAST(SPLIT(entry, ':')[OFFSET(0)] AS STRING) AS key,
CAST(SPLIT(entry, ':')[OFFSET(1)] AS INT64) AS value
)
FROM
UNNEST(SPLIT(REPLACE(TRIM(input, '{}'), '"', ''), ',')) AS entry
WHERE
LENGTH(entry) > 0
)
);
-- This sums the values reported by an histogram.
CREATE TEMP FUNCTION sum_values(x ARRAY<STRUCT<key INT64, value INT64>>) AS (
(
WITH a AS (
SELECT
IF(array_length(x) > 0, 1, 0) AS isPres1
),
b AS (
SELECT
sum(value) AS t
FROM
UNNEST(x)
WHERE
key > 0
)
SELECT
coalesce(isPres1 * t, 0)
FROM
a,
b
)
);
-- This counts how many times an histogram is not present.
-- It checks if the histogram is present at all and whether or not it recorded
-- any non-zero value.
CREATE TEMP FUNCTION empty(x ARRAY<STRUCT<key INT64, value INT64>>) AS (
(
WITH a AS (
SELECT
IF(array_length(x) = 0, 1, 0) AS isEmpty1
),
b AS (
SELECT
IF(max(value) = 0, 1, 0) isEmpty2
FROM
UNNEST(x)
),
c AS (
SELECT
IF(isEmpty2 = 1 OR isEmpty1 = 1, 1, 0) AS Empty
FROM
a,
b
)
SELECT
Empty
FROM
C
)
);
-- Get a stable source for DAUs.
WITH DAUs AS (
SELECT
-- Given the `telemetry.clients_daily` implementation we don't expect
-- ?? to be in the data (https://github.com/mozilla/bigquery-etl/blob/3f1cb398fa3eb162c232480d8cfa97b8952ee658/sql/telemetry_derived/clients_daily_v6/query.sql#L127).
-- But reality defies expectations.
NULLIF(country, '??') AS country,
-- If cities are either '??' or NULL then it's from cities we either don't
-- know about or have a population less than 15k. Just rename to 'unknown'.
IF(city = '??' OR city IS NULL, 'unknown', city) AS city,
-- Truncate the submission timestamp to the hour. Note that this filed was
-- introduced on the 16th December 2019, so it will be `null` for queries
-- before that day. See https://github.com/mozilla/bigquery-etl/pull/603 .
TIMESTAMP_TRUNC(submission_timestamp_min, HOUR) AS datetime,
COUNT(*) AS client_count
FROM
telemetry.clients_daily
WHERE
submission_date = @submission_date
-- Country can be null if geoip lookup failed.
-- There's no point in adding these to the analyses.
-- Due to a bug in `telemetry.clients_daily` we need to
-- check for '??' as well in addition to null.
AND country IS NOT NULL
AND country != '??'
GROUP BY
1,
2,
3
-- Filter filter out cities for which we have less than or equal to
-- 50 hourly active users. This will make sure data won't end up in
-- the final table.
HAVING
client_count > 50
),
-- Compute aggregates for the health data.
health_data_sample AS (
SELECT
-- `city` is processed in `health_data_aggregates`.
udf.geo_struct(metadata.geo.country, metadata.geo.city, NULL, NULL).* EXCEPT (
geo_subdivision1,
geo_subdivision2
),
TIMESTAMP_TRUNC(submission_timestamp, HOUR) AS datetime,
client_id,
SUM(
coalesce(
SAFE_CAST(JSON_EXTRACT(additional_properties, '$.payload.sendFailure.undefined') AS INT64),
0
)
) AS e_undefined,
SUM(
coalesce(
SAFE_CAST(JSON_EXTRACT(additional_properties, '$.payload.sendFailure.timeout') AS INT64),
0
)
) AS e_timeout,
SUM(
coalesce(
SAFE_CAST(JSON_EXTRACT(additional_properties, '$.payload.sendFailure.abort') AS INT64),
0
)
) AS e_abort,
SUM(
coalesce(
SAFE_CAST(
JSON_EXTRACT(additional_properties, '$.payload.sendFailure.eUnreachable') AS INT64
),
0
)
) AS e_unreachable,
SUM(
coalesce(
SAFE_CAST(
JSON_EXTRACT(additional_properties, '$.payload.sendFailure.eTerminated') AS INT64
),
0
)
) AS e_terminated,
SUM(
coalesce(
SAFE_CAST(
JSON_EXTRACT(additional_properties, '$.payload.sendFailure.eChannelOpen') AS INT64
),
0
)
) AS e_channel_open,
FROM
telemetry.health
WHERE
date(submission_timestamp) = @submission_date
GROUP BY
1,
2,
3,
4
),
health_data_aggregates AS (
SELECT
country,
-- If cities are either '??' or NULL then it's from cities we either don't
-- know about or have a population less than 15k. Just rename to 'unknown'.
IF(city = '??' OR city IS NULL, 'unknown', city) AS city,
datetime,
COUNTIF(e_undefined > 0) AS num_clients_e_undefined,
COUNTIF(e_timeout > 0) AS num_clients_e_timeout,
COUNTIF(e_abort > 0) AS num_clients_e_abort,
COUNTIF(e_unreachable > 0) AS num_clients_e_unreachable,
COUNTIF(e_terminated > 0) AS num_clients_e_terminated,
COUNTIF(e_channel_open > 0) AS num_clients_e_channel_open,
FROM
health_data_sample
WHERE
-- Country can be null if geoip lookup failed.
-- There's no point in adding these to the analyses.
country IS NOT NULL
GROUP BY
country,
city,
datetime
HAVING
COUNT(*) > 50
),
final_health_data AS (
SELECT
h.country,
h.city,
h.datetime,
(num_clients_e_undefined / DAUs.client_count) AS proportion_undefined,
(num_clients_e_timeout / DAUs.client_count) AS proportion_timeout,
(num_clients_e_abort / DAUs.client_count) AS proportion_abort,
(num_clients_e_unreachable / DAUs.client_count) AS proportion_unreachable,
(num_clients_e_terminated / DAUs.client_count) AS proportion_terminated,
(num_clients_e_channel_open / DAUs.client_count) AS proportion_channel_open,
FROM
health_data_aggregates AS h
INNER JOIN
DAUs
USING
(datetime, country, city)
),
-- Compute aggregates for histograms coming from the health ping.
histogram_data_sample AS (
SELECT
-- We don't need to use udf.geo_struct here since `telemetry.main` won't
-- have '??' values. It only has nulls, which we can handle.
metadata.geo.country AS country,
-- If cities are NULL then it's from cities we either don't
-- know about or have a population less than 15k. Just rename to 'unknown'.
IFNULL(metadata.geo.city, 'unknown') AS city,
client_id,
document_id,
TIMESTAMP_TRUNC(submission_timestamp, HOUR) AS time_slot,
payload.info.subsession_length AS subsession_length,
udf.json_extract_int_map(
JSON_EXTRACT(payload.histograms.dns_failed_lookup_time, '$.values')
) AS dns_fail,
udf.json_extract_int_map(
JSON_EXTRACT(payload.histograms.dns_lookup_time, '$.values')
) AS dns_success,
udf.json_extract_int_map(
JSON_EXTRACT(payload.histograms.ssl_cert_verification_errors, '$.values')
) AS ssl_cert_errors,
udf.json_extract_int_map(
JSON_EXTRACT(payload.processes.content.histograms.http_page_tls_handshake, '$.values')
) AS tls_handshake,
FROM
telemetry.main
WHERE
DATE(submission_timestamp) = @submission_date
-- Restrict to Firefox.
AND normalized_app_name = 'Firefox'
-- Only to pings who seem to represent an active session.
AND payload.info.subsession_length >= 0
-- Country can be null if geoip lookup failed.
-- There's no point in adding these to the analyses.
AND metadata.geo.country IS NOT NULL
),
-- DNS_SUCCESS histogram
dns_success_time AS (
SELECT
country,
city,
time_slot AS datetime,
exp(sum(log(key) * count) / sum(count)) AS value
FROM
(
SELECT
country,
city,
client_id,
time_slot,
key,
sum(value) AS count
FROM
histogram_data_sample
CROSS JOIN
UNNEST(histogram_data_sample.dns_success)
GROUP BY
country,
city,
time_slot,
client_id,
key
)
WHERE
key > 0
GROUP BY
1,
2,
3
HAVING
COUNT(*) > 50
),
-- Oddness: active sessions without DNS_LOOKUP_TIME
dns_no_dns_lookup_time AS (
SELECT
country,
city,
time_slot AS datetime,
SUM(IF(subsession_length > 0 AND is_empty = 1, 1, 0)) / (
1 + SUM(IF(subsession_length > 0, 1, 0))
) AS value
FROM
(
SELECT
country,
city,
client_id,
time_slot,
subsession_length,
empty(dns_success) AS is_empty
FROM
histogram_data_sample
)
GROUP BY
1,
2,
3
HAVING
COUNT(*) > 50
),
-- A shared source for the DNS_FAIL histogram
dns_failure_src AS (
SELECT
country,
city,
client_id,
time_slot,
key,
sum(value) AS count
FROM
histogram_data_sample
CROSS JOIN
UNNEST(histogram_data_sample.dns_fail)
GROUP BY
country,
city,
time_slot,
client_id,
key
),
-- DNS_FAIL histogram
dns_failure_time AS (
SELECT
country,
city,
time_slot AS datetime,
exp(sum(log(key) * count) / sum(count)) AS value
FROM
dns_failure_src
WHERE
key > 0
GROUP BY
1,
2,
3
HAVING
COUNT(*) > 50
),
-- DNS_FAIL counts
dns_failure_counts AS (
SELECT
country,
city,
time_slot AS datetime,
avg(count) AS value
FROM
(
SELECT
country,
city,
client_id,
time_slot,
sum(count) AS count
FROM
dns_failure_src
GROUP BY
country,
city,
time_slot,
client_id
)
GROUP BY
country,
city,
time_slot
HAVING
COUNT(*) > 50
),
-- Oddness: active sessions without DNS_FAILED_LOOKUP_TIME
dns_no_dns_failure_time AS (
SELECT
country,
city,
time_slot AS datetime,
SUM(IF(subsession_length > 0 AND is_empty = 1, 1, 0)) / (
1 + SUM(IF(subsession_length > 0, 1, 0))
) AS value
FROM
(
SELECT
country,
city,
client_id,
time_slot,
subsession_length,
empty(dns_fail) AS is_empty
FROM
histogram_data_sample
)
GROUP BY
1,
2,
3
HAVING
COUNT(*) > 50
),
-- SSL_CERT_VERIFICATION_ERRORS histograms
ssl_error_prop_src AS (
SELECT
country,
city,
time_slot,
client_id,
document_id,
subsession_length,
sum_values(ssl_cert_errors) AS ssl_sum_vals
FROM
histogram_data_sample
),
ssl_error_prop AS (
SELECT
country,
city,
time_slot AS datetime,
SUM(IF(subsession_length > 0 AND ssl_sum_vals > 0, 1, 0)) / (
1 + SUM(IF(subsession_length > 0, 1, 0))
) AS value
FROM
ssl_error_prop_src
GROUP BY
country,
city,
time_slot,
client_id
HAVING
COUNT(*) > 50
),
-- TLS_HANDSHAKE histogram
tls_handshake_time AS (
SELECT
country,
city,
time_slot AS datetime,
exp(sum(log(key) * count) / sum(count)) AS value
FROM
(
SELECT
country,
city,
client_id,
time_slot,
key,
sum(value) AS count
FROM
histogram_data_sample
CROSS JOIN
UNNEST(histogram_data_sample.tls_handshake)
GROUP BY
country,
city,
time_slot,
client_id,
key
)
WHERE
key > 0
GROUP BY
1,
2,
3
HAVING
COUNT(*) > 50
)
SELECT
DAUs.country AS country,
DAUs.city AS city,
DAUs.datetime AS datetime,
hd.* EXCEPT (datetime, country, city),
ds.value AS avg_dns_success_time,
ds_missing.value AS missing_dns_success,
df.value AS avg_dns_failure_time,
df_missing.value AS missing_dns_failure,
dfc.value AS count_dns_failure,
ssl.value AS ssl_error_prop,
tls.value AS avg_tls_handshake_time
FROM
final_health_data AS hd
-- We apply LEFT JOIN here and in the other places instead
-- of a FULL OUTER JOIN. Since LEFT is DAUs, which should contain
-- all the countries and all the days, it should always have matches
-- with whatever we pass on the RIGHT.
-- When doing a FULL OUTER JOIN, we end up sometimes with nulls on the
-- left because there are a few samples coming from telemetry.main that
-- are not accounted for in telemetry.clients_daily
LEFT JOIN
DAUs
USING
(datetime, country, city)
LEFT JOIN
dns_success_time AS ds
USING
(datetime, country, city)
LEFT JOIN
dns_no_dns_lookup_time AS ds_missing
USING
(datetime, country, city)
LEFT JOIN
dns_failure_time AS df
USING
(datetime, country, city)
LEFT JOIN
dns_failure_counts AS dfc
USING
(datetime, country, city)
LEFT JOIN
dns_no_dns_failure_time AS df_missing
USING
(datetime, country, city)
LEFT JOIN
tls_handshake_time AS tls
USING
(datetime, country, city)
LEFT JOIN
ssl_error_prop AS ssl
USING
(datetime, country, city)
ORDER BY
1,
2