Bug 1677609 Join clients_first_seen into clients_last_seen (#1561)

* Bug 1677609 Join clients_first_seen into clients_last_seen

Several folks on DS report that they have been getting great value from
clients_first_seen, as the first_seen_date there is a much more stable way
to define new profiles compared to using profile_created_date from pings.

Currently, using first_seen_date requires doing a join between these two tables.
This PR adds that join to the clients_last_seen query itself to make this
workflow more efficient. I'd like to get this merged before we proceed with
the backfill discussed in https://bugzilla.mozilla.org/show_bug.cgi?id=1677609

This change has a few operational implications. Most importantly, it makes
clients_last_seen dependent on clients_first_seen, so those queries can no
longer proceed in parallel. `clients_first_seen` takes on average 10 minutes
to run, so we'll be delaying all ETL downstream of `clients_last_seen` by
about 10 minutes, which seems acceptable. It also adds some mental complexity
to the model.

The extra join does not appear to significantly slow down the
`clients_last_seen` query itself; it scans about 15% more data and consumes
about 15% more slot time.
I expect the performance is dominated by the existing join between
clients_daily and the previous day of clients_last_seen.
This commit is contained in:
Jeff Klukas 2020-11-30 09:28:53 -05:00 коммит произвёл GitHub
Родитель df0508841f
Коммит 603fec3850
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
9 изменённых файлов: 61 добавлений и 9 удалений

Просмотреть файл

@ -156,6 +156,10 @@ with DAG(
telemetry_derived__clients_daily__v6
)
telemetry_derived__clients_last_seen__v1.set_upstream(
telemetry_derived__clients_first_seen__v1
)
wait_for_copy_deduplicate_main_ping = ExternalTaskSensor(
task_id="wait_for_copy_deduplicate_main_ping",
external_dag_id="copy_deduplicate",

Просмотреть файл

@ -9,6 +9,8 @@ OPTIONS
(require_partition_filter = TRUE)
AS
SELECT
CAST(NULL AS DATE) AS first_seen_date,
CAST(NULL AS DATE) AS second_seen_date,
CAST(NULL AS INT64) AS days_seen_bits,
CAST(NULL AS INT64) AS days_visited_1_uri_bits,
CAST(NULL AS INT64) AS days_visited_5_uri_bits,

Просмотреть файл

@ -3,6 +3,13 @@ friendly_name: Clients Last Seen
description: >
Captures history of activity of each client in 28 day
windows for each submission date.
Generally, this is direct product of clients_daily and serves to make
certain query patterns more efficient by eliminating the need for
self-joins that would otherwise be needed to consider windows of activity.
As an exception, it pulls in first_seen_date and second_seen_date over all
time from clients_first_seen since first_seen_date is highly valuable for
providing a stable definition of when a profile was created.
owners:
- dthorn@mozilla.com
labels:

Просмотреть файл

@ -43,7 +43,7 @@ WITH _current AS (
WHERE
submission_date = @submission_date
),
--
--
_previous AS (
SELECT
days_seen_bits,
@ -65,7 +65,9 @@ _previous AS (
days_interacted_bits,
days_created_profile_bits,
days_seen_in_experiment,
submission_date
submission_date,
first_seen_date,
second_seen_date
)
FROM
clients_last_seen_v1
@ -74,9 +76,11 @@ _previous AS (
-- Filter out rows from yesterday that have now fallen outside the 28-day window.
AND udf.shift_28_bits_one_day(days_seen_bits) > 0
)
--
--
SELECT
@submission_date AS submission_date,
IF(cfs.first_seen_date > @submission_date, NULL, cfs.first_seen_date) AS first_seen_date,
IF(cfs.second_seen_date > @submission_date, NULL, cfs.second_seen_date) AS second_seen_date,
IF(_current.client_id IS NOT NULL, _current, _previous).* REPLACE (
udf.combine_adjacent_days_28_bits(
_previous.days_seen_bits,
@ -121,3 +125,7 @@ FULL JOIN
_previous
USING
(client_id)
LEFT JOIN
clients_first_seen_v1 AS cfs
USING
(client_id)

Просмотреть файл

@ -0,0 +1,4 @@
{"first_seen_date":"2018-12-30","second_seen_date":"2019-01-01","client_id":"a"}
{"first_seen_date":"2018-12-30","second_seen_date":"2019-01-01","client_id":"b"}
{"first_seen_date":"2018-12-30","second_seen_date":"2019-01-01","client_id":"c"}
{"first_seen_date":"2018-12-30","second_seen_date":"2019-01-01","client_id":"d"}

Просмотреть файл

@ -0,0 +1,17 @@
[
{
"name": "first_seen_date",
"type": "DATE",
"mode": "REQUIRED"
},
{
"name": "second_seen_date",
"type": "DATE",
"mode": "REQUIRED"
},
{
"name": "client_id",
"type": "STRING",
"mode": "REQUIRED"
}
]

Просмотреть файл

@ -1,3 +1,3 @@
{"submission_date":"2019-01-01","active_hours_sum":0.0,"devtools_toolbox_opened_count_sum":2.0,"attribution":{"source":"prev"},"client_id":"a","sample_id":"0","days_seen_bits":3,"days_opened_dev_tools_bits":1,"days_created_profile_bits":1}
{"submission_date":"2019-01-01","active_hours_sum":0.0,"devtools_toolbox_opened_count_sum":0.0,"attribution":{"source":"prev"},"client_id":"b","sample_id":"0","days_seen_bits":0,"days_created_profile_bits":0}
{"submission_date":"2019-01-01","active_hours_sum":0.0,"devtools_toolbox_opened_count_sum":2.0,"attribution":{"source":"prev"},"client_id":"d","sample_id":"0","days_seen_bits":0,"days_created_profile_bits":64}
{"first_seen_date":"2018-12-30","second_seen_date":"2019-01-01","submission_date":"2019-01-01","active_hours_sum":0.0,"devtools_toolbox_opened_count_sum":2.0,"attribution":{"source":"prev"},"client_id":"a","sample_id":"0","days_seen_bits":3,"days_opened_dev_tools_bits":1,"days_created_profile_bits":1}
{"first_seen_date":"2018-12-30","second_seen_date":"2019-01-01","submission_date":"2019-01-01","active_hours_sum":0.0,"devtools_toolbox_opened_count_sum":0.0,"attribution":{"source":"prev"},"client_id":"b","sample_id":"0","days_seen_bits":0,"days_created_profile_bits":0}
{"first_seen_date":"2018-12-30","second_seen_date":"2019-01-01","submission_date":"2019-01-01","active_hours_sum":0.0,"devtools_toolbox_opened_count_sum":2.0,"attribution":{"source":"prev"},"client_id":"d","sample_id":"0","days_seen_bits":0,"days_created_profile_bits":64}

Просмотреть файл

@ -1,4 +1,14 @@
[
{
"name": "first_seen_date",
"type": "DATE",
"mode": "REQUIRED"
},
{
"name": "second_seen_date",
"type": "DATE",
"mode": "REQUIRED"
},
{
"name": "days_seen_bits",
"type": "INT64",

Просмотреть файл

@ -1,3 +1,3 @@
{"submission_date": "2019-01-02", "days_seen_bits": 6, "days_visited_1_uri_bits": 0, "days_visited_5_uri_bits": 0, "days_visited_10_uri_bits": 0, "days_had_8_active_ticks_bits": 0, "days_opened_dev_tools_bits": 2, "days_created_profile_bits": 2, "days_interacted_bits": 0, "days_seen_in_experiment": [], "client_id": "a", "sample_id": 0, "active_hours_sum": 0.0, "devtools_toolbox_opened_count_sum": 2.0, "attribution": {"source": "prev"}, "experiments": []}
{"submission_date": "2019-01-02", "days_seen_bits": 1, "days_visited_1_uri_bits": 0, "days_visited_5_uri_bits": 0, "days_visited_10_uri_bits": 0, "days_had_8_active_ticks_bits": 1, "days_opened_dev_tools_bits": 1, "days_seen_in_experiment": [], "days_created_profile_bits": 64, "days_interacted_bits": 1, "client_id": "b", "sample_id": 0, "active_hours_sum": 1.0, "devtools_toolbox_opened_count_sum": 2.0, "profile_creation_date": "2018-12-27 00:00:00", "attribution": {"source": "test"}, "experiments": []}
{"submission_date": "2019-01-02", "days_seen_bits": 1, "days_visited_1_uri_bits": 0, "days_visited_5_uri_bits": 0, "days_visited_10_uri_bits": 0, "days_had_8_active_ticks_bits": 1, "days_opened_dev_tools_bits": 0, "days_created_profile_bits": 0, "days_interacted_bits": 1, "days_seen_in_experiment": [{"bits": 1, "branch": "a", "experiment": "exp1"}], "client_id": "c", "sample_id": 0, "active_hours_sum": 1.0, "devtools_toolbox_opened_count_sum": 0.0, "profile_creation_date": "2018-09-01 00:00:00", "attribution": {"source": "test"},"experiments":[{"key":"exp1","value":"a"}]}
{"first_seen_date":"2018-12-30", "second_seen_date":"2019-01-01","submission_date": "2019-01-02", "days_seen_bits": 6, "days_visited_1_uri_bits": 0, "days_visited_5_uri_bits": 0, "days_visited_10_uri_bits": 0, "days_had_8_active_ticks_bits": 0, "days_opened_dev_tools_bits": 2, "days_created_profile_bits": 2, "days_interacted_bits": 0, "days_seen_in_experiment": [], "client_id": "a", "sample_id": 0, "active_hours_sum": 0.0, "devtools_toolbox_opened_count_sum": 2.0, "attribution": {"source": "prev"}, "experiments": []}
{"first_seen_date":"2018-12-30", "second_seen_date":"2019-01-01", "submission_date": "2019-01-02", "days_seen_bits": 1, "days_visited_1_uri_bits": 0, "days_visited_5_uri_bits": 0, "days_visited_10_uri_bits": 0, "days_had_8_active_ticks_bits": 1, "days_opened_dev_tools_bits": 1, "days_seen_in_experiment": [], "days_created_profile_bits": 64, "days_interacted_bits": 1, "client_id": "b", "sample_id": 0, "active_hours_sum": 1.0, "devtools_toolbox_opened_count_sum": 2.0, "profile_creation_date": "2018-12-27 00:00:00", "attribution": {"source": "test"}, "experiments": []}
{"first_seen_date":"2018-12-30", "second_seen_date":"2019-01-01", "submission_date": "2019-01-02", "days_seen_bits": 1, "days_visited_1_uri_bits": 0, "days_visited_5_uri_bits": 0, "days_visited_10_uri_bits": 0, "days_had_8_active_ticks_bits": 1, "days_opened_dev_tools_bits": 0, "days_created_profile_bits": 0, "days_interacted_bits": 1, "days_seen_in_experiment": [{"bits": 1, "branch": "a", "experiment": "exp1"}], "client_id": "c", "sample_id": 0, "active_hours_sum": 1.0, "devtools_toolbox_opened_count_sum": 0.0, "profile_creation_date": "2018-09-01 00:00:00", "attribution": {"source": "test"},"experiments":[{"key":"exp1","value":"a"}]}