Bug 1677609 Join clients_first_seen into clients_last_seen (#1561)
* Bug 1677609 Join clients_first_seen into clients_last_seen Several folks on DS report that they have been getting great value from clients_first_seen, as the first_seen_date there is a much more stable way to define new profiles compared to using profile_created_date from pings. Currently, using first_seen_date requires doing a join between these two tables. This PR adds that join to the clients_last_seen query itself to make this workflow more efficient. I'd like to get this merged before we proceed with the backfill discussed in https://bugzilla.mozilla.org/show_bug.cgi?id=1677609 This change has a few operational implications. Most importantly, it makes clients_last_seen dependent on clients_first_seen, so those queries can no longer proceed in parallel. `clients_first_seen` takes on average 10 minutes to run, so we'll be delaying all ETL downstream of `clients_last_seen` by about 10 minutes, which seems acceptable. It also adds some mental complexity to the model. The extra join does not appear to significantly slow down the `clients_last_seen` query itself; it scans about 15% more data and consumes about 15% more slot time. I expect the performance is dominated by the existing join between clients_daily and the previous day of clients_last_seen.
This commit is contained in:
Родитель
df0508841f
Коммит
603fec3850
|
@ -156,6 +156,10 @@ with DAG(
|
|||
telemetry_derived__clients_daily__v6
|
||||
)
|
||||
|
||||
telemetry_derived__clients_last_seen__v1.set_upstream(
|
||||
telemetry_derived__clients_first_seen__v1
|
||||
)
|
||||
|
||||
wait_for_copy_deduplicate_main_ping = ExternalTaskSensor(
|
||||
task_id="wait_for_copy_deduplicate_main_ping",
|
||||
external_dag_id="copy_deduplicate",
|
||||
|
|
|
@ -9,6 +9,8 @@ OPTIONS
|
|||
(require_partition_filter = TRUE)
|
||||
AS
|
||||
SELECT
|
||||
CAST(NULL AS DATE) AS first_seen_date,
|
||||
CAST(NULL AS DATE) AS second_seen_date,
|
||||
CAST(NULL AS INT64) AS days_seen_bits,
|
||||
CAST(NULL AS INT64) AS days_visited_1_uri_bits,
|
||||
CAST(NULL AS INT64) AS days_visited_5_uri_bits,
|
||||
|
|
|
@ -3,6 +3,13 @@ friendly_name: Clients Last Seen
|
|||
description: >
|
||||
Captures history of activity of each client in 28 day
|
||||
windows for each submission date.
|
||||
|
||||
Generally, this is direct product of clients_daily and serves to make
|
||||
certain query patterns more efficient by eliminating the need for
|
||||
self-joins that would otherwise be needed to consider windows of activity.
|
||||
As an exception, it pulls in first_seen_date and second_seen_date over all
|
||||
time from clients_first_seen since first_seen_date is highly valuable for
|
||||
providing a stable definition of when a profile was created.
|
||||
owners:
|
||||
- dthorn@mozilla.com
|
||||
labels:
|
||||
|
|
|
@ -43,7 +43,7 @@ WITH _current AS (
|
|||
WHERE
|
||||
submission_date = @submission_date
|
||||
),
|
||||
--
|
||||
--
|
||||
_previous AS (
|
||||
SELECT
|
||||
days_seen_bits,
|
||||
|
@ -65,7 +65,9 @@ _previous AS (
|
|||
days_interacted_bits,
|
||||
days_created_profile_bits,
|
||||
days_seen_in_experiment,
|
||||
submission_date
|
||||
submission_date,
|
||||
first_seen_date,
|
||||
second_seen_date
|
||||
)
|
||||
FROM
|
||||
clients_last_seen_v1
|
||||
|
@ -74,9 +76,11 @@ _previous AS (
|
|||
-- Filter out rows from yesterday that have now fallen outside the 28-day window.
|
||||
AND udf.shift_28_bits_one_day(days_seen_bits) > 0
|
||||
)
|
||||
--
|
||||
--
|
||||
SELECT
|
||||
@submission_date AS submission_date,
|
||||
IF(cfs.first_seen_date > @submission_date, NULL, cfs.first_seen_date) AS first_seen_date,
|
||||
IF(cfs.second_seen_date > @submission_date, NULL, cfs.second_seen_date) AS second_seen_date,
|
||||
IF(_current.client_id IS NOT NULL, _current, _previous).* REPLACE (
|
||||
udf.combine_adjacent_days_28_bits(
|
||||
_previous.days_seen_bits,
|
||||
|
@ -121,3 +125,7 @@ FULL JOIN
|
|||
_previous
|
||||
USING
|
||||
(client_id)
|
||||
LEFT JOIN
|
||||
clients_first_seen_v1 AS cfs
|
||||
USING
|
||||
(client_id)
|
||||
|
|
|
@ -0,0 +1,4 @@
|
|||
{"first_seen_date":"2018-12-30","second_seen_date":"2019-01-01","client_id":"a"}
|
||||
{"first_seen_date":"2018-12-30","second_seen_date":"2019-01-01","client_id":"b"}
|
||||
{"first_seen_date":"2018-12-30","second_seen_date":"2019-01-01","client_id":"c"}
|
||||
{"first_seen_date":"2018-12-30","second_seen_date":"2019-01-01","client_id":"d"}
|
|
@ -0,0 +1,17 @@
|
|||
[
|
||||
{
|
||||
"name": "first_seen_date",
|
||||
"type": "DATE",
|
||||
"mode": "REQUIRED"
|
||||
},
|
||||
{
|
||||
"name": "second_seen_date",
|
||||
"type": "DATE",
|
||||
"mode": "REQUIRED"
|
||||
},
|
||||
{
|
||||
"name": "client_id",
|
||||
"type": "STRING",
|
||||
"mode": "REQUIRED"
|
||||
}
|
||||
]
|
|
@ -1,3 +1,3 @@
|
|||
{"submission_date":"2019-01-01","active_hours_sum":0.0,"devtools_toolbox_opened_count_sum":2.0,"attribution":{"source":"prev"},"client_id":"a","sample_id":"0","days_seen_bits":3,"days_opened_dev_tools_bits":1,"days_created_profile_bits":1}
|
||||
{"submission_date":"2019-01-01","active_hours_sum":0.0,"devtools_toolbox_opened_count_sum":0.0,"attribution":{"source":"prev"},"client_id":"b","sample_id":"0","days_seen_bits":0,"days_created_profile_bits":0}
|
||||
{"submission_date":"2019-01-01","active_hours_sum":0.0,"devtools_toolbox_opened_count_sum":2.0,"attribution":{"source":"prev"},"client_id":"d","sample_id":"0","days_seen_bits":0,"days_created_profile_bits":64}
|
||||
{"first_seen_date":"2018-12-30","second_seen_date":"2019-01-01","submission_date":"2019-01-01","active_hours_sum":0.0,"devtools_toolbox_opened_count_sum":2.0,"attribution":{"source":"prev"},"client_id":"a","sample_id":"0","days_seen_bits":3,"days_opened_dev_tools_bits":1,"days_created_profile_bits":1}
|
||||
{"first_seen_date":"2018-12-30","second_seen_date":"2019-01-01","submission_date":"2019-01-01","active_hours_sum":0.0,"devtools_toolbox_opened_count_sum":0.0,"attribution":{"source":"prev"},"client_id":"b","sample_id":"0","days_seen_bits":0,"days_created_profile_bits":0}
|
||||
{"first_seen_date":"2018-12-30","second_seen_date":"2019-01-01","submission_date":"2019-01-01","active_hours_sum":0.0,"devtools_toolbox_opened_count_sum":2.0,"attribution":{"source":"prev"},"client_id":"d","sample_id":"0","days_seen_bits":0,"days_created_profile_bits":64}
|
||||
|
|
|
@ -1,4 +1,14 @@
|
|||
[
|
||||
{
|
||||
"name": "first_seen_date",
|
||||
"type": "DATE",
|
||||
"mode": "REQUIRED"
|
||||
},
|
||||
{
|
||||
"name": "second_seen_date",
|
||||
"type": "DATE",
|
||||
"mode": "REQUIRED"
|
||||
},
|
||||
{
|
||||
"name": "days_seen_bits",
|
||||
"type": "INT64",
|
||||
|
|
|
@ -1,3 +1,3 @@
|
|||
{"submission_date": "2019-01-02", "days_seen_bits": 6, "days_visited_1_uri_bits": 0, "days_visited_5_uri_bits": 0, "days_visited_10_uri_bits": 0, "days_had_8_active_ticks_bits": 0, "days_opened_dev_tools_bits": 2, "days_created_profile_bits": 2, "days_interacted_bits": 0, "days_seen_in_experiment": [], "client_id": "a", "sample_id": 0, "active_hours_sum": 0.0, "devtools_toolbox_opened_count_sum": 2.0, "attribution": {"source": "prev"}, "experiments": []}
|
||||
{"submission_date": "2019-01-02", "days_seen_bits": 1, "days_visited_1_uri_bits": 0, "days_visited_5_uri_bits": 0, "days_visited_10_uri_bits": 0, "days_had_8_active_ticks_bits": 1, "days_opened_dev_tools_bits": 1, "days_seen_in_experiment": [], "days_created_profile_bits": 64, "days_interacted_bits": 1, "client_id": "b", "sample_id": 0, "active_hours_sum": 1.0, "devtools_toolbox_opened_count_sum": 2.0, "profile_creation_date": "2018-12-27 00:00:00", "attribution": {"source": "test"}, "experiments": []}
|
||||
{"submission_date": "2019-01-02", "days_seen_bits": 1, "days_visited_1_uri_bits": 0, "days_visited_5_uri_bits": 0, "days_visited_10_uri_bits": 0, "days_had_8_active_ticks_bits": 1, "days_opened_dev_tools_bits": 0, "days_created_profile_bits": 0, "days_interacted_bits": 1, "days_seen_in_experiment": [{"bits": 1, "branch": "a", "experiment": "exp1"}], "client_id": "c", "sample_id": 0, "active_hours_sum": 1.0, "devtools_toolbox_opened_count_sum": 0.0, "profile_creation_date": "2018-09-01 00:00:00", "attribution": {"source": "test"},"experiments":[{"key":"exp1","value":"a"}]}
|
||||
{"first_seen_date":"2018-12-30", "second_seen_date":"2019-01-01","submission_date": "2019-01-02", "days_seen_bits": 6, "days_visited_1_uri_bits": 0, "days_visited_5_uri_bits": 0, "days_visited_10_uri_bits": 0, "days_had_8_active_ticks_bits": 0, "days_opened_dev_tools_bits": 2, "days_created_profile_bits": 2, "days_interacted_bits": 0, "days_seen_in_experiment": [], "client_id": "a", "sample_id": 0, "active_hours_sum": 0.0, "devtools_toolbox_opened_count_sum": 2.0, "attribution": {"source": "prev"}, "experiments": []}
|
||||
{"first_seen_date":"2018-12-30", "second_seen_date":"2019-01-01", "submission_date": "2019-01-02", "days_seen_bits": 1, "days_visited_1_uri_bits": 0, "days_visited_5_uri_bits": 0, "days_visited_10_uri_bits": 0, "days_had_8_active_ticks_bits": 1, "days_opened_dev_tools_bits": 1, "days_seen_in_experiment": [], "days_created_profile_bits": 64, "days_interacted_bits": 1, "client_id": "b", "sample_id": 0, "active_hours_sum": 1.0, "devtools_toolbox_opened_count_sum": 2.0, "profile_creation_date": "2018-12-27 00:00:00", "attribution": {"source": "test"}, "experiments": []}
|
||||
{"first_seen_date":"2018-12-30", "second_seen_date":"2019-01-01", "submission_date": "2019-01-02", "days_seen_bits": 1, "days_visited_1_uri_bits": 0, "days_visited_5_uri_bits": 0, "days_visited_10_uri_bits": 0, "days_had_8_active_ticks_bits": 1, "days_opened_dev_tools_bits": 0, "days_created_profile_bits": 0, "days_interacted_bits": 1, "days_seen_in_experiment": [{"bits": 1, "branch": "a", "experiment": "exp1"}], "client_id": "c", "sample_id": 0, "active_hours_sum": 1.0, "devtools_toolbox_opened_count_sum": 0.0, "profile_creation_date": "2018-09-01 00:00:00", "attribution": {"source": "test"},"experiments":[{"key":"exp1","value":"a"}]}
|
||||
|
|
Загрузка…
Ссылка в новой задаче