ETL code that produces the addons_daily derived dataset.
Перейти к файлу
Ben Miroglio 94d9dd1892
Merge pull request #68 from benmiroglio/use-get-for-dict-access
Use get for dict access
2019-07-30 11:35:44 -07:00
.circleci Pin docker image to use debian stretch 2019-07-30 10:52:36 -07:00
addons_daily Use get for dict access 2019-07-16 07:35:09 -07:00
tests Create fixture for main dependency mocks and test invalid prefixes 2019-06-24 16:02:48 -07:00
.gitignore Change search metrics, do not use collect_list 2019-06-04 14:46:18 -07:00
METRICS.md Update METRICS.md 2019-03-25 14:03:23 -07:00
README.md clean SQL code a bit 2019-06-10 10:09:04 -07:00
setup.py Format python with black 2019-05-09 16:14:30 -07:00
tox.ini Adjust filters for production 2019-06-03 11:37:05 -07:00

README.md

CircleCI

addons_daily Derived Dataset

Contributers: Sarah Melancon, Ben Miroglio, Brian Wright, Daniel Thorn

This ETL code produces daily aggregates of Firefox extensions. It supports the broader "Extention Data for Developers" Project.

Introduction:

The addons_daily dataset serves as the central hub for all Firefox extension related questions. This includes questions regarding browser performance, user engagement, click through rates, etc. Each row in the table represents a unique addon, and each column is a unique metric.

Contents

Prior to construction of this dataset, extension related data lived in several different sources. Addons_daily has combined metrics aggregated from several sources, including raw pings, telemetry data, and google analytics data.

Accessing the Data

The data is stored as a parquet table in S3 at the following address:

s3://net-mozaws-prod-us-west-2-pipeline-analysis/bmiroglio/addons_daily_test/

The addons_daily table is accessible through re:dash using the Athena data source. It is also available via the Presto data source, though Athena should be preferred for performance and stability reasons.

Data Reference

Example Queries

Query 1

SELECT addon_id,
       FIRST(name) as name,
       avg(dau) as `Average DAU`,
       avg(wau) as `Average WAU`,
       avg(mau) as `Average MAU`,
       avg(dau_prop) as `Average % of Total DAU`
FROM addons_daily
WHERE
  is_system = false
  and addon_id not like '%mozilla%'
GROUP BY 1
ORDER BY 3 DESC

Query 2

Get daily active users for all days in dataset

SELECT submission_date_s3 as 'Date',
       avg(dau) as 'Average DAU'
FROM addons_daily_tt
GROUP BY 1

Scheduling

This dataset is updated daily via the telemetry-airflow infrastructure. The job runs as part of the main_summary DAG.

Schema

The data is partitioned by submission_date_s3 which is formatted as %Y%m%d, like 20180130. As of 2019-06-05, the current version of the addons_daily dataset is v1, and has a schema as follows:

root
|-- addon_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- os_pct: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = false)
 |-- country_pct: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = false)
 |-- avg_time_total: double (nullable = true)
 |-- active_hours: double (nullable = true)
 |-- disabled: long (nullable = true)
 |-- avg_tabs: double (nullable = true)
 |-- avg_bookmarks: double (nullable = true)
 |-- avg_toolbox_opened_count: double (nullable = true)
 |-- avg_uri: double (nullable = true)
 |-- pct_w_tracking_prot_enabled: double (nullable = true)
 |-- mau: long (nullable = true)
 |-- wau: long (nullable = true)
 |-- dau: long (nullable = true)
 |-- dau_prop: double (nullable = true)
 |-- search_with_ads: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
 |-- ad_click: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
 |-- organic_searches: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
 |-- sap_searches: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
 |-- tagged_sap_searches: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
 |-- installs: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
 |-- download_times: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = false)
 |-- uninstalls: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
 |-- is_system: boolean (nullable = true)
 |-- avg_webext_storage_local_get_ms_: double (nullable = true)
 |-- avg_webext_storage_local_set_ms_: double (nullable = true)
 |-- avg_webext_extension_startup_ms_: double (nullable = true)
 |-- top_10_coinstalls: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- avg_webext_background_page_load_ms_: double (nullable = true)
 |-- avg_webext_browseraction_popup_open_ms_: double (nullable = true)
 |-- avg_webext_pageaction_popup_open_ms_: double (nullable = true)
 |-- avg_webext_content_script_injection_ms_: double (nullable = true)