ETL code that produces the addons_daily derived dataset.
Перейти к файлу
Ben Miroglio 4203b2bdd8
Merge branch 'master' into remove-click-vars
2019-06-05 16:25:57 -07:00
.circleci Remove remaining py27 job 2019-05-30 16:00:00 -07:00
addons_daily Remove unused click vars 2019-06-05 16:23:53 -07:00
tests formatting black 2019-06-05 15:26:08 -07:00
.gitignore Change search metrics, do not use collect_list 2019-06-04 14:46:18 -07:00
METRICS.md Update METRICS.md 2019-03-25 14:03:23 -07:00
README.md Update README.md 2019-06-05 16:08:43 -07:00
setup.py Format python with black 2019-05-09 16:14:30 -07:00
tox.ini Adjust filters for production 2019-06-03 11:37:05 -07:00

README.md

CircleCI

addons_daily Derived Dataset

Contributers: Sarah Melancon, Ben Miroglio, Brian Wright, Daniel Thorn

This ETL code produces daily aggregates of Firefox extensions. It supports the broader "Extention Data for Developers" Project.

Introduction:

The addons_daily dataset serves as the central hub for all Firefox extension related questions. This includes questions regarding browser performance, user engagement, click through rates, etc. Each row in the table represents a unique addon, and each column is a unique metric.

Contents

Prior to construction of this dataset, extension related data lived in several different sources. Addons_daily has combined metrics aggregated from several sources, including raw pings, telemetry data, and google analytics data.

Accessing the Data

The data is stored as a parquet table in S3 at the following address:

s3://net-mozaws-prod-us-west-2-pipeline-analysis/bmiroglio/addons_daily_test/

The addons_daily table is accessible through re:dash using the Athena data source. It is also available via the Presto data source, though Athena should be preferred for performance and stability reasons.

Data Reference

Example Queries

Query 1

Get Top ten addons by average DAU

  SELECT addon_id, avg(dau) as avg_dau
  FROM addons_daily_tt
  GROUP BY addon_id
  ORDER BY avg_dau desc
  LIMIT 10

Query 2

Get daily active users for all days in dataset

  SELECT submission_date_s3, avg(dau) as avg_dau
  FROM addons_daily_tt
  GROUP BY submission_date_s3

Scheduling

This dataset is updated daily via the telemetry-airflow infrastructure. The job runs as part of the main_summary DAG.

Schema

The data is partitioned by submission_date_s3 which is formatted as %Y%m%d, like 20180130. As of 2019-06-05, the current version of the addons_daily dataset is v1, and has a schema as follows:

root
|-- addon_id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- os_pct: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = false)
 |-- country_pct: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = false)
 |-- avg_time_total: double (nullable = true)
 |-- active_hours: double (nullable = true)
 |-- disabled: long (nullable = true)
 |-- avg_tabs: double (nullable = true)
 |-- avg_bookmarks: double (nullable = true)
 |-- avg_toolbox_opened_count: double (nullable = true)
 |-- avg_uri: double (nullable = true)
 |-- pct_w_tracking_prot_enabled: double (nullable = true)
 |-- mau: long (nullable = true)
 |-- wau: long (nullable = true)
 |-- dau: long (nullable = true)
 |-- dau_prop: double (nullable = true)
 |-- search_with_ads: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
 |-- ad_click: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
 |-- organic_searches: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
 |-- sap_searches: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
 |-- tagged_sap_searches: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
 |-- installs: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
 |-- download_times: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = false)
 |-- uninstalls: map (nullable = true)
 |    |-- key: string
 |    |-- value: long (valueContainsNull = true)
 |-- is_system: boolean (nullable = true)
 |-- avg_webext_storage_local_get_ms_: double (nullable = true)
 |-- avg_webext_storage_local_set_ms_: double (nullable = true)
 |-- avg_webext_extension_startup_ms_: double (nullable = true)
 |-- top_10_coinstalls: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
 |-- avg_webext_background_page_load_ms_: double (nullable = true)
 |-- avg_webext_browseraction_popup_open_ms_: double (nullable = true)
 |-- avg_webext_pageaction_popup_open_ms_: double (nullable = true)
 |-- avg_webext_content_script_injection_ms_: double (nullable = true)