Deprecated - please see https://github.com/mozilla/firefox-public-data-report-etl for the current ETL powering Firefox Public Data Report

Перейти к файлу

SuYoungHong 489ca258b1 added annotations for FF updates for latest version		2019-10-17 12:32:17 -07:00
.circleci	Address review feedback	2019-08-01 16:07:34 -04:00
tests	Fix PCD for use with UTC tz	2018-03-27 16:45:36 -05:00
usage_report	added annotations for FF updates for latest version	2019-10-17 12:32:17 -07:00
.gitignore	Merge pull request #4 from mozilla/jsong-analysis	2018-02-20 16:22:22 -08:00
DATAFORMAT.md	added doc for metrics and data format	2018-07-11 16:45:46 -07:00
Dockerfile	Fix Dockerfile and add spark provider option for gcp dataproc	2019-09-20 09:58:21 -07:00
METRICS.md	minor copy edit on metrics.md	2018-08-16 16:30:05 -07:00
Makefile	Use custom docker image for testing	2019-08-01 16:07:34 -04:00
README.md	Address review feedback	2019-08-01 16:07:34 -04:00
docker-compose.yaml	Use custom docker image for testing	2019-08-01 16:07:34 -04:00
requirements.txt	Address review feedback	2019-08-01 16:07:34 -04:00
setup.py	Remove tox; udpate requirements	2019-08-01 16:07:34 -04:00

README.md

Firefox Public Data

The Firefox Public Data (FxPD) project is a public facing website which tracks various merics over time and helps the general public understand what kind of data is being tracked by Mozilla and how it is used. It is modeled after and evolved out of the Firefox Hardware Report, which is now included as a part of FxPD.

This repository contains the code used to pull and process the data for the User Activity and Usage Behavior subsections of the Desktop sections of the report.

The website itself is generated by the Ensemble and Ensemble Transposer repos.

Data

The data is pulled from Firefox desktop telemetry, specifically the main summary view of the data.

The data is on a weekly resolution (one datapoint per week), and includes the metrics below. The metrics are estimated from a 10% sample of the Release, Beta, ESR, and Other channels, and broken down by the top 10 countries and worldwide overall aggregate. The historical data is kept in an S3 bucket as a JSON file.

This job (the repo) is designed to be run once a week and will produce the data for a single week. It will then update the historical data in the S3 bucket.

For backfills, this job needs to be run for each week of the backfill.

Metrics

For the list of metrics, see METRICS.md.

Data Structure

For a description of the structure of the data output, see DATAFORMAT.md.

Developing

Run the Job

To initiate a test run of this job, you can clone this repo onto an ATMO cluster. First run

$ pip install py4j --upgrade

from your cluster console to get the latest version of py4j.

Next, clone the repo, and from the repo's top-level directory, run:

$ python usage_report/usage_report.py --date [some date, i.e. 20180201] --no-output

which will aggregate usage statistics from the last 7 days by default. It is recommended when testing to specifiy the --lag-days flag to 1 for quicker iterations, i.e

$ python usage_report/usage_report.py --date 20180201 --lag-days 1 --no-output

Note: there is currently no output to S3, so testing like this is not a problem. However when testing runs in this way, always make sure to include the flag --no-output

Testing

Each metric has it's own set of unit tests. Code to extract a particular metric are found in .py files in usage_report/utils/, which are integrated in usage_report/usage_report.py.

To run these tests, first ensure you have Docker installed. First build the container using

$ make build

then run the tests with

$ make test

finally,

$ make lint

runs the linter.