A Scala framework to build derived datasets, aka batch views, of Telemetry data.

bigdata biggest-data dataset mozilla scala spark telemetry

Перейти к файлу

evgeny pavlov d78f3bba1d Upgrade amo api version		2022-06-24 16:33:07 -07:00
.circleci	switch addon rec to gcs, add compression	2021-02-08 15:43:22 -08:00
docs	Bug 1388025 - Remove crash aggregates code + docs	2019-04-11 10:58:11 -04:00
project	Bump sbt	2019-02-21 12:34:28 -08:00
scripts	Bug 1462699 - Remove quantum rc view	2019-04-12 14:57:10 -07:00
src	Upgrade amo api version	2022-06-24 16:33:07 -07:00
.dockerignore	Add a .dockerignore file	2017-07-13 17:44:57 -03:00
.gitignore	Mark HBase tests as slow.	2017-01-09 16:36:42 +00:00
.jvmopts	Use CircleCI 2.0 in place of TravisCI	2018-06-19 16:01:37 -04:00
CODE_OF_CONDUCT.md	See PR for details	2019-03-29 13:39:08 -07:00
GRAVEYARD.md	Change branch references to main in docs and comments	2021-02-03 15:57:24 -05:00
README.md	Change branch references to main in docs and comments	2021-02-03 15:57:24 -05:00
build.sbt	switch addon rec to gcs, add compression	2021-02-08 15:43:22 -08:00
run-sbt.sh	Use CircleCI 2.0 in place of TravisCI	2018-06-19 16:01:37 -04:00

README.md

telemetry-batch-view

This is a Scala application to build derived datasets, also known as batch views, of Telemetry data.

Raw JSON pings are stored on S3 within files containing framed Heka records. Reading the raw data in through e.g. Spark can be slow as for a given analysis only a few fields are typically used; not to mention the cost of parsing the JSON blobs. Furthermore, Heka files might contain only a handful of records under certain circumstances.

Defining a derived Parquet dataset, which uses a columnar layout optimized for analytics workloads, can drastically improve the performance of analysis jobs while reducing the space requirements. A derived dataset might, and should, also perform heavy duty operations common to all analysis that are going to read from that dataset (e.g., parsing dates into normalized timestamps).

Adding a new derived dataset

See the views folder for examples of jobs that create derived datasets.

See the Firefox Data Documentation for more information about the individual derived datasets. For help finding the right dataset for your analysis, see Choosing a Dataset.

Development and Deployment

The general workflow for telemetry-batch-view is:

Make some local changes on your branch
Test locally in Airflow, testing just the jobs that your code change touches.
Open PR, tag someone to review. Merge when approved, which will deploy the jar to production.

Note that Airflow deployments depend on cluster bootstrap scripts governed by emr-bootstrap-spark. Changes in job behavior that don't seem to correspond to changes in this repository's code could be related to changes in those other projects.

Local Development

There are two possible workflows for hacking on telemetry-batch-view: you can either create a docker container for building the package and running tests, or import the project into IntelliJ's IDEA.

To run sbt tests inside Docker, run:

# This will take 30+ minutes to run.
./run-sbt.sh test

For more efficient iteration, just invoke ./run-sbt.sh without arguments to open up a shell and then test only the class you're working on without invoking sbt startup time on each iteration:

sbt> testOnly *AddonsViewTest

You may need to increase the amount of memory allocated to Docker for this to work, as some of the tests are very memory hungry at present. At least 4 gigabytes is recommended.

If you wish to import the project into IntelliJ IDEA, apply the following changes to Preferences -> Languages & Frameworks -> Scala Compile Server:

JVM maximum heap size, MB: 2048
JVM parameters: -server -Xmx2G -Xss4M

Note that the first time the project is opened it takes some time to download all the dependencies.

Scala style checker

Scalastyle is used on the CI for enforcing style rules. In order to run it locally, use:

sbt scalastyle test:scalastyle

Generating Datasets

See the documentation for specific views for details about running/generating them.

For example, to create a longitudinal view locally:

sbt "runMain com.mozilla.telemetry.views.LongitudinalView --from 20160101 --to 20160701 --bucket telemetry-test-bucket"

For distributed execution we pack all of the classes together into a single JAR and submit it to the cluster:

sbt assembly
spark-submit --master yarn --deploy-mode client --class com.mozilla.telemetry.views.LongitudinalView target/scala-2.11/telemetry-batch-view-*.jar --from 20160101 --to 20160701 --bucket telemetry-test-bucket

Caveats

If you run into memory issues during compilation time or running the test suite, issue the following command before running sbt:

export _JAVA_OPTIONS="-Xms4G -Xmx4G -Xss4M -XX:MaxMetaspaceSize=256M"

Running on Windows

Executing scala/Spark jobs could be particularly problematic on this platform. Here's a list of common issues and the relative solutions:

Issue: I see a weird reflection error or an odd exception when trying to run my code.

This is probably due to winutils being missing or not found. Winutils are needed by HADOOP and can be downloaded from here.

Issue: java.net.URISyntaxException: Relative path in absolute URI: ...

This means that winutils cannot be found or that Spark cannot find a valid warehouse directory. Add the following line at the beginning of your entry function to make it work:

System.setProperty("hadoop.home.dir", "C:\\path\\to\\winutils")
System.setProperty("spark.sql.warehouse.dir", "file:///C:/somereal-dir/spark-warehouse")

Issue: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: ---------

See SPARK-10528. Run "winutils chmod 777 /tmp/hive" from a privileged prompt to make it work.

Any commits to main should also trigger a circleci build that will do the sbt publishing for you to our local maven repo in s3.