Spark Streaming ETL jobs for Mozilla Telemetry
Перейти к файлу
William Lachance 737a317a62
Update README.md
2019-12-05 14:09:03 -05:00
.circleci Bug 1498309 Refactor CircleCI config and use cache for assembly 2018-10-12 15:08:11 -04:00
configs 2.0.2 RC3 2019-10-31 22:30:06 +08:00
docker Add Integration Tests with Docker 2017-09-18 14:36:48 -05:00
docs/amplitude Bug 1474987 Add "session split" meta-event 2018-08-03 15:09:41 -04:00
project Bump sbt to 1.2.1 2018-08-30 11:40:10 +02:00
src Add pref_key_s_tracker_token 2019-11-01 22:30:47 +08:00
.gitignore Add .gitignore 2017-01-24 17:34:02 +00:00
.jvmopts Move from TravisCI to CircleCI 2.0 2018-08-07 11:03:04 -04:00
CODE_OF_CONDUCT.md Add Mozilla Code of Conduct 2019-04-18 13:56:37 +02:00
README.md Update README.md 2019-12-05 14:09:03 -05:00
build.sbt Bug 1485583 - Add accumulator-based metrics source 2018-08-30 11:40:10 +02:00
docker_setup.sh Fix internal IP resolve on Ubuntu 2018-06-05 17:07:12 +02:00

README.md

Build Status codecov.io

This repository is no longer in use at Mozilla! It was designed to be run on our AWS-based telemetry infrastructure

telemetry-streaming

Spark Streaming ETL jobs for Mozilla Telemetry

This service currently contains jobs that aggregate error data on 5 minute intervals. It is responsible for generating the (internal only) error_aggregates and experiment_error_aggregates parquet tables at Mozilla.

Issue Tracking

Please file bugs related to the error aggregates streaming job in the Datasets: Error Aggregates component.

Deployment

The jobs defined in this repository are generally deployed as streaming jobs within our hosted Databricks account, but some are deployed as periodic batch jobs via Airflow using wrappers codified in telemetry-airflow that spin up EMR clusters whose configuration is governed by emr-bootstrap-spark. Changes in production behavior that don't seem to correspond to changes in this repository's code could be related to changes in those other projects.

Amplitude Event Configuration

Some of the jobs defined in telemetry-streaming exist to transform telemetry events and republish to Amplitude for further analysis. Filtering and transforming events is accomplished via JSON configurations. If you're creating or updating such a schema, see:

Development

The recommended workflow for running tests is to use your favorite editor for editing the source code and running the tests via sbt. Some common invocations for sbt:

  • sbt test # run the basic set of tests (good enough for most purposes)
  • sbt "testOnly *ErrorAgg*" # run the tests only for packages matching ErrorAgg
  • sbt "testOnly *ErrorAgg* -- -z version" # run the tests only for packages matching ErrorAgg, limited to test cases with "version" in them
  • sbt dockerComposeTest # run the docker compose tests (slow)
  • sbt "dockerComposeTest -tags:DockerComposeTag" # run only tests with DockerComposeTag (while using docker)
  • sbt scalastyle test:scalastyle # run linter
  • sbt ci # run the full set of continuous integration tests

Some tests need Kafka to run. If one prefers to run them via IDE, it's required to run the test cluster:

sbt dockerComposeUp

or via plain docker-compose:

export DOCKER_KAFKA_HOST=$(./docker_setup.sh)
docker-compose -f docker/docker-compose.yml up

It's also good to shut down the cluster afterwards:

sbt dockerComposeStop