Spark Streaming ETL jobs for Mozilla Telemetry

realtime-metrics spark streaming structured-streaming

Перейти к файлу

Jeff Klukas 64a9ed6faa Bug 1482924 Change session split event name for backfill As discussed in https://bugzilla.mozilla.org/show_bug.cgi?id=1482924 we want to backfill just session split events in order to pick up the new active-ticks logic. We can only accomplish that in Amplitude by sending events with a different name. Once this is merged, I'll proceed with backfilling.		2018-08-15 10:47:54 -04:00
.circleci	Move from TravisCI to CircleCI 2.0	2018-08-07 11:03:04 -04:00
configs	Bug 1482924 Change session split event name for backfill	2018-08-15 10:47:54 -04:00
docker	Add Integration Tests with Docker	2017-09-18 14:36:48 -05:00
docs/amplitude	Bug 1474987 Add "session split" meta-event	2018-08-03 15:09:41 -04:00
project	Extract common parts of streaming jobs to single base class	2018-06-05 17:07:12 +02:00
src	Add experiment enrollments to datadog job	2018-08-13 15:46:43 -05:00
.gitignore	Add .gitignore	2017-01-24 17:34:02 +00:00
.jvmopts	Move from TravisCI to CircleCI 2.0	2018-08-07 11:03:04 -04:00
README.md	Use CircleCI README status badge instead of Travis	2018-08-08 09:06:56 -04:00
build.sbt	Move from TravisCI to CircleCI 2.0	2018-08-07 11:03:04 -04:00
docker_setup.sh	Fix internal IP resolve on Ubuntu	2018-06-05 17:07:12 +02:00

README.md

telemetry-streaming

Spark Streaming ETL jobs for Mozilla Telemetry

This service currently contains jobs that aggregate error data on 5 minute intervals. It is responsible for generating the (internal only) error_aggregates and experiment_error_aggregates parquet tables at Mozilla.

Issue Tracking

Please file bugs in the Datasets: Error Aggregates component.

Amplitude Event Configuration

Some of the jobs defined in telemetry-streaming exist to transform telemetry events and republish to Amplitude for further analysis. Filtering and transforming events is accomplished via JSON configurations. If you're creating or updating such a schema, see:

Amplitude event configuration docs

Development

The recommended workflow for running tests is to use your favorite editor for editing the source code and running the tests via sbt. Some common invocations for sbt:

sbt test # run the basic set of tests (good enough for most purposes)
sbt "testOnly *ErrorAgg*" # run the tests only for packages matching ErrorAgg
sbt "testOnly *ErrorAgg* -- -z version" # run the tests only for packages matching ErrorAgg, limited to test cases with "version" in them
sbt dockerComposeTest # run the docker compose tests (slow)
sbt "dockerComposeTest -tags:DockerComposeTag" # run only tests with DockerComposeTag (while using docker)
sbt scalastyle test:scalastyle # run linter
sbt ci # run the full set of continuous integration tests

Some tests need Kafka to run. If one prefers to run them via IDE, it's required to run the test cluster:

sbt dockerComposeUp

or via plain docker-compose:

export DOCKER_KAFKA_HOST=$(./docker_setup.sh)
docker-compose -f docker/docker-compose.yml up

It's also good to shut down the cluster afterwards:

sbt dockerComposeStop