telemetry-streaming/README.md

[![Build Status](https://circleci.com/gh/mozilla/telemetry-streaming/tree/master.svg?style=svg)](https://circleci.com/gh/mozilla/telemetry-streaming/tree/master)
[![codecov.io](https://codecov.io/github/mozilla/telemetry-streaming/coverage.svg?branch=master)](https://codecov.io/github/mozilla/telemetry-streaming?branch=master)

*This repository is no longer in use at Mozilla! It was designed to be run on our AWS-based telemetry infrastructure*

# telemetry-streaming
Spark Streaming ETL jobs for Mozilla Telemetry

This service currently contains jobs that aggregate error data
on 5 minute intervals. It is responsible for generating the (internal only)
`error_aggregates` and `experiment_error_aggregates` parquet tables at
Mozilla.

## Issue Tracking

Please file bugs related to the error aggregates streaming job in the
[Datasets: Error Aggregates](https://bugzilla.mozilla.org/enter_bug.cgi?product=Data%20Platform%20and%20Tools&component=Datasets%3A%20Error%20Aggregates) component.

## Deployment

The jobs defined in this repository are generally deployed as streaming jobs within
[our hosted Databricks account](https://docs.telemetry.mozilla.org/concepts/pipeline/data_pipeline_detail.html?highlight=databricks#databricks-managed-spark-analysis),
but some are deployed as periodic batch jobs via Airflow
using wrappers codified in
[telemetry-airflow](https://github.com/mozilla/telemetry-airflow)
that spin up EMR clusters whose configuration is governed by
[emr-bootstrap-spark](https://github.com/mozilla/emr-bootstrap-spark/).
Changes in production behavior that don't seem to correspond to changes
in this repository's code could be related to changes in those other projects.

## Amplitude Event Configuration

Some of the jobs defined in `telemetry-streaming` exist to transform telemetry events
and republish to [Amplitude](https://amplitude.com/) for further analysis.
Filtering and transforming events is accomplished via JSON configurations.
If you're creating or updating such a schema, see:

- [Amplitude event configuration docs](docs/amplitude)

## Development

The recommended workflow for running tests is to use your favorite editor for editing
the source code and running the tests via sbt. Some common invocations for sbt:

* `sbt test  # run the basic set of tests (good enough for most purposes)`
* `sbt "testOnly *ErrorAgg*"  # run the tests only for packages matching ErrorAgg`
* `sbt "testOnly *ErrorAgg* -- -z version"  # run the tests only for packages matching ErrorAgg, limited to test cases with "version" in them`
* `sbt dockerComposeTest  # run the docker compose tests (slow)`
* `sbt "dockerComposeTest -tags:DockerComposeTag" # run only tests with DockerComposeTag (while using docker)`
* `sbt scalastyle test:scalastyle  # run linter`
* `sbt ci  # run the full set of continuous integration tests`

Some tests need Kafka to run. If one prefers to run them via IDE, it's required to run the test cluster:
```bash
sbt dockerComposeUp
```
or via plain docker-compose:
```bash
export DOCKER_KAFKA_HOST=$(./docker_setup.sh)
docker-compose -f docker/docker-compose.yml up
```
It's also good to shut down the cluster afterwards:
```bash
sbt dockerComposeStop
```
Use CircleCI README status badge instead of Travis 2018-08-08 03:11:03 +03:00			`[![Build Status](https://circleci.com/gh/mozilla/telemetry-streaming/tree/master.svg?style=svg)](https://circleci.com/gh/mozilla/telemetry-streaming/tree/master)`
Add coverage to README. 2017-02-13 15:19:07 +03:00			`[![codecov.io](https://codecov.io/github/mozilla/telemetry-streaming/coverage.svg?branch=master)](https://codecov.io/github/mozilla/telemetry-streaming?branch=master)`
Add build status to README. 2017-02-10 15:46:04 +03:00
Update README.md 2019-12-05 22:09:03 +03:00			`This repository is no longer in use at Mozilla! It was designed to be run on our AWS-based telemetry infrastructure`
Note that this repository is no longer in use 2019-12-05 22:08:40 +03:00
Create README.md 2017-02-10 14:47:10 +03:00			`# telemetry-streaming`
			`Spark Streaming ETL jobs for Mozilla Telemetry`
Some docs about how to invoke unit tests Addresses #54 2017-11-01 22:57:09 +03:00
fix nit 2018-01-30 19:42:29 +03:00			`This service currently contains jobs that aggregate error data`
README updates * Specify exactly what this code is responsible for generating * Tell people where to file bugs 2018-01-30 18:44:05 +03:00			`on 5 minute intervals. It is responsible for generating the (internal only)`
			`error_aggregates` and `experiment_error_aggregates` parquet tables at
			`Mozilla.`

			`## Issue Tracking`

Document deployment dependencies See related documentation changes in https://github.com/mozilla/emr-bootstrap-spark/pull/656 2018-12-15 00:14:06 +03:00			`Please file bugs related to the error aggregates streaming job in the`
			`[Datasets: Error Aggregates](https://bugzilla.mozilla.org/enter_bug.cgi?product=Data%20Platform%20and%20Tools&component=Datasets%3A%20Error%20Aggregates) component.`

			`## Deployment`

			`The jobs defined in this repository are generally deployed as streaming jobs within`
			`[our hosted Databricks account](https://docs.telemetry.mozilla.org/concepts/pipeline/data_pipeline_detail.html?highlight=databricks#databricks-managed-spark-analysis),`
			`but some are deployed as periodic batch jobs via Airflow`
			`using wrappers codified in`
			`[telemetry-airflow](https://github.com/mozilla/telemetry-airflow)`
			`that spin up EMR clusters whose configuration is governed by`
			`[emr-bootstrap-spark](https://github.com/mozilla/emr-bootstrap-spark/).`
			`Changes in production behavior that don't seem to correspond to changes`
			`in this repository's code could be related to changes in those other projects.`
README updates * Specify exactly what this code is responsible for generating * Tell people where to file bugs 2018-01-30 18:44:05 +03:00
Bug 1474987 Add "session split" meta-event 2018-07-31 21:08:38 +03:00			`## Amplitude Event Configuration`

			Some of the jobs defined in `telemetry-streaming` exist to transform telemetry events
			`and republish to [Amplitude](https://amplitude.com/) for further analysis.`
			`Filtering and transforming events is accomplished via JSON configurations.`
			`If you're creating or updating such a schema, see:`

			`- [Amplitude event configuration docs](docs/amplitude)`

Some docs about how to invoke unit tests Addresses #54 2017-11-01 22:57:09 +03:00			`## Development`

			`The recommended workflow for running tests is to use your favorite editor for editing`
			`the source code and running the tests via sbt. Some common invocations for sbt:`

			* `sbt test # run the basic set of tests (good enough for most purposes)`
			* `sbt "testOnly ErrorAgg" # run the tests only for packages matching ErrorAgg`
			* `sbt "testOnly ErrorAgg -- -z version" # run the tests only for packages matching ErrorAgg, limited to test cases with "version" in them`
			* `sbt dockerComposeTest # run the docker compose tests (slow)`
			* `sbt "dockerComposeTest -tags:DockerComposeTag" # run only tests with DockerComposeTag (while using docker)`
Move from TravisCI to CircleCI 2.0 2018-08-03 18:50:17 +03:00			* `sbt scalastyle test:scalastyle # run linter`
			* `sbt ci # run the full set of continuous integration tests`
Bug 1423340 - Make tests easily runnable from IDE 2018-03-15 18:22:03 +03:00
			`Some tests need Kafka to run. If one prefers to run them via IDE, it's required to run the test cluster:`
			```bash
			`sbt dockerComposeUp`
			```
			`or via plain docker-compose:`
			```bash
			`export DOCKER_KAFKA_HOST=$(./docker_setup.sh)`
			`docker-compose -f docker/docker-compose.yml up`
			```
			`It's also good to shut down the cluster afterwards:`
			```bash
			`sbt dockerComposeStop`
			```