2.1 KiB
Apache Beam Jobs for Ingestion
This ingestion-beam java module contains our Apache Beam jobs for use in Ingestion. Google Cloud Dataflow is a Google Cloud Platform service that natively runs Apache Beam jobs.
The source code lives in the ingestion-beam subdirectory of the gcp-ingestion repository.
The following are the main Beam classes, please see the respective sections on them in the documentation:
- Decoder job: A job for normalizing ingestion messages
- Republisher job: A job for republishing subsets of decoded messages to new destinations
There are a few additional jobs for special cases listed in the index for this section.
Building
Move to the ingestion-beam
subdirectory of your gcp-ingestion checkout and run:
./bin/mvn clean compile
See the details below under each job for details on how to run what you've produced.
Testing
Before anything else, be sure to download the test data:
./bin/download-cities15000
./bin/download-geolite2
./bin/download-schemas
Run tests locally with CircleCI Local CLI
(cd .. && circleci build --job ingestion-beam)
To make more targeted test invocations, you can install Java and maven locally or
use the bin/mvn
executable to run maven in docker:
./bin/mvn clean test
If you wish to just run a single test class or a single test case, try something like this:
# Run all tests in a single class
./bin/mvn test -Dtest=com.mozilla.telemetry.util.SnakeCaseTest
# Run only a single test case
./bin/mvn test -Dtest='com.mozilla.telemetry.util.SnakeCaseTest#testSnakeCaseFormat'
To run the project in a sandbox against production data, see this document on configuring an integration testing workflow.
Code Formatting
Use spotless to automatically reformat code:
mvn spotless:apply
or just check what changes it requires:
mvn spotless:check