Schemas for Mozilla's data ingestion pipeline and data lake outputs
Перейти к файлу
Nan Jiang 7e1786b5b2 Replace shiled_id with experiments for activity-stream and messaging-system 2020-03-24 16:29:44 -04:00
.circleci
.github Update pull_request_template.md (#486) 2020-01-16 13:20:54 -08:00
schemas Replace shiled_id with experiments for activity-stream and messaging-system 2020-03-24 16:29:44 -04:00
scripts Add script used to parse descriptions from propbe-dicitonary environment.json 2020-03-03 14:21:58 -08:00
templates Replace shiled_id with experiments for activity-stream and messaging-system 2020-03-24 16:29:44 -04:00
tests Replace validation testing with JSON Schema v7 compatible libraries (#499) 2020-02-11 14:28:36 -08:00
validation Replace shiled_id with experiments for activity-stream and messaging-system 2020-03-24 16:29:44 -04:00
.dockerignore Ignore only build at root directory; add java ignores (#506) 2020-02-12 13:03:15 -08:00
.gitignore Ignore only build at root directory; add java ignores (#506) 2020-02-12 13:03:15 -08:00
CMakeLists.txt Replace validation testing with JSON Schema v7 compatible libraries (#499) 2020-02-11 14:28:36 -08:00
CODE_OF_CONDUCT.md
Dockerfile Replace validation testing with JSON Schema v7 compatible libraries (#499) 2020-02-11 14:28:36 -08:00
GRAVEYARD.md Add initial GRAVEYARD document (#502) 2020-02-11 11:58:35 -08:00
LICENSE.txt
README.md Add a note suggesting titling a PR with "Bug XXX -" if appropriate (#514) 2020-03-20 20:45:00 -04:00
README.pioneer.md
README.shield.md
pom.xml Replace validation testing with JSON Schema v7 compatible libraries (#499) 2020-02-11 14:28:36 -08:00
requirements.in Replace validation testing with JSON Schema v7 compatible libraries (#499) 2020-02-11 14:28:36 -08:00
requirements.txt Replace validation testing with JSON Schema v7 compatible libraries (#499) 2020-02-11 14:28:36 -08:00

README.md

Mozilla Pipeline Schemas

This repository contains schemas for Mozilla's data ingestion pipeline and data lake outputs.

The JSON schemas are used to validate incoming submissions at ingestion time. The jsonschema [Python] and everit-org/json-schema [Java] library (using draft 4) are used for JSON Schema Validation in this repository's tests. This has implications for what kinds of string patterns are supported, see the Conformance section in the linked document for further details. Note that as of 2019, the data pipeline uses the everit-org/json-schema library for validation in production (see #302).

To learn more about writing JSON Schemas, Understanding JSON Schema is a great resource.

Adding a new schema

  • Create the JSON Schema in the templates directory first. Make use of common schema components from the templates/include directory where possible, including things like the telemetry environment, clientId, application block, or UUID patterns. The filename should be templates/<namespace>/<doctype>/<doctype>.<version>.schema.json.
  • Build the rendered schemas using the instructions below, and check those artifacts (in the schemas directory) in to the git repo as well. See the rationale for this in the "Notes" section below.
  • Add one or more example JSON documents to the validation directory.
  • Run the tests (either via Docker or directly) using the instructions below.
  • Once all tests pass, submit a PR to the github repository against the master branch. See also the notes on contributions.

Note that Pioneer studies have a slightly amended process.

Build

Prerequisites

On MacOS, these prerequisites can be installed using homebrew:

brew install cmake
brew install jq
brew install python
brew cask install docker

CMake Build Instructions

git clone https://github.com/mozilla-services/mozilla-pipeline-schemas.git
cd mozilla-pipeline-schemas
mkdir release
cd release

cmake ..  # this is the build process (the schemas are built with cmake templates)

Running Tests via Docker

The tests expect example pings to be in the validation/<namespace>/ subdirectory, with files named in the form <ping type>.<version>.<test name>.pass.json for documents expected to be valid, or <ping type>.<version>.<test name>.fail.json for documents expected to fail validation. The test name should match the pattern [0-9a-zA-Z_]+

To run the tests:

# build the container with the pipeline schemas
docker build -t mps .

# run the tests
docker run --rm mps

Packaging and integration tests (optional)

Follow the CMake Build Instructions above to update the schemas directory. To run the unit-tests, run the following commands:

# optional: activate a virtual environment with python3.6+
python3 -m venv venv
source venv/bin/activate

# install python dependencies, if they haven't already
pip install -r requirements.txt

# run the tests, with 8 parallel processes
pytest -n 8

# run tests for a specific namespace and doctype
pytest -k telemetry/main.4

# run java tests only (if Java is configured)
pytest -k java

If you would like to run validation against everit-org/json-schema used in mozilla/ingestion-beam, either run the docker container or install the java dependencies.

export JAVA_HOME=...

# resolves and copies jars into `target/dependency`
mvn dependency:copy-dependencies

# check that tests are not skipped
pytest -k java -n 8

The following docker command will generate a report against a sample of data from the ingestion system given proper credentials. Running this is recommended when making modifications to many schemas or during review.

docker run \
    -e AWS_ACCESS_KEY_ID \
    -e AWS_SECRET_ACCESS_KEY \
    -v "$(pwd)":/app/mozilla-pipeline-schemas \
    -it mozilla/edge-validator:latest \
        make report

Pushes to the main repo will trigger integration tests in CircleCI that directly compare the revision to the master branch. These tests do not run for forked PRs in order to protect data and credentials, but reviewers can trigger tests to run by pushing the PR's revisions to a branch of the main repo. We provide a script for this:

# Before running, double check that the PR doesn't make any changes to
# .circleci/config.yml that could spill sensitive environment variables
# or data contents to the public CircleCI logs.
./.github/push-to-trigger-integration <username>:<branchname>

For details on how to compare two arbitrary revisions, refer to the integration job in .circleci/config.yml. For more documentation, see mozilla-services/edge-validator.

Releases

There is a daily series of tasks run by Airflow (see the probe_scraper DAG) that uses the master branch of this repository as input and ends up pushing final JSONSchema and BigQuery schema files to the generated-schemas branch. As of January 2020, deploying schema changes still requires manual intervention by a member of the Data Ops team, but you can generally expect schemas to be deployed to production BigQuery tables several times a week.

Contributions

  • All non trivial contributions should start with a bug or issue being filed (if it is a new feature please propose your design/approach before doing any work as not all feature requests are accepted).
  • If updating the glean schemas, be sure to update the changelog in include/glean/CHANGELOG.md.
  • This repository is configured to auto-assign a reviewer on PR submission. If you do not receive a response within a few business days (or your request is urgent), please followup in the #fx-metrics slack channel.
  • If your PR is associated with a bugzilla bug, please title it Bug XXX - Description of change, that way the Bugzilla PR Linker will automatically add an attachment with your PR to bugzilla, for future reference.

Notes

All schemas are generated from the 'templates' directory and written into the 'schemas' directory (i.e., the artifacts are generated/saved back into the repository) and validated against the draft 4 schema a copy of which resides in the 'tests' directory. The reason for this is twofold:

  1. It lets us easily see and refer to complete schemas as they are actually used. This means that the schemas can be referenced directly in bugs and such, as well as being fetched directly from the repo for testing other schema consumers (test being important here, as any production use should be using the installable packages).
  2. It gives us a changelog for each schema, rather than having to reason about changes to templated external pieces and when/how that impacted a given doctype's schema over time. This means that it should be easy to look back in time for the provenance of different parts of the schema for each doctype.