A library for creating full representations of Mozilla telemetry pings.
Перейти к файлу
akkomar 7199bb64ed
CI: update Docker image (#283)
Hopefully fixes https://github.com/mozilla/mozilla-schema-generator/pull/282.

This is the same image that's used in gcp-ingestion for publishing to Dockerhub: 50eb31f6dc/.circleci/config.yml (L135).
2024-11-20 17:38:12 +01:00
.circleci
bin
mozilla_schema_generator
requirements
tests
validation-schemas
.dockerignore
.gitignore
.gitmodules
.isort.cfg
AUTHORS.rst
CODE_OF_CONDUCT.md
CONTRIBUTING.rst
Dockerfile
HISTORY.rst
LICENSE
MANIFEST.in
Makefile
README.md
aliases.json
common_pings.json
disallowlist
docker-compose.yml
incompatibility-allowlist
setup.cfg
setup.py

README.md

CircleCI Latest Version

Mozilla Schema Generator

A library for generating full representations of Mozilla telemetry pings.

See Mozilla Pipeline Schemas for the more generic structure of pings. This library takes those generic structures and fills in all of the probes we expect to see in the appropriate places.

Telemetry Integration

There are two generic ping types we're targeting for this library:

  1. The Common Ping Format is used for many legacy pings from Firefox Desktop ping, including the "main" ping
  2. The Glean Ping Format is the common structure being used for all newly instrumented products at Mozilla, including mobile browsers.

This library takes the information for what should be in those pings from the Probe Info Service.

Data Store Integration

The primary use of the schemas is for integration with the Schema Transpiler. The schemas that this repository generates can be transpiled into Avro and Bigquery. They define the schema of the Avro and BigQuery tables that the BQ Sink writes to.

Validation

When we validate pings against a schema in the data pipeline, we use the generic versions rather than the versions generated by this repository's machinery. While the schemas produced here are guaranteed to be more correct since they include explicit definitions of every metric and probe, we find in practice there are too many edge cases where a probe is sent with the incorrect type and we need to coerce it to the correct type when loading to BigQuery. We also purposely represent some complex types as JSON strings in schemas, relying on the BQ loader to coerce objects to string. We could still consider using the generated schemas for validation in the future, but additional work would be required to ensure it does not lead to mass rejection of pings.

Usage

Main Ping

Generate the Full Main Ping schema:

mozilla-schema-generator generate-main-ping

The out-dir parameter will be the namespace for the pings.

To see a full list of options, run mozilla-schema-generator generate-main-ping --help.

Glean

Generate all Glean ping schemas - one for each application, for each ping that application sends:

mozilla-schema-generator generate-glean-pings

Write schemas to a directory:

mozilla-schema-generator generate-glean-pings --out-dir glean-ping

To see a full list of options, run mozilla-schema-generator generate-glean-pings --help.

Configuration Files

Configuration files are by default found in /config. You can also specify your own when running the generator.

Configuration files match certain parts of a ping to certain types of probes or metrics. The nesting of the config file matches the ping it is filling in. For example, Glean stores probe types under the metrics key, so the nesting looks like this:

{
    "metrics": {
        "string": {
            <METRIC_ID>: {...}
        }
    }
}

While the generic schema doesn't include information about the specific <METRIC_ID>s being included, the schema-generator does. To include the correct metrics that we would find in that section of the ping, we would organize the config.yaml file like this:

metrics:
    string:
        match:
            type: string

The match key indicates that we should fill-in this section of the ping schema with metrics, and the type: string makes sure we only put string metrics in there. You can do an exact match on any field available in the ping info from the probe-info-service, which also contains the Desktop probes.

There are a few additional keywords allowable under any field:

  • contains - e.g. process: contains: main, indicates that the process field is an array and it should only match those that include the entry main.
  • not - e.g. send_in_pings: not: glean_ping_info, indicates that we should match any field for send_in_pings except glean_ping_info.

table_group Key

This specific field is for indicating which table group that section of the ping should be included in when splitting the schema. Currently we do not split any pings. See the section on BigQuery Limitations and Splitting for more info.

Allowing schema incompatible changes

On every run of the schema generator, there is a check for incompatible changes between the previous revision and current generated revision. A schema incompatible change includes a removal of a schema or a column, or a change in the type definition of a column.

There are two methods to get around these restrictions. If you are actively developing the schema generator and need to introduce a schema incompatible change, set MPS_VALIDATE_BQ=false.

If a schema incompatible change needs to be introduced in production (i.e. generated-schemas), then modify the incompatibility-allowlist at the root of the repository. Add documents in the form of {namespace}.{doctype}.{docversion}. Globs are allowed. For example, add the following line to allow remove schemas under the my_glean_app namespace:

my_glean_app.*

Once the commit has gone through successfully, this line should be removed from the document.

Development and Testing

Install requirements:

make install-requirements

Ensure that the mozilla-pipeline-schemas submodule has been checked out:

git submodule init
git submodule update --remote

Run tests:

make test

Publish generated schemas to mozilla-generated-schemas/test-generated-schemas run:

git fetch origin

git checkout <branch-to-test>

export MPS_SSH_KEY_BASE64=$(cat ~/.ssh/id_rsa | base64)

# generate all schemas for current main
git checkout main && git pull make build && make run

# generate all schemas with changes and compare with main
git checkout <branch-to-test> make build && make run