This substantially reorganizes the documentation as an mkdocs site. Main
changes:

* All documentation is now browseable and searchable in a single site, with
  handy table of contents on the side of each section
* Top-level README significantly slimmed down (just pointing to docs site)
* READMEs inside individual components removed (moved to subdirectories inside
  docs/ folder, accessible via top-level in docs site)
This commit is contained in:
William Lachance 2019-08-12 17:06:19 -04:00 коммит произвёл GitHub
Родитель 63510edf26
Коммит 85863eef1e
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
22 изменённых файлов: 124 добавлений и 337 удалений

Просмотреть файл

@ -9,14 +9,23 @@ jobs:
name: Spell Check
command: mdspell --ignore-numbers --en-us --report '**/*.md'
doctoc:
docs:
docker:
- image: node:8.10.0
- image: circleci/python:3.7
steps:
- checkout
- run:
name: Ensure markdown tables of contents are up to date
command: ./.circleci/doctoc-check.sh
- checkout
- run:
name: Install dependencies
command: sudo pip install mkdocs markdown-include
- add_ssh_keys:
fingerprints:
"84:b0:66:dd:ec:68:b1:45:9d:5d:66:fd:4a:4f:1b:57"
- run:
name: Build and deploy docs
command: |
if [ $CIRCLE_BRANCH == "master" ]; then
mkdocs gh-deploy
fi
ingestion-edge: &ingestion-edge
working_directory: /root/project/ingestion-edge
@ -173,7 +182,10 @@ workflows:
build:
jobs:
- spelling
- doctoc
- docs:
filters:
tags:
only: /.*/
- ingestion-edge
- ingestion-edge-release:
filters:

Просмотреть файл

@ -1,13 +0,0 @@
#!/bin/bash
set -e
bash "$(dirname $0)/doctoc-run.sh"
# Exit with success code if doctoc modified no files.
git diff --name-only | grep '.md$' || exit 0
# Print instructions and fail this test.
echo "Some markdown files have outdated Table of Contents!"
echo "To fix, run ./bin/update-toc"
exit 1

Просмотреть файл

@ -1,9 +0,0 @@
#!/bin/bash
# Run doctoc to update tables of contents in markdown files.
# https://www.npmjs.com/package/doctoc
set -e
npm install -g --silent doctoc
doctoc . --notitle

Просмотреть файл

@ -16,6 +16,7 @@ CircleCI
CLI
cron
Dataflow
datapipeline
dataset
deduplicate
deduplication
@ -28,6 +29,8 @@ encodings
failsafe
featureful
filesystem
fx-metrics
gcp-ingestion
GCP
GCS
GeoIP
@ -41,6 +44,7 @@ HTTPS
hyperloglog
IAM
IPs
irc.mozilla.org
Javadoc
JSON
JVM

Просмотреть файл

@ -1,12 +1,3 @@
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
- [Community Participation Guidelines](#community-participation-guidelines)
- [How to Report](#how-to-report)
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
# Community Participation Guidelines
This repository is governed by Mozilla's code of conduct and etiquette guidelines.

Просмотреть файл

@ -1,19 +1,16 @@
# Telemetry Ingestion on Google Cloud Platform
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
[![CircleCI](https://circleci.com/gh/mozilla/gcp-ingestion.svg?style=svg&circle-token=d98a470269580907d5c6d74d0e67612834a21be7)](https://circleci.com/gh/mozilla/gcp-ingestion)
A monorepo for documentation and implementation of the Mozilla telemetry
ingestion system deployed to Google Cloud Platform (GCP).
The overall architecture is described in [docs/architecture](docs/architecture)
along with commentary on design decisions.
Individual components are specified under [docs](docs) and implemented
under the various `ingestion-*` service directories:
There are currently two components:
- [ingestion-edge](ingestion-edge): a simple Python service for accepting HTTP
messages and delivering to Google Cloud Pub/Sub
- [ingestion-beam](ingestion-beam): a Java module defining
[Apache Beam](https://beam.apache.org/) jobs for streaming and batch
transformations of ingested messages
For more information, see [the documentation](https://mozilla.github.io/gcp-ingestion).

Просмотреть файл

@ -1,12 +0,0 @@
#!/bin/bash
# Updates the Table of Contents in README.md; see
set -e
cd "$(dirname "$0")/.."
IMAGE=node:8.12.0
docker run -it --rm \
--volume $PWD:/root/project \
--workdir /root/project \
$IMAGE \
/bin/bash .circleci/doctoc-run.sh

Просмотреть файл

@ -3,22 +3,8 @@
This document specifies the behavior of the service that delivers decoded
messages into BigQuery.
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
- [Data Flow](#data-flow)
- [Implementation](#implementation)
- [Configuration](#configuration)
- [Coerce Types](#coerce-types)
- [Accumulate Unknown Values As `additional_properties`](#accumulate-unknown-values-as-additional_properties)
- [Errors](#errors)
- [Error Message Schema](#error-message-schema)
- [Other Considerations](#other-considerations)
- [Message Acks](#message-acks)
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
## Data Flow
Consume messages from a PubSub topic or Cloud Storage location and insert them
@ -86,7 +72,7 @@ retries are handled automatically and all errors returned are non-transient.
#### Error Message Schema
Always include the error attributes specified in the [Decoded Error Message
Schema](decoder.md#error-message-schema).
Schema](decoder_service_specification.md#error-message-schema).
Encode errors received as type `TableRow` as JSON in the payload of a
`PubsubMessage`, and add error attributes.

Просмотреть файл

@ -3,22 +3,6 @@
This document specifies the behavior of the service that decodes messages
in the Structured Ingestion pipeline.
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
- [Data Flow](#data-flow)
- [Implementation](#implementation)
- [Decoding Errors](#decoding-errors)
- [Error message schema](#error-message-schema)
- [Raw message schema](#raw-message-schema)
- [Decoded message metadata schema](#decoded-message-metadata-schema)
- [Other Considerations](#other-considerations)
- [Message Acks](#message-acks)
- [Deduplication](#deduplication)
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
## Data Flow
1. Consume messages from Google Cloud PubSub raw topic
@ -69,7 +53,7 @@ required group attributes {
### Raw message schema
See [Edge Server PubSub Message Schema](edge.md#edge-server-pubsub-message-schema).
See [Edge Service PubSub Message Schema](edge_service_specification.md#pubsub-message-schema).
### Decoded message metadata schema

Просмотреть файл

@ -1,23 +1,8 @@
# Differences from AWS Architecture
# Differences from AWS
This document explains how GCP Ingestion differs from the [AWS Data Platform
Architecture](https://mana.mozilla.org/wiki/display/SVCOPS/Telemetry+-+Data+Pipeline+Architecture).
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
- [Replace Heka Framed Protobuf with newline delimited JSON](#replace-heka-framed-protobuf-with-newline-delimited-json)
- [Replace EC2 Edge with Kubernetes Edge](#replace-ec2-edge-with-kubernetes-edge)
- [Replace Kafka with PubSub](#replace-kafka-with-pubsub)
- [Replace Hindsight Data Warehouse Loaders with Dataflow](#replace-hindsight-data-warehouse-loaders-with-dataflow)
- [Replace S3 with Cloud Storage](#replace-s3-with-cloud-storage)
- [Messages Always Delivered to Message Queue](#messages-always-delivered-to-message-queue)
- [Landfill is Downstream from Message Queue](#landfill-is-downstream-from-message-queue)
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
## Replace Heka Framed Protobuf with newline delimited JSON
Heka framed protobuf requires special code to read and write. Newline delimited

Просмотреть файл

@ -2,19 +2,8 @@
This document outlines plans to migrate edge traffic from AWS to GCP using the code in this repository.
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
- [Current state](#current-state)
- [Phase 1](#phase-1)
- [Phase 2](#phase-2)
- [Phase 3](#phase-3)
- [Phase 3 (alternative)](#phase-3-alternative)
- [Phase 4](#phase-4)
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
## Current state
Today, data producers send data to the ingestion stack on AWS as described [here](https://github.com/mozilla/firefox-data-docs/blob/042fddcbf27aa5993ee5578224200a3ef65fd7c7/src/concepts/pipeline/data_pipeline_detail.md#ingestion).

Просмотреть файл

@ -3,31 +3,6 @@
This document specifies the behavior of the server that accepts submissions
from HTTP clients e.g. Firefox telemetry.
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
- [General Data Flow](#general-data-flow)
- [Namespaces](#namespaces)
- [Forwarding to the pipeline](#forwarding-to-the-pipeline)
- [Edge Server PubSub Message Schema](#edge-server-pubsub-message-schema)
- [Server Request/Response](#server-requestresponse)
- [GET Request](#get-request)
- [GET Response codes](#get-response-codes)
- [POST/PUT Request](#postput-request)
- [Legacy Systems](#legacy-systems)
- [POST/PUT Response codes](#postput-response-codes)
- [Other Response codes](#other-response-codes)
- [Other Considerations](#other-considerations)
- [Compression](#compression)
- [Bad Messages](#bad-messages)
- [PubSub Topics](#pubsub-topics)
- [GeoIP Lookups](#geoip-lookups)
- [Data Retention](#data-retention)
- [Submission Timestamp Format](#submission-timestamp-format)
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
## General Data Flow
HTTP submissions come in from the wild, hit a load balancer, then optionally an
@ -54,7 +29,7 @@ configuration options.
The message is written to PubSub. If the message cannot be written to PubSub it
is written to a disk queue that will periodically retry writing to PubSub.
### Edge Server PubSub Message Schema
### PubSub Message Schema
```
required string data // base64 encoded body

Просмотреть файл

@ -3,18 +3,8 @@
This document specifies the behavior of the service that batches raw messages
into long term storage.
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
- [Data Flow](#data-flow)
- [Implementation](#implementation)
- [Latency](#latency)
- [Other Considerations](#other-considerations)
- [Message Acks](#message-acks)
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
## Data Flow
Consume messages from a Google Cloud PubSub topic and write in batches to

Просмотреть файл

@ -2,31 +2,6 @@
This document specifies the architecture for GCP Ingestion as a whole.
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
- [Architecture Diagram](#architecture-diagram)
- [Architecture Components](#architecture-components)
- [Ingestion Edge](#ingestion-edge)
- [Landfill Sink](#landfill-sink)
- [Decoder](#decoder)
- [Republisher](#republisher)
- [BigQuery Sink](#bigquery-sink)
- [Dataset Sink](#dataset-sink)
- [Notes](#notes)
- [Design Decisions](#design-decisions)
- [Kubernetes Engine and PubSub](#kubernetes-engine-and-pubsub)
- [Different topics for "raw" and "validated" data](#different-topics-for-raw-and-validated-data)
- [BigQuery](#bigquery)
- [Save messages as newline delimited JSON](#save-messages-as-newline-delimited-json)
- [Use destination tables](#use-destination-tables)
- [Use views for user-facing data](#use-views-for-user-facing-data)
- [Known Issues](#known-issues)
- [Further Reading](#further-reading)
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
## Architecture Diagram
![diagram.mmd](diagram.svg "Architecture Diagram")

Просмотреть файл

@ -1,21 +1,7 @@
# Overview
# Pain points
A running list of things that are suboptimal in GCP.
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
- [App Engine](#app-engine)
- [Dataflow](#dataflow)
- [`BigQueryIO.Write`](#bigqueryiowrite)
- [`FileIO.Write`](#fileiowrite)
- [`PubsubIO.Write`](#pubsubiowrite)
- [Templates](#templates)
- [PubSub](#pubsub)
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
# App Engine
For network-bound applications it can be prohibitively expensive. A PubSub push

Просмотреть файл

@ -5,18 +5,6 @@ Percentage determined by the Reliability Target below. If a component does
not meet that then a Stability Work Period should be assigned
to each software engineer supporting the component.
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
- [Disclaimer and Purpose](#disclaimer-and-purpose)
- [Reliability Target](#reliability-target)
- [Definitions](#definitions)
- [Exclusions](#exclusions)
- [Additional Information](#additional-information)
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
## Disclaimer and Purpose
**This document is intended solely for those directly running, writing, and

Просмотреть файл

@ -2,20 +2,8 @@
This document specifies the testing required for GCP Ingestion components.
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
- [Exceptions](#exceptions)
- [Test Phases](#test-phases)
- [Test Categories](#test-categories)
- [Unit Tests](#unit-tests)
- [Integration Tests](#integration-tests)
- [Load Tests](#load-tests)
- [Slow Load Tests](#slow-load-tests)
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
## Exceptions
Code that does not comply with this standard before it is deployed to

21
docs/index.md Normal file
Просмотреть файл

@ -0,0 +1,21 @@
# GCP Ingestion
[GCP Ingestion](https://github.com/mozilla/gcp-ingestion/) is a monorepo for
documentation and implementation of the Mozilla telemetry ingestion system
deployed to Google Cloud Platform (GCP).
There are currently two components:
- [ingestion-edge](./ingestion-edge/index.md): a simple Python service for accepting HTTP
messages and delivering to Google Cloud Pub/Sub
- [ingestion-beam](ingestion-beam): a Java module defining
[Apache Beam](https://beam.apache.org/) jobs for streaming and batch
transformations of ingested messages
The design behind the system along with various trade offs are documented in
the architecture section. Note that as of this writing (August 2019) this
GCP ingestion is changing quickly, so some parts of this documentation may be out
of date.
Feel free to ask us on irc.mozilla.org #datapipeline or #fx-metrics
on slack if you have specific questions.

Просмотреть файл

@ -1,67 +1,17 @@
[![CircleCI](https://circleci.com/gh/mozilla/gcp-ingestion.svg?style=svg&circle-token=d98a470269580907d5c6d74d0e67612834a21be7)](https://circleci.com/gh/mozilla/gcp-ingestion)
# Apache Beam Jobs for Ingestion
This java module contains our Apache Beam jobs for use in Ingestion.
This ingestion-beam java module contains our [Apache Beam](https://beam.apache.org/) jobs for use in Ingestion.
Google Cloud Dataflow is a Google Cloud Platform service that natively runs
Apache Beam jobs.
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
The source code lives in the [ingestion-beam](https://github.com/mozilla/gcp-ingestion/tree/master/ingestion-beam)
subdirectory of the gcp-ingestion repository.
- [Code Formatting](#code-formatting)
- [Sink Job](#sink-job)
- [Supported Input and Outputs](#supported-input-and-outputs)
- [Encoding](#encoding)
- [Output Path Specification](#output-path-specification)
- [BigQuery](#bigquery)
- [Protocol](#protocol)
- [Attribute placeholders](#attribute-placeholders)
- [File prefix](#file-prefix)
- [Executing Jobs](#executing-jobs)
- [Locally](#locally)
- [On Dataflow](#on-dataflow)
- [On Dataflow with templates](#on-dataflow-with-templates)
- [In streaming mode](#in-streaming-mode)
- [Decoder Job](#decoder-job)
- [Transforms](#transforms)
- [Parse URI](#parse-uri)
- [Decompress](#decompress)
- [GeoIP Lookup](#geoip-lookup)
- [Parse User Agent](#parse-user-agent)
- [Executing Decoder Jobs](#executing-decoder-jobs)
- [Republisher Job](#republisher-job)
- [Capabilities](#capabilities)
- [Marking Messages As Seen](#marking-messages-as-seen)
- [Debug Republishing](#debug-republishing)
- [Per-`docType` Republishing](#per-doctype-republishing)
- [Per-Channel Sampled Republishing](#per-channel-sampled-republishing)
- [Executing Republisher Jobs](#executing-republisher-jobs)
- [Testing](#testing)
- [License](#license)
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
# Code Formatting
Use spotless to automatically reformat code:
```bash
mvn spotless:apply
```
or use just check what changes it requires:
```bash
mvn spotless:check
```
# Sink Job
## Sink Job
A job for delivering messages between Google Cloud services.
## Supported Input and Outputs
### Supported Input and Outputs
Supported inputs:
@ -83,7 +33,7 @@ Supported error outputs, must include attributes and must not validate messages:
* stdout with JSON encoding
* stderr with JSON encoding
## Encoding
### Encoding
Internally messages are stored and transported as
[PubsubMessage](https://beam.apache.org/documentation/sdks/javadoc/2.6.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubMessage.html).
@ -120,12 +70,12 @@ The above file when stored in the `text` format:
Note that the newline embedded at the end of the second JSON message results in
two text messages, one of which is blank.
## Output Path Specification
### Output Path Specification
Depending on the specified output type, the `--output` path that you provide controls
several aspects of the behavior.
### BigQuery
#### BigQuery
When `--outputType=bigquery`, `--output` is a `tableSpec` of form `dataset.tablename`
or the more verbose `projectId:dataset.tablename`. The values can contain
@ -146,7 +96,7 @@ payloads.
Instead, records missing an attribute required by a placeholder
will be redirected to error output if no default is provided.
### Protocol
#### Protocol
When `--outputType=file`, `--output` may be prefixed by a protocol specifier
to determine the
@ -156,7 +106,7 @@ Cloud Storage, use a `gs://` path like:
--output=gs://mybucket/somdir/myfileprefix
### Attribute placeholders
#### Attribute placeholders
We support `FileIO`'s "Dynamic destinations" feature (`FileIO.writeDynamic`) where
it's possible to route individual messages to different output locations based
@ -204,7 +154,7 @@ on attribute names and default values used in placeholders:
- attribute names may not contain curly braces (`{` or `}`)
- default values may not contain curly braces (`{` or `}`)
### File prefix
#### File prefix
Individual files are named by replacing `:` with `-` in the default format discussed in
the "File naming" section of Beam's
@ -226,12 +176,12 @@ An output file might be:
/tmp/output/out--290308-12-21T20-00-00.000Z--290308-12-21T20-10-00.000Z-00000-of-00001.ndjson
## Executing Jobs
### Executing Jobs
Note: `-Dexec.args` does not handle newlines gracefully, but bash will remove
`\` escaped newlines in `"`s.
### Locally
#### Locally
If you install Java and maven, you can invoke `mvn` directly in the following commands;
be aware, though, that Java 8 is the target JVM and some reflection warnings may be thrown on
@ -277,7 +227,7 @@ cat tmp/output/*
./bin/mvn compile exec:java -Dexec.args=--help=SinkOptions
```
### On Dataflow
#### On Dataflow
```bash
# Pick a bucket to store files in
@ -309,7 +259,7 @@ gcloud dataflow jobs list
gsutil cat $BUCKET/output/*
```
### On Dataflow with templates
#### On Dataflow with templates
Dataflow templates make a distinction between
[runtime parameters that implement the `ValueProvider` interface](https://cloud.google.com/dataflow/docs/guides/templates/creating-templates#runtime-parameters-and-the-valueprovider-interface)
@ -358,7 +308,7 @@ gcloud dataflow jobs show "$JOB_ID"
gsutil cat $BUCKET/output/*
```
### In streaming mode
#### In streaming mode
If `--inputType=pubsub`, Beam will execute in streaming mode, requiring some
extra configuration for file-based outputs. You will need to specify sharding like:
@ -378,25 +328,25 @@ As codified in [apache/beam/pull/1952](https://github.com/apache/beam/pull/1952)
the Dataflow runner suggests a reasonable starting point `numShards` is `2 * maxWorkers`
or 10 if `--maxWorkers` is unspecified.
# Decoder Job
## Decoder Job
A job for normalizing ingestion messages.
## Transforms
### Transforms
These transforms are currently executed against each message in order.
### Parse URI
#### Parse URI
Attempt to extract attributes from `uri`, on failure send messages to the
configured error output.
### Decompress
#### Decompress
Attempt to decompress payload with gzip, on failure pass the message through
unmodified.
### GeoIP Lookup
#### GeoIP Lookup
1. Extract `ip` from the `x_forwarded_for` attribute
* when the `x_pipeline_proxy` attribute is not present, use the
@ -416,12 +366,12 @@ unmodified.
1. Remove the `x_forwarded_for` and `remote_addr` attributes
1. Remove any `null` values added to attributes
### Parse User Agent
#### Parse User Agent
Attempt to extract browser, browser version, and os from the `user_agent`
attribute, drop any nulls, and remove `user_agent` from attributes.
## Executing Decoder Jobs
### Executing Decoder Jobs
Decoder jobs are executed the same way as [executing sink jobs](#executing-jobs)
but with a few extra flags:
@ -458,7 +408,7 @@ echo '{"payload":"dGVzdA==","attributeMap":{"remote_addr":"63.245.208.195"}}' >
"
```
# Republisher Job
## Republisher Job
A job for republishing subsets of decoded messages to new destinations.
@ -471,28 +421,28 @@ in `Cloud MemoryStore` for deduplication purposes. That functionality exists
here to avoid the expense of an additional separate consumer of the full
decoded topic.
## Capabilities
### Capabilities
### Marking Messages As Seen
#### Marking Messages As Seen
The job needs to connect to Redis in order to mark `document_id`s of consumed
messages as seen. The Decoder is able to use that information to drop duplicate
messages flowing through the pipeline.
### Debug Republishing
#### Debug Republishing
If `--enableDebugDestination` is set, messages containing an `x_debug_id`
attribute will be republished to a destination that's configurable at runtime.
This is currently expected to be a feature specific to structured ingestion,
so should not be set for `telemetry-decoded` input.
### Per-`docType` Republishing
#### Per-`docType` Republishing
If `--perDocTypeEnabledList` is provided, a separate producer will be created
for each `docType` specified in the given comma-separated list.
See the `--help` output for details on format.
### Per-Channel Sampled Republishing
#### Per-Channel Sampled Republishing
If `--perChannelSampleRatios` is provided, a separate producer will be created
for each specified release channel. The messages will be randomly sampled
@ -501,7 +451,7 @@ This is currently intended as a feature only for telemetry data, so should
not be set for `structured-decoded` input.
See the `--help` output for details on format.
## Executing Republisher Jobs
### Executing Republisher Jobs
Republisher jobs are executed the same way as [executing sink jobs](#executing-jobs)
but with a few differences in flags. You'll need to set the `mainClass`:
@ -551,7 +501,7 @@ echo '{"payload":"dGVzdA==","attributeMap":{"x_debug_id":"mysession"}}' > tmp/in
"
```
# Testing
## Testing
Before anything else, be sure to download the test data:
@ -575,10 +525,18 @@ use the `bin/mvn` executable to run maven in docker:
```
To run the project in a sandbox against production data, see this document on
![configuring an integration testing workflow](../docs/ingestion_testing_workflow.md).
[configuring an integration testing workflow](./ingestion_testing_workflow.md).
# License
## Code Formatting
This Source Code Form is subject to the terms of the Mozilla Public
License, v. 2.0. If a copy of the MPL was not distributed with this
file, You can obtain one at http://mozilla.org/MPL/2.0/.
Use spotless to automatically reformat code:
```bash
mvn spotless:apply
```
or just check what changes it requires:
```bash
mvn spotless:check
```

Просмотреть файл

@ -1,21 +1,10 @@
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
- [Ingestion Testing Workflow](#ingestion-testing-workflow)
- [Setting up the GCS project](#setting-up-the-gcs-project)
- [Bootstrapping schemas from `mozilla-pipeline-schemas`](#bootstrapping-schemas-from-mozilla-pipeline-schemas)
- [Building the project](#building-the-project)
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
# Ingestion Testing Workflow
The ingestion-beam handles data flow of documents from the edge into various
sinks. You may be interested in standing up a small testing instance to validate
the integration of the various components.
![diagrams/workflow.mmd](diagrams/workflow.svg)
![diagrams/workflow.mmd](../diagrams/workflow.svg)
__Figure__: _An overview of the various components necessary to query BigQuery
against data from a PubSub subscription._

Просмотреть файл

@ -1,28 +1,16 @@
[![CircleCI](https://circleci.com/gh/mozilla/gcp-ingestion.svg?style=svg&circle-token=d98a470269580907d5c6d74d0e67612834a21be7)](https://circleci.com/gh/mozilla/gcp-ingestion)
# Ingestion Edge Server
A simple service for delivering HTTP messages to Google Cloud PubSub
<!-- START doctoc generated TOC please keep comment here to allow auto update -->
<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
- [Building](#building)
- [Running](#running)
- [Configuration](#configuration)
- [Testing](#testing)
- [Style Checks](#style-checks)
- [Unit Tests](#unit-tests)
- [Integration Tests](#integration-tests)
- [Load Tests](#load-tests)
- [License](#license)
<!-- END doctoc generated TOC please keep comment here to allow auto update -->
The source code lives in the [ingestion-edge](https://github.com/mozilla/gcp-ingestion/tree/master/ingestion-beam)
subdirectory of the gcp-ingestion repository.
## Building
Install and update dependencies as-needed
We assume that you have [docker-compose](https://docs.docker.com/compose/)
installed.
From inside the `ingestion-edge` subdirectory:
```bash
# docker-compose
@ -230,8 +218,3 @@ Load test options (from `./bin/test -h`)
when --no-generator is specified
```
# License
This Source Code Form is subject to the terms of the Mozilla Public
License, v. 2.0. If a copy of the MPL was not distributed with this
file, You can obtain one at http://mozilla.org/MPL/2.0/.

20
mkdocs.yml Normal file
Просмотреть файл

@ -0,0 +1,20 @@
site_name: GCP Ingestion
site_description: Mozilla Telemetry ingestion on Google Cloud Platform
site_author: Mozilla Data Platform Team
nav:
- Home: index.md
- ingestion-edge: ingestion-edge/index.md
- ingestion-beam:
- Overview: ingestion-beam/index.md
- Ingestion testing workflow: ingestion-beam/ingestion_testing_workflow.md
- Architecture:
- Overview: architecture/overview.md
- Differences from AWS: architecture/differences_from_aws.md
- Pain Points: architecture/pain_points.md
- Edge Migration Plan: architecture/edge_migration_plan.md
- Reliability: architecture/reliability.md
- Test requirements: architecture/test_requirements.md
- Landfill Service Specification: architecture/landfill_service_specification.md
- Edge Server Specification: architecture/edge_service_specification.md
- BigQuery Sink Specification: architecture/bigquery_sink_specification.md
- Decoder Service Specification: architecture/decoder_service_specification.md