* Update docs about deduplication
To match reality of https://bugzilla.mozilla.org/show_bug.cgi?id=1694764
and allow for clear communication to data users about the change.
A separate follow-up will be to remove code from the repository that
handles interaction with Redis.
* Apply suggestions from code review
Co-authored-by: Daniel Thorn <dthorn@mozilla.com>
Co-authored-by: Daniel Thorn <dthorn@mozilla.com>
This corrects info on how AET support is deployed, updates some outdated info
(sinks are now all Kube jobs rather than Dataflow), and now mentions the
concept of pipeline families.
* Add a repo url
This will let people jump directly to editing the docs from the UI,
in case further tweaks are required.
* Fix omitted renaming of "edge server" to "edge service"
* Hide the prev/next buttons by default
They aren't very useful and cause the header to collapse vertically
in a very awkward way.
This substantially reorganizes the documentation as an mkdocs site. Main
changes:
* All documentation is now browseable and searchable in a single site, with
handy table of contents on the side of each section
* Top-level README significantly slimmed down (just pointing to docs site)
* READMEs inside individual components removed (moved to subdirectories inside
docs/ folder, accessible via top-level in docs site)
Pocket has a need to ingest all doctypes associated with the `activity-stream`
namespace; it seems efficient to be able to deliver one topic rather than
multiple topics per docType.
We also take this chance to refactor the per-channel and per-doctype logic
out of Republisher proper into their own dedicated transforms.
As part of the refactor, we make the execution graph cleaner and a bit more
efficient. We now partition per channel or doctype first, rather than having
each configured channel and doctype branch directly off the initial input.
We also give the output transforms names specific to the channel or doctype.
* Create documentation on testing BigQuery from ingestion-beam.
* Add a mermaid config file to the docs directory for overflow
* Fix spelling mistakes
* Move mermaid config to top-level .mermaid
Fixes#479 and #396
Adds the Republisher job as previously spec'd in `docs/architecture`.
Before this is deployed, we will need to create appropriate output topics
for the intended configuration.
Once this is deployed and stable, we can remove the decoded topic consumer
from Decoder. There will be a short overlap period where both the Decoder
and Republisher are marking messages as seen in Redis, but this shouldn't
cause any problems (other than the expense of two consumers).
* Documentation for refactor with new Republisher job
The Republisher job proposed in this PR would factor out MarkAsSeen
from Decoder which would lead to a more logical flow of data.
It also allows us to share the expense of the MarkAsSeen read with
the read needed to inspect message contents and republish to
smaller topics.
Specifically, the Republisher for structured ingestion would check for
the new debug header and publish message containing that header to
a debug topic per the Glean request in #458.
The Republisher for telemetry data would randomly sample messages to
produce the monitoring topics discussed in #396.
* Update diagram and include docker wrapper