gcp-ingestion/docs/architecture/reliability.md

3.3 KiB

Reliability

In production the ingestion product aims to provide a Biweekly Uptime Percentage determined by the Reliability Target below. If a component does not meet that then a Stability Work Period should be assigned to each software engineer supporting the component.

Disclaimer and Purpose

This document is intended solely for those directly running, writing, and managing GCP Ingestion. It is not an agreement, implicit or otherwise, with any other parties. This document is a prototype that should be treated as a goal that will require adjustments.

The purpose of this document is first and foremost to encourage behavior that reduces Downtime. The secondary purpose is to establish clear expectations for how software engineers respond to Downtime.

Reliability Target

Component Biweekly Uptime Percentage
ingestion-edge 99.99%
ingestion-beam 99.5%

Definitions

"Downtime" on ingestion-edge means for at least 0.1% of requests a status code lower than 500 was not successfully returned. On ingestion-beam it means oldest_unacked_message_age on a PubSub input exceeds 1 hour for a batch sink job or 1 minute for a decoder or streaming sink job.

"Downtime Period" means a period of 60 consecutive seconds of Downtime. Intermittent Downtime for a period of less than 60 consecutive seconds will not be counted towards any Downtime Periods.

"Biweekly Uptime Percentage" means total number of minutes in a rolling two week window, minus the number of minutes of Downtime suffered from all Downtime Periods in the window, divided by the total number of minutes in the window.

"Stability Work" means work on Downtime prevention and reduction. Only code changes resulting from that work may be deployed to production. All other features and work on the component are suspended.

"Stability Work Period" means a continuous period of time after recovery and postmortem, where each engineer is assigned Stability Work. The length of the Stability Work Period is determined below.

Component Biweekly Uptime Percentage Length of Stability Work Period
ingestion-edge 99.9% to < 99.99% 2 weeks
ingestion-edge 99% to < 99.9% 4 weeks
ingestion-edge < 99% 12 weeks
ingestion-beam 99% to < 99.5% 2 weeks
ingestion-beam 95% to < 99% 4 weeks
ingestion-beam < 95% 12 weeks

Exclusions

The Reliability Target does not apply to any Downtime in ingestion caused by Downtime on a Google Cloud Platform Service, other than Downtime on ingestion-edge caused by PubSub.

Additional Information

Biweekly Uptime Percentage Downtime per Two Weeks
99.99% 2 minutes
99.9% 20 minutes
99.5% 1 hour 40 minutes
99% 3 hours 21 minutes
95% 16 hours 48 minutes