kitsune/docs/k8s.md

# SUMO Kubernetes Support Guide

## Links

High level:

-   [SUMO Infra home](https://github.com/mozilla-it/sumo-infra)
-   [SUMO K8s deployment](https://github.com/mozilla/kitsune/tree/7ff9934d185ce58153c652928298b5f62d37f8d2/k8s#deploying-sumo) (obsolete)
-   [MozMEAO escalation path](https://mana.mozilla.org/wiki/pages/viewpage.action?pageId=50267455)
-   [Architecture diagram](https://raw.githubusercontent.com/mozilla/kitsune/main/docs/SUMO%20architecture%202019.pdf)
    -   [Source](https://www.lucidchart.com/documents/view/3687b2eb-57c7-4488-a8b5-4ddcf54e47b3)
-   [SLA](https://docs.google.com/document/d/1SYtkEioKl6uvdZZA06YtVigWYJY0Nb9hGfvE0UwEPXA/edit)
-   [Incident Reports](https://mana.mozilla.org/wiki/pages/viewpage.action?pageId=52265112)

Tech details:

-   [SUMO K8s deployments/services/secrets templates](https://github.com/mozilla/kitsune/tree/main/k8s/)
-   [SUMO AWS resource definitions](https://github.com/mozilla-it/sumo-infra/tree/main/k8s/tf)

## K8s commands

> Most of the examples use `sumo-prod` as an example namespace. SUMO dev/stage/prod run in the `sumo-dev`/`sumo-stage`/`sumo-prod` namespaces respectively.

### General

Most examples are using the `kubectl get ...` subcommand. If you'd prefer output that's more readable, you can substitute the `get` subcommand with `describe`:

```
kubectl -n sumo-prod describe pod sumo-prod-web-76b74db69-dvxbh
```

> Listing resources is easier with the `get` subcommand.

To see all SUMO pods currently running:

```
kubectl -n sumo-prod get pods
```

To see all pods running and the K8s nodes they are assigned to:

```
kubectl -n sumo-prod get pods -o wide
```

To show yaml for a single pod:

```
kubectl -n sumo-prod get pod sumo-prod-web-76b74db69-dvxbh -o yaml
```

To show all deployments:

```
 kubectl -n sumo-prod get deployments

NAME               DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
sumo-prod-celery   3         3         3            3           330d
sumo-prod-cron     0         0         0            0           330d
sumo-prod-web      50        50        50           50          331d
```

To show yaml for a single deployment:

```
 kubectl -n sumo-prod get deployment sumo-prod-web -o yaml
```

Run a bash shell on a SUMO pod:

```
kubectl -n sumo-prod exec -it sumo-prod-web-76b74db69-xbfgj bash
```

Scaling a deployment:

```
kubectl -n sumo-prod scale --replicas=60 deployment/sumo-prod-web
```

Check rolling update status:

```
kubectl -n sumo-prod rollout status deployment/sumo-prod-web
```

#### Working with K8s command output

Filtering pods based on a label:

```
kubectl -n sumo-prod -l type=web get pods
```

Getting a list of pods:

```
kubectl -n sumo-prod -l type=web get pods | tail -n +2 | cut -d" " -f 1
```

Structured output:

See the jsonpath guide [here](https://kubernetes.io/docs/reference/kubectl/jsonpath/)

```
kubectl -n sumo-prod get pods -o=jsonpath='{.items[0].metadata.name}'
```

Processing K8s command json output with jq:

> jsonpath may be more portable

```
kubectl -n sumo-prod get pods -o json | jq -r .items[].metadata.name
```

### K8s Services

List SUMO services:

```
kubectl -n sumo-prod get services
NAME            TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)         AGE
sumo-nodeport   NodePort   100.71.222.28   <none>        443:30139/TCP   341d
```

### Secrets

[K8s secrets docs](https://kubernetes.io/docs/concepts/configuration/secret/)

Secret values are base64 encoded when viewed in K8s output. Once setup as an environment variable or mounted file in a pod, the values are base64 decoded automatically.

Kitsune uses secrets specified as environment variables in a deployment spec:

-   [example](https://github.com/mozilla/kitsune/blob/7ff9934d185ce58153c652928298b5f62d37f8d2/k8s/templates/sumo-app.yaml.j2#L43-L46)

To list secrets:

```
kubectl -n sumo-prod get secrets
```

To view a secret w/ base64-encoded values:

```
kubectl -n sumo-prod get secret sumo-secrets-prod -o yaml
```

To view a secret with decoded values (aka "human readable"):

> This example uses the [ksv](https://github.com/metadave/ksv) utility

```
kubectl -n sumo-prod get secret sumo-secrets-prod -o yaml | ksv
```

To encode a secret value:

```
echo -n "somevalue" | base64
```

> The `-n` flag strips the newline before base64 encoding.
> Values must be specified without newlines, the `base64` command on Linux can take a `-w 0` parameter that outputs without newlines. The `base64` command in Macos Sierra seems to output encoded values without newlines.

Updating secrets:

```
kubectl -n sumo-prod apply -f ./some-secret.yaml
```

## Monitoring

### New Relic

-   [Primary region](https://onenr.io/0MRNqKbP8wn)

    -   `sumo-prod-oregon`

-   [Failover region](https://onenr.io/0qwyem31Gwn)
    -   `sumo-prod-frankfurt`

### Papertrail

All pod output is logged to Papertrail.

-   [Oregon](https://my.papertrailapp.com/groups/13629141/events)
-   [Frankfurt](https://papertrailapp.com/groups/5458941/events)

### elastic.co

Our hosted Elasticsearch cluster is in the `us-west-2` region of AWS. Elastic.co hosting status can be found on [this](https://cloud-status.elastic.co/) page.

## Operations

### Cronjobs

The `sumo-prod-cron` deployment is a self-contained Python cron system that runs in both Primary and Failover clusters.

```
 # Oregon
kubectl -n sumo-prod get deployments
NAME               DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
sumo-prod-celery   3         3         3            3           330d
sumo-prod-cron     1         1         1            1           330d
sumo-prod-web      25        25        25           25          331d
```

### Manually adding/removing K8s Oregon/Frankfurt cluster nodes

> If you are modifying the Frankfurt cluster, replace instances of `oregon-*` below with `frankfurt`.

1. login to the AWS console
2. ensure you are in the `Oregon` region
3. search for and select the `EC2` service in the AWS console
4. select `Auto Scaling Groups` from the navigation on the left side of the page
5. click on the `nodes.k8s.us-west-2a.sumo.mozit.cloud` or `nodes.k8s.us-west-2b.sumo.mozit.cloud` row to select it
6. from the `Actions` menu (close to the top of the page), click `Edit`
7. the `Details` tab for the ASG should appear, set the appropriate `Min`, `Desired` and `Max` values.
    1. it's probably good to set `Min` and `Desired` to the same value in case the cluster autoscaler decides to scale down the cluster smaller than the `Min`.
8. click `Save`
9. if you click on `Instances` from the navigation on the left side of the page, you can see the new instances that are starting/stopping.
10. you can see when the nodes join the K8s cluster with the following command:

```
watch 'kubectl get nodes | tail -n +2 | grep -v main | wc -l'
```

> The number that is displayed should eventually match your ASG `Desired` value. Note this value only includes K8s workers.

### Manually Blocking an IP address

1. login to the AWS console
2. ensure you are in the `Oregon` region
3. search for and select the `VPC` service in the AWS console
4. select `Network ACLs` from the navigation on the left side of the page
5. select the row containing the `Oregon` VPC
6. click on the `Inbound Rules` tab
7. click `Edit`
8. click `Add another rule`
9. for `Rule#`, select a value < 100 and > 0
10. for `Type`, select `All Traffic`
11. for `Source`, enter the IP address in CIDR format. To block a single IP, append `/32` to the IP address.
    1. example: `196.52.2.54/32`
12. for `Allow / Deny`, select `DENY`
13. click `Save`

There are limits that apply to using VPC ACLs documented [here](http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Appendix_Limits.html#vpc-limits-nacls).

### Manually Initiating Cluster failover

> Note: Route 53 will provide automated cluster failover, these docs cover things to consider if there is a catastrophic failure in Oregon and Frankfurt must be promoted to primary rather than a read-only failover.

-   **verify the Frankfurt read replica**
    -   `eu-central-1` (Frankfurt) has a read-replica of the SUMO production database
    -   the replica is currently a `db.m4.xlarge`, while the prod DB is `db.m4.4xlarge`
        -   this may be ok in maintenance mode, but if you are going to enable write traffic, the instance type must be scaled up.
            -   SRE's performed a manual instance type change on the Frankfurt read-replica, and it took ~10 minutes to change from a `db.t2.medium` to a `db.m4.xlarge`.
    -   although we have alerting in place to notify the SRE team in the event of a replication error, it's a good idea to check the replication status on the RDS details page for the `sumo` MySQL instance.
        -   specifically, check the `DB Instance Status`, `Read Replica Source`, `Replication State`, and `Replication Error` values.
    -   decide if [promoting the read-replica](http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ReadRepl.html#USER_ReadRepl.Promote) to a main is appropriate.
        -   it's preferrable to have a multi-AZ RDS instance, as we can take snapshots against the failover instance (RDS does this by default in a multi-AZ setup).
        -   if data is written to a promoted instance, and failover back to the us-west-2 clusters is desirable, a full DB backup and restore in us-west-2 is required.
        -   the replica is automatically rebooted before being promoted to a full instance.
-   **ensure image versions are up to date**
-   Most MySQL changes should already be replicated to the read-replica, however, if you're reading this, chances are things are broken. Ensure that the DB schema is correct for the iamges you're deploying.
-   **scale cluster and pods**

    -   the [prod deployments yaml](https://github.com/mozilla/kitsune/blob/99c4c2bf5c102f38910485b29fc87c2299daa18b/k8s/regions/oregon/prod.yaml#L24-L48) contains the correct number of replicas, but here are some safe values to use in an emergency:

-   **DNS**
    -   point the `prod-tp.sumo.mozit.cloud` traffic policy at the Frankfurt ELB
add sumo support guide 2019-01-31 00:06:12 +03:00			`# SUMO Kubernetes Support Guide`

			`## Links`

			`High level:`

Rename old references to main branch 2021-11-24 12:36:19 +03:00			`- [SUMO Infra home](https://github.com/mozilla-it/sumo-infra)`
Update k8s links 2024-04-04 23:53:38 +03:00			`- [SUMO K8s deployment](https://github.com/mozilla/kitsune/tree/7ff9934d185ce58153c652928298b5f62d37f8d2/k8s#deploying-sumo) (obsolete)`
Rename old references to main branch 2021-11-24 12:36:19 +03:00			`- [MozMEAO escalation path](https://mana.mozilla.org/wiki/pages/viewpage.action?pageId=50267455)`
Update k8s links 2024-04-04 23:53:38 +03:00			`- [Architecture diagram](https://raw.githubusercontent.com/mozilla/kitsune/main/docs/SUMO%20architecture%202019.pdf)`
Rename old references to main branch 2021-11-24 12:36:19 +03:00			`- [Source](https://www.lucidchart.com/documents/view/3687b2eb-57c7-4488-a8b5-4ddcf54e47b3)`
			`- [SLA](https://docs.google.com/document/d/1SYtkEioKl6uvdZZA06YtVigWYJY0Nb9hGfvE0UwEPXA/edit)`
			`- [Incident Reports](https://mana.mozilla.org/wiki/pages/viewpage.action?pageId=52265112)`
add sumo support guide 2019-01-31 00:06:12 +03:00
			`Tech details:`

Rename old references to main branch 2021-11-24 12:36:19 +03:00			`- [SUMO K8s deployments/services/secrets templates](https://github.com/mozilla/kitsune/tree/main/k8s/)`
			`- [SUMO AWS resource definitions](https://github.com/mozilla-it/sumo-infra/tree/main/k8s/tf)`
add sumo support guide 2019-01-31 00:06:12 +03:00
			`## K8s commands`

			> Most of the examples use `sumo-prod` as an example namespace. SUMO dev/stage/prod run in the `sumo-dev`/`sumo-stage`/`sumo-prod` namespaces respectively.

			`### General`

			Most examples are using the `kubectl get ...` subcommand. If you'd prefer output that's more readable, you can substitute the `get` subcommand with `describe`:

			```
			`kubectl -n sumo-prod describe pod sumo-prod-web-76b74db69-dvxbh`
			```

			> Listing resources is easier with the `get` subcommand.

			`To see all SUMO pods currently running:`

			```
			`kubectl -n sumo-prod get pods`
			```

			`To see all pods running and the K8s nodes they are assigned to:`

			```
			`kubectl -n sumo-prod get pods -o wide`
			```

			`To show yaml for a single pod:`

			```
			`kubectl -n sumo-prod get pod sumo-prod-web-76b74db69-dvxbh -o yaml`
			```

			`To show all deployments:`

			```
			`kubectl -n sumo-prod get deployments`

			`NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE`
			`sumo-prod-celery 3 3 3 3 330d`
			`sumo-prod-cron 0 0 0 0 330d`
			`sumo-prod-web 50 50 50 50 331d`
			```

			`To show yaml for a single deployment:`

			```
			`kubectl -n sumo-prod get deployment sumo-prod-web -o yaml`
			```

			`Run a bash shell on a SUMO pod:`

			```
			`kubectl -n sumo-prod exec -it sumo-prod-web-76b74db69-xbfgj bash`
			```

			`Scaling a deployment:`

			```
			`kubectl -n sumo-prod scale --replicas=60 deployment/sumo-prod-web`
			```

			`Check rolling update status:`

			```
			`kubectl -n sumo-prod rollout status deployment/sumo-prod-web`
			```

			`#### Working with K8s command output`

			`Filtering pods based on a label:`

			```
			`kubectl -n sumo-prod -l type=web get pods`
			```

			`Getting a list of pods:`

			```
			`kubectl -n sumo-prod -l type=web get pods \| tail -n +2 \| cut -d" " -f 1`
			```

			`Structured output:`

			`See the jsonpath guide [here](https://kubernetes.io/docs/reference/kubectl/jsonpath/)`

			```
			`kubectl -n sumo-prod get pods -o=jsonpath='{.items[0].metadata.name}'`
			```

			`Processing K8s command json output with jq:`

			`> jsonpath may be more portable`

			```
			`kubectl -n sumo-prod get pods -o json \| jq -r .items[].metadata.name`
			```

			`### K8s Services`

			`List SUMO services:`

			```
			`kubectl -n sumo-prod get services`
			`NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE`
			`sumo-nodeport NodePort 100.71.222.28 <none> 443:30139/TCP 341d`
			```

			`### Secrets`

			`[K8s secrets docs](https://kubernetes.io/docs/concepts/configuration/secret/)`

			`Secret values are base64 encoded when viewed in K8s output. Once setup as an environment variable or mounted file in a pod, the values are base64 decoded automatically.`

			`Kitsune uses secrets specified as environment variables in a deployment spec:`

Rename old references to main branch 2021-11-24 12:36:19 +03:00			`- [example](https://github.com/mozilla/kitsune/blob/7ff9934d185ce58153c652928298b5f62d37f8d2/k8s/templates/sumo-app.yaml.j2#L43-L46)`
add sumo support guide 2019-01-31 00:06:12 +03:00
			`To list secrets:`

			```
			`kubectl -n sumo-prod get secrets`
			```

			`To view a secret w/ base64-encoded values:`

			```
			`kubectl -n sumo-prod get secret sumo-secrets-prod -o yaml`
			```

			`To view a secret with decoded values (aka "human readable"):`

			`> This example uses the [ksv](https://github.com/metadave/ksv) utility`

			```
			`kubectl -n sumo-prod get secret sumo-secrets-prod -o yaml \| ksv`
			```

			`To encode a secret value:`

			```
			`echo -n "somevalue" \| base64`
			```

			> The `-n` flag strips the newline before base64 encoding.
			> Values must be specified without newlines, the `base64` command on Linux can take a `-w 0` parameter that outputs without newlines. The `base64` command in Macos Sierra seems to output encoded values without newlines.

			`Updating secrets:`

			```
			`kubectl -n sumo-prod apply -f ./some-secret.yaml`
			```

			`## Monitoring`

Rename old references to main branch 2021-11-24 12:36:19 +03:00			`### New Relic`
add sumo support guide 2019-01-31 00:06:12 +03:00
Remove mentions of Oregon A and B (#5228) 2022-09-20 19:53:31 +03:00			`- [Primary region](https://onenr.io/0MRNqKbP8wn)`
add sumo support guide 2019-01-31 00:06:12 +03:00
Rename old references to main branch 2021-11-24 12:36:19 +03:00			- `sumo-prod-oregon`

Remove mentions of Oregon A and B (#5228) 2022-09-20 19:53:31 +03:00			`- [Failover region](https://onenr.io/0qwyem31Gwn)`
Rename old references to main branch 2021-11-24 12:36:19 +03:00			- `sumo-prod-frankfurt`
add sumo support guide 2019-01-31 00:06:12 +03:00
			`### Papertrail`

			`All pod output is logged to Papertrail.`
Rename old references to main branch 2021-11-24 12:36:19 +03:00
Remove mentions of Oregon A and B (#5228) 2022-09-20 19:53:31 +03:00			`- [Oregon](https://my.papertrailapp.com/groups/13629141/events)`
Rename old references to main branch 2021-11-24 12:36:19 +03:00			`- [Frankfurt](https://papertrailapp.com/groups/5458941/events)`

add sumo support guide 2019-01-31 00:06:12 +03:00			`### elastic.co`

			Our hosted Elasticsearch cluster is in the `us-west-2` region of AWS. Elastic.co hosting status can be found on [this](https://cloud-status.elastic.co/) page.

			`## Operations`

			`### Cronjobs`

Remove mentions of Oregon A and B (#5228) 2022-09-20 19:53:31 +03:00			The `sumo-prod-cron` deployment is a self-contained Python cron system that runs in both Primary and Failover clusters.
add sumo support guide 2019-01-31 00:06:12 +03:00
			```
Remove mentions of Oregon A and B (#5228) 2022-09-20 19:53:31 +03:00			`# Oregon`
add sumo support guide 2019-01-31 00:06:12 +03:00			`kubectl -n sumo-prod get deployments`
			`NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE`
			`sumo-prod-celery 3 3 3 3 330d`
			`sumo-prod-cron 1 1 1 1 330d`
			`sumo-prod-web 25 25 25 25 331d`
			```

Remove mentions of Oregon A and B (#5228) 2022-09-20 19:53:31 +03:00			`### Manually adding/removing K8s Oregon/Frankfurt cluster nodes`
add sumo support guide 2019-01-31 00:06:12 +03:00
			> If you are modifying the Frankfurt cluster, replace instances of `oregon-*` below with `frankfurt`.

			`1. login to the AWS console`
			2. ensure you are in the `Oregon` region
			3. search for and select the `EC2` service in the AWS console
			4. select `Auto Scaling Groups` from the navigation on the left side of the page
Correct name of ASGs 2019-06-05 18:23:24 +03:00			5. click on the `nodes.k8s.us-west-2a.sumo.mozit.cloud` or `nodes.k8s.us-west-2b.sumo.mozit.cloud` row to select it
add sumo support guide 2019-01-31 00:06:12 +03:00			6. from the `Actions` menu (close to the top of the page), click `Edit`
			7. the `Details` tab for the ASG should appear, set the appropriate `Min`, `Desired` and `Max` values.
			1. it's probably good to set `Min` and `Desired` to the same value in case the cluster autoscaler decides to scale down the cluster smaller than the `Min`.
			8. click `Save`
			9. if you click on `Instances` from the navigation on the left side of the page, you can see the new instances that are starting/stopping.
			`10. you can see when the nodes join the K8s cluster with the following command:`

			```
Rename old references to main branch 2021-11-24 12:36:19 +03:00			`watch 'kubectl get nodes \| tail -n +2 \| grep -v main \| wc -l'`
add sumo support guide 2019-01-31 00:06:12 +03:00			```

			> The number that is displayed should eventually match your ASG `Desired` value. Note this value only includes K8s workers.

			`### Manually Blocking an IP address`

			`1. login to the AWS console`
			2. ensure you are in the `Oregon` region
			3. search for and select the `VPC` service in the AWS console
			4. select `Network ACLs` from the navigation on the left side of the page
Remove mentions of Oregon A and B (#5228) 2022-09-20 19:53:31 +03:00			5. select the row containing the `Oregon` VPC
add sumo support guide 2019-01-31 00:06:12 +03:00			6. click on the `Inbound Rules` tab
			7. click `Edit`
			8. click `Add another rule`
			9. for `Rule#`, select a value < 100 and > 0
			10. for `Type`, select `All Traffic`
			11. for `Source`, enter the IP address in CIDR format. To block a single IP, append `/32` to the IP address.
Rename old references to main branch 2021-11-24 12:36:19 +03:00			1. example: `196.52.2.54/32`
add sumo support guide 2019-01-31 00:06:12 +03:00			12. for `Allow / Deny`, select `DENY`
			13. click `Save`

			`There are limits that apply to using VPC ACLs documented [here](http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Appendix_Limits.html#vpc-limits-nacls).`

			`### Manually Initiating Cluster failover`

Remove mentions of Oregon A and B (#5228) 2022-09-20 19:53:31 +03:00			`> Note: Route 53 will provide automated cluster failover, these docs cover things to consider if there is a catastrophic failure in Oregon and Frankfurt must be promoted to primary rather than a read-only failover.`
add sumo support guide 2019-01-31 00:06:12 +03:00
Rename old references to main branch 2021-11-24 12:36:19 +03:00			`- verify the Frankfurt read replica`
			- `eu-central-1` (Frankfurt) has a read-replica of the SUMO production database
			- the replica is currently a `db.m4.xlarge`, while the prod DB is `db.m4.4xlarge`
			`- this may be ok in maintenance mode, but if you are going to enable write traffic, the instance type must be scaled up.`
			- SRE's performed a manual instance type change on the Frankfurt read-replica, and it took ~10 minutes to change from a `db.t2.medium` to a `db.m4.xlarge`.
			- although we have alerting in place to notify the SRE team in the event of a replication error, it's a good idea to check the replication status on the RDS details page for the `sumo` MySQL instance.
			- specifically, check the `DB Instance Status`, `Read Replica Source`, `Replication State`, and `Replication Error` values.
			`- decide if [promoting the read-replica](http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ReadRepl.html#USER_ReadRepl.Promote) to a main is appropriate.`
			`- it's preferrable to have a multi-AZ RDS instance, as we can take snapshots against the failover instance (RDS does this by default in a multi-AZ setup).`
			`- if data is written to a promoted instance, and failover back to the us-west-2 clusters is desirable, a full DB backup and restore in us-west-2 is required.`
			`- the replica is automatically rebooted before being promoted to a full instance.`
			`- ensure image versions are up to date`
			`- Most MySQL changes should already be replicated to the read-replica, however, if you're reading this, chances are things are broken. Ensure that the DB schema is correct for the iamges you're deploying.`
			`- scale cluster and pods`

Update k8s links 2024-04-04 23:53:38 +03:00			`- the [prod deployments yaml](https://github.com/mozilla/kitsune/blob/99c4c2bf5c102f38910485b29fc87c2299daa18b/k8s/regions/oregon/prod.yaml#L24-L48) contains the correct number of replicas, but here are some safe values to use in an emergency:`
Rename old references to main branch 2021-11-24 12:36:19 +03:00
			`- DNS`
			- point the `prod-tp.sumo.mozit.cloud` traffic policy at the Frankfurt ELB