# SUMO Kubernetes Support Guide ## Links High level: - [SUMO Infra home](https://github.com/mozilla-it/sumo-infra) - [SUMO K8s deployment](https://github.com/mozilla/kitsune/tree/7ff9934d185ce58153c652928298b5f62d37f8d2/k8s#deploying-sumo) (obsolete) - [MozMEAO escalation path](https://mana.mozilla.org/wiki/pages/viewpage.action?pageId=50267455) - [Architecture diagram](https://raw.githubusercontent.com/mozilla/kitsune/main/docs/SUMO%20architecture%202019.pdf) - [Source](https://www.lucidchart.com/documents/view/3687b2eb-57c7-4488-a8b5-4ddcf54e47b3) - [SLA](https://docs.google.com/document/d/1SYtkEioKl6uvdZZA06YtVigWYJY0Nb9hGfvE0UwEPXA/edit) - [Incident Reports](https://mana.mozilla.org/wiki/pages/viewpage.action?pageId=52265112) Tech details: - [SUMO K8s deployments/services/secrets templates](https://github.com/mozilla/kitsune/tree/main/k8s/) - [SUMO AWS resource definitions](https://github.com/mozilla-it/sumo-infra/tree/main/k8s/tf) ## K8s commands > Most of the examples use `sumo-prod` as an example namespace. SUMO dev/stage/prod run in the `sumo-dev`/`sumo-stage`/`sumo-prod` namespaces respectively. ### General Most examples are using the `kubectl get ...` subcommand. If you'd prefer output that's more readable, you can substitute the `get` subcommand with `describe`: ``` kubectl -n sumo-prod describe pod sumo-prod-web-76b74db69-dvxbh ``` > Listing resources is easier with the `get` subcommand. To see all SUMO pods currently running: ``` kubectl -n sumo-prod get pods ``` To see all pods running and the K8s nodes they are assigned to: ``` kubectl -n sumo-prod get pods -o wide ``` To show yaml for a single pod: ``` kubectl -n sumo-prod get pod sumo-prod-web-76b74db69-dvxbh -o yaml ``` To show all deployments: ``` kubectl -n sumo-prod get deployments NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE sumo-prod-celery 3 3 3 3 330d sumo-prod-cron 0 0 0 0 330d sumo-prod-web 50 50 50 50 331d ``` To show yaml for a single deployment: ``` kubectl -n sumo-prod get deployment sumo-prod-web -o yaml ``` Run a bash shell on a SUMO pod: ``` kubectl -n sumo-prod exec -it sumo-prod-web-76b74db69-xbfgj bash ``` Scaling a deployment: ``` kubectl -n sumo-prod scale --replicas=60 deployment/sumo-prod-web ``` Check rolling update status: ``` kubectl -n sumo-prod rollout status deployment/sumo-prod-web ``` #### Working with K8s command output Filtering pods based on a label: ``` kubectl -n sumo-prod -l type=web get pods ``` Getting a list of pods: ``` kubectl -n sumo-prod -l type=web get pods | tail -n +2 | cut -d" " -f 1 ``` Structured output: See the jsonpath guide [here](https://kubernetes.io/docs/reference/kubectl/jsonpath/) ``` kubectl -n sumo-prod get pods -o=jsonpath='{.items[0].metadata.name}' ``` Processing K8s command json output with jq: > jsonpath may be more portable ``` kubectl -n sumo-prod get pods -o json | jq -r .items[].metadata.name ``` ### K8s Services List SUMO services: ``` kubectl -n sumo-prod get services NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE sumo-nodeport NodePort 100.71.222.28 443:30139/TCP 341d ``` ### Secrets [K8s secrets docs](https://kubernetes.io/docs/concepts/configuration/secret/) Secret values are base64 encoded when viewed in K8s output. Once setup as an environment variable or mounted file in a pod, the values are base64 decoded automatically. Kitsune uses secrets specified as environment variables in a deployment spec: - [example](https://github.com/mozilla/kitsune/blob/7ff9934d185ce58153c652928298b5f62d37f8d2/k8s/templates/sumo-app.yaml.j2#L43-L46) To list secrets: ``` kubectl -n sumo-prod get secrets ``` To view a secret w/ base64-encoded values: ``` kubectl -n sumo-prod get secret sumo-secrets-prod -o yaml ``` To view a secret with decoded values (aka "human readable"): > This example uses the [ksv](https://github.com/metadave/ksv) utility ``` kubectl -n sumo-prod get secret sumo-secrets-prod -o yaml | ksv ``` To encode a secret value: ``` echo -n "somevalue" | base64 ``` > The `-n` flag strips the newline before base64 encoding. > Values must be specified without newlines, the `base64` command on Linux can take a `-w 0` parameter that outputs without newlines. The `base64` command in Macos Sierra seems to output encoded values without newlines. Updating secrets: ``` kubectl -n sumo-prod apply -f ./some-secret.yaml ``` ## Monitoring ### New Relic - [Primary region](https://onenr.io/0MRNqKbP8wn) - `sumo-prod-oregon` - [Failover region](https://onenr.io/0qwyem31Gwn) - `sumo-prod-frankfurt` ### Papertrail All pod output is logged to Papertrail. - [Oregon](https://my.papertrailapp.com/groups/13629141/events) - [Frankfurt](https://papertrailapp.com/groups/5458941/events) ### elastic.co Our hosted Elasticsearch cluster is in the `us-west-2` region of AWS. Elastic.co hosting status can be found on [this](https://cloud-status.elastic.co/) page. ## Operations ### Cronjobs The `sumo-prod-cron` deployment is a self-contained Python cron system that runs in both Primary and Failover clusters. ``` # Oregon kubectl -n sumo-prod get deployments NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE sumo-prod-celery 3 3 3 3 330d sumo-prod-cron 1 1 1 1 330d sumo-prod-web 25 25 25 25 331d ``` ### Manually adding/removing K8s Oregon/Frankfurt cluster nodes > If you are modifying the Frankfurt cluster, replace instances of `oregon-*` below with `frankfurt`. 1. login to the AWS console 2. ensure you are in the `Oregon` region 3. search for and select the `EC2` service in the AWS console 4. select `Auto Scaling Groups` from the navigation on the left side of the page 5. click on the `nodes.k8s.us-west-2a.sumo.mozit.cloud` or `nodes.k8s.us-west-2b.sumo.mozit.cloud` row to select it 6. from the `Actions` menu (close to the top of the page), click `Edit` 7. the `Details` tab for the ASG should appear, set the appropriate `Min`, `Desired` and `Max` values. 1. it's probably good to set `Min` and `Desired` to the same value in case the cluster autoscaler decides to scale down the cluster smaller than the `Min`. 8. click `Save` 9. if you click on `Instances` from the navigation on the left side of the page, you can see the new instances that are starting/stopping. 10. you can see when the nodes join the K8s cluster with the following command: ``` watch 'kubectl get nodes | tail -n +2 | grep -v main | wc -l' ``` > The number that is displayed should eventually match your ASG `Desired` value. Note this value only includes K8s workers. ### Manually Blocking an IP address 1. login to the AWS console 2. ensure you are in the `Oregon` region 3. search for and select the `VPC` service in the AWS console 4. select `Network ACLs` from the navigation on the left side of the page 5. select the row containing the `Oregon` VPC 6. click on the `Inbound Rules` tab 7. click `Edit` 8. click `Add another rule` 9. for `Rule#`, select a value < 100 and > 0 10. for `Type`, select `All Traffic` 11. for `Source`, enter the IP address in CIDR format. To block a single IP, append `/32` to the IP address. 1. example: `196.52.2.54/32` 12. for `Allow / Deny`, select `DENY` 13. click `Save` There are limits that apply to using VPC ACLs documented [here](http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Appendix_Limits.html#vpc-limits-nacls). ### Manually Initiating Cluster failover > Note: Route 53 will provide automated cluster failover, these docs cover things to consider if there is a catastrophic failure in Oregon and Frankfurt must be promoted to primary rather than a read-only failover. - **verify the Frankfurt read replica** - `eu-central-1` (Frankfurt) has a read-replica of the SUMO production database - the replica is currently a `db.m4.xlarge`, while the prod DB is `db.m4.4xlarge` - this may be ok in maintenance mode, but if you are going to enable write traffic, the instance type must be scaled up. - SRE's performed a manual instance type change on the Frankfurt read-replica, and it took ~10 minutes to change from a `db.t2.medium` to a `db.m4.xlarge`. - although we have alerting in place to notify the SRE team in the event of a replication error, it's a good idea to check the replication status on the RDS details page for the `sumo` MySQL instance. - specifically, check the `DB Instance Status`, `Read Replica Source`, `Replication State`, and `Replication Error` values. - decide if [promoting the read-replica](http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_ReadRepl.html#USER_ReadRepl.Promote) to a main is appropriate. - it's preferrable to have a multi-AZ RDS instance, as we can take snapshots against the failover instance (RDS does this by default in a multi-AZ setup). - if data is written to a promoted instance, and failover back to the us-west-2 clusters is desirable, a full DB backup and restore in us-west-2 is required. - the replica is automatically rebooted before being promoted to a full instance. - **ensure image versions are up to date** - Most MySQL changes should already be replicated to the read-replica, however, if you're reading this, chances are things are broken. Ensure that the DB schema is correct for the iamges you're deploying. - **scale cluster and pods** - the [prod deployments yaml](https://github.com/mozilla/kitsune/blob/99c4c2bf5c102f38910485b29fc87c2299daa18b/k8s/regions/oregon/prod.yaml#L24-L48) contains the correct number of replicas, but here are some safe values to use in an emergency: - **DNS** - point the `prod-tp.sumo.mozit.cloud` traffic policy at the Frankfurt ELB