9.8 KiB
SUMO Kubernetes Support Guide
Links
High level:
- SUMO Infra home
- SUMO K8s deployment (obsolete)
- MozMEAO escalation path
- Architecture diagram
- SLA
- Incident Reports
Tech details:
K8s commands
Most of the examples use
sumo-prod
as an example namespace. SUMO dev/stage/prod run in thesumo-dev
/sumo-stage
/sumo-prod
namespaces respectively.
General
Most examples are using the kubectl get ...
subcommand. If you'd prefer output that's more readable, you can substitute the get
subcommand with describe
:
kubectl -n sumo-prod describe pod sumo-prod-web-76b74db69-dvxbh
Listing resources is easier with the
get
subcommand.
To see all SUMO pods currently running:
kubectl -n sumo-prod get pods
To see all pods running and the K8s nodes they are assigned to:
kubectl -n sumo-prod get pods -o wide
To show yaml for a single pod:
kubectl -n sumo-prod get pod sumo-prod-web-76b74db69-dvxbh -o yaml
To show all deployments:
kubectl -n sumo-prod get deployments
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
sumo-prod-celery 3 3 3 3 330d
sumo-prod-cron 0 0 0 0 330d
sumo-prod-web 50 50 50 50 331d
To show yaml for a single deployment:
kubectl -n sumo-prod get deployment sumo-prod-web -o yaml
Run a bash shell on a SUMO pod:
kubectl -n sumo-prod exec -it sumo-prod-web-76b74db69-xbfgj bash
Scaling a deployment:
kubectl -n sumo-prod scale --replicas=60 deployment/sumo-prod-web
Check rolling update status:
kubectl -n sumo-prod rollout status deployment/sumo-prod-web
Working with K8s command output
Filtering pods based on a label:
kubectl -n sumo-prod -l type=web get pods
Getting a list of pods:
kubectl -n sumo-prod -l type=web get pods | tail -n +2 | cut -d" " -f 1
Structured output:
See the jsonpath guide here
kubectl -n sumo-prod get pods -o=jsonpath='{.items[0].metadata.name}'
Processing K8s command json output with jq:
jsonpath may be more portable
kubectl -n sumo-prod get pods -o json | jq -r .items[].metadata.name
K8s Services
List SUMO services:
kubectl -n sumo-prod get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
sumo-nodeport NodePort 100.71.222.28 <none> 443:30139/TCP 341d
Secrets
Secret values are base64 encoded when viewed in K8s output. Once setup as an environment variable or mounted file in a pod, the values are base64 decoded automatically.
Kitsune uses secrets specified as environment variables in a deployment spec:
To list secrets:
kubectl -n sumo-prod get secrets
To view a secret w/ base64-encoded values:
kubectl -n sumo-prod get secret sumo-secrets-prod -o yaml
To view a secret with decoded values (aka "human readable"):
This example uses the ksv utility
kubectl -n sumo-prod get secret sumo-secrets-prod -o yaml | ksv
To encode a secret value:
echo -n "somevalue" | base64
The
-n
flag strips the newline before base64 encoding. Values must be specified without newlines, thebase64
command on Linux can take a-w 0
parameter that outputs without newlines. Thebase64
command in Macos Sierra seems to output encoded values without newlines.
Updating secrets:
kubectl -n sumo-prod apply -f ./some-secret.yaml
Monitoring
New Relic
-
sumo-prod-oregon
-
sumo-prod-frankfurt
Papertrail
All pod output is logged to Papertrail.
elastic.co
Our hosted Elasticsearch cluster is in the us-west-2
region of AWS. Elastic.co hosting status can be found on this page.
Operations
Cronjobs
The sumo-prod-cron
deployment is a self-contained Python cron system that runs in both Primary and Failover clusters.
# Oregon
kubectl -n sumo-prod get deployments
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
sumo-prod-celery 3 3 3 3 330d
sumo-prod-cron 1 1 1 1 330d
sumo-prod-web 25 25 25 25 331d
Manually adding/removing K8s Oregon/Frankfurt cluster nodes
If you are modifying the Frankfurt cluster, replace instances of
oregon-*
below withfrankfurt
.
- login to the AWS console
- ensure you are in the
Oregon
region - search for and select the
EC2
service in the AWS console - select
Auto Scaling Groups
from the navigation on the left side of the page - click on the
nodes.k8s.us-west-2a.sumo.mozit.cloud
ornodes.k8s.us-west-2b.sumo.mozit.cloud
row to select it - from the
Actions
menu (close to the top of the page), clickEdit
- the
Details
tab for the ASG should appear, set the appropriateMin
,Desired
andMax
values.- it's probably good to set
Min
andDesired
to the same value in case the cluster autoscaler decides to scale down the cluster smaller than theMin
.
- it's probably good to set
- click
Save
- if you click on
Instances
from the navigation on the left side of the page, you can see the new instances that are starting/stopping. - you can see when the nodes join the K8s cluster with the following command:
watch 'kubectl get nodes | tail -n +2 | grep -v main | wc -l'
The number that is displayed should eventually match your ASG
Desired
value. Note this value only includes K8s workers.
Manually Blocking an IP address
- login to the AWS console
- ensure you are in the
Oregon
region - search for and select the
VPC
service in the AWS console - select
Network ACLs
from the navigation on the left side of the page - select the row containing the
Oregon
VPC - click on the
Inbound Rules
tab - click
Edit
- click
Add another rule
- for
Rule#
, select a value < 100 and > 0 - for
Type
, selectAll Traffic
- for
Source
, enter the IP address in CIDR format. To block a single IP, append/32
to the IP address.- example:
196.52.2.54/32
- example:
- for
Allow / Deny
, selectDENY
- click
Save
There are limits that apply to using VPC ACLs documented here.
Manually Initiating Cluster failover
Note: Route 53 will provide automated cluster failover, these docs cover things to consider if there is a catastrophic failure in Oregon and Frankfurt must be promoted to primary rather than a read-only failover.
-
verify the Frankfurt read replica
eu-central-1
(Frankfurt) has a read-replica of the SUMO production database- the replica is currently a
db.m4.xlarge
, while the prod DB isdb.m4.4xlarge
- this may be ok in maintenance mode, but if you are going to enable write traffic, the instance type must be scaled up.
- SRE's performed a manual instance type change on the Frankfurt read-replica, and it took ~10 minutes to change from a
db.t2.medium
to adb.m4.xlarge
.
- SRE's performed a manual instance type change on the Frankfurt read-replica, and it took ~10 minutes to change from a
- this may be ok in maintenance mode, but if you are going to enable write traffic, the instance type must be scaled up.
- although we have alerting in place to notify the SRE team in the event of a replication error, it's a good idea to check the replication status on the RDS details page for the
sumo
MySQL instance.- specifically, check the
DB Instance Status
,Read Replica Source
,Replication State
, andReplication Error
values.
- specifically, check the
- decide if promoting the read-replica to a main is appropriate.
- it's preferrable to have a multi-AZ RDS instance, as we can take snapshots against the failover instance (RDS does this by default in a multi-AZ setup).
- if data is written to a promoted instance, and failover back to the us-west-2 clusters is desirable, a full DB backup and restore in us-west-2 is required.
- the replica is automatically rebooted before being promoted to a full instance.
-
ensure image versions are up to date
-
Most MySQL changes should already be replicated to the read-replica, however, if you're reading this, chances are things are broken. Ensure that the DB schema is correct for the iamges you're deploying.
-
scale cluster and pods
- the prod deployments yaml contains the correct number of replicas, but here are some safe values to use in an emergency:
-
DNS
- point the
prod-tp.sumo.mozit.cloud
traffic policy at the Frankfurt ELB
- point the