rename: controller -> operator
This commit is contained in:
Родитель
b3d79718cf
Коммит
eef3d2058d
|
@ -10,9 +10,9 @@ To initialize a cluster, we first need to create a seed etcd member. Then we can
|
|||
|
||||
## Failure recovery
|
||||
|
||||
If controller fails before creating the first member, we will end up with a cluster with no running pods.
|
||||
If operator fails before creating the first member, we will end up with a cluster with no running pods.
|
||||
|
||||
Recovery can be challenging if this happens. It is impossible for us to differentiate a dead cluster from a uninitialized cluster.
|
||||
|
||||
We choose the simplest solution here. We always consider a cluster with no running pods a failed cluster. The controller will
|
||||
We choose the simplest solution here. We always consider a cluster with no running pods a failed cluster. The operator will
|
||||
try to recover it from existing backups. If there is no backup, we mark the cluster dead.
|
||||
|
|
|
@ -1,13 +1,13 @@
|
|||
# etcd cluster lifecycle in etcd controller
|
||||
# etcd cluster lifecycle in etcd operator
|
||||
|
||||
Let's talk about the lifecycle of etcd cluster "A".
|
||||
|
||||
- Initially, "A" doesn't exist. Controller considers this cluster has 0 members.
|
||||
- Initially, "A" doesn't exist. Operator considers this cluster has 0 members.
|
||||
Any cluster with 0 members would be considered as non-existed.
|
||||
- At some point of time, a user creates an object for "A".
|
||||
Controller would receive "ADDED" event and create this cluster.
|
||||
Operator would receive "ADDED" event and create this cluster.
|
||||
For the entire lifecycle, an etcd cluster could be created only once.
|
||||
- Then user might update 0 or more times on the spec of "A".
|
||||
Controller would receive "MODIFIED" events and reconcile actual state gradually to desired state of given spec.
|
||||
- Finally, a user deletes the object of "A". Controller will delete and recycle all resources of "A".
|
||||
Operator would receive "MODIFIED" events and reconcile actual state gradually to desired state of given spec.
|
||||
- Finally, a user deletes the object of "A". Operator will delete and recycle all resources of "A".
|
||||
For the entire lifecycle, an etcd cluster could be deleted only once.
|
|
@ -5,7 +5,7 @@
|
|||
The primary goals etcd-operator cluster TLS:
|
||||
* Encrypt etcd client/peer communication
|
||||
* Cryptographically attestable identites for following components:
|
||||
* etcd controller
|
||||
* etcd operator
|
||||
* etcd cluster TPR objects
|
||||
* backup tool pods
|
||||
* etcd pods
|
||||
|
@ -18,7 +18,7 @@ Here is the overview for etcd-operator TLS flow, which should set us up well for
|
|||
### Trust delegation diagram:
|
||||
|
||||
-----------------
|
||||
| external PKI | (something that can sign the controller's CSRs)
|
||||
| external PKI | (something that can sign the operator's CSRs)
|
||||
-----------------
|
||||
| |
|
||||
| | /|\ CERT
|
||||
|
@ -28,7 +28,7 @@ Here is the overview for etcd-operator TLS flow, which should set us up well for
|
|||
| |
|
||||
| |
|
||||
| |
|
||||
| |---> [ controller client-interface CA ]
|
||||
| |---> [ operator client-interface CA ]
|
||||
| |
|
||||
| | --------------> [ etcd-cluster-A client-interface CLIENT CERT ]
|
||||
| | |
|
||||
|
@ -62,7 +62,7 @@ Here is the overview for etcd-operator TLS flow, which should set us up well for
|
|||
| | ...
|
||||
|
|
||||
|
|
||||
|------> [ controller peer-interface CA ]
|
||||
|------> [ operator peer-interface CA ]
|
||||
|
|
||||
|
|
||||
|--------> [ etcd-cluster-A peer interface CERTIFICATE AUTHORITY ]
|
||||
|
@ -90,20 +90,20 @@ Here is the overview for etcd-operator TLS flow, which should set us up well for
|
|||
### Certificate signing procedure
|
||||
|
||||
1. etcd-operator pod startup:
|
||||
* generate `controller CA` private key
|
||||
* generate `controller CA` certificate (peer and client) (select one of following)
|
||||
* generate `operator CA` private key
|
||||
* generate `operator CA` certificate (peer and client) (select one of following)
|
||||
* generate self-signed cert (default for now, useful for development mode)
|
||||
* generate CSR, wait for external entity to sign it and return cert via Kubernetes API (production mode, allows integration with arbitrary external PKI systems)
|
||||
|
||||
2. etcd cluster creation (in controller pod):
|
||||
2. etcd cluster creation (in operator pod):
|
||||
* generate private key
|
||||
* generate `controller CA` as a subordinate CA of `cluster CA` using parameter from cluster spec
|
||||
* generate `operator CA` as a subordinate CA of `cluster CA` using parameter from cluster spec
|
||||
|
||||
3. etcd node pod startup (in the etcd pod, prior to etcd application starting):
|
||||
* generate private key
|
||||
* generate a CSR for `CN=etcd-cluster-xxxx`, submit for signing via annotation (peer and client)
|
||||
|
||||
4. etcd node enrollment (in controller pod) (peer and client)
|
||||
4. etcd node enrollment (in operator pod) (peer and client)
|
||||
* observe new CSR annotation on `pod/etcd-cluster-xxxx`
|
||||
* sign CSR with the `cluster CA` for `pod/etcd-cluster-xxxx`
|
||||
* --> return certificate via annotation to `pod/etcd-cluster-xxxx`
|
||||
|
@ -168,23 +168,23 @@ _note: steps 2-9 can be repeated to implement a primitive cert refresh mechanism
|
|||
------
|
||||
|
||||
|
||||
Here's a table showing how this process is currently used in the kube etcd controller TLS infrastructure:
|
||||
Here's a table showing how this process is currently used in the etcd operator TLS infrastructure:
|
||||
|
||||
| _signer_ | signing CA | _singee_ | signed indenties | identity type |
|
||||
| ------------- | -------------- | ---------- | ----------------- | ------------- |
|
||||
| external pki | external CA | controller | <ul><li>controller peer ca</li><li>controller client ca</li></ul> | CERTIFICATE AUTHORITY |
|
||||
| controller | clusterA peer ca | etcd-xxxx | etcd-xxxx-peer-interface | SERVER CERTIFICATE |
|
||||
| controller | clusterA client ca | etcd-xxxx | etcd-xxxx, client-interface | SERVER CERTIFICATE |
|
||||
| controller | clusterA client ca | clusterA-backup-tool | clusterA-backup-tool, client ce | CLIENT CERTIFICATE |
|
||||
| external pki | external CA | operator | <ul><li>operator peer ca</li><li>operator client ca</li></ul> | CERTIFICATE AUTHORITY |
|
||||
| operator | clusterA peer ca | etcd-xxxx | etcd-xxxx-peer-interface | SERVER CERTIFICATE |
|
||||
| operator | clusterA client ca | etcd-xxxx | etcd-xxxx, client-interface | SERVER CERTIFICATE |
|
||||
| operator | clusterA client ca | clusterA-backup-tool | clusterA-backup-tool, client ce | CLIENT CERTIFICATE |
|
||||
|
||||
## Things to note:
|
||||
* **Private Keys Stay Put:** If a private key is needed, is its generated on and never leaves the component that uses it. The business of shuffling around private key material across networks is a dangerous business.
|
||||
|
||||
Most importantly, the external PKI component must be allowed to sign the controller's CSR _without_ divulging it's CA private key to the cluster.
|
||||
Most importantly, the external PKI component must be allowed to sign the operator's CSR _without_ divulging it's CA private key to the cluster.
|
||||
|
||||
* **Separate peer and client cert chains:** The motivation is to provide the ability to isolate the etcd peer (data) plane from the etcd client (control) plane.
|
||||
|
||||
The client interface CA will be expected to sign CSRs for any new entity that wants to "talk to" the cluster- this includes entirely external components like backup controllers, load-balancers, etc.
|
||||
The client interface CA will be expected to sign CSRs for any new entity that wants to "talk to" the cluster- this includes entirely external components like backup operators, load-balancers, etc.
|
||||
|
||||
The peer interface CA, on the other hand, will sign CSRs only for new entities that want to join the cluster.
|
||||
|
||||
|
|
|
@ -3,7 +3,7 @@
|
|||
## etcd upgrade story
|
||||
|
||||
- A user “kubectl apply” a new version in EtcdCluster object
|
||||
- etcd controller detects the version and
|
||||
- etcd operator detects the version and
|
||||
- if the targeted version is allowed, does rolling upgrade
|
||||
- otherwise, rejects it in admission control; We will write an admission plug-in to verify EtcdCluster object.
|
||||
|
||||
|
@ -20,7 +20,7 @@
|
|||
## Support notes
|
||||
|
||||
- Upgrade path: We only support one minor version upgrade, e.g. 3.0 -> 3.1, no 3.0 -> 3.2. Only support v3.0+
|
||||
- Rollback: We relies on etcd controller to do periodic backup.
|
||||
- Rollback: We relies on etcd operator to do periodic backup.
|
||||
For alpha release, we will provide features to do manual rollback.
|
||||
In the future, we might consider support automatic rollback.
|
||||
|
||||
|
|
|
@ -2,11 +2,11 @@
|
|||
|
||||
## Overview
|
||||
|
||||
If a cluster has less than majority of members alive, controller considers it disastrous failure. There might be other disastrous failures. Controller will do disaster recovery on such cases and try to recover entire cluster from snapshot.
|
||||
If a cluster has less than majority of members alive, operator considers it disastrous failure. There might be other disastrous failures. Operator will do disaster recovery on such cases and try to recover entire cluster from snapshot.
|
||||
|
||||
We have a backup pod to save checkpoints of the cluster.
|
||||
|
||||
If disastrous failure happened but no checkpoint is found, controller would consider the cluster dead.
|
||||
If disastrous failure happened but no checkpoint is found, operator would consider the cluster dead.
|
||||
|
||||
## Technical details
|
||||
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
# Controller recovery
|
||||
# Operator recovery
|
||||
|
||||
- Create TPR
|
||||
- If the creation succeed, then the controller is a new one and does not require recovery. END.
|
||||
- If the creation succeed, then the operator is a new one and does not require recovery. END.
|
||||
- Find all existing clusters
|
||||
- loop over the third part resource items to get all created clusters
|
||||
- Reconstruct clusters
|
|
@ -4,7 +4,7 @@
|
|||
|
||||
Given a desired size S, we have two membership states:
|
||||
- running pods P in k8s cluster
|
||||
- membership M in controller knowledge
|
||||
- membership M in operator knowledge
|
||||
|
||||
Reconciliation is the process to make these two states consistent with the desired size S.
|
||||
|
||||
|
|
|
@ -23,7 +23,7 @@ You should see "vendor/".
|
|||
|
||||
Build docker image
|
||||
```
|
||||
$ docker build --tag quay.io/coreos/etcd-operator:${TAG} -f hack/build/controller/Dockerfile .
|
||||
$ docker build --tag quay.io/coreos/etcd-operator:${TAG} -f hack/build/operator/Dockerfile .
|
||||
```
|
||||
`${TAG}` is the release tag. For example, "v0.0.1", "latest".
|
||||
We also need to create a corresponding release on github with release note.
|
||||
|
|
|
@ -22,7 +22,7 @@ import (
|
|||
|
||||
const (
|
||||
id = "UA-42684979-8"
|
||||
category = "etcd-controller"
|
||||
category = "etcd-operator"
|
||||
)
|
||||
|
||||
var (
|
||||
|
|
|
@ -7,7 +7,7 @@ End-to-end (e2e) testing is automated testing for real user scenarios.
|
|||
Prerequisites:
|
||||
- a running k8s cluster and kube config. We will need to pass kube config as arguments.
|
||||
- Have kubeconfig file ready.
|
||||
- Have etcd controller image ready.
|
||||
- Have etcd operator image ready.
|
||||
|
||||
e2e tests are written as go test. All go test techniques applies, e.g. picking what to run, timeout length.
|
||||
Let's say I want to run all tests in "test/e2e/":
|
||||
|
|
|
@ -140,13 +140,13 @@ func TestOneMemberRecovery(t *testing.T) {
|
|||
}
|
||||
|
||||
// TestDisasterRecovery2Members tests disaster recovery that
|
||||
// controller will make a backup from the left one pod.
|
||||
// ooperator will make a backup from the left one pod.
|
||||
func TestDisasterRecovery2Members(t *testing.T) {
|
||||
testDisasterRecovery(t, 2)
|
||||
}
|
||||
|
||||
// TestDisasterRecoveryAll tests disaster recovery that
|
||||
// we should make a backup ahead and controller will recover cluster from it.
|
||||
// we should make a backup ahead and ooperator will recover cluster from it.
|
||||
func TestDisasterRecoveryAll(t *testing.T) {
|
||||
testDisasterRecovery(t, 3)
|
||||
}
|
||||
|
@ -182,7 +182,7 @@ func testDisasterRecovery(t *testing.T, numToKill int) {
|
|||
t.Fatalf("failed to create backup pod: %v", err)
|
||||
}
|
||||
// No left pod to make a backup from. We need to back up ahead.
|
||||
// If there is any left pod, controller should be able to make a backup from it.
|
||||
// If there is any left pod, ooperator should be able to make a backup from it.
|
||||
if numToKill == len(names) {
|
||||
if err := makeBackup(f, testEtcd.Name); err != nil {
|
||||
t.Fatalf("fail to make a latest backup: %v", err)
|
||||
|
@ -192,7 +192,7 @@ func testDisasterRecovery(t *testing.T, numToKill int) {
|
|||
for i := 0; i < numToKill; i++ {
|
||||
toKill[i] = names[i]
|
||||
}
|
||||
// TODO: There might be race that controller will recover members between
|
||||
// TODO: There might be race that ooperator will recover members between
|
||||
// these members are deleted individually.
|
||||
if err := killMembers(f, toKill...); err != nil {
|
||||
t.Fatal(err)
|
||||
|
|
Загрузка…
Ссылка в новой задаче