ARO-RP

Steven Fairchild 6c945b07bf Add etcdRecovery maintenance type for admin update - ARO-1534 In the event that a master node changes IP addresses (or NIC's) the etcd quorum will become degraded. The node with the change will then have it's etcd pod in a crashloop. This is due to the hardcoded etcd spec. This PR adds the remediation type EtcdRecovery maintenance task to remediate this issue. How it works: 1. Verify this is the issue by comparing etcd's env variables to the node's IP address. a degradedEtcd object is returned with relevant information. 1. Create a batch job to backup etcd's data directory and move the etcd manifest to stop the pod from crash looping. 1. A batch job is created to run a pod that ssh's into the peer etcd container's to remove the failing node from it's member list. 1. Secret's for the failing pod are deleted 1. Etcd is patched Currently there is no endpoint to access this recovery task yet. An endpoint will be added in a later PR. Additional scenarios handled: - Sometimes the etcd deployement can remediate itself after an IP address change, but there is still data present from the previous IP address\'s member. This results in 4/5 containers running in the pod with the etcd container failing, but no IP address conflicts to use for remediation. Added code to find the failing member based on the conditions if no conflict is found - Check for multiple etcd pods with IP mismatches - Wait for jobs to reach a succeeded state, when the shell script exits with code 0. If this never happens the context is cancelled. - Return container log files to user from jobs	2023-07-19 13:36:41 -04:00
..
backupandfixetcd.sh	Add etcdRecovery maintenance type for admin update - ARO-1534	2023-07-19 13:36:41 -04:00

Steven Fairchild 6c945b07bf Add etcdRecovery maintenance type for admin update - ARO-1534

In the event that a master node changes IP addresses (or NIC's) the etcd
quorum will become degraded. The node with the change will then have
it's etcd pod in a crashloop. This is due to the hardcoded etcd spec.

This PR adds the remediation type EtcdRecovery maintenance task to
remediate this issue.

How it works:
1. Verify this is the issue by comparing etcd's env variables to the
node's IP address. a degradedEtcd object is returned with relevant
information.
1. Create a batch job to backup etcd's data directory and move the
etcd manifest to stop the pod from crash looping.
1. A batch job is created to run a pod that ssh's into the peer etcd
container's to remove the failing node from it's member list.
1. Secret's for the failing pod are deleted
1. Etcd is patched

Currently there is no endpoint to access this recovery task yet. An
endpoint will be added in a later PR.

Additional scenarios handled:

- Sometimes the etcd deployement can remediate itself after an IP address change, but there is still data present from the previous IP address\'s member. This results in 4/5 containers running in the pod with the etcd container failing, but no IP address conflicts to use for remediation. Added code to find the failing member based on the conditions if no conflict is found
- Check for multiple etcd pods with IP mismatches
- Wait for jobs to reach a succeeded state, when the shell script
exits with code 0. If this never happens the context is cancelled.
- Return container log files to user from jobs

2023-07-19 13:36:41 -04:00

backupandfixetcd.sh

Add etcdRecovery maintenance type for admin update - ARO-1534

2023-07-19 13:36:41 -04:00