ARO-RP/pkg/frontend/scripts
Steven Fairchild 6c945b07bf Add etcdRecovery maintenance type for admin update - ARO-1534
In the event that a master node changes IP addresses (or NIC's) the etcd
quorum will become degraded. The node with the change will then have
it's etcd pod in a crashloop. This is due to the hardcoded etcd spec.

This PR adds the remediation type EtcdRecovery maintenance task to
remediate this issue.

How it works:
  1. Verify this is the issue by comparing etcd's env variables to the
     node's IP address. a degradedEtcd object is returned with relevant
information.
  1. Create a batch job to backup etcd's data directory and move the
     etcd manifest to stop the pod from crash looping.
  1. A batch job is created to run a pod that ssh's into the peer etcd
     container's to remove the failing node from it's member list.
  1. Secret's for the failing pod are deleted
  1. Etcd is patched

Currently there is no endpoint to access this recovery task yet. An
endpoint will be added in a later PR.

Additional scenarios handled:

  - Sometimes the etcd deployement can remediate itself after an IP address change, but there is still data present from the previous IP address\'s member. This results in 4/5 containers running in the pod with the etcd container failing, but no IP address conflicts to use for remediation. Added code to find the failing member based on the conditions if no conflict is found
  - Check for multiple etcd pods with IP mismatches
  - Wait for jobs to reach a succeeded state, when the shell script
    exits with code 0. If this never happens the context is cancelled.
  - Return container log files to user from jobs
2023-07-19 13:36:41 -04:00
..
backupandfixetcd.sh Add etcdRecovery maintenance type for admin update - ARO-1534 2023-07-19 13:36:41 -04:00