proposal: redesign of CNS IPAM (#2013)

proposal: redesign of cns ipam Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
2023-09-22 17:07:27 -07:00 · 2023-09-22 17:07:27 -07:00 · 3c6bb62ff3
--- a/docs/feature/ipammath/0-background.md
+++ b/docs/feature/ipammath/0-background.md
@ -0,0 +1,70 @@
+## CNS IPAM redesign
+
+### Background
+In SWIFT, IPs are allocated to Nodes in batches $B$ according to the request for Pod IPs on that Node. CNS runs on the Node and handles the IPAM for that Node. As Pods are scheduled, the CNI requests IPs from CNS. CNS assigns IPs from its allocated IP Pool, and dynamically scales the pool according to utilization as follows:
+- If the unassigned IPs in the Pool falls below a threshold ( $m$ , the minimum free IPs), CNS requests a batch of IPs from DNC-RC.
+- If the unassigned IPs in the Pool exceeds a threshold ( $M$ , the maximum free IPs), CNS releases a batch of IPs back to the subnet.
+
+The minimum and maximum free IPs are calculated using a fraction of the Batch size. The minimum free IP quantity is the minimum free fraction ( $mf$ ) of the batch size, and the maximum free IP quantity is the maximum free fraction ( $Mf$ ) of the batch size. For convergent scaling behavior, the maximum free fraction must be greater than 1 + the minimum free fraction.
+
+Therefore the scaling thresholds $m$ and $M$ can be described by:
+
+$$
+m = mf \times B \text{ , } M = Mf \times B \text{ , and } Mf = mf + 1
+$$
+
+Typically in current deployments the Batch size $B = 16$ and the minimum free fraction $mf = 0.5$, so the minimum free IPs $m = 8$. The maximum free fraction $Mf = 1.5$ and the maximum free IPs $M = 24$ .
+
+
+### Scaling
+
+The current Pod IP allocation flows as follows:
+- CNS is allocated a Batch of IPs via the NNC and records them internally as "Available"
+- As Pods are scheduled on the Node:
+    - The CNI make an IP assignment request to CNS.
+    - If there is an Available IP:
+        - CNS assigns an Available IP out of the Pool to that Pod.
+    - If there is not an Available IP:
+        - CNS returns an error
+        - CRI tears down the Pod Sandbox
+- In parallel, CNS monitors the IP Pool as described in the [Background](#background) section above.
+    - If the number of Free IPs crosses $Mf$ or $mf$ CNS requests or releases a Batch of IPs via the `NodeNetworkConfig` CRD.
+
+$$m = mf \times B \quad \text{the Minimum Free IPs}$$
+$$\text{if } Available IPs \lt m \quad \text{request an additional Batch }B$$
+
+---
+
+```mermaid
+sequenceDiagram
+    participant CRI
+    participant CNI
+    participant CNS
+    participant Network Controller
+    loop Monitor IP Pool
+    alt M > Available IPs > m
+    CNS->CNS: Do nothing
+    else Resize pool
+    CNS->>+Network Controller: Request/Release B IPs
+    Network Controller->>-CNS: Provide IPs in NNC
+    end
+    end
+    CRI->>+CNI: Create Pod
+    CNI->>+CNS: Request IP
+    alt IP is Available
+    CNS->>CNI: Assign IP
+    CNI->>CRI: Start Pod
+    else No IP Available
+    CNS->>-CNI: Error
+    CNI->>-CRI: Destroy Pod
+    end
+```
+
+### Issues
+The existing IP Pool scaling behavior in CNS is reactive and serial.
+
+CNS will only request to increase or decrease its Pool size by a single batch at a time. It reacts to the IP usage, attempting to adjust the Pool size to stay between the minimum and maximum free IPs, but it will only step the pool size by a single Batch at a time. CNS is unable to proactively scale the pool to meet large swings in IP usage (any change in Pod count $\Delta N > B/2$) and will take several round-trips through the NNC to scale the pool to meet the new demand.
+
+This design is also prone to error: because we scale up/down a batch at a time, we have to recalculate IP usage using "future" expected Free IP counts, whenever the Pool size has been updated but the new IP list has not propogated through the NNC. This has lead to IP leaks, or to CNS getting stuck and being unable to scale up the pool because IPs are still in the process of being allocated or released.
+
+Because the "next" request is based on the "current" request, it is possible for the Pool to become misaligned to the Batch size if the Request is edited out of band.
--- a/docs/feature/ipammath/1-ipam-math.md
+++ b/docs/feature/ipammath/1-ipam-math.md
@ -0,0 +1,45 @@
+## CNS IPAM Scaling v2
+
+### Scaling Math
+
+The Pool scaling process can be improved by directly calculating the target Pool size based on the current IP usage on the Node. Using this idempotent algorithm, we will calculate the correct target Pool size in a single step based on the current IP usage, instead of free, future-free, etc.
+
+The O(1) Pool scaling formula is:
+
+$$
+Request = B \times \lceil mf + \frac{U}{B} \rceil
+$$
+
+> Note: $\lceil ... \rceil$ is the ceiling function.
+
+where $U$ is the number of Assigned (Used) IPs on the Node, $B$ is the Batch size, and $mf$ is the Minimum Free Fraction, as discussed in the [Background](0-background.md#background).
+
+The resulting IP Count is forward looking without effecting the correctness of the Request: it represents the target quantity of IP addresses that CNS should have at any instant in time based on the current real IP demand, and does not in any way depend on what the current or previous Requested IP count is or whether there are unsatisfied requests currently in-flight.
+
+A concrete example:
+
+$$
+\displaylines{
+    \text{Given: }\quad B=16\quad mf=0.5 \quad U=25 \text{ scheduled Pods}\\
+    Request = 16 \times \lceil 0.5 + \frac{25}{16} \rceil\\
+    Request = 16 \times \lceil 0.5 + 1.5625 \rceil\\
+    Request = 16 \times \lceil 2.0625 \rceil\\
+    Request = 16 \times 3 \\
+    Request = 48
+}
+$$
+
+As shown, if the demand is for $25$ IPs, and the Batch is $16$, and the Min Free is $8$ (0.5 of the Batch), then the Request must be $48$. $32$ is too few, as $32-25=7 < 8$. The resulting request is also (and will always be) immediately aligned to a multiple of the Batch ($3B$)
+
+This algorithm will significantly improve the time-to-pod-ready for large changes in the quantity of scheduled Pods on a Node, due to eliminating all iterations required for CNS to converge on the final Requested IP Count.
+
+
+### Including PrimaryIPs
+
+The IPAM Pool scaling operates only on NC SecondaryIPs. However, CNS is allocated an additional `PrimaryIP` for every NC as a prerequisite of that NC's existence. Therefore, to align the **real allocated** IP Count to the Batch size, CNS should deduct those PrimaryIPs from its Requested (Secondary) IP Count.
+
+This makes the RequestedIPCount:
+
+$$
+RequestedIPCount = B \times \lceil mf + \frac{U}{B} \rceil - PrimaryIPCount
+$$
--- a/docs/feature/ipammath/1-watchpods.md
+++ b/docs/feature/ipammath/1-watchpods.md
@ -0,0 +1,175 @@
+## CNS watches Pods to drive IPAM scaling
+
+### Current state
+The IPAM Pool Scaling is reactive: CNS assigns IPs out of the IPAM Pool as it is asked for them by the CNI, while trying to maintain a buffer of Free IPs that is within the Scaler parameters. The CNI makes IP assignment requests serially, and as it requests that IPs are assigned or freed, CNS makes requests to scale up or down the IPAM Pool by adjusting the Requested IP Count in the NodeNetworkConfig. If CNS is unable to honor an IP assignment requests due to no free IPs, CNI returns an error to the CRI which causes the Pod sandbox to be cleaned up, and CNS will receive an IP Release request for that Pod.
+
+In this reactive architecture, CNS is unable to track the number of incoming Pod IP assignment requests or predict how many IPs it may soon need. CNS can only reliably scale by a single Batch at a time when it runs out of Free IPs. For example:
+
+- At $T_3$ CNI requests an IP for Pod $P_{16}$ but CNS is out of free IPs and returns an error
+  - CRI tears down $P_{16}$, and CNI requests that CNS frees the IP for $P_{16}$
+  - $P_{17-36}$ are similarly stuck, pending available IPs
+- At $T_4$ CNS receives an additional Batch of IPs, and as the CNI retries they are assigned to $P_{16-31}$
+- At $T_5$ CNS has too few unassigned IPs again and requests another Batch
+- At $T_6$ $P_{32}$ is stuck, pending available IPs
+- At $T_7$ CNS receives an additional Batch of IPs, and as the CNI retries they are assigned to $P_{32-35}$
+
+
+| Time   | State                                                                 |
+| ---- | -------------------------------------------------------------------- |
+| $T_0$  | 1 Pod $P_0$ is scheduled: CNS has 1 Batch ( $16$ ) IPs                |
+| $T_1$  | 35 Pods are scheduled for a Total of 36 Pods                         |
+| $T_2$  | CNI is sequentially requesting IP assignments for Pods, and for Pod $P_8$, CNS has less than $B\times mf$ unassigned IPs and requests an additional Batch of IPs |
+| $T_3$  | CNI requests an IP for Pod $P_{16}$ but CNS is out of free IPs and returns an error |
+| $T_3+$ | - CRI tears down $P_{16}$, and CNI requests that CNS frees the IP for $P_{16}$ <br> - $P_{17-36}$ are similarly stuck, pending available IPs |
+| $T_4$  | CNS receives an additional Batch of IPs, and as the CNI retries they are assigned to $P_{16-31}$ |
+| $T_5$  | CNS has too few unassigned IPs again and requests another Batch        |
+| $T_6$  | $P_{32}$ is stuck, pending available IPs                              |
+| $T_7$  | CNS receives an additional Batch of IPs, and as the CNI retries they are assigned to $P_{32-35}$ |
+| ... | ...|
+
+By proactively watching Pods instead of waiting for the CNI requests, this process could be faster and simpler:
+
+| Time   | State |
+| ---- | -------------------------------------------------------------------- |
+| $T_0$  | 1 Pod $P_0$ is scheduled: CNS has 1 Batch ( $16$ ) IPs        |
+| $T_1$  | 35 Pods are scheduled for a Total of 36 Pods                 |
+| $T_2$  | CNS sees 36 Pods have been scheduled and updates the Requested IP Count to $48$ according to the [Scaling Equation](1-ipam-math.md#scaling-math) |
+| $T_3$  | CNS receives 48 total IPs, and as the CNI requests IP assignments they are assigned to $P_{1-35}$ |
+
+
+
+
+### Performance Considerations
+
+Migrating CNS IPAM from a reactive to a proactive architecture is a significant change to the CNS <-> Kubernetes interaactions which has the potential to increase the load on the API Server. However, this use-case  is a common one - the Kubelet, notably, also watches Pods on each Node, and it is highly optimized path.
+
+By leverageing similar patterns and some Kubernetes provided machinery, we can make this change efficiently. 
+
+#### SharedInformers and local Caches
+Kubernetes `client-go` [provides machinery for local caching](https://github.com/kubernetes/sample-controller/blob/6d1d76794eb5f951e63a46f1ad6e097c1879d81b/docs/controller-client-go.md): Reflectors, (Shared)Informers, Indexer, and Stores
+
+<p align="center">
+  <img src="https://raw.githubusercontent.com/kubernetes/sample-controller/6d1d76794eb5f951e63a46f1ad6e097c1879d81b/docs/images/client-go-controller-interaction.jpeg" height="600" width="700"/>
+
+  > [Image from kubernetes/sample-controller documentation](https://github.com/kubernetes/sample-controller/blob/6d1d76794eb5f951e63a46f1ad6e097c1879d81b/docs/controller-client-go.md).
+</p>
+
+By leveraging this machinery, CNS will set up a `Watch` on Pods which will open a single long-lived socket connection to the API Server and will let the API Server push incremental updates. This significantly decreases the data transferred and API Server load when compared to naively polling `List` to get Pods repeatedly.
+
+Additionally, any read-only requests (`Get`, `List`, `Watch`) that CNS makes to Kubernetes using a cache-aware client will hit the local Cache instead of querying the remote API Server. This means that the only requests leaving CNS to the API Server for this Pod Watcher will be the Reflector's List and Watch.
+
+#### Server-side filtering
+To reduce API Server load and traffic, CNS can use an available [Field Selector](https://kubernetes.io/docs/concepts/overview/working-with-objects/field-selectors/) for Pods: [`spec.nodeName=<node>`](https://github.com/kubernetes/kubernetes/blob/691d4c3989f18e0be22c4499d22eff95d516d32b/pkg/apis/core/v1/conversion.go#L40). Field selectors are, like Label Selectors, [applied on the server-side](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#list-and-watch-filtering) to List and Watch queries to reduce the dataset that is returned from the API Server to the Client. 
+
+By restricting the Watch to Pods on the current Node, the traffic generated by the Watch will be proportional to the number of Pods on that Node, and will *not* scale in relation to either the number of Nodes in the cluster or the total number of Pods in the cluster.
+
+### Implementation
+To make setting up the filters, SharedInformers, and cache-aware client easy, we will use [`controller-runtime`](https://github.com/kubernetes-sigs/controller-runtime) and create a Pod Reconciler. A controller already exists for managing the `NodeNetworkConfig` CRD lifecycle, so the necessary infrastructure (namely, a Manager) already exists in CNS.
+
+To create a filtered Cache during the Manager instantiation, the existing `nodeScopedCache` will be expanded to include Pods:
+
+```go
+import (
+  v1 "k8s.io/api/core/v1"
+  "k8s.io/apimachinery/pkg/fields"
+  "sigs.k8s.io/controller-runtime/pkg/cache"
+  //...
+)
+//...
+nodeName := "the-node-name"
+// the nodeScopedCache sets Selector options on the Manager cache which are used
+// to perform *server-side* filtering of the cached objects. This is very important
+// for high node/pod count clusters, as it keeps us from watching objects at the
+// whole cluster scope when we are only interested in our Node's scope.
+nodeScopedCache := cache.BuilderWithOptions(cache.Options{
+  SelectorsByObject: cache.SelectorsByObject{
+    // existing options
+    //...,
+    &v1.Pod{}: {
+      Field: fields.SelectorFromSet(fields.Set{"spec.nodeName": nodeName}),
+    },
+  },
+})
+//...
+manager, err := ctrl.NewManager(kubeConfig, ctrl.Options{
+    // existing options
+    //...,
+    NewCache:           nodeScopedCache,
+})
+```
+
+After the local Cache and ListWatch has been set up correctly, the Reconciler should use the Manager-provided Kubernetes API Client within its event loop so that reads hit the cache instead of the real API.
+
+```go
+import (
+  "context"
+
+  v1 "k8s.io/api/core/v1"
+  ctrl "sigs.k8s.io/controller-runtime"
+  "sigs.k8s.io/controller-runtime/pkg/client"
+  "sigs.k8s.io/controller-runtime/pkg/reconcile"
+)
+
+type Reconciler struct {
+  client client.Client
+}
+
+func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reconcile.Result, error) {
+  pods := v1.PodList{}
+  r.client.List(ctx, &pods)
+  // do things with the list of pods
+  // ...
+  return reconcile.Result{}, nil
+}
+
+func (r *Reconciler) SetupWithManager(mgr ctrl.Manager) error {
+  r.client = mgr.GetClient()
+  return ctrl.NewControllerManagedBy(mgr).
+    For(&v1.Pod{}).
+    Complete(r)
+}
+```
+
+This can be further optimized by ignoring "Status" Updates to any Pods in the controller setup func:
+```go
+func (r *Reconciler) SetupWithManager(mgr ctrl.Manager) error {
+  r.client = mgr.GetClient()
+  return ctrl.NewControllerManagedBy(mgr).
+    For(&v1.Pod{}).
+    WithEventFilter(predicate.Funcs{
+      // check that the generation has changed - status changes don't update generation.
+      UpdateFunc: func(ue event.UpdateEvent) bool {
+        return ue.ObjectOld.GetGeneration() != ue.ObjectNew.GetGeneration()
+      },
+    }).
+    Complete(r)
+}
+```
+Note: 
+The CNS RBAC will need to be updated to include permission to access Pods:
+```yaml
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: pod-ro
+  namespace: kube-system
+rules:
+- apiGroups: 
+  - ""
+  verbs: 
+  - get
+  - list
+  - watch
+  resources: 
+  - pods
+```
+
+### The updated IPAM Pool Monitor
+
+When CNS is watching Pods via the above mechanism, the number of Pods scheduled on the Node (after discarding `hostNetwork: true` Pods), is the instantaneous IP demand for the Node. This IP demand can be fed in to the IPAM Pool scaler in place of the "Used" quantity described in the [idempotent Pool Scaling equation](1-ipam-math.md#scaling-math):
+
+$$
+Request = B \times \lceil mf + \frac{Demand}{B} \rceil
+$$
+
+to immediately calculate the target Requested IP Count for the current actual Pod load. At this point, CNS can scale directly to the neccesary number of IPs in a single operation proactively, as soon as Pods are scheduled on the Node, without waiting for the CNI to request IPs serially.