CNS Prometheus and Grafana examples (#1366)

* cns prometheus examples Signed-off-by: Evan Baker <rbtr@users.noreply.github.com> * grafana samples Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
2022-05-11 12:34:47 -05:00 · 2022-05-11 12:34:47 -05:00 · 8750b346ed
--- a/cns/doc/examples/metrics/README.md
+++ b/cns/doc/examples/metrics/README.md
@ -0,0 +1,66 @@
+# Azure CNS metrics
+azure-cns exposes metrics via Prometheus on `:10092/metrics`
+
+## Scraping 
+Prometheus can be configured using these examples: 
+- a [podMonitor](podMonitor.yaml), if using promotheus-operator or kube-prometheus
+- manually via this equivalent [scrape_config](scrape_config.yaml)
+
+## Monitoring
+To view all available CNS metrics once Prometheus is correctly configured to scrape:
+```promql
+count ({job="kube-system/azure-cns"}) by (__name__)
+```
+
+CNS exposes standard Go and Prom metrics such as `go_goroutines`, `go_gc*`, `up`, and more.
+
+Metrics designed to be customer-facing are generally prefixed with `cx_` and can be listed similarly:
+```promql
+count ({__name__=~"cx.*",job="kube-system/azure-cns"}) by (__name__)
+```
+At time of writing, the following cx metrics are exposed (key metrics in **bold**):
+- **cx_ipam_available_ips** (IPs reserved by the Node but not assigned to Pods yet)
+- cx_ipam_batch_size
+- cx_ipam_current_available_ips
+- cx_ipam_expect_available_ips
+- **cx_ipam_max_ips** (maximum IPs the Node can reserve from the Subnet)
+- cx_ipam_pending_programming_ips
+- cx_ipam_pending_release_ips
+- **cx_ipam_pod_allocated_ips** (IPs assigned to Pods on the Node)
+- cx_ipam_requested_ips
+- **cx_ipam_total_ips** (IPs reserved by the Node from the Subnet)
+
+These metrics may be used to gain insight in to the current state of the cluster's IPAM. 
+
+For example, to view the current IP count requested by each node:
+```promql
+sum (cx_ipam_requested_ips{job="kube-system/azure-cns"}) by (instance)
+```
+To view the current IP count allocated to each node:
+```promql
+sum (cx_ipam_total_ips{job="kube-system/azure-cns"}) by (instance)
+```
+> Note: if these two values aren't converging after some time, that indicates an IP provisioning error.
+
+To view the current IP count assigned to pods, per node:
+```promql
+sum (cx_ipam_pod_allocated_ips{job="kube-system/azure-cns"}) by (instance)
+```
+
+## Visualizing
+A sample Grafana dashboard is included at [grafan.json](grafana.json).
+
+Visualizations included are: 
+- Per Node
+    - CNS Status (Up/Down)
+    - Requested IPs
+    - Reserved IPs
+    - Used IPs
+    - Request/Reserved/Used vs Time
+- Per Cluster
+    - Total Reserver IPs vs Time
+    - Total Used IPs vs Time
+    - Reserved and Assigned vs Time
+    - Cluster Subnet Utilization Percentage vs Time
+    - Cluster Subnet Utilization Total vs Time
+    - Node Headroom (how many additional Nodes can be added to the Cluster based on the Subnet capacity)
--- a/cns/doc/examples/metrics/grafana.json
+++ b/cns/doc/examples/metrics/grafana.json
--- a/cns/doc/examples/metrics/podMonitor.yaml
+++ b/cns/doc/examples/metrics/podMonitor.yaml
@ -0,0 +1,14 @@
+## This example podMonitor config can be used with a Prometheus-Operator 
+## managed Prometheus to automatically discover and collect azure-cns metrics.
+---
+apiVersion: monitoring.coreos.com/v1
+kind: PodMonitor
+metadata:
+  name: azure-cns
+  namespace: kube-system
+spec:
+  podMetricsEndpoints:
+  - port: metrics
+  selector:
+    matchLabels:
+      k8s-app: azure-cns
--- a/cns/doc/examples/metrics/scrape_config.yaml
+++ b/cns/doc/examples/metrics/scrape_config.yaml
@ -0,0 +1,76 @@
+## This example Prometheus scrape-config can be used with a manually 
+## configured Prometheus to collect azure-cns metrics.
+- job_name: azure-cns
+  honor_timestamps: true
+  scrape_interval: 30s
+  scrape_timeout: 10s
+  metrics_path: /metrics
+  scheme: http
+  follow_redirects: true
+  enable_http2: true
+  relabel_configs:
+  - source_labels: [job]
+    separator: ;
+    regex: (.*)
+    target_label: __tmp_prometheus_job_name
+    replacement: $1
+    action: replace
+  - source_labels: [__meta_kubernetes_pod_label_k8s_app, __meta_kubernetes_pod_labelpresent_k8s_app]
+    separator: ;
+    regex: (azure-cns);true
+    replacement: $1
+    action: keep
+  - source_labels: [__meta_kubernetes_pod_container_port_name]
+    separator: ;
+    regex: metrics
+    replacement: $1
+    action: keep
+  - source_labels: [__meta_kubernetes_namespace]
+    separator: ;
+    regex: (.*)
+    target_label: namespace
+    replacement: $1
+    action: replace
+  - source_labels: [__meta_kubernetes_pod_container_name]
+    separator: ;
+    regex: (.*)
+    target_label: container
+    replacement: $1
+    action: replace
+  - source_labels: [__meta_kubernetes_pod_name]
+    separator: ;
+    regex: (.*)
+    target_label: pod
+    replacement: $1
+    action: replace
+  - separator: ;
+    regex: (.*)
+    target_label: job
+    replacement: kube-system/azure-cns
+    action: replace
+  - separator: ;
+    regex: (.*)
+    target_label: endpoint
+    replacement: metrics
+    action: replace
+  - source_labels: [__address__]
+    separator: ;
+    regex: (.*)
+    modulus: 1
+    target_label: __tmp_hash
+    replacement: $1
+    action: hashmod
+  - source_labels: [__tmp_hash]
+    separator: ;
+    regex: "0"
+    replacement: $1
+    action: keep
+  kubernetes_sd_configs:
+  - role: pod
+    kubeconfig_file: ""
+    follow_redirects: true
+    enable_http2: true
+    namespaces:
+      own_namespace: false
+      names:
+      - kube-system