Граф коммитов

24 Коммитов

Автор SHA1 Сообщение Дата
Hunter Gregory ebddca18bd
perf: [NPM] [LINUX] add NetPols in background (#1969)
* wip: apply dirty NetPols every 500ms in Linux

* only build npm linux image

* fix: check for empty cache

* feat: toggle for netpol interval. default 500 ms

* ci: remove stages "build binaries" and "run windows tests"

* wip: max batched netpols (toggle-specified)

* ci: remove manifest build/push for win npm

* wip: handle ipset deletion properly and max batch for delete too

* fix: correct remove policy

* fix: only remove policy if it was in kernel

* finalize toggles, allowing ability to turn off iptablesInBackground

* ci: conf + cyc use PR's configmaps

* fix: lints

* fix dp toggle: iptablesInBackground

* fix lock typo and config logging

* fix background thread. add comments. only add tmp ref when enabled

* copy pod selector list

* fix: removepolicy needs namespace too

* rename opInfo to event

* fix: fix references and prevent concurrent map read/write

* tmp: debug logging

* fix: missing set references by swap keys and values

* Revert "tmp: debug logging"

This reverts commit 70ed34c714ea4a6d009a1fe90a7168be4bedd5bf.

* fix: add podSelectorList to fake NetPol

* log: do not print error when failing to delete non-existent nft rule

* log: verbose iptables bootup

* log: use fmt.Errorf for clean logging

* log: never return error for iptables in background and fix some lints

* fix: activate/deactivate azure chain rules

* fix: correctly decrement netpols in kernel

* ci: run UTs again

* ci: update profiles. default to placefirst=false

* address comment: rename batch to pendingPolicy

* refactor: make dirty cache  OS-specific

* test: UTs

* test: put UT cfg back to placefirst to not break things

* ci: update cyclonus workflows

* fmt: address comment & lint

* fmt: rename numInKernel to policiesInKernel

* log: switch to fmt.Errorf

* fmt: whitespace

* feat: resiliency to errors while reconciling dirty netpols

* log: temporarily print everything for ipset restore

* fix: remove nomatch from ipset -D for cidr blocks

* test: UTs for non-happy path

* test: fix hns fake

* fix: don't change windows. let it delete ipsets when removing policies

* fix windows lint

* fix: ignore chain doesn't exist errors for iptables -D

* feat: latency and failure metrics

* test: update exit code for UT

* metrics: new metrics should go in node-metrics path

* style: simplify nesting

* style: move identical windows & linux code to shared file

* ci: remove v1 conformance and cyclonus

* feat: add NetPols in background from the DP (revert background code in pMgr)

* style: remove "background" from iptables metrics

* revert changes in ipsetmanager, const.go, and dp.Remove/UpdatePolicy

* style: whitespace

* perf: use len() instead of creating slice from map

* remove verbosity for iptables bootup

* build: add return statement

* style: whitespace

* build: fix variable shadowing

* build: fix more import shadowing

* build: windows pointer issue and UT issue

* test: fix UT for iptables error code 2

* ci: enable linux scale test

* ci: revert to master pipeline.yaml

* revert changes to chain-management. do changes in PR #2012

* log: change wording

* test: UTs for netpol in background

* log: wording

* feat: apply ipsets for each netpol individually

* config: rearrange ConfigMap & update capz yaml

* fix: windows bootup phase logic for addpolicy

* feat: restrict netpol in background to linux + nftables

* test: skip nftables check for UT

* style: netpols[0] instead of loop

* log: address log comments

* style: lint for long line

---------

Co-authored-by: Vamsi Kalapala <vakr@microsoft.com>
2023-07-19 09:13:52 -07:00
Hunter Gregory 96243c325e
fix: [WIN-NPM] fix units of new latency metrics (#2018)
* fix: new latency metrics use seconds instead of milliseconds

* chore: lint

* fix: lower buckets to start at 8 milliseconds
2023-06-16 23:22:27 +00:00
Hunter Gregory e6eeb5014a
feat: [NPM] metric for total Pod IPs (#1999)
* feat: remove metric for max members in ipset and add metric for total pod IPs in cluster

* feat: update total pod IP metric in pod controller

* log: make sure we log apply dp errors

* test: UTs for pod count metric

* style: rename metric to customer_pods

* test: revert changes from previous PR to ipsetMgr windows UTs

* style: rename metric to pods_watched

* test: try longer wait

* debug: tmp log

* Revert "debug: tmp log"

This reverts commit 71529a8643.

* style: fix whitespace
2023-06-09 09:59:27 -07:00
Hunter Gregory 04f92857f2
feat: [WIN-NPM] metrics for latencies and failures (#1959)
* implement metrics

* add npm prefix

* rename windows files

* metrics pkg UTs

* allow reinitializing prometheus metrics

* fix: hns wrapper should not throw error for empty SetPolicy values

* test: metric UTs in dataplane

* fix: record list endpoint latency always

* remove flaky UT

* feat: metric for max ipset members

* fix lint

* fix lint 2

* fix build

* fix lint 3

* simplify conditionals and protect against maxMembers becoming negative

* remove bottom 4 histogram buckets. start at 16 ms

* reset metrics for ipset UTs

* style: don't check for windows dp in *_windows.go files

* build: remove unused import

* test: reset windows metrics in UT
2023-06-05 12:43:39 -07:00
Hunter Gregory 09cd371fb4
fix: [NPM] cleanup restarted pod stuck with no IP (#1503)
* print statements

* cleanup Running pod with empty IP

* add log line

* revert previous 3 commits

* enqueue updates with empty IPs and add prometheus metric

* fix lints

* handle pod assigned to wrong endpoint edge case

* log and update comment

* UTs and fixed named port + build

* reset entire endpoint regardless of cache

* remove comment in dp.go

* fix windows build issues

* skip refreshing endpoints and address comments

* only sync empty ip if pod running. add tmp log

* undo special pod delete logic

* reference GH issue

* fix Windows UTs

* remove prometheus metrics and a log

---------

Co-authored-by: Vamsi Kalapala <vakr@microsoft.com>
2023-02-15 13:38:47 -08:00
Hunter Gregory 8cc8e7f1ff
fix: [NPM-LINUX] resiliency for several non-retriable errors (#1566)
* adaptively modify linux max restore try count to prevent perpetual errors

* remove debug print

* log restore file and send ipsetmanager_linux errors

* send other appropriate errors

* fix handleLineError function

* fix printing restore lines and enhance a log

* fix lints and wrap chainLineNumber errors

* fix one off error for logging the try count

* revert exponential increase to try limit

* update try count to 5 and update UTs

* do not log lines for every restore call until perf is understood
2022-11-23 10:38:21 -08:00
Vamsi Kalapala 311eba6c3e
test: [NPM] Removing fail on AITelemetry error (#1288)
* Removing extra log lines and adding an option to print in sendLog

* removing fail on AI initialization error.

* fixing lint
2022-03-17 15:37:02 -07:00
Hunter Gregory 26a4b6571e
feat: [NPM] include NPM v1/v2 in telemetry and fix heartbeat log (#1266)
* include NPM v1/v2 in telemetry

* fix heartbeat
2022-03-08 10:39:16 -08:00
Hunter Gregory 1ea2f5a745
feat: [NPM] num ACL rules for v2 & update existing metrics (#1223)
* wip

* fix windows build err

* address comments

* fix lingering merge conflicts
2022-02-09 20:38:46 -08:00
Hunter Gregory c820189a2f
feat: [NPM] send more AI logs (#1230)
* send heartbeat log and send logs in v2

* address comment and add logs for ip validation and policy manager bootup
2022-02-09 15:28:05 -08:00
Hunter Gregory d5f134d597
feat: [NPM] perf metrics for pod/ns/policy CRUD (#1220)
* add dataplane health metrics

* change counters to countervecs

* wip

* uncomment metrics.ReinitializeAll()

* add comment about ReinitializeAll

* restructure prometheus-metrics.go, address comments, and finish UTs for v1

* properly record exec times and include error labels

* add error label to add_policy_exec_time

* add v2 UTs, test NoOp, and address comment

* resolve lints
2022-02-09 15:27:35 -08:00
Hunter Gregory 667743d79f
fix: [NPM] fix incorrect ipset create, fix 1-off prometheus logic, and add cache checking for UTs (#1202)
* wip

* dont touch v1 metrics code and fix lints

* add comment and comment code to resolve lint

* optimize looping and dirty cache updates

* address comments and change param type of modifyCacheForKernelMemberUpdate to reduce map lookups

* add exec time metrics

* UTs

* fix lints

* initialize metrics in policymanager tests

* fix bug in publishing npm logs
2022-02-01 14:47:20 -08:00
Hunter Gregory da4bd0d43b
feat: [NPM] Reset ipsets & update a Prometheus function ResetIPSetEntries (previously just used for UTs) (#1108)
* fix: for prometheus ResetIPSetEntries & feat: reset ipsets in NPM v2

* add note about difference in prometheus metrics in v1 vs v2, and strengthen a UT

* add comment to delegate prometheus metrics from generic ipsetmanager to OS-specific ones

* fix UT for dataplane_test.go and fix lint

* rename variables based on suggestions

* switch to unnamed return values (will throw a go lint error)
2021-11-18 10:00:54 -08:00
Hunter Gregory d00aa2e9b1
NPM Prometheus Unit Tests (#1016)
* fixed bug in NumIPSetsIsPositive()

* moved code for getting metric values to a new file

* renamed file

* unit tests for prometheus metrics

* fix go lints

* use fexec for TestDestroyNpmIpsets()
2021-09-21 10:02:58 -07:00
Hunter Gregory fe23878507
Remove test coverage (#1007)
* removed test/ and testutil/ from code coverage

* remove promutil from coverage

* removed tools/ from code coverage

* removed crd/ from code coverage and updated multitenantnetworkcontainer's manifest

* switch to !ignore_NAME syntax for test and cli tags

* add coverage back to crd (besides autogenerated files)

* rename ignore_test and ignore_cli tags to ignore_uncovered

* make cns/fakes/ uncovered

* mark go files in crd api folders as uncovered again

* add main.go back for nnsmock server
2021-09-17 15:29:40 -07:00
Hunter Gregory 0dd10e4e89
NPM Prometheus Update (#986)
* made prometheus exec time metrics for ipsets and iptables in line with those for network policies (exec time recorded even for failures). Also made prometheus timer variable names clearer.

* fixed faulty prometheus handler test looking for a node metric name when testing the cluster metric handler

* add clarity in comments related to the IPSetInventory metric

* Include prometheus metrics for lists and in DestroyNPMIpsets(). Only make metric updates when there's no error

* refactor prometheus testing and include metric tests for lists and NPMDestroyIpsets()

* better check for empty response to ipset list in DestroyNpmIpsets()

* remove unused clientset from controllers

* replace function for setting ipset inventory with function for removing ipset for better readability. updating comments too

* reset ipset inventory before each unit test

* added unit test for adding to set with pod cache

* remove unused cluster state function and clientset from np manager

* fix build problems: remove clientset from calls to npm.NewNetworkPolicyManager()

* fix logic for destroy ipsets for situation when destroy is called while num ipsets is 0

* delete commented out function

* encapsulated prometheus metrics, refactored prometheus testing for iptm and netpol controller, and removed clientset from controller creation in test files (fixing build error)

* update test for DestroyNpmIpsets() to always use a new Exec
2021-09-10 15:53:58 -07:00
Evan Baker 96bec09d41
chore: appease the linter (3/?), the big gofumpt (#987)
* gofumpt -w -s .

* small addtl cleanups after gofumpt

* rerun after rebase
2021-09-02 16:33:18 -05:00
Evan Baker 1087201b28
chore: appease the linter, pt 2 of ? (#925) 2021-09-01 18:28:17 -05:00
JungukCho d8169318f1
[NPM] support network policy controller and its unit tests (#849)
* first version of network policy controller and its unit tests

* update reconcile and deleteNetworkPolicy function to correctly install and uninstall default Azure NPM chain.

* To explicitly manage default Azure NPM chain in deleteNetworkPolicy function

* correct comments and delete unused variable

* fix missed returing errors in codes

* Correct to check DeletionTimestamp and DeletionGracePeriodSeconds variables

* removed placeholder functions in network policy controoler and added more test cases (e.g., update and adding multiple network policies)

* - applied comments (use explict names, locating lock in a better place)

* add two methods to save and restore iptables in unit test

* comment out unused function

* early filter in updateNetworkPolicy function if they are the same network policies. Update unit tests to test more network policies events

* - start using klog package instead of log package

* remove unneeded defer for lock

* Locate of adding and deleting network policy object from our network policy cache in a right place. Correct prometheus metric code.

* use cached network policy key instead of network policy object as method parameter in cleanUpNetworkPolicy

* remove redundant check

* Remove ns- prefix as key in RawNpMap. Update UT to check prometheus metrics. Applied better naming and removed redundancy codes.

* minor update for varialbe names

* remove dependency between UT by re-initializing metrics. Correct message.
2021-04-14 10:35:36 -07:00
Mathew Merrick d169929048
Npm debug tools (#817)
* add inital debug tools

* export member variables for debug api

* add dependencies

* update metrics and tests

* remove refactor artifacts
2021-03-11 11:47:34 -08:00
shchen 0835cae2d1
Change AI log and metrics sending function name in NPM. (#737) 2020-11-23 23:14:31 -08:00
shchen 1330e4aa3b
Add error log and metrics to AI telemetry. (#656)
* Accelerate metrics report from every 30 mins to every 5 mins.

* Add errCountTest metric.

* Refactor SendAiMetrics. AI initialization is in main routine while send metrics is in another go routine.

* Add aiMetadata config.

* Add SendErrorMetrics function in ai utils.

* Going to push error log to AI telemetry.

* Add error log to AI telemetry.

* Change error message format.

* Add error log and metrics to AI telemetry.

* Remove unnecessary const.

* Change heartbeat back to every 30 mins.

* Seperate send log from SendErrorMetric function for better reuse.

* Change a unit test set name to avoid kernel conflict.

* Address comments. Make error log and metrics sending more generic.

* Fix typo.

* Fix indentation.

* Fix AI initialize issue.

* Remove unnecessary log.

* Use break in if condition.
2020-09-04 10:57:37 -07:00
Hunter Gregory 74c0521de4
Efficient prometheus (#629)
* made ipset inventory metric more efficient for container insights scraping. Added metric for total ipset entries

* updated comment for GetVecValue

* changed prometheus metrics port number from 8000 to 10091 to be next to the node port used in CNS

* added cluster service for NPM Prometheus metrics (lets a scraper only scrape this service for node redundant metrics)

* separated node and cluster metrics into separate registries and HTTP endpoints

* separated functionality for getting IPSetInventory labels and made public

* updated initialization of IPSetInventory to have hash set label and changed ipsm tests to mirror this

* added two yaml options for configuring a prometheus server to scrape NPM efficiently. Removed generic prometheus annotations on NPM pod to prevent default scraping of NPM for a helm prometheus server, and added a specific annotation for the alternative prometheus server config

Co-authored-by: Hunter Gregory <t-hugreg@microsoft.com>
2020-08-05 11:21:29 -04:00
Hunter Gregory 88ea3c2acd
Prometheus metrics (#590)
* prometheus additions to testmain (commented out right now)

* home of the npm prometheus metrics and tools for updating them, testing them

* add/remove policy metrics

* add/remove iptables rule metric measurements

* add/remove ipset metric measurements

* testing for gauges. want to soon remove the boolean for including prometheus in unit testing

* run http server that exposes prometheus from main

* cleaner test additions with less code

* removed incorrect instance of AddSet in the TestDeleteSet test

* added prometheus annotations to pod templates

* deleted unused file

* much more organized initialization of metrics now. now includes map from metric to metric name

* add ability to get summary count value. now getting gauge values and this new count value are done by passing the metric itself as a param instead of a string

* condenses prometheus testing code base by condensing all prometheus error messages into a function

* added testing for summary counts, condensed prometheus error handling code, and updated calls to use new form for getting metric values

* update based on variable spelling change in metrics package

* Added comments for functions and moved http handler code to the http file

* fixed problem of registering same metric name for different metrics, and passing in the wrong param type for testing

* made prometheus testing folder with interactive testing file. moved old random metric flux testing function over from ipsm_test

* moved testing around again

* fixed spelling mistake

* counting mistake in unit test

* handler variable ws in wrong file. Changed stdout printing to logging

* fixed parameter errors and counting error in a test

* moved utilities for testing prometheus metrics to npm/util. Updated StartHTTP to have an additional parameter for waiting after starting the server

* updated uses of StartHTTP to have the extra parameter

* updated GetValue and GetCountValue uses to use the prometheus features of the util package, which is now moved to a promutil package within npm/metrics/

* removed unnecessary comments, removed print statement, and added quantiles to all summary metrics

* fixed problem of double registering metrics

* wait longer for http server to start

* moved tool in test-util.go to promutil/util.go

* fixed timer to be in milliseconds and updated metric descriptions to mention units

* removed unnecessary comments

* http server always started in a go routine now. Added comment justifying the use of an http server

* debugging http connection refused in pipeline

* fixed syntax error

* removed debugging wrapper around http service

* sleep so that the testing metrics endpoint can be pinged

* redesigned GetValue and GetCountValue so that they don't use http calls

* removed random but helpful testing file - will write about quick testing in a wiki page

* milliseconds were being truncated. now they have decimals

* use direct Prometheus metric commands instead of wrapping them

* removed code used when testing was done through http server. Moved registering to metric creation functions

* added createGaugeVec, updated comments, made all help strings constants

* added metric that counts number of entries in each ipset. still need to add tests

* fixed creation of GaugeVecs, and use explicit labeling instead of order-based labeling now

* updated GetVecValue method signature

* added set to metrics on creation and wrote unit tests for CreateSet, AddToSet, DeleteFromSet, DeleteSet

* use custom registry to limit content that Container Insights scrapes. Also log the start of http server

* wrote TODO item comments for Restore and Destroy (currently these functions are only used in testing)

* NPM won't crash if a Prometheus metric fails to register now (unlikely). Added logging for metric registration/creation, and explicit public function to initialize metrics so that we can finish log config first

* initialize metrics in unit tests

* renamed util.go to test-util.go

Co-authored-by: Hunter Gregory <t-hugreg@microsoft.com>
2020-07-14 19:41:02 -04:00