* wip: apply dirty NetPols every 500ms in Linux
* only build npm linux image
* fix: check for empty cache
* feat: toggle for netpol interval. default 500 ms
* ci: remove stages "build binaries" and "run windows tests"
* wip: max batched netpols (toggle-specified)
* ci: remove manifest build/push for win npm
* wip: handle ipset deletion properly and max batch for delete too
* fix: correct remove policy
* fix: only remove policy if it was in kernel
* finalize toggles, allowing ability to turn off iptablesInBackground
* ci: conf + cyc use PR's configmaps
* fix: lints
* fix dp toggle: iptablesInBackground
* fix lock typo and config logging
* fix background thread. add comments. only add tmp ref when enabled
* copy pod selector list
* fix: removepolicy needs namespace too
* rename opInfo to event
* fix: fix references and prevent concurrent map read/write
* tmp: debug logging
* fix: missing set references by swap keys and values
* Revert "tmp: debug logging"
This reverts commit 70ed34c714ea4a6d009a1fe90a7168be4bedd5bf.
* fix: add podSelectorList to fake NetPol
* log: do not print error when failing to delete non-existent nft rule
* log: verbose iptables bootup
* log: use fmt.Errorf for clean logging
* log: never return error for iptables in background and fix some lints
* fix: activate/deactivate azure chain rules
* fix: correctly decrement netpols in kernel
* ci: run UTs again
* ci: update profiles. default to placefirst=false
* address comment: rename batch to pendingPolicy
* refactor: make dirty cache OS-specific
* test: UTs
* test: put UT cfg back to placefirst to not break things
* ci: update cyclonus workflows
* fmt: address comment & lint
* fmt: rename numInKernel to policiesInKernel
* log: switch to fmt.Errorf
* fmt: whitespace
* feat: resiliency to errors while reconciling dirty netpols
* log: temporarily print everything for ipset restore
* fix: remove nomatch from ipset -D for cidr blocks
* test: UTs for non-happy path
* test: fix hns fake
* fix: don't change windows. let it delete ipsets when removing policies
* fix windows lint
* fix: ignore chain doesn't exist errors for iptables -D
* feat: latency and failure metrics
* test: update exit code for UT
* metrics: new metrics should go in node-metrics path
* style: simplify nesting
* style: move identical windows & linux code to shared file
* ci: remove v1 conformance and cyclonus
* feat: add NetPols in background from the DP (revert background code in pMgr)
* style: remove "background" from iptables metrics
* revert changes in ipsetmanager, const.go, and dp.Remove/UpdatePolicy
* style: whitespace
* perf: use len() instead of creating slice from map
* remove verbosity for iptables bootup
* build: add return statement
* style: whitespace
* build: fix variable shadowing
* build: fix more import shadowing
* build: windows pointer issue and UT issue
* test: fix UT for iptables error code 2
* ci: enable linux scale test
* ci: revert to master pipeline.yaml
* revert changes to chain-management. do changes in PR #2012
* log: change wording
* test: UTs for netpol in background
* log: wording
* feat: apply ipsets for each netpol individually
* config: rearrange ConfigMap & update capz yaml
* fix: windows bootup phase logic for addpolicy
* feat: restrict netpol in background to linux + nftables
* test: skip nftables check for UT
* style: netpols[0] instead of loop
* log: address log comments
* style: lint for long line
---------
Co-authored-by: Vamsi Kalapala <vakr@microsoft.com>
* feat: remove metric for max members in ipset and add metric for total pod IPs in cluster
* feat: update total pod IP metric in pod controller
* log: make sure we log apply dp errors
* test: UTs for pod count metric
* style: rename metric to customer_pods
* test: revert changes from previous PR to ipsetMgr windows UTs
* style: rename metric to pods_watched
* test: try longer wait
* debug: tmp log
* Revert "debug: tmp log"
This reverts commit 71529a8643.
* style: fix whitespace
* print statements
* cleanup Running pod with empty IP
* add log line
* revert previous 3 commits
* enqueue updates with empty IPs and add prometheus metric
* fix lints
* handle pod assigned to wrong endpoint edge case
* log and update comment
* UTs and fixed named port + build
* reset entire endpoint regardless of cache
* remove comment in dp.go
* fix windows build issues
* skip refreshing endpoints and address comments
* only sync empty ip if pod running. add tmp log
* undo special pod delete logic
* reference GH issue
* fix Windows UTs
* remove prometheus metrics and a log
---------
Co-authored-by: Vamsi Kalapala <vakr@microsoft.com>
* adaptively modify linux max restore try count to prevent perpetual errors
* remove debug print
* log restore file and send ipsetmanager_linux errors
* send other appropriate errors
* fix handleLineError function
* fix printing restore lines and enhance a log
* fix lints and wrap chainLineNumber errors
* fix one off error for logging the try count
* revert exponential increase to try limit
* update try count to 5 and update UTs
* do not log lines for every restore call until perf is understood
* add dataplane health metrics
* change counters to countervecs
* wip
* uncomment metrics.ReinitializeAll()
* add comment about ReinitializeAll
* restructure prometheus-metrics.go, address comments, and finish UTs for v1
* properly record exec times and include error labels
* add error label to add_policy_exec_time
* add v2 UTs, test NoOp, and address comment
* resolve lints
* fix: for prometheus ResetIPSetEntries & feat: reset ipsets in NPM v2
* add note about difference in prometheus metrics in v1 vs v2, and strengthen a UT
* add comment to delegate prometheus metrics from generic ipsetmanager to OS-specific ones
* fix UT for dataplane_test.go and fix lint
* rename variables based on suggestions
* switch to unnamed return values (will throw a go lint error)
* fixed bug in NumIPSetsIsPositive()
* moved code for getting metric values to a new file
* renamed file
* unit tests for prometheus metrics
* fix go lints
* use fexec for TestDestroyNpmIpsets()
* removed test/ and testutil/ from code coverage
* remove promutil from coverage
* removed tools/ from code coverage
* removed crd/ from code coverage and updated multitenantnetworkcontainer's manifest
* switch to !ignore_NAME syntax for test and cli tags
* add coverage back to crd (besides autogenerated files)
* rename ignore_test and ignore_cli tags to ignore_uncovered
* make cns/fakes/ uncovered
* mark go files in crd api folders as uncovered again
* add main.go back for nnsmock server
* made prometheus exec time metrics for ipsets and iptables in line with those for network policies (exec time recorded even for failures). Also made prometheus timer variable names clearer.
* fixed faulty prometheus handler test looking for a node metric name when testing the cluster metric handler
* add clarity in comments related to the IPSetInventory metric
* Include prometheus metrics for lists and in DestroyNPMIpsets(). Only make metric updates when there's no error
* refactor prometheus testing and include metric tests for lists and NPMDestroyIpsets()
* better check for empty response to ipset list in DestroyNpmIpsets()
* remove unused clientset from controllers
* replace function for setting ipset inventory with function for removing ipset for better readability. updating comments too
* reset ipset inventory before each unit test
* added unit test for adding to set with pod cache
* remove unused cluster state function and clientset from np manager
* fix build problems: remove clientset from calls to npm.NewNetworkPolicyManager()
* fix logic for destroy ipsets for situation when destroy is called while num ipsets is 0
* delete commented out function
* encapsulated prometheus metrics, refactored prometheus testing for iptm and netpol controller, and removed clientset from controller creation in test files (fixing build error)
* update test for DestroyNpmIpsets() to always use a new Exec
* first version of network policy controller and its unit tests
* update reconcile and deleteNetworkPolicy function to correctly install and uninstall default Azure NPM chain.
* To explicitly manage default Azure NPM chain in deleteNetworkPolicy function
* correct comments and delete unused variable
* fix missed returing errors in codes
* Correct to check DeletionTimestamp and DeletionGracePeriodSeconds variables
* removed placeholder functions in network policy controoler and added more test cases (e.g., update and adding multiple network policies)
* - applied comments (use explict names, locating lock in a better place)
* add two methods to save and restore iptables in unit test
* comment out unused function
* early filter in updateNetworkPolicy function if they are the same network policies. Update unit tests to test more network policies events
* - start using klog package instead of log package
* remove unneeded defer for lock
* Locate of adding and deleting network policy object from our network policy cache in a right place. Correct prometheus metric code.
* use cached network policy key instead of network policy object as method parameter in cleanUpNetworkPolicy
* remove redundant check
* Remove ns- prefix as key in RawNpMap. Update UT to check prometheus metrics. Applied better naming and removed redundancy codes.
* minor update for varialbe names
* remove dependency between UT by re-initializing metrics. Correct message.
* Accelerate metrics report from every 30 mins to every 5 mins.
* Add errCountTest metric.
* Refactor SendAiMetrics. AI initialization is in main routine while send metrics is in another go routine.
* Add aiMetadata config.
* Add SendErrorMetrics function in ai utils.
* Going to push error log to AI telemetry.
* Add error log to AI telemetry.
* Change error message format.
* Add error log and metrics to AI telemetry.
* Remove unnecessary const.
* Change heartbeat back to every 30 mins.
* Seperate send log from SendErrorMetric function for better reuse.
* Change a unit test set name to avoid kernel conflict.
* Address comments. Make error log and metrics sending more generic.
* Fix typo.
* Fix indentation.
* Fix AI initialize issue.
* Remove unnecessary log.
* Use break in if condition.
* made ipset inventory metric more efficient for container insights scraping. Added metric for total ipset entries
* updated comment for GetVecValue
* changed prometheus metrics port number from 8000 to 10091 to be next to the node port used in CNS
* added cluster service for NPM Prometheus metrics (lets a scraper only scrape this service for node redundant metrics)
* separated node and cluster metrics into separate registries and HTTP endpoints
* separated functionality for getting IPSetInventory labels and made public
* updated initialization of IPSetInventory to have hash set label and changed ipsm tests to mirror this
* added two yaml options for configuring a prometheus server to scrape NPM efficiently. Removed generic prometheus annotations on NPM pod to prevent default scraping of NPM for a helm prometheus server, and added a specific annotation for the alternative prometheus server config
Co-authored-by: Hunter Gregory <t-hugreg@microsoft.com>
* prometheus additions to testmain (commented out right now)
* home of the npm prometheus metrics and tools for updating them, testing them
* add/remove policy metrics
* add/remove iptables rule metric measurements
* add/remove ipset metric measurements
* testing for gauges. want to soon remove the boolean for including prometheus in unit testing
* run http server that exposes prometheus from main
* cleaner test additions with less code
* removed incorrect instance of AddSet in the TestDeleteSet test
* added prometheus annotations to pod templates
* deleted unused file
* much more organized initialization of metrics now. now includes map from metric to metric name
* add ability to get summary count value. now getting gauge values and this new count value are done by passing the metric itself as a param instead of a string
* condenses prometheus testing code base by condensing all prometheus error messages into a function
* added testing for summary counts, condensed prometheus error handling code, and updated calls to use new form for getting metric values
* update based on variable spelling change in metrics package
* Added comments for functions and moved http handler code to the http file
* fixed problem of registering same metric name for different metrics, and passing in the wrong param type for testing
* made prometheus testing folder with interactive testing file. moved old random metric flux testing function over from ipsm_test
* moved testing around again
* fixed spelling mistake
* counting mistake in unit test
* handler variable ws in wrong file. Changed stdout printing to logging
* fixed parameter errors and counting error in a test
* moved utilities for testing prometheus metrics to npm/util. Updated StartHTTP to have an additional parameter for waiting after starting the server
* updated uses of StartHTTP to have the extra parameter
* updated GetValue and GetCountValue uses to use the prometheus features of the util package, which is now moved to a promutil package within npm/metrics/
* removed unnecessary comments, removed print statement, and added quantiles to all summary metrics
* fixed problem of double registering metrics
* wait longer for http server to start
* moved tool in test-util.go to promutil/util.go
* fixed timer to be in milliseconds and updated metric descriptions to mention units
* removed unnecessary comments
* http server always started in a go routine now. Added comment justifying the use of an http server
* debugging http connection refused in pipeline
* fixed syntax error
* removed debugging wrapper around http service
* sleep so that the testing metrics endpoint can be pinged
* redesigned GetValue and GetCountValue so that they don't use http calls
* removed random but helpful testing file - will write about quick testing in a wiki page
* milliseconds were being truncated. now they have decimals
* use direct Prometheus metric commands instead of wrapping them
* removed code used when testing was done through http server. Moved registering to metric creation functions
* added createGaugeVec, updated comments, made all help strings constants
* added metric that counts number of entries in each ipset. still need to add tests
* fixed creation of GaugeVecs, and use explicit labeling instead of order-based labeling now
* updated GetVecValue method signature
* added set to metrics on creation and wrote unit tests for CreateSet, AddToSet, DeleteFromSet, DeleteSet
* use custom registry to limit content that Container Insights scrapes. Also log the start of http server
* wrote TODO item comments for Restore and Destroy (currently these functions are only used in testing)
* NPM won't crash if a Prometheus metric fails to register now (unlikely). Added logging for metric registration/creation, and explicit public function to initialize metrics so that we can finish log config first
* initialize metrics in unit tests
* renamed util.go to test-util.go
Co-authored-by: Hunter Gregory <t-hugreg@microsoft.com>