azure-container-networking

Граф коммитов

Автор	SHA1	Сообщение	Дата
Hunter Gregory	ebddca18bd	perf: [NPM] [LINUX] add NetPols in background (#1969 ) * wip: apply dirty NetPols every 500ms in Linux * only build npm linux image * fix: check for empty cache * feat: toggle for netpol interval. default 500 ms * ci: remove stages "build binaries" and "run windows tests" * wip: max batched netpols (toggle-specified) * ci: remove manifest build/push for win npm * wip: handle ipset deletion properly and max batch for delete too * fix: correct remove policy * fix: only remove policy if it was in kernel * finalize toggles, allowing ability to turn off iptablesInBackground * ci: conf + cyc use PR's configmaps * fix: lints * fix dp toggle: iptablesInBackground * fix lock typo and config logging * fix background thread. add comments. only add tmp ref when enabled * copy pod selector list * fix: removepolicy needs namespace too * rename opInfo to event * fix: fix references and prevent concurrent map read/write * tmp: debug logging * fix: missing set references by swap keys and values * Revert "tmp: debug logging" This reverts commit 70ed34c714ea4a6d009a1fe90a7168be4bedd5bf. * fix: add podSelectorList to fake NetPol * log: do not print error when failing to delete non-existent nft rule * log: verbose iptables bootup * log: use fmt.Errorf for clean logging * log: never return error for iptables in background and fix some lints * fix: activate/deactivate azure chain rules * fix: correctly decrement netpols in kernel * ci: run UTs again * ci: update profiles. default to placefirst=false * address comment: rename batch to pendingPolicy * refactor: make dirty cache OS-specific * test: UTs * test: put UT cfg back to placefirst to not break things * ci: update cyclonus workflows * fmt: address comment & lint * fmt: rename numInKernel to policiesInKernel * log: switch to fmt.Errorf * fmt: whitespace * feat: resiliency to errors while reconciling dirty netpols * log: temporarily print everything for ipset restore * fix: remove nomatch from ipset -D for cidr blocks * test: UTs for non-happy path * test: fix hns fake * fix: don't change windows. let it delete ipsets when removing policies * fix windows lint * fix: ignore chain doesn't exist errors for iptables -D * feat: latency and failure metrics * test: update exit code for UT * metrics: new metrics should go in node-metrics path * style: simplify nesting * style: move identical windows & linux code to shared file * ci: remove v1 conformance and cyclonus * feat: add NetPols in background from the DP (revert background code in pMgr) * style: remove "background" from iptables metrics * revert changes in ipsetmanager, const.go, and dp.Remove/UpdatePolicy * style: whitespace * perf: use len() instead of creating slice from map * remove verbosity for iptables bootup * build: add return statement * style: whitespace * build: fix variable shadowing * build: fix more import shadowing * build: windows pointer issue and UT issue * test: fix UT for iptables error code 2 * ci: enable linux scale test * ci: revert to master pipeline.yaml * revert changes to chain-management. do changes in PR #2012 * log: change wording * test: UTs for netpol in background * log: wording * feat: apply ipsets for each netpol individually * config: rearrange ConfigMap & update capz yaml * fix: windows bootup phase logic for addpolicy * feat: restrict netpol in background to linux + nftables * test: skip nftables check for UT * style: netpols[0] instead of loop * log: address log comments * style: lint for long line --------- Co-authored-by: Vamsi Kalapala <vakr@microsoft.com>	2023-07-19 09:13:52 -07:00
Hunter Gregory	96243c325e	fix: [WIN-NPM] fix units of new latency metrics (#2018 ) * fix: new latency metrics use seconds instead of milliseconds * chore: lint * fix: lower buckets to start at 8 milliseconds	2023-06-16 23:22:27 +00:00
Hunter Gregory	e6eeb5014a	feat: [NPM] metric for total Pod IPs (#1999 ) * feat: remove metric for max members in ipset and add metric for total pod IPs in cluster * feat: update total pod IP metric in pod controller * log: make sure we log apply dp errors * test: UTs for pod count metric * style: rename metric to customer_pods * test: revert changes from previous PR to ipsetMgr windows UTs * style: rename metric to pods_watched * test: try longer wait * debug: tmp log * Revert "debug: tmp log" This reverts commit `71529a8643`. * style: fix whitespace	2023-06-09 09:59:27 -07:00
Hunter Gregory	04f92857f2	feat: [WIN-NPM] metrics for latencies and failures (#1959 ) * implement metrics * add npm prefix * rename windows files * metrics pkg UTs * allow reinitializing prometheus metrics * fix: hns wrapper should not throw error for empty SetPolicy values * test: metric UTs in dataplane * fix: record list endpoint latency always * remove flaky UT * feat: metric for max ipset members * fix lint * fix lint 2 * fix build * fix lint 3 * simplify conditionals and protect against maxMembers becoming negative * remove bottom 4 histogram buckets. start at 16 ms * reset metrics for ipset UTs * style: don't check for windows dp in _windows.go files build: remove unused import * test: reset windows metrics in UT	2023-06-05 12:43:39 -07:00
Hunter Gregory	09cd371fb4	fix: [NPM] cleanup restarted pod stuck with no IP (#1503 ) * print statements * cleanup Running pod with empty IP * add log line * revert previous 3 commits * enqueue updates with empty IPs and add prometheus metric * fix lints * handle pod assigned to wrong endpoint edge case * log and update comment * UTs and fixed named port + build * reset entire endpoint regardless of cache * remove comment in dp.go * fix windows build issues * skip refreshing endpoints and address comments * only sync empty ip if pod running. add tmp log * undo special pod delete logic * reference GH issue * fix Windows UTs * remove prometheus metrics and a log --------- Co-authored-by: Vamsi Kalapala <vakr@microsoft.com>	2023-02-15 13:38:47 -08:00
Hunter Gregory	8cc8e7f1ff	fix: [NPM-LINUX] resiliency for several non-retriable errors (#1566 ) * adaptively modify linux max restore try count to prevent perpetual errors * remove debug print * log restore file and send ipsetmanager_linux errors * send other appropriate errors * fix handleLineError function * fix printing restore lines and enhance a log * fix lints and wrap chainLineNumber errors * fix one off error for logging the try count * revert exponential increase to try limit * update try count to 5 and update UTs * do not log lines for every restore call until perf is understood	2022-11-23 10:38:21 -08:00
Vamsi Kalapala	311eba6c3e	test: [NPM] Removing fail on AITelemetry error (#1288 ) * Removing extra log lines and adding an option to print in sendLog * removing fail on AI initialization error. * fixing lint	2022-03-17 15:37:02 -07:00
Hunter Gregory	26a4b6571e	feat: [NPM] include NPM v1/v2 in telemetry and fix heartbeat log (#1266 ) * include NPM v1/v2 in telemetry * fix heartbeat	2022-03-08 10:39:16 -08:00
Hunter Gregory	1ea2f5a745	feat: [NPM] num ACL rules for v2 & update existing metrics (#1223 ) * wip * fix windows build err * address comments * fix lingering merge conflicts	2022-02-09 20:38:46 -08:00
Hunter Gregory	c820189a2f	feat: [NPM] send more AI logs (#1230 ) * send heartbeat log and send logs in v2 * address comment and add logs for ip validation and policy manager bootup	2022-02-09 15:28:05 -08:00
Hunter Gregory	d5f134d597	feat: [NPM] perf metrics for pod/ns/policy CRUD (#1220 ) * add dataplane health metrics * change counters to countervecs * wip * uncomment metrics.ReinitializeAll() * add comment about ReinitializeAll * restructure prometheus-metrics.go, address comments, and finish UTs for v1 * properly record exec times and include error labels * add error label to add_policy_exec_time * add v2 UTs, test NoOp, and address comment * resolve lints	2022-02-09 15:27:35 -08:00
Hunter Gregory	667743d79f	fix: [NPM] fix incorrect ipset create, fix 1-off prometheus logic, and add cache checking for UTs (#1202 ) * wip * dont touch v1 metrics code and fix lints * add comment and comment code to resolve lint * optimize looping and dirty cache updates * address comments and change param type of modifyCacheForKernelMemberUpdate to reduce map lookups * add exec time metrics * UTs * fix lints * initialize metrics in policymanager tests * fix bug in publishing npm logs	2022-02-01 14:47:20 -08:00
Hunter Gregory	da4bd0d43b	feat: [NPM] Reset ipsets & update a Prometheus function ResetIPSetEntries (previously just used for UTs) (#1108 ) * fix: for prometheus ResetIPSetEntries & feat: reset ipsets in NPM v2 * add note about difference in prometheus metrics in v1 vs v2, and strengthen a UT * add comment to delegate prometheus metrics from generic ipsetmanager to OS-specific ones * fix UT for dataplane_test.go and fix lint * rename variables based on suggestions * switch to unnamed return values (will throw a go lint error)	2021-11-18 10:00:54 -08:00
Hunter Gregory	d00aa2e9b1	NPM Prometheus Unit Tests (#1016 ) * fixed bug in NumIPSetsIsPositive() * moved code for getting metric values to a new file * renamed file * unit tests for prometheus metrics * fix go lints * use fexec for TestDestroyNpmIpsets()	2021-09-21 10:02:58 -07:00
Hunter Gregory	fe23878507	Remove test coverage (#1007 ) * removed test/ and testutil/ from code coverage * remove promutil from coverage * removed tools/ from code coverage * removed crd/ from code coverage and updated multitenantnetworkcontainer's manifest * switch to !ignore_NAME syntax for test and cli tags * add coverage back to crd (besides autogenerated files) * rename ignore_test and ignore_cli tags to ignore_uncovered * make cns/fakes/ uncovered * mark go files in crd api folders as uncovered again * add main.go back for nnsmock server	2021-09-17 15:29:40 -07:00
Hunter Gregory	0dd10e4e89	NPM Prometheus Update (#986 ) * made prometheus exec time metrics for ipsets and iptables in line with those for network policies (exec time recorded even for failures). Also made prometheus timer variable names clearer. * fixed faulty prometheus handler test looking for a node metric name when testing the cluster metric handler * add clarity in comments related to the IPSetInventory metric * Include prometheus metrics for lists and in DestroyNPMIpsets(). Only make metric updates when there's no error * refactor prometheus testing and include metric tests for lists and NPMDestroyIpsets() * better check for empty response to ipset list in DestroyNpmIpsets() * remove unused clientset from controllers * replace function for setting ipset inventory with function for removing ipset for better readability. updating comments too * reset ipset inventory before each unit test * added unit test for adding to set with pod cache * remove unused cluster state function and clientset from np manager * fix build problems: remove clientset from calls to npm.NewNetworkPolicyManager() * fix logic for destroy ipsets for situation when destroy is called while num ipsets is 0 * delete commented out function * encapsulated prometheus metrics, refactored prometheus testing for iptm and netpol controller, and removed clientset from controller creation in test files (fixing build error) * update test for DestroyNpmIpsets() to always use a new Exec	2021-09-10 15:53:58 -07:00
Evan Baker	96bec09d41	chore: appease the linter (3/?), the big gofumpt (#987 ) * gofumpt -w -s . * small addtl cleanups after gofumpt * rerun after rebase	2021-09-02 16:33:18 -05:00
Evan Baker	1087201b28	chore: appease the linter, pt 2 of ? (#925 )	2021-09-01 18:28:17 -05:00
JungukCho	d8169318f1	[NPM] support network policy controller and its unit tests (#849 ) * first version of network policy controller and its unit tests * update reconcile and deleteNetworkPolicy function to correctly install and uninstall default Azure NPM chain. * To explicitly manage default Azure NPM chain in deleteNetworkPolicy function * correct comments and delete unused variable * fix missed returing errors in codes * Correct to check DeletionTimestamp and DeletionGracePeriodSeconds variables * removed placeholder functions in network policy controoler and added more test cases (e.g., update and adding multiple network policies) * - applied comments (use explict names, locating lock in a better place) * add two methods to save and restore iptables in unit test * comment out unused function * early filter in updateNetworkPolicy function if they are the same network policies. Update unit tests to test more network policies events * - start using klog package instead of log package * remove unneeded defer for lock * Locate of adding and deleting network policy object from our network policy cache in a right place. Correct prometheus metric code. * use cached network policy key instead of network policy object as method parameter in cleanUpNetworkPolicy * remove redundant check * Remove ns- prefix as key in RawNpMap. Update UT to check prometheus metrics. Applied better naming and removed redundancy codes. * minor update for varialbe names * remove dependency between UT by re-initializing metrics. Correct message.	2021-04-14 10:35:36 -07:00
Mathew Merrick	d169929048	Npm debug tools (#817 ) * add inital debug tools * export member variables for debug api * add dependencies * update metrics and tests * remove refactor artifacts	2021-03-11 11:47:34 -08:00
shchen	0835cae2d1	Change AI log and metrics sending function name in NPM. (#737 )	2020-11-23 23:14:31 -08:00
shchen	1330e4aa3b	Add error log and metrics to AI telemetry. (#656 ) * Accelerate metrics report from every 30 mins to every 5 mins. * Add errCountTest metric. * Refactor SendAiMetrics. AI initialization is in main routine while send metrics is in another go routine. * Add aiMetadata config. * Add SendErrorMetrics function in ai utils. * Going to push error log to AI telemetry. * Add error log to AI telemetry. * Change error message format. * Add error log and metrics to AI telemetry. * Remove unnecessary const. * Change heartbeat back to every 30 mins. * Seperate send log from SendErrorMetric function for better reuse. * Change a unit test set name to avoid kernel conflict. * Address comments. Make error log and metrics sending more generic. * Fix typo. * Fix indentation. * Fix AI initialize issue. * Remove unnecessary log. * Use break in if condition.	2020-09-04 10:57:37 -07:00
Hunter Gregory	74c0521de4	Efficient prometheus (#629 ) * made ipset inventory metric more efficient for container insights scraping. Added metric for total ipset entries * updated comment for GetVecValue * changed prometheus metrics port number from 8000 to 10091 to be next to the node port used in CNS * added cluster service for NPM Prometheus metrics (lets a scraper only scrape this service for node redundant metrics) * separated node and cluster metrics into separate registries and HTTP endpoints * separated functionality for getting IPSetInventory labels and made public * updated initialization of IPSetInventory to have hash set label and changed ipsm tests to mirror this * added two yaml options for configuring a prometheus server to scrape NPM efficiently. Removed generic prometheus annotations on NPM pod to prevent default scraping of NPM for a helm prometheus server, and added a specific annotation for the alternative prometheus server config Co-authored-by: Hunter Gregory <t-hugreg@microsoft.com>	2020-08-05 11:21:29 -04:00
Hunter Gregory	88ea3c2acd	Prometheus metrics (#590 ) * prometheus additions to testmain (commented out right now) * home of the npm prometheus metrics and tools for updating them, testing them * add/remove policy metrics * add/remove iptables rule metric measurements * add/remove ipset metric measurements * testing for gauges. want to soon remove the boolean for including prometheus in unit testing * run http server that exposes prometheus from main * cleaner test additions with less code * removed incorrect instance of AddSet in the TestDeleteSet test * added prometheus annotations to pod templates * deleted unused file * much more organized initialization of metrics now. now includes map from metric to metric name * add ability to get summary count value. now getting gauge values and this new count value are done by passing the metric itself as a param instead of a string * condenses prometheus testing code base by condensing all prometheus error messages into a function * added testing for summary counts, condensed prometheus error handling code, and updated calls to use new form for getting metric values * update based on variable spelling change in metrics package * Added comments for functions and moved http handler code to the http file * fixed problem of registering same metric name for different metrics, and passing in the wrong param type for testing * made prometheus testing folder with interactive testing file. moved old random metric flux testing function over from ipsm_test * moved testing around again * fixed spelling mistake * counting mistake in unit test * handler variable ws in wrong file. Changed stdout printing to logging * fixed parameter errors and counting error in a test * moved utilities for testing prometheus metrics to npm/util. Updated StartHTTP to have an additional parameter for waiting after starting the server * updated uses of StartHTTP to have the extra parameter * updated GetValue and GetCountValue uses to use the prometheus features of the util package, which is now moved to a promutil package within npm/metrics/ * removed unnecessary comments, removed print statement, and added quantiles to all summary metrics * fixed problem of double registering metrics * wait longer for http server to start * moved tool in test-util.go to promutil/util.go * fixed timer to be in milliseconds and updated metric descriptions to mention units * removed unnecessary comments * http server always started in a go routine now. Added comment justifying the use of an http server * debugging http connection refused in pipeline * fixed syntax error * removed debugging wrapper around http service * sleep so that the testing metrics endpoint can be pinged * redesigned GetValue and GetCountValue so that they don't use http calls * removed random but helpful testing file - will write about quick testing in a wiki page * milliseconds were being truncated. now they have decimals * use direct Prometheus metric commands instead of wrapping them * removed code used when testing was done through http server. Moved registering to metric creation functions * added createGaugeVec, updated comments, made all help strings constants * added metric that counts number of entries in each ipset. still need to add tests * fixed creation of GaugeVecs, and use explicit labeling instead of order-based labeling now * updated GetVecValue method signature * added set to metrics on creation and wrote unit tests for CreateSet, AddToSet, DeleteFromSet, DeleteSet * use custom registry to limit content that Container Insights scrapes. Also log the start of http server * wrote TODO item comments for Restore and Destroy (currently these functions are only used in testing) * NPM won't crash if a Prometheus metric fails to register now (unlikely). Added logging for metric registration/creation, and explicit public function to initialize metrics so that we can finish log config first * initialize metrics in unit tests * renamed util.go to test-util.go Co-authored-by: Hunter Gregory <t-hugreg@microsoft.com>	2020-07-14 19:41:02 -04:00

24 Коммитов