* change total IPs and add secondary IP metric
Updates the Total IPs metrics to include the NC
Primary IP in the total. Adds a Secondary IPs
metric which holds the value that the Total IPs
previously held: NC Secondary IPs known to CNS
which could be used by Pods.
Signed-off-by: GitHub <noreply@github.com>
* update help wording on IPAM metrics
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
* reword PrimaryIP metric help
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
---------
Signed-off-by: GitHub <noreply@github.com>
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
* modifying reconcile function to take multiple ncs instead of one. we need this because for reconciliation in dual stack cases both IPs for a pod which can come from distinct ncs must be added at the same time
* adding comments, and renaming functions and types, to make the intent clearer
* adding some dummy values to cns.NewPodInfo invocations in tests, instead of empty strings since we need distinct interface ids
* adding a basic test for dual stack reconciliation
* adding validation for multiple ips for the same ip family on the same pod, which is not allowed
* changing direct use of interface id to pod key, which is better for reconcile flows using information from kubernetes instead of cni
* fixing comments to use host network terminology instead of system pod
* improving error message on duplicate ip from cni found; improving readability of error handling on ip parsing
* only checking for pod ips that are actually set on a pod
add EnabledAZR flag to make home az goroutine as an opt-in feature
change PopulateHomeAzCacheRetryIntervalSecs default to 60
Co-authored-by: rjdenney <105380463+rjdenney@users.noreply.github.com>
* fix: reconcile initial state from CRD regardless of existing podInfo
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
* fix: add test for no state and pending release in NNC
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
* fix: add test for restoring state and pending release in NNC
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
---------
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
* InterfaceID should be set to containerID instead of interfaceName if CNS manage endpoint state is enabled
* add log in GetExistingIPConfig function to debug
* Fix UT (set containerid as pod interface id)
lint fix
* add retry to nnc update during scaledown
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
* test for panic in pool monitor
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
---------
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
1. Enables CNS to handle multiple NCs in NNC.
2. Adds new APIs that allows multiple IPs to be requested and released.
3. This change is needed for dualstack overlay
---------
Co-authored-by: Tim Raymond <traymond@microsoft.com>
* initial delete NC changes to include body
* initial delete NC changes to include body
* lint error
* initial delete NC changes to include body
* initial delete NC changes to include body
* lint error
* test change
* test change
* add comment
* adding wireserver proxy
* using proxy for publish
* injecting dependency. refactoring unpublish nc
* default wireserver ip config
* setting content type
* better printing of requests and responses
* fixing unit tests
In scenarios where the subnet token does not match (leading to a 401
from NMAgent), CNS returns a 200 for the PublishStatusCode. This is
incorrect, and a 401 should be returned instead. This leads clients of
CNS to take incorrect action on, what they believe to be, a successful
response.
* tmp commit for onbaording nma v2
* Remove test output file
* Remove unnecessary code when CNS onboard get nc version list without
token
* tmp commit to fix getnc version tests when onboarding nc version api v2
from nmagent.
* Fix the unit test for nmagent v2 api change.
* Fix unit test TestGetNetworkContainerVersionStatus
* Revert back to GetNCVersionF test.
* Roll back to get nc version api v1 for test.
* Continue revert back and store nc version url
* Onboard nmagent get nc version api v2.
* Address pr feedback of returning early and remove comment out code.
* Remove unnecessary ncVersionURLs and NCVersionRequest.
* Remove unnecessary variables.
* Update nmagent get nc version api v2 to v2 url
* Remove comment out code.
* tmp commit for onbaording nma v2
* Remove test output file
* Remove unnecessary code when CNS onboard get nc version list without
token
* tmp commit to fix getnc version tests when onboarding nc version api v2
from nmagent.
* Fix the unit test for nmagent v2 api change.
* Fix unit test TestGetNetworkContainerVersionStatus
* Revert back to GetNCVersionF test.
* Roll back to get nc version api v1 for test.
* Continue revert back and store nc version url
* Onboard nmagent get nc version api v2.
* Address pr feedback of returning early and remove comment out code.
* Remove unnecessary ncVersionURLs and NCVersionRequest.
* Remove unnecessary variables.
* Update nmagent get nc version api v2 to v2 url
* Remove comment out code.
* Update nmagent get nc version list.
* Address feedback and fix golint
* Fix lint issue.
* Fix the remaining 2 lint issues.
* Revert back test error generation to address feedback.
* Fix incorrect HTTP status from publish NC
CNS was responding with an HTTP status code of "0" from NMAgent.
Successes are supposed to be 200. The C-style var block at the beginning
of publishNetworkContainer was the reason for this. During refactoring,
the location where this status code was set to a successful value of 200
was accidentally removed. Because the var block declared the variable
and silently initialized it to 0, the compiler did not flag this bug as
it otherwise would have. The status code has been removed from this
block and explicitly defined and initialized to a correct value of 200.
Subsequent error handling will change this as necessary.
Also, despite consumers depending on this status, there were no tests to
verify that the status was set correctly. Tests have been added to
reflect this dependency.
* Ensure that NMAgent body is always set
DNC depends on the NMAgent body being set for its vestigial functions of
retrying failed requests. Since failed requests will now be retried
internally (to CNS) by the NMAgent client, this isn't really necessary
anymore. There are versions of DNC out there that depend on this body
though, so it needs to be present in order for NC publishing to actually
work.
* Fix missing NMAgent status for Unpublish
It was discovered that the Unpublish endpoints also omitted the status
codes and bodies expected by clients. This adds those and fixes the
associated tests to guarantee the expected behavior.
* Silence the linter
There were two instances where the linter was flagging dynamic errors,
but this is just in a test. It's perfectly fine to bend the rules there,
since we don't expect to re-use the errors (they really should be
t.Fatal / t.Error anyway, but due to legacy we're returning errors here
instead).
* Moving the lock from InitializeKeyValueStore() function to restore/save functions to improve cni performance on windows.
* fix: use defer function to unlock statefile.
* fix: fixing the IPAM lock and defer func
* fix: Optimizing cni file lock by moving SetSdnRemoteArpMacAddress() on startup for CRD and MultitenantCRD mode.
* adding store lock on telemetry service start to avoid race condition on windows.
There was insufficient coverage over cases involving different
permutations of the "WireserverIP" configuration option. Consequently,
there were instances where reasonable values for this option caused CNS
to fail to start.
This moves the logic for transforming the CNS configuration into
configuration suitable for the NMAgent client into a method off the
CNSConfig. It also permits adding coverage over different scenarios that
are likely to emerge.
* feat: add cni conflist generator for v4 overlay scenario
* feat: use atomic fs operations
* fix: use same directory as temp dir since /tmp is a tmpfs
* Fix inability to unmarshal nmagent request
During the previous switchover to using the client from the
`nmagent` client (as opposed to the `cns/nmagent` client), an assumption
was made that "proxied" requests to NMAgent were provided by clients as
nested JSON. This assumption was wrong--they are Base64-encoded strings
of JSON. Even though they're ultimately similar, they're very different
from the perspective of the JSON unmarshaler. Consequently, this
restores the nested request body back to a []byte and performs the
second-stage decoding manually (similarly to how it was previously
done).
Fixes#1694
* Fix swallowed error when body is not JSON
In one instance, an error was accidentally swallowed because the
existing code does not return errors (it sets variables instead). This
makes controlling the flow of execution difficult. To fix this, the
offending code has been moved to a separate function where returns can
be used effectively.
* Switch PublishNetworkContainer to nmagent package
This removes the PublishNetworkContainer method from the CNS client's
nmagent package in favor of using the one from the top-level nmagent
client package. This is to ensure that there's only one nmagent client
to maintain.
* Use Unpublish method from nmagent package
The existing unpublish endpoints within CNS used the older nmagent
client. This converts them to use the newer one.
* Use JoinNetwork from new nmagent client
The API in CNS was using the old nmagent client endpoints which we want
to phase out so that there is exactly one nmagent client across all
systems.
* Add SupportedAPIs method to nmagent and use it
The CNS client uses this SupportedAPIs method to proxy requests from DNC
and detect capabilities of the node.
* Add GetNCVersion to nmagent and use it in CNS
CNS previously used its own client method for accessing the GetNCVersion
endpoint of nmagent. In the interest of having a singular NMAgent
client, this adopts the behavior into the nmagent package and uses it
within CNS instead.
* Use test helpers for context and checking errs
Checking whether the error was present and should it be present was
annoying and repetitive, so there's a helper now to take care of that.
Similarly, contexts are necessary but it's nice to create them from the
test's deadline itself. There's a helper to do that now as well.
* Fix broken tests & improve the suite
There were some broken tests left over from implementing new NMAgent
interface methods. This makes those tests pass and also improves the
test suite by detangling mixed concerns of the utility functions within.
Many of them returned errors and made assertions, which made them
confusing to use. The utility functions now only do things you request,
and the tests themselves are the only ones making assertions (sometimes
through helpers that were added to make those assertions easier).
* Add GetNCVersionList endpoint to NMAgent client
This is the final endpoint that was being used in CNS without being
present in the "official" NMAgent client. This moves that
implementation, more-or-less, to the nmagent package and consumes it in
NMAgent through an interface.
* Remove incorrect error shadowing
There were a few places where errors were shadowed. Also removed the
C-ism of pre-declaring variables for the sake of pre-declaring
variables.
* Replace error type assertions with errors.As
In two instances, type assertions were used instead of errors.As. Apart
from being less obvious, there are apparently instances where this can
be incorrect with error wrapping.
Also, there was an instance where errors.As was mistakenly used in the
init clause of an if statement, instead of the predicate. This was
corrected to be a conjunctive predicate converting the error and then
making assertions using that error. This is safe because short-circuit
evaluation will prevent the second half of the predicate from being
evaluated if the error is not of that type.
* Use context for joinNetwork
The linter rightly pointed out that context could be trivially
propagated further than it was. This commit does exactly that.
* Add error wrapping to an otherwise-opaque error
The linter highlighted this error, showing that the error returned would
be confusing to a user of this function. This wraps the error indicating
that a join network request was issued at the point where the error was
produced, to aid in debugging.
* Remove error shadowing in the tests
This is really somewhat unnecessary because it's just a test. It
shouldn't impact correctness, since the errors were properly scoped to
their respective if statements. However, to prevent other people's
linters from complaining, this corrects the lint.
* Silence complaints about dynamic errors in tests
This is an unnecessary lint in a test, because it's useful to be able to
define errors on the fly when you need them in a test. However, the
linter demands it presently, so it must be silenced.
* Add missing return for fmt.Errorf
Surprisingly, a return was missing here. This was caught by the linter,
and was most certainly incorrect.
* Remove yet another shadowed err
There was nothing incorrect about this shadowed err variable, but
linters complain about it and fail CI.
* Finish wiring in the NMAgent Client
There were missing places where the nmagent client wasn't wired in
properly in main. This threads the client through completely and also
alters some tests to maintain compatibility.
* Add config for Wireserver to NMAgent Client
This was in reaction to a lint detecting a vestigial use of
"WireserverIP". This package variable is no longer in use, however, the
spirit of it still exists. The changes adapt the provided configuration
into an nmagent.Config for use with the NMAgent Client instead.
* Silence the linter
The linter complained about a shadowed err, which is fine since it's
scoped in the `if`. Also there was a duplicate import which resulted
from refactoring. This has been de-duped.
* Rename jnr -> req and nma -> nmagent
The "jnr" variable was confusing when "req" is far more revealing of
what it is. Also, the "nma" alias was left over from a prior iteration
when the legacy nmagent package and the new one co-existed.
* Rename NodeInquirer -> NodeInterrogator
"Interrogator" is more revealing about the set of behavior encapsulated
by the interface. Depending on that behavior allows a consumer to
interrogate nodes for various information (but not modify anything about
them).
* emit a metric for NC sync and an error if all NCs are not present
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
* rename variables in nc sync host version for clarity
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
* Added a metric to monitor the subnet exhaustion state within the Ipam Monitor Pool
* Fixed the PR comments
* Added a reconciler error metric
* Addressed code review comments
* Updating lint on code
* Addressed all code review comments and changed the reconciler metric to a counter metric and fixed linting issues
* Added a count metric for IPAM pool as well to count the number of switches between subnet exhaustion and reversal for each subnet
* Updated the makefile to be able to run linting with better garbage collection
* Updated the code with the PR review comments
* Updated the label values based on a discussion offline with Evan
Co-authored-by: asn <asn@microsoft.com>
This path was missed in a prior iteration, but is necessary for the
client to function properly. Without it, the route cannot be found and
thus the request cannot be created.
align pool to batch during scale up
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
* Add NumOfCPUCores method to CNS client
This is the last part of CNS's API surface that is used by DNC, but not
covered by a method from the CNS client.
* Remove unnecessary error wrapping lambda
The error handling here is simple enough that it doesn't really warrant
wrapping it with a lambda.
* Add NmAgentSupportedAPIs to the CNS Client
DNC uses this API to detect GRE key capabilities. Since it's not
supported by the client, it does this with direct HTTP requests
currently. In order to ensure that there's only one way of accessing
CNS, this implements the method so that the client can be used in DNC
instead.
* Add NMA supported APIs to client URL list
This was forgotten in a previous iteration.
* Handle non-200 status code by returning an error
The NMAgentSupportedAPIs method did not error when a non-2XX status code
was encountered. This leads to surprising behavior on the part of the
consumer.
* Use http.NoBody instead of nil
This was a linter suggestion, and a good one. http.NoBody is more
semantically meaningful than passing a nil.
* Create a FailedHTTPRequest error type
This is in response to a linter complaint that dynamic errors were being
used instead of a static error. Declaring an error type with var
wouldn't carry along necessary debugging information (namely the HTTP
status code), so it wasn't really appropriate. Rather than disable the
check entirely with a linter directive, this defines a reusable error
type so that HTTP errors can be communicated upward consistently.
* Use an alias for CNS client paths
These really should be a single set of paths, but there's this existing
block of constants to define a "contract with DNC." Whether or not
that's complete, the spirit of it is somewhat clear. However, it's not
great to be duplicating constants all over the place. This is somewhat
of a compromise by defining the newer constants in terms of the older
ones.
* Add missing context to outbound HTTP requests
This was a good suggestion from the linter.
move primary IP count to separate metric to not affect the ipam math
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
* rebase
* linting
* rebase
* missing if condition for releaseIPConfig
* update azure-cns.yaml and add UTs
* rebase
* update program iptables changes
* linting
* fix broken tests
* fix podinfoprovider returns error when key is not found
* log when no endpoint state exist when reconcilling
* not remove endpoint state file on failure to read in restserver.restoreState()
* addressed comments
* update acn tag
* go get on acn
* addressed comments
Co-authored-by: Evan Baker <rbtr@users.noreply.github.com>
* rebase
* rebase
* rebase
* adding snat iptables rules using coreos lib
* fix iptables cmd not running
* docs
* added conflist back
* change chain name from SWIFT to SWIFT-POSTROUTING
* update go.mod
* split internalapi into linux and windows
* add imports
* fix iptables programming login
* fix iptables programming logic
* change program iptables rules logic
accept an nnc with no nc as valid and consider monitor and reconciler started
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
These were forgotten during a previous effort to add endpoints to the
CNS client. Consequently, those endpoints don't actually work unless
these paths are present here.
* Add DeleteNetworkContainer method to CNS Client
DNC needs to delete network containers using CNS, and it currently does
so through raw HTTP requests. This instead makes this an official
operation within the CNS client so that it can be used there instead.
* Remove return value from DeleteNetworkContainer
The return value from DeleteNetworkContainer was basically unused in
DeleteNetworkContainer, since it only contains a CNSResponse type which
can only ever be a successful response. Given this, it's sufficient to
just return no error as a signifier of success.
* Add SetOrchestratorType endpoint to cns client
DNC uses SetOrchestratorType currently by making straight HTTP requests
to the CNS backend. Since we should have one client only, this moves
these endpoints into the CNS client so that it can be consumed in DNC.
* Fix two linter issues
In one instance the linter had a false positive on an error that doesn't
really need to be wrapped. The other was a good suggestion since it
helps readers of the test understand what is going on.
* Add CreateNetworkContainer endpoint to client
Turns out DNC uses a few more endpoints than it would seem. This adds a
CreateNetworkContainer method to encapsulate the
/network/createorupdatenetworkcontainer endpoint.
* Add a PublishNetworkContainer to CNS client
Publishing network containers via CNS is something that DNC does
directly through an HTTP client. Given how common this is, it makes
sense to adopt this into the CNS client so that DNC can use the client
instead.
* Fix placeholder error message
This was just used in testing and was forgotten.
* Add context to DeleteNetworkContainer
Everything involving the network should take a context parameter.
* Fix PublishNetworkContainer return type
The PublishNetworkContainer endpoint returns a wrapping type around the
cns.Response. This updates the tests and the endpoint to reflect that.
* Add UnpublishNC to CNS Client
Unpublishing NCs is accomplished by DNC currently by directly making
HTTP calls. This adds that functionality to the client so the client can
be used instead.
* Remove azure-vne-telemetry for windows multitenancy and telemetry service for windows multitenancy will be started from cns.
* start telemetry service from cns
* lint and log fix
* minor change
* addressed comment
* Service Account Mitigation for CNS on k8s Windows 2022
* pick up Neha's bug fix
* addressing comments
* add node selector back
Co-authored-by: Jaeryn <tsun.chu@microsoft.com>
* Fix CNS Reconciling and ASsignmentMode handling for AKS Swift
* Fixed the handling of default mode == Dynamic
* Incorporate feedback
* golint fixes
* Fix for whyNoLint: include an explanation for nolint directive (gocritic)
* add windows cns manifest to multi arch image
* try to use generic windows template w/ containerize stage in pipeline
* try and use buildah to pull images
* update manifest build and push for buildah
* create manifest by referencing images instead of pulling to avoid OS mismatch error
* remove unused windows-image.yaml
* remove REGISTRY var and use IMAGE_REGISTRY from makefile
Co-authored-by: Jaeryn <tsun.chu@microsoft.com>
Co-authored-by: Evan Baker <rbtr@users.noreply.github.com>
* combine metrics and healthz
* use root mux and bind pprof routes on debug config opt
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
Co-authored-by: Paul Miller <pmiller@microsoft.com>
* Added the Pod Net ARM ID as a Label.
Added Pod Net Arm Id as labels to the CNS Metrics.
* CNS Metrics: ARM ID as Label
Necessary changes to the read the ARM ID metadata coming from DNC.
* Subtle changes in NNC
Variable names had to be modified to meet lint standards.
* Subtle changes in NNC
Variable names had to be modified to meet lint standards.
* Made NNC arg a Pointer
NNC that was passed to create the ARM ID was too heavy. Converted that argument to a pointer and passed the reference of the NNC in that method.
* [CNS] Metrics Prefix
Added the "cx_" prefix to prometheus metrics of CNS.
* Minor Changes
Minor edits with the Metric Names and Documentation.
* ConstLabels in Metrics
Made changes to the ConstLabels in the CNS Metrics.
* ConstLabels in Metrics
Made changes to the ConstLabels in the CNS Metrics.
* wrap a histogram around syncHostNCVersion
* try and fix linter errros though dubious about one
* clean up
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
* kick pr?
* use success to be consistent with dnc
Co-authored-by: Evan Baker <rbtr@users.noreply.github.com>
* 12622609: Dimensions to metrics.
* Modified the constants declaration.
* Made a slight change to make the observeIPPoolState() method inline and remove a whitespace.
* Modified label values to stay consistent with other prometheus labels.
* remove early nnc send to pool monitor
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
* order service starts better and add logging in initialization
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
* wait for reconciler to start
Signed-off-by: Evan Baker <rbtr@users.noreply.github.com>
* update comments and docs
Signed-off-by: GitHub <noreply@github.com>
The io/ioutil package has been deprecated as of Go 1.16, see
https://golang.org/doc/go1.16#ioutil. This commit replaces the existing
io/ioutil functions with their new definitions in io and os packages.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* added lockedfileapi support for CNI
* fixed interface changes
* addressed comments
fixed ut
* addressed comments
* fixed copy to buffer part in writer api
* fixed copy to buffer part in writer api
* keeping old code not changing it.