Although the tests have been passing on the latest changes, there was a failure in testing last night.
When investigating I found the cause of the problem. When you call cmd.Execute("systemd-run") golang will (sometimes) replace it with the full path (in this case /usr/bin/systemd-run) and so our check for systemd-run mode was not working and it was going down the old code path of direct cgroup assignment.
Fixing by being explicit about it and returning a boolean indicating whether resource governance is required after the process is launched. This brings it back to the way it was in the previous PR iterations but avoids the objections raised there due to linux only concepts. When we converge the windows code here, the implementation of applyResourceGovernance will use Job objects on windows and the code flow will be the same.
We found when testing on some ditros that they had older versions of systemd installed.
Versions before 246 use `MemoryLimit` and after that use `MemoryMax` so we need to know which version we have when constructing the commandline.
Also older versions didn't support the `-E` flag for environment variables and instead use the longer form `--setenv`. This same flag is supported in both old and new versions
* Removed Noise Telemetry Events, and more details on error log.
* - Created new CustomMetricsStatusType
- CustomMetrics will know be reported only when there is a Change in the CustomMetric Field.
- Added commitedCustomMetricsState variable to keep track of the last CustomMetric Value.
* Adding internal/manifest package from Cross-Platform AppHealth Feature Branch
* Running go mod tidy and go mod vendor
* - Add manifest.xml to Extension folder
- Chaged Github workflow go version to Go 1.18
- Small refactor in setup function for bats tests.
* Update Go version to 1.18 in Dockerfile
* Add logging package with NopLogger implementation
* Add telemetry package for logging events
* - Add telemetry event Logging to main.go
* - Add new String() methods to vmWatchSignalFilters and vmWatchSettings structs
- Add telemetry event Logging to handlersettings.go
* - Add telemetry event Logging to reportstatus.go
* Add telemetry event Logging to health.go
* Refactor install handler in main/cmds.go to use telemetry event logging
* Refactor uninstall handler in main/cmds.go to use telemetry event logging
* Refactor enable handler function in main/cmds.go to use telemetry event logging
* Refactor vmWatch.go to use telemetry event logging
* Fix requestPath in extension-settings.json and updated 2 integration tests, one in 2_handler-commands.bats and another in 7_vmwatch.bats
* ran go mod tidy && go mod vendor
* Update ExtensionManifest version to 2.0.9 on UT
* Refactor telemetry event sender to use EventLevel constants in main/telemetry.go
* Refactor telemetry event sender to use EventTasks constants that match with existing Windows Telemetry
* Update logging messages in 7_vmwatch.bats
* Moved telemetry.go to its package in internal/telemetry
* Update Go version to 1.22 in Dockerfile, go.yml, go.mod, and go.sum
* Update ExtensionManifest version to 2.0.9 on UT
* Add NopLogger documentation to pkg/logging/logging.go
* Added Documentation to Telemetry Pkg
* -Added a Wrapper to HandlerEnviroment to add Additional functionality like the String() func
- Added String() func to handlersettings struct, publicSettings struct, vmWatchSettings struct and
vmWatchSignalFilters struct
- Added Telemetry Event for HandlerSettings, and for HandlerEnviroment
* - Updated HandlerEnviroment String to use MarshallIndent Function.
- Updated HandlerSettings struct String() func to use MarshallIndent
- Fixed Failing UTs due to nil pointer in Embedded Struct inside HandlerEnviroment.
* - Updated vmWatchSetting String Func to use MarshallIdent
* Update ExtensionManifest version to 2.0.10 on Failing UT
* removed duplicated UT
* Removed String() func from VMWatchSignalFilters, publicSettings and protectedSettings
i don't know why this passed before, clearly we kill the process when we fail to assign a cgroup i don't know why it would ever return a different message with this fix test pass locally
Background:
Our tests have been running fine for a long time but suddenly started failing on specific os versions. This was because the process (although initially associated with the correct cgroup that we created) gets moved back to the parent cgroup. This results in the limits being removed.
I did some research and reached out to various people and found that this is something that has previously been seen. When a process is started with systemd you are not supposed to manage cgroups directly, systemd owns its own hierarchy and can manipulate things within it. Documentation says that you should not modify the cgroups within that slice hierarchy directly but instead you should use `systemd-run` to launch processes.
The GuestAgent folks saw very similar behavior and switching to systemd-run resolved all their issues.
Changes:
Changed the code to run using `systemd-run` to launch the vmwatch process. Using the `--scope` parameter results in the call to wait until the vmwatch process completes.
The process id returned from the call is the actual process id of vmwatch.
I have confirmed that killing vmwatch and killing app health extension still has the same behavior (the PDeathSig integration is working fine) and the aurora tests are working fine with these changes.
NOTE: Because in docker containers, systemd-run is not available, the code falls back to run the process directly and continues to use the old code path in that case. This should also cover and linux distros which don't use systemd where direct cgroup assignment should work fine.
Changes by @dpoole73
- Fix bug where we were not using the global `vmWatchCommand` variable so the SIGTERM handler was not killing anything
- set the `Pdealthsig` property on the command so the SIGTERM signal is sent to the sub process on parent process termination
This fixes both issues:
Before the fix if we killed app health process, vmwatch process was always leaked
After the fix:
`kill <pid>` -> log message "Received shutdown request" and kill vmwatch.
`kill -9 <pid>`-> no log message, vmwatch is killed
Changes by @klugorosado
- Added Integration tests to kill AppHealth Gracefully with SIGTERM and SIGINT, and validated VMWatch Shutdown.
- Added Integration tests to kill AppHealth Forcibly with SIGKILL, and validated VMWatch Shutdown.
- Added the capability for dev containers to run Integration tests inside.
* Added --apphealth-version flag to VMWatch with AppHealth version from manifest.xml
* - Validated Extension Version on existing VMWatch.
- Created bash function to extract Version from manifest.xml.
- GetExtensionManifestVersion now first attempts to get Extension Version from Version passed at build time and uses manifest.xml file as fallback.