Граф коммитов

146 Коммитов

Автор SHA1 Сообщение Дата
Mingyang Zheng 3d9cf2d3d4
bump version to 2.0.14 (#92) 2024-09-23 13:38:10 -07:00
Mingyang Zheng bffd7d62a0 bump version to 2.0.14 2024-09-23 13:13:00 -07:00
Mingyang Zheng d532e74181
AppHealthLinux's VMWatch binary update to 1.2.0, AppHealthLinux manifest version bump to 2.0.13 (#91) 2024-09-17 10:19:51 -07:00
Mingyang Zheng 91496b5a84 remove binary 2024-09-16 19:04:53 -07:00
Mingyang Zheng 290225e1c3 update binary under integration folder 2024-09-16 19:04:14 -07:00
Mingyang Zheng 98d0dd8fb7 AppHealthLinux's VMWatch binary update to 1.2.0, AppHealthLinux manifest version bump to 2.0.13 2024-09-16 17:57:37 -07:00
Kevin Lugo 3ec3e5b906
Refactoring Telemetry Package and replacing go-kit Logger for Slog Logger (#88)
This Pull Request refactors the usage of our Telemetry package and
changes the logger from gokit logger. These changes close the gap on the
difference between the v2 in the master branch and the consolidated code
base in the feature/v2/WindowsMigration.
 
 
Important changes: 
- Created new Slog Handler to change the msg tag to event and moved it
to the end of the log message.
- The new Telemetry struct is now a singleton instead of a global
function.

### AI Updates:

------------------------------------------------------------------------------------

Updates to Telemetry:

*
[`internal/telemetry/telemetry.go`](diffhunk://#diff-03a209e90e40142dadb464e0a169c11dae5605db9eaf53106cdac0c56c235b38L4-R21):
The telemetry package was updated with several changes. The `EventLevel`
and `EventTask` constants were renamed for clarity, and the
`TelemetryEventSender` struct was replaced with a `Telemetry` struct.
Several methods were also changed, and new error variables were
introduced for better error handling.
[[1]](diffhunk://#diff-03a209e90e40142dadb464e0a169c11dae5605db9eaf53106cdac0c56c235b38L4-R21)
[[2]](diffhunk://#diff-03a209e90e40142dadb464e0a169c11dae5605db9eaf53106cdac0c56c235b38L34-R121)

*
[`main/cmds.go`](diffhunk://#diff-ace417b47e816a44cf3b6f6248e72453a46d9e6043f19aea9d39212e852cc373R5-R16):
The logging in the `main/cmds.go` file was updated to use the `slog`
package instead of the `log` package. This included changing the type of
the `lg` variable in several functions and updating the telemetry calls
to use the new `SendEvent` method from the updated `telemetry` package.
[[1]](diffhunk://#diff-ace417b47e816a44cf3b6f6248e72453a46d9e6043f19aea9d39212e852cc373R5-R16)
[[2]](diffhunk://#diff-ace417b47e816a44cf3b6f6248e72453a46d9e6043f19aea9d39212e852cc373L44-R68)
[[3]](diffhunk://#diff-ace417b47e816a44cf3b6f6248e72453a46d9e6043f19aea9d39212e852cc373L80-R88)
[[4]](diffhunk://#diff-ace417b47e816a44cf3b6f6248e72453a46d9e6043f19aea9d39212e852cc373L108-R115)
[[5]](diffhunk://#diff-ace417b47e816a44cf3b6f6248e72453a46d9e6043f19aea9d39212e852cc373L137-R142)
[[6]](diffhunk://#diff-ace417b47e816a44cf3b6f6248e72453a46d9e6043f19aea9d39212e852cc373L173-R173)
[[7]](diffhunk://#diff-ace417b47e816a44cf3b6f6248e72453a46d9e6043f19aea9d39212e852cc373L182-R202)
[[8]](diffhunk://#diff-ace417b47e816a44cf3b6f6248e72453a46d9e6043f19aea9d39212e852cc373L225-R225)
[[9]](diffhunk://#diff-ace417b47e816a44cf3b6f6248e72453a46d9e6043f19aea9d39212e852cc373L238-R238)

Updates to Logging:

*
[`main/handlersettings.go`](diffhunk://#diff-f8ae33e4c69620dbc2523794f5240aa34ad618e11e155fec37c03a0c2e8b2b8cR6-L11):
The logging in the `handlersettings.go` file was updated to use the
`slog` package instead of the `log` package. This included updating the
telemetry calls to use the new `SendEvent` method from the updated
`telemetry` package.
[[1]](diffhunk://#diff-f8ae33e4c69620dbc2523794f5240aa34ad618e11e155fec37c03a0c2e8b2b8cR6-L11)
[[2]](diffhunk://#diff-f8ae33e4c69620dbc2523794f5240aa34ad618e11e155fec37c03a0c2e8b2b8cL139-R162)

*
[`main/health.go`](diffhunk://#diff-2422cb6e5f570a2a3eeb5388f7e0fcc644727dfd0d34911de35e83a268f1d2efR8):
The logging in the `health.go` file was updated to use the `slog`
package instead of the `log` package. This included updating the
`evaluate` method in the `HealthProbe` interface.
[[1]](diffhunk://#diff-2422cb6e5f570a2a3eeb5388f7e0fcc644727dfd0d34911de35e83a268f1d2efR8)
[[2]](diffhunk://#diff-2422cb6e5f570a2a3eeb5388f7e0fcc644727dfd0d34911de35e83a268f1d2efL16)
[[3]](diffhunk://#diff-2422cb6e5f570a2a3eeb5388f7e0fcc644727dfd0d34911de35e83a268f1d2efL63-R63)

Addition of a new devcontainer run configuration:

*
[`.vscode/launch.json`](diffhunk://#diff-bd5430ee7c51dc892a67b3f2829d1f5b6d223f0fd48b82322cfd45baf9f5e945R18-R29):
A new devcontainer run configuration named "devcontainer run -
uninstall" was added. This configuration is set up to run the
"uninstall" command in the `applicationhealth-extension` program.
2024-07-01 20:47:09 -07:00
Kevin Lugo cf66d11186
Implementing New Sequence Number Management and Fixing how we get the extension Sequence Number (#83)
This pull request includes changes to the sequence number management and
testing in the `main` and `internal/seqno` packages. The most important
changes include the creation of a new `SequenceNumberManager` interface
and `SeqNumManager` struct, the addition of a function to check if a
sequence number has already been processed before enabling it, and the
addition of tests for the new function.

New sequence number management:

*
[`internal/seqno/seqno.go`](diffhunk://#diff-f671e4abbca7ae7b738bc8ef287fcbf3995062b2cc5e54ad666e3fa6f1b674dcR1-R101):
Created a new `SequenceNumberManager` interface and `SeqNumManager`
struct to manage sequence numbers. The `SeqNumManager` struct includes
functions to get and set sequence numbers, and to find a sequence number
from either the environment variable or the most recently used file
under the config folder.

Changes to `main` package:

*
[`main/cmds.go`](diffhunk://#diff-ace417b47e816a44cf3b6f6248e72453a46d9e6043f19aea9d39212e852cc373L32-R32):
a new function `enablePre` has been added. This function, acting as the
PreFunc for the enable command, verifies if the sequence number is ready
for processing by comparing it with the last executed number from the
`mrseq` file. This ensures orderly processing of sequence numbers.
*
[`main/main.go`](diffhunk://#diff-327181d0a8c5e6b164561d7910f4eeffd41442d55b2a2788fda2aa2692f17ec0L64-R68):
Replaced the `FindSeqNum` function with `seqnoManager.FindSeqNum` to
find the sequence number.
*
[`main/seqnum.go`](diffhunk://#diff-171d8d31093fac5a89b9bbe034fe628faf47dd12fad91b3205433ca95c56be52L1-L32):
Removed the `FindSeqNum` function as it has been replaced by
`seqnoManager.FindSeqNum`.

New tests:

*
[`main/cmds_test.go`](diffhunk://#diff-bdb35e68cc43b04f7c8b572233a1472169052b84e0b471c6fe578fe049784223R36-R133):
Added tests for the enablePre.

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-27 18:13:21 -07:00
dpoole73 7b478892b7
chore(logging): Logging the error code if kill fails (#87)
We investigated this at length and cannot figure out what could possibly
be happening based on the logs. Logging the fact that kill failed should
help us at least narrow it down further.

Also adding a sleep between killing app health and vmwatch to reduce the
chance of any races (vmwatch should die naturally when app health is
killed but it may not happen immediately so giving it a second to
respond should reduce the times it needs to be killed independently.
2024-06-20 12:27:33 -07:00
Dave Poole 212543df06 chore(logging): Logging the error code if kill fails
We investigated this at length and cannot figure out what could possibly be happening based on the logs.  Logging the fact that kill failed should help us at least narrow it down further.

Also adding a sleep between killing app health and vmwatch to reduce the chance of any races (vmwatch should die naturally when app health is killed but it may not happen immediately so giving it a second to respond should reduce the times it needs to be killed independently.
2024-06-20 11:22:26 -07:00
dpoole73 d2d9562981
chore(script): Update the script to upload the test binaries to use the new account (#86)
we migrated to a new account so updating the script
2024-06-17 10:13:58 -07:00
Dave Poole 78910a6b1b chore(script): Update the script tp upload the test binaries to use the new account
we migrated to a new account so updating the script
2024-06-13 16:57:58 -07:00
dpoole73 784f7cce3d
fix(shim): Fix timing issue in shim script (#85)
We discovered that there is a timing issue in the script which can cause
it to fail, this resulting in settings update timeouts.

Explanation:

1. `kill_existing_apphealth_processes` checks for app health running and
see it is running
1. it kills the process using `pkill -f`, this succeeds.
1. killing app health causes vmwatch to be sent a kill signal
1. `kill_existing_vmwatch_processes` checks for vmwatch running and sees
it is there because it hasn't quite reacted to the kill signal yet
1. tries to kill it using `pkill -f` but it has already gone so it fails
and because the script is running with `set -e` it fails immediately

the fix:

add `|| true` to the command so that failures are ignored. If it
actually failed to kill the process for some reason the script will
still poll and fall back to `pkill -9` so there is no change in behavior
in the case of a real issue killing the process, just fixes a timing
issue
2024-06-13 14:38:18 -07:00
Dave Poole f9cd527079 fix(shim): Fix timing issue in shim script
We discovered that there is a timing issue in the script which can cause it to fail, this resulting in settings update timeouts.

Explanation:

1. `kill_existing_apphealth_processes` checks for app health running and see it is running
1. it kills the process using `pkill -f`, this succeeds.
1. killing app health causes vmwatch to be sent a kill signal
1. `kill_existing_vmwatch_processes` checks for vmwatch running and sees it is there because it hasn't quite reacted to the kill signal yet
1. tries to kill it using `pkill -f` but it has already gone so it fails and because the script is running with `set -e` it fails immediately

the fix:

add `|| true` to the command so that failures are ignored.  If it actually failed to kill the process for some reason the script will still poll and fall back to `pkill -9` so there is no change in behavior in the case of a real issue killing the process, just fixes a timing issue
2024-06-13 11:43:24 -07:00
dependabot[bot] 93f8e0fe5e Bump google.golang.org/protobuf from 1.27.1 to 1.33.0
Bumps google.golang.org/protobuf from 1.27.1 to 1.33.0.

---
updated-dependencies:
- dependency-name: google.golang.org/protobuf
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2024-06-13 17:35:04 +00:00
frank-pang-msft ab885dcecc
Bump to v2.0.12: VMWatch Integration (#60)
## Overview
This PR contains changes multiple pull requests into a feature branch
that will support running VMWatch (amd64 and arm64) as an executable via
goroutines and channels. In addition, a number of dev/debugging tools
were included to improve developer productivity.

> VMWatch is a standardized, lightweight, and open-sourced testing
framework designed to enhance the monitoring and management of guest VMs
on the Azure platform, including both 1P and 3P instances. VMWatch is
engineered to collect vital health signals across multiple dimensions,
which will be seamlessly integrated into Azure's quality systems. By
leveraging these signals, VMWatch will enable Azure to swiftly detect
and prevent regressions induced by platform updates or configuration
changes, identify gaps in platform telemetry, and ultimately improve the
guest experience for all Azure customers.

## Behavior
VMWatch will run asynchronously as a separate process than
ApplicationHealth, so the probing of application health will not be
affected by the state of VMWatch. Depending on extension settings,
VMWatch can be enabled/disabled, and also specify the test names and
parameter overrides to VMWatch binary. The status of VMWatch will be
displayed in the extension status file and also in GET VM Instance View.
Main process will manage VMWatch process and communicate VMWatch status
via extension status file.

## Process Leaks & Resource Governance
Main process ensures proper resource utilization limits for CPU and
Memory, along with avoiding process leaks by subscribing to
shutdown/termination signals in the main process.
2024-06-11 15:16:42 -07:00
Mingyang Zheng b2b0c04c96
AppHealthLinux manifest version bump to 2.0.12 (#81) 2024-05-23 14:14:52 -07:00
Mingyang Zheng 46fbe0214c update VMWatch binary 2024-05-23 13:05:08 -07:00
Mingyang Zheng f723cfd21c AppHealthLinux manifest version bump to 2.0.12 2024-05-23 12:36:03 -07:00
frank-pang-msft aefdba0144
Improve logging to kusto for better debugging (#77)
Some info is missing in the kusto logs that are present in the local
logs that makes it difficult to debug.

- Log specific command being executed at startup
(install/enable/update/etc)
- Include extension sequence number and pid at startup for debugging
from GuestAgent logs when extension logs are missing or seqNum.status
file is missing
- Log overall status file so we have better debugging when
VMExtensionProvisioning fails. This status is only sent when extension
transitions between Transitioning -> Success/Error or whenever extension
starts up.
- Update azure-extension-platform package to pull in change to increase
precision of event timestamp to include milliseconds/nanoseconds,
Previously it was RFC3339, which is in format yyyy-mm-ddThh:mm:ssZ,
which causes issue in sorting timestamps.
https://github.com/Azure/azure-extension-platform/pull/34
2024-05-21 15:02:52 -07:00
Mingyang Zheng 62444c87f9
bump version to 2.0.11 (#74) 2024-05-17 11:05:13 -07:00
Kevin Lugo Rosado af6cd720af fix: Update manifest.xml version to 2.0.11 2024-05-16 23:12:36 +00:00
Kevin Lugo e6980289bf
Adding CodeQL Code Scanning Workflow (#71) (#73)
* Adding codeql code scanning to repo

* Update .github/workflows/codeql.yml to use only ubuntu-latest for Go
language build mode

* chore: Update GOPATH on codeql.yml

* Attempt to fix GOPATH

* debug

* debug

* chore: Update GO111MODULE

* chore: Update GOPATH and repo root path in codeql.yml

* revert

* adding more codeql queries
2024-05-16 16:06:42 -07:00
dpoole73 30515987c5
fix(cgroups): Fixing the check for systemd-run (#70)
Although the tests have been passing on the latest changes, there was a
failure in testing last night.

When investigating I found the cause of the problem. When you call
cmd.Execute("systemd-run") golang will (sometimes) replace it with the
full path (in this case /usr/bin/systemd-run) and so our check for
systemd-run mode was not working and it was going down the old code path
of direct cgroup assignment.

Fixing by being explicit about it and returning a boolean indicating
whether resource governance is required after the process is launched.
This brings it back to the way it was in the previous PR iterations but
avoids the objections raised there due to linux only concepts. When we
converge the windows code here, the implementation of
applyResourceGovernance will use Job objects on windows and the code
flow will be the same.
2024-05-16 15:17:26 -07:00
dpoole73 b484db311d
Changes to support running devcontainer based tests on Mac Silicon (#72)
I have been unable to run the integration tests locally since upgrading my laptop.  I worked with kevin to figure out the issues and the tests are working now.

1. changing to build the test container using no-cache mode since if you have an old bad version it would not get rebuilt.
1. changing the devconatiner config to force running amd64 rather than arm64
1. tweaking the scripts to handle the slightly different process names and ps output when running in this way.

now, the tests pass on mac
2024-05-16 13:41:21 -07:00
Kevin Lugo 4b71f92489
Adding CodeQL Code Scanning Workflow (#71)
* Adding codeql code scanning to repo

* Update .github/workflows/codeql.yml to use only ubuntu-latest for Go language build mode

* chore: Update GOPATH on codeql.yml

* Attempt to fix GOPATH

* debug

* debug

* chore: Update GO111MODULE

* chore: Update GOPATH and repo root path in codeql.yml

* revert

* adding more codeql queries
2024-05-16 12:57:10 -07:00
Dave Poole baa630dab7 corrected test to match the new way the error shows up based on previous feedback 2024-05-15 23:00:54 +00:00
Dave Poole 03446f89ac feedback 2024-05-15 22:05:44 +00:00
Dave Poole c079e29fa1 revert accidental new variable creation 2024-05-14 22:53:17 +00:00
Dave Poole 486672beba preserving the original logic 2024-05-14 21:34:36 +00:00
Dave Poole 89379ba69c fix accidental edit 2024-05-14 21:12:19 +00:00
Dave Poole c84eb7be8d fix error check 2024-05-14 13:22:48 -07:00
Dave Poole c7b28f08d5 fix(cgroups): Fixing the check for systemd-run
Although the tests have been passing on the latest changes, there was a failure in testing last night.

When investigating I found the cause of the problem.  When you call cmd.Execute("systemd-run") golang will (sometimes) replace it with the full path (in this case /usr/bin/systemd-run) and so our check for systemd-run mode was not working and it was going down the old code path of direct cgroup assignment.

Fixing by being explicit about it and returning a boolean indicating whether resource governance is required after the process is launched.  This brings it back to the way it was in the previous PR iterations but avoids the objections raised there due to linux only concepts.  When we converge the windows code here, the implementation of applyResourceGovernance will use Job objects on windows and the code flow will be the same.
2024-05-14 12:37:10 -07:00
dpoole73 6d2ff1e5f7
Merge pull request #69 from Azure/dpoole/systemd-run-commandline-fix
Change the commandline used for systemd-run depeding on the installed version
2024-05-06 12:58:53 -07:00
Dave Poole 5b29a32e8f feedback 2024-05-06 10:44:02 -07:00
Dave Poole f9ff9c5cea Change the commandline used for systemd-run depeding on the installed version
We found when testing on some ditros that they had older versions of systemd installed.

Versions before 246 use `MemoryLimit` and after that use `MemoryMax` so we need to know which version we have when constructing the commandline.

Also older versions didn't support the `-E` flag for environment variables and instead use the longer form `--setenv`.  This same flag is supported in both old and new versions
2024-05-06 09:51:49 -07:00
Kevin Lugo 62799315b6
Removing Unnecessary Telemetry Events and Log CustomMetrics Changes only (#68)
* Removed Noise Telemetry Events, and more details on error log.

* - Created new CustomMetricsStatusType
- CustomMetrics will know be reported only when there is a Change in the CustomMetric Field.
- Added commitedCustomMetricsState variable to keep track of the last CustomMetric Value.
2024-05-03 16:24:36 -07:00
dpoole73 bd1dbc02e8
Merge pull request #67 from Azure/dev/dpoole/update-vmwatch-5-2
chore: update the latest vmwatch binaries (1.1.1)
2024-05-02 22:29:51 -07:00
Dave Poole 30a2d4c04e update the latest vmwatch binaries (1.1.1) 2024-05-02 14:11:20 -07:00
Kevin Lugo b56f2ad074
Adding Kusto Telemetry to ApplicationHealthLinux v2 (#63)
* Adding internal/manifest package from Cross-Platform AppHealth Feature Branch

* Running go mod tidy and go mod vendor

* - Add manifest.xml to Extension folder
- Chaged Github workflow go version to Go 1.18
- Small refactor in setup function for bats tests.

* Update Go version to 1.18 in Dockerfile

* Add logging package with NopLogger implementation

* Add telemetry package for logging events

* - Add telemetry event Logging to main.go

* - Add new String() methods to vmWatchSignalFilters and vmWatchSettings structs
- Add telemetry event Logging to handlersettings.go

* - Add telemetry event Logging to reportstatus.go

* Add telemetry event Logging to health.go

* Refactor install handler in main/cmds.go to use telemetry event logging

* Refactor uninstall handler in main/cmds.go to use telemetry event logging

* Refactor enable handler function in main/cmds.go to use telemetry event logging

* Refactor vmWatch.go to use telemetry event logging

* Fix requestPath in extension-settings.json and updated 2 integration tests,  one in 2_handler-commands.bats and another in 7_vmwatch.bats

* ran go mod tidy && go mod vendor

* Update ExtensionManifest version to 2.0.9 on UT

* Refactor telemetry event sender to use EventLevel constants in main/telemetry.go

* Refactor telemetry event sender to use EventTasks constants that match with existing Windows Telemetry

* Update logging messages in 7_vmwatch.bats

* Moved telemetry.go to its package in internal/telemetry

* Update Go version to 1.22 in Dockerfile, go.yml, go.mod, and go.sum

* Update ExtensionManifest version to 2.0.9 on UT

* Add NopLogger documentation to pkg/logging/logging.go

* Added Documentation to Telemetry Pkg

* -Added a Wrapper to HandlerEnviroment to add Additional functionality like the String() func
- Added String() func to handlersettings struct, publicSettings struct, vmWatchSettings struct and
vmWatchSignalFilters struct
- Added Telemetry Event for HandlerSettings, and for HandlerEnviroment

* - Updated HandlerEnviroment String to use MarshallIndent Function.
- Updated HandlerSettings struct String() func to use MarshallIndent
- Fixed Failing UTs due to nil pointer in Embedded Struct inside HandlerEnviroment.

* - Updated vmWatchSetting String Func to use MarshallIdent

* Update ExtensionManifest version to 2.0.10 on Failing UT

* removed duplicated UT

* Removed String() func from VMWatchSignalFilters, publicSettings and protectedSettings
2024-05-01 23:46:04 -07:00
Mingyang Zheng e8e69f4a0b
Merge pull request #66 from Azure/release-2.0.10
bump version to 2.0.10
2024-04-30 14:47:58 -07:00
dpoole73 b552134a5c
Merge pull request #65 from Azure/dev/dpoole/update-vmwatch-4-30
Updating vmwatch binaries to 1.1.0 package
2024-04-30 14:47:00 -07:00
Mingyang Zheng 65961892b2 bump version to 2.0.10 2024-04-30 14:08:53 -07:00
Dave Poole 152b39ce93 Updating vmwatch binaries to 1.1.0 package 2024-04-30 11:01:11 -07:00
dpoole73 043d2a2773
Merge pull request #64 from Azure/dev/dpoole/cgroup-using-systemd-run
fix(systemd-run): Switch to use systemd-run instead of direct process and cgroup manipulation
2024-04-26 14:37:50 -07:00
Dave Poole 23cb651c36 correcting the search term
i don't know why this passed before, clearly we kill the process when we fail to assign a cgroup i don't know why it would ever return a different message with this fix test pass locally
2024-04-24 21:44:35 +00:00
Dave Poole 3e94eb670f revert 2024-04-24 20:11:45 +00:00
Dave Poole fea2ebdb9f fix test issue. There seems to be a non-deterministic case where the message can get logged differently 2024-04-24 12:24:09 -07:00
Dave Poole b30ee9181a feedback 2024-04-24 11:37:57 -07:00
Dave Poole 0c9a693cc7 feedback 2024-04-24 11:04:11 -07:00