This commit is contained in:
Charles Torre 2021-02-10 11:06:34 -08:00
Коммит 27a3c8f8bf
86 изменённых файлов: 10454 добавлений и 0 удалений

7
.editorconfig Normal file
Просмотреть файл

@ -0,0 +1,7 @@
[*.cs]
# CA1068: CancellationToken parameters must come last
dotnet_diagnostic.CA1068.severity = none
# IDE0055: Fix formatting
dotnet_diagnostic.IDE0055.severity = none

63
.gitattributes поставляемый Normal file
Просмотреть файл

@ -0,0 +1,63 @@
###############################################################################
# Set default behavior to automatically normalize line endings.
###############################################################################
* text=auto
###############################################################################
# Set default behavior for command prompt diff.
#
# This is need for earlier builds of msysgit that does not have it on by
# default for csharp files.
# Note: This is only used by command line
###############################################################################
#*.cs diff=csharp
###############################################################################
# Set the merge driver for project and solution files
#
# Merging from the command prompt will add diff markers to the files if there
# are conflicts (Merging from VS is not affected by the settings below, in VS
# the diff markers are never inserted). Diff markers may cause the following
# file extensions to fail to load in VS. An alternative would be to treat
# these files as binary and thus will always conflict and require user
# intervention with every merge. To do so, just uncomment the entries below
###############################################################################
#*.sln merge=binary
#*.csproj merge=binary
#*.vbproj merge=binary
#*.vcxproj merge=binary
#*.vcproj merge=binary
#*.dbproj merge=binary
#*.fsproj merge=binary
#*.lsproj merge=binary
#*.wixproj merge=binary
#*.modelproj merge=binary
#*.sqlproj merge=binary
#*.wwaproj merge=binary
###############################################################################
# behavior for image files
#
# image files are treated as binary by default.
###############################################################################
#*.jpg binary
#*.png binary
#*.gif binary
###############################################################################
# diff behavior for common document formats
#
# Convert binary document formats to text before diffing them. This feature
# is only available from the command line. Turn it on by uncommenting the
# entries below.
###############################################################################
#*.doc diff=astextplain
#*.DOC diff=astextplain
#*.docx diff=astextplain
#*.DOCX diff=astextplain
#*.dot diff=astextplain
#*.DOT diff=astextplain
#*.pdf diff=astextplain
#*.PDF diff=astextplain
#*.rtf diff=astextplain
#*.RTF diff=astextplain

342
.gitignore поставляемый Normal file
Просмотреть файл

@ -0,0 +1,342 @@
## Ignore Visual Studio temporary files, build results, and
## files generated by popular Visual Studio add-ons.
##
## Get latest from https://github.com/github/gitignore/blob/master/VisualStudio.gitignore
# User-specific files
*.rsuser
*.suo
*.user
*.userosscache
*.sln.docstates
# User-specific files (MonoDevelop/Xamarin Studio)
*.userprefs
# Build results
[Dd]ebug/
[Dd]ebugPublic/
[Rr]elease/
[Rr]eleases/
x64/
x86/
[Aa][Rr][Mm]/
[Aa][Rr][Mm]64/
bld/
[Bb]in/
[Oo]bj/
[Ll]og/
# Visual Studio 2015/2017 cache/options directory
.vs/
# Uncomment if you have tasks that create the project's static files in wwwroot
#wwwroot/
# Visual Studio 2017 auto generated files
Generated\ Files/
# MSTest test Results
[Tt]est[Rr]esult*/
[Bb]uild[Ll]og.*
# NUNIT
*.VisualState.xml
TestResult.xml
# Build Results of an ATL Project
[Dd]ebugPS/
[Rr]eleasePS/
dlldata.c
# Benchmark Results
BenchmarkDotNet.Artifacts/
# .NET Core
project.lock.json
project.fragment.lock.json
artifacts/
# StyleCop
StyleCopReport.xml
# Files built by Visual Studio
*_i.c
*_p.c
*_h.h
*.ilk
*.meta
*.obj
*.iobj
*.pch
*.pdb
*.ipdb
*.pgc
*.pgd
*.rsp
*.sbr
*.tlb
*.tli
*.tlh
*.tmp
*.tmp_proj
*_wpftmp.csproj
*.log
*.vspscc
*.vssscc
.builds
*.pidb
*.svclog
*.scc
# Chutzpah Test files
_Chutzpah*
# Visual C++ cache files
ipch/
*.aps
*.ncb
*.opendb
*.opensdf
*.sdf
*.cachefile
*.VC.db
*.VC.VC.opendb
# Visual Studio profiler
*.psess
*.vsp
*.vspx
*.sap
# Visual Studio Trace Files
*.e2e
# TFS 2012 Local Workspace
$tf/
# Guidance Automation Toolkit
*.gpState
# ReSharper is a .NET coding add-in
_ReSharper*/
*.[Rr]e[Ss]harper
*.DotSettings.user
# JustCode is a .NET coding add-in
.JustCode
# TeamCity is a build add-in
_TeamCity*
# DotCover is a Code Coverage Tool
*.dotCover
# AxoCover is a Code Coverage Tool
.axoCover/*
!.axoCover/settings.json
# Visual Studio code coverage results
*.coverage
*.coveragexml
# NCrunch
_NCrunch_*
.*crunch*.local.xml
nCrunchTemp_*
# MightyMoose
*.mm.*
AutoTest.Net/
# Web workbench (sass)
.sass-cache/
# Installshield output folder
[Ee]xpress/
# DocProject is a documentation generator add-in
DocProject/buildhelp/
DocProject/Help/*.HxT
DocProject/Help/*.HxC
DocProject/Help/*.hhc
DocProject/Help/*.hhk
DocProject/Help/*.hhp
DocProject/Help/Html2
DocProject/Help/html
# Click-Once directory
publish/
# Publish Web Output
*.[Pp]ublish.xml
*.azurePubxml
# Note: Comment the next line if you want to checkin your web deploy settings,
# but database connection strings (with potential passwords) will be unencrypted
*.pubxml
*.publishproj
# Microsoft Azure Web App publish settings. Comment the next line if you want to
# checkin your Azure Web App publish settings, but sensitive information contained
# in these scripts will be unencrypted
PublishScripts/
# NuGet Packages
*.nupkg
# The packages folder can be ignored because of Package Restore
**/[Pp]ackages/*
# except build/, which is used as an MSBuild target.
!**/[Pp]ackages/build/
# Uncomment if necessary however generally it will be regenerated when needed
#!**/[Pp]ackages/repositories.config
# NuGet v3's project.json files produces more ignorable files
*.nuget.props
*.nuget.targets
# Microsoft Azure Build Output
csx/
*.build.csdef
# Microsoft Azure Emulator
ecf/
rcf/
# Windows Store app package directories and files
AppPackages/
BundleArtifacts/
Package.StoreAssociation.xml
_pkginfo.txt
*.appx
# Visual Studio cache files
# files ending in .cache can be ignored
*.[Cc]ache
# but keep track of directories ending in .cache
!?*.[Cc]ache/
# Others
ClientBin/
~$*
*~
*.dbmdl
*.dbproj.schemaview
*.jfm
*.pfx
*.publishsettings
orleans.codegen.cs
# Including strong name files can present a security risk
# (https://github.com/github/gitignore/pull/2483#issue-259490424)
#*.snk
# Since there are multiple workflows, uncomment next line to ignore bower_components
# (https://github.com/github/gitignore/pull/1529#issuecomment-104372622)
#bower_components/
# RIA/Silverlight projects
Generated_Code/
# Backup & report files from converting an old project file
# to a newer Visual Studio version. Backup files are not needed,
# because we have git ;-)
_UpgradeReport_Files/
Backup*/
UpgradeLog*.XML
UpgradeLog*.htm
ServiceFabricBackup/
*.rptproj.bak
# SQL Server files
*.mdf
*.ldf
*.ndf
# Business Intelligence projects
*.rdl.data
*.bim.layout
*.bim_*.settings
*.rptproj.rsuser
*- Backup*.rdl
# Microsoft Fakes
FakesAssemblies/
# GhostDoc plugin setting file
*.GhostDoc.xml
# Node.js Tools for Visual Studio
.ntvs_analysis.dat
node_modules/
# Visual Studio 6 build log
*.plg
# Visual Studio 6 workspace options file
*.opt
# Visual Studio 6 auto-generated workspace file (contains which files were open etc.)
*.vbw
# Visual Studio LightSwitch build output
**/*.HTMLClient/GeneratedArtifacts
**/*.DesktopClient/GeneratedArtifacts
**/*.DesktopClient/ModelManifest.xml
**/*.Server/GeneratedArtifacts
**/*.Server/ModelManifest.xml
_Pvt_Extensions
# Paket dependency manager
.paket/paket.exe
paket-files/
# FAKE - F# Make
.fake/
# JetBrains Rider
.idea/
*.sln.iml
# CodeRush personal settings
.cr/personal
# Python Tools for Visual Studio (PTVS)
__pycache__/
*.pyc
# Cake - Uncomment if you are using it
# tools/**
# !tools/packages.config
# Tabs Studio
*.tss
# Telerik's JustMock configuration file
*.jmconfig
# BizTalk build output
*.btp.cs
*.btm.cs
*.odx.cs
*.xsd.cs
# OpenCover UI analysis results
OpenCover/
# Azure Stream Analytics local run output
ASALocalRun/
# MSBuild Binary and Structured Log
*.binlog
# NVidia Nsight GPU debugger configuration file
*.nvuser
# MFractors (Xamarin productivity tool) working folder
.mfractor/
# Local History for Visual Studio
.localhistory/
# BeatPulse healthcheck temp database
healthchecksdb
**/PublishProfiles/Cloud.xml

29
Build-FabricHealer.ps1 Normal file
Просмотреть файл

@ -0,0 +1,29 @@
$ErrorActionPreference = "Stop"
$Configuration="Release"
[string] $scriptPath = Split-Path -Parent $MyInvocation.MyCommand.Definition
try {
Push-Location $scriptPath
Remove-Item $scriptPath\bin\release\FabricHealer\ -Recurse -Force -EA SilentlyContinue
dotnet publish FabricHealer\FabricHealer.csproj -o bin\release\FabricHealer\linux-x64\self-contained\FabricHealerType\FabricHealerPkg\Code -c $Configuration -r linux-x64 --self-contained true
dotnet publish FabricHealer\FabricHealer.csproj -o bin\release\FabricHealer\linux-x64\framework-dependent\FabricHealerType\FabricHealerPkg\Code -c $Configuration -r linux-x64 --self-contained false
dotnet publish FabricHealer\FabricHealer.csproj -o bin\release\FabricHealer\win-x64\self-contained\FabricHealerType\FabricHealerPkg\Code -c $Configuration -r win-x64 --self-contained true
dotnet publish FabricHealer\FabricHealer.csproj -o bin\release\FabricHealer\win-x64\framework-dependent\FabricHealerType\FabricHealerPkg\Code -c $Configuration -r win-x64 --self-contained false
Copy-Item FabricHealer\PackageRoot\* bin\release\FabricHealer\linux-x64\self-contained\FabricHealerType\FabricHealerPkg\ -Recurse
Copy-Item FabricHealer\PackageRoot\* bin\release\FabricHealer\linux-x64\framework-dependent\FabricHealerType\FabricHealerPkg\ -Recurse
Copy-Item FabricHealer\PackageRoot\* bin\release\FabricHealer\win-x64\self-contained\FabricHealerType\FabricHealerPkg\ -Recurse
Copy-Item FabricHealer\PackageRoot\* bin\release\FabricHealer\win-x64\framework-dependent\FabricHealerType\FabricHealerPkg\ -Recurse
Copy-Item FabricHealerApp\ApplicationPackageRoot\ApplicationManifest.xml bin\release\FabricHealer\linux-x64\self-contained\FabricHealerType\ApplicationManifest.xml
Copy-Item FabricHealerApp\ApplicationPackageRoot\ApplicationManifest.xml bin\release\FabricHealer\linux-x64\framework-dependent\FabricHealerType\ApplicationManifest.xml
Copy-Item FabricHealerApp\ApplicationPackageRoot\ApplicationManifest.xml bin\release\FabricHealer\win-x64\self-contained\FabricHealerType\ApplicationManifest.xml
Copy-Item FabricHealerApp\ApplicationPackageRoot\ApplicationManifest.xml bin\release\FabricHealer\win-x64\framework-dependent\FabricHealerType\ApplicationManifest.xml
}
finally {
Pop-Location
}

32
Build-NugetPackages.ps1 Normal file
Просмотреть файл

@ -0,0 +1,32 @@
function Build-Nuget {
param (
[string]
$packageId,
[string]
$basePath
)
[string] $nugetSpecTemplate = [System.IO.File]::ReadAllText([System.IO.Path]::Combine($scriptPath, "FabricHealer.nuspec.template"))
[string] $nugetSpecPath = "$scriptPath\bin\release\FabricHealer\$($packageId).nuspec"
[System.IO.File]::WriteAllText($nugetSpecPath, $nugetSpecTemplate.Replace("%PACKAGE_ID%", $packageId).Replace("%ROOT_PATH%", $scriptPath))
.\nuget.exe pack $nugetSpecPath -basepath $basePath -OutputDirectory bin\release\FabricHealer\Nugets -properties NoWarn=NU5100
}
[string] $scriptPath = Split-Path -Parent $MyInvocation.MyCommand.Definition
try {
Push-Location $scriptPath
Build-Nuget "Microsoft.ServiceFabricApps.FabricHealer.Linux.SelfContained.Beta" "$scriptPath\bin\release\FabricHealer\linux-x64\self-contained\FabricHealerType"
Build-Nuget "Microsoft.ServiceFabricApps.FabricHealer.Linux.FrameworkDependent.Beta" "$scriptPath\bin\release\FabricHealer\linux-x64\framework-dependent\FabricHealerType"
Build-Nuget "Microsoft.ServiceFabricApps.FabricHealer.Windows.SelfContained.Beta" "$scriptPath\bin\release\FabricHealer\win-x64\self-contained\FabricHealerType"
Build-Nuget "Microsoft.ServiceFabricApps.FabricHealer.Windows.FrameworkDependent.Beta" "$scriptPath\bin\release\FabricHealer\win-x64\framework-dependent\FabricHealerType"
}
finally {
Pop-Location
}

34
Build-SFPKGs.ps1 Normal file
Просмотреть файл

@ -0,0 +1,34 @@
[string] $scriptPath = Split-Path -Parent $MyInvocation.MyCommand.Definition
function Build-SFPkg {
param (
[string]
$packageId,
[string]
$basePath
)
$ProgressPreference = "SilentlyContinue"
[string] $outputDir = "$scriptPath\bin\release\FabricHealer\SFPkgs"
[string] $zipPath = "$outputDir\$($packageId).zip"
[System.IO.Directory]::CreateDirectory($outputDir) | Out-Null
Compress-Archive "$basePath\*" $zipPath -Force
Move-Item -Path $zipPath -Destination ($zipPath.Replace(".zip", ".sfpkg"))
}
try {
Push-Location $scriptPath
Build-SFPkg "Microsoft.ServiceFabricApps.FabricHealer.Linux.SelfContained.Beta.0.4.2" "$scriptPath\bin\release\FabricHealer\linux-x64\self-contained\FabricHealerType"
Build-SFPkg "Microsoft.ServiceFabricApps.FabricHealer.Linux.FrameworkDependent.Beta.0.4.2" "$scriptPath\bin\release\FabricHealer\linux-x64\framework-dependent\FabricHealerType"
Build-SFPkg "Microsoft.ServiceFabricApps.FabricHealer.Windows.SelfContained.Beta.0.4.2" "$scriptPath\bin\release\FabricHealer\win-x64\self-contained\FabricHealerType"
Build-SFPkg "Microsoft.ServiceFabricApps.FabricHealer.Windows.FrameworkDependent.Beta.0.4.2" "$scriptPath\bin\release\FabricHealer\win-x64\framework-dependent\FabricHealerType"
}
finally {
Pop-Location
}

9
CODE_OF_CONDUCT.md Normal file
Просмотреть файл

@ -0,0 +1,9 @@
# Microsoft Open Source Code of Conduct
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
Resources:
- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
- Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns

Просмотреть файл

@ -0,0 +1,264 @@
## Extending Repair Workflows with Logic Programming
FabricHealer employs configuration-as-logic by leveraging the expressive power of [Guan](https://github.com/microsoft/guan), a general-purpose logic programming system composed of a C# API and logic interpreter/query executor. It enables Prolog style syntax for writing logic rules and executing queries over them. Guan enables FabricHealer's configuration-as-logic model for defining execution workflows for automatic repairs in Service Fabric clusters (Windows and Linux).
**Why?**
Supporting formal logic-based repair workflows gives users more tools and options to express their custom repair workflows. Formal logic gives users the power to express concepts like if/else statements, leverage boolean operators, and even things like recursion! Logic programming allows users to easily and concisely express complex repair workflows that leverage the complete power of a logic programming language. We use GuanLogic for our underlying logic processing, which is a general purpose logic programming API written by Lu Xun (Microsoft) that enables Prolog-like (https://en.wikipedia.org/wiki/Prolog) rule definition and query execution in C#.
While not necessary, reading Chapters 1-3 of the [learnprolognow](http://www.learnprolognow.org/lpnpage.php?pagetype=html&pageid=lpn-htmlch1) book can be quite useful. Note that the documentation here doesn't assume you have any experience with logic programming.
**Do I need experience with logic programming?**
No, using logic to express repair workflows is easy! One doesn't need a deep knowledge and understanding of logic programming to write their own complex repair workflows! Let's start with an example to help inspire your own logic-based repair workflows!
***Problem***: I want to perform a code package restart if FabricObserver emits a memory usage warning for a *specific* application in my cluster (e.g. "fabric:/App1").
***Solution***: We can leverage Guan and its built-in equals operator for checking the name of the application that triggered the warning against the name of the application for which we decided we want to perform a code package restart for. For application level health events, the repair workflow is defined inside the PackageRoot/Config/Rules/AppRules.config.txt file. Here is that we would enter:
```
Mitigate(AppName="fabric:/App1", MetricName=MemoryPercent) :- RestartCodePackage().
```
Don't be alarmed if you don't understand how to read that repair action! We will go more in-depth later about the syntax and semantics of Guan. The takeaway is that expressing a Guan repair workflow doesn't require a deep knowledge of Prolog programming to get started. Hopefully this also gives you a general idea about the kinds of repair workflows we can express with GuanLogic.
Each repair policy has its own corresponding configuration file:
| Repair Policy | Configuration File Name |
|---------------------------|------------------------------|
| AppRepairPolicy | AppRules.config.txt |
| DiskRepairPolicy | DiskRules.config.txt |
| FabricNodeRepairPolicy | FabricNodeRules.config.txt |
| ReplicaRepairPolicy | ReplicaRules.config.txt |
| SystemAppRepairPolicy | SystemAppRules.config.txt |
| VMRepairPolicy | VmRules.config.txt |
Now let's look at *how* to actually define a Guan logic repair workflow, so that you will have the knowledge necessary to express your own.
## Writing Logic Repair Workflows
This [site](https://www.metalevel.at/prolog/concepts) gives a good, fast overview of basic prolog concepts which you may find useful.
The building block for creating Guan logic repair workflows is through the use and composition of **Predicates**. A predicate has a name, and zero or more arguments. In FH, there are two different kinds of predicates: **Internal Predicates** and **External Predicates**. Internal predicates are equivalent to standard predicates in Prolog. An internal predicate defines relations between their arguments and other internal predicates. External predicates on the other hand are more similar to functions in terms of behaviour. External predicates are usually used to perform actions such as checking values, performing calculations, performing repairs, and binding values to variables.
Here is a list of currently implemented External Predicates:
**External Predicates**
```RestartCodePackage()```
Attempts to restart the code package for the service that emitted the health event, returns true if successful, else false.
```RestartFabricNode()```
Attempts to restart the node of the service that emitted the health event, returns true if successful, else false. Takes an optional Safe parameter: "safe" or "unsafe" which defines whether or not to perform a safe or unsafe node restart. A safe node restart will try to first deactivate the node before restarting, whereas an unsafe node restart will try restarting the node without first trying to deactivate it.
```RestartReplica()```
Attempts to restart the replica of the service that emitted the health event, returns true if successful, else false.
```RestartVM()```
Attempts to restart the underlying virtual machine of the service that emitted the health event, returns true if successful, else false.
**Forming a Logic Repair Workflow**
Now that we know what predicates are, let's learn how to form a logic repair workflow.
A GuanLogic program is expressed in terms of **rule(s)** and a program is executed by running a **query** over these rule(s). A rule is of the form:
```RuleHead(*args, ...*) :- PredicateA(), PredicateB(), ...```.
A simple way to understand what a rule is, is to just think of it as a function.
```
RuleHead(*args, ...*) -> Function signature (name + arguments)
PredicateA(), PredicateB(), ... -> Function body (everything to the right of the ":-" is part of the function body)
```
A query is simply a way to invoke a rule (function).
```RuleHead() -> invokes the rule of name "RuleHead" with the same number of arguments```
By default, for logic-based repair workflows, FH will execute a query which calls a rule named ```Mitigate()```. Think of ```Mitigate()``` as the root for executing the repair workflow, similar to the Main() function in most programming languages. By default, ```Mitigate()``` passes arguments that can be passed to predicates part of the repair workflow.
| Argument Name | Definition |
|---------------------------|----------------------------------------------------------------------------------------------|
| AppName | Name of the SF application, format is fabric:/SomeApp |
| ServiceName | Name of the SF service, format is fabric:/SomeApp/SomeService |
| NodeName | Name of the node |
| NodeType | Type of node |
| PartitionId | Id of the partition |
| ReplicaOrInstanceId | Id of the replica or instance |
| FOErrorCode | Error Code emitted by FO (e.g. "FO002") |
| MetricName | Name of the resource supplied by FO (e.g., CpuPercent or MemoryMB, etc.) |
| MetricValue | Corresponding Metric Value supplied by FO (e.g. "85" indicating 85% CPU usage) |
For example if you wanted to use AppName and ServiceName in your repair workflow you would specify them like so:
```
Mitigate(AppName=?x, ServiceName=?y) :- ..., ?x == "fabric:/App1", ...
now the variable ?x is bound to the application name and ?y is bound to the service name. You can name variables whatever you want
```
Here's a simple example that we've seen before:
```
Mitigate() :- RestartCodePackage().
```
Essentially, what this logic repair workflow (mitigation scenario) is describing is that if FO emits a health event that falls under the AppServiceCpuMemoryPortAbuseRepairPolicy and if the repair policy is enabled, then we will execute the repair action. FH will automatically detect that it is a logic workflow, so it will invoke the root rule ```Mitigate()```. Guan determines that the ```Mitigate()``` rule is defined inside the repair action, where it then will try to execute the body of the ```Mitigate()``` rule.
Users can define multiple rules (separated by a newline) as part of a repair workflow, here is an example:
```
Mitigate() :- RestartCodePackage().
Mitigate() :- RestartFabricNode().
```
This seems confusing as we've defined ```Mitigate()``` twice. Here is the execution flow explained in words: "Look for the *first* ```Mitigate()``` rule (read from top to bottom). The *first* ```Mitigate()``` rule is the one that calls ```RestartCodePackage()``` in its body. So we try to run the first rule. If the first rule fails (i.e. ```RestartCodePackage()``` returns false) then we check to see if there is another rule named ```Mitigate()```, which there is. The next ```Mitigate()``` rule we find is the one that calls ```RestartFabricNode()``` so we try to run the second rule.
This concept of retrying rules is important to understand. Imagine your goal is that you want ```Mitigate()``` to return true, so you will try every rule named ```Mitigate()``` in order until one returns true, in which case the query stops. This concept of retrying rules is also how you can model conditional branches and boolean operators in GuanLogic repair workflows.
**Important Syntax Rules**: Each rule must end with a period, a single rule may be split up across multiple lines for readability:
```
Mitigate() :- PredicateA(), <-- The first predicate in the rule must be inline with the Head of the rule like so
PredicateB(),
PredicateC().
```
The following would be invalid:
```
Mitigate() :-
PredicateA(),
PredicateB(),
PredicateC().
```
**Modelling Boolean Operators**
Let's look at how we can create AND/OR/NOT statements in Guan logic repair workflows.
**NOT**
```
Mitigate() :- not((condition A T/F)), (true branch B).
Can be read as: if (!A) then goto B
```
NOT behaviour is achieved by wrapping any predicate inside ```not()``` which is a built-in GuanLogic predicate.
**AND**
```
Mitigate() :- (condition A T/F), (condition B T/F), (true branch C).
Can be read as: if (A and B) then goto C
```
AND behaviour is achieved by separating predicates with commas, similar to programming with the ```||``` character.
**OR**
```
Mitigate() :- (condition A T/F), (true branch C).
Mitigate() :- (condition B T/F), (true branch C).
Can be read as: if (A or B) then goto C
```
OR behaviour is achieved by separating predicates by rule. Here is the execution flow for the above workflow: Go to first ```Mitigate()``` rule -> does predicate A succeed? If so, continue with branch C, if it fails look for the next ```Mitigate()``` rule if it exists. We find the second ```Mitigate()``` rule -> does predicate B suceed? If so, continue with branch C, if it fails look for the next ```Mitigate()``` rule if it exists. There are no more ```Mitigate()``` rules so the workflow is over.
**Conditional Branches in Logic Programming**
An if/else conditional branch can be constructed with the following rule pattern:
```
Mitigate() :- (condition T/F), !, (true branch).
Mitigate() :- (false branch).
```
Notice the ```!``` symbol, this is called a cut operator in Prolog and it essentially prevents backtracking past where it is defined. Consider the case where the conditional check succeeds, the execution will continue towards the ```(true branch)```. However if the ```(true branch)``` returns false, the first rule will fail and Guan will "backtrack" try to execute the second rule, so the execution flow will actually end up in the ```(false branch)``` of the second rule. Clearly this is not how traditional if/else conditionals work, so it is important to understand why we need to use the cut operator. However as long as you understand this concept you may remove the cut operator if that is the type of behaviour you desire.
This pattern can also be repeated to construct else if branches:
```
Mitigate() :- (condition T/F), !, (true branch).
Mitigate() :- (condition T/F), !, (true branch).
Mitigate() :- …
Mitigate() :- (false branch).
```
You can check for multiple conditions aswell:
```
Mitigate() :- (condition check_1 T/F), (condition check_2 T/F), …, (condition check_N T/F), (true branch).
Mitigate() :- (false branch)
```
**Using internal predicates**
So far we've only looked at creating rules that are invoked from the root ```Mitigate()``` query, but users can also create their own rules like so:
```
MyInternalPredicate() :- RestartCodePackage().
Mitigate() :- MyInternalPredicate().
```
Here we've defined an internal predicate named ```MyInternalPredicate()``` and we can see that it is invoked in the body of the ```Mitigate()``` rule. In order to fulfill the ```Mitigate()``` rule, we will need to fulfill the ```MyInternalPredicate()``` predicate since it is part of the body of the ```Mitigate()``` rule. This repair workflow is identical in behaviour to one that directly calls ```RestartCodePackage()``` inside the body of ```Mitigate()```.
Using internal predicates like this is useful for improving readability and organizing complex repair workflows.
With internal predicates, you can easily configure run interval time for a repair (how often to run the repair) in a convenient way.
The ```IntervalForRepairTarget``` predicate below simply allows us to express which target we are interested in determining where we are relative to the specified run interval for the specific repair.
Like all repair configurations in FH, these settings for run interval for various repair targets are defined as part of the rule itself. This is a key part (and advantage) of configuration as logic.
If inside the supplied RunInterval, then cut (!). Here, this effectively means stop processing rules. The logic below specifies that the related Mitigate rule (one that employs the internal predicate) will run for each IntervalForRepairTarget predicate specification.
```
IntervalForRepairTarget(AppName="fabric:/CpuStress", RunInterval=00:15:00).
IntervalForRepairTarget(AppName="fabric:/ContainerFoo2", RunInterval=00:15:00).
IntervalForRepairTarget(MetricName="ActiveTcpPorts", RunInterval=00:15:00).
Mitigate() :- IntervalForRepairTarget(Target=?target, RunInterval=?timespan), CheckInsideRunInterval(RunInterval=?timespan), !.
```
IMPORTANT: the state machine holding the data that the CheckInsideRunInterval predicate compares your specified RunInterval TimeSpan value against is our friendly neighborhood RepairManagerService(RM), a stateful Service Fabric System Service that orchestrates repairs
and manages repair state. ***FH requires the presence of RM in order to function***.
Let's look at another example of an internal predicate that is used in FH's SystemAppRules rules file, TimeScopedRestartFabricNode, whixh is a simple convenience internal predicate used to check for the number of times a repair has run to completion within a supplied time window.
If completed repair count is less then supplied value, then run RestartFabricNode mitigation. Here, you can see it removes the need to have to write the same logic in multiple places.
```
TimeScopedRestartFabricNode(?count, ?time) :- GetRepairHistory(?repairCount, TimeWindow=?time), ?repairCount < ?count,
RestartFabricNode().
## CPU Time - Percent
Mitigate(AppName="fabric:/System", MetricName="CpuPercent", MetricValue=?MetricValue) :- ?MetricValue >= 80,
TimeScopedRestartFabricNode(5, 01:00:00).
## Memory Use - Megabytes in use
Mitigate(AppName="fabric:/System", MetricName="MemoryMB", MetricValue=?MetricValue) :- ?MetricValue >= 2048,
TimeScopedRestartFabricNode(5, 01:00:00).
## Memory Use - Percent in use
Mitigate(AppName="fabric:/System", MetricName="MemoryPercent", MetricValue=?MetricValue) :- ?MetricValue >= 40,
TimeScopedRestartFabricNode(5, 01:00:00).
## Ephemeral Ports in Use
Mitigate(AppName="fabric:/System", MetricName="EphemeralPorts", MetricValue=?MetricValue) :- ?MetricValue >= 800,
TimeScopedRestartFabricNode(5, 01:00:00).
```
**Filtering parameters from Mitigate()**
If you wish to do equals checks such as ```?AppName == ...``` you don't actually need to write this in the body of your rules, instead you can specify these values inside Mitigate() like so:
```
## This is the preferred way to do this. It is easier to read and employs less (unnecessary) basic logic.
Mitigate(AppName="fabric:/App1") :- ...
```
What that means, is that the rule will only execute when the AppName is equal to "fabric:/App1". This is equivalent to the following:
```
Mitigate(AppName=?AppName) :- ?AppName == "fabric:/App1", ...
```
Obviously, the first way of doing it is more succinct and, again, preferred.

94
Documentation/Using.md Normal file
Просмотреть файл

@ -0,0 +1,94 @@
# Using FabricHealer - Scenarios
Please [download the Guan nupkg](https://github.com/microsoft/Guan/releases/download/nupkg1.0/Microsoft.ServiceFabricApps.Guan.1.0.0.nupkg) to your local dev machine and install it into your local FH project in order to build FH successfully. This will be unnecessary when FH ships in Public Preview as Guan will be shipping concurrently and the Guan nupkg will be available in the nuget.org package gallery, as will FH.
To learn how create your own GuanLogic repair workflows, click [here](LogicWorkflows.md).
**Application Memory Usage Warning -> Trigger Code Package Restart**
***Problem***: I want to perform a code package restart if FabricObserver emits a memory usage warning (as a percentage of total memory) for any application in my cluster.
***Solution***: We can use the predefined "RestartCodePackage" repair action.
Navigate to the PackageRoot/Config/Rules/AppRules.config.txt file and copypaste this repair workflow:
```
Mitigate(MetricName=MemoryPercent) :- RestartCodePackage().
```
**System Application CPU Usage Warning -> Trigger Fabric Node Restart**
***Problem***: I want to perform a fabric node restart if FabricObserver emits a cpu usage warning for any system application in my cluster.
***Solution***: We can use the predefined "RestartFabricNode" repair action.
Navigate to the PackageRoot/Config/Rules/SystemAppRules.config.txt file and copypaste this repair workflow:
```
## CPU Time - Percent
Mitigate(AppName="fabric:/System", MetricName="CpuPercent", MetricValue=?MetricValue) :- ?MetricValue >= 90,
GetRepairHistory(?repairCount, TimeWindow=01:00:00),
?repairCount < 5,
RestartFabricNode().
```
**Please note that ## is how comment lines are specified in FabricHealer's logic rules. They are not block comments and apply to single lines only.**
```
## this is a comment on one line. I do not span
## lines. See? :)
```
**GetRepairHistory** is an *external* predicate. That is, it is not a Guan system predicate (implemented in the Guan runtime) or internal predicate (which only exists within and as part of the rule - it has no backing implementation): it is user-implemented;
look in the [FabricHealer/Repair/Guan](/FabricHealer/Repair/Guan) folder to see all external predicate impls.
GetRepairHistory takes a time span formatted value as the only input, TimeWindow, and has one output variable, ?repairCount, which will hold the value computed by the predicate call. TimeWindow means the span of time in which
Completed repairs have occurred for the repair type (in this case App level repairs for an application named "fabric:/System"). ?repairCount can then be used in subsequent logic within the same rule (not all rules in the file,
just the rule that it is a part of). You can see a more advanced approach in the [AppRules](/FabricHealer/PackageRoot/Config/Rules/AppRules.config.txt) and [SystemAppRules](/FabricHealer/PackageRoot/Config/Rules/SystemAppRules.config.txt) files where rather than having each rule run the same check, a convenience internal predicate is used that takes arguments.
Repair type is implicitly or explicitly specified in the query. Implicitly, FH already knows the context internally when this rule is run since it gets the related information from FabricObserver's
health report, passing each metric as a default argument available to the query (Mitigate, in this case). To be clear, in the above example, AppName is one of the default named arguments available to Mitigate and it's corresponding
value is passed from FabricObserver in health report data (held within a serialized instance of TelemetryData type). Learn more [here](LogicWorkflows.md).
Here, we use the named argument expression, AppName to say "when the app name is \"fabric:/System\"".
***IMPORTANT: Whenever you use arithmetic operators inside a string that is not mathematical in nature (so, a forward slash, for example), you must "quote" the value.
If you do not do this, then Guan will assume you want it do some arithmetic operation with the value, which in the case of something like "fabric:/System"
or "fabric:/MyApp42" you certainly do not want.***
***Problem***: I want to specify different repair actions for different applications.
***Solution***:
```
Mitigate(AppName="fabric:/SampleApp1") :- RepairApp1().
Mitigate(AppName="fabric:/SampleApp2") :- RepairApp2().
RepairApp1() :- ...
RepairApp2() :- ...
```
Here, ```RepairApp1()``` and ```RepairApp2()``` are custom rules, the above workflow can be read as follows: If ```?AppName``` is equal to ```SampleApp1``` then we want to invoke the rule named ```RepairApp1```. From there we would execute the ```RepairApp1``` rule just like we would for any other rule like ```Mitigate```.
***Problem***: I want to check the observed value for the supplied resource metric (Cpu, Disk, Memory, etc.) and ensure the we are within the specified run interval before running the RestartCodePackage repair on any app service that FabricObserver is monitoring.
***Solution***:
```
## First, check if we are inside run interval. If so, then cut (!).
Mitigate() :- CheckInsideRunInterval(RunInterval=02:00:00), !.
## CPU Time - Percent
Mitigate(MetricName="CpuPercent", MetricValue=?MetricValue) :- ?MetricValue >= 20,
GetRepairHistory(?repairCount, TimeWindow=01:00:00),
?repairCount < 5,
RestartCodePackage().
```
***Problem***: I want to check the observed value for the supplied resource metric (Cpu, Disk, Memory, etc.) and ensure the we are within the specified run interval before running the RestartCodePackage repair on any service belonging to the specified Application that FabricObserver is monitoring.
***Solution***:
```
## CPU Time - Percent
Mitigate(AppName="fabric:/MyApp42", MetricName="CpuPercent", MetricValue=?MetricValue) :- ?MetricValue >= 20,
GetRepairHistory(?repairCount, TimeWindow=01:00:00),
?repairCount < 5,
RestartCodePackage().
```

Двоичные данные
FHDT.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 36 KiB

50
FHTest/FHTest.csproj Normal file
Просмотреть файл

@ -0,0 +1,50 @@
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<TargetFramework>netcoreapp3.1</TargetFramework>
<IsPackable>false</IsPackable>
<Platforms>AnyCPU;x64</Platforms>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|AnyCPU'">
<PlatformTarget>x64</PlatformTarget>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|x64'">
<PlatformTarget>x64</PlatformTarget>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|AnyCPU'">
<PlatformTarget>x64</PlatformTarget>
</PropertyGroup>
<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Release|x64'">
<PlatformTarget>x64</PlatformTarget>
</PropertyGroup>
<ItemGroup>
<PackageReference Include="Microsoft.NET.Test.Sdk" Version="16.8.3" />
<PackageReference Include="MSTest.TestAdapter" Version="2.1.2" />
<PackageReference Include="MSTest.TestFramework" Version="2.1.2" />
<PackageReference Include="coverlet.collector" Version="3.0.2">
<PrivateAssets>all</PrivateAssets>
<IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets>
</PackageReference>
</ItemGroup>
<ItemGroup>
<ProjectReference Include="..\FabricHealer\FabricHealer.csproj" />
</ItemGroup>
<ItemGroup>
<None Update="testrules_wellformed">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
<None Update="testrules_malformed">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
</ItemGroup>
</Project>

288
FHTest/FHUnitTests.cs Normal file
Просмотреть файл

@ -0,0 +1,288 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using Microsoft.VisualStudio.TestTools.UnitTesting;
using FabricHealer.Repair;
using Guan.Logic;
using System.Collections.Generic;
using System.Threading.Tasks;
using System;
using System.Fabric;
using System.Threading;
using FabricHealer.Repair.Guan;
using System.Diagnostics;
using System.IO;
using Guan.Common;
using System.Linq;
using FabricHealer.Utilities.Telemetry;
using FabricHealer.Utilities;
namespace FHTest
{
[TestClass]
public class FHUnitTests
{
private static readonly Uri ServiceName = new Uri("fabric:/app/service");
private static readonly ICodePackageActivationContext CodePackageContext
= new MockCodePackageActivationContext(
ServiceName.AbsoluteUri,
"applicationType",
"Code",
"1.0.0.0",
Guid.NewGuid().ToString(),
@"C:\Log",
@"C:\Temp",
@"C:\Work",
"ServiceManifest",
"1.0.0.0");
private readonly StatelessServiceContext context
= new StatelessServiceContext(
new NodeContext("Node0", new NodeId(0, 1), 0, "NodeType1", "TEST.MACHINE"),
CodePackageContext,
"FabricHealer.FabricHealerType",
ServiceName,
null,
Guid.NewGuid(),
long.MaxValue);
private readonly CancellationToken token = new CancellationToken { };
// Set this to the full path to your Rules directory in the FabricHealer project's PackageRoot\Config directory.
// e.g., if on Windows, then something like @"C:\Users\[me]\source\repos\service-fabric-healer\FabricHealer\PackageRoot\Config\Rules\";
private const string FHRulesDirectory = @"C:\Users\ctorre\source\repos\service-fabric-healer\FabricHealer\PackageRoot\Config\Rules\";
public FHUnitTests()
{
}
/* GuanLogic Tests */
// TODO: More of them.
// This test ensures your actual rule files contain legitimate rules. This will catch bugs in your
// logic. Of course, you should have caught these flaws in your end-to-end tests. This is just an extra precaution.
[TestMethod]
public async Task TestGuanLogic_AllRules_FabricHealer_EnsureWellFormedRules_QueryInitialized()
{
TelemetryData foHealthData = new TelemetryData
{
ApplicationName = "fabric:/test0",
NodeName = "TEST_0",
RepairId = "Test42",
Code = FabricObserverErrorWarningCodes.AppErrorMemoryMB,
ServiceName = "fabric/test0/service0",
};
RepairExecutorData executorData = new RepairExecutorData
{
RepairAction = RepairAction.RestartCodePackage,
};
foreach (var file in Directory.GetFiles(FHRulesDirectory))
{
List<string> rules = File.ReadAllLines(file).ToList();
List<string> repairAction = ParseRulesFile(rules);
try
{
Assert.IsTrue(await TestInitializeGuanAndRunQuery(foHealthData, repairAction, executorData).ConfigureAwait(false));
}
catch (GuanException)
{
throw;
}
}
Assert.IsTrue(true);
}
// This test ensures a given rule can successfully be turned into a GL query.
// This means that the rule is well-formed logic and that the referenced predicates exist.
// So, if the rule is malformed or not a logic rule or no predicate exists as written, this test will fail.
[TestMethod]
public async Task TestGuanLogicRule_GoodRule_QueryInitialized()
{
string testRulesFilePath = Path.Combine(Environment.CurrentDirectory, "testrules_wellformed");
string[] rules = await File.ReadAllLinesAsync(testRulesFilePath).ConfigureAwait(false);
List<string> repairAction = ParseRulesFile(rules.ToList());
TelemetryData foHealthData = new TelemetryData
{
ApplicationName = "fabric:/test0",
NodeName = "TEST_0",
Metric = "Memory",
RepairId = "Test42",
Code = FabricObserverErrorWarningCodes.AppErrorMemoryMB,
ServiceName = "fabric/test0/service0",
Value = 42,
ReplicaId = default(long).ToString(),
PartitionId = default(Guid).ToString(),
};
RepairExecutorData executorData = new RepairExecutorData
{
RepairAction = RepairAction.RestartCodePackage,
};
try
{
Assert.IsTrue(await TestInitializeGuanAndRunQuery(foHealthData, repairAction, executorData).ConfigureAwait(false));
}
catch (GuanException)
{
throw;
}
Assert.IsTrue(true);
}
// All rules in target rules file are malformed. They should all lead to GuanExceptions.
// If they do not lead to a GuanException from TestInitializeGuanAndRunQuery, then this test will fail.
[TestMethod]
public async Task TestGuanLogicRule_BadRule_ShouldThrowGuanException()
{
string[] rules = await File.ReadAllLinesAsync(Path.Combine(Environment.CurrentDirectory, "testrules_malformed")).ConfigureAwait(false);
List<string> repairAction = ParseRulesFile(rules.ToList());
TelemetryData foHealthData = new TelemetryData
{
ApplicationName = "fabric:/test0",
NodeName = "TEST_0",
Metric = "Memory",
RepairId = "Test42",
Code = FabricObserverErrorWarningCodes.AppErrorMemoryMB,
ServiceName = "fabric/test0/service0",
Value = 42,
ReplicaId = default(long).ToString(),
PartitionId = default(Guid).ToString(),
};
RepairExecutorData executorData = new RepairExecutorData
{
RepairAction = RepairAction.RestartCodePackage,
};
await Assert.ThrowsExceptionAsync<GuanException>(async () => { await TestInitializeGuanAndRunQuery(foHealthData, repairAction, executorData); });
}
/* FH Repair Scheduler Tests */
// TODO.
/* FH Repair Excecutor Tests */
// TODO.
[ClassCleanup]
public static void TestClassCleanup()
{
}
/* private Helpers */
private bool IsLocalSFRuntimePresent()
{
try
{
var ps = Process.GetProcessesByName("Fabric");
return ps?.Length != 0;
}
catch (InvalidOperationException)
{
return false;
}
}
private async Task<bool> TestInitializeGuanAndRunQuery(
TelemetryData foHealthData,
List<string> repairRules,
RepairExecutorData executorData)
{
var fabricClient = new FabricClient(FabricClientRole.Admin);
var repairTaskHelper = new RepairTaskManager(fabricClient, this.context, this.token);
var repairTaskEngine = new RepairTaskEngine(fabricClient);
// ----- Guan Processing Logic -----
// Add predicate types to functor table, note that all health information fields are automatically passed to all predicates.
// This enables access to values in queries. See Mitigate() in rules files, for examples.
FunctorTable functorTable = new FunctorTable();
// Add external helper predicates.
functorTable.Add(CheckFolderSizePredicateType.Singleton(RepairConstants.CheckFolderSize, repairTaskHelper, foHealthData));
functorTable.Add(GetRepairHistoryPredicateType.Singleton(RepairConstants.GetRepairHistory, repairTaskHelper, foHealthData));
functorTable.Add(CheckInsideRunIntervalPredicateType.Singleton(RepairConstants.CheckInsideRunInterval, repairTaskHelper, foHealthData));
// Add external repair predicates.
functorTable.Add(DeleteFilesPredicateType.Singleton(RepairConstants.DeleteFiles, repairTaskHelper, foHealthData));
functorTable.Add(RestartCodePackagePredicateType.Singleton(RepairConstants.RestartCodePackage, repairTaskHelper, foHealthData));
functorTable.Add(RestartFabricNodePredicateType.Singleton(RepairConstants.RestartFabricNode, repairTaskHelper, executorData, repairTaskEngine, foHealthData));
functorTable.Add(RestartReplicaPredicateType.Singleton(RepairConstants.RestartReplica, repairTaskHelper, foHealthData));
functorTable.Add(RestartVMPredicateType.Singleton(RepairConstants.RestartVM, repairTaskHelper, foHealthData));
// Parse rules
_ = Module.Parse("Module", repairRules, functorTable);
// Create guan query
List<CompoundTerm> terms = new List<CompoundTerm>();
CompoundTerm term = new CompoundTerm("Mitigate");
/* Pass default arguments in query */
term.AddArgument(new Constant(foHealthData.ApplicationName), RepairConstants.AppName);
term.AddArgument(new Constant(foHealthData.Code), RepairConstants.FOErrorCode);
term.AddArgument(new Constant(foHealthData.Metric), RepairConstants.MetricName);
term.AddArgument(new Constant(foHealthData.Value), RepairConstants.MetricValue);
term.AddArgument(new Constant(foHealthData.NodeName), RepairConstants.NodeName);
term.AddArgument(new Constant(foHealthData.NodeType), RepairConstants.NodeType);
term.AddArgument(new Constant(foHealthData.ServiceName), RepairConstants.ServiceName);
term.AddArgument(new Constant(foHealthData.PartitionId), RepairConstants.PartitionId);
term.AddArgument(new Constant(foHealthData.ReplicaId), RepairConstants.ReplicaOrInstanceId);
return await Task.FromResult(true);
}
private List<string> ParseRulesFile(List<string> rules)
{
var repairRules = new List<string>();
int ptr1 = 0;
int ptr2 = 0;
rules = rules.Where(s => !string.IsNullOrWhiteSpace(s)).ToList();
while (ptr1 < rules.Count && ptr2 < rules.Count)
{
// Single line comments removal.
if (rules[ptr2].StartsWith("##"))
{
ptr1++;
ptr2++;
continue;
}
if (rules[ptr2].EndsWith("."))
{
if (ptr1 == ptr2)
{
repairRules.Add(rules[ptr2].Remove(rules[ptr2].Length - 1, 1));
}
else
{
string rule = rules[ptr1].TrimEnd(' ');
for (int i = ptr1 + 1; i <= ptr2; i++)
{
rule = rule + ' ' + rules[i].Replace('\t', ' ').TrimStart(' ');
}
repairRules.Add(rule.Remove(rule.Length - 1, 1));
}
ptr2++;
ptr1 = ptr2;
}
else
{
ptr2++;
}
}
return repairRules;
}
}
}

Просмотреть файл

@ -0,0 +1,200 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using System;
using System.Collections.Generic;
using System.Collections.ObjectModel;
using System.Fabric;
using System.Fabric.Description;
using System.Fabric.Health;
namespace FHTest
{
public class MockCodePackageActivationContext : ICodePackageActivationContext
{
/// <summary>
/// Initializes a new instance of the <see cref="MockCodePackageActivationContext"/> class.
/// </summary>
/// <param name="applicationName">applicationName.</param>
/// <param name="applicationTypeName">applicationTypeName.</param>
/// <param name="codePackageName">codePackageName.</param>
/// <param name="codePackageVersion">codePackageVersion.</param>
/// <param name="context">context.</param>
/// <param name="logDirectory">logDirectory.</param>
/// <param name="tempDirectory">tempDirectory.</param>
/// <param name="workDirectory">workDirectory.</param>
/// <param name="serviceManifestName">serviceManifestName.</param>
/// <param name="serviceManifestVersion">serviceManifestVersion.</param>
public MockCodePackageActivationContext(
string applicationName,
string applicationTypeName,
string codePackageName,
string codePackageVersion,
string context,
string logDirectory,
string tempDirectory,
string workDirectory,
string serviceManifestName,
string serviceManifestVersion)
{
this.ApplicationName = applicationName;
this.ApplicationTypeName = applicationTypeName;
this.CodePackageName = codePackageName;
this.CodePackageVersion = codePackageVersion;
this.ContextId = context;
this.LogDirectory = logDirectory;
this.TempDirectory = tempDirectory;
this.WorkDirectory = workDirectory;
this.ServiceManifestName = serviceManifestName;
this.ServiceManifestVersion = serviceManifestVersion;
}
private string ServiceManifestName { get; set; }
private string ServiceManifestVersion { get; set; }
public string ApplicationName { get; private set; }
public string ApplicationTypeName { get; private set; }
public string CodePackageName { get; private set; }
public string CodePackageVersion { get; private set; }
public string ContextId { get; private set; }
public string LogDirectory { get; private set; }
public string TempDirectory { get; private set; }
public string WorkDirectory { get; private set; }
// Interface required events. These are never used. Ignore the Warnings(CS0067) The event 'MockCodePackageActivationContext.CodePackageRemovedEvent' is never used
#pragma warning disable CS0067
public event EventHandler<PackageAddedEventArgs<CodePackage>> CodePackageAddedEvent;
public event EventHandler<PackageModifiedEventArgs<CodePackage>> CodePackageModifiedEvent;
public event EventHandler<PackageRemovedEventArgs<CodePackage>> CodePackageRemovedEvent;
public event EventHandler<PackageAddedEventArgs<ConfigurationPackage>> ConfigurationPackageAddedEvent;
public event EventHandler<PackageModifiedEventArgs<ConfigurationPackage>> ConfigurationPackageModifiedEvent;
public event EventHandler<PackageRemovedEventArgs<ConfigurationPackage>> ConfigurationPackageRemovedEvent;
public event EventHandler<PackageAddedEventArgs<DataPackage>> DataPackageAddedEvent;
public event EventHandler<PackageModifiedEventArgs<DataPackage>> DataPackageModifiedEvent;
public event EventHandler<PackageRemovedEventArgs<DataPackage>> DataPackageRemovedEvent;
#pragma warning restore
public ApplicationPrincipalsDescription GetApplicationPrincipals()
{
return default(ApplicationPrincipalsDescription);
}
public IList<string> GetCodePackageNames()
{
return new List<string>() { this.CodePackageName };
}
public CodePackage GetCodePackageObject(string packageName)
{
return default(CodePackage);
}
public IList<string> GetConfigurationPackageNames()
{
return new List<string>() { string.Empty };
}
public ConfigurationPackage GetConfigurationPackageObject(string packageName)
{
return default(ConfigurationPackage);
}
public IList<string> GetDataPackageNames()
{
return new List<string>() { string.Empty };
}
public DataPackage GetDataPackageObject(string packageName)
{
return default(DataPackage);
}
public EndpointResourceDescription GetEndpoint(string endpointName)
{
return default(EndpointResourceDescription);
}
public KeyedCollection<string, EndpointResourceDescription> GetEndpoints()
{
return null;
}
public KeyedCollection<string, ServiceGroupTypeDescription> GetServiceGroupTypes()
{
return null;
}
public string GetServiceManifestName()
{
return this.ServiceManifestName;
}
public string GetServiceManifestVersion()
{
return this.ServiceManifestVersion;
}
public KeyedCollection<string, ServiceTypeDescription> GetServiceTypes()
{
return null;
}
public void ReportApplicationHealth(HealthInformation healthInformation)
{
}
public void ReportDeployedServicePackageHealth(HealthInformation healthInformation)
{
}
public void ReportDeployedApplicationHealth(HealthInformation healthInformation)
{
}
private bool disposedValue; // To detect redundant calls
protected virtual void Dispose(bool disposing)
{
if (this.disposedValue)
{
return;
}
if (disposing)
{
// TODO: dispose managed state (managed objects).
}
this.disposedValue = true;
}
public void Dispose()
{
// Do not change this code. Put cleanup code in Dispose(bool disposing) above.
this.Dispose(true);
}
public void ReportApplicationHealth(HealthInformation healthInfo, HealthReportSendOptions sendOptions)
{
}
public void ReportDeployedApplicationHealth(HealthInformation healthInfo, HealthReportSendOptions sendOptions)
{
}
public void ReportDeployedServicePackageHealth(HealthInformation healthInfo, HealthReportSendOptions sendOptions)
{
}
}
}

Просмотреть файл

@ -0,0 +1,2 @@
Whatever() :- IDontExistPredicate()
Mitigation(AppName="fabric:/TestApp0") :- RestartCodePackage().

Просмотреть файл

@ -0,0 +1,17 @@
Mitigate() :- interval(AppName=?source, RunInterval=?timespan), CheckInsideRunInterval(RunInterval=?timespan), !.
interval(AppName="fabric:/CpuStress", RunInterval=00:15:00).
interval(AppName="fabric:/ContainerFoo2", RunInterval=00:15:00).
interval(MetricName="ActiveTcpPorts", RunInterval=00:15:00).
## CPU - Percent In Use.
Mitigate(AppName="fabric:/CpuStress", MetricName="CpuPercent", MetricValue=?MetricValue) :- ?MetricValue >= 20,
GetRepairHistory(?repairCount, TimeWindow="04:00:00"),
?repairCount < 5,
RestartCodePackage().
## Memory - Percent In Use.
Mitigate(AppName="fabric:/CpuStress", MetricName="MemoryPercent", MetricValue=?MetricValue) :- ?MetricValue >= 30,
GetRepairHistory(?repairCount, TimeWindow="04:00:00"),
?repairCount < 5,
RestartCodePackage().

Просмотреть файл

@ -0,0 +1,28 @@
<?xml version="1.0" encoding="utf-8"?>
<package xmlns="http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd">
<metadata minClientVersion="3.3.0">
<id>%PACKAGE_ID%</id>
<version>0.4.2</version>
<authors>Microsoft</authors>
<license type="expression">MIT</license>
<requireLicenseAcceptance>false</requireLicenseAcceptance>
<title>Service Fabric FabricHealer Application</title>
<icon>icon.png</icon>
<language>en-US</language>
<description>FabricHealer is a stateless singleton Service Fabric service that runs on all nodes in a Linux or Windows cluster. It is implemented as a .NET Core 3.1 application and has been tested on Windows (2016/2019) and Ubuntu (18.04). Its primary purpose is to schedule and execute automatic repairs in Service Fabric clusters after inspecting unhealthy events created by FabricObserver (FO) instances running in the same cluster. It employs a novel Configuration-as-Logic model to express repair workflows using Prolog-like semantics/syntax in text-based configuration files.</description>
<contentFiles>
<files include="**" buildAction="None" copyToOutput="true" />
</contentFiles>
<dependencies>
<group targetFramework=".NETStandard2.1" />
</dependencies>
<projectUrl>https://aka.ms/sf/FabricObserver</projectUrl>
<tags>azure servicefabric fabrichealer utility auto-mitigation</tags>
<copyright>© Microsoft Corporation. All rights reserved.</copyright>
</metadata>
<files>
<file src="**" target="contentFiles\any\any" />
<file src="FabricHealerPkg\Code\FabricHealer.dll" target="lib\netstandard2.1" />
<file src="%ROOT_PATH%\icon.png" target="" />
</files>
</package>

70
FabricHealer.sln Normal file
Просмотреть файл

@ -0,0 +1,70 @@

Microsoft Visual Studio Solution File, Format Version 12.00
# Visual Studio Version 16
VisualStudioVersion = 16.0.29411.108
MinimumVisualStudioVersion = 10.0.40219.1
Project("{9A19103F-16F7-4668-BE54-9A1E7A4F7556}") = "FabricHealer", "FabricHealer\FabricHealer.csproj", "{9A19103F-16F7-4668-BE54-9A1E7A4F7556}"
EndProject
Project("{2150E333-8FDC-42A3-9474-1A3956D46DE8}") = "Solution Items", "Solution Items", "{FE3D81E7-3ADD-4927-8A51-4FF709B5E8BF}"
ProjectSection(SolutionItems) = preProject
.editorconfig = .editorconfig
.gitignore = .gitignore
Build-FabricHealer.ps1 = Build-FabricHealer.ps1
Build-NugetPackages.ps1 = Build-NugetPackages.ps1
Build-SFPKGs.ps1 = Build-SFPKGs.ps1
FabricHealer.nuspec.template = FabricHealer.nuspec.template
icon.png = icon.png
Documentation\LogicWorkflows.md = Documentation\LogicWorkflows.md
nuget.exe = nuget.exe
README.md = README.md
Documentation\Using.md = Documentation\Using.md
EndProjectSection
EndProject
Project("{9A19103F-16F7-4668-BE54-9A1E7A4F7556}") = "FHTest", "FHTest\FHTest.csproj", "{8D9712BF-C026-4A36-B6D1-6345137D3B6F}"
EndProject
Project("{A07B5EB6-E848-4116-A8D0-A826331D98C6}") = "FabricHealerApp", "FabricHealerApp\FabricHealerApp.sfproj", "{A977C8E0-2183-4845-95EA-7F3C3E795310}"
EndProject
Global
GlobalSection(SolutionConfigurationPlatforms) = preSolution
Debug|Any CPU = Debug|Any CPU
Debug|x64 = Debug|x64
Release|Any CPU = Release|Any CPU
Release|x64 = Release|x64
EndGlobalSection
GlobalSection(ProjectConfigurationPlatforms) = postSolution
{9A19103F-16F7-4668-BE54-9A1E7A4F7556}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{9A19103F-16F7-4668-BE54-9A1E7A4F7556}.Debug|Any CPU.Build.0 = Debug|Any CPU
{9A19103F-16F7-4668-BE54-9A1E7A4F7556}.Debug|x64.ActiveCfg = Debug|x64
{9A19103F-16F7-4668-BE54-9A1E7A4F7556}.Debug|x64.Build.0 = Debug|x64
{9A19103F-16F7-4668-BE54-9A1E7A4F7556}.Release|Any CPU.ActiveCfg = Release|Any CPU
{9A19103F-16F7-4668-BE54-9A1E7A4F7556}.Release|Any CPU.Build.0 = Release|Any CPU
{9A19103F-16F7-4668-BE54-9A1E7A4F7556}.Release|x64.ActiveCfg = Release|Any CPU
{9A19103F-16F7-4668-BE54-9A1E7A4F7556}.Release|x64.Build.0 = Release|Any CPU
{8D9712BF-C026-4A36-B6D1-6345137D3B6F}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{8D9712BF-C026-4A36-B6D1-6345137D3B6F}.Debug|Any CPU.Build.0 = Debug|Any CPU
{8D9712BF-C026-4A36-B6D1-6345137D3B6F}.Debug|x64.ActiveCfg = Debug|x64
{8D9712BF-C026-4A36-B6D1-6345137D3B6F}.Debug|x64.Build.0 = Debug|x64
{8D9712BF-C026-4A36-B6D1-6345137D3B6F}.Release|Any CPU.ActiveCfg = Release|Any CPU
{8D9712BF-C026-4A36-B6D1-6345137D3B6F}.Release|Any CPU.Build.0 = Release|Any CPU
{8D9712BF-C026-4A36-B6D1-6345137D3B6F}.Release|x64.ActiveCfg = Release|Any CPU
{8D9712BF-C026-4A36-B6D1-6345137D3B6F}.Release|x64.Build.0 = Release|Any CPU
{A977C8E0-2183-4845-95EA-7F3C3E795310}.Debug|Any CPU.ActiveCfg = Debug|x64
{A977C8E0-2183-4845-95EA-7F3C3E795310}.Debug|Any CPU.Build.0 = Debug|x64
{A977C8E0-2183-4845-95EA-7F3C3E795310}.Debug|Any CPU.Deploy.0 = Debug|x64
{A977C8E0-2183-4845-95EA-7F3C3E795310}.Debug|x64.ActiveCfg = Debug|x64
{A977C8E0-2183-4845-95EA-7F3C3E795310}.Debug|x64.Build.0 = Debug|x64
{A977C8E0-2183-4845-95EA-7F3C3E795310}.Debug|x64.Deploy.0 = Debug|x64
{A977C8E0-2183-4845-95EA-7F3C3E795310}.Release|Any CPU.ActiveCfg = Release|x64
{A977C8E0-2183-4845-95EA-7F3C3E795310}.Release|Any CPU.Build.0 = Release|x64
{A977C8E0-2183-4845-95EA-7F3C3E795310}.Release|Any CPU.Deploy.0 = Release|x64
{A977C8E0-2183-4845-95EA-7F3C3E795310}.Release|x64.ActiveCfg = Release|x64
{A977C8E0-2183-4845-95EA-7F3C3E795310}.Release|x64.Build.0 = Release|x64
{A977C8E0-2183-4845-95EA-7F3C3E795310}.Release|x64.Deploy.0 = Release|x64
EndGlobalSection
GlobalSection(SolutionProperties) = preSolution
HideSolutionNode = FALSE
EndGlobalSection
GlobalSection(ExtensibilityGlobals) = postSolution
SolutionGuid = {05A35B4C-CB73-4FEE-8AEC-89E30A3FB512}
EndGlobalSection
EndGlobal

Просмотреть файл

@ -0,0 +1,142 @@
<?xml version="1.0" encoding="utf-8"?>
<ApplicationInsights xmlns="http://schemas.microsoft.com/ApplicationInsights/2013/Settings">
<!-- Add your appinsights instrumentation key here in addition to enabling Telemetry and supplying this
key in in Settings.xml.-->
<InstrumentationKey></InstrumentationKey>
<TelemetryInitializers>
<Add Type="Microsoft.ApplicationInsights.DependencyCollector.HttpDependenciesParsingTelemetryInitializer, Microsoft.AI.DependencyCollector"/>
<Add Type="Microsoft.ApplicationInsights.WindowsServer.AzureRoleEnvironmentTelemetryInitializer, Microsoft.AI.WindowsServer"/>
<Add Type="Microsoft.ApplicationInsights.WindowsServer.BuildInfoConfigComponentVersionTelemetryInitializer, Microsoft.AI.WindowsServer"/>
<Add Type="Microsoft.ApplicationInsights.Web.WebTestTelemetryInitializer, Microsoft.AI.Web"/>
<Add Type="Microsoft.ApplicationInsights.Web.SyntheticUserAgentTelemetryInitializer, Microsoft.AI.Web">
<!-- Extended list of bots:
search|spider|crawl|Bot|Monitor|BrowserMob|BingPreview|PagePeeker|WebThumb|URL2PNG|ZooShot|GomezA|Google SketchUp|Read Later|KTXN|KHTE|Keynote|Pingdom|AlwaysOn|zao|borg|oegp|silk|Xenu|zeal|NING|htdig|lycos|slurp|teoma|voila|yahoo|Sogou|CiBra|Nutch|Java|JNLP|Daumoa|Genieo|ichiro|larbin|pompos|Scrapy|snappy|speedy|vortex|favicon|indexer|Riddler|scooter|scraper|scrubby|WhatWeb|WinHTTP|voyager|archiver|Icarus6j|mogimogi|Netvibes|altavista|charlotte|findlinks|Retreiver|TLSProber|WordPress|wsr-agent|http client|Python-urllib|AppEngine-Google|semanticdiscovery|facebookexternalhit|web/snippet|Google-HTTP-Java-Client-->
<Filters>search|spider|crawl|Bot|Monitor|AlwaysOn</Filters>
</Add>
<Add Type="Microsoft.ApplicationInsights.Web.ClientIpHeaderTelemetryInitializer, Microsoft.AI.Web"/>
<Add Type="Microsoft.ApplicationInsights.Web.AzureAppServiceRoleNameFromHostNameHeaderInitializer, Microsoft.AI.Web"/>
<Add Type="Microsoft.ApplicationInsights.Web.OperationNameTelemetryInitializer, Microsoft.AI.Web"/>
<Add Type="Microsoft.ApplicationInsights.Web.OperationCorrelationTelemetryInitializer, Microsoft.AI.Web"/>
<Add Type="Microsoft.ApplicationInsights.Web.UserTelemetryInitializer, Microsoft.AI.Web"/>
<Add Type="Microsoft.ApplicationInsights.Web.AuthenticatedUserIdTelemetryInitializer, Microsoft.AI.Web"/>
<Add Type="Microsoft.ApplicationInsights.Web.AccountIdTelemetryInitializer, Microsoft.AI.Web"/>
<Add Type="Microsoft.ApplicationInsights.Web.SessionTelemetryInitializer, Microsoft.AI.Web"/>
</TelemetryInitializers>
<TelemetryModules>
<Add Type="Microsoft.ApplicationInsights.DependencyCollector.DependencyTrackingTelemetryModule, Microsoft.AI.DependencyCollector">
<ExcludeComponentCorrelationHttpHeadersOnDomains>
<!--
Requests to the following hostnames will not be modified by adding correlation headers.
Add entries here to exclude additional hostnames.
NOTE: this configuration will be lost upon NuGet upgrade.
-->
<Add>core.windows.net</Add>
<Add>core.chinacloudapi.cn</Add>
<Add>core.cloudapi.de</Add>
<Add>core.usgovcloudapi.net</Add>
</ExcludeComponentCorrelationHttpHeadersOnDomains>
<IncludeDiagnosticSourceActivities>
<Add>Microsoft.Azure.EventHubs</Add>
<Add>Microsoft.Azure.ServiceBus</Add>
</IncludeDiagnosticSourceActivities>
</Add>
<Add Type="Microsoft.ApplicationInsights.Extensibility.PerfCounterCollector.PerformanceCollectorModule, Microsoft.AI.PerfCounterCollector">
<!--
Use the following syntax here to collect additional performance counters:
<Counters>
<Add PerformanceCounter="\Process(??APP_WIN32_PROC??)\Handle Count" ReportAs="Process handle count" />
.
</Counters>
PerformanceCounter must be either \CategoryName(InstanceName)\CounterName or \CategoryName\CounterName
NOTE: performance counters configuration will be lost upon NuGet upgrade.
The following placeholders are supported as InstanceName:
??APP_WIN32_PROC?? - instance name of the application process for Win32 counters.
??APP_W3SVC_PROC?? - instance name of the application IIS worker process for IIS/ASP.NET counters.
??APP_CLR_PROC?? - instance name of the application CLR process for .NET counters.
-->
</Add>
<Add Type="Microsoft.ApplicationInsights.Extensibility.PerfCounterCollector.QuickPulse.QuickPulseTelemetryModule, Microsoft.AI.PerfCounterCollector"/>
<Add Type="Microsoft.ApplicationInsights.WindowsServer.AppServicesHeartbeatTelemetryModule, Microsoft.AI.WindowsServer"/>
<Add Type="Microsoft.ApplicationInsights.WindowsServer.AzureInstanceMetadataTelemetryModule, Microsoft.AI.WindowsServer">
<!--
Remove individual fields collected here by adding them to the ApplicationInsighs.HeartbeatProvider
with the following syntax:
<Add Type="Microsoft.ApplicationInsights.Extensibility.Implementation.Tracing.DiagnosticsTelemetryModule, Microsoft.ApplicationInsights">
<ExcludedHeartbeatProperties>
<Add>osType</Add>
<Add>location</Add>
<Add>name</Add>
<Add>offer</Add>
<Add>platformFaultDomain</Add>
<Add>platformUpdateDomain</Add>
<Add>publisher</Add>
<Add>sku</Add>
<Add>version</Add>
<Add>vmId</Add>
<Add>vmSize</Add>
<Add>subscriptionId</Add>
<Add>resourceGroupName</Add>
<Add>placementGroupId</Add>
<Add>tags</Add>
<Add>vmScaleSetName</Add>
</ExcludedHeartbeatProperties>
</Add>
NOTE: exclusions will be lost upon upgrade.
-->
</Add>
<Add Type="Microsoft.ApplicationInsights.WindowsServer.DeveloperModeWithDebuggerAttachedTelemetryModule, Microsoft.AI.WindowsServer"/>
<Add Type="Microsoft.ApplicationInsights.WindowsServer.UnhandledExceptionTelemetryModule, Microsoft.AI.WindowsServer"/>
<Add Type="Microsoft.ApplicationInsights.WindowsServer.UnobservedExceptionTelemetryModule, Microsoft.AI.WindowsServer">
<!--</Add>
<Add Type="Microsoft.ApplicationInsights.WindowsServer.FirstChanceExceptionStatisticsTelemetryModule, Microsoft.AI.WindowsServer">-->
</Add>
<Add Type="Microsoft.ApplicationInsights.Web.RequestTrackingTelemetryModule, Microsoft.AI.Web">
<Handlers>
<!--
Add entries here to filter out additional handlers:
NOTE: handler configuration will be lost upon NuGet upgrade.
-->
<Add>Microsoft.VisualStudio.Web.PageInspector.Runtime.Tracing.RequestDataHttpHandler</Add>
<Add>System.Web.StaticFileHandler</Add>
<Add>System.Web.Handlers.AssemblyResourceLoader</Add>
<Add>System.Web.Optimization.BundleHandler</Add>
<Add>System.Web.Script.Services.ScriptHandlerFactory</Add>
<Add>System.Web.Handlers.TraceHandler</Add>
<Add>System.Web.Services.Discovery.DiscoveryRequestHandler</Add>
<Add>System.Web.HttpDebugHandler</Add>
</Handlers>
</Add>
<Add Type="Microsoft.ApplicationInsights.Web.ExceptionTrackingTelemetryModule, Microsoft.AI.Web"/>
<Add Type="Microsoft.ApplicationInsights.Web.AspNetDiagnosticTelemetryModule, Microsoft.AI.Web"/>
</TelemetryModules>
<ApplicationIdProvider Type="Microsoft.ApplicationInsights.Extensibility.Implementation.ApplicationId.ApplicationInsightsApplicationIdProvider, Microsoft.ApplicationInsights"/>
<TelemetrySinks>
<Add Name="default">
<TelemetryProcessors>
<Add Type="Microsoft.ApplicationInsights.Extensibility.PerfCounterCollector.QuickPulse.QuickPulseTelemetryProcessor, Microsoft.AI.PerfCounterCollector"/>
<Add Type="Microsoft.ApplicationInsights.Extensibility.AutocollectedMetricsExtractor, Microsoft.ApplicationInsights"/>
<Add Type="Microsoft.ApplicationInsights.WindowsServer.TelemetryChannel.AdaptiveSamplingTelemetryProcessor, Microsoft.AI.ServerTelemetryChannel">
<MaxTelemetryItemsPerSecond>5</MaxTelemetryItemsPerSecond>
<ExcludedTypes>Event</ExcludedTypes>
</Add>
<Add Type="Microsoft.ApplicationInsights.WindowsServer.TelemetryChannel.AdaptiveSamplingTelemetryProcessor, Microsoft.AI.ServerTelemetryChannel">
<MaxTelemetryItemsPerSecond>5</MaxTelemetryItemsPerSecond>
<IncludedTypes>Event</IncludedTypes>
</Add>
</TelemetryProcessors>
<TelemetryChannel Type="Microsoft.ApplicationInsights.WindowsServer.TelemetryChannel.ServerTelemetryChannel, Microsoft.AI.ServerTelemetryChannel"/>
</Add>
</TelemetrySinks>
<!--
Learn more about Application Insights configuration with ApplicationInsights.config here:
http://go.microsoft.com/fwlink/?LinkID=513840
Note: If not present, please add <InstrumentationKey>Your Key</InstrumentationKey> to the top of this file.
--></ApplicationInsights>

Просмотреть файл

@ -0,0 +1,39 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using System.Fabric;
using System.Threading;
using System.Threading.Tasks;
using Microsoft.ServiceFabric.Services.Runtime;
namespace FabricHealer
{
/// <summary>
/// An instance of this class is created for each service instance by the Service Fabric runtime.
/// </summary>
public sealed class FabricHealer : StatelessService
{
private FabricHealerManager healerManager;
public FabricHealer(StatelessServiceContext context)
: base(context)
{
}
/// <summary>
/// This is the main entry point for your service instance.
/// </summary>
/// <param name="cancellationToken">Canceled when Service Fabric needs to shut down this service instance.</param>
protected override async Task RunAsync(CancellationToken cancellationToken)
{
// FabricHealerManager will create an instance member cancellation token object (see Token) that is this cancellation token,
// which is threaded through all async operations throughout the program.
healerManager = FabricHealerManager.Singleton(Context, cancellationToken);
// Blocks until cancellationToken cancellation.
await healerManager.StartAsync().ConfigureAwait(true);
}
}
}

Просмотреть файл

@ -0,0 +1,79 @@
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<ProjectGuid>{9A19103F-16F7-4668-BE54-9A1E7A4F7556}</ProjectGuid>
<TargetFramework>netcoreapp3.1</TargetFramework>
<PlatformTarget>x64</PlatformTarget>
<OutputType>Exe</OutputType>
<!-- ***NOTE***:
If deploying to SF cluster directly from Visual Studio, you must use single target RID:
For Windows, use win-x64. For Linux, use linux-x64.
<RuntimeIdentifier>win-x64</RuntimeIdentifier> -->
<!-- For multi-target publish (say, from Azure Pipeline build), you use multi-target RIDs:
linux-x64;win-x64. -->
<RuntimeIdentifiers>linux-x64;win-x64</RuntimeIdentifiers>
<RootNamespace>FabricHealer</RootNamespace>
<AssemblyName>FabricHealer</AssemblyName>
<AssemblyVersion>0.4.2</AssemblyVersion>
<FileVersion>0.4.2</FileVersion>
<AutoGenerateBindingRedirects>true</AutoGenerateBindingRedirects>
<IsServiceFabricServiceProject>true</IsServiceFabricServiceProject>
<StartupObject>FabricHealer.Program</StartupObject>
<SignAssembly>false</SignAssembly>
<DelaySign>false</DelaySign>
<NoWarn>CA1822;$(NoWarn)</NoWarn>
<ResolveComReferenceSilent>true</ResolveComReferenceSilent>
<Platforms>AnyCPU;x64</Platforms>
</PropertyGroup>
<ItemGroup>
<Compile Remove="Repair\Guan\ReimageVMPredicateType.cs" />
</ItemGroup>
<ItemGroup>
<None Remove="NLog.config" />
</ItemGroup>
<ItemGroup>
<PackageReference Include="Microsoft.ApplicationInsights" Version="2.15.0" />
<PackageReference Include="Microsoft.CSharp" Version="4.7.0" />
<PackageReference Include="Microsoft.ServiceFabric" Version="7.1.458" />
<PackageReference Include="Microsoft.ServiceFabric.Data" Version="4.1.458" />
<PackageReference Include="Microsoft.ServiceFabric.Data.Extensions" Version="4.1.458" />
<PackageReference Include="Microsoft.ServiceFabric.Data.Interfaces" Version="4.1.458" />
<PackageReference Include="Microsoft.ServiceFabric.Diagnostics.internal" Version="4.1.458" />
<PackageReference Include="Microsoft.ServiceFabric.Services" Version="4.1.458" />
<PackageReference Include="Microsoft.ServiceFabricApps.Guan" Version="1.0.0" />
<PackageReference Include="Newtonsoft.Json" Version="12.0.3" />
<PackageReference Include="NLog" Version="4.7.5" />
<PackageReference Include="System.Buffers" Version="4.5.1" />
<PackageReference Include="System.Collections" Version="4.3.0" />
<PackageReference Include="System.Collections.Concurrent" Version="4.3.0" />
<PackageReference Include="System.Collections.Immutable" Version="1.7.1" />
<PackageReference Include="System.ComponentModel" Version="4.3.0" />
<PackageReference Include="System.ComponentModel.Composition" Version="4.7.0" />
<PackageReference Include="System.Configuration.ConfigurationManager" Version="4.7.0" />
<PackageReference Include="System.Diagnostics.Contracts" Version="4.3.0" />
<PackageReference Include="System.Diagnostics.Debug" Version="4.3.0" />
<PackageReference Include="System.Diagnostics.DiagnosticSource" Version="4.7.1" />
<PackageReference Include="System.Globalization" Version="4.3.0" />
<PackageReference Include="System.IO" Version="4.3.0" />
<PackageReference Include="System.Linq" Version="4.3.0" />
<PackageReference Include="System.Linq.Expressions" Version="4.3.0" />
<PackageReference Include="System.Management" Version="4.7.0" />
<PackageReference Include="System.Management" Version="4.7.0" />
<PackageReference Include="System.Memory" Version="4.5.4" />
<PackageReference Include="System.Net.Http" Version="4.3.4" />
<PackageReference Include="System.Numerics.Vectors" Version="4.5.0" />
<PackageReference Include="System.Reflection" Version="4.3.0" />
<PackageReference Include="System.Reflection.Metadata" Version="1.8.1" />
<PackageReference Include="System.Resources.ResourceManager" Version="4.3.0" />
<PackageReference Include="System.Runtime" Version="4.3.1" />
<PackageReference Include="System.Runtime.CompilerServices.Unsafe" Version="4.7.1" />
<PackageReference Include="System.Runtime.Extensions" Version="4.3.1" />
<PackageReference Include="System.Runtime.InteropServices" Version="4.3.0" />
<PackageReference Include="System.Runtime.InteropServices.RuntimeInformation" Version="4.3.0" />
<PackageReference Include="System.Text.Encoding" Version="4.3.0" />
<PackageReference Include="System.Text.Encoding.CodePages" Version="4.7.1" />
<PackageReference Include="System.Text.Encodings.Web" Version="4.7.1" />
<PackageReference Include="System.Threading" Version="4.3.0" />
<PackageReference Include="System.Threading.Tasks" Version="4.3.0" />
<PackageReference Include="System.Threading.Tasks.Extensions" Version="4.5.4" />
</ItemGroup>
</Project>

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -0,0 +1,31 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using System.Collections.Generic;
using System.Fabric.Repair;
using System.Threading;
using System.Threading.Tasks;
using FabricHealer.Repair;
using FabricHealer.Utilities.Telemetry;
namespace FabricHealer.Interfaces
{
public interface IRepairTasks
{
Task ActivateServiceFabricNodeAsync(string nodeName, CancellationToken cancellationToken);
Task RemoveServiceFabricNodeStateAsync(string nodeName, CancellationToken cancellationToken);
Task<bool> RestartDeployedCodePackageAsync(RepairConfiguration repairConfiguration, CancellationToken cancellationToken);
Task<bool> RestartReplicaAsync(RepairConfiguration repairConfiguration, CancellationToken cancellationToken);
Task<bool> RemoveReplicaAsync(RepairConfiguration repairConfiguration, CancellationToken cancellationToken);
Task<bool> SafeRestartServiceFabricNodeAsync(string nodeName, RepairTask repairTask, CancellationToken cancellationToken);
Task StartRepairWorkflowAsync(TelemetryData foHealthData, List<string> repairRules, CancellationToken cancellationToken);
}
}

Просмотреть файл

@ -0,0 +1,169 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using System;
using System.Collections.Generic;
using System.Fabric.Health;
using System.Threading;
using System.Threading.Tasks;
using FabricHealer.Utilities;
using FabricHealer.Utilities.Telemetry;
namespace FabricHealer.Interfaces
{
/// <summary>
/// ITelemetry interface.
/// </summary>
public interface ITelemetryProvider
{
/// <summary>
/// Gets or sets the telemetry API key.
/// </summary>
string Key { get; set; }
/// <summary>
/// Calls telemetry provider to track the availability.
/// </summary>
/// <param name="serviceUri">Service name.</param>
/// <param name="instance">Instance identifier.</param>
/// <param name="testName">Availability test name.</param>
/// <param name="captured">The time when the availability was captured.</param>
/// <param name="duration">The time taken for the availability test to run.</param>
/// <param name="location">Name of the location the availability test was run from.</param>
/// <param name="success">True if the availability test ran successfully.</param>
/// <param name="cancellationToken">CancellationToken instance.</param>
/// <param name="message">Error message on availability test run failure.</param>
/// <returns>a completed task.</returns>
Task ReportAvailabilityAsync(
Uri serviceUri,
string instance,
string testName,
DateTimeOffset captured,
TimeSpan duration,
string location,
bool success,
CancellationToken cancellationToken,
string message = null);
/// <summary>
/// Calls telemetry provider to report health.
/// </summary>
/// <param name="scope">Scope of health evaluation (Cluster, Node, etc.).</param>
/// <param name="propertyName">Value of the property.</param>
/// <param name="state">Health state.</param>
/// <param name="unhealthyEvaluations">Unhealthy evaluations aggregated description.</param>
/// <param name="source">Source of emission.</param>
/// <param name="cancellationToken">CancellationToken instance.</param>
/// <param name="serviceName">Optional: TraceTelemetry context cloud service name.</param>
/// <param name="instanceName">Optional: TraceTelemetry context cloud instance name.</param>
/// <returns>a Task.</returns>
Task ReportHealthAsync(
HealthScope scope,
string propertyName,
HealthState state,
string unhealthyEvaluations,
string source,
CancellationToken cancellationToken,
string serviceName = null,
string instanceName = null);
/// <summary>
/// Calls telemetry provider to report a metric.
/// </summary>
/// <param name="name">Name of the metric.</param>
/// <param name="value">Value of the property.</param>
/// <param name="source">Name of the observer omitting the signal.</param>
/// <param name="cancellationToken">CancellationToken instance.</param>
/// <returns>A completed task of bool.</returns>
Task<bool> ReportMetricAsync<T>(
string name,
T value,
string source,
CancellationToken cancellationToken);
/// <summary>
/// Calls telemetry provider to report a metric.
/// </summary>
/// <param name="telemetryData">TelemetryData instance.</param>
/// <param name="cancellationToken">CancellationToken instance.</param>
Task ReportMetricAsync(
TelemetryData telemetryData,
CancellationToken cancellationToken);
/// <summary>
/// Calls telemetry provider to report a metric.
/// </summary>
/// <param name="name">Name of the metric.</param>
/// <param name="value">Value of the property.</param>
/// <param name="properties">IDictionary&lt;string&gt;,&lt;string&gt; containing name/value pairs of additional properties.</param>
/// <param name="cancellationToken">CancellationToken instance.</param>
/// <returns>A completed task.</returns>
Task ReportMetricAsync(
string name,
long value,
IDictionary<string, string> properties,
CancellationToken cancellationToken);
/// <summary>
/// Calls telemetry provider to report a metric.
/// </summary>
/// <param name="service">Name of the service.</param>
/// <param name="partition">Partition id.</param>
/// <param name="name">Name of the metric.</param>
/// <param name="value">Value of the metric.</param>
/// <param name="cancellationToken">CancellationToken instance.</param>
/// <returns>A completed task.</returns>
Task ReportMetricAsync(
string service,
Guid partition,
string name,
long value,
CancellationToken cancellationToken);
/// <summary>
/// Calls telemetry provider to report a metric.
/// </summary>
/// <param name="role">Name of the role.</param>
/// <param name="id">Replica or instance identifier.</param>
/// <param name="name">Name of the metric.</param>
/// <param name="value">Value if the metric.</param>
/// <param name="cancellationToken">CancellationToken instance.</param>
/// <returns>A completed task.</returns>
Task ReportMetricAsync(
string role,
long id,
string name,
long value,
CancellationToken cancellationToken);
/// <summary>
/// Calls telemetry provider to report a metric.
/// </summary>
/// <param name="roleName">Name of the role.</param>
/// <param name="instance">Instance idenfitier.</param>
/// <param name="name">Name of the metric.</param>
/// <param name="value">Value of the metric.</param>
/// <param name="count">Number of samples for this metric.</param>
/// <param name="min">Minimum value of the samples.</param>
/// <param name="max">Maximum value of the samples.</param>
/// <param name="sum">Sum of all of the samples.</param>
/// <param name="deviation">Standard deviation of the sample set.</param>
/// <param name="properties">IDictionary&lt;string&gt;,&lt;string&gt; containing name/value pairs of additional properties.</param>
/// <param name="cancellationToken">CancellationToken instance.</param>
/// <returns>A completed task.</returns>
Task ReportMetricAsync(
string roleName,
string instance,
string name,
long value,
int count,
long min,
long max,
long sum,
double deviation,
IDictionary<string, string> properties,
CancellationToken cancellationToken);
}
}

Просмотреть файл

@ -0,0 +1,43 @@
## Logic rules for Application level repairs in the cluster.
## These internal predicate "calls" form the basis of the configuration for checking if we are inside a specified run interval for a specified repair target. Please see the documentation for supported target types.
IntervalForRepairTarget(AppName="fabric:/ClusterObserver", RunInterval=01:00:00).
IntervalForRepairTarget(AppName="fabric:/CpuStress", RunInterval=01:00:00).
IntervalForRepairTarget(AppName="fabric:/ContainerFoo2", RunInterval=00:30:00).
IntervalForRepairTarget(AppName="fabric:/MyApp42", RunInterval=00:30:00).
IntervalForRepairTarget(MetricName="ActiveTcpPorts", RunInterval=00:45:00).
IntervalForRepairTarget(MetricName="EphemeralPorts", RunInterval=00:30:00).
## This is the rule that will run for each IntervalForRepairTarget predicate specified above.
## The CheckInsideRunInterval external predicate compares the specified RunInterval TimeSpan value against repair job data managed by the RepairManagerService(RM),
## a stateful Service Fabric System Service that orchestrates repairs and manages repair state for a Service Fabric cluster. FH requires the presence of RM in order to function.
Mitigate() :- IntervalForRepairTarget(Target=?target, RunInterval=?timespan), CheckInsideRunInterval(RunInterval=?timespan), !.
## TimeScopedRestartCodePackage is an internal predicate to check for the number of times a repair has run to completion within a supplied time window.
## If Completed Repair count is less then supplied value, then run RestartCodePackage mitigation.
TimeScopedRestartCodePackage(?count, ?time) :- GetRepairHistory(?repairCount, TimeWindow=?time), ?repairCount < ?count,
RestartCodePackage().
## Mitigation queries for multiple metrics and targets.
## CPU - Percent In Use.
Mitigate(AppName="fabric:/CpuStress", MetricName="CpuPercent", MetricValue=?MetricValue) :- ?MetricValue >= 20,
TimeScopedRestartCodePackage(5, 05:00:00).
## Memory - Percent In Use.
Mitigate(AppName="fabric:/CpuStress", MetricName="MemoryPercent", MetricValue=?MetricValue) :- ?MetricValue >= 30,
TimeScopedRestartCodePackage(5, 05:00:00).
## Memory - Megabytes In Use.
Mitigate(AppName="fabric:/CpuStress", MetricName="MemoryMB") :- TimeScopedRestartCodePackage(5, 05:00:00).
Mitigate(AppName="fabric:/ContainerFoo2", MetricName="MemoryMB") :- TimeScopedRestartCodePackage(5, 05:00:00).
## Active TCP Ports - Any app service.
Mitigate(MetricName="ActiveTcpPorts") :- TimeScopedRestartCodePackage(5, 05:00:00).
## Ephemeral TCP Ports - Any app service.
Mitigate(MetricName="EphemeralPorts") :- TimeScopedRestartCodePackage(5, 05:00:00).
## Ephemeral Ports - Specific Application - any of its services.
Mitigate(AppName="fabric:/MyApp42", MetricName="EphemeralPorts", MetricValue=?MetricValue) :- ?MetricValue >= 250,
TimeScopedRestartCodePackage(5, 05:00:00).

Просмотреть файл

@ -0,0 +1,13 @@
## This rule checks folder size for SF logs, observer logs, E:\temp, for example. If size exceeds supplied threshold, then try and delete the files in the directory. You can supply optional arguments to the DeleteFiles predicate. Size (MaxFolderSizeGB or MaxFolderSizeMB) must be supplied as positive whole number.
## Optional arguments for DeleteFiles: SortOrder (File sort order - Ascending or Descending. Defaults to Ascending (oldest to newest)), MaxFilesToDelete (The maximum number of files to delete. If not specified (or 0) means delete all files), RecurseSubdirectories (Delete files in child folders of specified directory. Defaults to false if not specified.).
## First, check if we are inside run interval. If so, then cut (!).
Mitigate() :- CheckInsideRunInterval(RunInterval=01:00:00), !.
## Iterate over a list of folders with a system predicate, member (defined and implemented in Guan) and an internal predicate, config (an internal predicate needs no backing impl, it only exists in this logic).
## You can just write a rule for each folder should you need to have MB max size values for some folders and GB max size values for others. In the case below, all values are GB so it makes sense to use enumeration for convenience.
Mitigate() :- GetRepairHistory(?repairCount, TimeWindow=01:00:00),
?repairCount < 2,
member(config(?X,?Y), [config("C:\SFDevCluster\Log\Traces", 50), config("C:\observer_logs", 1), config("E:\temp", 40)]),
CheckFolderSize(FolderPath=?X, MaxFolderSizeGB=?Y),
DeleteFiles(FolderPath=?X, MaxRepairs=2, SortOrder=Ascending, MaxFilesToDelete=50, RecurseSubdirectories=true).

Просмотреть файл

@ -0,0 +1,10 @@
## This houses both Fabric Node restart and remove rules.
## First check if we are inside the run interval. If so, cut (!).
Mitigate() :- CheckInsideRunInterval(RunInterval=01:00:00), !.
Mitigate() :- GetRepairHistory(?repairCount, TimeWindow=01:00:00),
?repairCount < 5,
RestartFabricNode().
## TODO: Fabric node removal rules.

Просмотреть файл

@ -0,0 +1,18 @@
## This demonstrates a workflow that employs multiple external predicates to get to a solution for a single unhealthy replica scenario.
## First, check if we are inside run interval. If so, then cut (!), which means stop processing rules.
Mitigate() :- CheckInsideRunInterval(RunInterval=00:30:00), !.
Mitigate() :- GetRepairHistory(?repairCount, TimeWindow=01:00:00),
?repairCount < 2,
RestartReplica().
## Else, try this.
Mitigate() :- GetRepairHistory(?repairCount, TimeWindow=01:00:00),
?repairCount < 2,
RestartCodePackage().
## Else, try this.
Mitigate() :- GetRepairHistory(?repairCount, TimeWindow=01:00:00),
?repairCount < 2,
RestartFabricNode().

Просмотреть файл

@ -0,0 +1,28 @@
## Logic rules for System Application level repairs in the cluster.
## First, check if we are inside the run interval. If inside run interval, then cut (no other rules will be processed).
## Note: FO only generates Application (System) level warnings for system services. There will only ever be ApplicationName as "fabric:/System" in the FO health data that FH emits, so this is an optional argument.
Mitigate(AppName="fabric:/System") :- CheckInsideRunInterval(RunInterval=01:00:00), !.
## TimeScopedRestartCodePackage is an internal predicate to check for the number of times a repair has run to completion within a supplied time window.
## If Completed Repair count is less then supplied value, then run RestartCodePackage mitigation.
TimeScopedRestartFabricNode(?count, ?time) :- GetRepairHistory(?repairCount, TimeWindow=?time), ?repairCount < ?count,
RestartFabricNode().
## Mitigation queries for multiple metrics and targets.
## CPU Time - Percent
Mitigate(AppName="fabric:/System", MetricName="CpuPercent", MetricValue=?MetricValue) :- ?MetricValue >= 80,
TimeScopedRestartFabricNode(4, 08:00:00).
## Memory Use - Megabytes in use
Mitigate(AppName="fabric:/System", MetricName="MemoryMB", MetricValue=?MetricValue) :- ?MetricValue >= 2048,
TimeScopedRestartFabricNode(4, 08:00:00).
## Memory Use - Percent in use
Mitigate(AppName="fabric:/System", MetricName="MemoryPercent", MetricValue=?MetricValue) :- ?MetricValue >= 40,
TimeScopedRestartFabricNode(4, 08:00:00).
## Ephemeral Ports in Use
Mitigate(AppName="fabric:/System", MetricName="EphemeralPorts", MetricValue=?MetricValue) :- ?MetricValue >= 800,
TimeScopedRestartFabricNode(4, 08:00:00).

Просмотреть файл

@ -0,0 +1,9 @@
## Logic rules for Virtual Machine level repairs in the cluster. Only OS reboot is supported today.
## First, check if we are inside run interval. If so, then cut (!).
Mitigate() :- CheckInsideRunInterval(RunInterval=02:00:00), !.
Mitigate(MetricName="MemoryPercent", MetricValue=?MetricValue) :- ?MetricValue >= 90,
GetRepairHistory(?repairCount, TimeWindow=08:00:00),
?repairCount < 5,
RestartVM().

Просмотреть файл

@ -0,0 +1,58 @@
<?xml version="1.0" encoding="utf-8" ?>
<Settings xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schemas.microsoft.com/2011/01/fabric">
<Section Name="RepairManagerConfiguration">
<!-- Optional: This service makes async SF Api calls that are cluster-wide operations
and can take time in large clusters. -->
<Parameter Name="AsyncOperationTimeoutSeconds" Value="120" />
<Parameter Name="HealthCheckLoopSleepTimeSeconds" Value="" MustOverride="true" />
<!-- Required: Location on disk to store observer data, including ObserverManager.
Each observer will write to their own directory on this path.
**NOTE: For Linux targets, do not supply a drive prefix. Just supply a folder name.** -->
<Parameter Name="LocalLogPath" Value="fabrichealer_logs" />
<Parameter Name="EnableVerboseLogging" Value="false" />
<!-- Optional: Diagnostic Telemetry. Azure ApplicationInsights and LogAnalytics support is already implemented,
but you can implement whatever provider you want. See IObserverTelemetry interface. -->
<Parameter Name="EnableTelemetryProvider" Value="false" />
<!-- Required: Values can be either AzureApplicationInsights or AzureLogAnalytics -->
<Parameter Name="TelemetryProvider" Value="AzureLogAnalytics" />
<!-- Required-If TelemetryProvider is AzureApplicationInsights. -->
<Parameter Name="AppInsightsInstrumentationKey" Value="" />
<!-- Required-If TelemetryProvider is AzureLogAnalytics. -->
<Parameter Name="LogAnalyticsWorkspaceId" Value="" />
<!-- Required-If TelemetryProvider is AzureLogAnalytics. -->
<Parameter Name="LogAnalyticsSharedKey" Value="" />
<!-- Required-If TelemetryProvider is AzureLogAnalytics. -->
<Parameter Name="LogAnalyticsLogType" Value="FabricHealer" />
<!-- Optional: EventSource Tracing. -->
<Parameter Name="EnableEventSourceProvider" Value="true" />
<Parameter Name="EventSourceProviderName" Value="FabricHealerETWProvider" />
<!-- Big on/off switch. You can be more granular below in the Repair policies sections. -->
<Parameter Name="EnableAutoMitigation" Value="" MustOverride="true" />
<Parameter Name="EnableRepairAuditTelemetry" Value="true" />
</Section>
<!-- Repair policies -->
<Section Name="AppRepairPolicy">
<Parameter Name="Enabled" Value="" MustOverride="true" />
<Parameter Name="LogicRulesConfigurationFile" Value="AppRules.config.txt" />
</Section>
<Section Name="DiskRepairPolicy">
<Parameter Name="Enabled" Value="" MustOverride="true" />
<Parameter Name="LogicRulesConfigurationFile" Value="DiskRules.config.txt" />
</Section>
<Section Name="FabricNodeRepairPolicy">
<Parameter Name="Enabled" Value="" MustOverride="true" />
<Parameter Name="LogicRulesConfigurationFile" Value="FabricNodeRules.config.txt" />
</Section>
<Section Name="ReplicaRepairPolicy">
<Parameter Name="Enabled" Value="" MustOverride="true" />
<Parameter Name="LogicRulesConfigurationFile" Value="ReplicaRules.config.txt" />
</Section>
<Section Name="SystemAppRepairPolicy">
<Parameter Name="Enabled" Value="" MustOverride="true" />
<Parameter Name="LogicRulesConfigurationFile" Value="SystemAppRules.config.txt" />
</Section>
<Section Name="VMRepairPolicy">
<Parameter Name="Enabled" Value="" MustOverride="true" />
<Parameter Name="LogicRulesConfigurationFile" Value="VmRules.config.txt" />
</Section>
</Settings>

Просмотреть файл

@ -0,0 +1,25 @@
<?xml version="1.0" encoding="utf-8"?>
<ServiceManifest Name="FabricHealerPkg"
Version="0.4.2"
xmlns="http://schemas.microsoft.com/2011/01/fabric"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<ServiceTypes>
<!-- This is the name of your ServiceType.
This name must match the string used in RegisterServiceType call in Program.cs. -->
<StatelessServiceType ServiceTypeName="FabricHealerType" />
</ServiceTypes>
<!-- Code package is your service executable. -->
<CodePackage Name="Code" Version="0.4.2">
<EntryPoint>
<ExeHost>
<Program>FabricHealer</Program>
</ExeHost>
</EntryPoint>
</CodePackage>
<!-- Config package is the contents of the Config directory under PackageRoot that contains an
independently-updateable and versioned set of custom configuration settings for your service. -->
<ConfigPackage Name="Config" Version="0.4.2" />
</ServiceManifest>

42
FabricHealer/Program.cs Normal file
Просмотреть файл

@ -0,0 +1,42 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using System;
using System.Diagnostics;
using System.Threading;
using Microsoft.ServiceFabric.Services.Runtime;
namespace FabricHealer
{
internal static class Program
{
/// <summary>
/// This is the entry point of the service host process.
/// </summary>
private static void Main()
{
try
{
// The ServiceManifest.XML file defines one or more service type names.
// Registering a service maps a service type name to a .NET type.
// When Service Fabric creates an instance of this service type,
// an instance of the class is created in this host process.
ServiceRuntime.RegisterServiceAsync("FabricHealerType",
context => new FabricHealer(context)).GetAwaiter().GetResult();
ServiceEventSource.Current.ServiceTypeRegistered(Process.GetCurrentProcess().Id, typeof(FabricHealer).Name);
// Prevents this host process from terminating so services keep running.
Thread.Sleep(Timeout.Infinite);
}
catch (Exception e)
{
ServiceEventSource.Current.ServiceHostInitializationFailed(e.ToString());
throw;
}
}
}
}

Просмотреть файл

@ -0,0 +1,36 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
namespace FabricHealer.Repair
{
public class DiskRepairPolicy : RepairPolicy
{
public bool RecurseSubdirectories
{
get; set;
}
public string FolderPath
{
get; set;
}
public long MaxNumberOfFilesToDelete
{
get; set;
}
public FileSortOrder FileAgeSortOrder
{
get; set;
}
}
public enum FileSortOrder
{
Ascending,
Descending
}
}

Просмотреть файл

@ -0,0 +1,15 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
namespace FabricHealer.Repair
{
public enum FabricNodeRepairStep
{
Activate,
Deactivate,
Restart,
Scheduled,
}
}

Просмотреть файл

@ -0,0 +1,404 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using System;
using System.Collections.Generic;
using System.Fabric;
using System.Fabric.Query;
using System.Fabric.Repair;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
using FabricHealer.Utilities;
using FabricHealer.Utilities.Telemetry;
namespace FabricHealer.Repair
{
public static class FabricRepairTasks
{
public static async Task<bool> IsRepairTaskInDesiredStateAsync(
string taskId,
FabricClient fabricClient,
string executorName,
List<RepairTaskState> desiredStates)
{
IList<RepairTask> repairTaskList = await fabricClient.RepairManager.GetRepairTaskListAsync(
taskId,
RepairTaskStateFilter.All,
executorName).ConfigureAwait(true);
return desiredStates.Any(desiredState => repairTaskList.Count(rt => rt.State == desiredState) > 0);
}
/// <summary>
/// Cancels a repair task based on its current state
/// </summary>
/// <param name="repairTask"><see cref="RepairTask"/> to be cancelled</param>
/// <returns></returns>
public static async Task CancelRepairTaskAsync(RepairTask repairTask, FabricClient fabricClient)
{
switch (repairTask.State)
{
case RepairTaskState.Restoring:
case RepairTaskState.Completed:
break;
case RepairTaskState.Created:
case RepairTaskState.Claimed:
case RepairTaskState.Preparing:
_ = await fabricClient.RepairManager.CancelRepairTaskAsync(
repairTask.TaskId,
repairTask.Version,
true).ConfigureAwait(false);
break;
case RepairTaskState.Approved:
case RepairTaskState.Executing:
repairTask.State = RepairTaskState.Restoring;
repairTask.ResultStatus = RepairTaskResult.Cancelled;
_ = await fabricClient.RepairManager.UpdateRepairExecutionStateAsync(repairTask).ConfigureAwait(false);
break;
case RepairTaskState.Invalid:
break;
default:
throw new Exception($"Repair task {repairTask.TaskId} is in invalid state {repairTask.State}");
}
}
public static async Task<bool> CompleteCustomActionRepairJobAsync(
RepairTask repairTask,
FabricClient fabricClient,
StatelessServiceContext context,
CancellationToken token)
{
try
{
if (repairTask.ResultStatus == RepairTaskResult.Succeeded
|| repairTask.State == RepairTaskState.Completed
|| repairTask.State == RepairTaskState.Restoring)
{
return true;
}
repairTask.State = RepairTaskState.Restoring;
repairTask.ResultStatus = RepairTaskResult.Succeeded;
_ = await fabricClient.RepairManager.UpdateRepairExecutionStateAsync(
repairTask,
FabricHealerManager.ConfigSettings.AsyncTimeout,
token).ConfigureAwait(false);
}
catch (Exception e)
{
var telemetryUtilities = new TelemetryUtilities(fabricClient, context);
await telemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Error,
"FabricRepairTasks.CompleteCustomActionRepairJobAsync",
$"Failed to Complete Repair Task {repairTask.TaskId} with " +
$"Unhandled Exception:{Environment.NewLine}{e}",
token).ConfigureAwait(false);
if (e is FabricException || e is TaskCanceledException || e is OperationCanceledException)
{
return false;
}
throw;
}
return true;
}
public static async Task<RepairTask> ScheduleRepairTaskAsync(
RepairConfiguration repairConfiguration,
RepairExecutorData executorData,
string executorName,
FabricClient fabricClient,
CancellationToken token)
{
var repairTaskEngine = new RepairTaskEngine(fabricClient);
RepairTask repairTask;
var repairAction = repairConfiguration.RepairPolicy.CurrentAction;
switch (repairAction)
{
case RepairAction.RestartVM:
repairTask = repairTaskEngine.CreateVmRebootTask(
repairConfiguration,
executorName);
break;
case RepairAction.DeleteFiles:
case RepairAction.RestartCodePackage:
case RepairAction.RestartFabricNode:
case RepairAction.RestartReplica:
repairTask = repairTaskEngine.CreateFabricHealerRmRepairTask(
repairConfiguration,
executorData);
break;
default:
FabricHealerManager.RepairLogger.LogWarning("Unknown or Unsupported FabricRepairAction specified.");
return null;
}
bool success = await TryCreateRepairTaskAsync(
fabricClient,
repairTask,
repairConfiguration,
token).ConfigureAwait(false);
if (success)
{
return repairTask;
}
return null;
}
private static async Task<bool> TryCreateRepairTaskAsync(
FabricClient fabricClient,
RepairTask repairTask,
RepairConfiguration repairConfiguration,
CancellationToken token)
{
if (repairTask == null)
{
return false;
}
try
{
var repairTaskEngine = new RepairTaskEngine(fabricClient);
var isRepairAlreadyInProgress =
await repairTaskEngine.IsFHRepairTaskRunningAsync(
repairTask.Executor,
repairConfiguration,
token).ConfigureAwait(false);
if (!isRepairAlreadyInProgress)
{
_ = await fabricClient.RepairManager.CreateRepairTaskAsync(
repairTask,
FabricHealerManager.ConfigSettings.AsyncTimeout,
token).ConfigureAwait(false);
return true;
}
}
catch (FabricException fe)
{
string message =
$"Unable to create repairtask:{Environment.NewLine}{fe}";
FabricHealerManager.RepairLogger.LogWarning(message);
FabricHealerManager.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"FabricRepairTasks::TryCreateRepairTaskAsync",
message,
token).GetAwaiter().GetResult();
}
return false;
}
public static async Task<long> SetFabricRepairJobStateAsync(
RepairTask repairTask,
RepairTaskState repairState,
RepairTaskResult repairResult,
FabricClient fabricClient,
CancellationToken token)
{
repairTask.State = repairState;
repairTask.ResultStatus = repairResult;
return
await fabricClient.RepairManager.UpdateRepairExecutionStateAsync(
repairTask,
FabricHealerManager.ConfigSettings.AsyncTimeout,
token).ConfigureAwait(false);
}
public static async Task<IEnumerable<Service>> GetInfrastructureServiceInstancesAsync(
FabricClient fabricClient,
CancellationToken cancellationToken)
{
var allSystemServices =
await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
() =>
fabricClient.QueryManager.GetServiceListAsync(
new Uri("fabric:/System"),
null,
FabricHealerManager.ConfigSettings.AsyncTimeout,
cancellationToken),
cancellationToken).ConfigureAwait(false);
var infraInstances = allSystemServices.Where(
i => i.ServiceTypeName.Equals(
RepairConstants.InfrastructureServiceType,
StringComparison.InvariantCultureIgnoreCase));
return infraInstances;
}
public static async Task<bool> IsLastCompletedFHRepairTaskWithinTimeRangeAsync(
TimeSpan interval,
FabricClient fabricClient,
TelemetryData foHealthData,
CancellationToken cancellationToken)
{
var allRecentFHRepairTasksCompleted =
await fabricClient.RepairManager.GetRepairTaskListAsync(
RepairTaskEngine.FHTaskIdPrefix,
RepairTaskStateFilter.Completed,
null,
FabricHealerManager.ConfigSettings.AsyncTimeout,
cancellationToken).ConfigureAwait(true);
if (allRecentFHRepairTasksCompleted?.Count == 0)
{
return false;
}
foreach (var repair in allRecentFHRepairTasksCompleted.Where(r => r.ResultStatus == RepairTaskResult.Succeeded))
{
if (cancellationToken.IsCancellationRequested)
{
return false;
}
var fhExecutorData =
SerializationUtility.TryDeserialize(repair.ExecutorData, out RepairExecutorData exData) ? exData : null;
// Non-VM repairs (FH is executor, custom repair ExecutorData supplied by FH.)
if (fhExecutorData != null)
{
if (foHealthData.RepairId != fhExecutorData.CustomIdentificationData)
{
continue;
}
if (repair.CompletedTimestamp == null || !repair.CompletedTimestamp.HasValue)
{
return false;
}
// Note: Completed aborted/cancelled repair tasks should not block repairs if they are inside run interval.
if (DateTime.UtcNow.Subtract(repair.CompletedTimestamp.Value) <= interval
&& repair.Flags != RepairTaskFlags.CancelRequested && repair.Flags != RepairTaskFlags.AbortRequested)
{
return true;
}
}
// VM repairs (IS is executor, ExecutorData supplied by IS. Custom FH repair id supplied as repair Description.)
else if (repair.Executor == $"fabric:/System/InfrastructureService/{foHealthData.NodeType}" && repair.Description == foHealthData.RepairId)
{
if (repair.CompletedTimestamp == null || !repair.CompletedTimestamp.HasValue)
{
return false;
}
// Note: Completed aborted/cancelled repair tasks should not block repairs if they are inside run interval.
if (DateTime.UtcNow.Subtract(repair.CompletedTimestamp.Value) <= interval
&& repair.Flags != RepairTaskFlags.CancelRequested && repair.Flags != RepairTaskFlags.AbortRequested)
{
return true;
}
}
}
return false;
}
public static async Task<int> GetCompletedRepairCountWithinTimeRangeAsync(
TimeSpan timeWindow,
FabricClient fabricClient,
TelemetryData foHealthData,
CancellationToken cancellationToken)
{
var allRecentFHRepairTasksCompleted =
await fabricClient.RepairManager.GetRepairTaskListAsync(
RepairTaskEngine.FHTaskIdPrefix,
RepairTaskStateFilter.Completed,
null,
FabricHealerManager.ConfigSettings.AsyncTimeout,
cancellationToken).ConfigureAwait(true);
if (allRecentFHRepairTasksCompleted?.Count == 0)
{
return 0;
}
int count = 0;
foreach (var repair in allRecentFHRepairTasksCompleted.Where(r => r.ResultStatus == RepairTaskResult.Succeeded))
{
if (cancellationToken.IsCancellationRequested)
{
return 0;
}
var fhExecutorData =
SerializationUtility.TryDeserialize(repair.ExecutorData, out RepairExecutorData exData) ? exData : null;
// Non-VM repairs (FH is executor, custom repair ExecutorData supplied by FH.)
if (fhExecutorData != null)
{
if (foHealthData.RepairId != fhExecutorData.CustomIdentificationData)
{
continue;
}
if (repair.CompletedTimestamp == null || !repair.CompletedTimestamp.HasValue)
{
continue;
}
// Note: Completed aborted/cancelled repair tasks should not block repairs if they are inside run interval.
if (DateTime.UtcNow.Subtract(repair.CompletedTimestamp.Value) <= timeWindow
&& repair.Flags != RepairTaskFlags.CancelRequested && repair.Flags != RepairTaskFlags.AbortRequested)
{
count++;
}
}
// VM repairs (IS is executor, ExecutorData supplied by IS. Custom FH repair id supplied as repair Description.)
else if (repair.Executor == $"fabric:/System/InfrastructureService/{foHealthData.NodeType}" && repair.Description == foHealthData.RepairId)
{
if (repair.CompletedTimestamp == null || !repair.CompletedTimestamp.HasValue)
{
continue;
}
// Note: Completed aborted/cancelled repair tasks should not block repairs if they are inside max time window for a repair cycle (of n repair attempts at a run interval of y)
if (DateTime.UtcNow.Subtract(repair.CompletedTimestamp.Value) <= timeWindow
&& repair.Flags != RepairTaskFlags.CancelRequested && repair.Flags != RepairTaskFlags.AbortRequested)
{
count++;
}
}
}
return count;
}
}
}

Просмотреть файл

@ -0,0 +1,165 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using FabricHealer.Utilities;
using FabricHealer.Utilities.Telemetry;
using Guan.Common;
using Guan.Logic;
using System.IO;
using System.Linq;
namespace FabricHealer.Repair.Guan
{
public class CheckFolderSizePredicateType : PredicateType
{
private static CheckFolderSizePredicateType Instance;
private static RepairTaskManager RepairTaskManager;
private static TelemetryData FOHealthData;
class Resolver : BooleanPredicateResolver
{
public Resolver(
CompoundTerm input,
Constraint constraint,
QueryContext context)
: base(input, constraint, context)
{
}
protected override bool Check()
{
string folderPath = null;
long maxFolderSizeGB = 0;
long maxFolderSizeMB = 0;
int count = Input.Arguments.Count;
for (int i = 0; i < count; i++)
{
switch (Input.Arguments[i].Name.ToLower())
{
case "folderpath":
folderPath = (string)Input.Arguments[i].Value.GetEffectiveTerm().GetValue();
break;
case "maxfoldersizemb":
maxFolderSizeMB = (long)Input.Arguments[i].Value.GetEffectiveTerm().GetValue();
break;
case "maxfoldersizegb":
maxFolderSizeGB = (long)Input.Arguments[i].Value.GetEffectiveTerm().GetValue();
break;
default:
throw new GuanException($"Unsupported input: {Input.Arguments[i].Name}");
}
}
if (!Directory.Exists(folderPath))
{
RepairTaskManager.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"CheckFolderSizePredicate::DirectoryNotFound",
$"Directory {folderPath} does not exist.",
RepairTaskManager.Token).GetAwaiter().GetResult();
return false;
}
if (Directory.GetFiles(folderPath, "*", new EnumerationOptions { RecurseSubdirectories = true }).Length == 0)
{
RepairTaskManager.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"CheckFolderSizePredicate::NoFilesFound",
$"Directory {folderPath} does not contain any files.",
RepairTaskManager.Token).GetAwaiter().GetResult();
return false;
}
long size = 0;
if (maxFolderSizeGB > 0)
{
size = GetFolderSize(folderPath, SizeUnit.GB);
if (size >= maxFolderSizeGB)
{
return true;
}
}
else if (maxFolderSizeMB > 0)
{
size = GetFolderSize(folderPath, SizeUnit.MB);
if (size >= maxFolderSizeMB)
{
return true;
}
}
string message =
$"Repair {FOHealthData.RepairId}: Supplied Maximum folder size value ({(maxFolderSizeGB > 0 ? maxFolderSizeGB.ToString() + "GB" : maxFolderSizeMB.ToString() + "MB")}) " +
$"for path {folderPath} is less than computed folder size ({size}{(maxFolderSizeGB > 0 ? "GB" : "MB")}). " +
$"Will not attempt repair.";
RepairTaskManager.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"CheckFolderSizePredicate",
message,
RepairTaskManager.Token).GetAwaiter().GetResult();
return false;
}
private long GetFolderSize(string path, SizeUnit unit)
{
var dir = new DirectoryInfo(path);
var folderSizeInBytes = dir.EnumerateFiles("*", SearchOption.AllDirectories).Sum(fi => fi.Length);
RepairTaskManager.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"CheckFolderSizePredicate::Size",
$"Directory {path} size: {folderSizeInBytes} bytes.",
RepairTaskManager.Token).GetAwaiter().GetResult();
if (unit == SizeUnit.GB)
{
return folderSizeInBytes / 1024 / 1024 / 1024;
}
return folderSizeInBytes / 1024 / 1024;
}
}
public static CheckFolderSizePredicateType Singleton(
string name,
RepairTaskManager repairTaskManager,
TelemetryData foHealthData)
{
FOHealthData = foHealthData;
RepairTaskManager = repairTaskManager;
return Instance ??= new CheckFolderSizePredicateType(name);
}
private CheckFolderSizePredicateType(
string name)
: base(name, true, 2, 2)
{
}
public override PredicateResolver CreateResolver(CompoundTerm input, Constraint constraint, QueryContext context)
{
return new Resolver(input, constraint, context);
}
}
enum SizeUnit
{
GB,
MB
}
}

Просмотреть файл

@ -0,0 +1,101 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using System;
using FabricHealer.Utilities;
using FabricHealer.Utilities.Telemetry;
using Guan.Common;
using Guan.Logic;
namespace FabricHealer.Repair.Guan
{
public class CheckInsideRunIntervalPredicateType : PredicateType
{
private static CheckInsideRunIntervalPredicateType Instance;
private static RepairTaskManager RepairTaskManager;
private static TelemetryData FOHealthData;
class Resolver : BooleanPredicateResolver
{
public Resolver(
CompoundTerm input,
Constraint constraint,
QueryContext context)
: base(input, constraint, context)
{
}
protected override bool Check()
{
TimeSpan runInterval = TimeSpan.MinValue;
int count = Input.Arguments.Count;
bool insideRunInterval = false;
if (count == 0 || Input.Arguments[0].Name.ToLower() != "runinterval")
{
throw new GuanException("RunInterval argument is required.");
}
TimeSpan interval = (TimeSpan)Input.Arguments[0].Value.GetEffectiveTerm().GetValue();
// This means this repair hasn't been run at least once, so there is no data related to it in the repair
// manager state machine. lastRunTime is retrieved in GetRepairHistory predicate, provided to this predicate in related rules.
if (interval > TimeSpan.MinValue)
{
// Since FH is stateless -1, check for interval state outside of what is maintained in an FH instance state container.
insideRunInterval = FabricRepairTasks.IsLastCompletedFHRepairTaskWithinTimeRangeAsync(
interval,
RepairTaskManager.FabricClientInstance,
FOHealthData,
RepairTaskManager.Token).GetAwaiter().GetResult();
}
if (!insideRunInterval)
{
return false;
}
string message =
$"Repair {FOHealthData.RepairId}:{FabricObserverErrorWarningCodes.GetMetricNameFromCode(FOHealthData.Code)} has already run once within the specified run interval." +
$"{Environment.NewLine}Run interval: {(runInterval > TimeSpan.MinValue ? runInterval : interval)}. Will not attempt repair at this time.";
RepairTaskManager.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
$"CheckRunIntervalPredicate::{FOHealthData.RepairId}",
message,
RepairTaskManager.Token).GetAwaiter().GetResult();
return insideRunInterval;
}
}
public static CheckInsideRunIntervalPredicateType Singleton(
string name,
RepairTaskManager repairTaskManager,
TelemetryData foHealthData)
{
RepairTaskManager = repairTaskManager;
FOHealthData = foHealthData;
return Instance ??= new CheckInsideRunIntervalPredicateType(name);
}
private CheckInsideRunIntervalPredicateType(
string name)
: base(name, true, 1, 3)
{
}
public override PredicateResolver CreateResolver(
CompoundTerm input,
Constraint constraint,
QueryContext context)
{
return new Resolver(input, constraint, context);
}
}
}

Просмотреть файл

@ -0,0 +1,137 @@
using FabricHealer.Utilities.Telemetry;
using Guan.Logic;
using System;
using FabricHealer.Utilities;
using Guan.Common;
namespace FabricHealer.Repair.Guan
{
public class DeleteFilesPredicateType : PredicateType
{
private static RepairTaskManager RepairTaskManager;
private static TelemetryData FOHealthData;
private static DeleteFilesPredicateType Instance;
class Resolver : BooleanPredicateResolver
{
private readonly RepairConfiguration repairConfiguration;
public Resolver(
CompoundTerm input,
Constraint constraint,
QueryContext context)
: base(input, constraint, context)
{
this.repairConfiguration = new RepairConfiguration
{
AppName = !string.IsNullOrEmpty(FOHealthData.ApplicationName) ? new Uri(FOHealthData.ApplicationName) : null,
FOHealthCode = FOHealthData.Code,
NodeName = FOHealthData.NodeName,
NodeType = FOHealthData.NodeType,
PartitionId = !string.IsNullOrEmpty(FOHealthData.PartitionId) ? new Guid(FOHealthData.PartitionId) : default,
ReplicaOrInstanceId = !string.IsNullOrEmpty(FOHealthData.ReplicaId) ? long.Parse(FOHealthData.ReplicaId) : default,
ServiceName = !string.IsNullOrEmpty(FOHealthData.ServiceName) ? new Uri(FOHealthData.ServiceName) : null,
FOHealthMetricValue = FOHealthData.Value,
RepairPolicy = new DiskRepairPolicy(),
};
}
protected override bool Check()
{
bool recurseSubDirectories = false;
string path = null;
// default as 0 means delete all files.
long maxFilesToDelete = 0;
FileSortOrder direction = FileSortOrder.Ascending;
TimeSpan maxTimeWindow = TimeSpan.MinValue;
TimeSpan runInterval = TimeSpan.MinValue;
int count = Input.Arguments.Count;
for (int i = 0; i < count; i++)
{
switch (Input.Arguments[i].Name.ToLower())
{
case "sortorder":
direction = (FileSortOrder)Enum.Parse(typeof(FileSortOrder), (string)Input.Arguments[i].Value.GetEffectiveTerm().GetValue());
break;
case "folderpath":
path = (string)Input.Arguments[i].Value.GetEffectiveTerm().GetValue();
break;
case "maxfilestodelete":
maxFilesToDelete = (long)Input.Arguments[i].Value.GetEffectiveTerm().GetValue();
break;
case "recursesubdirectories":
recurseSubDirectories = bool.Parse((string)Input.Arguments[i].Value.GetEffectiveTerm().GetValue());
break;
default:
throw new GuanException($"Unsupported input: {Input.Arguments[i].Name}");
}
}
// RepairPolicy
repairConfiguration.RepairPolicy.CurrentAction = RepairAction.DeleteFiles;
((DiskRepairPolicy)repairConfiguration.RepairPolicy).FolderPath = path;
repairConfiguration.RepairPolicy.Id = FOHealthData.RepairId;
((DiskRepairPolicy)repairConfiguration.RepairPolicy).MaxNumberOfFilesToDelete = maxFilesToDelete;
((DiskRepairPolicy)repairConfiguration.RepairPolicy).FileAgeSortOrder = direction;
repairConfiguration.RepairPolicy.TargetType = RepairTargetType.VirtualMachine;
((DiskRepairPolicy)repairConfiguration.RepairPolicy).RecurseSubdirectories = recurseSubDirectories;
// Try to schedule repair with RM.
var repairTask = FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
() =>
RepairTaskManager.ScheduleFabricHealerRmRepairTaskAsync(
repairConfiguration,
RepairTaskManager.Token),
RepairTaskManager.Token).ConfigureAwait(true).GetAwaiter().GetResult();
if (repairTask == null)
{
return false;
}
// Try to execute repair (FH executor does this work and manages repair state).
bool success = FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
() =>
RepairTaskManager.ExecuteFabricHealerRmRepairTaskAsync(
repairTask,
repairConfiguration,
RepairTaskManager.Token),
RepairTaskManager.Token).ConfigureAwait(false).GetAwaiter().GetResult();
return success;
}
}
public static DeleteFilesPredicateType Singleton(
string name,
RepairTaskManager repairTaskManager,
TelemetryData foHealthData)
{
FOHealthData = foHealthData;
RepairTaskManager = repairTaskManager;
return Instance ??= new DeleteFilesPredicateType(name);
}
private DeleteFilesPredicateType(
string name)
: base(name, true, 1, 5)
{
}
public override PredicateResolver CreateResolver(
CompoundTerm input,
Constraint constraint,
QueryContext context)
{
return new Resolver(input, constraint, context);
}
}
}

Просмотреть файл

@ -0,0 +1,94 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using System.Threading.Tasks;
using Guan.Common;
using Guan.Logic;
using System;
using FabricHealer.Utilities.Telemetry;
using FabricHealer.Utilities;
namespace FabricHealer.Repair.Guan
{
public class GetRepairHistoryPredicateType : PredicateType
{
private static RepairTaskManager RepairTaskManager;
private static TelemetryData FOHealthData;
private static GetRepairHistoryPredicateType Instance;
class Resolver : GroundPredicateResolver
{
public Resolver(
CompoundTerm input,
Constraint constraint,
QueryContext context)
: base(input, constraint, context, 1)
{
}
protected override Task<Term> GetNextTermAsync()
{
long repairCount = 0;
TimeSpan timeWindow = (TimeSpan)Input.Arguments[1].Value.GetEffectiveTerm().GetValue();
if (timeWindow > TimeSpan.MinValue)
{
repairCount = FabricRepairTasks.GetCompletedRepairCountWithinTimeRangeAsync(
timeWindow,
RepairTaskManager.FabricClientInstance,
FOHealthData,
RepairTaskManager.Token).GetAwaiter().GetResult();
}
else
{
string message = $"You must supply a valid TimeSpan string for TimeWindow argument of GetRepairHistoryPredicate. Default result has been supplied (0).";
RepairTaskManager.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
$"GetRepairHistoryPredicate::{FOHealthData.RepairId}",
message,
RepairTaskManager.Token).GetAwaiter().GetResult();
}
var result = new CompoundTerm(Instance, null);
result.AddArgument(new Constant(repairCount), "0");
return Task.FromResult<Term>(result);
}
}
public static GetRepairHistoryPredicateType Singleton(
string name,
RepairTaskManager repairTaskManager,
TelemetryData foHealthData)
{
RepairTaskManager = repairTaskManager;
FOHealthData = foHealthData;
return Instance ??= new GetRepairHistoryPredicateType(name);
}
private GetRepairHistoryPredicateType(
string name)
: base(name, true, 2, 2)
{
}
public override PredicateResolver CreateResolver(CompoundTerm input, Constraint constraint, QueryContext context)
{
return new Resolver(input, constraint, context);
}
public override void AdjustTerm(CompoundTerm term, Rule rule)
{
if (!(term.Arguments[0].Value is IndexedVariable))
{
throw new GuanException("The first argument, ?repairCount, of GetRepairHistoryPredicateType must be a variable: {0}", term);
}
}
}
}

Просмотреть файл

@ -0,0 +1,104 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using FabricHealer.Utilities.Telemetry;
using Guan.Logic;
using System;
using FabricHealer.Utilities;
namespace FabricHealer.Repair.Guan
{
public class RestartCodePackagePredicateType : PredicateType
{
private static RepairTaskManager RepairTaskManager;
private static TelemetryData FOHealthData;
private static RestartCodePackagePredicateType Instance;
class Resolver : BooleanPredicateResolver
{
private readonly RepairConfiguration repairConfiguration;
public Resolver(
CompoundTerm input,
Constraint constraint,
QueryContext context)
: base(input, constraint, context)
{
this.repairConfiguration = new RepairConfiguration
{
AppName = !string.IsNullOrEmpty(FOHealthData.ApplicationName) ? new Uri(FOHealthData.ApplicationName) : null,
ContainerId = FOHealthData.ContainerId,
FOHealthCode = FOHealthData.Code,
NodeName = FOHealthData.NodeName,
NodeType = FOHealthData.NodeType,
PartitionId = !string.IsNullOrEmpty(FOHealthData.PartitionId) ? new Guid(FOHealthData.PartitionId) : default,
ReplicaOrInstanceId = !string.IsNullOrEmpty(FOHealthData.ReplicaId) ? long.Parse(FOHealthData.ReplicaId) : default,
ServiceName = !string.IsNullOrEmpty(FOHealthData.ServiceName) ? new Uri(FOHealthData.ServiceName) : null,
FOHealthMetricValue = FOHealthData.Value,
RepairPolicy = new RepairPolicy(),
};
}
protected override bool Check()
{
// RepairPolicy
repairConfiguration.RepairPolicy.CurrentAction = RepairAction.RestartCodePackage;
repairConfiguration.RepairPolicy.Id = FOHealthData.RepairId;
repairConfiguration.RepairPolicy.TargetType = RepairTargetType.Application;
// Try to schedule repair with RM.
var repairTask = FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
() =>
RepairTaskManager.ScheduleFabricHealerRmRepairTaskAsync(
repairConfiguration,
RepairTaskManager.Token),
RepairTaskManager.Token).ConfigureAwait(true).GetAwaiter().GetResult();
if (repairTask == null)
{
return false;
}
// Try to execute repair (FH executor does this work and manages repair state).
bool success = FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
() =>
RepairTaskManager.ExecuteFabricHealerRmRepairTaskAsync(
repairTask,
repairConfiguration,
RepairTaskManager.Token),
RepairTaskManager.Token).ConfigureAwait(false).GetAwaiter().GetResult();
return success;
}
}
public static RestartCodePackagePredicateType Singleton(
string name,
RepairTaskManager repairTaskManager,
TelemetryData foHealthData)
{
RepairTaskManager = repairTaskManager;
FOHealthData = foHealthData;
return Instance ??= new RestartCodePackagePredicateType(name);
}
private RestartCodePackagePredicateType(
string name)
: base(name, true, 0, 0)
{
}
public override PredicateResolver CreateResolver(
CompoundTerm input,
Constraint constraint,
QueryContext context)
{
return new Resolver(input, constraint, context);
}
}
}

Просмотреть файл

@ -0,0 +1,151 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using FabricHealer.Utilities.Telemetry;
using Guan.Logic;
using System;
using System.Fabric.Repair;
using FabricHealer.Utilities;
using Guan.Common;
namespace FabricHealer.Repair.Guan
{
public class RestartFabricNodePredicateType : PredicateType
{
private static RepairTaskManager RepairTaskManager;
private static RepairExecutorData RepairExecutorData;
private static RepairTaskEngine RepairTaskEngine;
private static TelemetryData FOHealthData;
private static RestartFabricNodePredicateType Instance;
class Resolver : BooleanPredicateResolver
{
private readonly RepairConfiguration repairConfiguration;
public Resolver(
CompoundTerm input,
Constraint constraint,
QueryContext context)
: base(input, constraint, context)
{
this.repairConfiguration = new RepairConfiguration
{
AppName = !string.IsNullOrEmpty(FOHealthData.ApplicationName) ? new Uri(FOHealthData.ApplicationName) : null,
FOHealthCode = FOHealthData.Code,
NodeName = FOHealthData.NodeName,
NodeType = FOHealthData.NodeType,
PartitionId = !string.IsNullOrEmpty(FOHealthData.PartitionId) ? new Guid(FOHealthData.PartitionId) : default,
ReplicaOrInstanceId = !string.IsNullOrEmpty(FOHealthData.ReplicaId) ? long.Parse(FOHealthData.ReplicaId) : default,
ServiceName = (!string.IsNullOrEmpty(FOHealthData.ServiceName) && FOHealthData.ServiceName.Contains("fabric:/")) ? new Uri(FOHealthData.ServiceName) : null,
FOHealthMetricValue = FOHealthData.Value,
RepairPolicy = new RepairPolicy(),
};
}
protected override bool Check()
{
RepairTask repairTask;
// Repair Policy
repairConfiguration.RepairPolicy.CurrentAction = RepairAction.RestartFabricNode;
repairConfiguration.RepairPolicy.Id = FOHealthData.RepairId;
repairConfiguration.RepairPolicy.TargetType = FOHealthData.ApplicationName == "fabric:/System" ? RepairTargetType.Application : RepairTargetType.Node;
bool success;
// This means it's a resumed repair.
if (RepairExecutorData != null)
{
// Historical info, like what step the healer was in when the node went down, is contained in the
// executordata instance.
repairTask = RepairTaskEngine.CreateFabricHealerRmRepairTask(this.repairConfiguration, RepairExecutorData);
success = RepairTaskManager.ExecuteFabricHealerRmRepairTaskAsync(
repairTask,
this.repairConfiguration,
RepairTaskManager.Token).ConfigureAwait(false).GetAwaiter().GetResult();
return success;
}
// Block attempts to create node-level repair tasks if one is already running in the cluster.
var repairTaskEngine = new RepairTaskEngine(RepairTaskManager.FabricClientInstance);
var isNodeRepairAlreadyInProgress =
repairTaskEngine.IsFHRepairTaskRunningAsync(
$"FabricHealer",
repairConfiguration,
RepairTaskManager.Token).GetAwaiter().GetResult();
if (isNodeRepairAlreadyInProgress)
{
string message =
$"A Fabric Node repair, {FOHealthData.RepairId}, is already in progress in the cluster. Will not attempt repair at this time.";
RepairTaskManager.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
$"RestartFabricNodePredicateType::{FOHealthData.RepairId}",
message,
RepairTaskManager.Token).GetAwaiter().GetResult();
return false;
}
// Try to schedule repair with RM.
repairTask = FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
() =>
RepairTaskManager.ScheduleFabricHealerRmRepairTaskAsync(
this.repairConfiguration,
RepairTaskManager.Token),
RepairTaskManager.Token).ConfigureAwait(true).GetAwaiter().GetResult();
if (repairTask == null)
{
return false;
}
// Try to execute custom repair (FH executor).
success = FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
() =>
RepairTaskManager.ExecuteFabricHealerRmRepairTaskAsync(
repairTask,
this.repairConfiguration,
RepairTaskManager.Token),
RepairTaskManager.Token).ConfigureAwait(false).GetAwaiter().GetResult();
return success;
}
}
public static RestartFabricNodePredicateType Singleton(
string name,
RepairTaskManager repairTaskManager,
RepairExecutorData repairExecutorData,
RepairTaskEngine repairTaskEngine,
TelemetryData foHealthData)
{
RepairTaskManager = repairTaskManager;
RepairExecutorData = repairExecutorData;
RepairTaskEngine = repairTaskEngine;
FOHealthData = foHealthData;
return Instance ??= new RestartFabricNodePredicateType(name);
}
private RestartFabricNodePredicateType(
string name)
: base(name, true, 0, 0)
{
}
public override PredicateResolver CreateResolver(
CompoundTerm input,
Constraint constraint,
QueryContext context)
{
return new Resolver(input, constraint, context);
}
}
}

Просмотреть файл

@ -0,0 +1,101 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using FabricHealer.Utilities.Telemetry;
using Guan.Logic;
using System;
using FabricHealer.Utilities;
using Guan.Common;
namespace FabricHealer.Repair.Guan
{
public class RestartReplicaPredicateType : PredicateType
{
private static RepairTaskManager RepairTaskManager;
private static TelemetryData FOHealthData;
private static RestartReplicaPredicateType Instance;
class Resolver : BooleanPredicateResolver
{
private readonly RepairConfiguration repairConfiguration;
public Resolver(
CompoundTerm input,
Constraint constraint,
QueryContext context)
: base(input, constraint, context)
{
this.repairConfiguration = new RepairConfiguration
{
AppName = !string.IsNullOrEmpty(FOHealthData.ApplicationName) ? new Uri(FOHealthData.ApplicationName) : null,
FOHealthCode = FOHealthData.Code,
NodeName = FOHealthData.NodeName,
NodeType = FOHealthData.NodeType,
PartitionId = !string.IsNullOrEmpty(FOHealthData.PartitionId) ? new Guid(FOHealthData.PartitionId) : default,
ReplicaOrInstanceId = !string.IsNullOrEmpty(FOHealthData.ReplicaId) ? long.Parse(FOHealthData.ReplicaId) : default,
ServiceName = !string.IsNullOrEmpty(FOHealthData.ServiceName) ? new Uri(FOHealthData.ServiceName) : null,
FOHealthMetricValue = FOHealthData.Value,
RepairPolicy = new RepairPolicy(),
};
}
protected override bool Check()
{
repairConfiguration.RepairPolicy.Id = FOHealthData.RepairId;
repairConfiguration.RepairPolicy.TargetType = RepairTargetType.Application;
// Try to schedule repair with RM.
var repairTask = FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
() =>
RepairTaskManager.ScheduleFabricHealerRmRepairTaskAsync(
repairConfiguration,
RepairTaskManager.Token),
RepairTaskManager.Token).ConfigureAwait(true).GetAwaiter().GetResult();
if (repairTask == null)
{
return false;
}
// Try to execute custom repair (FH executor).
bool success = FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
() =>
RepairTaskManager.ExecuteFabricHealerRmRepairTaskAsync(
repairTask,
repairConfiguration,
RepairTaskManager.Token),
RepairTaskManager.Token).ConfigureAwait(false).GetAwaiter().GetResult();
return success;
}
}
public static RestartReplicaPredicateType Singleton(
string name,
RepairTaskManager repairTaskManager,
TelemetryData foHealthData)
{
RepairTaskManager = repairTaskManager;
FOHealthData = foHealthData;
return Instance ??= new RestartReplicaPredicateType(name);
}
private RestartReplicaPredicateType(
string name)
: base(name, true, 0, 2)
{
}
public override PredicateResolver CreateResolver(
CompoundTerm input,
Constraint constraint,
QueryContext context)
{
return new Resolver(input, constraint, context);
}
}
}

Просмотреть файл

@ -0,0 +1,112 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using FabricHealer.Utilities.Telemetry;
using Guan.Logic;
using System;
using FabricHealer.Utilities;
using Guan.Common;
namespace FabricHealer.Repair.Guan
{
public class RestartVMPredicateType : PredicateType
{
private static RepairTaskManager RepairTaskManager;
private static TelemetryData FOHealthData;
private static RestartVMPredicateType Instance;
class Resolver : BooleanPredicateResolver
{
private readonly RepairConfiguration repairConfiguration;
public Resolver(
CompoundTerm input,
Constraint constraint,
QueryContext context)
: base(input, constraint, context)
{
this.repairConfiguration = new RepairConfiguration
{
AppName = !string.IsNullOrEmpty(FOHealthData.ApplicationName) ? new Uri(FOHealthData.ApplicationName) : null,
FOHealthCode = FOHealthData.Code,
NodeName = FOHealthData.NodeName,
NodeType = FOHealthData.NodeType,
PartitionId = !string.IsNullOrEmpty(FOHealthData.PartitionId) ? new Guid(FOHealthData.PartitionId) : default,
ReplicaOrInstanceId = !string.IsNullOrEmpty(FOHealthData.ReplicaId) ? long.Parse(FOHealthData.ReplicaId) : default,
ServiceName = !string.IsNullOrEmpty(FOHealthData.ServiceName) ? new Uri(FOHealthData.ServiceName) : null,
FOHealthMetricValue = FOHealthData.Value,
RepairPolicy = new RepairPolicy(),
};
}
protected override bool Check()
{
// Repair Policy
this.repairConfiguration.RepairPolicy.CurrentAction = RepairAction.RestartVM;
repairConfiguration.RepairPolicy.Id = FOHealthData.RepairId;
repairConfiguration.RepairPolicy.TargetType = RepairTargetType.VirtualMachine;
// FH does not execute repairs for VM level mitigation. InfrastructureService (IS) does,
// so, FH schedules VM repairs via RM and the execution is taken care of by IS (the executor).
// Block attempts to create duplicate repair tasks.
var repairTaskEngine = new RepairTaskEngine(RepairTaskManager.FabricClientInstance);
var isRepairAlreadyInProgress =
repairTaskEngine.IsFHRepairTaskRunningAsync(
$"fabric:/System/InfrastructureService/{FOHealthData.NodeType}",
repairConfiguration,
RepairTaskManager.Token).GetAwaiter().GetResult();
if (isRepairAlreadyInProgress)
{
string message =
$"VM Repair {FOHealthData.RepairId} is already in progress. Will not attempt repair at this time.";
RepairTaskManager.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
$"RestartVMPredicateType::{FOHealthData.RepairId}",
message,
RepairTaskManager.Token).GetAwaiter().GetResult();
return false;
}
bool success = FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
() =>
RepairTaskManager.ExecuteRMInfrastructureRepairTask(
repairConfiguration,
RepairTaskManager.Token),
RepairTaskManager.Token).ConfigureAwait(false).GetAwaiter().GetResult();
return success;
}
}
public static RestartVMPredicateType Singleton(
string name,
RepairTaskManager repairTaskManager,
TelemetryData foHealthData)
{
RepairTaskManager = repairTaskManager;
FOHealthData = foHealthData;
return Instance ??= new RestartVMPredicateType(name);
}
private RestartVMPredicateType(
string name)
: base(name, true, 0, 2)
{
}
public override PredicateResolver CreateResolver(
CompoundTerm input,
Constraint constraint,
QueryContext context)
{
return new Resolver(input, constraint, context);
}
}
}

Просмотреть файл

@ -0,0 +1,66 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using System;
using System.Fabric.Health;
using FabricHealer.Utilities;
namespace FabricHealer.Repair
{
public static class HealthEventChecker
{
public static bool IsHealthPropertyInError(
HealthEvent healthEvent,
HealthReportKind kind,
bool treatWarningAsError = false)
{
if (healthEvent?.HealthInformation.HealthState
!= HealthState.Error && !treatWarningAsError)
{
return false;
}
switch (kind)
{
// App, Service, Replica/Instance, CodePackage, etc (as in service process/instance scope)
case HealthReportKind.Application:
case HealthReportKind.Service:
case HealthReportKind.DeployedApplication:
case HealthReportKind.StatefulServiceReplica:
case HealthReportKind.StatelessServiceInstance:
case HealthReportKind.DeployedServicePackage:
return healthEvent != null && FabricObserverErrorWarningCodes.AppErrorCodesDictionary.ContainsKey(healthEvent.HealthInformation.SourceId);
// Node level (as in VM scope)
case HealthReportKind.Node:
return healthEvent != null && FabricObserverErrorWarningCodes.NodeErrorCodesDictionary.ContainsKey(healthEvent.HealthInformation.SourceId);
case HealthReportKind.Invalid:
break;
case HealthReportKind.Partition:
break;
case HealthReportKind.Cluster:
break;
default:
return false;
}
return false;
}
public static bool HasHealthPropertyExpired(HealthEvent healthEvent)
{
if (healthEvent == null)
{
throw new ArgumentException("HealthEvent can't be null.");
}
return healthEvent.IsExpired;
}
}
}

Просмотреть файл

@ -0,0 +1,23 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
namespace FabricHealer.Repair
{
/// <summary>
/// Not all of these actions have corresponding implementations yet.
/// </summary>
public enum RepairAction
{
DeleteFiles,
PauseFabricNode,
RemoveFabricNodeState,
RemoveReplica,
RepairPartition,
RestartCodePackage,
RestartFabricNode,
RestartReplica,
RestartVM,
}
}

Просмотреть файл

@ -0,0 +1,68 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using System;
using System.Fabric.Query;
namespace FabricHealer.Repair
{
public class RepairConfiguration
{
public Uri AppName
{
get; set;
}
public DeployedCodePackage CodePackage
{
get; set;
}
public string ContainerId
{
get; set;
}
public string NodeType
{
get; set;
}
public string NodeName
{
get; set;
}
public Guid PartitionId
{
get; set;
} = Guid.Empty;
public RepairPolicy RepairPolicy
{
get; set;
}
public long ReplicaOrInstanceId
{
get; set;
} = default;
public Uri ServiceName
{
get; set;
}
public string FOHealthCode
{
get; set;
}
public object FOHealthMetricValue
{
get; set;
}
}
}

Просмотреть файл

@ -0,0 +1,106 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
namespace FabricHealer.Repair
{
public static class RepairConstants
{
// Queue Constants
public const int QueueRetries = 3;
// Time Constants
public const int QueueVisibilityTimeInMin = 60;
public const int TaskDelayTimeInMin = 3;
public const int QueueRetryTimeInSec = 3;
public const int QueueRetryCount = 5;
// Logic rules file parameter.
public const string LogicRulesConfigurationFile = "LogicRulesConfigurationFile";
// Health event sourceId constants.
public const string FabricObserverSourceId = "FabricObserver";
public const string InfrastructureServiceSourceId = "System.InfrastructureService";
public const string RepairPolicyEngineServiceSourceId = "RepairPolicyEngineService";
public const string MonitoringHealthProperty = "MonitoringHealth";
public const string InfrastructureServiceType = "InfrastructureServiceType";
// Telemetry Settings Parameters.
public const string TelemetryProviderType = "TelemetryProvider";
public const string LogAnalyticsLogTypeParameter = "LogAnalyticsLogType";
public const string LogAnalyticsSharedKeyParameter = "LogAnalyticsSharedKey";
public const string LogAnalyticsWorkspaceIdParameter = "LogAnalyticsWorkspaceId";
public const string EventSourceEventName = "FabricHealerDataEvent";
// RepairManager Settings Parameters.
public const string RepairManagerConfigurationSectionName = "RepairManagerConfiguration";
public const string EnableVerboseLoggingParameter = "EnableVerboseLogging";
public const string ShutdownGracePeriodInSeconds = "ShutdownGracePeriodInSeconds";
public const string AppInsightsTelemetryEnabled = "EnableTelemetryProvider";
public const string AppInsightsInstrumentationKeyParameter = "AppInsightsInstrumentationKey";
public const string FabricHealerpublicTelemetryEnabled = "FabricHealerpublicTelemetryEnabled";
public const string EnableEventSourceProvider = "EnableEventSourceProvider";
public const string EventSourceProviderName = "EventSourceProviderName";
public const string HealthCheckLoopSleepTimeSeconds = "HealthCheckLoopSleepTimeSeconds";
public const string LocalLogPathParameter = "LocalLogPath";
public const string AsyncOperationTimeout = "AsyncOperationTimeoutSeconds";
// General Repair Settings Parameters.
public const string EnableAutoMitigation = "EnableAutoMitigation";
// RepairPolicy Settings Sections.
public const string FabricNodeRepairPolicySectionName = "FabricNodeRepairPolicy";
public const string ReplicaRepairPolicySectionName = "ReplicaRepairPolicy";
public const string AppRepairPolicySectionName = "AppRepairPolicy";
public const string DiskRepairPolicySectionName = "DiskRepairPolicy";
public const string SystemAppRepairPolicySectionName = "SystemAppRepairPolicy";
public const string VmRepairPolicySectionName = "VMRepairPolicy";
// RepairPolicy Settings Parameters.
public const string ActionParameter = "RepairAction";
public const string Enabled = "Enabled";
public const string AppName = "AppName";
public const string ServiceName = "ServiceName";
public const string NodeName = "NodeName";
public const string NodeType = "NodeType";
public const string PartitionId = "PartitionId";
public const string ReplicaOrInstanceId = "ReplicaOrInstanceId";
public const string TargetType = "TargetType";
public const string CycleTimeDistributionType = "CycleTimeDistributionType";
public const string RunInterval = "RunInterval";
public const string FOErrorCode = "FOErrorCode";
public const string MetricName = "MetricName";
public const string MetricValue = "MetricValue";
// Repair Actions.
public const string DeleteFiles = "DeleteFiles";
public const string RestartCodePackage = "RestartCodePackage";
public const string RestartFabricNode = "RestartFabricNode";
public const string RestartReplica = "RestartReplica";
public const string RestartVM = "RestartVM";
// Helper Predicates.
public const string CheckInsideRunInterval = "CheckInsideRunInterval";
public const string CheckFolderSize = "CheckFolderSize";
public const string GetRepairHistory = "GetRepairHistory";
// Resource types.
public const string ActiveTcpPorts = "ActiveTcpPorts";
public const string Certificate = "Certificate";
public const string Cpu = "Cpu";
public const string CpuPercent = "CpuPercent";
public const string Disk = "Disk";
public const string DiskAverageQueueLength = "DiskAverageQueueLength";
public const string DiskSpaceMB = "DiskSpaceMB";
public const string DiskSpacePercent = "DiskSpacePercent";
public const string EphemeralPorts = "EphemeralPorts";
public const string EndpointUnreachable = "EndpointUnreachable";
public const string FirewallRules = "FirewallRules";
public const string Memory = "Memory";
public const string MemoryMB = "MemoryMB";
public const string MemoryPercent = "MemoryPercent";
public const string Network = "Network";
}
}

Просмотреть файл

@ -0,0 +1,716 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using System;
using System.Threading.Tasks;
using System.Threading;
using System.Diagnostics;
using System.Fabric;
using System.Fabric.Result;
using System.Fabric.Query;
using System.Fabric.Health;
using FabricHealer.Utilities;
using FabricHealer.Utilities.Telemetry;
using System.Fabric.Repair;
using System.Net;
using System.Net.Sockets;
using System.IO;
using System.Security;
using System.Linq;
using System.Collections.Generic;
namespace FabricHealer.Repair
{
public class RepairExecutor
{
private const double MaxWaitTimeMinutesForNodeOperation = 60.0;
private readonly FabricClient fabricClient;
private readonly TelemetryUtilities telemetryUtilities;
private readonly StatelessServiceContext serviceContext;
public bool IsOneNodeCluster
{
get;
}
public RepairExecutor(
FabricClient fabricClient,
StatelessServiceContext context,
CancellationToken token)
{
this.serviceContext = context;
this.fabricClient = fabricClient;
this.telemetryUtilities = new TelemetryUtilities(fabricClient, context);
try
{
if (FabricHealerManager.ConfigSettings == null)
{
return;
}
IsOneNodeCluster =
this.fabricClient.QueryManager.GetNodeListAsync(
null,
FabricHealerManager.ConfigSettings.AsyncTimeout,
token).GetAwaiter().GetResult().Count == 1;
}
catch (FabricException fe)
{
FabricHealerManager.RepairLogger.LogWarning(
$"Unable to determine cluster size:{Environment.NewLine}{fe}");
}
}
public async Task<RestartDeployedCodePackageResult> RestartCodePackageAsync(
Uri appName,
Guid partitionId,
long replicaId,
Uri serviceName,
CancellationToken cancellationToken)
{
try
{
PartitionSelector partitionSelector = PartitionSelector.PartitionIdOf(serviceName, partitionId);
// Verify target replica still exists.
var replicaList = await fabricClient.QueryManager.GetReplicaListAsync(
partitionId,
replicaId,
FabricHealerManager.ConfigSettings.AsyncTimeout,
cancellationToken).ConfigureAwait(false);
if (replicaList.Count == 0)
{
await this.telemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairExecutor.RestartCodePackageAsync",
$"Execution failure: Replica {replicaId} not found in partition {partitionId}.",
cancellationToken).ConfigureAwait(false);
return null;
}
ReplicaSelector replicaSelector = ReplicaSelector.ReplicaIdOf(partitionSelector, replicaId);
var restartCodePackageResult = await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
() =>
fabricClient.FaultManager.RestartDeployedCodePackageAsync(
appName,
replicaSelector,
CompletionMode.DoNotVerify, // There is a bug with Verify for Stateless services...
FabricHealerManager.ConfigSettings.AsyncTimeout,
cancellationToken),
cancellationToken).ConfigureAwait(true);
return restartCodePackageResult;
}
catch (Exception ex)
when (ex is FabricException
|| ex is InvalidOperationException
|| ex is OperationCanceledException
|| ex is TimeoutException)
{
await this.telemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Warning,
"RepairExecutor.RestartCodePackageAsync",
$"Execution failure: {ex}.",
cancellationToken).ConfigureAwait(false);
return null;
}
}
/// <summary>
/// Safely restarts a Service Fabric Node instance.
/// Algorithm:
/// 1 Deactivate target node.
/// 2 Wait for node to get into Disabled/Ok.
/// 3 Restart node (which is the Fabric.exe kill API in FaultManager)
/// 4 Wait for node to go Down.
/// 5 Wait for node to get to Disabled/Ok.
/// 5 Activate node.
/// 6 Wait for node to get to Up/Ok.
/// </summary>
/// <param name="nodeName">Name of the target node</param>
/// <param name="repairTask">The scheduled Repair Task</param>
/// <param name="cancellationToken">Task cancellation token</param>
/// <returns></returns>
public async Task<bool> SafeRestartFabricNodeAsync(
string nodeName,
RepairTask repairTask,
CancellationToken cancellationToken)
{
bool isTargetNodeHostingFH = nodeName == this.serviceContext.NodeContext.NodeName;
if (isTargetNodeHostingFH)
{
return false;
}
var nodes = await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
() =>
this.fabricClient.QueryManager.GetNodeListAsync(
nodeName,
FabricHealerManager.ConfigSettings.AsyncTimeout,
cancellationToken),
cancellationToken).ConfigureAwait(false);
if (nodes.Count == 0)
{
string info =
$"Target node not found: {nodeName}. " +
$"Aborting node restart operation.";
await this.telemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairExecutor.SafeRestartFabricNodeAsync::NodeCount0",
info,
cancellationToken).ConfigureAwait(false);
FabricHealerManager.RepairLogger.LogInfo(info);
return false;
}
var allnodes = await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
() =>
this.fabricClient.QueryManager.GetNodeListAsync(
null,
FabricHealerManager.ConfigSettings.AsyncTimeout,
cancellationToken),
cancellationToken).ConfigureAwait(false);
if (allnodes.Count < 3)
{
string info =
$"Unsupported repair for a {nodes.Count} node cluster. " +
$"Aborting fabric node restart operation.";
await this.telemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairExecutor.SafeRestartFabricNodeAsync::NodeCount",
info,
cancellationToken).ConfigureAwait(false);
FabricHealerManager.RepairLogger.LogInfo(info);
return false;
}
var nodeInstanceId = nodes[0].NodeInstanceId;
var stopwatch = new Stopwatch();
var maxWaitTimeout = TimeSpan.FromMinutes(MaxWaitTimeMinutesForNodeOperation);
string actionMessage = "Attempting to safely restart Fabric node " +
$"{nodeName} with InstanceId {nodeInstanceId}.";
FabricHealerManager.RepairLogger.LogInfo(actionMessage);
await this.telemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairExecutor.SafeRestartFabricNodeAsyncAttemptingRestart",
actionMessage,
cancellationToken).ConfigureAwait(false);
try
{
if (!SerializationUtility.TryDeserialize(repairTask.ExecutorData, out RepairExecutorData executorData))
{
return false;
}
if (executorData.LatestRepairStep == FabricNodeRepairStep.Scheduled)
{
executorData.LatestRepairStep = FabricNodeRepairStep.Deactivate;
if (SerializationUtility.TrySerialize(executorData, out string exData))
{
repairTask.ExecutorData = exData;
}
else
{
actionMessage = "Step = Deactivate => Did not successfully serialize executordata.";
await this.telemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairExecutor.SafeRestartFabricNodeAsyncAttemptingRestart::Deactivate",
actionMessage,
cancellationToken).ConfigureAwait(false);
return false;
}
await fabricClient.RepairManager.UpdateRepairExecutionStateAsync(
repairTask,
FabricHealerManager.ConfigSettings.AsyncTimeout,
cancellationToken).ConfigureAwait(false);
// Deactivate the node with intent to restart. Several health checks will
// take place to ensure safe deactivation, which includes giving services a
// chance to gracefully shut down, should they override OnAbort/OnClose.
await this.fabricClient.ClusterManager.DeactivateNodeAsync(
nodeName,
NodeDeactivationIntent.Restart,
FabricHealerManager.ConfigSettings.AsyncTimeout,
cancellationToken).ConfigureAwait(false);
stopwatch.Start();
// Wait for node to get into Disabled state.
while (stopwatch.Elapsed <= maxWaitTimeout)
{
var nodeList = await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
() =>
this.fabricClient.QueryManager.GetNodeListAsync(
nodeName,
FabricHealerManager.ConfigSettings.AsyncTimeout,
cancellationToken),
cancellationToken).ConfigureAwait(false);
if (nodeList == null || nodeList.Count == 0)
{
break;
}
Node targetNode = nodeList[0];
// exit loop, this is the state we're looking for.
if (targetNode.NodeStatus == NodeStatus.Disabled)
{
break;
}
await Task.Delay(1000, cancellationToken).ConfigureAwait(false);
}
stopwatch.Stop();
stopwatch.Reset();
}
if (executorData.LatestRepairStep == FabricNodeRepairStep.Deactivate)
{
executorData.LatestRepairStep = FabricNodeRepairStep.Restart;
if (SerializationUtility.TrySerialize(executorData, out string exData))
{
repairTask.ExecutorData = exData;
}
else
{
return false;
}
await fabricClient.RepairManager.UpdateRepairExecutionStateAsync(
repairTask,
FabricHealerManager.ConfigSettings.AsyncTimeout,
cancellationToken).ConfigureAwait(false);
actionMessage = $"In Step Restart Node.{Environment.NewLine}{repairTask.ExecutorData}";
await this.telemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairExecutor.SafeRestartFabricNodeAsyncAttemptingRestart::RestartStep",
actionMessage,
cancellationToken).ConfigureAwait(false);
// Now, restart node.
_ = await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
() =>
this.fabricClient.FaultManager.RestartNodeAsync(
nodeName,
nodes[0].NodeInstanceId,
FabricHealerManager.ConfigSettings.AsyncTimeout,
cancellationToken),
cancellationToken).ConfigureAwait(false);
stopwatch.Start();
// Wait for Disabled/OK
while (stopwatch.Elapsed <= maxWaitTimeout)
{
var nodeList = await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
() =>
this.fabricClient.QueryManager.GetNodeListAsync(
nodeName,
FabricHealerManager.ConfigSettings.AsyncTimeout,
cancellationToken),
cancellationToken).ConfigureAwait(false);
Node targetNode = nodeList[0];
// Node is ready to be enabled.
if (targetNode.NodeStatus == NodeStatus.Disabled
&& targetNode.HealthState == HealthState.Ok)
{
break;
}
await Task.Delay(1000, cancellationToken).ConfigureAwait(false);
}
stopwatch.Stop();
stopwatch.Reset();
}
if (executorData.LatestRepairStep == FabricNodeRepairStep.Restart)
{
executorData.LatestRepairStep = FabricNodeRepairStep.Activate;
if (SerializationUtility.TrySerialize(executorData, out string exData))
{
repairTask.ExecutorData = exData;
}
else
{
return false;
}
await this.fabricClient.RepairManager.UpdateRepairExecutionStateAsync(
repairTask,
FabricHealerManager.ConfigSettings.AsyncTimeout,
cancellationToken).ConfigureAwait(false);
// Now, enable the node.
await this.fabricClient.ClusterManager.ActivateNodeAsync(
nodeName,
FabricHealerManager.ConfigSettings.AsyncTimeout,
cancellationToken).ConfigureAwait(false);
await Task.Delay(TimeSpan.FromSeconds(15), cancellationToken).ConfigureAwait(false);
var nodeList = await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
() =>
this.fabricClient.QueryManager.GetNodeListAsync(
nodeName,
FabricHealerManager.ConfigSettings.AsyncTimeout,
cancellationToken),
cancellationToken).ConfigureAwait(false);
Node targetNode = nodeList[0];
// Make sure activation request went through.
if (targetNode.NodeStatus == NodeStatus.Disabled
&& targetNode.HealthState == HealthState.Ok)
{
await this.fabricClient.ClusterManager.ActivateNodeAsync(
nodeName,
FabricHealerManager.ConfigSettings.AsyncTimeout,
cancellationToken).ConfigureAwait(false);
}
await Task.Delay(TimeSpan.FromSeconds(15), cancellationToken).ConfigureAwait(false);
return true;
}
return false;
}
catch (Exception e)
//when (e is FabricException || e is OperationCanceledException || e is TimeoutException)
{
string err =
$"Error restarting Fabric node {nodeName}, " +
$"NodeInstanceId {nodeInstanceId}:{Environment.NewLine}{e}";
await this.telemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairExecutor.SafeRestartFabricNodeAsync::HandledException",
err,
cancellationToken).ConfigureAwait(false);
FabricHealerManager.RepairLogger.LogError(err);
return false;
}
}
public async Task<RestartReplicaResult> RestartReplicaAsync(
RepairConfiguration repairConfiguration,
CancellationToken cancellationToken)
{
string actionMessage = $"Attempting to restart replica {repairConfiguration.ReplicaOrInstanceId} " +
$"on partition {repairConfiguration.PartitionId} on node {repairConfiguration.NodeName}.";
FabricHealerManager.RepairLogger.LogInfo(actionMessage);
await this.telemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairExecutor.RestartCodePackageAsync",
actionMessage,
cancellationToken,
repairConfiguration).ConfigureAwait(false);
RestartReplicaResult replicaResult;
try
{
PartitionSelector partitionSelector = PartitionSelector.PartitionIdOf(repairConfiguration.ServiceName, repairConfiguration.PartitionId);
ReplicaSelector replicaSelector = ReplicaSelector.ReplicaIdOf(partitionSelector, repairConfiguration.ReplicaOrInstanceId);
replicaResult = await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
() =>
this.fabricClient.FaultManager.RestartReplicaAsync(
replicaSelector,
CompletionMode.DoNotVerify,
FabricHealerManager.ConfigSettings.AsyncTimeout,
cancellationToken),
cancellationToken).ConfigureAwait(false);
string statusSuccess =
$"Successfully restarted replica {repairConfiguration.ReplicaOrInstanceId} " +
$"on partition {repairConfiguration.PartitionId} " +
$"on node {repairConfiguration.NodeName}.";
FabricHealerManager.RepairLogger.LogInfo(statusSuccess);
await this.telemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairExecutor.RestartReplicaAsync",
statusSuccess,
cancellationToken,
repairConfiguration).ConfigureAwait(false);
}
catch (Exception e) when (e is FabricException || e is TimeoutException || e is OperationCanceledException)
{
string err =
$"Unable to restart replica {repairConfiguration.ReplicaOrInstanceId} " +
$"on partition {repairConfiguration.PartitionId} " +
$"on node {repairConfiguration.NodeName}.{Environment.NewLine}" +
$"Exception Info:{Environment.NewLine}{e}";
FabricHealerManager.RepairLogger.LogWarning(err);
await this.telemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Warning,
"RepairExecutor.RestartReplicaAsync",
err,
cancellationToken,
repairConfiguration).ConfigureAwait(false);
return null;
}
return replicaResult;
}
public async Task<RemoveReplicaResult> RemoveReplicaAsync(
RepairConfiguration repairConfiguration,
CancellationToken cancellationToken)
{
string actionMessage =
$"Attempting to remove replica {repairConfiguration.ReplicaOrInstanceId} " +
$"on partition {repairConfiguration.PartitionId} on node {repairConfiguration.NodeName}.";
FabricHealerManager.RepairLogger.LogInfo(actionMessage);
await this.telemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairExecutor.RemoveCodePackageAsync",
actionMessage,
cancellationToken,
repairConfiguration).ConfigureAwait(false);
RemoveReplicaResult replicaResult;
try
{
replicaResult = await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
() =>
this.fabricClient.FaultManager.RemoveReplicaAsync(
repairConfiguration.NodeName,
repairConfiguration.PartitionId,
repairConfiguration.ReplicaOrInstanceId,
CompletionMode.DoNotVerify,
false,
FabricHealerManager.ConfigSettings.AsyncTimeout.TotalSeconds,
cancellationToken),
cancellationToken).ConfigureAwait(false);
string statusSuccess =
$"Successfully removed replica {repairConfiguration.ReplicaOrInstanceId} " +
$"on partition {repairConfiguration.PartitionId} " +
$"on node {repairConfiguration.NodeName}.";
FabricHealerManager.RepairLogger.LogInfo(statusSuccess);
await this.telemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairExecutor.RemoveReplicaAsync",
statusSuccess,
cancellationToken,
repairConfiguration).ConfigureAwait(false);
}
catch (Exception e) when (e is FabricException || e is TimeoutException || e is OperationCanceledException)
{
string err =
$"Unable to remove replica {repairConfiguration.ReplicaOrInstanceId} " +
$"on partition {repairConfiguration.PartitionId} " +
$"on node {repairConfiguration.NodeName}.{Environment.NewLine}" +
$"Exception Info:{Environment.NewLine}{e}";
FabricHealerManager.RepairLogger.LogWarning(err);
await this.telemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Warning,
"RepairExecutor.RemoveReplicaAsync",
err,
cancellationToken,
repairConfiguration).ConfigureAwait(false);
return null;
}
return replicaResult;
}
internal async Task<bool> DeleteFilesAsync(
RepairConfiguration repairConfiguration,
CancellationToken cancellationToken)
{
string actionMessage =
$"Attempting to delete files in folder {((DiskRepairPolicy)repairConfiguration.RepairPolicy).FolderPath} " +
$"on node {repairConfiguration.NodeName}.";
FabricHealerManager.RepairLogger.LogInfo(actionMessage);
await this.telemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairExecutor.DeleteFilesAsync",
actionMessage,
cancellationToken,
repairConfiguration).ConfigureAwait(false);
string targetFolderPath = ((DiskRepairPolicy)repairConfiguration.RepairPolicy).FolderPath;
if (!Directory.Exists(targetFolderPath))
{
return false;
}
var dirInfo = new DirectoryInfo(targetFolderPath);
FileSortOrder direction = ((DiskRepairPolicy)repairConfiguration.RepairPolicy).FileAgeSortOrder;
List<string> files = null;
if (direction == FileSortOrder.Ascending)
{
files = (from file in dirInfo.EnumerateFiles("*", new EnumerationOptions { RecurseSubdirectories = ((DiskRepairPolicy)repairConfiguration.RepairPolicy).RecurseSubdirectories })
orderby file.LastWriteTimeUtc ascending
select file.FullName).Distinct().ToList();
}
else if (direction == FileSortOrder.Descending)
{
files = (from file in dirInfo.EnumerateFiles("*", new EnumerationOptions { RecurseSubdirectories = ((DiskRepairPolicy)repairConfiguration.RepairPolicy).RecurseSubdirectories })
orderby file.LastAccessTimeUtc descending
select file.FullName).Distinct().ToList();
}
int initialCount = files.Count;
int deletedFiles = 0;
long maxFiles = ((DiskRepairPolicy)repairConfiguration.RepairPolicy).MaxNumberOfFilesToDelete;
if (initialCount == 0)
{
return false;
}
foreach (var file in files)
{
if (maxFiles > 0 && deletedFiles == maxFiles)
{
break;
}
try
{
File.Delete(file);
deletedFiles++;
}
catch (Exception e) when (e is IOException || e is SecurityException)
{
await this.telemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairExecutor.DeleteFilesAsync::HandledException",
$"Unable to delete {file}:{Environment.NewLine}{e}",
cancellationToken,
repairConfiguration).ConfigureAwait(false);
}
}
if (maxFiles > 0 && initialCount > maxFiles && deletedFiles < maxFiles)
{
await this.telemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairExecutor.DeleteFilesAsync::IncompleteOperation",
$"Unable to delete specified number of files ({maxFiles}).",
cancellationToken,
repairConfiguration).ConfigureAwait(false);
return false;
}
if (maxFiles == 0 && deletedFiles < initialCount)
{
await this.telemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairExecutor.DeleteFilesAsync::IncompleteOperation",
$"Unable to delete all files.",
cancellationToken,
repairConfiguration).ConfigureAwait(false);
return false;
}
await this.telemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairExecutor.DeleteFilesAsync::Success",
$"Successfully deleted {(maxFiles > 0 ? "up to " + maxFiles.ToString() : "all")} files in {targetFolderPath}",
cancellationToken,
repairConfiguration).ConfigureAwait(false);
return true;
}
/// <summary>
/// Returns a machine name string, given a fabric node name.
/// </summary>
/// <param name="nodeName">Fabric node name</param>
/// <param name="cancellationToken"></param>
internal async Task<string>
GetMachineHostNameFromFabricNodeNameAsync(string nodeName, CancellationToken cancellationToken)
{
try
{
var nodes = await this.fabricClient.QueryManager.GetNodeListAsync(
nodeName,
FabricHealerManager.ConfigSettings.AsyncTimeout,
cancellationToken).ConfigureAwait(true);
Node targetNode = nodes.Count > 0 ? nodes[0] : null;
if (targetNode == null)
{
return null;
}
string ipOrDnsName = targetNode?.IpAddressOrFQDN;
var hostEntry = await Dns.GetHostEntryAsync(ipOrDnsName).ConfigureAwait(false);
var machineName = hostEntry.HostName;
return machineName;
}
catch (Exception e) when
(e is ArgumentException
|| e is SocketException
|| e is OperationCanceledException
|| e is TimeoutException)
{
FabricHealerManager.RepairLogger.LogWarning(
$"Unable to determine machine host name from Fabric node name {nodeName}:{Environment.NewLine}{e}");
}
return null;
}
}
}

Просмотреть файл

@ -0,0 +1,83 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using System;
using System.Runtime.Serialization;
namespace FabricHealer.Repair
{
/// <summary>
/// Executor Data is used to store the state of an executing repair task.
/// </summary>
[DataContract]
public class RepairExecutorData
{
[DataMember]
public int ExecutorSubState
{
get; set;
}
[DataMember]
public int ExecutorTimeoutInMinutes
{
get; set;
}
[DataMember]
public DateTime RestartRequestedTime
{
get; set;
}
[DataMember]
public string CustomIdentificationData
{
get; set;
}
[DataMember]
public FabricNodeRepairStep LatestRepairStep
{
get; set;
} = FabricNodeRepairStep.Scheduled;
[DataMember]
public RepairAction RepairAction
{
get; set;
}
[DataMember]
public string NodeType
{
get; set;
}
[DataMember]
public string NodeName
{
get; set;
}
[DataMember]
public RepairPolicy RepairPolicy
{
get; set;
}
[DataMember]
public string FOErrorCode
{
get; set;
}
[DataMember]
public object FOMetricValue
{
get; set;
}
}
}

Просмотреть файл

@ -0,0 +1,67 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using System;
namespace FabricHealer.Repair
{
/// <summary>
/// Defines the type of repair to execute as specified in Settings.xml configuration Sections.
/// </summary>
public class RepairPolicy
{
public string Id
{
get; set;
}
public CycleTimeDistributionType CycleTimeDistributionType
{
get; set;
}
public RepairAction CurrentAction
{
get;set;
}
public long MaxRepairCycles
{
get; set;
}
public TimeSpan RepairCycleTimeWindow
{
get; set;
} = TimeSpan.MinValue;
public bool RepairInWarningState
{
get; set;
}
public TimeSpan RunInterval
{
get; set;
} = TimeSpan.MinValue;
public RepairTargetType TargetType
{
get; set;
}
}
/// <summary>
/// Type of interval time distribution to employ for a repair cycle's time window.
/// </summary>
public enum CycleTimeDistributionType
{
Even,
// TODO?...
/*Exponential,
Random,
Unknown,*/
}
}

Просмотреть файл

@ -0,0 +1,16 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
namespace FabricHealer.Repair
{
public enum RepairTargetType
{
Application,
Node,
Partition,
Replica,
VirtualMachine
}
}

Просмотреть файл

@ -0,0 +1,170 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using System;
using System.Fabric;
using System.Fabric.Repair;
using System.Threading;
using System.Threading.Tasks;
using FabricHealer.Utilities;
namespace FabricHealer.Repair
{
public sealed class RepairTaskEngine
{
private readonly FabricClient fabricClient;
public const string HostVMReboot = "System.Reboot";
public const string FHTaskIdPrefix = "FH";
public const string AzureTaskIdPrefix = "Azure";
public const string FabricHealerExecutorName = "FabricHealer";
public bool IsFabricRepairManagerServiceDeployed
{
get; private set;
}
public RepairTaskEngine(
FabricClient fabricClient)
{
this.fabricClient = fabricClient;
}
public RepairTask CreateFabricHealerRmRepairTask(
RepairConfiguration repairConfiguration,
RepairExecutorData executorData)
{
NodeImpactLevel impact = NodeImpactLevel.None;
if (repairConfiguration.RepairPolicy.CurrentAction == RepairAction.RestartFabricNode)
{
impact = NodeImpactLevel.Restart;
}
else if (repairConfiguration.RepairPolicy.CurrentAction == RepairAction.RemoveFabricNodeState)
{
impact = NodeImpactLevel.RemoveData;
}
var nodeRepairImpact = new NodeRepairImpactDescription();
var impactedNode = new NodeImpact(repairConfiguration.NodeName, impact);
nodeRepairImpact.ImpactedNodes.Add(impactedNode);
string taskId = $"{FHTaskIdPrefix}/{Enum.GetName(typeof(RepairAction), repairConfiguration.RepairPolicy.CurrentAction)}/{Guid.NewGuid()}/{repairConfiguration.NodeName}";
var repairTask = new ClusterRepairTask(
taskId,
Enum.GetName(typeof(RepairAction), repairConfiguration.RepairPolicy.CurrentAction))
{
Target = new NodeRepairTargetDescription(repairConfiguration.NodeName),
Impact = nodeRepairImpact,
Description =
$"FabricHealer executing repair {Enum.GetName(typeof(RepairAction), executorData.RepairAction)} on node {repairConfiguration.NodeName}",
State = RepairTaskState.Preparing,
Executor = FabricHealerExecutorName,
ExecutorData = SerializationUtility.TrySerialize(executorData, out string exData) ? exData : null,
PerformPreparingHealthCheck = false,
PerformRestoringHealthCheck = false,
};
return repairTask;
}
/// <summary>
/// This function returns the list of currently processing FH repair tasks.
/// </summary>
/// <returns>List of repair tasks in Preparing, Approved, Executing or Restoring state</returns>
public async Task<RepairTaskList> GetFHRepairTasksCurrentlyProcessingAsync(
string executorName,
CancellationToken cancellationToken)
{
var repairTasks = await this.fabricClient.RepairManager.GetRepairTaskListAsync(
FHTaskIdPrefix,
RepairTaskStateFilter.Active |
RepairTaskStateFilter.Approved |
RepairTaskStateFilter.Executing,
executorName,
FabricHealerManager.ConfigSettings.AsyncTimeout,
cancellationToken).ConfigureAwait(false);
return repairTasks;
}
// This allows InfrastructureService to schedule and run reboot
public RepairTask CreateVmRebootTask(
RepairConfiguration repairConfiguration,
string executorName)
{
// Do not allow this to take place in one-node cluster.
var nodes = this.fabricClient.QueryManager.GetNodeListAsync().GetAwaiter().GetResult();
int nodeCount = nodes.Count;
if (nodeCount == 1)
{
return null;
}
string taskId = $"{FHTaskIdPrefix}/{HostVMReboot}/{Guid.NewGuid()}/{repairConfiguration.NodeName}/{repairConfiguration.NodeType}";
var repairTask = new ClusterRepairTask(taskId, HostVMReboot)
{
Target = new NodeRepairTargetDescription(repairConfiguration.NodeName),
Description = $"{repairConfiguration.RepairPolicy.Id}",
Executor = executorName,
PerformPreparingHealthCheck = false,
PerformRestoringHealthCheck = false,
State = RepairTaskState.Claimed,
};
return repairTask;
}
public async Task<bool> IsFHRepairTaskRunningAsync(
string executorName,
RepairConfiguration repairConfig,
CancellationToken token)
{
// All RepairTasks are prefixed with FH, regardless of repair target type (VM, fabric node, codepackage, replica...).
// For VM-level repair, RM will create a new task for IS that replaces FH executor data with IS job info, but the original FH repair task will
// remain in an active state which will block any duplicate scheduling by another FH instance.
var currentFHRepairTasksInProgress =
await fabricClient.RepairManager.GetRepairTaskListAsync(
FHTaskIdPrefix,
RepairTaskStateFilter.Active | RepairTaskStateFilter.Approved | RepairTaskStateFilter.Executing,
executorName,
FabricHealerManager.ConfigSettings.AsyncTimeout,
token).ConfigureAwait(true);
if (currentFHRepairTasksInProgress?.Count == 0)
{
return false;
}
foreach (var repair in currentFHRepairTasksInProgress)
{
var executorData =
SerializationUtility.TryDeserialize(repair.ExecutorData, out RepairExecutorData exData) ? exData : null;
if (executorData == null)
{
// This would block scheduling any VM level operation (reboot, reimage) already in flight. For IS repairs, state is stored in Description.
if (repair.Executor == $"fabric:/System/InfrastructureService/{repairConfig.NodeType}"
&& repair.Description == repairConfig.RepairPolicy.Id)
{
return true;
}
continue;
}
if (repairConfig.RepairPolicy.Id == executorData.CustomIdentificationData
|| executorData.RepairAction == RepairAction.RestartFabricNode)
{
return true;
}
}
return false;
}
}
}

Просмотреть файл

@ -0,0 +1,940 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Fabric;
using System.Fabric.Health;
using System.Fabric.Query;
using System.Fabric.Repair;
using System.Linq;
using System.Threading;
using System.Threading.Tasks;
using FabricHealer.Utilities.Telemetry;
using FabricHealer.Interfaces;
using Guan.Logic;
using FabricHealer.Repair.Guan;
using FabricHealer.Utilities;
using Guan.Common;
using System.Data;
namespace FabricHealer.Repair
{
public class RepairTaskManager : IRepairTasks
{
private readonly RepairTaskEngine repairTaskEngine;
internal readonly RepairExecutor RepairExec;
internal readonly StatelessServiceContext Context;
internal readonly CancellationToken Token;
internal readonly TelemetryUtilities TelemetryUtilities;
public readonly FabricClient FabricClientInstance;
private TimeSpan AsyncTimeout
{
get;
} = TimeSpan.FromSeconds(60);
public static readonly TimeSpan MaxWaitTimeForInfraRepairTaskCompleted = TimeSpan.FromHours(2);
public RepairTaskManager(
FabricClient fabricClient,
StatelessServiceContext context,
CancellationToken token)
{
this.FabricClientInstance = fabricClient ?? throw new ArgumentException("FabricClient can't be null");
this.Context = context;
this.Token = token;
this.RepairExec = new RepairExecutor(fabricClient, context, token);
this.repairTaskEngine = new RepairTaskEngine(fabricClient);
this.TelemetryUtilities = new TelemetryUtilities(fabricClient, context);
}
public async Task EnableServiceFabricNodeAsync(
string nodeName,
CancellationToken cancellationToken)
{
await ActivateServiceFabricNodeAsync(nodeName, cancellationToken).ConfigureAwait(true);
}
public async Task RemoveServiceFabricNodeStateAsync(
string nodeName,
CancellationToken cancellationToken)
{
// TODO...
await Task.CompletedTask.ConfigureAwait(false);
}
public async Task ActivateServiceFabricNodeAsync(
string nodeName,
CancellationToken cancellationToken)
{
await FabricClientInstance.ClusterManager.ActivateNodeAsync(
nodeName,
AsyncTimeout,
cancellationToken).ConfigureAwait(false);
}
public async Task<bool> SafeRestartServiceFabricNodeAsync(
string nodeName,
RepairTask repairTask,
CancellationToken cancellationToken)
{
FabricHealerManager.RepairLogger.LogInfo(
$"Taking down Fabric node {nodeName}.");
if (!await RepairExec.SafeRestartFabricNodeAsync(
nodeName,
repairTask,
cancellationToken).ConfigureAwait(false))
{
await this.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"SafeRestartFabricNodeAsync",
$"Did not restart Fabric node {nodeName}",
cancellationToken).ConfigureAwait(false);
return false;
}
await this.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"SafeRestartFabricNodeAsync",
$"Successfully restarted Fabric node {nodeName}",
cancellationToken).ConfigureAwait(false);
return true;
}
public async Task StartRepairWorkflowAsync(
TelemetryData foHealthData,
List<string> repairRules,
CancellationToken cancellationToken)
{
Node node = null;
if (foHealthData.NodeName != null)
{
node = await GetFabricNodeFromNodeNameAsync(
foHealthData.NodeName,
cancellationToken).ConfigureAwait(false);
}
if (node == null)
{
await this.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Warning,
$"RepairTaskManager.StartRepairWorkflowAsync",
$"Unable to attempt repair. Target node exists in cluster? {node == null}.",
cancellationToken).ConfigureAwait(false);
return;
}
try
{
if (repairRules.Any(r => r.Contains(RepairConstants.RestartVM)))
{
// Do not allow VM reboot to take place in one-node cluster.
var nodes = await FabricClientInstance.QueryManager.GetNodeListAsync(
null,
FabricHealerManager.ConfigSettings.AsyncTimeout,
cancellationToken).ConfigureAwait(false);
int nodeCount = nodes.Count;
if (nodeCount == 1)
{
await this.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Warning,
$"RepairTaskManager.StartRepairWorkflowAsync::OneNodeCluster",
$"Will not attempt VM-level repair in a one node cluster.",
cancellationToken).ConfigureAwait(false);
return;
}
}
}
catch (Exception e) when (e is FabricException || e is OperationCanceledException || e is TimeoutException)
{
await this.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Warning,
$"RepairTaskManager.StartRepairWorkflowAsync::NodeCount",
$"Unable to determine node count. Will not attempt VM level repairs:{Environment.NewLine}{e}",
cancellationToken).ConfigureAwait(false);
return;
}
foHealthData.NodeType = node.NodeType;
try
{
_ = await InitializeGuanAndRunQuery(foHealthData, repairRules);
}
catch (GuanException ge)
{
await this.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Warning,
"StartRepairWorkflowAsync:GuanException",
$"Failed in Guan: {ge}",
cancellationToken).ConfigureAwait(false);
return;
}
}
public async Task<bool> InitializeGuanAndRunQuery(
TelemetryData foHealthData,
List<string> repairRules,
RepairExecutorData repairExecutorData = null)
{
// ----- Guan Processing Logic -----
// Add predicate types to functor table. Note that all health information data from FO are automatically passed to all predicates.
// This enables access to various health state values in any query. See Mitigate() in rules files, for examples.
FunctorTable functorTable = new FunctorTable();
// Add external helper predicates.
functorTable.Add(CheckFolderSizePredicateType.Singleton(RepairConstants.CheckFolderSize, this, foHealthData));
functorTable.Add(GetRepairHistoryPredicateType.Singleton(RepairConstants.GetRepairHistory, this, foHealthData));
functorTable.Add(CheckInsideRunIntervalPredicateType.Singleton(RepairConstants.CheckInsideRunInterval, this, foHealthData));
// Add external repair predicates.
functorTable.Add(DeleteFilesPredicateType.Singleton(RepairConstants.DeleteFiles, this, foHealthData));
functorTable.Add(RestartCodePackagePredicateType.Singleton(RepairConstants.RestartCodePackage, this, foHealthData));
functorTable.Add(RestartFabricNodePredicateType.Singleton(RepairConstants.RestartFabricNode, this, repairExecutorData, this.repairTaskEngine, foHealthData));
functorTable.Add(RestartReplicaPredicateType.Singleton(RepairConstants.RestartReplica, this, foHealthData));
functorTable.Add(RestartVMPredicateType.Singleton(RepairConstants.RestartVM, this, foHealthData));
// Parse rules
Module module = Module.Parse("Module", repairRules, functorTable);
var queryDispatcher = new GuanQueryDispatcher(module);
// Create guan query
List<CompoundTerm> terms = new List<CompoundTerm>();
CompoundTerm term = new CompoundTerm("Mitigate");
/* Pass default arguments in query. */
// The type of metric that led FO to generate the unhealthy evaluation for the entity (App, Node, VM, Replica, etc).
foHealthData.Metric = FabricObserverErrorWarningCodes.GetMetricNameFromCode(foHealthData.Code);
term.AddArgument(new Constant(foHealthData.ApplicationName), RepairConstants.AppName);
term.AddArgument(new Constant(foHealthData.Code), RepairConstants.FOErrorCode);
term.AddArgument(new Constant(foHealthData.Metric), RepairConstants.MetricName);
term.AddArgument(new Constant(foHealthData.NodeName), RepairConstants.NodeName);
term.AddArgument(new Constant(foHealthData.NodeType), RepairConstants.NodeType);
term.AddArgument(new Constant(foHealthData.ServiceName), RepairConstants.ServiceName);
term.AddArgument(new Constant(foHealthData.PartitionId), RepairConstants.PartitionId);
term.AddArgument(new Constant(foHealthData.ReplicaId), RepairConstants.ReplicaOrInstanceId);
// FO metric values can be doubles or ints. We don't care about doubles here. That level of precision
// is not important and by converting to long we won't break default (long) Guan numeric comparison..
term.AddArgument(new Constant(Convert.ToInt64((double)foHealthData.Value)), RepairConstants.MetricValue);
terms.Add(term);
// Dispatch query
return await queryDispatcher.RunQueryAsync(terms).ConfigureAwait(false);
}
// The repair will be executed by SF Infrastructure service, not FH. This is the case for all
// VM-level repairs. IS will communicate with VMSS (for example) to guarantee safe repairs in MR-enabled
// clusters.RM, as usual, will orchestrate the repair cycle.
public async Task<bool> ExecuteRMInfrastructureRepairTask(
RepairConfiguration repairConfiguration,
CancellationToken cancellationToken)
{
var infraServices = await FabricRepairTasks.GetInfrastructureServiceInstancesAsync(
FabricClientInstance,
cancellationToken).ConfigureAwait(false);
if (infraServices.Count() == 0)
{
await this.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairTaskManager.ExecuteRMInfrastructureRepairTask",
"Infrastructure Service not found. Will not attemp VM repair.",
cancellationToken,
repairConfiguration).ConfigureAwait(false);
return false;
}
string executorName = null;
foreach (var service in infraServices)
{
if (!service.ServiceName.OriginalString.Contains(repairConfiguration.NodeType))
{
continue;
}
executorName = service.ServiceName.OriginalString;
await this.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairTaskManager.ExecuteRMInfrastructureRepairTask",
$"IS RepairTask {RepairTaskEngine.HostVMReboot} " +
$"Executor set to {executorName}.",
cancellationToken,
repairConfiguration).ConfigureAwait(false);
break;
}
if (executorName == null)
{
await this.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairTaskManager.ExecuteRMInfrastructureRepairTask",
"Unable to determine InfrastructureService service instance." +
"Exiting RepairTaskManager.ScheduleFHRepairTaskAsync.",
cancellationToken,
repairConfiguration).ConfigureAwait(false);
return false;
}
// Make sure there is not already a repair job executing reboot/reimage repair for target node.
var isRepairAlreadyInProgress =
await repairTaskEngine.IsFHRepairTaskRunningAsync(
executorName,
repairConfiguration,
cancellationToken).ConfigureAwait(true);
if (isRepairAlreadyInProgress)
{
await this.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairTaskManager.ExecuteRMInfrastructureRepairTask",
$"Virtual machine repair task for VM " +
$"{await RepairExec.GetMachineHostNameFromFabricNodeNameAsync(repairConfiguration.NodeName, cancellationToken)} is already in progress. " +
$"Will not schedule another VM repair at this time.",
cancellationToken,
repairConfiguration).ConfigureAwait(false);
return false;
}
// Create repair task for target node.
var repairTask = await FabricRepairTasks.ScheduleRepairTaskAsync(
repairConfiguration,
null,
executorName,
FabricClientInstance,
cancellationToken).ConfigureAwait(false);
if (repairTask == null)
{
await this.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairTaskManager.ExecuteRMInfrastructureRepairTask",
"Unable to create Repair Task.",
cancellationToken,
repairConfiguration).ConfigureAwait(false);
return false;
}
await this.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairTaskManager.ExecuteRMInfrastructureRepairTask",
$"Successfully created Repair Task {repairTask.TaskId}",
cancellationToken,
repairConfiguration).ConfigureAwait(false);
var timer = Stopwatch.StartNew();
// It can take a while to get from a VM reboot/reimage to a healthy Fabric node, so block here until repair completes.
// Note that, by design, this will block any other FabricHealer-initiated repair from taking place in the cluster.
// FabricHealer is designed to be very conservative with respect to node level repairs.
// It is a good idea to not change this default behavior.
while (timer.Elapsed < MaxWaitTimeForInfraRepairTaskCompleted)
{
if (!await FabricRepairTasks.IsRepairTaskInDesiredStateAsync(
repairTask.TaskId,
this.FabricClientInstance,
executorName,
new List<RepairTaskState> { RepairTaskState.Completed }))
{
await Task.Delay(TimeSpan.FromSeconds(30), cancellationToken).ConfigureAwait(false);
continue;
}
timer.Stop();
break;
}
return true;
}
public async Task<bool> DeleteFilesAsyncAsync(
RepairConfiguration repairConfiguration,
CancellationToken cancellationToken)
{
return await RepairExec.DeleteFilesAsync(repairConfiguration, cancellationToken);
}
public async Task<bool> RestartReplicaAsync(
RepairConfiguration repairConfiguration,
CancellationToken cancellationToken)
{
var result = await RepairExec.RestartReplicaAsync(
repairConfiguration ?? throw new ArgumentException("configuration can't be null."),
cancellationToken).ConfigureAwait(false);
return result != null;
}
public async Task<bool> RemoveReplicaAsync(
RepairConfiguration repairConfiguration,
CancellationToken cancellationToken)
{
var result = await RepairExec.RemoveReplicaAsync(
repairConfiguration ?? throw new ArgumentException("configuration can't be null."),
cancellationToken).ConfigureAwait(false);
return result != null;
}
public async Task<bool> RestartDeployedCodePackageAsync(
RepairConfiguration repairConfiguration,
CancellationToken cancellationToken)
{
string actionMessage =
$"Attempting to restart deployed code package for app {repairConfiguration?.AppName.OriginalString}, " +
$"service manifest {repairConfiguration?.CodePackage?.ServiceManifestName}";
await this.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairExecutor.RestartCodePackageAsync",
actionMessage,
cancellationToken,
repairConfiguration).ConfigureAwait(false);
var result = await RepairExec.RestartCodePackageAsync(
repairConfiguration.AppName,
repairConfiguration.PartitionId,
repairConfiguration.ReplicaOrInstanceId,
repairConfiguration.ServiceName,
cancellationToken).ConfigureAwait(true);
if (result == null)
{
return false;
}
string statusSuccess =
"Successfully restarted " +
$"code package {result.CodePackageName} with Instance Id " +
$"{result.CodePackageInstanceId} " +
$"for application {repairConfiguration.AppName.OriginalString} on node " +
$"{repairConfiguration.NodeName}.";
await this.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairExecutor.RestartCodePackageAsync",
statusSuccess,
cancellationToken,
repairConfiguration).ConfigureAwait(false);
return true;
}
public async Task<Node> GetFabricNodeFromNodeNameAsync(string nodeName, CancellationToken cancellationToken)
{
try
{
var nodes = await this.FabricClientInstance.QueryManager.GetNodeListAsync(
nodeName,
AsyncTimeout,
cancellationToken).ConfigureAwait(true);
return nodes.Count > 0 ? nodes[0] : null;
}
catch (FabricException fe)
{
FabricHealerManager.RepairLogger.LogError(
$"Error getting node {nodeName}:{Environment.NewLine}{fe}");
return null;
}
}
public async Task<RepairTask> ScheduleFabricHealerRmRepairTaskAsync(
RepairConfiguration repairConfiguration,
CancellationToken cancellationToken)
{
var isThisRepairTaskAlreadyInProgress =
await repairTaskEngine.IsFHRepairTaskRunningAsync(
RepairTaskEngine.FabricHealerExecutorName,
repairConfiguration,
cancellationToken).ConfigureAwait(true);
// For the cases where this repair is already in flight.
if (isThisRepairTaskAlreadyInProgress)
{
string message =
$"Node {repairConfiguration.NodeName} already has a " +
$"{Enum.GetName(typeof(RepairAction), repairConfiguration.RepairPolicy.CurrentAction)} repair in progress for repair Id {repairConfiguration.RepairPolicy.Id}. " +
"Exiting RepairTaskManager.ScheduleFabricHealerRmRepairTaskAsync.";
await TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"ScheduleRepairTask:RepairAlreadyInProgress",
message,
cancellationToken,
repairConfiguration).ConfigureAwait(false);
return null;
}
// Don't attempt a node level repair on a node where there is already an active node-level repair.
var currentlyExecutingRepairs =
await this.FabricClientInstance.RepairManager.GetRepairTaskListAsync(
RepairTaskEngine.FHTaskIdPrefix,
RepairTaskStateFilter.Active | RepairTaskStateFilter.Executing,
RepairTaskEngine.FabricHealerExecutorName,
FabricHealerManager.ConfigSettings.AsyncTimeout,
cancellationToken).ConfigureAwait(true);
if (currentlyExecutingRepairs.Count > 0)
{
foreach (var repair in currentlyExecutingRepairs.Where(task => task.ExecutorData.Contains(repairConfiguration.NodeName)))
{
if (!SerializationUtility.TryDeserialize(repair.ExecutorData, out RepairExecutorData repairExecutorData))
{
continue;
}
if (repairExecutorData.RepairAction == RepairAction.RestartFabricNode
|| repairExecutorData.RepairAction == RepairAction.RestartVM)
{
string message =
$"Node {repairConfiguration.NodeName} already has a node-impactful repair in progress: " +
$"{Enum.GetName(typeof(RepairAction), repairConfiguration.RepairPolicy.CurrentAction)}: {repair.TaskId}" +
"Exiting RepairTaskManager.ScheduleFabricHealerRmRepairTaskAsync.";
await TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"ScheduleRepairTask::NodeRepairAlreadyInProgress",
message,
cancellationToken,
repairConfiguration).ConfigureAwait(false);
return null;
}
}
}
var executorData = new RepairExecutorData
{
CustomIdentificationData = repairConfiguration.RepairPolicy.Id,
ExecutorTimeoutInMinutes = (int)MaxWaitTimeForInfraRepairTaskCompleted.TotalMinutes,
RestartRequestedTime = DateTime.Now,
RepairAction = repairConfiguration.RepairPolicy.CurrentAction,
RepairPolicy = repairConfiguration.RepairPolicy,
FOErrorCode = repairConfiguration.FOHealthCode,
FOMetricValue = repairConfiguration.FOHealthMetricValue,
NodeName = repairConfiguration.NodeName,
NodeType = repairConfiguration.NodeType,
};
// Create custom FH repair task for target node.
var repairTask = await FabricRepairTasks.ScheduleRepairTaskAsync(
repairConfiguration,
executorData,
RepairTaskEngine.FabricHealerExecutorName,
FabricClientInstance,
cancellationToken).ConfigureAwait(false);
return repairTask;
}
public async Task<bool> ExecuteFabricHealerRmRepairTaskAsync(
RepairTask repairTask,
RepairConfiguration repairConfiguration,
CancellationToken cancellationToken)
{
// Execute the repair.
TimeSpan timeout = TimeSpan.FromMinutes(30);
Stopwatch stopWatch = Stopwatch.StartNew();
RepairTaskList repairs;
while (timeout >= stopWatch.Elapsed)
{
repairs =
await repairTaskEngine.GetFHRepairTasksCurrentlyProcessingAsync(
RepairTaskEngine.FabricHealerExecutorName,
cancellationToken).ConfigureAwait(true);
if (repairs == null || repairs.Count == 0)
{
await this.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairTaskManager.ExecuteFabricHealerRmRepairTaskAsync",
$"Failed to schedule repair {repairTask.TaskId}.",
cancellationToken).ConfigureAwait(false);
return false;
}
if (repairs.All(task => task.TaskId != repairTask.TaskId))
{
await this.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairTaskManager.ExecuteFabricHealerRmRepairTaskAsync",
$"Failed to find scheduled repair task {repairTask.TaskId}.",
Token).ConfigureAwait(false);
return false;
}
await this.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairTaskManager.ExecuteFabricHealerRmRepairTaskAsync_WaitingForApproval",
$"Waiting for RM to Approve repair task {repairTask.TaskId}.",
cancellationToken).ConfigureAwait(false);
if (!repairs.Any(task => task.TaskId == repairTask.TaskId
&& task.State == RepairTaskState.Approved))
{
await Task.Delay(TimeSpan.FromSeconds(5), cancellationToken).ConfigureAwait(true);
continue;
}
await this.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairTaskManager.ExecuteFabricHealerRmRepairTaskAsync_Approved",
$"RM has Approved repair task {repairTask.TaskId}.",
cancellationToken).ConfigureAwait(false);
break;
}
stopWatch.Stop();
stopWatch.Reset();
await FabricRepairTasks.SetFabricRepairJobStateAsync(
repairTask,
RepairTaskState.Executing,
RepairTaskResult.Pending,
FabricClientInstance,
cancellationToken).ConfigureAwait(true);
await this.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairTaskManager.ExecuteFabricHealerRmRepairTaskAsync_MovedExecuting",
$"Executing repair {repairTask.TaskId}.",
cancellationToken).ConfigureAwait(false);
bool success;
var repairAction = repairConfiguration.RepairPolicy.CurrentAction;
switch (repairAction)
{
case RepairAction.DeleteFiles:
success = await DeleteFilesAsyncAsync(
repairConfiguration,
cancellationToken).ConfigureAwait(true);
break;
// Note: For SF app container services, RestartDeployedCodePackage API does not work.
// Thus, using Restart/Remove(stateful/stateless)Replica API instead, which does restart container instances.
case RepairAction.RestartCodePackage:
if (string.IsNullOrEmpty(repairConfiguration.ContainerId))
{
success = await RestartDeployedCodePackageAsync(
repairConfiguration,
cancellationToken).ConfigureAwait(true);
}
else
{
// Need replica or instance details..
var repList = await FabricClientInstance.QueryManager.GetReplicaListAsync(
repairConfiguration.PartitionId,
repairConfiguration.ReplicaOrInstanceId,
FabricHealerManager.ConfigSettings.AsyncTimeout,
cancellationToken).ConfigureAwait(false);
if (repList.Count == 0)
{
success = false;
break;
}
var rep = repList[0];
// Restarting stateful replica will restart the container instance.
if (rep.ServiceKind == ServiceKind.Stateful)
{
success = await RestartReplicaAsync(
repairConfiguration,
cancellationToken).ConfigureAwait(true);
}
else
{
// For stateless intances, you need to remove the replica, which will
// restart the container instance.
success = await RemoveReplicaAsync(
repairConfiguration,
cancellationToken).ConfigureAwait(true);
}
}
break;
case RepairAction.RemoveReplica:
success = await RemoveReplicaAsync(
repairConfiguration,
cancellationToken).ConfigureAwait(true);
break;
case RepairAction.RestartReplica:
var replicaList = await FabricClientInstance.QueryManager.GetReplicaListAsync(
repairConfiguration.PartitionId,
repairConfiguration.ReplicaOrInstanceId,
FabricHealerManager.ConfigSettings.AsyncTimeout,
cancellationToken).ConfigureAwait(false);
if (replicaList.Count == 0)
{
success = false;
await this.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
$"RepairTaskManager.ExecuteFabricHealerRmRepairTaskAsync",
$"Replica or Instance {repairConfiguration.ReplicaOrInstanceId} not found on partition {repairConfiguration.PartitionId}.",
cancellationToken).ConfigureAwait(false);
break;
}
var replica = replicaList[0];
// Restart - stateful replica.
if (replica.ServiceKind == ServiceKind.Stateful)
{
success = await RestartReplicaAsync(
repairConfiguration,
cancellationToken).ConfigureAwait(true);
}
else
{
// For stateless replicas, you need to remove the replica. The runtime will create a new one
// and place it..
success = await RemoveReplicaAsync(
repairConfiguration,
cancellationToken).ConfigureAwait(true);
}
break;
case RepairAction.RestartFabricNode:
var executorData = repairTask.ExecutorData;
if (string.IsNullOrEmpty(executorData))
{
success = false;
}
else
{
success = await SafeRestartServiceFabricNodeAsync(
repairConfiguration.NodeName,
repairTask,
cancellationToken).ConfigureAwait(true);
}
break;
default:
return false;
}
if (success)
{
string target = Enum.GetName(
typeof(RepairTargetType),
repairConfiguration.RepairPolicy.TargetType);
TimeSpan maxWaitForHealthStateOk = TimeSpan.FromMinutes(60);
if ((repairConfiguration.RepairPolicy.TargetType == RepairTargetType.Application
&& repairConfiguration.AppName.OriginalString != "fabric:/System")
|| repairConfiguration.RepairPolicy.TargetType == RepairTargetType.Replica)
{
maxWaitForHealthStateOk = TimeSpan.FromMinutes(5);
}
else if (repairConfiguration.RepairPolicy.TargetType == RepairTargetType.Application
&& repairConfiguration.AppName.OriginalString == "fabric:/System")
{
maxWaitForHealthStateOk = TimeSpan.FromMinutes(20);
}
// Check healthstate of repair target to see if the repair worked.
if (await IsRepairTargetHealthyAfterCompletedRepair(
repairConfiguration,
maxWaitForHealthStateOk,
cancellationToken).ConfigureAwait(false))
{
await this.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
$"RepairTaskManager.ExecuteFabricHealerRmRepairTaskAsync",
$"{target} Repair target {repairConfiguration.RepairPolicy.Id} successfully healed on node {repairConfiguration.NodeName}.",
cancellationToken).ConfigureAwait(false);
// Tell RM we are ready to move to Completed state
// as our custom code has completed its repair execution successfully. This function
// puts the repair task into a Restoring State with ResultStatus Succeeded.
_ = await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
() =>
FabricRepairTasks.CompleteCustomActionRepairJobAsync(
repairTask,
this.FabricClientInstance,
this.Context,
cancellationToken),
cancellationToken).ConfigureAwait(false);
return true;
}
await this.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
$"RepairTaskManager.ExecuteFabricHealerRmRepairTaskAsync",
$"{target} Repair target {repairConfiguration.RepairPolicy.Id} not successfully healed.",
cancellationToken).ConfigureAwait(false);
// Did not solve the problem within specified time. Cancel repair task.
//await FabricRepairTasks.CancelRepairTaskAsync(repairTask, this.FabricClientInstance).ConfigureAwait(false);
_ = await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
() =>
FabricRepairTasks.CompleteCustomActionRepairJobAsync(
repairTask,
this.FabricClientInstance,
this.Context,
cancellationToken),
cancellationToken).ConfigureAwait(false);
return false;
}
// Executor failure. Cancel repair task.
await this.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"RepairTaskManager.ExecuteFabricHealerRmRepairTaskAsync_ExecuteFailed",
$"Executor failed for repair {repairTask.TaskId}. See logs for details. Cancelling repair task.",
cancellationToken).ConfigureAwait(false);
await FabricRepairTasks.CancelRepairTaskAsync(repairTask, this.FabricClientInstance).ConfigureAwait(false);
return false;
}
/// <summary>
/// This function checks to see if the target of a repair is healthy after the repair task completed.
/// This will signal the result via telemetry and as a health event.
/// </summary>
/// <param name="repairConfig">RepairConfiguration instance</param>
/// <param name="maxTimeToWait">Amount of time to wait for cluster to settle.</param>
/// <param name="token">CancellationToken instance</param>
/// <returns>Boolean representing whether the repair target is healthy after a completed repair operation.</returns>
public async Task<bool> IsRepairTargetHealthyAfterCompletedRepair(
RepairConfiguration repairConfig,
TimeSpan maxTimeToWait,
CancellationToken token)
{
if (repairConfig == null)
{
return false;
}
var stopwatch = Stopwatch.StartNew();
while (maxTimeToWait >= stopwatch.Elapsed)
{
if (token.IsCancellationRequested)
{
break;
}
if (await GetCurrentAggregatedHealthStateAsync(
repairConfig,
token).ConfigureAwait(false) == HealthState.Ok)
{
stopwatch.Stop();
return true;
}
await Task.Delay(TimeSpan.FromSeconds(5), token).ConfigureAwait(true);
}
stopwatch.Stop();
return false;
}
/// <summary>
/// Determines aggregated health state for repair target in supplied repair configuration.
/// </summary>
/// <param name="repairConfig">RepairConfiguration instance.</param>
/// <param name="token">CancellationToken instance.</param>
/// <returns></returns>
private async Task<HealthState> GetCurrentAggregatedHealthStateAsync(
RepairConfiguration repairConfig,
CancellationToken token)
{
switch (repairConfig.RepairPolicy.TargetType)
{
case RepairTargetType.Application:
var appHealth = await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
() => this.FabricClientInstance.HealthManager.GetApplicationHealthAsync(
repairConfig.AppName,
FabricHealerManager.ConfigSettings.AsyncTimeout,
token),
token);
// Code package restarts can spin up target app on a new node, so check to make sure the target app
// is no longer in error/warning on the node it was running before it was restarted. If the app is in error/warning
// on a new node after restart, then it should still be marked healed on the old node.
var isTargetAppHealedOnTargetNode =
!appHealth.HealthEvents.Any(h => h.HealthInformation.Description.Contains(repairConfig.NodeName));
return isTargetAppHealedOnTargetNode ? HealthState.Ok : appHealth.AggregatedHealthState;
case RepairTargetType.Node:
case RepairTargetType.VirtualMachine:
var nodeHealth = await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
() => this.FabricClientInstance.HealthManager.GetNodeHealthAsync(
repairConfig.NodeName,
FabricHealerManager.ConfigSettings.AsyncTimeout,
token),
token);
return nodeHealth.AggregatedHealthState;
case RepairTargetType.Replica:
// TODO. This needs to be fixed as the replica will no longer exist...
var replicaHealth = await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
() => this.FabricClientInstance.HealthManager.GetReplicaHealthAsync(
repairConfig.PartitionId,
repairConfig.ReplicaOrInstanceId,
FabricHealerManager.ConfigSettings.AsyncTimeout,
token),
token);
return replicaHealth.AggregatedHealthState;
default:
return HealthState.Unknown;
}
}
}
}

Просмотреть файл

@ -0,0 +1,276 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using System;
using System.Diagnostics.Tracing;
using System.Fabric;
using System.Threading.Tasks;
namespace FabricHealer
{
[EventSource(Name = "Service-Fabric-FabricHealer", Guid = "344ea295-2ee1-53a6-fc40-4e893e25c60d")]
internal sealed class ServiceEventSource : EventSource
{
public static readonly ServiceEventSource Current = new ServiceEventSource();
static ServiceEventSource()
{
// A workaround for the problem where ETW activities do not get tracked until Tasks infrastructure is initialized.
// This problem will be fixed in .NET Framework 4.6.2.
Task.Run(() => { });
}
// Instance constructor is private to enforce singleton semantics
private ServiceEventSource() : base() { }
#region Keywords
// Event keywords can be used to categorize events.
// Each keyword is a bit flag. A single event can be associated with multiple keywords (via EventAttribute.Keywords property).
// Keywords must be defined as a public class named 'Keywords' inside EventSource that uses them.
public static class Keywords
{
public const EventKeywords Requests = (EventKeywords)0x1L;
public const EventKeywords ServiceInitialization = (EventKeywords)0x2L;
}
#endregion
#region Events
// Define an instance method for each event you want to record and apply an [Event] attribute to it.
// The method name is the name of the event.
// Pass any parameters you want to record with the event (only primitive integer types, DateTime, Guid & string are allowed).
// Each event method implementation should check whether the event source is enabled, and if it is, call WriteEvent() method to raise the event.
// The number and types of arguments passed to every event method must exactly match what is passed to WriteEvent().
// Put [NonEvent] attribute on all methods that do not define an event.
// For more information see https://msdn.microsoft.com/en-us/library/system.diagnostics.tracing.eventsource.aspx
[NonEvent]
public void Message(string message, params object[] args)
{
if (IsEnabled())
{
string finalMessage = string.Format(message, args);
Message(finalMessage);
}
}
private const int MessageEventId = 1;
[Event(MessageEventId, Level = EventLevel.Informational, Message = "{0}")]
public void Message(string message)
{
if (IsEnabled())
{
WriteEvent(MessageEventId, message);
}
}
[NonEvent]
public void ServiceMessage(StatefulServiceContext serviceContext, string message, params object[] args)
{
if (this.IsEnabled())
{
string finalMessage = string.Format(message, args);
ServiceMessage(
serviceContext.ServiceName.ToString(),
serviceContext.ServiceTypeName,
serviceContext.ReplicaId,
serviceContext.PartitionId,
serviceContext.CodePackageActivationContext.ApplicationName,
serviceContext.CodePackageActivationContext.ApplicationTypeName,
serviceContext.NodeContext.NodeName,
finalMessage);
}
}
// For very high-frequency events it might be advantageous to raise events using WriteEventCore API.
// This results in more efficient parameter handling, but requires explicit allocation of EventData structure and unsafe code.
// To enable this code path, define UNSAFE conditional compilation symbol and turn on unsafe code support in project properties.
private const int ServiceMessageEventId = 2;
[Event(ServiceMessageEventId, Level = EventLevel.Informational, Message = "{7}")]
private
#if UNSAFE
unsafe
#endif
void ServiceMessage(
string serviceName,
string serviceTypeName,
long replicaOrInstanceId,
Guid partitionId,
string applicationName,
string applicationTypeName,
string nodeName,
string message)
{
#if !UNSAFE
WriteEvent(ServiceMessageEventId, serviceName, serviceTypeName, replicaOrInstanceId, partitionId, applicationName, applicationTypeName, nodeName, message);
#else
const int numArgs = 8;
fixed (char* pServiceName = serviceName, pServiceTypeName = serviceTypeName, pApplicationName = applicationName, pApplicationTypeName = applicationTypeName, pNodeName = nodeName, pMessage = message)
{
EventData* eventData = stackalloc EventData[numArgs];
eventData[0] = new EventData { DataPointer = (IntPtr) pServiceName, Size = SizeInBytes(serviceName) };
eventData[1] = new EventData { DataPointer = (IntPtr) pServiceTypeName, Size = SizeInBytes(serviceTypeName) };
eventData[2] = new EventData { DataPointer = (IntPtr) (&replicaOrInstanceId), Size = sizeof(long) };
eventData[3] = new EventData { DataPointer = (IntPtr) (&partitionId), Size = sizeof(Guid) };
eventData[4] = new EventData { DataPointer = (IntPtr) pApplicationName, Size = SizeInBytes(applicationName) };
eventData[5] = new EventData { DataPointer = (IntPtr) pApplicationTypeName, Size = SizeInBytes(applicationTypeName) };
eventData[6] = new EventData { DataPointer = (IntPtr) pNodeName, Size = SizeInBytes(nodeName) };
eventData[7] = new EventData { DataPointer = (IntPtr) pMessage, Size = SizeInBytes(message) };
WriteEventCore(ServiceMessageEventId, numArgs, eventData);
}
#endif
}
private const int ServiceTypeRegisteredEventId = 3;
[Event(ServiceTypeRegisteredEventId, Level = EventLevel.Informational, Message = "Service host process {0} registered service type {1}", Keywords = Keywords.ServiceInitialization)]
public void ServiceTypeRegistered(int hostProcessId, string serviceType)
{
WriteEvent(ServiceTypeRegisteredEventId, hostProcessId, serviceType);
}
private const int ServiceHostInitializationFailedEventId = 4;
[Event(ServiceHostInitializationFailedEventId, Level = EventLevel.Error, Message = "Service host initialization failed", Keywords = Keywords.ServiceInitialization)]
public void ServiceHostInitializationFailed(string exception)
{
WriteEvent(ServiceHostInitializationFailedEventId, exception);
}
// A pair of events sharing the same name prefix with a "Start"/"Stop" suffix implicitly marks boundaries of an event tracing activity.
// These activities can be automatically picked up by debugging and profiling tools, which can compute their execution time, child activities,
// and other statistics.
private const int ServiceRequestStartEventId = 5;
[Event(ServiceRequestStartEventId, Level = EventLevel.Informational, Message = "Service request '{0}' started", Keywords = Keywords.Requests)]
public void ServiceRequestStart(string requestTypeName)
{
WriteEvent(ServiceRequestStartEventId, requestTypeName);
}
private const int ServiceRequestStopEventId = 6;
[Event(ServiceRequestStopEventId, Level = EventLevel.Informational, Message = "Service request '{0}' finished", Keywords = Keywords.Requests)]
public void ServiceRequestStop(string requestTypeName, string exception = "")
{
WriteEvent(ServiceRequestStopEventId, requestTypeName, exception);
}
[NonEvent]
public void VerboseMessage(string message, params object[] args)
{
if (this.IsEnabled())
{
string finalMessage = string.Format(message, args);
this.VerboseMessage(finalMessage);
}
}
[NonEvent]
public void InfoMessage(string message, params object[] args)
{
if (this.IsEnabled())
{
string finalMessage = string.Format(message, args);
this.InfoMessage(finalMessage);
}
}
[NonEvent]
public void ErrorMessage(string message, params object[] args)
{
if (this.IsEnabled())
{
string finalMessage = string.Format(message, args);
this.ErrorMessage(finalMessage);
}
}
private const int ErrorMessageEventId = 7;
[Event(ErrorMessageEventId, Level = EventLevel.Error, Message = "{0}")]
public void ErrorMessage(string message)
{
if (this.IsEnabled())
{
this.WriteEvent(ErrorMessageEventId, message);
}
}
private const int InfoMessageEventId = 8;
[Event(InfoMessageEventId, Level = EventLevel.Informational, Message = "{0}")]
public void InfoMessage(string message)
{
if (this.IsEnabled())
{
this.WriteEvent(InfoMessageEventId, message);
}
}
private const int PrintRepairTaskEventId = 9;
[Event(PrintRepairTaskEventId, Level = EventLevel.Verbose, Message = "TasksID = {0}, State = {1}, Action = {2}, Executor = {3}, Description = {4}, ExecutorData = {5}, Target = {6}")]
public void PrintRepairTasks(string taskId, string state, string action, string executor, string description, string executordata, string target)
{
if (this.IsEnabled())
{
this.WriteEvent(PrintRepairTaskEventId, taskId, state, action, executor, description, executordata, target);
}
}
// TelemetryLib impl \\
private const int VerboseMessageEventId = 10;
[Event(VerboseMessageEventId, Level = EventLevel.Verbose, Message = "{0}")]
public void VerboseMessage(string message)
{
if (IsEnabled())
{
WriteEvent(VerboseMessageEventId, message);
}
}
private const int FabricHealerTelemetryEventId = 11;
[Event(FabricHealerTelemetryEventId, Level = EventLevel.Verbose,
Message = "FabricHealer Internal Diagnostic Event, " +
"eventSourceId = {0}, applicationVersion = {1}, " +
"fabricHealerConfiguration = {2}, " +
"fabricHealerHealthState = {3}")]
public void FabricHealerRuntimeNodeEvent(
string clusterId,
string applicationVersion,
string fhConfigInfo,
string fhHealthInfo)
{
if (IsEnabled())
{
WriteEvent(
FabricHealerTelemetryEventId,
clusterId,
applicationVersion,
fhConfigInfo,
fhHealthInfo);
}
}
public void FabricObserverRuntimeNodeEvent(string clusterId, string applicationVersion, string foConfigInfo, string foHealthInfo)
{
throw new NotImplementedException();
}
#endregion
#region Private methods
#if UNSAFE
private int SizeInBytes(string s)
{
if (s == null)
{
return 0;
}
else
{
return (s.Length + 1) * sizeof(char);
}
}
#endif
#endregion
}
}

Просмотреть файл

@ -0,0 +1,307 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using FabricHealer.Repair;
using System;
using System.Collections.Generic;
using System.Fabric;
using System.Fabric.Description;
using FabricHealer.Utilities.Telemetry;
namespace FabricHealer.Utilities
{
public class ConfigSettings
{
private readonly StatelessServiceContext context;
private ConfigurationSettings configSettings;
public bool EnableAutoMitigation
{
get; set;
}
public int ExecutionLoopSleepSeconds
{
get; set;
} = 30;
public bool EnableVerboseLocalLogging
{
get; set;
}
public bool TelemetryEnabled
{
get; set;
}
public TelemetryProviderType TelemetryProvider
{
get; set;
}
// For Azure ApplicationInsights Telemetry
public string AppInsightsInstrumentationKey
{
get; set;
}
// For Azure LogAnalytics Telemetry
public string LogAnalyticsWorkspaceId
{
get; set;
}
public string LogAnalyticsSharedKey
{
get; set;
}
public string LogAnalyticsLogType
{
get; set;
}
public TimeSpan AsyncTimeout
{
get; set;
} = TimeSpan.FromSeconds(120);
// For EventSource Telemetry
public bool EtwEnabled
{
get; set;
}
// For EventSource Telemetry
public string EtwProviderName
{
get; set;
}
public string LocalLogPathParameter
{
get; set;
}
// RepairPolicy Enablement
public bool EnableAppRepair
{
get; set;
}
public bool EnableNodeRepair
{
get; set;
}
public bool EnableReplicaRepair
{
get; set;
}
public bool EnableSystemAppRepair
{
get; set;
}
public bool EnableVmRepair
{
get; set;
}
public ConfigSettings(StatelessServiceContext context)
{
this.context = context ?? throw new ArgumentException("Context can't be null.");
UpdateConfigSettings();
}
internal void UpdateConfigSettings(ConfigurationSettings settings = null)
{
this.configSettings = settings;
// General
if (bool.TryParse(
GetConfigSettingValue(
RepairConstants.RepairManagerConfigurationSectionName,
RepairConstants.EnableAutoMitigation),
out bool enableAutoMitigation))
{
EnableAutoMitigation = enableAutoMitigation;
}
if (int.TryParse(GetConfigSettingValue(
RepairConstants.RepairManagerConfigurationSectionName,
RepairConstants.AsyncOperationTimeout),
out int timeout))
{
AsyncTimeout = TimeSpan.FromSeconds(timeout);
}
// Logger
if (bool.TryParse(
GetConfigSettingValue(
RepairConstants.RepairManagerConfigurationSectionName,
RepairConstants.EnableVerboseLoggingParameter),
out bool enableVerboseLogging))
{
EnableVerboseLocalLogging = enableVerboseLogging;
}
LocalLogPathParameter = GetConfigSettingValue(
RepairConstants.RepairManagerConfigurationSectionName,
RepairConstants.LocalLogPathParameter);
if (int.TryParse(
GetConfigSettingValue(
RepairConstants.RepairManagerConfigurationSectionName,
RepairConstants.HealthCheckLoopSleepTimeSeconds),
out int execFrequency))
{
ExecutionLoopSleepSeconds = execFrequency;
}
// (Assuming Diagnostics/Analytics cloud service implemented) Telemetry.
if (bool.TryParse(GetConfigSettingValue(
RepairConstants.RepairManagerConfigurationSectionName,
RepairConstants.AppInsightsTelemetryEnabled),
out bool telemEnabled))
{
TelemetryEnabled = telemEnabled;
if (TelemetryEnabled)
{
string telemetryProviderType = GetConfigSettingValue(
RepairConstants.RepairManagerConfigurationSectionName,
RepairConstants.TelemetryProviderType);
if (string.IsNullOrEmpty(telemetryProviderType))
{
TelemetryEnabled = false;
return;
}
if (Enum.TryParse(telemetryProviderType, out TelemetryProviderType telemetryProvider))
{
TelemetryProvider = telemetryProvider;
if (telemetryProvider == TelemetryProviderType.AzureLogAnalytics)
{
LogAnalyticsLogType = GetConfigSettingValue(
RepairConstants.RepairManagerConfigurationSectionName,
RepairConstants.LogAnalyticsLogTypeParameter);
LogAnalyticsSharedKey = GetConfigSettingValue(
RepairConstants.RepairManagerConfigurationSectionName,
RepairConstants.LogAnalyticsSharedKeyParameter);
LogAnalyticsWorkspaceId = GetConfigSettingValue(
RepairConstants.RepairManagerConfigurationSectionName,
RepairConstants.LogAnalyticsWorkspaceIdParameter);
}
else
{
AppInsightsInstrumentationKey = GetConfigSettingValue(
RepairConstants.RepairManagerConfigurationSectionName,
RepairConstants.AppInsightsInstrumentationKeyParameter);
}
}
}
}
// FabricHealer ETW telemetry.
if (bool.TryParse(GetConfigSettingValue(
RepairConstants.RepairManagerConfigurationSectionName,
RepairConstants.EnableEventSourceProvider),
out bool etwEnabled))
{
EtwEnabled = etwEnabled;
EtwProviderName = GetConfigSettingValue(
RepairConstants.RepairManagerConfigurationSectionName,
RepairConstants.EventSourceProviderName);
}
// Repair Policies
if (bool.TryParse(GetConfigSettingValue(
RepairConstants.AppRepairPolicySectionName,
RepairConstants.Enabled),
out bool appRepairEnabled))
{
this.EnableAppRepair = appRepairEnabled;
}
if (bool.TryParse(GetConfigSettingValue(
RepairConstants.FabricNodeRepairPolicySectionName,
RepairConstants.Enabled),
out bool nodeRepairEnabled))
{
this.EnableNodeRepair = nodeRepairEnabled;
}
if (bool.TryParse(GetConfigSettingValue(
RepairConstants.ReplicaRepairPolicySectionName,
RepairConstants.Enabled),
out bool replicaRepairEnabled))
{
this.EnableReplicaRepair = replicaRepairEnabled;
}
if (bool.TryParse(GetConfigSettingValue(
RepairConstants.SystemAppRepairPolicySectionName,
RepairConstants.Enabled),
out bool systemAppRepairEnabled))
{
this.EnableSystemAppRepair = systemAppRepairEnabled;
}
if (bool.TryParse(GetConfigSettingValue(
RepairConstants.VmRepairPolicySectionName,
RepairConstants.Enabled),
out bool vmRepairEnabled))
{
this.EnableVmRepair = vmRepairEnabled;
}
}
private string GetConfigSettingValue(string sectionName, string parameterName)
{
try
{
var settings = this.configSettings;
// This will always be null unless there is a configuration update.
if (settings == null)
{
settings = this.context.CodePackageActivationContext?.GetConfigurationPackageObject("Config")?.Settings;
if (settings == null)
{
return null;
}
}
var section = settings.Sections[sectionName];
var parameter = section.Parameters[parameterName];
// reset.
this.configSettings = null;
return parameter.Value;
}
catch (KeyNotFoundException)
{
}
catch (FabricElementNotFoundException)
{
}
return null;
}
}
}

Просмотреть файл

@ -0,0 +1,197 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using FabricHealer.Repair;
using System.Collections.Generic;
using System.Linq;
namespace FabricHealer.Utilities
{
// FabricObserver Error/Warning/Ok Codes.
public sealed class FabricObserverErrorWarningCodes
{
// Ok
public const string Ok = "FO000";
// CPU
public const string AppErrorCpuPercent = "FO001";
public const string AppWarningCpuPercent = "FO002";
public const string NodeErrorCpuPercent = "FO003";
public const string NodeWarningCpuPercent = "FO004";
// Certificate
public const string ErrorCertificateExpiration = "FO005";
public const string WarningCertificateExpiration = "FO006";
// Disk
public const string NodeErrorDiskSpacePercent = "FO007";
public const string NodeErrorDiskSpaceMB = "FO008";
public const string NodeWarningDiskSpacePercent = "FO009";
public const string NodeWarningDiskSpaceMB = "FO010";
public const string NodeErrorDiskAverageQueueLength = "FO011";
public const string NodeWarningDiskAverageQueueLength = "FO012";
// Memory
public const string AppErrorMemoryPercent = "FO013";
public const string AppWarningMemoryPercent = "FO014";
public const string AppErrorMemoryMB = "FO015";
public const string AppWarningMemoryMB = "FO016";
public const string NodeErrorMemoryPercent = "FO017";
public const string NodeWarningMemoryPercent = "FO018";
public const string NodeErrorMemoryMB = "FO019";
public const string NodeWarningMemoryMB = "FO020";
// Networking
public const string AppErrorNetworkEndpointUnreachable = "FO021";
public const string AppWarningNetworkEndpointUnreachable = "FO022";
public const string AppErrorTooManyActiveTcpPorts = "FO023";
public const string AppWarningTooManyActiveTcpPorts = "FO024";
public const string NodeErrorTooManyActiveTcpPorts = "FO025";
public const string NodeWarningTooManyActiveTcpPorts = "FO026";
public const string ErrorTooManyFirewallRules = "FO027";
public const string WarningTooManyFirewallRules = "FO028";
public const string AppErrorTooManyActiveEphemeralPorts = "FO029";
public const string AppWarningTooManyActiveEphemeralPorts = "FO030";
public const string NodeErrorTooManyActiveEphemeralPorts = "FO031";
public const string NodeWarningTooManyActiveEphemeralPorts = "FO032";
public static Dictionary<string, string> AppErrorCodesDictionary { get; } = new Dictionary<string, string>
{
{ Ok, "Ok" },
{ AppErrorCpuPercent, "AppErrorCpuPercent" },
{ AppWarningCpuPercent, "AppWarningCpuPercent" },
{ AppErrorMemoryPercent, "AppErrorMemoryPercent" },
{ AppWarningMemoryPercent, "AppWarningMemoryPercent" },
{ AppErrorMemoryMB, "AppErrorMemoryMB" },
{ AppWarningMemoryMB, "AppWarningMemoryMB" },
{ AppErrorNetworkEndpointUnreachable, "AppErrorNetworkEndpointUnreachable" },
{ AppWarningNetworkEndpointUnreachable, "AppWarningNetworkEndpointUnreachable" },
{ AppErrorTooManyActiveTcpPorts, "AppErrorTooManyActiveTcpPorts" },
{ AppWarningTooManyActiveTcpPorts, "AppWarningTooManyActiveTcpPorts" },
{ AppErrorTooManyActiveEphemeralPorts, "AppErrorTooManyActiveEphemeralPorts" },
{ AppWarningTooManyActiveEphemeralPorts, "AppWarningTooManyActiveEphemeralPorts" },
};
public static Dictionary<string, string> NodeErrorCodesDictionary { get; } = new Dictionary<string, string>
{
{ Ok, "Ok" },
{ NodeErrorCpuPercent, "NodeErrorCpuPercent" },
{ NodeWarningCpuPercent, "NodeWarningCpuPercent" },
{ ErrorCertificateExpiration, "ErrorCertificateExpiration" },
{ WarningCertificateExpiration, "WarningCertificateExpiration" },
{ NodeErrorDiskSpacePercent, "NodeErrorDiskSpacePercent" },
{ NodeErrorDiskSpaceMB, "NodeErrorDiskSpaceMB" },
{ NodeWarningDiskSpacePercent, "NodeWarningDiskSpacePercent" },
{ NodeWarningDiskSpaceMB, "NodeWarningDiskSpaceMB" },
{ NodeErrorDiskAverageQueueLength, "NodeErrorDiskAverageQueueLength" },
{ NodeWarningDiskAverageQueueLength, "NodeWarningDiskAverageQueueLength" },
{ NodeErrorMemoryPercent, "NodeErrorMemoryPercent" },
{ NodeWarningMemoryPercent, "NodeWarningMemoryPercent" },
{ NodeErrorMemoryMB, "NodeErrorMemoryMB" },
{ NodeWarningMemoryMB, "NodeWarningMemoryMB" },
{ NodeErrorTooManyActiveTcpPorts, "NodeErrorTooManyActiveTcpPorts" },
{ NodeWarningTooManyActiveTcpPorts, "NodeWarningTooManyActiveTcpPorts" },
{ ErrorTooManyFirewallRules, "NodeErrorTooManyFirewallRules" },
{ WarningTooManyFirewallRules, "NodeWarningTooManyFirewallRules" },
{ NodeErrorTooManyActiveEphemeralPorts, "NodeErrorTooManyActiveEphemeralPorts" },
{ NodeWarningTooManyActiveEphemeralPorts, "NodeWarningTooManyActiveEphemeralPorts" },
};
public static string GetErrorWarningNameFromCode(string id)
{
if (string.IsNullOrEmpty(id))
{
return null;
}
if (AppErrorCodesDictionary.Any(k => k.Key == id))
{
return AppErrorCodesDictionary.First(k => k.Key == id).Value;
}
if (NodeErrorCodesDictionary.Any(k => k.Key == id))
{
return NodeErrorCodesDictionary.First(k => k.Key == id).Value;
}
return null;
}
public static string GetMetricNameFromCode(string code)
{
if (GetIsResourceType(code, RepairConstants.ActiveTcpPorts))
{
return RepairConstants.ActiveTcpPorts;
}
if (GetIsResourceType(code, RepairConstants.Certificate))
{
return RepairConstants.Certificate;
}
if (GetIsResourceType(code, RepairConstants.Cpu))
{
return RepairConstants.CpuPercent;
}
if (GetIsResourceType(code, RepairConstants.DiskAverageQueueLength))
{
return RepairConstants.DiskAverageQueueLength;
}
if (GetIsResourceType(code, RepairConstants.DiskSpaceMB))
{
return RepairConstants.DiskSpaceMB;
}
if (GetIsResourceType(code, RepairConstants.DiskSpacePercent))
{
return RepairConstants.DiskSpacePercent;
}
if (GetIsResourceType(code, RepairConstants.EndpointUnreachable))
{
return RepairConstants.EndpointUnreachable;
}
if (GetIsResourceType(code, RepairConstants.EphemeralPorts))
{
return RepairConstants.EphemeralPorts;
}
if (GetIsResourceType(code, RepairConstants.FirewallRules))
{
return RepairConstants.FirewallRules;
}
if (GetIsResourceType(code, RepairConstants.MemoryMB))
{
return RepairConstants.MemoryMB;
}
return GetIsResourceType(code, RepairConstants.MemoryPercent) ? RepairConstants.MemoryPercent : null;
}
private static bool GetIsResourceType(string id, string resourceType)
{
if (string.IsNullOrEmpty(id))
{
return false;
}
if (AppErrorCodesDictionary.Any(k => k.Key == id && k.Value.Contains(resourceType)))
{
return true;
}
if (NodeErrorCodesDictionary.Any(k => k.Key == id && k.Value.Contains(resourceType)))
{
return true;
}
return false;
}
}
}

Просмотреть файл

@ -0,0 +1,193 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using System;
using System.Collections.Generic;
using System.Fabric;
namespace FabricHealer.Utilities
{
/// <summary>
/// Class to define retry-able fabric client errors
/// </summary>
public class FabricClientRetryErrors
{
/// <summary>
/// Fabric errors that are retry-able for fabric client GetEntityHealth commands
/// </summary>
public static readonly Lazy<FabricClientRetryErrors> GetEntityHealthFabricErrors = new Lazy<FabricClientRetryErrors>(() =>
{
var retryErrors = new FabricClientRetryErrors();
retryErrors.RetryableFabricErrorCodes.Add(FabricErrorCode.FabricHealthEntityNotFound);
return retryErrors;
});
/// <summary>
/// Fabric errors that are retry-able for fabric client MoveSecondary commands
/// </summary>
public static readonly Lazy<FabricClientRetryErrors> MoveSecondaryFabricErrors = new Lazy<FabricClientRetryErrors>(() =>
{
var retryErrors = new FabricClientRetryErrors();
retryErrors.RetrySuccessFabricErrorCodes.Add(FabricErrorCode.AlreadySecondaryReplica);
retryErrors.RetryableFabricErrorCodes.Add(FabricErrorCode.PLBNotReady);
return retryErrors;
});
/// <summary>
/// Fabric errors that are retry-able for fabric client MovePrimary commands
/// </summary>
public static readonly Lazy<FabricClientRetryErrors> MovePrimaryFabricErrors = new Lazy<FabricClientRetryErrors>(() =>
{
var retryErrors = new FabricClientRetryErrors();
retryErrors.RetrySuccessFabricErrorCodes.Add(FabricErrorCode.AlreadyPrimaryReplica);
retryErrors.RetryableFabricErrorCodes.Add(FabricErrorCode.PLBNotReady);
return retryErrors;
});
/// <summary>
/// Fabric errors that are retry-able for fabric client RemoveReplica commands
/// </summary>
public static readonly Lazy<FabricClientRetryErrors> RemoveReplicaErrors = new Lazy<FabricClientRetryErrors>(() =>
{
var retryErrors = new FabricClientRetryErrors();
retryErrors.RetryableFabricErrorCodes.Add(FabricErrorCode.ObjectClosed);
return retryErrors;
});
/// <summary>
/// Fabric errors that are retry-able for fabric client RestartReplica commands
/// </summary>
public static readonly Lazy<FabricClientRetryErrors> RestartReplicaErrors = new Lazy<FabricClientRetryErrors>(() =>
{
var retryErrors = new FabricClientRetryErrors();
retryErrors.RetryableFabricErrorCodes.Add(FabricErrorCode.ObjectClosed);
return retryErrors;
});
/// <summary>
/// Fabric errors that are retry-able for fabric client GetPartitionList commands
/// </summary>
public static readonly Lazy<FabricClientRetryErrors> GetPartitionListFabricErrors = new Lazy<FabricClientRetryErrors>(() =>
{
var retryErrors = new FabricClientRetryErrors();
retryErrors.RetryableFabricErrorCodes.Add(FabricErrorCode.ServiceNotFound);
retryErrors.RetryableExceptions.Add(typeof(FabricServiceNotFoundException));
return retryErrors;
});
/// <summary>
/// Fabric errors that are retry-able for fabric client GetClusterManifest commands
/// </summary>
public static readonly Lazy<FabricClientRetryErrors> GetClusterManifestFabricErrors = new Lazy<FabricClientRetryErrors>(() =>
{
var retryErrors = new FabricClientRetryErrors();
return retryErrors;
});
/// <summary>
/// Fabric errors that are retry-able for fabric client Provision commands
/// </summary>
public static readonly Lazy<FabricClientRetryErrors> ProvisionFabricErrors = new Lazy<FabricClientRetryErrors>(() =>
{
var retryErrors = new FabricClientRetryErrors();
retryErrors.RetrySuccessFabricErrorCodes.Add(FabricErrorCode.FabricVersionAlreadyExists);
return retryErrors;
});
/// <summary>
/// Fabric errors that are retry-able for fabric client Upgrade commands
/// </summary>
public static readonly Lazy<FabricClientRetryErrors> UpgradeFabricErrors = new Lazy<FabricClientRetryErrors>(() =>
{
var retryErrors = new FabricClientRetryErrors();
retryErrors.RetrySuccessFabricErrorCodes.Add(FabricErrorCode.FabricUpgradeInProgress);
retryErrors.RetrySuccessFabricErrorCodes.Add(FabricErrorCode.FabricAlreadyInTargetVersion);
return retryErrors;
});
/// <summary>
/// Fabric errors that are retry-able for fabric client RemoveUnreliableTransportBehavior commands
/// </summary>
public static readonly Lazy<FabricClientRetryErrors> RemoveUnreliableTransportBehaviorErrors = new Lazy<FabricClientRetryErrors>(() =>
{
var retryErrors = new FabricClientRetryErrors();
retryErrors.publicRetrySuccessFabricErrorCodes.Add(2147949808);
return retryErrors;
});
/// <summary>
/// Setting SuccessFabricErrorCodes while performing CreateApp
/// </summary>
public static readonly Lazy<FabricClientRetryErrors> CreateAppErrors = new Lazy<FabricClientRetryErrors>(() =>
{
var retryErrors = new FabricClientRetryErrors();
retryErrors.RetrySuccessFabricErrorCodes.Add(FabricErrorCode.ApplicationAlreadyExists);
return retryErrors;
});
/// <summary>
/// Setting SuccessFabricErrorCodes while performing DeleteApp
/// </summary>
public static readonly Lazy<FabricClientRetryErrors> DeleteAppErrors = new Lazy<FabricClientRetryErrors>(() =>
{
var retryErrors = new FabricClientRetryErrors();
retryErrors.RetrySuccessFabricErrorCodes.Add(FabricErrorCode.ApplicationNotFound);
return retryErrors;
});
/// <summary>
/// Constructor that populates default retry-able errors
/// </summary>
public FabricClientRetryErrors()
{
this.RetryableExceptions = new List<Type>();
this.RetryableFabricErrorCodes = new List<FabricErrorCode>();
this.RetrySuccessExceptions = new List<Type>();
this.RetrySuccessFabricErrorCodes = new List<FabricErrorCode>();
this.publicRetrySuccessFabricErrorCodes = new List<uint>();
this.PopulateDefaultValues();
}
/// <summary>
/// List of exceptions that are retry-able
/// </summary>
public IList<Type> RetryableExceptions { get; private set; }
/// <summary>
/// List of Fabric error codes that are retry-able
/// </summary>
public IList<FabricErrorCode> RetryableFabricErrorCodes { get; private set; }
/// <summary>
/// List of success exceptions that are retry-able
/// </summary>
public IList<Type> RetrySuccessExceptions { get; private set; }
/// <summary>
/// List of success error codes that are retry-able
/// </summary>
public IList<FabricErrorCode> RetrySuccessFabricErrorCodes { get; private set; }
/// <summary>
/// List of public success error codes that are retry-able
/// </summary>
public IList<uint> publicRetrySuccessFabricErrorCodes { get; private set; }
private void PopulateDefaultValues()
{
this.RetryableExceptions.Add(typeof(TimeoutException));
this.RetryableExceptions.Add(typeof(OperationCanceledException));
this.RetryableExceptions.Add(typeof(FabricNotReadableException));
this.RetryableFabricErrorCodes.Add(FabricErrorCode.OperationTimedOut);
this.RetryableFabricErrorCodes.Add(FabricErrorCode.CommunicationError);
// TODO: Enable after updating ServiceFabricClientPackage in SF-AppStore Repo
// this.RetryableFabricErrorCodes.Add(FabricErrorCode.GatewayNotReachable);
this.RetryableFabricErrorCodes.Add(FabricErrorCode.ServiceTooBusy);
}
}
}

Просмотреть файл

@ -0,0 +1,166 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using System;
using System.Diagnostics;
using System.Fabric;
using System.Runtime.InteropServices;
using System.Threading;
using System.Threading.Tasks;
namespace FabricHealer.Utilities
{
/// <summary>
/// Helper class to execute fabric client operations with retry
/// </summary>
public static class FabricClientRetryHelper
{
private static readonly TimeSpan DefaultOperationTimeout = TimeSpan.FromMinutes(2);
private static readonly Logger Logger = new Logger("FabricClientRetryHelper");
/// <summary>
/// Helper method to execute given function with defaultFabricClientRetryErrors and default Operation Timeout
/// </summary>
/// <param name="function">Action to be performed</param>
/// <param name="cancellationToken">Cancellation token for Async operation</param>
/// <returns>Task object</returns>
public static async Task<T> ExecuteFabricActionWithRetryAsync<T>(
Func<Task<T>> function,
CancellationToken cancellationToken)
{
return await ExecuteFabricActionWithRetryAsync(
function,
new FabricClientRetryErrors(),
DefaultOperationTimeout,
cancellationToken).ConfigureAwait(false);
}
/// <summary>
/// Helper method to execute given function with given user FabricClientRetryErrors and given Operation Timeout
/// </summary>
/// <param name="function">Action to be performed</param>
/// <param name="errors">Fabric Client Errors that can be retired</param>
/// <param name="operationTimeout">Timeout for the operation</param>
/// <param name="cancellationToken">Cancellation token for Async operation</param>
/// <returns>Task object</returns>
public static async Task<T> ExecuteFabricActionWithRetryAsync<T>(
Func<Task<T>> function,
FabricClientRetryErrors errors,
TimeSpan operationTimeout,
CancellationToken cancellationToken)
{
bool needToWait = false;
var watch = new Stopwatch();
watch.Start();
while (true)
{
cancellationToken.ThrowIfCancellationRequested();
if (needToWait)
{
await Task.Delay(TimeSpan.FromSeconds(5), cancellationToken).ConfigureAwait(false);
}
try
{
return await function().ConfigureAwait(false);
}
catch (Exception e)
{
if (HandleException(e, errors, out bool retryElseSuccess))
{
if (retryElseSuccess)
{
Logger.LogInfo(
$"ExecuteFabricActionWithRetryAsync: Retrying due to Exception: {e}");
if (watch.Elapsed > operationTimeout)
{
Logger.LogWarning(
"ExecuteFabricActionWithRetryAsync: Done Retrying. " +
$"Time Elapsed: {watch.Elapsed.TotalSeconds}, " +
$"Timeout: {operationTimeout.TotalSeconds}. " +
$"Throwing Exception: {e}");
throw;
}
needToWait = true;
continue;
}
Logger.LogInfo(
$"ExecuteFabricActionWithRetryAsync: Exception {e} Handled but No Retry.");
return default;
}
throw;
}
}
}
private static bool HandleException(
Exception e,
FabricClientRetryErrors errors,
out bool retryElseSuccess)
{
var fabricException = e as FabricException;
if (errors.RetryableExceptions.Contains(e.GetType()))
{
retryElseSuccess = true /*retry*/;
return true;
}
if (fabricException != null && errors.RetryableFabricErrorCodes.Contains(fabricException.ErrorCode))
{
retryElseSuccess = true /*retry*/;
return true;
}
if (errors.RetrySuccessExceptions.Contains(e.GetType()))
{
retryElseSuccess = false /*success*/;
return true;
}
if (fabricException != null
&& errors.RetrySuccessFabricErrorCodes.Contains(fabricException.ErrorCode))
{
retryElseSuccess = false /*success*/;
return true;
}
if (e.GetType() == typeof(FabricTransientException))
{
retryElseSuccess = true /*retry*/;
return true;
}
if (fabricException?.InnerException != null)
{
if (fabricException.InnerException is COMException ex
&& errors.publicRetrySuccessFabricErrorCodes.Contains((uint)ex.ErrorCode))
{
retryElseSuccess = false /*success*/;
return true;
}
}
retryElseSuccess = false;
return false;
}
}
}

Просмотреть файл

@ -0,0 +1,94 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using System;
using System.Fabric;
using System.Fabric.Health;
namespace FabricHealer.Utilities
{
/// <summary>
/// Reports health data to Service Fabric Health Manager and logs locally (optional).
/// </summary>
public class FabricHealthReporter
{
private readonly FabricClient fabricClient;
/// <summary>
/// Initializes a new instance of the <see cref="FabricHealthReporter"/> class.
/// </summary>
/// <param name="fabricClient"></param>
public FabricHealthReporter(FabricClient fabricClient)
{
this.fabricClient = fabricClient ?? throw new ArgumentException("FabricClient can't be null");
this.fabricClient.Settings.HealthReportSendInterval = TimeSpan.FromSeconds(1);
this.fabricClient.Settings.HealthReportRetrySendInterval = TimeSpan.FromSeconds(3);
}
public void ReportHealthToServiceFabric(HealthReport healthReport)
{
if (healthReport == null)
{
return;
}
var sendOptions = new HealthReportSendOptions { Immediate = false };
// Quickly send OK (clears warning/errors states).
if (healthReport.State == HealthState.Ok)
{
sendOptions.Immediate = true;
}
var timeToLive = TimeSpan.FromMinutes(5);
if (healthReport.HealthReportTimeToLive != default)
{
timeToLive = healthReport.HealthReportTimeToLive;
}
var healthInformation = new HealthInformation(
healthReport.Source,
healthReport.Code ?? healthReport.Property,
healthReport.State)
{
Description = healthReport.HealthMessage,
TimeToLive = timeToLive,
RemoveWhenExpired = true,
};
switch (healthReport.ReportType)
{
case HealthReportType.Application when healthReport.AppName != null:
var appHealthReport = new ApplicationHealthReport(healthReport.AppName, healthInformation);
this.fabricClient.HealthManager.ReportHealth(appHealthReport, sendOptions);
break;
case HealthReportType.Node when healthReport.NodeName != null:
var nodeHealthReport = new NodeHealthReport(healthReport.NodeName, healthInformation);
this.fabricClient.HealthManager.ReportHealth(nodeHealthReport, sendOptions);
break;
case HealthReportType.Service when healthReport.ServiceName != null:
var serviceHealthReport = new ServiceHealthReport(healthReport.ServiceName, healthInformation);
this.fabricClient.HealthManager.ReportHealth(serviceHealthReport, sendOptions);
break;
default:
break;
}
}
}
public enum HealthReportType
{
Application,
Node,
Service
}
}

Просмотреть файл

@ -0,0 +1,78 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using System;
using System.Fabric.Health;
namespace FabricHealer.Utilities
{
public class HealthReport
{
public Uri AppName
{
get; set;
}
public string Code
{
get; set;
}
public TimeSpan HealthReportTimeToLive
{
get; set;
}
public string HealthMessage
{
get; set;
}
public HealthReportType ReportType
{
get; set;
}
public HealthState State
{
get; set;
}
public string NodeName
{
get; set;
}
public string Source
{
get; set;
}
public Guid PartitionId
{
get; set;
}
public string Property
{
get; set;
}
public long ReplicaId
{
get; set;
}
public string ResourceUsageDataProperty
{
get; set;
}
public Uri ServiceName
{
get; set;
}
}
}

Просмотреть файл

@ -0,0 +1,18 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
namespace FabricHealer.Utilities
{
public enum HealthScope
{
Application,
Cluster,
Node,
Partition,
Replica,
Service,
Unknown
}
}

Просмотреть файл

@ -0,0 +1,38 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using Newtonsoft.Json;
namespace FabricHealer.Utilities
{
public static class JsonHelper
{
public static bool IsJson<T>(string text)
{
if (string.IsNullOrWhiteSpace(text))
{
return false;
}
try
{
_ = JsonConvert.DeserializeObject<T>(text);
return true;
}
catch (JsonSerializationException)
{
return false;
}
catch (JsonReaderException)
{
return false;
}
catch (JsonWriterException)
{
return false;
}
}
}
}

Просмотреть файл

@ -0,0 +1,15 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
namespace FabricHealer.Utilities
{
public enum LogLevel
{
Info,
Warning,
Error,
Critical
}
}

Просмотреть файл

@ -0,0 +1,229 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using System;
using System.Diagnostics.Tracing;
using System.IO;
using System.Runtime.InteropServices;
using System.Threading;
using NLog;
using NLog.Config;
using NLog.Targets;
using NLog.Time;
namespace FabricHealer.Utilities
{
public sealed class Logger
{
private const int Retries = 5;
private readonly string loggerName;
// Text file logger.
private ILogger OLogger
{
get; set;
}
public string FolderName
{
get;
}
public string Filename
{
get;
}
public bool EnableVerboseLogging
{
get; set;
} = false;
public string LogFolderBasePath
{
get; set;
}
public string FilePath
{
get; set;
}
public static EventSource EtwLogger
{
get;
}
static Logger()
{
if (!FabricHealerManager.ConfigSettings.EtwEnabled
|| string.IsNullOrEmpty(FabricHealerManager.ConfigSettings.EtwProviderName))
{
return;
}
if (EtwLogger == null)
{
EtwLogger = new EventSource(FabricHealerManager.ConfigSettings.EtwProviderName);
}
}
/// <summary>
/// Initializes a new instance of the <see cref="Logger"/> class.
/// </summary>
/// <param name="sourceName">Name of observer.</param>
/// <param name="logFolderBasePath">Base folder path.</param>
public Logger(string sourceName, string logFolderBasePath = null)
{
FolderName = sourceName;
Filename = sourceName + ".log";
this.loggerName = sourceName;
if (!string.IsNullOrEmpty(logFolderBasePath))
{
LogFolderBasePath = logFolderBasePath;
}
InitializeLoggers();
}
public void InitializeLoggers()
{
string logFolderBase;
if (RuntimeInformation.IsOSPlatform(OSPlatform.Windows))
{
string windrive = Environment.SystemDirectory.Substring(0, 3);
logFolderBase = windrive + "fabrichealer_logs";
}
else
{
logFolderBase = "/tmp/fabrichealer_logs";
}
// log directory supplied in config. Set in ObserverManager.
if (!string.IsNullOrEmpty(this.LogFolderBasePath))
{
// Add current drive letter if not supplied for Windows path target.
if (RuntimeInformation.IsOSPlatform(OSPlatform.Windows))
{
if (!this.LogFolderBasePath.Substring(0, 3).Contains(":\\"))
{
string windrive = Environment.SystemDirectory.Substring(0, 3);
logFolderBase = windrive + this.LogFolderBasePath;
}
}
else
{
// Remove supplied drive letter if Linux is the runtime target.
if (this.LogFolderBasePath.Substring(0, 3).Contains(":\\"))
{
this.LogFolderBasePath = this.LogFolderBasePath.Remove(0, 3);
}
logFolderBase = this.LogFolderBasePath;
}
}
string file = Path.Combine(logFolderBase, "fabrichealer.log");
if (!string.IsNullOrEmpty(FolderName) && !string.IsNullOrEmpty(Filename))
{
string folderPath = Path.Combine(logFolderBase, FolderName);
file = Path.Combine(folderPath, Filename);
}
FilePath = file;
var targetName = this.loggerName + "LogFile";
if (LogManager.Configuration == null)
{
LogManager.Configuration = new LoggingConfiguration();
}
if ((FileTarget)LogManager.Configuration?.FindTargetByName(targetName) == null)
{
var target = new FileTarget
{
Name = targetName,
FileName = file,
Layout = "${longdate}--${uppercase:${level}}--${message}",
OpenFileCacheTimeout = 5,
ArchiveNumbering = ArchiveNumberingMode.DateAndSequence,
ArchiveEvery = FileArchivePeriod.Day,
AutoFlush = true,
};
LogManager.Configuration.AddTarget(this.loggerName + "LogFile", target);
var ruleInfo = new LoggingRule(this.loggerName, NLog.LogLevel.Debug, target);
LogManager.Configuration.LoggingRules.Add(ruleInfo);
LogManager.ReconfigExistingLoggers();
}
TimeSource.Current = new AccurateUtcTimeSource();
OLogger = LogManager.GetLogger(this.loggerName);
}
public void LogInfo(string format, params object[] parameters)
{
if (!EnableVerboseLogging)
{
return;
}
OLogger.Info(format, parameters);
}
public void LogError(string format, params object[] parameters)
{
OLogger.Error(format, parameters);
}
public void LogWarning(string format, params object[] parameters)
{
OLogger.Warn(format, parameters);
}
public static bool TryWriteLogFile(string path, string content)
{
if (string.IsNullOrEmpty(content))
{
return false;
}
for (var i = 0; i < Retries; i++)
{
try
{
string directory = Path.GetDirectoryName(path);
if (directory != null && !Directory.Exists(directory))
{
Directory.CreateDirectory(directory);
}
File.WriteAllText(path, content);
return true;
}
catch (Exception e) when (e is IOException || e is UnauthorizedAccessException)
{
}
Thread.Sleep(1000);
}
return false;
}
public static void Flush()
{
LogManager.Flush();
}
}
}

Просмотреть файл

@ -0,0 +1,71 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using Newtonsoft.Json;
using System;
using System.IO;
namespace FabricHealer.Utilities
{
/// <summary>
/// Serialization and deserialization utility for Json objects
/// </summary>
public static class SerializationUtility
{
#pragma warning disable CA1720 // Identifier contains type name
public static bool TrySerialize<T>(T objTarget, out string obj)
#pragma warning restore CA1720 // Identifier contains type name
{
try
{
obj = JsonConvert.SerializeObject(objTarget);
return true;
}
catch (Exception)
{
obj = default;
return false;
}
}
public static bool TryDeserialize<T>(string serializedObj, out T obj)
{
try
{
obj = JsonConvert.DeserializeObject<T>(serializedObj);
return true;
}
catch (Exception)
{
obj = default;
return false;
}
}
#pragma warning disable CA1720 // Identifier contains type name
public static bool TrySerializeObjectToFile<T>(string fileName, T obj)
#pragma warning restore CA1720 // Identifier contains type name
{
if (TrySerialize(obj, out string file))
{
File.WriteAllText(fileName, file);
return true;
}
return false;
}
public static bool TryDeserializeObjectFromFile<T>(string fileName, out T obj)
{
if (TryDeserialize(File.ReadAllText(fileName), out obj))
{
return true;
}
obj = default;
return false;
}
}
}

Просмотреть файл

@ -0,0 +1,360 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using System;
using System.Collections.Generic;
using System.Fabric.Health;
using System.Threading;
using System.Threading.Tasks;
using FabricHealer.Interfaces;
using Microsoft.ApplicationInsights;
using Microsoft.ApplicationInsights.DataContracts;
using Microsoft.ApplicationInsights.Extensibility;
namespace FabricHealer.Utilities.Telemetry
{
/// <summary>
/// Abstracts the ApplicationInsights telemetry API calls allowing
/// other telemetry providers to be plugged in.
/// </summary>
public class AppInsightsTelemetry : ITelemetryProvider, IDisposable
{
/// <summary>
/// ApplicationInsights telemetry client.
/// </summary>
private readonly TelemetryClient telemetryClient;
private readonly Logger logger;
public AppInsightsTelemetry(string key)
{
if (string.IsNullOrWhiteSpace(key))
{
throw new ArgumentException("Argument is empty", nameof(key));
}
this.logger = new Logger("TelemetryLog");
this.telemetryClient = new TelemetryClient(new TelemetryConfiguration() { InstrumentationKey = key });
}
/// <summary>
/// Gets a value indicating whether telemetry is enabled or not.
/// </summary>
public bool IsEnabled => this.telemetryClient.IsEnabled() && FabricHealerManager.ConfigSettings.TelemetryEnabled;
/// <summary>
/// Gets or sets the key.
/// </summary>
public string Key
{
get => this.telemetryClient?.InstrumentationKey;
set => this.telemetryClient.InstrumentationKey = value;
}
/// <summary>
/// Calls AI to track the availability.
/// </summary>
/// <param name="serviceName">Service name.</param>
/// <param name="instance">Instance identifier.</param>
/// <param name="testName">Availability test name.</param>
/// <param name="captured">The time when the availability was captured.</param>
/// <param name="duration">The time taken for the availability test to run.</param>
/// <param name="location">Name of the location the availability test was run from.</param>
/// <param name="success">True if the availability test ran successfully.</param>
/// <param name="cancellationToken">CancellationToken instance.</param>
/// <param name="message">Error message on availability test run failure.</param>
/// <returns>A <see cref="Task"/> representing the asynchronous operation.</returns>
public Task ReportAvailabilityAsync(
Uri serviceName,
string instance,
string testName,
DateTimeOffset captured,
TimeSpan duration,
string location,
bool success,
CancellationToken cancellationToken,
string message = null)
{
if (!this.IsEnabled || cancellationToken.IsCancellationRequested)
{
return Task.FromResult(1);
}
var at = new AvailabilityTelemetry(testName, captured, duration, location, success, message);
at.Properties.Add("Service", serviceName?.OriginalString);
at.Properties.Add("Instance", instance);
this.telemetryClient.TrackAvailability(at);
return Task.FromResult(0);
}
/// <summary>
/// Calls AI to report health.
/// </summary>
/// <param name="scope">Scope of health evaluation (Cluster, Node, etc.).</param>
/// <param name="propertyName">Value of the property.</param>
/// <param name="state">Health state.</param>
/// <param name="unhealthyEvaluations">Unhealthy evaluations aggregated description.</param>
/// <param name="source">Source of emission.</param>
/// <param name="cancellationToken">CancellationToken instance.</param>
/// <param name="serviceName">Optional: TraceTelemetry context cloud service name.</param>
/// <param name="instanceName">Optional: TraceTelemetry context cloud instance name.</param>
/// <returns>A <see cref="Task"/> representing the asynchronous operation.</returns>
public Task ReportHealthAsync(
HealthScope scope,
string propertyName,
HealthState state,
string unhealthyEvaluations,
string source,
CancellationToken cancellationToken,
string serviceName = null,
string instanceName = null)
{
if (!this.IsEnabled || cancellationToken.IsCancellationRequested)
{
return Task.FromResult(1);
}
try
{
cancellationToken.ThrowIfCancellationRequested();
var sev = (state == HealthState.Error) ? SeverityLevel.Error
: (state == HealthState.Warning) ? SeverityLevel.Warning : SeverityLevel.Information;
string healthInfo = string.Empty;
if (!string.IsNullOrEmpty(unhealthyEvaluations))
{
healthInfo += $"{Environment.NewLine}{unhealthyEvaluations}";
}
var tt = new TraceTelemetry($"Service Fabric Health report - {Enum.GetName(typeof(HealthScope), scope)}: {Enum.GetName(typeof(HealthState), state)} -> {source}:{propertyName}{healthInfo}", sev);
tt.Context.Cloud.RoleName = serviceName;
tt.Context.Cloud.RoleInstance = instanceName;
this.telemetryClient.TrackTrace(tt);
}
catch (Exception e)
{
this.logger.LogWarning($"Unhandled exception in TelemetryClient.ReportHealthAsync:{Environment.NewLine}{e}");
throw;
}
return Task.FromResult(0);
}
/// <summary>
/// Sends metrics to a telemetry service.
/// </summary>
/// <typeparam name="T">type of data.</typeparam>
/// <param name="name">name of metric.</param>
/// <param name="value">value of metric.</param>
/// <param name="source">source of event.</param>
/// <param name="cancellationToken">cancellation token.</param>
/// <returns>A Task of bool.</returns>
public async Task<bool> ReportMetricAsync<T>(
string name,
T value,
string source,
CancellationToken cancellationToken)
{
if (!this.IsEnabled || cancellationToken.IsCancellationRequested)
{
return false;
}
TraceTelemetry tt = new TraceTelemetry(name, SeverityLevel.Information);
this.telemetryClient?.TrackTrace(tt);
return await Task.FromResult(true).ConfigureAwait(false);
}
/// <summary>
/// Reports a metric to a telemetry service.
/// </summary>
/// <param name="telemetryData">TelemetryData instance.</param>
/// <param name="cancellationToken">Cancellation token.</param>
/// <returns>A task.</returns>
public Task ReportMetricAsync(
TelemetryData telemetryData,
CancellationToken cancellationToken)
{
if (telemetryData == null)
{
return Task.CompletedTask;
}
Dictionary<string, string> properties = new Dictionary<string, string>
{
{ "Application", telemetryData.ApplicationName ?? string.Empty },
{ "ClusterId", telemetryData.ClusterId ?? string.Empty },
{ "ErrorCode", telemetryData.Code ?? string.Empty },
{ "HealthEventDescription", telemetryData.HealthEventDescription ?? string.Empty },
{ "HealthState", telemetryData.HealthState ?? string.Empty },
{ "Metric", telemetryData.Metric ?? string.Empty },
{ "NodeName", telemetryData.NodeName ?? string.Empty },
{ "ObserverName", telemetryData.ObserverName ?? string.Empty },
{ "Partition", telemetryData.PartitionId ?? string.Empty },
{ "Replica", telemetryData.ReplicaId ?? string.Empty },
{ "Source", telemetryData.Source ?? string.Empty },
{ "Value", telemetryData.Value?.ToString() ?? string.Empty },
};
this.telemetryClient.TrackEvent(
$"{telemetryData.ObserverName ?? "FabricObserver"}DataEvent",
properties);
return Task.CompletedTask;
}
/// <summary>
/// Calls AI to report a metric.
/// </summary>
/// <param name="name">Name of the metric.</param>
/// <param name="value">Value of the property.</param>
/// <param name="properties">IDictionary&lt;string&gt;,&lt;string&gt; containing name/value pairs of additional properties.</param>
/// <param name="cancellationToken">CancellationToken instance.</param>
/// <returns>A <see cref="Task"/> representing the asynchronous operation.</returns>
public Task ReportMetricAsync(
string name,
long value,
IDictionary<string, string> properties,
CancellationToken cancellationToken)
{
if (!this.IsEnabled || cancellationToken.IsCancellationRequested)
{
return Task.FromResult(1);
}
_ = this.telemetryClient.GetMetric(name).TrackValue(value, string.Join(";", properties));
return Task.FromResult(0);
}
/// <summary>
/// Calls AI to report a metric.
/// </summary>
/// <param name="role">Name of the service.</param>
/// <param name="partition">Guid of the partition.</param>
/// <param name="name">Name of the metric.</param>
/// <param name="value">Value if the metric.</param>
/// <param name="cancellationToken">CancellationToken instance.</param>
/// <returns>A <see cref="Task"/> representing the asynchronous operation.</returns>
public Task ReportMetricAsync(
string role,
Guid partition,
string name,
long value,
CancellationToken cancellationToken)
{
return this.ReportMetricAsync(role, partition.ToString(), name, value, 1, value, value, value, 0.0, null, cancellationToken);
}
/// <summary>
/// Calls AI to report a metric.
/// </summary>
/// <param name="role">Name of the service.</param>
/// <param name="id">Replica or Instance identifier.</param>
/// <param name="name">Name of the metric.</param>
/// <param name="value">Value if the metric.</param>
/// <param name="cancellationToken">CancellationToken instance.</param>
/// <returns>A <see cref="Task"/> representing the asynchronous operation.</returns>
public async Task ReportMetricAsync(
string role,
long id,
string name,
long value,
CancellationToken cancellationToken)
{
await this.ReportMetricAsync(role, id.ToString(), name, value, 1, value, value, value, 0.0, null, cancellationToken).ConfigureAwait(false);
}
/// <summary>
/// Calls AI to report a metric.
/// </summary>
/// <param name="roleName">Name of the role. Usually the service name.</param>
/// <param name="instance">Instance identifier.</param>
/// <param name="name">Name of the metric.</param>
/// <param name="value">Value if the metric.</param>
/// <param name="count">Number of samples for this metric.</param>
/// <param name="min">Minimum value of the samples.</param>
/// <param name="max">Maximum value of the samples.</param>
/// <param name="sum">Sum of all of the samples.</param>
/// <param name="deviation">Standard deviation of the sample set.</param>
/// <param name="properties">IDictionary&lt;string&gt;,&lt;string&gt; containing name/value pairs of additional properties.</param>
/// <param name="cancellationToken">CancellationToken instance.</param>
/// <returns>A <see cref="Task"/> representing the asynchronous operation.</returns>
public Task ReportMetricAsync(
string roleName,
string instance,
string name,
long value,
int count,
long min,
long max,
long sum,
double deviation,
IDictionary<string, string> properties,
CancellationToken cancellationToken)
{
if (!this.IsEnabled || cancellationToken.IsCancellationRequested)
{
return Task.FromResult(false);
}
var mt = new MetricTelemetry(name, value)
{
Count = count,
Min = min,
Max = max,
StandardDeviation = deviation,
};
mt.Context.Cloud.RoleName = roleName;
mt.Context.Cloud.RoleInstance = instance;
// Set the properties.
if (properties != null)
{
foreach (var prop in properties)
{
mt.Properties.Add(prop);
}
}
// Track the telemetry.
this.telemetryClient.TrackMetric(mt);
return Task.FromResult(0);
}
/// <inheritdoc/>
public void Dispose()
{
// Do not change this code. Put cleanup code in Dispose(bool disposing) above.
this.Dispose(true);
}
private bool disposedValue; // To detect redundant calls
protected virtual void Dispose(bool disposing)
{
if (this.disposedValue)
{
return;
}
if (disposing)
{
}
this.disposedValue = true;
}
}
}

Просмотреть файл

@ -0,0 +1,272 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using System;
using System.Collections.Generic;
using System.Fabric.Health;
using System.Net;
using System.Runtime.InteropServices;
using System.Security.Cryptography;
using System.Text;
using System.Threading;
using System.Threading.Tasks;
using FabricHealer.Interfaces;
using Newtonsoft.Json;
namespace FabricHealer.Utilities.Telemetry
{
// LogAnalyticsTelemetry class is partially based on public (non-license-protected) sample https://dejanstojanovic.net/aspnet/2018/february/send-data-to-azure-log-analytics-from-c-code/
public class LogAnalyticsTelemetry : ITelemetryProvider
{
private const int MaxRetries = 5;
private readonly Logger logger;
private int retries;
public string WorkspaceId { get; set; }
public string Key { get; set; }
public string ApiVersion { get; set; }
public string LogType { get; set; }
public LogAnalyticsTelemetry(
string workspaceId,
string sharedKey,
string logType,
string apiVersion = "2016-04-01")
{
this.WorkspaceId = workspaceId;
this.Key = sharedKey;
this.LogType = logType;
this.ApiVersion = apiVersion;
this.logger = new Logger("TelemetryLogger");
}
public async Task ReportHealthAsync(
HealthScope scope,
string propertyName,
HealthState state,
string unhealthyEvaluations,
string source,
CancellationToken cancellationToken,
string serviceName = null,
string instanceName = null)
{
string jsonPayload = JsonConvert.SerializeObject(
new
{
id = $"FH_{Guid.NewGuid()}",
datetime = DateTime.UtcNow,
source = "FabricHealer",
property = propertyName,
healthScope = scope.ToString(),
healthState = state.ToString(),
healthEvaluation = unhealthyEvaluations,
osPlatform = RuntimeInformation.IsOSPlatform(OSPlatform.Windows) ? "Windows" : "Linux",
serviceName = serviceName ?? string.Empty,
instanceName = instanceName ?? string.Empty,
});
await this.SendTelemetryAsync(jsonPayload, cancellationToken).ConfigureAwait(false);
}
public async Task ReportMetricAsync(
TelemetryData telemetryData,
CancellationToken cancellationToken)
{
if (telemetryData == null)
{
return;
}
if (!SerializationUtility.TrySerialize<TelemetryData>(telemetryData, out string jsonPayload))
{
return;
}
await SendTelemetryAsync(jsonPayload, cancellationToken).ConfigureAwait(false);
}
public async Task<bool> ReportMetricAsync<T>(
string name,
T value,
string source,
CancellationToken cancellationToken)
{
string jsonPayload = JsonConvert.SerializeObject(
new
{
id = $"FH_{Guid.NewGuid()}",
datetime = DateTime.UtcNow,
source,
osPlatform = RuntimeInformation.IsOSPlatform(OSPlatform.Windows) ? "Windows" : "Linux",
property = name,
value,
});
await SendTelemetryAsync(jsonPayload, cancellationToken).ConfigureAwait(false);
return await Task.FromResult(true).ConfigureAwait(false);
}
// Implement functions below as you need.
public Task ReportAvailabilityAsync(
Uri serviceUri,
string instance,
string testName,
DateTimeOffset captured,
TimeSpan duration,
string location,
bool success,
CancellationToken cancellationToken,
string message = null)
{
return Task.CompletedTask;
}
public Task ReportMetricAsync(
string name,
long value,
IDictionary<string, string> properties,
CancellationToken cancellationToken)
{
return Task.CompletedTask;
}
public Task ReportMetricAsync(
string service,
Guid partition,
string name,
long value,
CancellationToken cancellationToken)
{
return Task.CompletedTask;
}
public Task ReportMetricAsync(
string role,
long id,
string name,
long value,
CancellationToken cancellationToken)
{
return Task.CompletedTask;
}
public Task ReportMetricAsync(
string roleName,
string instance,
string name,
long value,
int count,
long min,
long max,
long sum,
double deviation,
IDictionary<string, string> properties,
CancellationToken cancellationToken)
{
return Task.CompletedTask;
}
/// <summary>
/// Sends telemetry data to Azure LogAnalytics via REST.
/// </summary>
/// <param name="payload">Json string containing telemetry data.</param>
/// <returns>A completed task or task containing exception info.</returns>
private async Task SendTelemetryAsync(string payload, CancellationToken token)
{
if (string.IsNullOrEmpty(this.WorkspaceId))
{
return;
}
var requestUri = new Uri($"https://{this.WorkspaceId}.ods.opinsights.azure.com/api/logs?api-version={this.ApiVersion}");
string date = DateTime.UtcNow.ToString("r");
string signature = this.GetSignature("POST", payload.Length, "application/json", date, "/api/logs");
var request = (HttpWebRequest)WebRequest.Create(requestUri);
request.ContentType = "application/json";
request.Method = "POST";
request.Headers["Log-Type"] = this.LogType;
request.Headers["x-ms-date"] = date;
request.Headers["Authorization"] = signature;
byte[] content = Encoding.UTF8.GetBytes(payload);
if (token.IsCancellationRequested)
{
return;
}
try
{
using (var requestStreamAsync = await request.GetRequestStreamAsync())
{
if (token.IsCancellationRequested)
{
return;
}
await requestStreamAsync.WriteAsync(content, 0, content.Length);
}
using var responseAsync = await request.GetResponseAsync() as HttpWebResponse;
if (token.IsCancellationRequested)
{
return;
}
if (responseAsync.StatusCode == HttpStatusCode.OK ||
responseAsync.StatusCode == HttpStatusCode.Accepted)
{
this.retries = 0;
return;
}
this.logger.LogWarning($"Unexpected response from server in LogAnalyticsTelemetry.SendTelemetryAsync:{Environment.NewLine}{responseAsync.StatusCode}: {responseAsync.StatusDescription}");
}
catch (Exception e)
{
// An Exception during telemetry data submission should never take down FH process. Log it.
this.logger.LogWarning($"Handled Exception in LogAnalyticsTelemetry.SendTelemetryAsync:{Environment.NewLine}{e}");
}
if (this.retries < MaxRetries)
{
if (token.IsCancellationRequested)
{
return;
}
this.retries++;
await Task.Delay(1000).ConfigureAwait(false);
await SendTelemetryAsync(payload, token).ConfigureAwait(false);
}
else
{
// Exhausted retries. Reset counter.
this.logger.LogWarning($"Exhausted request retries in LogAnalyticsTelemetry.SendTelemetryAsync: {MaxRetries}. See logs for error details.");
this.retries = 0;
}
}
private string GetSignature(
string method,
int contentLength,
string contentType,
string date,
string resource)
{
string message = $"{method}\n{contentLength}\n{contentType}\nx-ms-date:{date}\n{resource}";
byte[] bytes = Encoding.UTF8.GetBytes(message);
using var encryptor = new HMACSHA256(Convert.FromBase64String(this.Key));
return $"SharedKey {this.WorkspaceId}:{Convert.ToBase64String(encryptor.ComputeHash(bytes))}";
}
}
}

Просмотреть файл

@ -0,0 +1,105 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using Newtonsoft.Json;
using System.Runtime.InteropServices;
namespace FabricHealer.Utilities.Telemetry
{
public class TelemetryData
{
public string ApplicationName
{
get; set;
}
public string ClusterId
{
get; set;
}
public string Code
{
get; set;
}
public string ContainerId
{
get; set;
}
public string HealthEventDescription
{
get; set;
}
public string HealthState
{
get; set;
}
public string Metric
{
get; set;
}
public string NodeName
{
get; set;
}
public string ObserverName
{
get; set;
}
public string OS
{
get; set;
} = RuntimeInformation.IsOSPlatform(OSPlatform.Windows) ? "Windows" : "Linux";
public string PartitionId
{
get; set;
}
public string ReplicaId
{
get; set;
}
public string ServiceName
{
get; set;
}
public string Source
{
get; set;
}
public object Value
{
get; set;
}
public string NodeType
{
get;
set;
}
public string RepairId
{
get;
set;
}
[JsonConstructor]
public TelemetryData()
{
}
}
}

Просмотреть файл

@ -0,0 +1,13 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
namespace FabricHealer.Utilities.Telemetry
{
public enum TelemetryProviderType
{
AzureApplicationInsights,
AzureLogAnalytics,
}
}

Просмотреть файл

@ -0,0 +1,151 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using FabricHealer.Interfaces;
using FabricHealer.Repair;
using System;
using System.Fabric;
using System.Fabric.Health;
using System.Runtime.InteropServices;
using System.Threading;
using System.Threading.Tasks;
namespace FabricHealer.Utilities.Telemetry
{
public class TelemetryUtilities
{
private readonly FabricClient fabricClient;
private readonly StatelessServiceContext serviceContext;
private readonly ITelemetryProvider telemetryClient;
public TelemetryUtilities(FabricClient fabricClient, StatelessServiceContext serviceContext)
{
this.fabricClient = fabricClient;
this.serviceContext = serviceContext;
if (FabricHealerManager.ConfigSettings != null && FabricHealerManager.ConfigSettings.TelemetryEnabled)
{
switch (FabricHealerManager.ConfigSettings.TelemetryProvider)
{
case TelemetryProviderType.AzureApplicationInsights:
{
this.telemetryClient = new AppInsightsTelemetry(FabricHealerManager.ConfigSettings.AppInsightsInstrumentationKey);
break;
}
case TelemetryProviderType.AzureLogAnalytics:
{
this.telemetryClient = new LogAnalyticsTelemetry(
FabricHealerManager.ConfigSettings.LogAnalyticsWorkspaceId,
FabricHealerManager.ConfigSettings.LogAnalyticsSharedKey,
FabricHealerManager.ConfigSettings.LogAnalyticsLogType);
break;
}
}
}
}
/// <summary>
/// Emits Repair telemetry to AppInsights (or some other external service),
/// ETW (EventSource), and Health Event to Service Fabric.
/// </summary>
/// <param name="level">Log Level.</param>
/// <param name="source">Err/Warning source id.</param>
/// <param name="description">Message.</param>
/// <param name="token">Cancellation token.</param>
/// <param name="node">Node name.</param>
/// <param name="repairAction">Repair action.</param>
/// <returns></returns>
public async Task EmitTelemetryEtwHealthEventAsync(
LogLevel level,
string source,
string description,
CancellationToken token,
RepairConfiguration repairConfig = null)
{
bool hasRepairInfo = repairConfig != null;
string repairAction = string.Empty;
if (source != null)
{
source = source.Insert(0, "FabricHealer.");
}
if (hasRepairInfo)
{
repairAction = Enum.GetName(typeof(RepairAction), repairConfig.RepairPolicy.CurrentAction);
}
HealthState healthState = HealthState.Ok;
if (level == LogLevel.Error)
{
healthState = HealthState.Error;
}
else if (level == LogLevel.Warning)
{
healthState = HealthState.Warning;
}
if (FabricHealerManager.ConfigSettings.TelemetryEnabled && this.telemetryClient != null)
{
var telemData = new TelemetryData()
{
Metric = repairAction,
ApplicationName = repairConfig?.AppName?.OriginalString ?? string.Empty,
ServiceName = repairConfig?.ServiceName?.OriginalString ?? string.Empty,
PartitionId = repairConfig?.PartitionId.ToString() ?? string.Empty,
ReplicaId = repairConfig?.ReplicaOrInstanceId.ToString() ?? string.Empty,
HealthEventDescription = description,
HealthState = Enum.GetName(typeof(HealthState), healthState),
NodeName = repairConfig?.NodeName ?? string.Empty,
Source = $"{(RuntimeInformation.IsOSPlatform(OSPlatform.Windows) ? "Windows_" : "Linux_")}{source}",
};
await (this.telemetryClient?.ReportMetricAsync(telemData, token)).ConfigureAwait(false);
}
// ETW.
if (FabricHealerManager.ConfigSettings.EtwEnabled)
{
Logger.EtwLogger?.Write(
RepairConstants.EventSourceEventName,
new
{
Level = level,
Metric = repairAction,
ApplicationName = repairConfig?.AppName?.OriginalString ?? string.Empty,
ServiceName = repairConfig?.ServiceName?.OriginalString ?? string.Empty,
PartitionId = repairConfig?.PartitionId.ToString() ?? string.Empty,
ReplicaId = repairConfig?.ReplicaOrInstanceId.ToString() ?? string.Empty,
HealthEventDescription = description,
HealthState = Enum.GetName(typeof(HealthState), healthState),
NodeName = repairConfig?.NodeName ?? string.Empty,
Source = source,
OSPlatform = RuntimeInformation.IsOSPlatform(OSPlatform.Windows) ? "Windows" : "Linux",
});
}
// Service Fabric HM - Informational Events
// The Warning/Error events are emitted on the offending node. This is
// for seeing information in SFX about what happened where vis a vis Repair/Healing.
var healthReporter = new FabricHealthReporter(this.fabricClient);
var healthReport = new HealthReport
{
Code = repairConfig?.RepairPolicy.Id,
HealthMessage = description,
NodeName = this.serviceContext.NodeContext.NodeName,
ReportType = HealthReportType.Node,
State = healthState,
HealthReportTimeToLive = TimeSpan.FromMinutes(5),
Property = "RepairStateInformation",
Source = source,
};
healthReporter.ReportHealthToServiceFabric(healthReport);
}
}
}

Просмотреть файл

@ -0,0 +1,202 @@
// ------------------------------------------------------------
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using System;
using System.Collections.Generic;
using System.Fabric;
using System.Fabric.Query;
using System.Linq;
using System.Runtime.InteropServices.WindowsRuntime;
using System.Threading;
using System.Threading.Tasks;
using FabricHealer.Utilities;
namespace FabricHealer.Repair
{
public static class UpgradeChecker
{
private static readonly Logger Logger = new Logger("UpgradeLogger");
/// <summary>
/// Gather list of UD's from all upgrades
/// </summary>
/// <param name="fabricClient">FabricClient</param>
/// <param name="token"></param>
/// <returns>List of uds</returns>
public static async Task<IList<int>>
GetUDsWhereUpgradeInProgressAsync(FabricClient fabricClient, CancellationToken token)
{
var domainsWhereUpgradeInProgress = new List<int>();
domainsWhereUpgradeInProgress.AddRange(
await GetUdsWhereApplicationUpgradeInProgressAsync(fabricClient, token).ConfigureAwait(true));
domainsWhereUpgradeInProgress.Add(
await GetUdsWhereFabricUpgradeInProgressAsync(fabricClient, token).ConfigureAwait(true));
return domainsWhereUpgradeInProgress;
}
/// <summary>
/// Gets Application Upgrade Domains (integers) for application or applications
/// currently upgrading (or rolling back).
/// </summary>
/// <param name="fabricClient">FabricClient instance</param>
/// <param name="token">CancellationToken</param>
/// <param name="appName" type="optional">Application Name (Uri)</param>
/// <returns>List of integers representing UDs</returns>
internal static async Task<List<int>>
GetUdsWhereApplicationUpgradeInProgressAsync(
FabricClient fabricClient,
CancellationToken token,
Uri appName = null)
{
try
{
ApplicationList appList;
int currentUpgradeDomainInProgress = -1;
var upgradeDomainsInProgress = new List<int>();
if (appName == null)
{
appList = await fabricClient.QueryManager.GetApplicationListAsync().ConfigureAwait(true);
}
else
{
appList = await fabricClient.QueryManager.GetApplicationListAsync(
appName,
TimeSpan.FromMinutes(1),
token).ConfigureAwait(true);
}
foreach (var application in appList)
{
var upgradeProgress =
await fabricClient.ApplicationManager.GetApplicationUpgradeProgressAsync(
application.ApplicationName,
TimeSpan.FromMinutes(1),
token).ConfigureAwait(true);
if (upgradeProgress.UpgradeState.Equals(ApplicationUpgradeState.RollingBackInProgress)
|| upgradeProgress.UpgradeState.Equals(ApplicationUpgradeState.RollingForwardInProgress)
|| upgradeProgress.UpgradeState.Equals(ApplicationUpgradeState.RollingForwardPending))
{
if (int.TryParse(upgradeProgress.CurrentUpgradeDomainProgress.UpgradeDomainName, out currentUpgradeDomainInProgress))
{
Logger.LogInfo($"Application Upgrade for {application.ApplicationName} is in progress in {currentUpgradeDomainInProgress} upgrade domain.");
if (!upgradeDomainsInProgress.Contains(currentUpgradeDomainInProgress))
{
upgradeDomainsInProgress.Add(currentUpgradeDomainInProgress);
}
}
else
{
// TryParse fails out value currentUpgradeDomainInProgress will be set to 0,
// 0 is valid UD name, so setting it to -1 to return right value
currentUpgradeDomainInProgress = -1;
}
}
}
// If no UD's are being upgraded then currentUpgradeDomainInProgress
// remains -1, otherwise it will be added only once
if (!upgradeDomainsInProgress.Any())
{
Logger.LogInfo(
$"No Application Upgrade is in progress in domain {currentUpgradeDomainInProgress}");
upgradeDomainsInProgress.Add(currentUpgradeDomainInProgress);
}
return upgradeDomainsInProgress;
}
catch (Exception e)
{
Logger.LogError(e.ToString());
return new List<int>{ int.MaxValue };
}
}
/// <summary>
/// Get the UD where service fabric upgrade is in progress
/// </summary>
/// <param name="fabricClient">FabricClient</param>
/// <param name="token"></param>
/// <returns>UD in progress</returns>
public static async Task<int> GetUdsWhereFabricUpgradeInProgressAsync(
FabricClient fabricClient,
CancellationToken token)
{
try
{
var fabricUpgradeProgress =
await fabricClient.ClusterManager.GetFabricUpgradeProgressAsync(
FabricHealerManager.ConfigSettings.AsyncTimeout,
token).ConfigureAwait(true);
int currentUpgradeDomainInProgress = -1;
if (fabricUpgradeProgress.UpgradeState.Equals(FabricUpgradeState.RollingBackInProgress)
|| fabricUpgradeProgress.UpgradeState.Equals(FabricUpgradeState.RollingForwardInProgress)
|| fabricUpgradeProgress.UpgradeState.Equals(FabricUpgradeState.RollingForwardPending))
{
if (int.TryParse(fabricUpgradeProgress.CurrentUpgradeDomainProgress.UpgradeDomainName, out currentUpgradeDomainInProgress))
{
return currentUpgradeDomainInProgress;
}
// TryParse fails out value currentUpgradeDomainInProgress will be set to 0,
// 0 is valid UD name, so setting it to -1 to return right value.
currentUpgradeDomainInProgress = -1;
}
return currentUpgradeDomainInProgress;
}
catch (Exception e) when (e is FabricException || e is OperationCanceledException || e is TimeoutException)
{
return int.MaxValue;
}
}
/// <summary>
/// Determines if an Azure tenant update is in progress for cluster VMs.
/// </summary>
/// <param name="fabricClient">FabricClient instance</param>
/// <param name="token">CancellationToken instance</param>
/// <returns>true if tenant update is in progress, false otherwise</returns>
public static async Task<bool> IsAzureTenantUpdateInProgress(
FabricClient fabricClient,
string nodeType,
CancellationToken token)
{
var repairTasks = await fabricClient.RepairManager.GetRepairTaskListAsync(
"Azure",
System.Fabric.Repair.RepairTaskStateFilter.Active | System.Fabric.Repair.RepairTaskStateFilter.Executing,
$"fabric:/System/InfrastructureService/{nodeType}",
FabricHealerManager.ConfigSettings.AsyncTimeout,
token).ConfigureAwait(false);
bool isAzureTenantRepairInProgress = repairTasks.Count > 0;
if (isAzureTenantRepairInProgress)
{
string message =
$"Azure Tenant Update in progress. Will not attempt repairs at this time.";
FabricHealerManager.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
$"IsAzureTenantUpdateInProgress::true",
message,
token).GetAwaiter().GetResult();
return true;
}
return false;
}
}
}

Просмотреть файл

@ -0,0 +1,74 @@
<?xml version="1.0" encoding="utf-8"?>
<ApplicationManifest xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ApplicationTypeName="FabricHealerType" ApplicationTypeVersion="0.4.2" xmlns="http://schemas.microsoft.com/2011/01/fabric">
<Parameters>
<Parameter Name="FabricHealer_InstanceCount" DefaultValue="-1" />
<!-- FabricHealer Enablement, Monitor Loop Timeout -->
<Parameter Name="MonitorLoopSleepSeconds" DefaultValue="10" />
<Parameter Name="AutoMitigationEnabled" DefaultValue="true" />
<!-- Repair Policy Enablement -->
<Parameter Name="AppRepairEnabled" DefaultValue="true" />
<Parameter Name="DiskRepairEnabled" DefaultValue="true" />
<Parameter Name="NodeRepairEnabled" DefaultValue="true" />
<Parameter Name="ReplicaRepairEnabled" DefaultValue="false" />
<Parameter Name="SystemAppRepairEnabled" DefaultValue="true" />
<Parameter Name="VmRepairEnabled" DefaultValue="true" />
</Parameters>
<!-- Import the ServiceManifest from the ServicePackage. The ServiceManifestName and ServiceManifestVersion
should match the Name and Version attributes of the ServiceManifest element defined in the
ServiceManifest.xml file. -->
<ServiceManifestImport>
<ServiceManifestRef ServiceManifestName="FabricHealerPkg" ServiceManifestVersion="0.4.2" />
<ConfigOverrides>
<ConfigOverride Name="Config">
<Settings>
<!-- FabricHealerManager -->
<Section Name="RepairManagerConfiguration">
<Parameter Name="EnableAutoMitigation" Value="[AutoMitigationEnabled]" />
<Parameter Name="HealthCheckLoopSleepTimeSeconds" Value="[MonitorLoopSleepSeconds]" />
</Section>
<!-- Repair policies -->
<Section Name="AppRepairPolicy">
<Parameter Name="Enabled" Value="[AppRepairEnabled]" />
</Section>
<Section Name="DiskRepairPolicy">
<Parameter Name="Enabled" Value="[DiskRepairEnabled]" />
</Section>
<Section Name="FabricNodeRepairPolicy">
<Parameter Name="Enabled" Value="[NodeRepairEnabled]" />
</Section>
<Section Name="ReplicaRepairPolicy">
<Parameter Name="Enabled" Value="[ReplicaRepairEnabled]" />
</Section>
<Section Name="SystemAppRepairPolicy">
<Parameter Name="Enabled" Value="[SystemAppRepairEnabled]" />
</Section>
<Section Name="VMRepairPolicy">
<Parameter Name="Enabled" Value="[VmRepairEnabled]" />
</Section>
</Settings>
</ConfigOverride>
</ConfigOverrides>
<Policies>
<RunAsPolicy CodePackageRef="Code" UserRef="SystemUser" />
</Policies>
</ServiceManifestImport>
<DefaultServices>
<!-- The section below creates instances of service types, when an instance of this
application type is created. You can also create one or more instances of service type using the
ServiceFabric PowerShell module.
The attribute ServiceTypeName below must match the name defined in the imported ServiceManifest.xml file. -->
<Service Name="FabricHealer" ServicePackageActivationMode="ExclusiveProcess">
<StatelessService ServiceTypeName="FabricHealerType" InstanceCount="[FabricHealer_InstanceCount]">
<SingletonPartition />
</StatelessService>
</Service>
</DefaultServices>
<!-- Because of the actions FabricHealer takes in a cluster, it must run as Admin user on Windows and root on Linux.
SystemUser and LocalSystem AccountType map to both (System on Windows, root on Linux). -->
<Principals>
<Users>
<User Name="SystemUser" AccountType="LocalSystem" />
</Users>
</Principals>
</ApplicationManifest>

Просмотреть файл

@ -0,0 +1,6 @@
<?xml version="1.0" encoding="utf-8"?>
<Application xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" Name="fabric:/FabricHealer" xmlns="http://schemas.microsoft.com/2011/01/fabric">
<Parameters>
<Parameter Name="FabricHealer_InstanceCount" Value="-1" />
</Parameters>
</Application>

Просмотреть файл

@ -0,0 +1,4 @@
<?xml version="1.0" encoding="utf-8"?>
<Application xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" Name="fabric:/FabricHealer" xmlns="http://schemas.microsoft.com/2011/01/fabric">
<Parameters />
</Application>

Просмотреть файл

@ -0,0 +1,4 @@
<?xml version="1.0" encoding="utf-8"?>
<Application xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" Name="fabric:/FabricHealer" xmlns="http://schemas.microsoft.com/2011/01/fabric">
<Parameters />
</Application>

Просмотреть файл

@ -0,0 +1,47 @@
<?xml version="1.0" encoding="utf-8"?>
<Project ToolsVersion="14.0" DefaultTargets="Build" xmlns="http://schemas.microsoft.com/developer/msbuild/2003" InitialTargets=";ValidateMSBuildFiles">
<Import Project="..\packages\Microsoft.VisualStudio.Azure.Fabric.MSBuild.1.7.3\build\Microsoft.VisualStudio.Azure.Fabric.Application.props" Condition="Exists('..\packages\Microsoft.VisualStudio.Azure.Fabric.MSBuild.1.7.3\build\Microsoft.VisualStudio.Azure.Fabric.Application.props')" />
<PropertyGroup Label="Globals">
<ProjectGuid>a977c8e0-2183-4845-95ea-7f3c3e795310</ProjectGuid>
<ProjectVersion>2.3</ProjectVersion>
<MinToolsVersion>1.5</MinToolsVersion>
<SupportedMSBuildNuGetPackageVersion>1.7.3</SupportedMSBuildNuGetPackageVersion>
<TargetFrameworkVersion>v4.5.2</TargetFrameworkVersion>
</PropertyGroup>
<ItemGroup Label="ProjectConfigurations">
<ProjectConfiguration Include="Debug|x64">
<Configuration>Debug</Configuration>
<Platform>x64</Platform>
</ProjectConfiguration>
<ProjectConfiguration Include="Release|x64">
<Configuration>Release</Configuration>
<Platform>x64</Platform>
</ProjectConfiguration>
</ItemGroup>
<ItemGroup>
<None Include="ApplicationPackageRoot\ApplicationManifest.xml" />
<None Include="ApplicationParameters\Cloud.xml" />
<None Include="ApplicationParameters\Local.1Node.xml" />
<None Include="ApplicationParameters\Local.5Node.xml" />
<None Include="PublishProfiles\Local.1Node.xml" />
<None Include="PublishProfiles\Local.5Node.xml" />
<None Include="Scripts\Deploy-FabricApplication.ps1" />
</ItemGroup>
<ItemGroup>
<ProjectReference Include="..\FabricHealer\FabricHealer.csproj" />
</ItemGroup>
<ItemGroup>
<Content Include="packages.config" />
<Content Include="PublishProfiles\Cloud.xml" />
</ItemGroup>
<Import Project="$(MSBuildToolsPath)\Microsoft.Common.targets" />
<PropertyGroup>
<ApplicationProjectTargetsPath>$(MSBuildExtensionsPath32)\Microsoft\VisualStudio\v$(VisualStudioVersion)\Service Fabric Tools\Microsoft.VisualStudio.Azure.Fabric.ApplicationProject.targets</ApplicationProjectTargetsPath>
</PropertyGroup>
<Import Project="$(ApplicationProjectTargetsPath)" Condition="Exists('$(ApplicationProjectTargetsPath)')" />
<Import Project="..\packages\Microsoft.VisualStudio.Azure.Fabric.MSBuild.1.7.3\build\Microsoft.VisualStudio.Azure.Fabric.Application.targets" Condition="Exists('..\packages\Microsoft.VisualStudio.Azure.Fabric.MSBuild.1.7.3\build\Microsoft.VisualStudio.Azure.Fabric.Application.targets')" />
<Target Name="ValidateMSBuildFiles" BeforeTargets="PrepareForBuild">
<Error Condition="!Exists('..\packages\Microsoft.VisualStudio.Azure.Fabric.MSBuild.1.7.3\build\Microsoft.VisualStudio.Azure.Fabric.Application.props')" Text="Unable to find the '..\packages\Microsoft.VisualStudio.Azure.Fabric.MSBuild.1.7.3\build\Microsoft.VisualStudio.Azure.Fabric.Application.props' file. Please restore the 'Microsoft.VisualStudio.Azure.Fabric.MSBuild' Nuget package." />
<Error Condition="!Exists('..\packages\Microsoft.VisualStudio.Azure.Fabric.MSBuild.1.7.3\build\Microsoft.VisualStudio.Azure.Fabric.Application.targets')" Text="Unable to find the '..\packages\Microsoft.VisualStudio.Azure.Fabric.MSBuild.1.7.3\build\Microsoft.VisualStudio.Azure.Fabric.Application.targets' file. Please restore the 'Microsoft.VisualStudio.Azure.Fabric.MSBuild' Nuget package." />
</Target>
</Project>

Просмотреть файл

@ -0,0 +1,11 @@
<?xml version="1.0" encoding="utf-8"?>
<PublishProfile xmlns="http://schemas.microsoft.com/2015/05/fabrictools">
<!-- ClusterConnectionParameters allows you to specify the PowerShell parameters to use when connecting to the Service Fabric cluster.
Valid parameters are any that are accepted by the Connect-ServiceFabricCluster cmdlet.
For a local cluster, you would typically not use any parameters.
For example: <ClusterConnectionParameters />
-->
<ClusterConnectionParameters />
<ApplicationParameterFile Path="..\ApplicationParameters\Local.1Node.xml" />
</PublishProfile>

Просмотреть файл

@ -0,0 +1,11 @@
<?xml version="1.0" encoding="utf-8"?>
<PublishProfile xmlns="http://schemas.microsoft.com/2015/05/fabrictools">
<!-- ClusterConnectionParameters allows you to specify the PowerShell parameters to use when connecting to the Service Fabric cluster.
Valid parameters are any that are accepted by the Connect-ServiceFabricCluster cmdlet.
For a local cluster, you would typically not use any parameters.
For example: <ClusterConnectionParameters />
-->
<ClusterConnectionParameters />
<ApplicationParameterFile Path="..\ApplicationParameters\Local.5Node.xml" />
</PublishProfile>

Просмотреть файл

@ -0,0 +1,273 @@
<#
.SYNOPSIS
Deploys a Service Fabric application type to a cluster.
.DESCRIPTION
This script deploys a Service Fabric application type to a cluster. It is invoked by Visual Studio when deploying a Service Fabric Application project.
.NOTES
WARNING: This script file is invoked by Visual Studio. Its parameters must not be altered but its logic can be customized as necessary.
.PARAMETER PublishProfileFile
Path to the file containing the publish profile.
.PARAMETER ApplicationPackagePath
Path to the folder of the packaged Service Fabric application.
.PARAMETER DeployOnly
Indicates that the Service Fabric application should not be created or upgraded after registering the application type.
.PARAMETER ApplicationParameter
Hashtable of the Service Fabric application parameters to be used for the application.
.PARAMETER UnregisterUnusedApplicationVersionsAfterUpgrade
Indicates whether to unregister any unused application versions that exist after an upgrade is finished.
.PARAMETER OverrideUpgradeBehavior
Indicates the behavior used to override the upgrade settings specified by the publish profile.
'None' indicates that the upgrade settings will not be overridden.
'ForceUpgrade' indicates that an upgrade will occur with default settings, regardless of what is specified in the publish profile.
'VetoUpgrade' indicates that an upgrade will not occur, regardless of what is specified in the publish profile.
.PARAMETER UseExistingClusterConnection
Indicates that the script should make use of an existing cluster connection that has already been established in the PowerShell session. The cluster connection parameters configured in the publish profile are ignored.
.PARAMETER OverwriteBehavior
Overwrite Behavior if an application exists in the cluster with the same name. Available Options are Never, Always, SameAppTypeAndVersion. This setting is not applicable when upgrading an application.
'Never' will not remove the existing application. This is the default behavior.
'Always' will remove the existing application even if its Application type and Version is different from the application being created.
'SameAppTypeAndVersion' will remove the existing application only if its Application type and Version is same as the application being created.
.PARAMETER SkipPackageValidation
Switch signaling whether the package should be validated or not before deployment.
.PARAMETER SecurityToken
A security token for authentication to cluster management endpoints. Used for silent authentication to clusters that are protected by Azure Active Directory.
.PARAMETER CopyPackageTimeoutSec
Timeout in seconds for copying application package to image store.
.EXAMPLE
. Scripts\Deploy-FabricApplication.ps1 -ApplicationPackagePath 'pkg\Debug'
Deploy the application using the default package location for a Debug build.
.EXAMPLE
. Scripts\Deploy-FabricApplication.ps1 -ApplicationPackagePath 'pkg\Debug' -DoNotCreateApplication
Deploy the application but do not create the application instance.
.EXAMPLE
. Scripts\Deploy-FabricApplication.ps1 -ApplicationPackagePath 'pkg\Debug' -ApplicationParameter @{CustomParameter1='MyValue'; CustomParameter2='MyValue'}
Deploy the application by providing values for parameters that are defined in the application manifest.
#>
Param
(
[String]
$PublishProfileFile,
[String]
$ApplicationPackagePath,
[Switch]
$DeployOnly,
[Hashtable]
$ApplicationParameter,
[Boolean]
$UnregisterUnusedApplicationVersionsAfterUpgrade,
[String]
[ValidateSet('None', 'ForceUpgrade', 'VetoUpgrade')]
$OverrideUpgradeBehavior = 'None',
[Switch]
$UseExistingClusterConnection,
[String]
[ValidateSet('Never','Always','SameAppTypeAndVersion')]
$OverwriteBehavior = 'Never',
[Switch]
$SkipPackageValidation,
[String]
$SecurityToken,
[int]
$CopyPackageTimeoutSec,
[int]
$RegisterApplicationTypeTimeoutSec
)
function Read-XmlElementAsHashtable
{
Param (
[System.Xml.XmlElement]
$Element
)
$hashtable = @{}
if ($Element.Attributes)
{
$Element.Attributes |
ForEach-Object {
$boolVal = $null
if ([bool]::TryParse($_.Value, [ref]$boolVal)) {
$hashtable[$_.Name] = $boolVal
}
else {
$hashtable[$_.Name] = $_.Value
}
}
}
return $hashtable
}
function Read-PublishProfile
{
Param (
[ValidateScript({Test-Path $_ -PathType Leaf})]
[String]
$PublishProfileFile
)
$publishProfileXml = [Xml] (Get-Content $PublishProfileFile)
$publishProfile = @{}
$publishProfile.ClusterConnectionParameters = Read-XmlElementAsHashtable $publishProfileXml.PublishProfile.Item("ClusterConnectionParameters")
$publishProfile.UpgradeDeployment = Read-XmlElementAsHashtable $publishProfileXml.PublishProfile.Item("UpgradeDeployment")
$publishProfile.CopyPackageParameters = Read-XmlElementAsHashtable $publishProfileXml.PublishProfile.Item("CopyPackageParameters")
$publishProfile.RegisterApplicationParameters = Read-XmlElementAsHashtable $publishProfileXml.PublishProfile.Item("RegisterApplicationParameters")
if ($publishProfileXml.PublishProfile.Item("UpgradeDeployment"))
{
$publishProfile.UpgradeDeployment.Parameters = Read-XmlElementAsHashtable $publishProfileXml.PublishProfile.Item("UpgradeDeployment").Item("Parameters")
if ($publishProfile.UpgradeDeployment["Mode"])
{
$publishProfile.UpgradeDeployment.Parameters[$publishProfile.UpgradeDeployment["Mode"]] = $true
}
}
$publishProfileFolder = (Split-Path $PublishProfileFile)
$publishProfile.ApplicationParameterFile = [System.IO.Path]::Combine($PublishProfileFolder, $publishProfileXml.PublishProfile.ApplicationParameterFile.Path)
return $publishProfile
}
$LocalFolder = (Split-Path $MyInvocation.MyCommand.Path)
if (!$PublishProfileFile)
{
$PublishProfileFile = "$LocalFolder\..\PublishProfiles\Local.xml"
}
if (!$ApplicationPackagePath)
{
$ApplicationPackagePath = "$LocalFolder\..\pkg\Release"
}
$ApplicationPackagePath = Resolve-Path $ApplicationPackagePath
$publishProfile = Read-PublishProfile $PublishProfileFile
if (-not $UseExistingClusterConnection)
{
$ClusterConnectionParameters = $publishProfile.ClusterConnectionParameters
if ($SecurityToken)
{
$ClusterConnectionParameters["SecurityToken"] = $SecurityToken
}
try
{
[void](Connect-ServiceFabricCluster @ClusterConnectionParameters)
}
catch [System.Fabric.FabricObjectClosedException]
{
Write-Warning "Service Fabric cluster may not be connected."
throw
}
}
$RegKey = "HKLM:\SOFTWARE\Microsoft\Service Fabric SDK"
$ModuleFolderPath = (Get-ItemProperty -Path $RegKey -Name FabricSDKPSModulePath).FabricSDKPSModulePath
Import-Module "$ModuleFolderPath\ServiceFabricSDK.psm1"
$IsUpgrade = ($publishProfile.UpgradeDeployment -and $publishProfile.UpgradeDeployment.Enabled -and $OverrideUpgradeBehavior -ne 'VetoUpgrade') -or $OverrideUpgradeBehavior -eq 'ForceUpgrade'
$PublishParameters = @{
'ApplicationPackagePath' = $ApplicationPackagePath
'ApplicationParameterFilePath' = $publishProfile.ApplicationParameterFile
'ApplicationParameter' = $ApplicationParameter
'ErrorAction' = 'Stop'
}
if ($publishProfile.CopyPackageParameters.CopyPackageTimeoutSec)
{
$PublishParameters['CopyPackageTimeoutSec'] = $publishProfile.CopyPackageParameters.CopyPackageTimeoutSec
}
if ($publishProfile.CopyPackageParameters.CompressPackage)
{
$PublishParameters['CompressPackage'] = $publishProfile.CopyPackageParameters.CompressPackage
}
if ($publishProfile.RegisterApplicationParameters.RegisterApplicationTypeTimeoutSec)
{
$PublishParameters['RegisterApplicationTypeTimeoutSec'] = $publishProfile.RegisterApplicationParameters.RegisterApplicationTypeTimeoutSec
}
# CopyPackageTimeoutSec parameter overrides the value from the publish profile
if ($CopyPackageTimeoutSec)
{
$PublishParameters['CopyPackageTimeoutSec'] = $CopyPackageTimeoutSec
}
# RegisterApplicationTypeTimeoutSec parameter overrides the value from the publish profile
if ($RegisterApplicationTypeTimeoutSec)
{
$PublishParameters['RegisterApplicationTypeTimeoutSec'] = $RegisterApplicationTypeTimeoutSec
}
if ($IsUpgrade)
{
$Action = "RegisterAndUpgrade"
if ($DeployOnly)
{
$Action = "Register"
}
$UpgradeParameters = $publishProfile.UpgradeDeployment.Parameters
if ($OverrideUpgradeBehavior -eq 'ForceUpgrade')
{
# Warning: Do not alter these upgrade parameters. It will create an inconsistency with Visual Studio's behavior.
$UpgradeParameters = @{ UnmonitoredAuto = $true; Force = $true }
}
$PublishParameters['Action'] = $Action
$PublishParameters['UpgradeParameters'] = $UpgradeParameters
$PublishParameters['UnregisterUnusedVersions'] = $UnregisterUnusedApplicationVersionsAfterUpgrade
Publish-UpgradedServiceFabricApplication @PublishParameters
}
else
{
$Action = "RegisterAndCreate"
if ($DeployOnly)
{
$Action = "Register"
}
$PublishParameters['Action'] = $Action
$PublishParameters['OverwriteBehavior'] = $OverwriteBehavior
$PublishParameters['SkipPackageValidation'] = $SkipPackageValidation
Publish-NewServiceFabricApplication @PublishParameters
}

Просмотреть файл

@ -0,0 +1,4 @@
<?xml version="1.0" encoding="utf-8"?>
<packages>
<package id="Microsoft.VisualStudio.Azure.Fabric.MSBuild" version="1.7.3" />
</packages>

21
LICENSE Normal file
Просмотреть файл

@ -0,0 +1,21 @@
MIT License
Copyright (c) Microsoft Corporation.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE

41
README.md Normal file
Просмотреть файл

@ -0,0 +1,41 @@
# FabricHealer Beta
### Configuration as logic and auto-mitigation in Service Fabric clusters
FabricHealer is a Service Fabric application that attempts to fix a set of reliably solvable problems that can take place in a Service Fabric application service or host virtual machine, including logical disks, but scoped to space usage problems only. These repairs mostly employ a set of Service Fabric API calls, but can also be fully custom. All repairs are orchestrated through Service Fabrics RepairManager service. Repair configuration is written as [Prolog](http://www.learnprolognow.org/)-like [logic](https://github.com/microsoft/service-fabric-healer/tree/main/FabricHealer/PackageRoot/Config/Rules) with [supporting external predicates](https://github.com/microsoft/service-fabric-healer/tree/main/FabricHealer/Repair/Guan) written in C#. This is made possible by a new logic programming system, [Guan](https://github.com/microsoft/guan). The fun starts when FabricHealer detects supported error or warning states reported by [FabricObserver](https://github.com/microsoft/service-fabric-observer) running in the same cluster.
To learn more about FabricHealer's configuration-as-logic model, [click here.](Documentation/LogicWorkflows.md)
```
FabricHealer requires that FabricObserver is deployed in the same cluster.
```
FabricHealer is implemented as a stateless singleton service that runs on all nodes
in a Linux or Windows Service Fabric cluster. It is a .NET Core 3.1 application and has been tested on
Windows (2016/2019) and Ubuntu (16/18.04).
All warning and error reports created by [FabricObserver](https://github.com/microsoft/service-fabric-observer) and subsequently repaired by FabricHealer are user-configured - developer control extends from unhealthy event source to related healing operations.
```
This is a pre-release and is not meant for use in production.
```
## Quickstart
To quickly learn how to use FabricHealer, please see the [simple scenario-based examples.](Documentation/Using.md)
## For Early Adopters while in Private Preview
Please [download the Guan nupkg](https://github.com/microsoft/Guan/releases/download/nupkg1.0/Microsoft.ServiceFabricApps.Guan.1.0.0.nupkg) to your local dev machine and install it into your local FH project in order to build FH successfully. This will be unnecessary when FH ships in Public Preview as Guan will be shipping concurrently and the Guan nupkg will be available in the nuget.org package gallery, as will FH.
# Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.

41
SECURITY.md Normal file
Просмотреть файл

@ -0,0 +1,41 @@
<!-- BEGIN MICROSOFT SECURITY.MD V0.0.5 BLOCK -->
## Security
Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [our GitHub organizations](https://opensource.microsoft.com/).
If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc751383(v=technet.10)), please report it to us as described below.
## Reporting Security Issues
**Please do not report security vulnerabilities through public GitHub issues.**
Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report).
If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc).
You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc).
Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
* Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
* Full paths of source file(s) related to the manifestation of the issue
* The location of the affected source code (tag/branch/commit or direct URL)
* Any special configuration required to reproduce the issue
* Step-by-step instructions to reproduce the issue
* Proof-of-concept or exploit code (if possible)
* Impact of the issue, including how an attacker might exploit the issue
This information will help us triage your report more quickly.
If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://microsoft.com/msrc/bounty) page for more details about our active programs.
## Preferred Languages
We prefer all communications to be in English.
## Policy
Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://www.microsoft.com/en-us/msrc/cvd).
<!-- END MICROSOFT SECURITY.MD BLOCK -->

25
SUPPORT.md Normal file
Просмотреть файл

@ -0,0 +1,25 @@
# TODO: The maintainer of this repo has not yet edited this file
**REPO OWNER**: Do you want Customer Service & Support (CSS) support for this product/project?
- **No CSS support:** Fill out this template with information about how to file issues and get help.
- **Yes CSS support:** Fill out an intake form at [aka.ms/spot](https://aka.ms/spot). CSS will work with/help you to determine next steps. More details also available at [aka.ms/onboardsupport](https://aka.ms/onboardsupport).
- **Not sure?** Fill out a SPOT intake as though the answer were "Yes". CSS will help you decide.
*Then remove this first heading from this SUPPORT.MD file before publishing your repo.*
# Support
## How to file issues and get help
This project uses GitHub Issues to track bugs and feature requests. Please search the existing
issues before filing new issues to avoid duplicates. For new issues, file your bug or
feature request as a new Issue.
For help and questions about using this project, please **REPO MAINTAINER: INSERT INSTRUCTIONS HERE
FOR HOW TO ENGAGE REPO OWNERS OR COMMUNITY FOR HELP. COULD BE A STACK OVERFLOW TAG OR OTHER
CHANNEL. WHERE WILL YOU HELP PEOPLE?**.
## Microsoft Support Policy
Support for this **PROJECT or PRODUCT** is limited to the resources listed above.

Двоичные данные
icon.png Normal file

Двоичный файл не отображается.

После

Ширина:  |  Высота:  |  Размер: 17 KiB

Двоичные данные
nuget.exe Normal file

Двоичный файл не отображается.