1.1.0-Preview (dev)
This commit is contained in:
Родитель
a0b9b3d931
Коммит
0b6372281d
|
@ -23,11 +23,11 @@ function Build-SFPkg {
|
|||
try {
|
||||
Push-Location $scriptPath
|
||||
|
||||
Build-SFPkg "Microsoft.ServiceFabricApps.FabricHealer.Linux.SelfContained.1.0.9-Preview" "$scriptPath\bin\release\FabricHealer\linux-x64\self-contained\FabricHealerType"
|
||||
Build-SFPkg "Microsoft.ServiceFabricApps.FabricHealer.Linux.FrameworkDependent.1.0.9-Preview" "$scriptPath\bin\release\FabricHealer\linux-x64\framework-dependent\FabricHealerType"
|
||||
Build-SFPkg "Microsoft.ServiceFabricApps.FabricHealer.Linux.SelfContained.1.1.0-Preview" "$scriptPath\bin\release\FabricHealer\linux-x64\self-contained\FabricHealerType"
|
||||
Build-SFPkg "Microsoft.ServiceFabricApps.FabricHealer.Linux.FrameworkDependent.1.1.0-Preview" "$scriptPath\bin\release\FabricHealer\linux-x64\framework-dependent\FabricHealerType"
|
||||
|
||||
Build-SFPkg "Microsoft.ServiceFabricApps.FabricHealer.Windows.SelfContained.1.0.9-Preview" "$scriptPath\bin\release\FabricHealer\win-x64\self-contained\FabricHealerType"
|
||||
Build-SFPkg "Microsoft.ServiceFabricApps.FabricHealer.Windows.FrameworkDependent.1.0.9-Preview" "$scriptPath\bin\release\FabricHealer\win-x64\framework-dependent\FabricHealerType"
|
||||
Build-SFPkg "Microsoft.ServiceFabricApps.FabricHealer.Windows.SelfContained.1.1.0-Preview" "$scriptPath\bin\release\FabricHealer\win-x64\self-contained\FabricHealerType"
|
||||
Build-SFPkg "Microsoft.ServiceFabricApps.FabricHealer.Windows.FrameworkDependent.1.1.0-Preview" "$scriptPath\bin\release\FabricHealer\win-x64\framework-dependent\FabricHealerType"
|
||||
}
|
||||
finally {
|
||||
Pop-Location
|
||||
|
|
|
@ -0,0 +1,80 @@
|
|||
## FabricHealer Operational Telemetry
|
||||
|
||||
FabricHealer (FH) operational data is transmitted to Microsoft and contains information about FabricHealer. This information helps us understand how FabricHealer is doing in the real world, what type of environment it runs in, and how many, if any, successful or unsuccessful repairs it conducts. This information will help us make sure we invest time in the right places. This data does not contain PII or any information about the services being repaired in your cluster.
|
||||
|
||||
**This information is only used by the Service Fabric team and will be retained for no more than 90 days.**
|
||||
|
||||
Disabling / Enabling transmission of Operational Data:
|
||||
|
||||
Transmission of operational data is controlled by a setting and can be easily turned off. ```ObserverManagerEnableOperationalTelemetry``` setting in ```ApplicationManifest.xml``` controls transmission of Operational data. **Note that if you are deploying FabricHealer to a cluster running in a restricted region (China) or cloud (Gov) you should disable this feature before deploying to remain compliant. Please do not send data outside of any restricted boundary.**
|
||||
|
||||
**NOTE: We recommend that this feature be disabled if you are deploying FabricHealer to an Azure cluster running in a restricted region or cloud.**
|
||||
|
||||
Setting the value to false as below will prevent the transmission of operational data:
|
||||
|
||||
**\<Parameter Name="EnableOperationalTelemetry" DefaultValue="false" />**
|
||||
|
||||
As with most of FabricHealer's application settings, you can also do this with a versionless parameter-only application upgrade:
|
||||
|
||||
```Powershell
|
||||
Connect-ServiceFabricCluster ...
|
||||
|
||||
$appParams = @{ "EnableOperationalTelemetry" = "false"; }
|
||||
Start-ServiceFabricApplicationUpgrade -ApplicationName fabric:/FabricHealer -ApplicationParameter $appParams -ApplicationTypeVersion 1.1.0-Preview -UnMonitoredAuto
|
||||
|
||||
```
|
||||
|
||||
#### Questions we want to answer from data:
|
||||
|
||||
- Health of FH
|
||||
- If FH crashes with an unhandled exception that can be caught, related error information will be sent to us (this will include the offending FH stack). This will help us improve quality.
|
||||
- Enabled Observers
|
||||
- Helps us focus effort on the most useful observers.
|
||||
- Is FH successfully repairing issues? This data is represented in the total number of SuccessfulRepairs FH conducts in a 24 hour window.
|
||||
- This telemetry is sent once every 24 hours and internal error/warning counters are reset after each telemetry transmission.
|
||||
|
||||
#### Operational data details:
|
||||
|
||||
Here is a full example of exactly what is sent in one of these telemetry events, in this case, from an SFRP cluster:
|
||||
|
||||
```JSON
|
||||
{
|
||||
"EventName": "OperationalEvent",
|
||||
"TaskName": "FabricHealer",
|
||||
"EventRunInterval": "1.00:00:00",
|
||||
"ClusterId": "00000000-1111-1111-0000-00f00d000d",
|
||||
"ClusterType": "SFRP",
|
||||
"NodeNameHash": "3e83569d4c6aad78083cd081215dafc81e5218556b6a46cb8dd2b183ed0095ad",
|
||||
"FHVersion": "1.1.0-Preview",
|
||||
"UpTime": "00:00:00.1956784",
|
||||
"Timestamp": "2021-12-11T03:12:47.0410613Z",
|
||||
"OS": "Windows",
|
||||
"EnabledRepairCount": 2,
|
||||
"TotalRepairAttempts": 5,
|
||||
"SuccessfulRepairs": 5,
|
||||
"FailedRepairs": 0
|
||||
}
|
||||
```
|
||||
|
||||
Let's take a look at the data and why we think it is useful to share with us. We'll go through each object property in the JSON above.
|
||||
- **EventName** - this is the name of the telemetry event.
|
||||
- **TaskName** - this specifies that the event is from FabricHealer.
|
||||
- **EventRunInterval** - this is how often this telemetry is sent from a node in a cluster.
|
||||
- **ClusterId** - this is used to both uniquely identify a telemetry event and to correlate data that comes from a cluster.
|
||||
- **ClusterType** - this is the type of cluster: Standalone or SFRP.
|
||||
- **NodeNameHash** - this is a sha256 hash of the name of the Fabric node from where the data originates. It is used to correlate data from specific nodes in a cluster (the hashed node name will be known to be part of the cluster with a specific cluster id).
|
||||
- **FHVersion** - this is the internal version of FH (if you have your own version naming, we will only know what the FH code version is (not your specific FH app version name)).
|
||||
- **UpTime** - this is the amount of time FH has been running since it last started.
|
||||
- **Timestamp** - this is the time, in UTC, when FH sent the telemetry.
|
||||
- **OS** - this is the operating system FH is running on (Windows or Linux).
|
||||
- **EnabledRepairs** - this is the number of enabled repairs.
|
||||
- **TotalRepairAttempts** - this is the number of times repairs were attempted.
|
||||
- **SuccessfulRepairs** - this is the number of successful repairs.
|
||||
- **FailedRepairs** - this is the number of failed repairs.
|
||||
|
||||
|
||||
If the ClusterType is not SFRP then a TenantId (Guid) is sent for use in the same way we use ClusterId.
|
||||
|
||||
This information will **really** help us understand how FabricHealer is doing out there and we would greatly appreciate you sharing it with us!
|
||||
|
||||
|
|
@ -2,18 +2,16 @@
|
|||
<package xmlns="http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd">
|
||||
<metadata minClientVersion="3.3.0">
|
||||
<id>%PACKAGE_ID%</id>
|
||||
<version>1.0.9-Preview</version>
|
||||
<version>1.1.0-Preview</version>
|
||||
<releaseNotes>
|
||||
- FabricHealer will let you know if a new version is available (Informational event).
|
||||
- Changes to telemetry to support the version check feature.
|
||||
- Application Parameters Renaming (breaking change).
|
||||
- More string consts to limit mispelling bugs.
|
||||
- TODO
|
||||
</releaseNotes>
|
||||
<authors>Microsoft</authors>
|
||||
<license type="expression">MIT</license>
|
||||
<requireLicenseAcceptance>false</requireLicenseAcceptance>
|
||||
<title>FabricHealer Service</title>
|
||||
<icon>icon.png</icon>
|
||||
<readme>fhnuget.md</readme>
|
||||
<language>en-US</language>
|
||||
<description>FabricHealer is a Service Fabric service that schedules and safely executes automatic repairs in Linux and Windows Service Fabric clusters after inspecting unhealthy events created by FabricObserver (FO) instances running in the same cluster. It employs a novel Configuration-as-Logic model to express repair workflows using Prolog-like semantics/syntax in text-based configuration files. This version, 1.0.2-Preview, is meant for Test clusters.</description>
|
||||
<contentFiles>
|
||||
|
@ -30,5 +28,6 @@
|
|||
<file src="**" target="contentFiles\any\any" />
|
||||
<file src="FabricHealerPkg\Code\FabricHealer.dll" target="lib\netstandard2.0" />
|
||||
<file src="%ROOT_PATH%\icon.png" target="" />
|
||||
<file src="%ROOT_PATH%\fhnuget.md" target="" />
|
||||
</files>
|
||||
</package>
|
|
@ -13,8 +13,10 @@ Project("{2150E333-8FDC-42A3-9474-1A3956D46DE8}") = "Solution Items", "Solution
|
|||
Build-NugetPackages.ps1 = Build-NugetPackages.ps1
|
||||
Build-SFPKGs.ps1 = Build-SFPKGs.ps1
|
||||
FabricHealer.nuspec.template = FabricHealer.nuspec.template
|
||||
fhnuget.md = fhnuget.md
|
||||
icon.png = icon.png
|
||||
Documentation\LogicWorkflows.md = Documentation\LogicWorkflows.md
|
||||
Documentation\OperationalTelemetry.md = Documentation\OperationalTelemetry.md
|
||||
README.md = README.md
|
||||
Documentation\Using.md = Documentation\Using.md
|
||||
EndProjectSection
|
||||
|
|
|
@ -28,7 +28,7 @@ namespace FabricHealer
|
|||
internal static RepairData RepairHistory;
|
||||
|
||||
// Folks often use their own version numbers. This is for internal diagnostic telemetry.
|
||||
private const string InternalVersionNumber = "1.0.9-Preview";
|
||||
private const string InternalVersionNumber = "1.1.0-Preview";
|
||||
private static FabricHealerManager singleton;
|
||||
private bool disposedValue;
|
||||
private readonly StatelessServiceContext serviceContext;
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<ServiceManifest Name="FabricHealerPkg"
|
||||
Version="1.0.9-Preview"
|
||||
Version="1.1.0-Preview"
|
||||
xmlns="http://schemas.microsoft.com/2011/01/fabric"
|
||||
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
|
||||
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
|
||||
|
@ -11,7 +11,7 @@
|
|||
</ServiceTypes>
|
||||
|
||||
<!-- Code package is your service executable. -->
|
||||
<CodePackage Name="Code" Version="1.0.9-Preview">
|
||||
<CodePackage Name="Code" Version="1.1.0-Preview">
|
||||
<EntryPoint>
|
||||
<ExeHost>
|
||||
<Program>FabricHealer</Program>
|
||||
|
@ -21,5 +21,5 @@
|
|||
|
||||
<!-- Config package is the contents of the Config directory under PackageRoot that contains an
|
||||
independently-updateable and versioned set of custom configuration settings for your service. -->
|
||||
<ConfigPackage Name="Config" Version="1.0.9-Preview" />
|
||||
<ConfigPackage Name="Config" Version="1.1.0-Preview" />
|
||||
</ServiceManifest>
|
|
@ -1,5 +1,5 @@
|
|||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<ApplicationManifest xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ApplicationTypeName="FabricHealerType" ApplicationTypeVersion="1.0.9-Preview" xmlns="http://schemas.microsoft.com/2011/01/fabric">
|
||||
<ApplicationManifest xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ApplicationTypeName="FabricHealerType" ApplicationTypeVersion="1.1.0-Preview" xmlns="http://schemas.microsoft.com/2011/01/fabric">
|
||||
<Parameters>
|
||||
<Parameter Name="FabricHealer_InstanceCount" DefaultValue="-1" />
|
||||
<!-- FabricHealerManager Settings -->
|
||||
|
@ -25,7 +25,7 @@
|
|||
should match the Name and Version attributes of the ServiceManifest element defined in the
|
||||
ServiceManifest.xml file. -->
|
||||
<ServiceManifestImport>
|
||||
<ServiceManifestRef ServiceManifestName="FabricHealerPkg" ServiceManifestVersion="1.0.9-Preview" />
|
||||
<ServiceManifestRef ServiceManifestName="FabricHealerPkg" ServiceManifestVersion="1.1.0-Preview" />
|
||||
<ConfigOverrides>
|
||||
<ConfigOverride Name="Config">
|
||||
<Settings>
|
||||
|
|
41
README.md
41
README.md
|
@ -1,21 +1,17 @@
|
|||
# FabricHealer Preview
|
||||
## FabricHealer Preview
|
||||
### Configuration as Logic and auto-mitigation in Service Fabric clusters
|
||||
|
||||
FabricHealer is a Service Fabric application that attempts to automatically fix a set of reliably solvable problems that can take place in Service Fabric applications
|
||||
(including containers), host virtual machines, and logical disks (scoped to space usage problems only).
|
||||
These repairs mostly employ a set of Service Fabric API calls, but can also be fully customizable (like Disk repair).
|
||||
All repairs are safely orchestrated through Service Fabric’s RepairManager service.
|
||||
Repair workflow configuration is written as [Prolog](http://www.let.rug.nl/bos/lpn//lpnpage.php?pageid=online)-like [logic](/FabricHealer/PackageRoot/Config/LogicRules) with supporting external predicates written in C#.
|
||||
FabricHealer's Configuration-as-Logic feature is made possible by a new logic programming library for .NET, [Guan](https://github.com/microsoft/guan), also in Preview.
|
||||
The fun starts when FabricHealer detects supported error or warning health events reported by [FabricObserver](https://github.com/microsoft/service-fabric-observer).
|
||||
FabricHealer (FH) is a Service Fabric application that attempts to automatically fix a set of reliably solvable problems that can take place in Service Fabric applications (including containers), host virtual machines, and logical disks (scoped to space usage problems only). These repairs mostly employ a set of Service Fabric API calls, but can also be fully customizable (like Disk repair). All repairs are safely orchestrated through the Service Fabric RepairManager system service. Repair workflow configuration is written as [Prolog](http://www.let.rug.nl/bos/lpn//lpnpage.php?pageid=online)-like [logic](https://github.com/microsoft/service-fabric-healer/blob/main/FabricHealer/PackageRoot/Config/LogicRules) with supporting external predicates written in C#.
|
||||
|
||||
FabricHealer's Configuration-as-Logic feature is made possible by a new logic programming library for .NET, [Guan](https://github.com/microsoft/guan), also in Preview. The fun starts when FabricHealer detects supported error or warning health events reported by [FabricObserver](https://github.com/microsoft/service-fabric-observer).
|
||||
|
||||
FabricHealer is implemented as a stateless singleton service that runs on all nodes in a Linux or Windows Service Fabric cluster. It is a .NET Core 3.1 application and has been tested on Windows (2016/2019) and Ubuntu (16/18.04).
|
||||
|
||||
All warning and error health reports created by [FabricObserver](https://github.com/microsoft/service-fabric-observer) and subsequently repaired by FabricHealer are user-configured - developer control extends from unhealthy event source to related healing operations.
|
||||
All warning and error health reports created by [FabricObserver](https://github.com/microsoft/service-fabric-observer) and subsequently repaired by FabricHealer are user-configured - developer control extends from unhealthy event source to related healing operations.
|
||||
|
||||
FabricObserver and FabricHealer are part of a family of highly configurable Service Fabric observability tools that work together to keep your clusters green.
|
||||
|
||||
|
||||
To learn more about FabricHealer's configuration-as-logic model, [click here.](Documentation/LogicWorkflows.md)
|
||||
To learn more about FabricHealer's configuration-as-logic model, [click here.](https://github.com/microsoft/service-fabric-healer/blob/main/Documentation/LogicWorkflows.md)
|
||||
|
||||
```
|
||||
FabricHealer requires that FabricObserver and RepairManager (RM) service are deployed.
|
||||
|
@ -28,10 +24,7 @@ This is Preview technology and is not meant for production use. Only use in Test
|
|||
```
|
||||
We are very interested in your feedback with both repair reliability and the Configuration-as-Logic feature. Please let us know what you think. Simply create Issues with your feedback and any bugs run into/enhancements you think are necessary. Thank you.
|
||||
|
||||
Also, a reminder that this is preview quality software and there are probably some minor bugs and the code will definitely churn, but it is stable and solid.
|
||||
It is capable as is today and appropriate for use in **test** enviroments. It has been tested in both Linux and Windows deployments.
|
||||
The current set of repair workflows work and should perform correctly in your clusters. Please create Issues on this repo if you find bugs. If you are comfortable fixing them, then
|
||||
pull requests will be evaluated and merged if they meet the quality bar. Thanks in advance for your partnership and for experimenting with FabricHealer.
|
||||
Also, a reminder that this is preview quality software and there are probably some minor bugs and the code will definitely churn, but it is stable and solid. It is capable as is today and appropriate for use in **test** enviroments. It has been tested in both Linux and Windows deployments. The current set of repair workflows work and should perform correctly in your clusters. Please create Issues on this repo if you find bugs. If you are comfortable fixing them, then pull requests will be evaluated and merged if they meet the quality bar. Thanks in advance for your partnership and for experimenting with FabricHealer.
|
||||
|
||||
## Build and run
|
||||
|
||||
|
@ -43,19 +36,18 @@ pull requests will be evaluated and merged if they meet the quality bar. Thanks
|
|||
|
||||
## Using FabricHealer
|
||||
|
||||
```FabricHealer is a service specifically designed to auto-mitigate Service Fabric service issues that are generally the result of bugs in user code.```
|
||||
```FabricHealer is a service specifically designed to auto-mitigate Service Fabric service issues that are generally the result of bugs in user code (leaking memory, leaking ephemeral ports, creating too many threads, abusing the cpu, etc...).```
|
||||
|
||||
Let's say you have a service that leaks memory or ephemeral ports. You would use FabricHealer to keep the problem in check while your developers figure out the root cause and fix the bug(s) that lead to resource usage over-consumption. FabricHealer is really just a temporary solution to problems, not a fix. This is how you should think about auto-mitigation, generally. FabricHealer aims to keep your cluster green while you fix your bugs. With it's configuration-as-logic support, you can easily specify that some repair for some service should only be attempted for n weeks or months, while your dev team fixes the underlying issues with the problematic service. FabricHealer should be thought of as a "disappearing task force" in that it can provide stability during times of instability, then "go away" when bugs are fixed.
|
||||
|
||||
FabricHealer comes with a number of already-implemented/tested target-specific logic rules. You will only need to modify existing rules to get going quickly. FabricHealer is a rule-based repair service and the rules are defined in logic. These rules also form FabricHealer's repair workflow configuration. This is what is meant by Configuration-as-Logic. The only use of XML-based configuration with respect to repair workflow is enabling automitigation (big on/off switch), enabling repair policies, and specifying rule file names. The rest is just the typical Service Fabric application configuration that you know and love. Most of the settings in
|
||||
Settings.xml are overridable parameters and you set the values in ApplicationManifest.xml. This enables versionless parameter-only application upgrades, which means you can change Settings.xml-based settings without redeploying FabricHealer.
|
||||
FabricHealer comes with a number of already-implemented/tested target-specific logic rules. You will only need to modify existing rules to get going quickly. FabricHealer is a rule-based repair service and the rules are defined in logic. These rules also form FabricHealer's repair workflow configuration. This is what is meant by Configuration-as-Logic. The only use of XML-based configuration with respect to repair workflow is enabling automitigation (big on/off switch), enabling repair policies, and specifying rule file names. The rest is just the typical Service Fabric application configuration that you know and love. Most of the settings in Settings.xml are overridable parameters and you set the values in ApplicationManifest.xml. This enables versionless parameter-only application upgrades, which means you can change Settings.xml-based settings without redeploying FabricHealer.
|
||||
|
||||
### Repair ephemeral port-leaking service process
|
||||
|
||||
```Prolog
|
||||
## Ephemeral Ports - Specific Application: any of its services, constrained on number of local ephemeral ports open.
|
||||
## 5 repairs within 5 hour window.
|
||||
Mitigate(AppName="fabric:/MyApp42", MetricName="EphemeralPorts", MetricValue=?MetricValue) :- ?MetricValue > 5000,
|
||||
Mitigate(AppName="fabric:/IlikePorts", MetricName="EphemeralPorts", MetricValue=?MetricValue) :- ?MetricValue > 5000,
|
||||
TimeScopedRestartCodePackage(5, 05:00:00).
|
||||
```
|
||||
|
||||
|
@ -66,13 +58,18 @@ Mitigate(AppName="fabric:/MyApp42", MetricName="EphemeralPorts", MetricValue=?Me
|
|||
## 3 repairs within 30 minute window.
|
||||
Mitigate(AppName="fabric:/ILikeMemory", MetricName="MemoryPercent", MetricValue=?MetricValue) :- ?MetricValue >= 70,
|
||||
TimeScopedRestartCodePackage(3, 00:30:00).
|
||||
```
|
||||
```
|
||||
|
||||
## Quickstart
|
||||
|
||||
To quickly learn how to use FabricHealer, please see the [simple scenario-based examples.](Documentation/Using.md)
|
||||
To quickly learn how to use FabricHealer, please see the [simple scenario-based examples.](https://github.com/microsoft/service-fabric-healer/blob/main/Documentation/Using.md)
|
||||
|
||||
# Contributing
|
||||
# Operational Telemetry
|
||||
Please see [FabricHealer Operational Telemetry](/Documentation/OperationalTelemetry.md) for detailed information on the user agnostic (Non-PII) data FabricHealer sends to Microsoft (opt out with a simple configuration parameter change).
|
||||
Please consider leaving this enabled so your friendly neighborhood FabricObserver devs can understand how FO is doing in the real world. We would really appreciate it!
|
||||
|
||||
|
||||
## Contributing
|
||||
|
||||
This project welcomes contributions and suggestions. Most contributions require you to agree to a
|
||||
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
|
||||
|
|
|
@ -69,7 +69,7 @@ namespace FabricHealer.TelemetryLib
|
|||
{ "EventRunInterval", runInterval.ToString() },
|
||||
{ "ClusterId", clusterId },
|
||||
{ "ClusterType", clusterType },
|
||||
{ "NodeNameHash", nodeHashString ?? string.Empty },
|
||||
{ "NodeNameHash", nodeHashString },
|
||||
{ "FHVersion", repairData.Version },
|
||||
{ "UpTime", repairData.UpTime },
|
||||
{ "Timestamp", DateTime.UtcNow.ToString("o") },
|
||||
|
@ -138,7 +138,7 @@ namespace FabricHealer.TelemetryLib
|
|||
{ "ClusterId", clusterId },
|
||||
{ "ClusterType", clusterType },
|
||||
{ "TenantId", tenantId },
|
||||
{ "NodeNameHash", nodeHashString ?? string.Empty },
|
||||
{ "NodeNameHash", nodeHashString },
|
||||
{ "FHVersion", fhErrorData.Version },
|
||||
{ "CrashTime", fhErrorData.CrashTime },
|
||||
{ "ErrorMessage", fhErrorData.ErrorMessage },
|
||||
|
|
|
@ -0,0 +1,59 @@
|
|||
## FabricHealer Preview
|
||||
### Configuration as Logic and auto-mitigation in Service Fabric clusters
|
||||
|
||||
FabricHealer is a Service Fabric application that attempts to automatically fix a set of reliably solvable problems that can take place in Service Fabric applications (including containers), host virtual machines, and logical disks (scoped to space usage problems only). These repairs mostly employ a set of Service Fabric API calls, but can also be fully customizable (like Disk repair). All repairs are safely orchestrated through the Service Fabric RepairManager system service. Repair workflow configuration is written as [Prolog](http://www.let.rug.nl/bos/lpn//lpnpage.php?pageid=online)-like [logic](https://github.com/microsoft/service-fabric-healer/blob/main/FabricHealer/PackageRoot/Config/LogicRules) with supporting external predicates written in C#.
|
||||
|
||||
FabricHealer's Configuration-as-Logic feature is made possible by a new logic programming library for .NET, [Guan](https://github.com/microsoft/guan), also in Preview. The fun starts when FabricHealer detects supported error or warning health events reported by [FabricObserver](https://github.com/microsoft/service-fabric-observer).
|
||||
|
||||
FabricHealer is implemented as a stateless singleton service that runs on all nodes in a Linux or Windows Service Fabric cluster. It is a .NET Core 3.1 application and has been tested on Windows (2016/2019) and Ubuntu (16/18.04).
|
||||
|
||||
All warning and error health reports created by [FabricObserver](https://github.com/microsoft/service-fabric-observer) and subsequently repaired by FabricHealer are user-configured - developer control extends from unhealthy event source to related healing operations.
|
||||
|
||||
FabricObserver and FabricHealer are part of a family of highly configurable Service Fabric observability tools that work together to keep your clusters green.
|
||||
|
||||
To learn more about FabricHealer's configuration-as-logic model, [click here.](https://github.com/microsoft/service-fabric-healer/blob/main/Documentation/LogicWorkflows.md)
|
||||
|
||||
```
|
||||
FabricHealer requires that FabricObserver and RepairManager (RM) service are deployed.
|
||||
```
|
||||
```
|
||||
For VM level repair, InfrastructureService (IS) service must be deployed.
|
||||
```
|
||||
```
|
||||
This is Preview technology and is not meant for production use. Only use in Test environments.
|
||||
```
|
||||
We are very interested in your feedback with both repair reliability and the Configuration-as-Logic feature. Please let us know what you think. Simply create Issues with your feedback and any bugs run into/enhancements you think are necessary. Thank you.
|
||||
|
||||
Also, a reminder that this is preview quality software and there are probably some minor bugs and the code will definitely churn, but it is stable and solid. It is capable as is today and appropriate for use in **test** enviroments. It has been tested in both Linux and Windows deployments. The current set of repair workflows work and should perform correctly in your clusters. Please create Issues on this repo if you find bugs. If you are comfortable fixing them, then pull requests will be evaluated and merged if they meet the quality bar. Thanks in advance for your partnership and for experimenting with FabricHealer.
|
||||
|
||||
***Note: FabricHealer must be run under the LocalSystem account (see ApplicationManifest.xml) in order to function correctly. This means on Windows, by default, it will run as System user. On Linux, by default, it will run as root user. You do not have to make any changes to ApplicationManifest.xml for this to be the case.***
|
||||
|
||||
## Using FabricHealer
|
||||
|
||||
```FabricHealer is a service specifically designed to auto-mitigate Service Fabric service issues that are generally the result of bugs in user code (leaking memory, leaking ephemeral ports, creating too many threads, abusing the cpu, etc...).```
|
||||
|
||||
Let's say you have a service that leaks memory or ephemeral ports. You would use FabricHealer to keep the problem in check while your developers figure out the root cause and fix the bug(s) that lead to resource usage over-consumption. FabricHealer is really just a temporary solution to problems, not a fix. This is how you should think about auto-mitigation, generally. FabricHealer aims to keep your cluster green while you fix your bugs. With it's configuration-as-logic support, you can easily specify that some repair for some service should only be attempted for n weeks or months, while your dev team fixes the underlying issues with the problematic service. FabricHealer should be thought of as a "disappearing task force" in that it can provide stability during times of instability, then "go away" when bugs are fixed.
|
||||
|
||||
FabricHealer comes with a number of already-implemented/tested target-specific logic rules. You will only need to modify existing rules to get going quickly. FabricHealer is a rule-based repair service and the rules are defined in logic. These rules also form FabricHealer's repair workflow configuration. This is what is meant by Configuration-as-Logic. The only use of XML-based configuration with respect to repair workflow is enabling automitigation (big on/off switch), enabling repair policies, and specifying rule file names. The rest is just the typical Service Fabric application configuration that you know and love. Most of the settings in Settings.xml are overridable parameters and you set the values in ApplicationManifest.xml. This enables versionless parameter-only application upgrades, which means you can change Settings.xml-based settings without redeploying FabricHealer.
|
||||
|
||||
### Repair ephemeral port-leaking service process
|
||||
|
||||
```Prolog
|
||||
## Ephemeral Ports - Specific Application: any of its services, constrained on number of local ephemeral ports open.
|
||||
## 5 repairs within 5 hour window.
|
||||
Mitigate(AppName="fabric:/IlikePorts", MetricName="EphemeralPorts", MetricValue=?MetricValue) :- ?MetricValue > 5000,
|
||||
TimeScopedRestartCodePackage(5, 05:00:00).
|
||||
```
|
||||
|
||||
### Repair memory-leaking service process
|
||||
|
||||
```Prolog
|
||||
## Memory - Percent In Use for Any SF Service Process belonging to the specified SF Application.
|
||||
## 3 repairs within 30 minute window.
|
||||
Mitigate(AppName="fabric:/ILikeMemory", MetricName="MemoryPercent", MetricValue=?MetricValue) :- ?MetricValue >= 70,
|
||||
TimeScopedRestartCodePackage(3, 00:30:00).
|
||||
```
|
||||
|
||||
## Quickstart
|
||||
|
||||
To quickly learn how to use FabricHealer, please see the [simple scenario-based examples.](https://github.com/microsoft/service-fabric-healer/blob/main/Documentation/Using.md)
|
Двоичный файл не отображается.
Загрузка…
Ссылка в новой задаче