1.1.19: RestartFabricNode re-impl, major doc updates, repair rule updates, tests.
This commit is contained in:
Родитель
406c998aae
Коммит
5a7c730448
|
@ -23,11 +23,11 @@ function Build-SFPkg {
|
|||
try {
|
||||
Push-Location $scriptPath
|
||||
|
||||
Build-SFPkg "Microsoft.ServiceFabricApps.FabricHealer.Linux.SelfContained.1.1.18" "$scriptPath\bin\release\FabricHealer\linux-x64\self-contained\FabricHealerType"
|
||||
Build-SFPkg "Microsoft.ServiceFabricApps.FabricHealer.Linux.FrameworkDependent.1.1.18" "$scriptPath\bin\release\FabricHealer\linux-x64\framework-dependent\FabricHealerType"
|
||||
Build-SFPkg "Microsoft.ServiceFabricApps.FabricHealer.Linux.SelfContained.1.1.19" "$scriptPath\bin\release\FabricHealer\linux-x64\self-contained\FabricHealerType"
|
||||
Build-SFPkg "Microsoft.ServiceFabricApps.FabricHealer.Linux.FrameworkDependent.1.1.19" "$scriptPath\bin\release\FabricHealer\linux-x64\framework-dependent\FabricHealerType"
|
||||
|
||||
Build-SFPkg "Microsoft.ServiceFabricApps.FabricHealer.Windows.SelfContained.1.1.18" "$scriptPath\bin\release\FabricHealer\win-x64\self-contained\FabricHealerType"
|
||||
Build-SFPkg "Microsoft.ServiceFabricApps.FabricHealer.Windows.FrameworkDependent.1.1.18" "$scriptPath\bin\release\FabricHealer\win-x64\framework-dependent\FabricHealerType"
|
||||
Build-SFPkg "Microsoft.ServiceFabricApps.FabricHealer.Windows.SelfContained.1.1.19" "$scriptPath\bin\release\FabricHealer\win-x64\self-contained\FabricHealerType"
|
||||
Build-SFPkg "Microsoft.ServiceFabricApps.FabricHealer.Windows.FrameworkDependent.1.1.19" "$scriptPath\bin\release\FabricHealer\win-x64\framework-dependent\FabricHealerType"
|
||||
}
|
||||
finally {
|
||||
Pop-Location
|
||||
|
|
|
@ -11,7 +11,7 @@
|
|||
},
|
||||
"applicationTypeVersionFabricHealer": {
|
||||
"type": "string",
|
||||
"defaultValue": "1.1.18",
|
||||
"defaultValue": "1.1.19",
|
||||
"metadata": {
|
||||
"description": "Provide the app version number of FabricHealer. This must be identical to the version specified in the sfpkg."
|
||||
}
|
||||
|
|
|
@ -6,7 +6,7 @@
|
|||
"value": "<YOUR-CLUSTER-RESOURCE-NAME>"
|
||||
},
|
||||
"applicationTypeVersionFabricHealer": {
|
||||
"value": "1.1.18"
|
||||
"value": "1.1.19"
|
||||
},
|
||||
"packageUrlFabricHealer": {
|
||||
"value": "<PUBLIC-ACCESSIBLE-URL-FOR-FABRICHEALER-SFPKG>"
|
|
@ -5,20 +5,18 @@ FabricHealer employs configuration-as-logic by leveraging the expressive power o
|
|||
|
||||
Supporting formal logic-based repair workflows gives users more tools and options to express their custom repair workflows. Formal logic gives users the power to express concepts like if/else statements, leverage boolean operators, and even things like recursion! Logic programming allows users to easily and concisely express complex repair workflows that leverage the complete power of a logic programming language. We use GuanLogic for our underlying logic processing, which is a general purpose logic programming API written by Lu Xun (Microsoft) that enables Prolog-like (https://en.wikipedia.org/wiki/Prolog) rule definition and query execution in C#.
|
||||
|
||||
|
||||
While not necessary, reading Chapters 1-10 of the [learnprolognow](http://www.learnprolognow.org/lpnpage.php?pagetype=html&pageid=lpn-htmlch1) online (free) book can be quite useful and is highly recommended if you want to create more advanced rules to suit your more complex needs over time. Note that the documentation here doesn't assume you have any experience with logic programming. This is also the case with respect to using FabricHealer: several rules are already in place and you will simply need to change parts of an existing rule (like supplying your app names, for example) to get up and running very quickly.
|
||||
|
||||
**Do I need experience with logic programming?**
|
||||
|
||||
No, using logic to express repair workflows is easy! One doesn't need a deep knowledge and understanding of logic programming to write their own repair workflows. However, the more sophisticated/complex you need to be, then the more knowledge you will need to possess. For now, let's start with a very simple example to help inspire your own logic-based repair workflows.
|
||||
|
||||
|
||||
***Problem***: I want to perform a code package restart if FabricObserver emits a memory usage warning for a *specific* application in my cluster (e.g. "fabric:/App1").
|
||||
|
||||
***Solution***: We can leverage Guan and its built-in equals operator for checking the name of the application that triggered the warning against the name of the application for which we decided we want to perform a code package restart for. For application level health events, the repair workflow is defined inside the PackageRoot/Config/LogicRules/AppRules.config.txt file. Here is that we would enter:
|
||||
|
||||
```
|
||||
Mitigate(AppName="fabric:/App1", MetricName="MemoryPercent") :- RestartCodePackage().
|
||||
Mitigate(AppName="fabric:/App1", MetricName="MemoryPercent") :- RestartCodePackage.
|
||||
```
|
||||
|
||||
Don't be alarmed if you don't understand how to read that repair action! We will go more in-depth later about the syntax and semantics of Guan. The takeaway is that expressing a Guan repair workflow doesn't require a deep knowledge of Prolog programming to get started. Hopefully this also gives you a general idea about the kinds of repair workflows we can express with GuanLogic.
|
||||
|
@ -37,61 +35,237 @@ Each repair policy has its own corresponding configuration file located in the F
|
|||
|
||||
Now let's look at *how* to actually define a Guan logic repair workflow, so that you will have the knowledge necessary to express your own.
|
||||
|
||||
## Writing Logic Rules
|
||||
### Writing Logic Rules
|
||||
|
||||
This [site](https://www.metalevel.at/prolog/concepts) gives a good, fast overview of basic prolog concepts which you may find useful.
|
||||
|
||||
The building block for creating Guan logic repair workflows is through the use and composition of **Predicates**. A predicate has a name, and zero or more arguments.
|
||||
In FH, there are two different kinds of predicates: **Internal Predicates** and **External Predicates**. Internal predicates are equivalent to standard
|
||||
predicates in Prolog. An internal predicate defines relations between their arguments and other internal predicates.
|
||||
External predicates on the other hand are more similar to functions in terms of behaviour. External predicates are usually used to perform
|
||||
actions such as checking values, performing calculations, performing repairs, and binding values to variables.
|
||||
|
||||
The building block for creating Guan logic repair workflows is through the use and composition of **Predicates**. A predicate has a name, and zero or more arguments. In FH, there are two different kinds of predicates: **Internal Predicates** and **External Predicates**. Internal predicates are equivalent to standard predicates in Prolog. An internal predicate defines relations between their arguments and other internal predicates. External predicates on the other hand are more similar to functions in terms of behaviour. External predicates are usually used to perform actions such as checking values, performing calculations, performing repairs, and binding values to variables.
|
||||
### FabricHealer Predicates
|
||||
|
||||
Here is a list of currently implemented **External Predicates**:
|
||||
|
||||
**Repair Predicates**
|
||||
**Note**: Technically, an external predicate is not a function - though you can think of it as a function as it optionally takes input and does specific things. In reality, an external predicate is a type.
|
||||
You can look at how each one of the external predicates below are defined and implemented by looking in FabricHealer/Repair/Guan source folder.
|
||||
Given this, you only need to append an "()" at the end of an external predicate if you don't specify any arguments.
|
||||
E.g., you don't have to specify ```MyPredicate``` as ```MyPredicate()``` if you don't supply any arguments (or variables to which some value will be bound for use in subgoals). You can if you want. It's really up to you.
|
||||
|
||||
```RestartCodePackage()```
|
||||
**Please read the [Optional Arguments](#Optional-Arguments) section to learn more about how optional argument support works in FabricHealer.**
|
||||
|
||||
Attempts to restart the code package for the service that emitted the health event, returns true if successful, else false.
|
||||
**RestartCodePackage**
|
||||
|
||||
```RestartFabricNode()```
|
||||
Attempts to restart a service code package (all related facts are already known by FH as these fact were provided by either FO or FHProxy).
|
||||
|
||||
Attempts to restart the node of the service that emitted the health event, returns true if successful, else false. Takes an optional Safe parameter: "safe" or "unsafe" which defines whether or not to perform a safe or unsafe node restart. A safe node restart will try to first deactivate the node before restarting, whereas an unsafe node restart will try restarting the node without first trying to deactivate it.
|
||||
Arguments:
|
||||
|
||||
```RestartReplica()```
|
||||
- DoHealthChecks (Boolean), Optional
|
||||
- MaxWaitTimeForHealthStateOk (TimeSpan), Optional
|
||||
- MaxExecutionTime (TimeSpan), Optional
|
||||
|
||||
Example:
|
||||
|
||||
```Mitigate(MetricName="Threads") :- RestartCodePackage(false, 00:10:00, 00:30:00)..```
|
||||
|
||||
**RestartFabricNode**
|
||||
|
||||
Restarts a Service Fabric Node. Note that the repair task that is created for this type of repair will have a NodeImpactLevel of Restart, so
|
||||
Service Fabric will disable the node before the job is Approved, after which time FabricHealer will execute the repair.
|
||||
|
||||
Arguments: none.
|
||||
|
||||
Example:
|
||||
|
||||
```Mitigate(HealthState=Error) :- GetRepairHistory(?repairCount, 08:00:00), ?repairCount < 2, RestartFabricNode.```
|
||||
|
||||
**DeactivateFabricNode**
|
||||
|
||||
Schedules a Service Fabric Repair Job to deactivate a Service Fabric Node.
|
||||
|
||||
Arguments:
|
||||
|
||||
- ImpactLevel (string, supported values: Restart, RemoveData, RemoveNode), Optional (default is Restart).
|
||||
|
||||
Example:
|
||||
|
||||
```Mitigate(HealthState=Error) :- GetRepairHistory(?repairCount, 08:00:00), ?repairCount < 2, DeactivateFabricNode(RemoveData).```
|
||||
|
||||
**RestartReplica**
|
||||
|
||||
Attempts to restart a service replica (the replica (or instance) id is already known by FH as that fact was provided by either FO or FHProxy).
|
||||
|
||||
Arguments:
|
||||
|
||||
- DoHealthChecks (Boolean), Optional
|
||||
- MaxWaitTimeForHealthStateOk (TimeSpan), Optional
|
||||
- MaxExecutionTime (TimeSpan), Optional
|
||||
|
||||
Example:
|
||||
|
||||
```Mitigate(MetricName="Threads", MetricValue=?value) :- ?value > 500, RestartReplica.```
|
||||
|
||||
Attempts to restart the replica of the service that emitted the health event, returns true if successful, else false.
|
||||
|
||||
```RestartVM()```
|
||||
**ScheduleMachineRepair**
|
||||
|
||||
Attempts to restart the underlying virtual machine of the service that emitted the health event, returns true if successful, else false.
|
||||
Arguments:
|
||||
|
||||
```RestartFabricSystemProcess()```
|
||||
- DoHealthChecks (Boolean), Optional
|
||||
- RepairAction (String), Required
|
||||
|
||||
Attempts to schedule an infrastructure repair for the underlying virtual machine, returns true if successful, else false.
|
||||
|
||||
**RestartFabricSystemProcess**
|
||||
|
||||
Arguments:
|
||||
|
||||
- DoHealthChecks (Boolean), Optional
|
||||
- MaxWaitTimeForHealthStateOk (TimeSpan), Optional
|
||||
- MaxExecutionTime (TimeSpan), Optional
|
||||
|
||||
Attempts to restart a system service process that is misbehaving as per the FO health data.
|
||||
|
||||
```DeleteFiles()```
|
||||
**DeleteFiles**
|
||||
|
||||
Arguments:
|
||||
|
||||
- Path (String, **must always be the first argument**), Required.
|
||||
- SortOrder (String, supported values are Ascending, Descending), Optional.
|
||||
- MaxFilesToDelete (long), Optional.
|
||||
- RecurseSubdirectories (Boolean), Optional.
|
||||
- SearchPattern (String), Optional.
|
||||
|
||||
Attempts to delete files in a supplied path. You can supply target path, max number of files to remove, sort order (ASC/DESC).
|
||||
|
||||
**Helper Predicates**
|
||||
|
||||
```LogInfo(), LogWarning(), LogError()```
|
||||
**LogInfo, LogWarning, LogError**
|
||||
|
||||
These will emit telemetry/etw/health event at corresponding level (Info, Warning, Error) from a rule and can help with debugging, auditing, upstream action (ETW/Telemetry -> Alerts, for example).
|
||||
|
||||
```GetRepairHistory()```
|
||||
Arguments:
|
||||
|
||||
- Message (String), Required
|
||||
|
||||
Example (input string is a formatted string with arguments, in this case):
|
||||
|
||||
```LogInfo("0042_{0}: Specified Machine repair escalations have been exhausted for node {0}. Human intervention is required.", ?nodeName)```
|
||||
|
||||
Example (simple string, unformatted):
|
||||
|
||||
```LogInfo("This is a message...")```
|
||||
|
||||
**GetRepairHistory**
|
||||
|
||||
Gets the number of times a repair has been run with a supplied time window.
|
||||
|
||||
```CheckFolderSize()```
|
||||
This is an example of a predicate that takes up to 3 arguments, where the first one must be a variable (as it will hold the result, which can then be used in subsequent subgoals within the rule).
|
||||
|
||||
Example:
|
||||
|
||||
```GetRepairHistory(?repairCount, 08:00:00, System.Azure.Heal)```
|
||||
|
||||
The above example specifies that a variable named ?repairCount will hold the value of how many times System.Azure.Heal machine repair jobs were completed in 8 hours. For machine repairs, you must specify the action name. For other types of repairs, you do not need to do that. E.g., for
|
||||
a service-level repair
|
||||
|
||||
```GetRepairHistory(?repairCount, 02:00:00)``` is all FH needs as it keeps track of the repairs that it executes (where it is the Executor, which is it never is for machine-level repairs).
|
||||
|
||||
**CheckFolderSize**
|
||||
|
||||
Checks the size of a specified folder (full path) and returns a boolean value indicating whether the supplied max size (MB or GB) has been reached or exceeded.
|
||||
|
||||
```CheckInsideRunInterval()```
|
||||
Arguments:
|
||||
|
||||
- FolderPath (string), Required. You do not specify a name, just the value.
|
||||
|
||||
**CheckInsideRunInterval**
|
||||
|
||||
Checks if some repair has already run once within the specified time frame (TimeSpan).
|
||||
|
||||
Arguments:
|
||||
|
||||
**Forming a Logic Repair Workflow**
|
||||
- [Unnamed], (TimeSpan), Required. (You do not specify a name, just the value)
|
||||
|
||||
**CheckInsideHealthStateMinDuration**
|
||||
|
||||
Checks to see if the entity has been in some HealthState for the specified duration. Facts like EntityType and HealthState are already known to FabricHealer. They are therfore not arguments this predicate supports (it doesn't have to).
|
||||
|
||||
Arguments:
|
||||
|
||||
[Unnamed], (TimeSpan), Required.
|
||||
|
||||
Example:
|
||||
|
||||
```
|
||||
## Don't proceed if the target node hasn't been in Error (including cyclic Up/Down) state for at least two hours.
|
||||
Mitigate :- CheckInsideHealthStateMinDuration(02:00:00), !.
|
||||
```
|
||||
|
||||
**CheckOutstandingRepairs**
|
||||
|
||||
Checks the number of repairs (repair tasks) currently in flight in the cluster is less than or equal to the supplied argument.
|
||||
|
||||
Arguments:
|
||||
|
||||
[Unnamed], (long), Required.
|
||||
|
||||
```
|
||||
## Don't proceed if there are already 2 or more entity-specific (implicit fact) repairs currently active in the cluster.
|
||||
Mitigate :- CheckOutstandingRepairs(2), !.
|
||||
```
|
||||
|
||||
**CheckInsideScheduleInterval**
|
||||
|
||||
Checks if the last related (context is the type of repair, which FH already knows) repair was scheduled within the supplied TimeSpan.
|
||||
|
||||
Arguments:
|
||||
|
||||
[Unnamed], (TimeSpan), Required.
|
||||
|
||||
Example:
|
||||
|
||||
```
|
||||
## Don't proceed if FH scheduled a machine repair less than 10 minutes ago.
|
||||
Mitigate :- CheckInsideScheduleInterval(00:10:00), !.
|
||||
```
|
||||
|
||||
**CheckInsideNodeProbationPeriod**
|
||||
|
||||
This is for machine or node level repairs. It checks if the last related repair was Completed less than the supplied TimeSpan ago.
|
||||
|
||||
Arguments:
|
||||
|
||||
[Unnamed], (TimeSpan), Required.
|
||||
|
||||
Example:
|
||||
|
||||
```
|
||||
## Don't proceed if target node is currently inside a post-repair health probation period (post-repair means a Completed repair; target node is still recovering).
|
||||
Mitigate :- CheckInsideNodeProbationPeriod(00:30:00), !.
|
||||
```
|
||||
**LogRule**
|
||||
|
||||
Logs the entire repair rule in which LogRule is specified.
|
||||
|
||||
Arguments:
|
||||
|
||||
[Unnamed], (long), Required.
|
||||
|
||||
Example:
|
||||
|
||||
```
|
||||
Mitigate(Source=?source, Property=?property) :- LogRule(64), match(?source, "SomeOtherWatchdog"),
|
||||
match(?property, "SomeOtherFailure"), DeactivateFabricNode(RemoveData).
|
||||
```
|
||||
|
||||
The logic rule above begins at line number 64 in the related file (in this case, MachineRules.guan). By specifying LogRule(64), FabricHealer will emit the
|
||||
rule in its entirety as an SF health event (Ok HealthState), ETW event and telemetry event (ApplicationInsights/LogAnalytics). This is very useful for debugging and auditing
|
||||
rules.
|
||||
|
||||
### Forming a Logic Repair Workflow
|
||||
|
||||
Now that we know what predicates are, let's learn how to form a logic repair workflow.
|
||||
|
||||
|
@ -135,7 +309,7 @@ now the variable ?x is bound to the application name and ?y is bound to the serv
|
|||
Here's a simple example that we've seen before:
|
||||
|
||||
```
|
||||
Mitigate() :- RestartCodePackage().
|
||||
Mitigate :- RestartCodePackage.
|
||||
```
|
||||
|
||||
Essentially, what this logic repair workflow (mitigation scenario) is describing that if FO emits a health event Warning/Error related to any application entity and the App repair policy is enabled, then we will execute the repair action. FH will automatically detect that it is a logic workflow, so it will invoke the root rule ```Mitigate()```. Guan determines that the ```Mitigate()``` rule is defined inside the repair action, where it then will try to execute the body of the ```Mitigate()``` rule.
|
||||
|
@ -143,8 +317,8 @@ Essentially, what this logic repair workflow (mitigation scenario) is describing
|
|||
Users can define multiple rules (separated by a newline) as part of a repair workflow, here is an example:
|
||||
|
||||
```
|
||||
Mitigate() :- RestartCodePackage().
|
||||
Mitigate() :- RestartFabricNode().
|
||||
Mitigate :- RestartCodePackage.
|
||||
Mitigate :- RestartFabricNode.
|
||||
```
|
||||
|
||||
This seems confusing as we've defined ```Mitigate()``` twice. Here is the execution flow explained in words: "Look for the *first* ```Mitigate()``` rule (read from top to bottom). The *first* ```Mitigate()``` rule is the one that calls ```RestartCodePackage()``` in its body. So we try to run the first rule. If the first rule fails (i.e. ```RestartCodePackage()``` returns false) then we check to see if there is another rule named ```Mitigate()```, which there is. The next ```Mitigate()``` rule we find is the one that calls ```RestartFabricNode()``` so we try to run the second rule.
|
||||
|
@ -154,18 +328,18 @@ This concept of retrying rules is important to understand. Imagine your goal is
|
|||
**Important Syntax Rules**: Each rule must end with a period, a single rule may be split up across multiple lines for readability:
|
||||
|
||||
```
|
||||
Mitigate() :- PredicateA(), <-- The first predicate in the rule must be inline with the Head of the rule like so
|
||||
PredicateB(),
|
||||
PredicateC().
|
||||
Mitigate :- Predicate), <-- The first predicate in the rule must be inline with the Head of the rule like so
|
||||
PredicateB,
|
||||
PredicateC.
|
||||
```
|
||||
|
||||
The following would be invalid:
|
||||
|
||||
```
|
||||
Mitigate() :-
|
||||
PredicateA(),
|
||||
PredicateB(),
|
||||
PredicateC().
|
||||
Mitigate :-
|
||||
PredicateA,
|
||||
PredicateB,
|
||||
PredicateC.
|
||||
```
|
||||
|
||||
**Modelling Boolean Operators**
|
||||
|
@ -174,7 +348,7 @@ Let's look at how we can create AND/OR/NOT statements in Guan logic repair workf
|
|||
|
||||
**NOT**
|
||||
```
|
||||
Mitigate() :- not((condition A)), (true branch B).
|
||||
Mitigate :- not((condition A)), (true branch B).
|
||||
Can be read as: if (!A) then goto B
|
||||
```
|
||||
|
||||
|
@ -183,7 +357,7 @@ NOT behaviour is achieved by wrapping any predicate inside ```not()``` which is
|
|||
|
||||
**AND**
|
||||
```
|
||||
Mitigate() :- (condition A), (condition B), (true branch C).
|
||||
Mitigate :- (condition A), (condition B), (true branch C).
|
||||
Can be read as: if (A and B) then goto C
|
||||
```
|
||||
|
||||
|
@ -191,8 +365,8 @@ AND behavior is achieved by separating predicates with commas, similar to progra
|
|||
|
||||
**OR**
|
||||
```
|
||||
Mitigate() :- (condition A), (true branch C).
|
||||
Mitigate() :- (condition B), (true branch C).
|
||||
Mitigate :- (condition A), (true branch C).
|
||||
Mitigate :- (condition B), (true branch C).
|
||||
Can be read as: if (A or B) then goto C
|
||||
```
|
||||
|
||||
|
@ -203,8 +377,8 @@ OR behaviour is achieved by separating predicates by rule. Here is the execution
|
|||
So far we've only looked at creating rules that are invoked from the root ```Mitigate()``` query, but users can also create their own rules like so:
|
||||
|
||||
```
|
||||
MyInternalPredicate() :- RestartCodePackage().
|
||||
Mitigate() :- MyInternalPredicate().
|
||||
MyInternalPredicate :- RestartCodePackage.
|
||||
Mitigate :- MyInternalPredicate.
|
||||
```
|
||||
|
||||
Here we've defined an internal predicate named ```MyInternalPredicate()``` and we can see that it is invoked in the body of the ```Mitigate()``` rule. In order to fulfill the ```Mitigate()``` rule, we will need to fulfill the ```MyInternalPredicate()``` predicate since it is part of the body of the ```Mitigate()``` rule. This repair workflow is identical in behaviour to one that directly calls ```RestartCodePackage()``` inside the body of ```Mitigate()```.
|
||||
|
@ -221,31 +395,17 @@ IntervalForRepairTarget(AppName="fabric:/CpuStress", RunInterval=00:15:00).
|
|||
IntervalForRepairTarget(AppName="fabric:/ContainerFoo2", RunInterval=00:15:00).
|
||||
IntervalForRepairTarget(MetricName="ActiveTcpPorts", RunInterval=00:15:00).
|
||||
|
||||
Mitigate() :- IntervalForRepairTarget(?target, ?runinterval), CheckInsideRunInterval(?runinterval), !.
|
||||
Mitigate :- IntervalForRepairTarget(?target, ?runinterval), CheckInsideRunInterval(?runinterval), !.
|
||||
```
|
||||
|
||||
IMPORTANT: the state machine holding the data that the CheckInsideRunInterval predicate compares your specified RunInterval TimeSpan value against is our friendly neighborhood RepairManagerService(RM), a stateful Service Fabric System Service that orchestrates repairs
|
||||
and manages repair state. ***FH requires the presence of RM in order to function***.
|
||||
|
||||
### A note on named arguments in sub-rules
|
||||
|
||||
In Guan, a named argument is used to express an argument that is optional. It is also useful to add names that describe the meaning of what the argument represents. However, be careful here. Named arguments are not positional arguments. In fact, they are optional and should
|
||||
be treated as such in predicate implementations. ***In a nutshell, do not name required (positional) arguments in rules***.
|
||||
If you write your own external predicates, then please keep this in mind: required arguments are positional, named arguments are not. Let's take a quick trip to a small and important piece of an external predicate implementation for RestartCodePackage.
|
||||
|
||||
```C#
|
||||
private RestartCodePackagePredicateType(string name)
|
||||
: base(name, true, 0)
|
||||
{
|
||||
|
||||
}
|
||||
```
|
||||
|
||||
The base type for all predicates in Guan is PredicateType. It's constructor takes a required string pararmeter (the name of the predicate as it is used in a rule) and optional parameters of public visibility (bool) and minimum/maximum positional arguments. Note that for RestartCodePackagePredicateType the min
|
||||
positional argument is set to 0, which means there are no required arguments. That said, look how it is called as the definition of an internal predicate, TimeScopedRestartCodePackage in the App rules file:
|
||||
|
||||
```
|
||||
TimeScopedRestartCodePackage() :- RestartCodePackage(DoHealthChecks=true, MaxWaitTimeForHealthStateOk=00:10:00).
|
||||
TimeScopedRestartCodePackage :- RestartCodePackage(DoHealthChecks=true, MaxWaitTimeForHealthStateOk=00:10:00).
|
||||
```
|
||||
This means that the RestartCodePackage predicate takes two optinal args (see RestartCodePackagePredicateType.cs to see how handling optional arguments is implemented) and, again, 0 required arguments. This is important to remember as you build rules for existing predicates and when you write your own.
|
||||
|
||||
|
@ -274,7 +434,7 @@ Mitigate(AppName="fabric:/System", MetricName="EphemeralPorts", MetricValue=?Met
|
|||
TimeScopedRestartFabricNode(5, 01:00:00).
|
||||
```
|
||||
|
||||
**Filtering parameters from Mitigate()**
|
||||
**Filtering constraints in the head of rule**
|
||||
|
||||
If you wish to do a single test for equality such as ```?AppName == "fabric:/App1``` you don't actually need to write this in the body of your rules, instead you can specify these values inside Mitigate() like so:
|
||||
|
||||
|
@ -283,7 +443,7 @@ If you wish to do a single test for equality such as ```?AppName == "fabric:/App
|
|||
Mitigate(AppName="fabric:/App1") :- ...
|
||||
```
|
||||
|
||||
What that means, is that the rule will only execute when the AppName is "fabric:/App1". This is equivalent to the following:
|
||||
The above specifies that subgoals will only execute if the AppName fact is "fabric:/App1". This is equivalent to the following:
|
||||
|
||||
```
|
||||
Mitigate(AppName=?appName) :- ?appName == "fabric:/App1", ...
|
||||
|
@ -303,3 +463,36 @@ Or, you are only interested in any AppName that is not fabric:/App1 or fabric:/A
|
|||
```
|
||||
Mitigate(AppName=?appName) :- not(?appName == "fabric:/App1" || ?appName == "fabric:/App42"), ...
|
||||
```
|
||||
|
||||
### Optional Arguments
|
||||
|
||||
FabricHealer's Guan predicate implementations support optional arguments and they can be specified in any order, as long they are also named.
|
||||
|
||||
E.g.,
|
||||
|
||||
```Mitigate(AppName="fabric:/FooBar", MetricName="MemoryMB") :- RestartReplica(DoHealthChecks=false, MaxWaitTimeForHealthStateOk=00:05:00, MaxExecutionTime=00:15:00).```
|
||||
|
||||
The above rule will execute the subgoal when the AppName fact is "fabric:/FooBar" and the MetricName is "MemoryMB" (both facts would only come from FabricObserver or FHProxy). Let's just look at this piece:
|
||||
```RestartReplica(DoHealthChecks=false, MaxWaitTimeForHealthStateOk=00:05:00, MaxExecutionTime=00:15:00)```
|
||||
|
||||
Note the arguments. It could also be written as
|
||||
|
||||
```RestartReplica(false, 00:05:00, 00:15:00)```
|
||||
|
||||
However, if you just wanted to supply either the health checks boolean or wait timespan or execution timespan argument, then you would have to also employ the relevant name:
|
||||
|
||||
```RestartReplica(MaxWaitTimeForHealthStateOk=00:05:00)```
|
||||
|
||||
Or
|
||||
|
||||
```RestartReplica(MaxExecutionTime=00:15:00)```
|
||||
|
||||
Or
|
||||
|
||||
```RestartReplica(DoHealthChecks=false)```
|
||||
|
||||
Or
|
||||
|
||||
``` RestartReplica(DoHealthChecks=false, MaxExecutionTime=00:15:00)```
|
||||
|
||||
Hopefully, you can see the pattern here: You can employ any combination of predicate arguments in FabricHealer's Guan predicates (not the case in Guan's system predicates, however), but **they must each be named if you do not specifcy all of them**.
|
||||
|
|
|
@ -44,7 +44,7 @@ Here is a full example of exactly what is sent in one of these telemetry events,
|
|||
"ClusterId": "00000000-1111-1111-0000-00f00d000d",
|
||||
"ClusterType": "SFRP",
|
||||
"NodeNameHash": "3e83569d4c6aad78083cd081215dafc81e5218556b6a46cb8dd2b183ed0095ad",
|
||||
"FHVersion": "1.1.18",
|
||||
"FHVersion": "1.1.19",
|
||||
"UpTime": "00:00:00.2164523",
|
||||
"Timestamp": "2023-02-07T21:45:25.2443014Z",
|
||||
"OS": "Windows",
|
||||
|
|
|
@ -428,6 +428,70 @@ namespace FHTest
|
|||
}
|
||||
}
|
||||
|
||||
[TestMethod]
|
||||
public async Task AllFabricNodeRules_EnsureWellFormedRules_QueryInitialized_Successful()
|
||||
{
|
||||
// This will be the data used to create a repair task.
|
||||
var repairData = new TelemetryData
|
||||
{
|
||||
EntityType = EntityType.Node,
|
||||
NodeName = NodeName,
|
||||
HealthState = HealthState.Error
|
||||
};
|
||||
|
||||
repairData.RepairPolicy = new RepairPolicy
|
||||
{
|
||||
RepairId = $"Test42_FabricNodeRepair_{NodeName}",
|
||||
RepairIdPrefix = RepairConstants.FHTaskIdPrefix,
|
||||
NodeName = repairData.NodeName,
|
||||
HealthState = repairData.HealthState
|
||||
};
|
||||
|
||||
var executorData = new RepairExecutorData
|
||||
{
|
||||
RepairPolicy = repairData.RepairPolicy
|
||||
};
|
||||
|
||||
var file = Path.Combine(Environment.CurrentDirectory, "PackageRoot", "Config", "LogicRules", "FabricNodeRules.guan");
|
||||
FabricHealerManager.CurrentlyExecutingLogicRulesFileName = "FabricNodeRules.guan";
|
||||
List<string> repairRules = FabricHealerManager.ParseRulesFile(await File.ReadAllLinesAsync(file, token));
|
||||
|
||||
try
|
||||
{
|
||||
TimeSpan maxTestTime = TimeSpan.FromSeconds(30);
|
||||
|
||||
// don't block here.
|
||||
_ = TestInitializeGuanAndRunQuery(repairData, repairRules, executorData);
|
||||
|
||||
var repairTasks = await fabricClient.RepairManager.GetRepairTaskListAsync(
|
||||
RepairConstants.FHTaskIdPrefix, RepairTaskStateFilter.Active, null);
|
||||
|
||||
Stopwatch timer = Stopwatch.StartNew();
|
||||
|
||||
while (timer.Elapsed < maxTestTime)
|
||||
{
|
||||
repairTasks = await fabricClient.RepairManager.GetRepairTaskListAsync(
|
||||
RepairConstants.FHTaskIdPrefix, RepairTaskStateFilter.Active, null);
|
||||
|
||||
if (!repairTasks.Any(r => r.Action == "RestartFabricNode"))
|
||||
{
|
||||
await Task.Delay(1000);
|
||||
continue;
|
||||
}
|
||||
|
||||
await FabricRepairTasks.CancelRepairTaskAsync(repairTasks.First(r => r.Action == "RestartFabricNode"));
|
||||
return;
|
||||
}
|
||||
|
||||
throw new InternalTestFailureException("FabricNode repair task did not get created with max test time of 30s.");
|
||||
|
||||
}
|
||||
catch (GuanException ge)
|
||||
{
|
||||
throw new AssertFailedException(ge.Message, ge);
|
||||
}
|
||||
}
|
||||
|
||||
[TestMethod]
|
||||
public async Task AllReplicaRules_EnsureWellFormedRules_QueryInitialized_Successful()
|
||||
{
|
||||
|
|
|
@ -53,7 +53,7 @@
|
|||
## By having this as a top level rule, it means no subsequent rules in this file will run if we are inside the specified run interval.
|
||||
## This is commented out by default. Just uncomment and set the global run interval for app level repairs to suit your needs.
|
||||
|
||||
## Mitigate() :- CheckInsideRunInterval(00:30:00), !.
|
||||
## Mitigate :- CheckInsideRunInterval(00:30:00), !.
|
||||
|
||||
## Repair Lifetime Management - Specify that a repair can only run until some end date. This gives the dev team time to identify and fix the bug in user code that is causing the problem.
|
||||
## Meanwhile, FabricHealer will keep the offending service green. Remember: Auto-mitigation is not a fix, it's a stop gap. Fix those bugs!
|
||||
|
@ -63,12 +63,12 @@
|
|||
## The rule below reads: If any of the specified (set in Mitigate) app's service processes have put it into Warning due to CPU over-consumption and today's date is later than the supplied end date, emit a message, stop processing rules (!).
|
||||
## You can use LogInfo, LogWarning or LogError predicates to generate a log event that will create a local text log entry, an ETW event, and an SF health report.
|
||||
|
||||
Mitigate(AppName="fabric:/CpuStress", MetricName="CpuPercent") :- time() > DateTime("12/31/2022"),
|
||||
Mitigate(AppName="fabric:/CpuStress", MetricName="CpuPercent") :- time() > DateTime("4/15/2023"),
|
||||
LogInfo("Exceeded specified end date for repair of fabric:/CpuStress CpuPercent usage violations. Target end date: {0}. Current date (Utc): {1}", DateTime("12/31/2022"), time()), !.
|
||||
|
||||
## Alternatively, you could enforce repair end dates inline (as a subrule) to any rule, e.g.,
|
||||
|
||||
Mitigate(AppName="fabric:/PortEater42", MetricName="EphemeralPorts", MetricValue=?MetricValue) :- time() < DateTime("12/30/2022"),
|
||||
Mitigate(AppName="fabric:/PortEater42", MetricName="EphemeralPorts", MetricValue=?MetricValue) :- time() < DateTime("4/15/2023"),
|
||||
?MetricValue >= 8500,
|
||||
TimeScopedRestartCodePackage(4, 01:00:00).
|
||||
|
||||
|
@ -127,7 +127,6 @@ Mitigate(AppName="fabric:/MemoryStress", MetricName="MemoryPercent", MetricValue
|
|||
## Memory - Megabytes In Use for Any SF Service Process belonging to the specified SF Applications. 5 repairs within 5 hour window.
|
||||
Mitigate(AppName="fabric:/ContainerFoo", MetricName="MemoryMB") :- TimeScopedRestartCodePackage(5, 05:00:00).
|
||||
Mitigate(AppName="fabric:/ContainerFoo2", MetricName="MemoryMB") :- TimeScopedRestartCodePackage(5, 05:00:00).
|
||||
Mitigate(AppName="fabric:/TestApp42", MetricName="MemoryMB") :- TimeScopedRestartCodePackage(5, 05:00:00).
|
||||
|
||||
## Note the constraint on HealthState in the head of the rule below, which only applies to one service, fabric:/Voting/VotingData, in this example (just change the fabric Uri for your target).
|
||||
## This is important when you have both Warning and Error thresholds specified for some service for some metric in FabricObserver. You would do that
|
||||
|
@ -193,7 +192,7 @@ TimeScopedRestartReplica(?count, ?time) :- GetRepairHistory(?repairCount, ?time)
|
|||
## by the repair executor (FabricHealer).
|
||||
|
||||
## Restart all replicas hosted in the process. If you do not specify a value for MaxExecutionTime argument, the default is 60 minutes.
|
||||
TimeScopedRestartCodePackage() :- RestartCodePackage(DoHealthChecks=false, MaxWaitTimeForHealthStateOk=00:10:00, MaxExecutionTime=00:30:00).
|
||||
TimeScopedRestartCodePackage :- RestartCodePackage(DoHealthChecks=false, MaxWaitTimeForHealthStateOk=00:10:00, MaxExecutionTime=00:30:00).
|
||||
|
||||
## Restart individual replica hosted in the process. If you do not specify a value for MaxExecutionTime argument, the default is 60 minutes.
|
||||
TimeScopedRestartReplica() :- RestartReplica(DoHealthChecks=false, MaxWaitTimeForHealthStateOk=00:05:00, MaxExecutionTime=00:15:00).
|
||||
TimeScopedRestartReplica :- RestartReplica(DoHealthChecks=false, MaxWaitTimeForHealthStateOk=00:05:00, MaxExecutionTime=00:15:00).
|
|
@ -28,7 +28,7 @@
|
|||
## First, check if we are inside run interval. If so, then cut (!).
|
||||
## This is commented out by default. Just uncomment and set the global run interval for disk level repairs to suit your needs.
|
||||
|
||||
## Mitigate() :- CheckInsideRunInterval(02:00:00), !.
|
||||
## Mitigate :- CheckInsideRunInterval(02:00:00), !.
|
||||
|
||||
## DeleteFiles external predicate takes 1 required positional argument (position 0), which is the full path to the directory to be cleaned, and 4 optional named arguments.
|
||||
## Optional (named, position 1 to n in argument list, first arg (0) is reserved for folder path) arguments for DeleteFiles:
|
||||
|
|
|
@ -1,11 +1,5 @@
|
|||
## Logic rules for Service Fabric Node repairs.
|
||||
## Logic rule examples for Service Fabric Node repairs.
|
||||
|
||||
## First check if we are inside the run interval. If so, cut (!). This means that no other rules will be processed (no back-tracking).
|
||||
## This is commented out by default. Just uncomment and set the global run interval for app Fabric node level repairs to suit your needs.
|
||||
|
||||
## Mitigate() :- CheckInsideRunInterval(02:00:00), !.
|
||||
|
||||
## This rule means that whatever the Fabric node-level warning data from the issuing service happens to be, restart the target Fabric node if
|
||||
## the repair hasn't run 4 times in the last 8 hours.
|
||||
|
||||
Mitigate() :- GetRepairHistory(?repairCount, 08:00:00), ?repairCount < 4, RestartFabricNode(DoHealthChecks=false, MaxWaitTimeForHealthStateOk=00:45:00, MaxExecutionTime=02:00:00).
|
||||
## Restart/Deactivate. Try Restarting the node twice in an 8 hour window. Else, deactivate (with node impact = RemoveData) the node.
|
||||
Mitigate(HealthState=Error) :- GetRepairHistory(?repairCount, 08:00:00), ?repairCount < 2, !, RestartFabricNode.
|
||||
Mitigate(HealthState=Error) :- GetRepairHistory(?repairCount, 08:00:00), ?repairCount < 2, DeactivateFabricNode(RemoveData).
|
|
@ -7,13 +7,13 @@ Mitigate(Property=?property, Source=?source) :- not(?property == InfrastructureE
|
|||
|
||||
## Reboot.
|
||||
## Don't process any other rules if scheduling succeeds OR fails (note the position of ! (cut operator)) and there are less than 1 of these repairs that have completed in the last 8 hours.
|
||||
Mitigate() :- GetRepairHistory(?repairCount, 08:00:00, System.Reboot), ?repairCount < 1, !, ScheduleMachineRepair(System.Reboot).
|
||||
Mitigate :- GetRepairHistory(?repairCount, 08:00:00, System.Reboot), ?repairCount < 1, !, ScheduleMachineRepair(System.Reboot).
|
||||
|
||||
## ReimageOS escalation. *This is not supported in VMSS-managed clusters*.
|
||||
Mitigate() :- GetRepairHistory(?repairCount, 08:00:00, System.ReimageOS), ?repairCount < 1, !, ScheduleMachineRepair(System.ReimageOS).
|
||||
Mitigate :- GetRepairHistory(?repairCount, 08:00:00, System.ReimageOS), ?repairCount < 1, !, ScheduleMachineRepair(System.ReimageOS).
|
||||
|
||||
## Azure.Heal escalation.
|
||||
Mitigate() :- GetRepairHistory(?repairCount, 08:00:00, System.Azure.Heal), ?repairCount < 1, !, ScheduleMachineRepair(System.Azure.Heal).
|
||||
Mitigate :- GetRepairHistory(?repairCount, 08:00:00, System.Azure.Heal), ?repairCount < 1, !, ScheduleMachineRepair(System.Azure.Heal).
|
||||
|
||||
## Triage escalation.
|
||||
Mitigate(NodeName=?nodeName) :- LogInfo("0042_{0}: Specified Machine repair escalations have been exhausted for node {0}. Human intervention is required.", ?nodeName),
|
||||
|
|
|
@ -9,12 +9,12 @@
|
|||
## First, check if we are inside run interval. If so, then cut (!), which effectively means stop processing rules (no backtracking to subsequent rules in the file).
|
||||
## This is commented out by default. Just uncomment and set the global run interval for replica level repairs to suit your needs.
|
||||
|
||||
## Mitigate() :- CheckInsideRunInterval(00:15:00), !.
|
||||
## Mitigate :- CheckInsideRunInterval(00:15:00), !.
|
||||
|
||||
## Set a repair count variable for use by any rule in this file (NOTE: all rules must have the same TimeWindow value) as an internal predicate, _mitigate(?count),
|
||||
## where ?repairCount and ?count variables are unified when _mitigate(?count) predicate runs. The concept here is sharing a variable value across different rules.
|
||||
|
||||
Mitigate() :- GetRepairHistory(?repairCount, 01:00:00), _mitigate(?repairCount).
|
||||
Mitigate :- GetRepairHistory(?repairCount, 01:00:00), _mitigate(?repairCount).
|
||||
|
||||
## Now, let's say you wanted to only repair specific Apps or Paritions where related repair TimeWindow values are *not* the same, unlike the above "global" variable rule.
|
||||
## You could do something like the below three rules, which would mean the _mitigate internal predicate would only run if the supplied Mitigate argument values are matched:
|
||||
|
|
|
@ -29,15 +29,13 @@
|
|||
## Note: FO only generates Application (System) level warnings for system services. There will only ever be ApplicationName as "fabric:/System" in the FO health data that FH emits, so this is an optional argument.
|
||||
## This is commented out by default. Just uncomment and set the global run interval for System app level repairs to suit your needs.
|
||||
|
||||
## Mitigate() :- CheckInsideRunInterval(00:10:00), !.
|
||||
## Mitigate :- CheckInsideRunInterval(00:10:00), !.
|
||||
|
||||
|
||||
## TimeScopedRestartFabricNode is an internal predicate to check for the number of times a system service node restart repair has run to completion within a supplied time window.
|
||||
## If Completed Repair count is less then supplied value, then run RestartFabricNode mitigation.
|
||||
|
||||
TimeScopedRestartFabricNode(?count, ?time) :- GetRepairHistory(?repairCount, ?time), ?repairCount < ?count,
|
||||
RestartFabricNode(DoHealthChecks=false, MaxWaitTimeForHealthStateOk=00:45:00, MaxExecutionTime=02:00:00).
|
||||
|
||||
TimeScopedRestartFabricNode(?count, ?time) :- GetRepairHistory(?repairCount, ?time), ?repairCount < ?count, RestartFabricNode(DoHealthChecks=false).
|
||||
|
||||
## TimeScopedRestartFabricSystemProcess is an internal predicate to check for the number of times a System service process restart repair has run to completion within a supplied time window.
|
||||
## If Completed Repair count is less then supplied value, then run RestartFabricSystemProcess mitigation.
|
||||
|
@ -45,11 +43,9 @@ TimeScopedRestartFabricNode(?count, ?time) :- GetRepairHistory(?repairCount, ?ti
|
|||
TimeScopedRestartFabricSystemProcess(?count, ?time) :- GetRepairHistory(?repairCount, ?time), ?repairCount < ?count,
|
||||
RestartFabricSystemProcess(DoHealthChecks=false, MaxWaitTimeForHealthStateOk=00:05:00, MaxExecutionTime=00:10:00).
|
||||
|
||||
|
||||
## Mitigation rules for multiple metrics and targets. NOTE: Do not restart Fabric or FabricHost processes unless you want to take the Fabric node down. For the latter (restart node),
|
||||
## use TimeScopedRestartFabricNode (or RestartFabricNode predicate directly), which employs a safe Fabric node restart workflow (with deactivation step), not just a process kill.
|
||||
|
||||
|
||||
## CPU Time - Percent
|
||||
|
||||
Mitigate(MetricName="CpuPercent", ProcessName=?SysProcName) :- not(?SysProcName == "Fabric" || ?SysProcName == "FabricHost"),
|
||||
|
@ -57,7 +53,6 @@ Mitigate(MetricName="CpuPercent", ProcessName=?SysProcName) :- not(?SysProcName
|
|||
?HealthEventCount >= 3,
|
||||
TimeScopedRestartFabricSystemProcess(5, 01:00:00).
|
||||
|
||||
|
||||
## Memory Use - Megabytes in use
|
||||
|
||||
Mitigate(MetricName="MemoryMB", ProcessName=?SysProcName) :- not(?SysProcName == "Fabric" || ?SysProcName == "FabricHost"),
|
||||
|
@ -69,7 +64,6 @@ Mitigate(MetricName="MemoryMB", ProcessName=?SysProcName) :- not(?SysProcName ==
|
|||
## ?HealthEventCount >= 3,
|
||||
## TimeScopedRestartFabricNode(1, 01:00:00).
|
||||
|
||||
|
||||
## Memory Use - Percent in use
|
||||
|
||||
Mitigate(MetricName="MemoryPercent", ProcessName=?SysProcName) :- not(?SysProcName == "Fabric" || ?SysProcName == "FabricHost"),
|
||||
|
@ -77,7 +71,6 @@ Mitigate(MetricName="MemoryPercent", ProcessName=?SysProcName) :- not(?SysProcNa
|
|||
?HealthEventCount >= 3,
|
||||
TimeScopedRestartFabricSystemProcess(5, 01:00:00).
|
||||
|
||||
|
||||
## Ephemeral Ports in Use
|
||||
|
||||
Mitigate(MetricName="EphemeralPorts", ProcessName=?SysProcName) :- not(?SysProcName == "Fabric" || ?SysProcName == "FabricHost"),
|
||||
|
@ -85,7 +78,6 @@ Mitigate(MetricName="EphemeralPorts", ProcessName=?SysProcName) :- not(?SysProcN
|
|||
?HealthEventCount >= 3,
|
||||
TimeScopedRestartFabricSystemProcess(5, 01:00:00).
|
||||
|
||||
|
||||
## Threads
|
||||
|
||||
Mitigate(MetricName="Threads", ProcessName=?SysProcName) :- not(?SysProcName == "Fabric" || ?SysProcName == "FabricHost"),
|
||||
|
@ -103,7 +95,6 @@ Mitigate(MetricName="FileHandles", ProcessName=?SysProcName) :- match(?SysProcNa
|
|||
?HealthEventCount >= 3,
|
||||
TimeScopedRestartFabricSystemProcess(15, 01:00:00).
|
||||
|
||||
|
||||
## Open File Handles - Linux-only: Any SF system service besides Fabric or FabricHost.
|
||||
## Restart the offending Fabric system process.
|
||||
|
||||
|
@ -112,7 +103,6 @@ Mitigate(MetricName="FileHandles", OS="Linux", ProcessName=?SysProcName) :- not(
|
|||
?HealthEventCount >= 3,
|
||||
TimeScopedRestartFabricSystemProcess(5, 01:00:00).
|
||||
|
||||
|
||||
## Open File Handles - Linux OS, Fabric process. In these cases, we want a safe (graceful) restart of the Fabric node; not just kill the process, which will restart the node, but not gracefully.
|
||||
## Restart the Fabric node where the offending instance is running.
|
||||
|
||||
|
|
|
@ -67,5 +67,6 @@ TimeScopedRestartCodePackage(?count, ?time) :- GetRepairHistory(?repairCount, ?t
|
|||
TimeScopedRestartReplica(?count, ?time) :- GetRepairHistory(?repairCount, ?time), ?repairCount >= ?count,
|
||||
LogInfo("Exhausted specified run count, {0}, within specified max repair time window, {1}. Will not attempt RestartReplica repair at this time.", ?count, ?time).
|
||||
|
||||
TimeScopedRestartCodePackage() :- RestartCodePackage(DoHealthChecks=false, MaxWaitTimeForHealthStateOk=00:10:00).
|
||||
TimeScopedRestartReplica() :- RestartReplica(DoHealthChecks=false, MaxWaitTimeForHealthStateOk=00:10:00).
|
||||
TimeScopedRestartCodePackage :- RestartCodePackage(DoHealthChecks=false, MaxWaitTimeForHealthStateOk=00:10:00).
|
||||
TimeScopedRestartReplica :- RestartReplica(DoHealthChecks=false, MaxWaitTimeForHealthStateOk=00:10:00).
|
||||
TimeScopedRestartFabricNode(?count, ?time) :- GetRepairHistory(?repairCount, ?time), ?repairCount < ?count, RestartFabricNode.
|
|
@ -10,5 +10,5 @@ TimeScopedRestartCodePackage(?count, ?time) :- GetRepairHistory(?repairCount, ?t
|
|||
TimeScopedRestartReplica(?count, ?time) :- GetRepairHistory(?repairCount, ?time), ?repairCount >= ?count,
|
||||
LogInfo("Exhausted specified run count, {0}, within specified max repair time window, {1}. Will not attempt RestartReplica repair at this time.", ?count, ?time).
|
||||
|
||||
TimeScopedRestartCodePackage() :- RestartCodePackage(DoHealthChecks=false, MaxExecutionTime=00:00:02).
|
||||
TimeScopedRestartReplica() :- RestartReplica(DoHealthChecks=false, MaxExecutionTime=00:00:02).
|
||||
TimeScopedRestartCodePackage :- RestartCodePackage(DoHealthChecks=false, MaxExecutionTime=00:00:02).
|
||||
TimeScopedRestartReplica :- RestartReplica(DoHealthChecks=false, MaxExecutionTime=00:00:02).
|
|
@ -2,7 +2,7 @@
|
|||
<package xmlns="http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd">
|
||||
<metadata minClientVersion="3.3.0">
|
||||
<id>%PACKAGE_ID%</id>
|
||||
<version>1.1.18</version>
|
||||
<version>1.1.19</version>
|
||||
<releaseNotes>
|
||||
This release requires Service Fabric runtime version 9 and higher and at least Service Fabric SDK version 6.0.1017. There are several changes and improvements in this
|
||||
release including a new machine repair model, updated logic rules, bug fixes, and many code improvements.
|
||||
|
|
|
@ -25,8 +25,8 @@ Project("{2150E333-8FDC-42A3-9474-1A3956D46DE8}") = "Solution Items", "Solution
|
|||
Documentation\OperationalTelemetry.md = Documentation\OperationalTelemetry.md
|
||||
README.md = README.md
|
||||
Documentation\Deployment\service-fabric-healer.json = Documentation\Deployment\service-fabric-healer.json
|
||||
Documentation\Deployment\service-fabric-healer.v1.1.18.parameters.json = Documentation\Deployment\service-fabric-healer.v1.1.18.parameters.json
|
||||
Documentation\Using.md = Documentation\Using.md
|
||||
Documentation\Deployment\service-fabric-healer.v1.1.19.parameters.json = Documentation\Deployment\service-fabric-healer.v1.1.19.parameters.json
|
||||
EndProjectSection
|
||||
EndProject
|
||||
Project("{9A19103F-16F7-4668-BE54-9A1E7A4F7556}") = "FHTest", "FHTest\FHTest.csproj", "{8D9712BF-C026-4A36-B6D1-6345137D3B6F}"
|
||||
|
|
|
@ -12,8 +12,8 @@
|
|||
<RuntimeIdentifier>win-x64</RuntimeIdentifier>-->
|
||||
<RuntimeIdentifiers>linux-x64;win-x64</RuntimeIdentifiers>
|
||||
<Product>FabricHealer</Product>
|
||||
<Version>1.1.18</Version>
|
||||
<FileVersion>1.1.18</FileVersion>
|
||||
<Version>1.1.19</Version>
|
||||
<FileVersion>1.1.19</FileVersion>
|
||||
<StartupObject>FabricHealer.Program</StartupObject>
|
||||
<Platforms>x64</Platforms>
|
||||
</PropertyGroup>
|
||||
|
|
|
@ -37,7 +37,7 @@ namespace FabricHealer
|
|||
public static string CurrentlyExecutingLogicRulesFileName { get; set; }
|
||||
|
||||
// Folks often use their own version numbers. This is for internal diagnostic telemetry.
|
||||
private const string InternalVersionNumber = "1.1.18";
|
||||
private const string InternalVersionNumber = "1.1.19";
|
||||
private static FabricHealerManager singleton;
|
||||
private static FabricClient _fabricClient;
|
||||
private bool disposedValue;
|
||||
|
|
|
@ -53,7 +53,7 @@
|
|||
## By having this as a top level rule, it means no subsequent rules in this file will run if we are inside the specified run interval.
|
||||
## This is commented out by default. Just uncomment and set the global run interval for app level repairs to suit your needs.
|
||||
|
||||
## Mitigate() :- CheckInsideRunInterval(00:30:00), !.
|
||||
## Mitigate :- CheckInsideRunInterval(00:30:00), !.
|
||||
|
||||
## Repair Lifetime Management - Specify that a repair can only run until some end date. This gives the dev team time to identify and fix the bug in user code that is causing the problem.
|
||||
## Meanwhile, FabricHealer will keep the offending service green. Remember: Auto-mitigation is not a fix, it's a stop gap. Fix those bugs!
|
||||
|
@ -63,12 +63,12 @@
|
|||
## The rule below reads: If any of the specified (set in Mitigate) app's service processes have put it into Warning due to CPU over-consumption and today's date is later than the supplied end date, emit a message, stop processing rules (!).
|
||||
## You can use LogInfo, LogWarning or LogError predicates to generate a log event that will create a local text log entry, an ETW event, and an SF health report.
|
||||
|
||||
Mitigate(AppName="fabric:/CpuStress", MetricName="CpuPercent") :- time() > DateTime("12/31/2022"),
|
||||
Mitigate(AppName="fabric:/CpuStress", MetricName="CpuPercent") :- time() > DateTime("4/15/2023"),
|
||||
LogInfo("Exceeded specified end date for repair of fabric:/CpuStress CpuPercent usage violations. Target end date: {0}. Current date (Utc): {1}", DateTime("12/31/2022"), time()), !.
|
||||
|
||||
## Alternatively, you could enforce repair end dates inline (as a subrule) to any rule, e.g.,
|
||||
|
||||
Mitigate(AppName="fabric:/PortEater42", MetricName="EphemeralPorts", MetricValue=?MetricValue) :- time() < DateTime("12/30/2022"),
|
||||
Mitigate(AppName="fabric:/PortEater42", MetricName="EphemeralPorts", MetricValue=?MetricValue) :- time() < DateTime("4/15/2023"),
|
||||
?MetricValue >= 8500,
|
||||
TimeScopedRestartCodePackage(4, 01:00:00).
|
||||
|
||||
|
@ -192,7 +192,7 @@ TimeScopedRestartReplica(?count, ?time) :- GetRepairHistory(?repairCount, ?time)
|
|||
## by the repair executor (FabricHealer).
|
||||
|
||||
## Restart all replicas hosted in the process. If you do not specify a value for MaxExecutionTime argument, the default is 60 minutes.
|
||||
TimeScopedRestartCodePackage() :- RestartCodePackage(DoHealthChecks=false, MaxWaitTimeForHealthStateOk=00:10:00, MaxExecutionTime=00:30:00).
|
||||
TimeScopedRestartCodePackage :- RestartCodePackage(DoHealthChecks=false, MaxWaitTimeForHealthStateOk=00:10:00, MaxExecutionTime=00:30:00).
|
||||
|
||||
## Restart individual replica hosted in the process. If you do not specify a value for MaxExecutionTime argument, the default is 60 minutes.
|
||||
TimeScopedRestartReplica() :- RestartReplica(DoHealthChecks=false, MaxWaitTimeForHealthStateOk=00:05:00, MaxExecutionTime=00:15:00).
|
||||
TimeScopedRestartReplica :- RestartReplica(DoHealthChecks=false, MaxWaitTimeForHealthStateOk=00:05:00, MaxExecutionTime=00:15:00).
|
|
@ -28,7 +28,7 @@
|
|||
## First, check if we are inside run interval. If so, then cut (!).
|
||||
## This is commented out by default. Just uncomment and set the global run interval for disk level repairs to suit your needs.
|
||||
|
||||
## Mitigate() :- CheckInsideRunInterval(02:00:00), !.
|
||||
## Mitigate :- CheckInsideRunInterval(02:00:00), !.
|
||||
|
||||
## DeleteFiles external predicate takes 1 required positional argument (position 0), which is the full path to the directory to be cleaned, and 4 optional named arguments.
|
||||
## Optional (named, position 1 to n in argument list, first arg (0) is reserved for folder path) arguments for DeleteFiles:
|
||||
|
|
|
@ -1,6 +1,5 @@
|
|||
## Logic rule examples for Service Fabric Node repairs.
|
||||
## These repairs are not executed by FabricHealer. FH creates repair tasks with the correct node impact specified and RM takes it from there.
|
||||
|
||||
## Restart/Deactivate. Try Restart twice in 8 hour window. Else, deactivate (Pause) the Fabric node.
|
||||
Mitigate(HealthState=Error) :- GetRepairHistory(?repairCount, 08:00:00), ?repairCount < 2, !, RestartFabricNode().
|
||||
Mitigate(HealthState=Error) :- DeactivateFabricNode().
|
||||
## Restart/Deactivate. Try Restarting the node twice in an 8 hour window. Else, deactivate (with node impact = RemoveData) the node.
|
||||
Mitigate(HealthState=Error) :- GetRepairHistory(?repairCount, 08:00:00), ?repairCount < 2, !, RestartFabricNode.
|
||||
Mitigate(HealthState=Error) :- GetRepairHistory(?repairCount, 08:00:00), ?repairCount < 2, DeactivateFabricNode(RemoveData).
|
|
@ -42,28 +42,26 @@ Mitigate(HealthState=?healthState) :- not(?healthState == Error), !.
|
|||
##Mitigate(Source=?source) :- not(match(?source, "EventLogWatchdog")), !.
|
||||
|
||||
## Don't proceed if there are already 2 or more machine repairs currently active in the cluster.
|
||||
Mitigate() :- CheckOutstandingRepairs(2), !.
|
||||
Mitigate :- CheckOutstandingRepairs(2), !.
|
||||
|
||||
## Don't proceed if FH scheduled a machine repair less than 10 minutes ago.
|
||||
Mitigate() :- CheckInsideScheduleInterval(00:10:00), !.
|
||||
Mitigate :- CheckInsideScheduleInterval(00:10:00), !.
|
||||
|
||||
## Don't proceed if target node is currently inside a post-repair health probation period (post-repair means a Completed repair; target node is still recovering).
|
||||
Mitigate() :- CheckInsideNodeProbationPeriod(00:30:00), !.
|
||||
Mitigate :- CheckInsideNodeProbationPeriod(00:30:00), !.
|
||||
|
||||
## Don't proceed if the target node hasn't been in Error (including cyclic Up/Down) state for at least two hours.
|
||||
Mitigate() :- CheckInsideHealthStateMinDuration(02:00:00), !.
|
||||
Mitigate :- CheckInsideHealthStateMinDuration(00:01:00), !.
|
||||
|
||||
## For certain environments, the correct mitigation is to deactivate the target node. The below rule schedules a node deactivation
|
||||
## (Here, the node impact level is RemoveData, but you can supply RestartNode or RemoveNode depending on your intention). If you do not specify an ImpactLevel,
|
||||
## the default level used will be RestartNode.
|
||||
##Mitigate(Source=?source, Property=?property) :- LogRule(59), match(?source, "EventLogWatchdog"), match(?property, "CriticalMachineFailure"),
|
||||
## DeactivateFabricNode(ImpactLevel=RemoveData).
|
||||
##Mitigate(Source=?source, Property=?property) :- LogRule(59), match(?source, "EventLogWatchdog"), match(?property, "CriticalMachineFailure"), DeactivateFabricNode(RemoveData).
|
||||
|
||||
## If you employ multiple rules with the same repair predicate (e.g., DeactivateFabricNode(ImpactLevel=RemoveData)) and want FH to log them,
|
||||
## then you must add the LogRule([LineNumber]) predicate to each rule in order for FabricHealer to trace correctly, regardless of EnableLogicRuleTracing setting.
|
||||
## Please see the Debugging/Auditing Rules section in Using.md to learn more.
|
||||
##Mitigate(Source=?source, Property=?property) :- LogRule(65), match(?source, "SomeOtherWatchdog"), match(?property, "SomeOtherFailure"),
|
||||
## DeactivateFabricNode(ImpactLevel=RemoveData).
|
||||
##Mitigate(Source=?source, Property=?property) :- LogRule(64), match(?source, "SomeOtherWatchdog"), match(?property, "SomeOtherFailure"), DeactivateFabricNode(RemoveData).
|
||||
|
||||
## Infra Mitigations (RM repair scheduling logic - InfrastructureService for the target node type will be the repair Executor, not FH).
|
||||
## The logic below demonstrates how to specify a repair escalation path: Reboot -> Reimage -> Heal -> Triage (human intervention required).
|
||||
|
@ -71,13 +69,13 @@ Mitigate() :- CheckInsideHealthStateMinDuration(02:00:00), !.
|
|||
|
||||
## Reboot.
|
||||
## Don't process any other rules if scheduling succeeds OR fails (note the position of ! (cut operator)) and there are less than 1 of these repairs that have completed in the last 8 hours.
|
||||
Mitigate() :- GetRepairHistory(?repairCount, 08:00:00, System.Reboot), ?repairCount < 1, !, ScheduleMachineRepair(System.Reboot).
|
||||
Mitigate :- GetRepairHistory(?repairCount, 08:00:00, System.Reboot), ?repairCount < 1, !, ScheduleMachineRepair(System.Reboot).
|
||||
|
||||
## ReimageOS escalation. *This is not supported in VMSS-managed clusters*.
|
||||
Mitigate() :- GetRepairHistory(?repairCount, 08:00:00, System.ReimageOS), ?repairCount < 1, !, ScheduleMachineRepair(System.ReimageOS).
|
||||
Mitigate :- GetRepairHistory(?repairCount, 08:00:00, System.ReimageOS), ?repairCount < 1, !, ScheduleMachineRepair(System.ReimageOS).
|
||||
|
||||
## Azure.Heal escalation.
|
||||
Mitigate() :- GetRepairHistory(?repairCount, 08:00:00, System.Azure.Heal), ?repairCount < 1, !, ScheduleMachineRepair(System.Azure.Heal).
|
||||
Mitigate :- GetRepairHistory(?repairCount, 08:00:00, System.Azure.Heal), ?repairCount < 1, !, ScheduleMachineRepair(System.Azure.Heal).
|
||||
|
||||
## Triage escalation.
|
||||
## If we end up here, then human intervention is required. LogInfo will generate ETW/Telemetry/Health events containing the message.
|
||||
|
|
|
@ -9,12 +9,12 @@
|
|||
## First, check if we are inside run interval. If so, then cut (!), which effectively means stop processing rules (no backtracking to subsequent rules in the file).
|
||||
## This is commented out by default. Just uncomment and set the global run interval for replica level repairs to suit your needs.
|
||||
|
||||
## Mitigate() :- CheckInsideRunInterval(00:15:00), !.
|
||||
## Mitigate :- CheckInsideRunInterval(00:15:00), !.
|
||||
|
||||
## Set a repair count variable for use by any rule in this file (NOTE: all rules must have the same TimeWindow value) as an internal predicate, _mitigate(?count),
|
||||
## where ?repairCount and ?count variables are unified when _mitigate(?count) predicate runs. The concept here is sharing a variable value across different rules.
|
||||
|
||||
Mitigate() :- GetRepairHistory(?repairCount, 01:00:00), _mitigate(?repairCount).
|
||||
Mitigate :- GetRepairHistory(?repairCount, 01:00:00), _mitigate(?repairCount).
|
||||
|
||||
## Now, let's say you wanted to only repair specific Apps or Paritions where related repair TimeWindow values are *not* the same, unlike the above "global" variable rule.
|
||||
## You could do something like the below three rules, which would mean the _mitigate internal predicate would only run if the supplied Mitigate argument values are matched:
|
||||
|
|
|
@ -29,15 +29,12 @@
|
|||
## Note: FO only generates Application (System) level warnings for system services. There will only ever be ApplicationName as "fabric:/System" in the FO health data that FH emits, so this is an optional argument.
|
||||
## This is commented out by default. Just uncomment and set the global run interval for System app level repairs to suit your needs.
|
||||
|
||||
## Mitigate() :- CheckInsideRunInterval(00:10:00), !.
|
||||
|
||||
## Mitigate :- CheckInsideRunInterval(00:10:00), !.
|
||||
|
||||
## TimeScopedRestartFabricNode is an internal predicate to check for the number of times a system service node restart repair has run to completion within a supplied time window.
|
||||
## If Completed Repair count is less then supplied value, then run RestartFabricNode mitigation.
|
||||
|
||||
TimeScopedRestartFabricNode(?count, ?time) :- GetRepairHistory(?repairCount, ?time), ?repairCount < ?count,
|
||||
RestartFabricNode(DoHealthChecks=false, MaxWaitTimeForHealthStateOk=00:45:00, MaxExecutionTime=02:00:00).
|
||||
|
||||
TimeScopedRestartFabricNode(?count, ?time) :- GetRepairHistory(?repairCount, ?time), ?repairCount < ?count, RestartFabricNode.
|
||||
|
||||
## TimeScopedRestartFabricSystemProcess is an internal predicate to check for the number of times a System service process restart repair has run to completion within a supplied time window.
|
||||
## If Completed Repair count is less then supplied value, then run RestartFabricSystemProcess mitigation.
|
||||
|
@ -45,23 +42,20 @@ TimeScopedRestartFabricNode(?count, ?time) :- GetRepairHistory(?repairCount, ?ti
|
|||
TimeScopedRestartFabricSystemProcess(?count, ?time) :- GetRepairHistory(?repairCount, ?time), ?repairCount < ?count,
|
||||
RestartFabricSystemProcess(DoHealthChecks=false, MaxWaitTimeForHealthStateOk=00:05:00, MaxExecutionTime=00:10:00).
|
||||
|
||||
|
||||
## Mitigation rules for multiple metrics and targets. NOTE: Do not restart Fabric or FabricHost processes unless you want to take the Fabric node down. For the latter (restart node),
|
||||
## use TimeScopedRestartFabricNode (or RestartFabricNode predicate directly), which employs a safe Fabric node restart workflow (with deactivation step), not just a process kill.
|
||||
|
||||
|
||||
## CPU Time - Percent
|
||||
|
||||
Mitigate(MetricName="CpuPercent", ProcessName=?SysProcName) :- not(?SysProcName == "Fabric" || ?SysProcName == "FabricHost"),
|
||||
GetHealthEventHistory(?HealthEventCount, 00:15:00),
|
||||
GetHealthEventHistory(?HealthEventCount, 00:30:00),
|
||||
?HealthEventCount >= 3,
|
||||
TimeScopedRestartFabricSystemProcess(5, 01:00:00).
|
||||
|
||||
|
||||
## Memory Use - Megabytes in use
|
||||
|
||||
Mitigate(MetricName="MemoryMB", ProcessName=?SysProcName) :- not(?SysProcName == "Fabric" || ?SysProcName == "FabricHost"),
|
||||
GetHealthEventHistory(?HealthEventCount, 00:15:00),
|
||||
GetHealthEventHistory(?HealthEventCount, 00:30:00),
|
||||
?HealthEventCount >= 3,
|
||||
TimeScopedRestartFabricSystemProcess(5, 01:00:00).
|
||||
|
||||
|
@ -69,27 +63,24 @@ Mitigate(MetricName="MemoryMB", ProcessName=?SysProcName) :- not(?SysProcName ==
|
|||
## ?HealthEventCount >= 3,
|
||||
## TimeScopedRestartFabricNode(1, 01:00:00).
|
||||
|
||||
|
||||
## Memory Use - Percent in use
|
||||
|
||||
Mitigate(MetricName="MemoryPercent", ProcessName=?SysProcName) :- not(?SysProcName == "Fabric" || ?SysProcName == "FabricHost"),
|
||||
GetHealthEventHistory(?HealthEventCount, 00:15:00),
|
||||
GetHealthEventHistory(?HealthEventCount, 00:30:00),
|
||||
?HealthEventCount >= 3,
|
||||
TimeScopedRestartFabricSystemProcess(5, 01:00:00).
|
||||
|
||||
|
||||
## Ephemeral Ports in Use
|
||||
|
||||
Mitigate(MetricName="EphemeralPorts", ProcessName=?SysProcName) :- not(?SysProcName == "Fabric" || ?SysProcName == "FabricHost"),
|
||||
GetHealthEventHistory(?HealthEventCount, 00:15:00),
|
||||
GetHealthEventHistory(?HealthEventCount, 00:30:00),
|
||||
?HealthEventCount >= 3,
|
||||
TimeScopedRestartFabricSystemProcess(5, 01:00:00).
|
||||
|
||||
|
||||
## Threads
|
||||
|
||||
Mitigate(MetricName="Threads", ProcessName=?SysProcName) :- not(?SysProcName == "Fabric" || ?SysProcName == "FabricHost"),
|
||||
GetHealthEventHistory(?HealthEventCount, 00:15:00),
|
||||
GetHealthEventHistory(?HealthEventCount, 00:30:00),
|
||||
?HealthEventCount >= 3,
|
||||
TimeScopedRestartFabricSystemProcess(5, 01:00:00).
|
||||
|
||||
|
@ -99,20 +90,18 @@ Mitigate(MetricName="Threads", ProcessName=?SysProcName) :- not(?SysProcName ==
|
|||
## Restart the offending Fabric system process named FabricGateway, regardless of OS.
|
||||
|
||||
Mitigate(MetricName="FileHandles", ProcessName=?SysProcName) :- match(?SysProcName, "FabricGateway"),
|
||||
GetHealthEventHistory(?HealthEventCount, 00:15:00),
|
||||
GetHealthEventHistory(?HealthEventCount, 00:30:00),
|
||||
?HealthEventCount >= 3,
|
||||
TimeScopedRestartFabricSystemProcess(15, 01:00:00).
|
||||
|
||||
|
||||
## Open File Handles - Linux-only: Any SF system service besides Fabric or FabricHost.
|
||||
## Restart the offending Fabric system process.
|
||||
|
||||
Mitigate(MetricName="FileHandles", OS="Linux", ProcessName=?SysProcName) :- not(?SysProcName == "Fabric" || ?SysProcName == "FabricHost"),
|
||||
GetHealthEventHistory(?HealthEventCount, 00:15:00),
|
||||
GetHealthEventHistory(?HealthEventCount, 00:30:00),
|
||||
?HealthEventCount >= 3,
|
||||
TimeScopedRestartFabricSystemProcess(5, 01:00:00).
|
||||
|
||||
|
||||
## Open File Handles - Linux OS, Fabric process. In these cases, we want a safe (graceful) restart of the Fabric node; not just kill the process, which will restart the node, but not gracefully.
|
||||
## Restart the Fabric node where the offending instance is running.
|
||||
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<ServiceManifest Name="FabricHealerPkg"
|
||||
Version="1.1.18"
|
||||
Version="1.1.19"
|
||||
xmlns="http://schemas.microsoft.com/2011/01/fabric"
|
||||
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
|
||||
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
|
||||
|
@ -11,7 +11,7 @@
|
|||
</ServiceTypes>
|
||||
|
||||
<!-- Code package is your service executable. -->
|
||||
<CodePackage Name="Code" Version="1.1.18">
|
||||
<CodePackage Name="Code" Version="1.1.19">
|
||||
<EntryPoint>
|
||||
<ExeHost>
|
||||
<Program>FabricHealer</Program>
|
||||
|
@ -21,5 +21,5 @@
|
|||
|
||||
<!-- Config package is the contents of the Config directory under PackageRoot that contains an
|
||||
independently-updateable and versioned set of custom configuration settings for your service. -->
|
||||
<ConfigPackage Name="Config" Version="1.1.18" />
|
||||
<ConfigPackage Name="Config" Version="1.1.19" />
|
||||
</ServiceManifest>
|
|
@ -32,7 +32,7 @@ namespace FabricHealer.Repair.Guan
|
|||
if (Input.Arguments.Count == 0 || Input.Arguments[0].Value.GetObjectValue().GetType() != typeof(TimeSpan))
|
||||
{
|
||||
throw new GuanException(
|
||||
"CheckEntityHealthStateDuration: One argument is required and it must be a TimeSpan " +
|
||||
"CheckInsideHealthStateMinDuration: One argument is required and it must be a TimeSpan " +
|
||||
"(xx:yy:zz format, for example 00:30:00 represents 30 minutes).");
|
||||
}
|
||||
|
||||
|
|
|
@ -36,48 +36,19 @@ namespace FabricHealer.Repair.Guan
|
|||
_ = await RepairTaskEngine.TryTraceCurrentlyExecutingRuleAsync(Input.ToString(), RepairData, FabricHealerManager.Token);
|
||||
}
|
||||
|
||||
int count = Input.Arguments.Count;
|
||||
|
||||
for (int i = 0; i < count; i++)
|
||||
string value = Input.Arguments[0].Value.GetEffectiveTerm().GetStringValue().ToLower();
|
||||
|
||||
if (value == "removedata")
|
||||
{
|
||||
var typeString = Input.Arguments[i].Value.GetEffectiveTerm().GetObjectValue().GetType().Name;
|
||||
|
||||
switch (typeString)
|
||||
{
|
||||
case "Boolean" when i == 0 && count == 4 || Input.Arguments[i].Name.ToLower() == "dohealthchecks":
|
||||
RepairData.RepairPolicy.DoHealthChecks = (bool)Input.Arguments[i].Value.GetEffectiveTerm().GetObjectValue();
|
||||
break;
|
||||
|
||||
case "TimeSpan" when i == 1 && count == 4 || Input.Arguments[i].Name.ToLower() == "maxwaittimeforhealthstateok":
|
||||
RepairData.RepairPolicy.MaxTimePostRepairHealthCheck = (TimeSpan)Input.Arguments[i].Value.GetEffectiveTerm().GetObjectValue();
|
||||
break;
|
||||
|
||||
case "TimeSpan" when i == 2 && count == 4 || Input.Arguments[i].Name.ToLower() == "maxexecutiontime":
|
||||
RepairData.RepairPolicy.MaxExecutionTime = (TimeSpan)Input.Arguments[i].Value.GetEffectiveTerm().GetObjectValue();
|
||||
break;
|
||||
|
||||
case "String" when i == 3 && count == 4 || Input.Arguments[i].Name.ToLower() == "impactlevel":
|
||||
|
||||
string value = Input.Arguments[i].Value.GetEffectiveTerm().GetStringValue().ToLower();
|
||||
|
||||
if (value == "removedata")
|
||||
{
|
||||
RepairData.RepairPolicy.NodeImpactLevel = NodeImpactLevel.RemoveData;
|
||||
}
|
||||
else if (value == "removenode")
|
||||
{
|
||||
RepairData.RepairPolicy.NodeImpactLevel = NodeImpactLevel.RemoveNode;
|
||||
}
|
||||
else
|
||||
{
|
||||
RepairData.RepairPolicy.NodeImpactLevel = NodeImpactLevel.Restart;
|
||||
}
|
||||
|
||||
break;
|
||||
|
||||
default:
|
||||
throw new GuanException($"Unsupported argument type for RestartFabricNode: {typeString}");
|
||||
}
|
||||
RepairData.RepairPolicy.NodeImpactLevel = NodeImpactLevel.RemoveData;
|
||||
}
|
||||
else if (value == "removenode")
|
||||
{
|
||||
RepairData.RepairPolicy.NodeImpactLevel = NodeImpactLevel.RemoveNode;
|
||||
}
|
||||
else
|
||||
{
|
||||
RepairData.RepairPolicy.NodeImpactLevel = NodeImpactLevel.Restart;
|
||||
}
|
||||
|
||||
var isNodeRepairAlreadyInProgress =
|
||||
|
@ -121,7 +92,7 @@ namespace FabricHealer.Repair.Guan
|
|||
}
|
||||
|
||||
private DeactivateFabricNodePredicateType(string name)
|
||||
: base(name, true, 0)
|
||||
: base(name, true, 1)
|
||||
{
|
||||
|
||||
}
|
||||
|
|
|
@ -28,38 +28,15 @@ namespace FabricHealer.Repair.Guan
|
|||
protected override async Task<bool> CheckAsync()
|
||||
{
|
||||
RepairData.RepairPolicy.RepairAction = RepairActionType.RestartFabricNode;
|
||||
RepairData.EntityType = EntityType.Node;
|
||||
RepairData.RepairPolicy.RepairIdPrefix = RepairConstants.FHTaskIdPrefix;
|
||||
RepairData.RepairPolicy.NodeImpactLevel = NodeImpactLevel.Restart;
|
||||
|
||||
if (FabricHealerManager.ConfigSettings.EnableLogicRuleTracing)
|
||||
{
|
||||
_ = await RepairTaskEngine.TryTraceCurrentlyExecutingRuleAsync(Input.ToString(), RepairData, FabricHealerManager.Token);
|
||||
}
|
||||
|
||||
int count = Input.Arguments.Count;
|
||||
|
||||
for (int i = 0; i < count; i++)
|
||||
{
|
||||
var typeString = Input.Arguments[i].Value.GetEffectiveTerm().GetObjectValue().GetType().Name;
|
||||
|
||||
switch (typeString)
|
||||
{
|
||||
case "Boolean" when i == 0 && count == 3 || Input.Arguments[i].Name.ToLower() == "dohealthchecks":
|
||||
RepairData.RepairPolicy.DoHealthChecks = (bool)Input.Arguments[i].Value.GetEffectiveTerm().GetObjectValue();
|
||||
break;
|
||||
|
||||
case "TimeSpan" when i == 1 && count == 3 || Input.Arguments[i].Name.ToLower() == "maxwaittimeforhealthstateok":
|
||||
RepairData.RepairPolicy.MaxTimePostRepairHealthCheck = (TimeSpan)Input.Arguments[i].Value.GetEffectiveTerm().GetObjectValue();
|
||||
break;
|
||||
|
||||
case "TimeSpan" when i == 2 && count == 3 || Input.Arguments[i].Name.ToLower() == "maxexecutiontime":
|
||||
RepairData.RepairPolicy.MaxExecutionTime = (TimeSpan)Input.Arguments[i].Value.GetEffectiveTerm().GetObjectValue();
|
||||
break;
|
||||
|
||||
default:
|
||||
throw new GuanException($"Unsupported argument type for RestartFabricNode: {typeString}");
|
||||
}
|
||||
}
|
||||
|
||||
// Block attempts to create node-level repair tasks if one is already running in the cluster.
|
||||
var isNodeRepairAlreadyInProgress =
|
||||
await RepairTaskEngine.IsRepairInProgressAsync(RepairData, FabricHealerManager.Token);
|
||||
|
@ -80,18 +57,26 @@ namespace FabricHealer.Repair.Guan
|
|||
return false;
|
||||
}
|
||||
|
||||
// Try to schedule repair with RM for Fabric Node Restart (FH will not be the executor).
|
||||
// Try to schedule repair with RM for Fabric Node Restart (FH will also be the executor of the repair).
|
||||
RepairTask repairTask = await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
|
||||
() => RepairTaskManager.ScheduleFabricHealerRepairTaskAsync(
|
||||
RepairData,
|
||||
FabricHealerManager.Token),
|
||||
FabricHealerManager.Token);
|
||||
() => RepairTaskManager.ScheduleFabricHealerRepairTaskAsync(
|
||||
RepairData,
|
||||
FabricHealerManager.Token),
|
||||
FabricHealerManager.Token);
|
||||
if (repairTask == null)
|
||||
{
|
||||
return false;
|
||||
}
|
||||
|
||||
return true;
|
||||
// Now execute the repair.
|
||||
bool success = await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
|
||||
() => RepairTaskManager.ExecuteFabricHealerRepairTaskAsync(
|
||||
repairTask,
|
||||
RepairData,
|
||||
FabricHealerManager.Token),
|
||||
FabricHealerManager.Token);
|
||||
|
||||
return success;
|
||||
}
|
||||
}
|
||||
|
||||
|
|
|
@ -21,6 +21,10 @@ using System.Collections.Generic;
|
|||
using System.ComponentModel;
|
||||
using Newtonsoft.Json;
|
||||
using System.Fabric.Description;
|
||||
using FabricHealer.TelemetryLib;
|
||||
using System.Fabric.Repair;
|
||||
using System.Numerics;
|
||||
using Octokit;
|
||||
|
||||
namespace FabricHealer.Repair
|
||||
{
|
||||
|
@ -770,6 +774,168 @@ namespace FabricHealer.Repair
|
|||
}
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// Restarts a Service Fabric Node.
|
||||
/// </summary>
|
||||
/// <param name="repairData">Repair configuration</param>
|
||||
/// <param name="cancellationToken">Task cancellation token</param>
|
||||
/// <returns>true if successful, false otherwise</returns>
|
||||
public static async Task<bool> RestartFabricNodeAsync(TelemetryData repairData, CancellationToken cancellationToken)
|
||||
{
|
||||
// If FH is installed on multiple nodes and this node is the target, then another FH instance should restart the node.
|
||||
if (FabricHealerManager.InstanceCount == -1 || FabricHealerManager.InstanceCount > 1)
|
||||
{
|
||||
if (repairData.NodeName.Equals(FabricHealerManager.ServiceContext.NodeContext.NodeName, StringComparison.OrdinalIgnoreCase))
|
||||
{
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
NodeList nodeList =
|
||||
await FabricHealerManager.FabricClientSingleton.QueryManager.GetNodeListAsync(
|
||||
repairData.NodeName,
|
||||
FabricHealerManager.ConfigSettings.AsyncTimeout,
|
||||
cancellationToken);
|
||||
|
||||
if (!nodeList.Any(n => n.NodeName == repairData.NodeName))
|
||||
{
|
||||
string info = $"Fabric node {repairData.NodeName} does not exist.";
|
||||
|
||||
await FabricHealerManager.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
|
||||
LogLevel.Info,
|
||||
"RestartFabricNodeAsync::MissingNode",
|
||||
info,
|
||||
cancellationToken,
|
||||
repairData,
|
||||
FabricHealerManager.ConfigSettings.EnableVerboseLogging);
|
||||
}
|
||||
|
||||
BigInteger nodeInstanceId = nodeList[0].NodeInstanceId;
|
||||
Stopwatch stopwatch = new();
|
||||
TimeSpan maxWaitTimeout = TimeSpan.FromMinutes(MaxWaitTimeMinutesForNodeOperation);
|
||||
string actionMessage = $"Attempting to restart Fabric node {repairData.NodeName} with InstanceId {nodeInstanceId}.";
|
||||
|
||||
await FabricHealerManager.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
|
||||
LogLevel.Info,
|
||||
$"AttemptingNodeRestart::{repairData.NodeName}",
|
||||
actionMessage,
|
||||
cancellationToken,
|
||||
repairData,
|
||||
FabricHealerManager.ConfigSettings.EnableVerboseLogging);
|
||||
try
|
||||
{
|
||||
RestartNodeResult result =
|
||||
await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
|
||||
() =>
|
||||
FabricHealerManager.FabricClientSingleton.FaultManager.RestartNodeAsync(
|
||||
repairData.NodeName,
|
||||
nodeInstanceId,
|
||||
false,
|
||||
CompletionMode.Verify,
|
||||
FabricHealerManager.ConfigSettings.AsyncTimeout,
|
||||
cancellationToken),
|
||||
cancellationToken);
|
||||
|
||||
if (result == null)
|
||||
{
|
||||
await FabricHealerManager.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
|
||||
LogLevel.Info,
|
||||
$"RestartFabricNodeAsync::Failure_{repairData.NodeName}",
|
||||
$"Failed to restart Fabric node {repairData.NodeName}. FaultManager did not complete the operation successfully.",
|
||||
cancellationToken,
|
||||
repairData,
|
||||
FabricHealerManager.ConfigSettings.EnableVerboseLogging);
|
||||
}
|
||||
|
||||
stopwatch.Start();
|
||||
|
||||
Node targetNode;
|
||||
|
||||
// Wait for Disabled/OK states.
|
||||
while (stopwatch.Elapsed <= maxWaitTimeout)
|
||||
{
|
||||
if (cancellationToken.IsCancellationRequested)
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
nodeList =
|
||||
await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
|
||||
() =>
|
||||
FabricHealerManager.FabricClientSingleton.QueryManager.GetNodeListAsync(
|
||||
repairData.NodeName,
|
||||
FabricHealerManager.ConfigSettings.AsyncTimeout,
|
||||
cancellationToken),
|
||||
cancellationToken);
|
||||
|
||||
targetNode = nodeList[0];
|
||||
|
||||
// Node is ready to be enabled.
|
||||
if (targetNode.NodeStatus == NodeStatus.Disabled && targetNode.HealthState == HealthState.Ok)
|
||||
{
|
||||
break;
|
||||
}
|
||||
|
||||
await Task.Delay(1000, cancellationToken);
|
||||
}
|
||||
|
||||
stopwatch.Stop();
|
||||
stopwatch.Reset();
|
||||
|
||||
// Enable the node.
|
||||
await FabricHealerManager.FabricClientSingleton.ClusterManager.ActivateNodeAsync(
|
||||
repairData.NodeName,
|
||||
FabricHealerManager.ConfigSettings.AsyncTimeout,
|
||||
cancellationToken);
|
||||
|
||||
await Task.Delay(TimeSpan.FromSeconds(15), cancellationToken);
|
||||
|
||||
nodeList =
|
||||
await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
|
||||
() =>
|
||||
FabricHealerManager.FabricClientSingleton.QueryManager.GetNodeListAsync(
|
||||
repairData.NodeName,
|
||||
FabricHealerManager.ConfigSettings.AsyncTimeout,
|
||||
cancellationToken),
|
||||
cancellationToken);
|
||||
|
||||
targetNode = nodeList[0];
|
||||
|
||||
// Make sure activation request went through.
|
||||
if (targetNode.NodeStatus == NodeStatus.Disabled && targetNode.HealthState == HealthState.Ok)
|
||||
{
|
||||
await FabricHealerManager.FabricClientSingleton.ClusterManager.ActivateNodeAsync(
|
||||
repairData.NodeName,
|
||||
FabricHealerManager.ConfigSettings.AsyncTimeout,
|
||||
cancellationToken);
|
||||
}
|
||||
|
||||
await Task.Delay(TimeSpan.FromSeconds(15), cancellationToken);
|
||||
UpdateRepairHistory(repairData);
|
||||
return true;
|
||||
}
|
||||
catch (Exception e) when (e is FabricException || e is TimeoutException)
|
||||
{
|
||||
#if DEBUG
|
||||
string err = $"Handled Exception restarting Fabric node {repairData.NodeName}, NodeInstanceId {nodeInstanceId}:{e.GetType().Name}";
|
||||
await FabricHealerManager.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
|
||||
LogLevel.Info,
|
||||
"RestartFabricNodeAsync::HandledException",
|
||||
err,
|
||||
cancellationToken,
|
||||
repairData,
|
||||
FabricHealerManager.ConfigSettings.EnableVerboseLogging);
|
||||
FabricHealerManager.RepairLogger.LogInfo(err);
|
||||
#endif
|
||||
FabricHealerManager.RepairHistory.FailedRepairs++;
|
||||
return false;
|
||||
}
|
||||
catch (Exception e) when (e is OperationCanceledException || e is TaskCanceledException)
|
||||
{
|
||||
return true;
|
||||
}
|
||||
}
|
||||
|
||||
/// <summary>
|
||||
/// This function ensures the input is in fact a Guid.
|
||||
/// </summary>
|
||||
|
|
|
@ -93,7 +93,7 @@ namespace FabricHealer.Repair
|
|||
Executor = RepairConstants.FabricHealer,
|
||||
ExecutorData = JsonSerializationUtility.TrySerializeObject(executorData, out string exData) ? exData : null,
|
||||
PerformPreparingHealthCheck = doHealthChecks,
|
||||
PerformRestoringHealthCheck = doHealthChecks,
|
||||
PerformRestoringHealthCheck = doHealthChecks
|
||||
};
|
||||
|
||||
return repairTask;
|
||||
|
@ -492,6 +492,12 @@ namespace FabricHealer.Repair
|
|||
|
||||
string[] lines = File.ReadLines(ruleFilePath).ToArray();
|
||||
predicate = predicate.Replace("'", "").Replace("\"", "").Replace(" ", "");
|
||||
|
||||
// appending "()" to a predicate is optional. Just remove it and use the name only for string matching.
|
||||
if (predicate.EndsWith("()"))
|
||||
{
|
||||
predicate = predicate.Remove(predicate.Length - 2);
|
||||
}
|
||||
|
||||
// Get all rules that contain the supplied predicate and "LogRule".
|
||||
List<string> flattenedLines = FabricHealerManager.ParseRulesFile(lines);
|
||||
|
|
|
@ -488,8 +488,8 @@ namespace FabricHealer.Repair
|
|||
}
|
||||
|
||||
// Don't attempt a node-level repair on a node where there is already an active node-level repair.
|
||||
if (repairData.RepairPolicy.RepairAction == RepairActionType.RestartFabricNode
|
||||
|| repairData.RepairPolicy.RepairAction == RepairActionType.DeactivateNode
|
||||
if ((repairData.RepairPolicy.RepairAction == RepairActionType.RestartFabricNode
|
||||
|| repairData.RepairPolicy.RepairAction == RepairActionType.DeactivateNode)
|
||||
&& await RepairTaskEngine.IsNodeLevelRepairCurrentlyInFlightAsync(repairData, cancellationToken))
|
||||
{
|
||||
string message = $"Node {repairData.NodeName} already has a node-impactful repair in progress: " +
|
||||
|
@ -807,12 +807,16 @@ namespace FabricHealer.Repair
|
|||
|
||||
break;
|
||||
}
|
||||
case RepairActionType.RestartFabricNode:
|
||||
{
|
||||
success = await RestartFabricNodeAsync(repairData, cancellationToken);
|
||||
break;
|
||||
}
|
||||
|
||||
default:
|
||||
return false;
|
||||
}
|
||||
|
||||
|
||||
// What was the target (a node, app, replica, etc..)?
|
||||
string repairTarget = null;
|
||||
|
||||
|
@ -1011,6 +1015,11 @@ namespace FabricHealer.Repair
|
|||
return false;
|
||||
}
|
||||
|
||||
private static Task<bool> RestartFabricNodeAsync(TelemetryData repairData, CancellationToken cancellationToken)
|
||||
{
|
||||
return RepairExecutor.RestartFabricNodeAsync(repairData, cancellationToken);
|
||||
}
|
||||
|
||||
// Support for GetHealthEventHistoryPredicateType, which enables time-scoping logic rules based on health events related to specific SF entities/targets.
|
||||
internal static int GetEntityHealthEventCountWithinTimeRange(TelemetryData repairData, TimeSpan timeWindow)
|
||||
{
|
||||
|
@ -1151,7 +1160,10 @@ namespace FabricHealer.Repair
|
|||
|
||||
while (stopwatch.Elapsed <= maxTimeToWait)
|
||||
{
|
||||
token.ThrowIfCancellationRequested();
|
||||
if (token.IsCancellationRequested)
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
if (await GetCurrentAggregatedHealthStateAsync(repairData, token) == HealthState.Ok)
|
||||
{
|
||||
|
@ -1174,99 +1186,99 @@ namespace FabricHealer.Repair
|
|||
/// <returns></returns>
|
||||
private static async Task<HealthState> GetCurrentAggregatedHealthStateAsync(TelemetryData repairData, CancellationToken token)
|
||||
{
|
||||
switch (repairData.EntityType)
|
||||
try
|
||||
{
|
||||
case EntityType.Application:
|
||||
|
||||
var appHealth = await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
|
||||
() => FabricHealerManager.FabricClientSingleton.HealthManager.GetApplicationHealthAsync(
|
||||
new Uri(repairData.ApplicationName),
|
||||
FabricHealerManager.ConfigSettings.AsyncTimeout,
|
||||
token),
|
||||
token);
|
||||
switch (repairData.EntityType)
|
||||
{
|
||||
case EntityType.Application:
|
||||
|
||||
bool isTargetAppHealedOnTargetNode = false;
|
||||
|
||||
// System Service repairs (process restarts)
|
||||
if (repairData.ApplicationName == RepairConstants.SystemAppName)
|
||||
{
|
||||
isTargetAppHealedOnTargetNode = appHealth.HealthEvents.Any(
|
||||
h => JsonSerializationUtility.TryDeserializeObject(
|
||||
h.HealthInformation.Description,
|
||||
out TelemetryData repairData)
|
||||
&& repairData.NodeName == repairData.NodeName
|
||||
&& repairData.ProcessName == repairData.ProcessName
|
||||
&& repairData.HealthState == HealthState.Ok);
|
||||
}
|
||||
else // Application repairs (code package restarts)
|
||||
{
|
||||
isTargetAppHealedOnTargetNode = appHealth.HealthEvents.Any(
|
||||
h => JsonSerializationUtility.TryDeserializeObject(
|
||||
h.HealthInformation.Description,
|
||||
out TelemetryData repairData)
|
||||
&& repairData.NodeName == repairData.NodeName
|
||||
&& repairData.ApplicationName == repairData.ApplicationName
|
||||
&& repairData.HealthState == HealthState.Ok);
|
||||
}
|
||||
|
||||
return isTargetAppHealedOnTargetNode ? HealthState.Ok : appHealth.AggregatedHealthState;
|
||||
|
||||
case EntityType.Service:
|
||||
|
||||
var serviceHealth = await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
|
||||
() => FabricHealerManager.FabricClientSingleton.HealthManager.GetServiceHealthAsync(
|
||||
new Uri(repairData.ServiceName),
|
||||
FabricHealerManager.ConfigSettings.AsyncTimeout,
|
||||
token),
|
||||
token);
|
||||
|
||||
bool isTargetServiceHealedOnTargetNode = serviceHealth.HealthEvents.Any(
|
||||
h => JsonSerializationUtility.TryDeserializeObject(
|
||||
h.HealthInformation.Description,
|
||||
out TelemetryData repairData)
|
||||
&& repairData.NodeName == repairData.NodeName
|
||||
&& repairData.ServiceName == repairData.ServiceName
|
||||
&& repairData.HealthState == HealthState.Ok);
|
||||
return isTargetServiceHealedOnTargetNode ? HealthState.Ok : serviceHealth.AggregatedHealthState;
|
||||
|
||||
case EntityType.Node:
|
||||
case EntityType.Machine:
|
||||
|
||||
var nodeHealth = await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
|
||||
() => FabricHealerManager.FabricClientSingleton.HealthManager.GetNodeHealthAsync(
|
||||
repairData.NodeName,
|
||||
var appHealth = await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
|
||||
() => FabricHealerManager.FabricClientSingleton.HealthManager.GetApplicationHealthAsync(
|
||||
new Uri(repairData.ApplicationName),
|
||||
FabricHealerManager.ConfigSettings.AsyncTimeout,
|
||||
token),
|
||||
token);
|
||||
token);
|
||||
|
||||
bool isTargetNodeHealed = nodeHealth.HealthEvents.Any(
|
||||
h => JsonSerializationUtility.TryDeserializeObject(
|
||||
h.HealthInformation.Description,
|
||||
out TelemetryData repairData)
|
||||
&& repairData.NodeName == repairData.NodeName
|
||||
&& repairData.HealthState == HealthState.Ok);
|
||||
bool isTargetAppHealedOnTargetNode = false;
|
||||
|
||||
return isTargetNodeHealed ? HealthState.Ok : nodeHealth.AggregatedHealthState;
|
||||
|
||||
case EntityType.Replica:
|
||||
// System Service repairs (process restarts)
|
||||
if (repairData.ApplicationName == RepairConstants.SystemAppName)
|
||||
{
|
||||
isTargetAppHealedOnTargetNode = appHealth.HealthEvents.Any(
|
||||
h => JsonSerializationUtility.TryDeserializeObject(
|
||||
h.HealthInformation.Description,
|
||||
out TelemetryData repairData)
|
||||
&& repairData.NodeName == repairData.NodeName
|
||||
&& repairData.ProcessName == repairData.ProcessName
|
||||
&& repairData.HealthState == HealthState.Ok);
|
||||
}
|
||||
else // Application repairs (code package restarts)
|
||||
{
|
||||
isTargetAppHealedOnTargetNode = appHealth.HealthEvents.Any(
|
||||
h => JsonSerializationUtility.TryDeserializeObject(
|
||||
h.HealthInformation.Description,
|
||||
out TelemetryData repairData)
|
||||
&& repairData.NodeName == repairData.NodeName
|
||||
&& repairData.ApplicationName == repairData.ApplicationName
|
||||
&& repairData.HealthState == HealthState.Ok);
|
||||
}
|
||||
|
||||
if (!RepairExecutor.TryGetGuid(repairData.PartitionId, out Guid partitionId))
|
||||
{
|
||||
return HealthState.Unknown;
|
||||
}
|
||||
return isTargetAppHealedOnTargetNode ? HealthState.Ok : appHealth.AggregatedHealthState;
|
||||
|
||||
// Make sure the Partition where the restarted replica was located is now healthy.
|
||||
var partitionHealth = await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
|
||||
() => FabricHealerManager.FabricClientSingleton.HealthManager.GetPartitionHealthAsync(
|
||||
partitionId,
|
||||
FabricHealerManager.ConfigSettings.AsyncTimeout,
|
||||
token),
|
||||
case EntityType.Service:
|
||||
|
||||
var serviceHealth = await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
|
||||
() => FabricHealerManager.FabricClientSingleton.HealthManager.GetServiceHealthAsync(
|
||||
new Uri(repairData.ServiceName),
|
||||
FabricHealerManager.ConfigSettings.AsyncTimeout,
|
||||
token),
|
||||
token);
|
||||
|
||||
return partitionHealth.AggregatedHealthState;
|
||||
|
||||
default:
|
||||
return HealthState.Unknown;
|
||||
bool isTargetServiceHealedOnTargetNode = serviceHealth.HealthEvents.Any(
|
||||
h => JsonSerializationUtility.TryDeserializeObject(
|
||||
h.HealthInformation.Description,
|
||||
out TelemetryData repairData)
|
||||
&& repairData.NodeName == repairData.NodeName
|
||||
&& repairData.ServiceName == repairData.ServiceName
|
||||
&& repairData.HealthState == HealthState.Ok);
|
||||
return isTargetServiceHealedOnTargetNode ? HealthState.Ok : serviceHealth.AggregatedHealthState;
|
||||
|
||||
case EntityType.Node:
|
||||
case EntityType.Machine:
|
||||
|
||||
var nodeHealth = await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
|
||||
() => FabricHealerManager.FabricClientSingleton.HealthManager.GetNodeHealthAsync(
|
||||
repairData.NodeName,
|
||||
FabricHealerManager.ConfigSettings.AsyncTimeout,
|
||||
token),
|
||||
token);
|
||||
|
||||
return nodeHealth.AggregatedHealthState;
|
||||
|
||||
case EntityType.Replica:
|
||||
|
||||
if (!RepairExecutor.TryGetGuid(repairData.PartitionId, out Guid partitionId))
|
||||
{
|
||||
return HealthState.Unknown;
|
||||
}
|
||||
|
||||
// Make sure the Partition where the restarted replica was located is now healthy.
|
||||
var partitionHealth = await FabricClientRetryHelper.ExecuteFabricActionWithRetryAsync(
|
||||
() => FabricHealerManager.FabricClientSingleton.HealthManager.GetPartitionHealthAsync(
|
||||
partitionId,
|
||||
FabricHealerManager.ConfigSettings.AsyncTimeout,
|
||||
token),
|
||||
token);
|
||||
|
||||
return partitionHealth.AggregatedHealthState;
|
||||
|
||||
default:
|
||||
return HealthState.Unknown;
|
||||
}
|
||||
}
|
||||
catch (Exception e) when (e is FabricException || e is OperationCanceledException || e is TaskCanceledException || e is TimeoutException)
|
||||
{
|
||||
return HealthState.Unknown;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<ApplicationManifest xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ApplicationTypeName="FabricHealerType" ApplicationTypeVersion="1.1.18" xmlns="http://schemas.microsoft.com/2011/01/fabric">
|
||||
<ApplicationManifest xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ApplicationTypeName="FabricHealerType" ApplicationTypeVersion="1.1.19" xmlns="http://schemas.microsoft.com/2011/01/fabric">
|
||||
<Parameters>
|
||||
<!-- FabricHealerManager Settings -->
|
||||
<Parameter Name="AutoMitigationEnabled" DefaultValue="true" />
|
||||
|
@ -31,7 +31,7 @@
|
|||
should match the Name and Version attributes of the ServiceManifest element defined in the
|
||||
ServiceManifest.xml file. -->
|
||||
<ServiceManifestImport>
|
||||
<ServiceManifestRef ServiceManifestName="FabricHealerPkg" ServiceManifestVersion="1.1.18" />
|
||||
<ServiceManifestRef ServiceManifestName="FabricHealerPkg" ServiceManifestVersion="1.1.19" />
|
||||
<ConfigOverrides>
|
||||
<ConfigOverride Name="Config">
|
||||
<Settings>
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
## FabricHealer 1.1.18
|
||||
## FabricHealer 1.1.19
|
||||
### Configuration as Logic and auto-mitigation in Service Fabric clusters
|
||||
|
||||
FabricHealer (FH) is a .NET 6 Service Fabric application that attempts to automatically fix a set of reliably solvable problems that can take place in Service Fabric
|
||||
|
@ -78,7 +78,7 @@ Register-ServiceFabricApplicationType -ApplicationPathInImageStore FH1110
|
|||
|
||||
#Create FO application (if not already deployed at lesser version):
|
||||
|
||||
New-ServiceFabricApplication -ApplicationName fabric:/FabricHealer -ApplicationTypeName FabricHealerType -ApplicationTypeVersion 1.1.18
|
||||
New-ServiceFabricApplication -ApplicationName fabric:/FabricHealer -ApplicationTypeName FabricHealerType -ApplicationTypeVersion 1.1.19
|
||||
|
||||
#Create the Service instance:
|
||||
|
||||
|
@ -87,7 +87,7 @@ New-ServiceFabricService -Stateless -PartitionSchemeSingleton -ApplicationName f
|
|||
|
||||
#OR if updating existing version:
|
||||
|
||||
Start-ServiceFabricApplicationUpgrade -ApplicationName fabric:/FabricHealer -ApplicationTypeVersion 1.1.18 -Monitored -FailureAction rollback
|
||||
Start-ServiceFabricApplicationUpgrade -ApplicationName fabric:/FabricHealer -ApplicationTypeVersion 1.1.19 -Monitored -FailureAction rollback
|
||||
```
|
||||
|
||||
## Using FabricHealer
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
## FabricHealer 1.1.18
|
||||
## FabricHealer 1.1.19
|
||||
### Configuration as Logic and auto-mitigation in Service Fabric clusters
|
||||
|
||||
FabricHealer (FH) is a .NET 6 Service Fabric application that attempts to automatically fix a set of reliably solvable problems that can take place in Service Fabric
|
||||
|
|
Загрузка…
Ссылка в новой задаче