124 строки
8.8 KiB
Plaintext
124 строки
8.8 KiB
Plaintext
Mitigate(AppName="fabric:/CpuStress", MetricName="CpuPercent") :- time() > DateTime("11/30/2021"),
|
|
EmitMessage("Exceeded specified end date for repair of fabric:/MyApp CpuPercent usage violations. Target end date: {0}. Current date (Utc): {1}", DateTime("11/30/2021"), time()), !.
|
|
|
|
## Alternatively, you could enforce repair end dates inline (as a subgoal) to any rule, e.g.,
|
|
|
|
Mitigate(AppName="fabric:/PortEater42", MetricName="EphemeralPorts", MetricValue=?MetricValue) :- time() < DateTime("11/30/2021"),
|
|
?MetricValue >= 8500,
|
|
TimeScopedRestartCodePackage(4, 01:00:00).
|
|
|
|
## Logic rules for multiple metrics and targets. The goal is to Mitigate!
|
|
|
|
## CPU
|
|
|
|
## CPU - Percent In Use - Constrained on AppName and number of times FabricObserver generates an Error/Warning Health report for CpuPercent metric within a specified timespan.
|
|
## This reads: Try to mitigate an SF Application in Error or Warning named fabric:/CpuStress where one of its services is consuming too much CPU (as a percentage of total CPU)
|
|
## and where at least 3 health events identifying this problem were produced in the last 15 minutes. This is useful to ensure you don't mitigate a transient (short-lived)
|
|
## problem as they will self-correct.
|
|
|
|
Mitigate(AppName="fabric:/CpuStress", MetricName="CpuPercent", MetricValue=?MetricValue) :- ?MetricValue >= 15,
|
|
GetHealthEventHistory(?HealthEventCount, 00:15:00),
|
|
?HealthEventCount >= 3,
|
|
TimeScopedRestartReplica(1, 00:15:00).
|
|
|
|
## CPU - Percent In Use - Constrained on AppName = "fabric:/MyApp42", observed Metric value and health event count within specified time range.
|
|
Mitigate(AppName="fabric:/MyApp42", MetricName="CpuPercent", MetricValue=?MetricValue) :- ?MetricValue >= 80,
|
|
GetHealthEventHistory(?HealthEventCount, 00:15:00),
|
|
?HealthEventCount >= 3,
|
|
TimeScopedRestartCodePackage(4, 01:00:00).
|
|
|
|
## CPU - Percent In Use - Specific application, any of its service processes. Your FO error/warning threshold alone prompts repair. This doesn't take into account transient misbehavior.
|
|
Mitigate(AppName="fabric:/MyApp", MetricName="CpuPercent") :- TimeScopedRestartCodePackage(5, 01:00:00).
|
|
|
|
## CPU - Percent In Use - Any application's service that exceeds 90% cpu usage, repair up to 5 times in a one hour window.
|
|
Mitigate(MetricName="CpuPercent", MetricValue=?MetricValue) :- ?MetricValue >= 90, TimeScopedRestartCodePackage(5, 01:00:00).
|
|
|
|
## File Handles
|
|
|
|
## This is also an example of how to use the Guan system predicate, match (and notmatch). It takes two args, the first is the source, the second is a regular expression,
|
|
## in this case, just a string of characters (no special regex characters).
|
|
## In practice, for this scenario, you would just pass the target app name string into Mitigate, Mitigate(AppName="fabric:/ClusterObserver", ...), for example.
|
|
## Use of the match function here is just an example of how to use it. Note that in Prolog, this type of substring matching capability could be expressed as an internal predicate
|
|
## in a much more complex format, in terms of human readability: substring(X,S) :-append(_,T,S), append(X,_,T), X \= [].
|
|
## This is because in Prolog a string is a list of characters. In Guan, a string is just a built-in .NET object, System.String.
|
|
## This is one of the great things about Guan: It's .NET all the way down.
|
|
|
|
## Constrained on AppName, MetricName (FileHandles). 5 repairs within 1 hour window.
|
|
Mitigate(AppName=?AppName, MetricName="FileHandles") :- match(?AppName, "ClusterObserver"),
|
|
TimeScopedRestartCodePackage(5, 01:00:00).
|
|
|
|
## Constrained on AppName, MetricName (FileHandles). 5 repairs within 1 hour window.
|
|
Mitigate(AppName="fabric:/MyApp", MetricName="FileHandles") :- TimeScopedRestartCodePackage(5, 01:00:00).
|
|
|
|
## Memory
|
|
|
|
## Memory - Percent In Use for Any SF Service Process belonging to the specified SF Application. 3 repairs within 10 minute window.
|
|
Mitigate(AppName="fabric:/CpuStress", MetricName="MemoryPercent", MetricValue=?MetricValue) :- ?MetricValue >= 30,
|
|
TimeScopedRestartCodePackage(3, 00:10:00).
|
|
|
|
## Memory - Megabytes In Use for Any SF Service Process belonging to the specified SF Applications. 5 repairs within 5 hour window.
|
|
Mitigate(AppName="fabric:/CpuStress", MetricName="MemoryMB") :- TimeScopedRestartCodePackage(5, 05:00:00).
|
|
Mitigate(AppName="fabric:/ContainerFoo", MetricName="MemoryMB") :- TimeScopedRestartCodePackage(5, 05:00:00).
|
|
Mitigate(AppName="fabric:/ContainerFoo2", MetricName="MemoryMB") :- TimeScopedRestartCodePackage(5, 05:00:00).
|
|
|
|
## Any app service that exceeds 1GB private workingset, restart code package if warning data is provided at least 3 times within 15 minute window. 1 repair per hour.
|
|
Mitigate(MetricName="MemoryMB", MetricValue=?MetricValue) :- ?MetricValue >= 1024,
|
|
GetHealthEventHistory(?HealthEventCount, TimeRange=00:15:00),
|
|
?HealthEventCount >= 3,
|
|
TimeScopedRestartCodePackage(1, 01:00:00).
|
|
|
|
## Disk
|
|
|
|
Mitigate(ErrorCode=?ErrorCode) :- ?ErrorCode == "FO042" || ?ErrorCode == "FO043", GetRepairHistory(?repairCount, 08:00:00),
|
|
?repairCount < 4,
|
|
CheckFolderSize("E:\SvcFab\Log\Traces", MaxFolderSizeGB=50),
|
|
DeleteFiles("E:\SvcFab\Log\Traces", SortOrder=Ascending, MaxFilesToDelete=25, RecurseSubdirectories=false).
|
|
|
|
Mitigate(ErrorCode=?ErrorCode) :- ?ErrorCode == "FO042" || ?ErrorCode == "FO043", GetRepairHistory(?repairCount, 08:00:00),
|
|
?repairCount < 4,
|
|
CheckFolderSize("%SOMEPATHVAR%", MaxFolderSizeGB=50),
|
|
DeleteFiles("%SOMEPATHVAR%", SortOrder=Ascending, MaxFilesToDelete=25, RecurseSubdirectories=false).
|
|
|
|
## Ports
|
|
|
|
## Local Active TCP Ports - Any app service. 5 repairs within 5 hour window. This means if FO warns on Active Ports, then heal. There are no conditional checks (on MetricValue) to take place.
|
|
Mitigate(MetricName="ActiveTcpPorts") :- TimeScopedRestartCodePackage(5, 05:00:00).
|
|
|
|
## Ephemeral Ports - Specific Application: any of its services, constrained on number of local ephemeral ports open.
|
|
## 5 repairs within 5 hour window.
|
|
Mitigate(AppName="fabric:/MyApp42", MetricName="EphemeralPorts", MetricValue=?MetricValue) :- ?MetricValue > 5000, TimeScopedRestartCodePackage(5, 05:00:00).
|
|
|
|
## Ephemeral TCP Ports - Any app service. 5 repairs within 5 hour window. This means if FO warns on Ephemeral ports usage, then heal.
|
|
## There are no conditional checks.
|
|
Mitigate(MetricName="EphemeralPorts") :- TimeScopedRestartCodePackage(5, 05:00:00).
|
|
|
|
## Threads
|
|
|
|
## Threads - Ignore specific application (FabricObserver, just for example - it's fine to target FO for repairs, generally), constrained on number of threads in use by the offending service process.
|
|
## 5 repairs within 5 hour window.
|
|
Mitigate(AppName=?AppName, MetricName="Threads", MetricValue=?MetricValue) :- ?AppName != "fabric:/FabricObserver", ?MetricValue >= 400, TimeScopedRestartCodePackage(5, 05:00:00).
|
|
|
|
## Threads - Any app service. 5 repairs within 5 hour window. This means if FO warns on Thread count, then heal. There are no conditional checks (on MetricValue) to take place.
|
|
## Mitigate(MetricName="Threads") :- TimeScopedRestartCodePackage(5, 05:00:00).
|
|
|
|
Mitigate(ServiceName=?ServiceName) :- ?ServiceName != null, TimeScopedRestartReplica(5, 05:00:00).
|
|
|
|
## Internal Predicates
|
|
|
|
## TimeScopedRestartCodePackage/TimeScopedRestartReplica are internal predicates to check for the number of times a repair has run to completion within a supplied time window.
|
|
## If Completed Repair count is less then supplied value, then run RestartCodePackage/RestartReplica mitigation. If not, emit a message so developer has event data that describes why
|
|
## the repair was not attempted at this time. EmitMessage always succeeds.
|
|
|
|
TimeScopedRestartCodePackage(?count, ?time) :- GetRepairHistory(?repairCount, ?time), ?repairCount >= ?count,
|
|
EmitMessage("Exhausted specified run count, {0}, within specified max repair time window, {1}. Will not attempt RestartCodePackage repair at this time.", ?count, ?time).
|
|
|
|
TimeScopedRestartReplica(?count, ?time) :- GetRepairHistory(?repairCount, ?time), ?repairCount >= ?count,
|
|
EmitMessage("Exhausted specified run count, {0}, within specified max repair time window, {1}. Will not attempt RestartReplica repair at this time.", ?count, ?time).
|
|
|
|
## If we get here, it means the number of repairs for a target has not exceeded the maximum number specified to run within a time window.
|
|
## Note you can add up to two optional arguments to RestartCodePackage/RestartReplica, name them whatever you want or omit the names, it just has to be either a TimeSpan value for how long to wait
|
|
## for the repair target to become healthy and/or a bool for whether or not RM should do health checks before and after the repair executes.
|
|
## See below for an example using both optional arguments (named arguments are just used for clarity; you could also just specify RestartCodePackage(true, 00:10:00), for example).
|
|
|
|
TimeScopedRestartCodePackage() :- RestartCodePackage(DoHealthChecks=false, MaxWaitTimeForHealthStateOk=00:10:00).
|
|
TimeScopedRestartReplica() :- RestartReplica(DoHealthChecks=false, MaxWaitTimeForHealthStateOk=00:10:00). |