New predicates for logging, updated rules. Updated documentation.
This commit is contained in:
Родитель
6568d8b5ea
Коммит
5ec235df80
|
@ -74,9 +74,9 @@ Attempts to delete files in a supplied path. You can supply target path, max num
|
|||
|
||||
**Helper Predicates**
|
||||
|
||||
```EmitMessage()```
|
||||
```LogInfo(), LogWarning(), LogError()```
|
||||
|
||||
This will emit telemetry/etw/health report from a rule which enables informational messaging and can help with debugging.
|
||||
These will emit telemetry/etw/health event at corresponding level (Info, Warning, Error) from a rule and can help with debugging, auditing, upstream action (ETW/Telemetry -> Alerts, for example).
|
||||
|
||||
```GetRepairHistory()```
|
||||
|
||||
|
@ -276,17 +276,30 @@ Mitigate(AppName="fabric:/System", MetricName="EphemeralPorts", MetricValue=?Met
|
|||
|
||||
**Filtering parameters from Mitigate()**
|
||||
|
||||
If you wish to do equals checks such as ```?AppName == ...``` you don't actually need to write this in the body of your rules, instead you can specify these values inside Mitigate() like so:
|
||||
If you wish to do a single test for equality such as ```?AppName == "fabric:/App1``` you don't actually need to write this in the body of your rules, instead you can specify these values inside Mitigate() like so:
|
||||
|
||||
```
|
||||
## This is the preferred way to do this. It is easier to read and employs less (unnecessary) basic logic.
|
||||
## This is the preferred way to do this for a single value test (note the use of = operator, not ==). It is easier to read and employs less (unnecessary) basic logic.
|
||||
Mitigate(AppName="fabric:/App1") :- ...
|
||||
```
|
||||
|
||||
What that means, is that the rule will only execute when the AppName is equal to "fabric:/App1". This is equivalent to the following:
|
||||
What that means, is that the rule will only execute when the AppName is "fabric:/App1". This is equivalent to the following:
|
||||
|
||||
```
|
||||
Mitigate(AppName=?AppName) :- ?AppName == "fabric:/App1", ...
|
||||
Mitigate(AppName=?appName) :- ?appName == "fabric:/App1", ...
|
||||
```
|
||||
|
||||
Obviously, the first way of doing it is more succinct and, again, preferred.
|
||||
Obviously, the first way of doing it is more succinct and, again, preferred for simple cases where you are only interested in a single value for the fact. If, for example,
|
||||
you want to test for multiple values of AppName, then you have to pull the variable out into a subrule as you can't add a logical expression to the head of a rule.
|
||||
|
||||
E.g., you want to proceed if AppName is either fabric:/App1 or fabric:/App42:
|
||||
|
||||
```
|
||||
Mitigate(AppName=?appName) :- ?appName == "fabric:/App1" || ?appName == "fabric:/App42", ...
|
||||
```
|
||||
|
||||
Or, you are only interested in any AppName that is not fabric:/App1 or fabric:/App42:
|
||||
|
||||
```
|
||||
Mitigate(AppName=?appName) :- not(?appName == "fabric:/App1" || ?appName == "fabric:/App42"), ...
|
||||
```
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
Mitigate(AppName="fabric:/CpuStress", MetricName="CpuPercent") :- time() > DateTime("11/30/2021"),
|
||||
EmitMessage("Exceeded specified end date for repair of fabric:/MyApp CpuPercent usage violations. Target end date: {0}. Current date (Utc): {1}", DateTime("11/30/2021"), time()), !.
|
||||
LogWarning("Exceeded specified end date for repair of fabric:/MyApp CpuPercent usage violations. Target end date: {0}. Current date (Utc): {1}", DateTime("11/30/2021"), time()), !.
|
||||
|
||||
## Alternatively, you could enforce repair end dates inline (as a subgoal) to any rule, e.g.,
|
||||
|
||||
|
@ -110,10 +110,10 @@ Mitigate(ServiceName=?ServiceName) :- ?ServiceName != null, TimeScopedRestartRep
|
|||
## the repair was not attempted at this time. EmitMessage always succeeds.
|
||||
|
||||
TimeScopedRestartCodePackage(?count, ?time) :- GetRepairHistory(?repairCount, ?time), ?repairCount >= ?count,
|
||||
EmitMessage("Exhausted specified run count, {0}, within specified max repair time window, {1}. Will not attempt RestartCodePackage repair at this time.", ?count, ?time).
|
||||
LogInfo("Exhausted specified run count, {0}, within specified max repair time window, {1}. Will not attempt RestartCodePackage repair at this time.", ?count, ?time).
|
||||
|
||||
TimeScopedRestartReplica(?count, ?time) :- GetRepairHistory(?repairCount, ?time), ?repairCount >= ?count,
|
||||
EmitMessage("Exhausted specified run count, {0}, within specified max repair time window, {1}. Will not attempt RestartReplica repair at this time.", ?count, ?time).
|
||||
LogInfo("Exhausted specified run count, {0}, within specified max repair time window, {1}. Will not attempt RestartReplica repair at this time.", ?count, ?time).
|
||||
|
||||
## If we get here, it means the number of repairs for a target has not exceeded the maximum number specified to run within a time window.
|
||||
## Note you can add up to two optional arguments to RestartCodePackage/RestartReplica, name them whatever you want or omit the names, it just has to be either a TimeSpan value for how long to wait
|
||||
|
|
|
@ -61,9 +61,10 @@
|
|||
## time() with no arguments returns DateTime.UtcNow. DateTime will return a DateTime object that represents the supplied datetime string.
|
||||
## *Note*: you must wrap the date string in quotes to make it explicit to Guan that the arg is a string as it contains mathematical operators (in this case a /).
|
||||
## The rule below reads: If any of the specified (set in Mitigate) app's service processes have put it into Warning due to CPU over-consumption and today's date is later than the supplied end date, emit a message, stop processing rules (!).
|
||||
## You can use LogInfo, LogWarning or LogError predicates to generate a log event that will create a local text log entry, an ETW event, and an SF health report.
|
||||
|
||||
Mitigate(AppName="fabric:/CpuStress", MetricName="CpuPercent") :- time() > DateTime("12/31/2022"),
|
||||
EmitMessage("Exceeded specified end date for repair of fabric:/CpuStress CpuPercent usage violations. Target end date: {0}. Current date (Utc): {1}", DateTime("12/31/2022"), time()), !.
|
||||
LogInfo("Exceeded specified end date for repair of fabric:/CpuStress CpuPercent usage violations. Target end date: {0}. Current date (Utc): {1}", DateTime("12/31/2022"), time()), !.
|
||||
|
||||
## Alternatively, you could enforce repair end dates inline (as a subrule) to any rule, e.g.,
|
||||
|
||||
|
@ -176,13 +177,13 @@ Mitigate() :- TimeScopedRestartReplica(10, 05:00:00).
|
|||
|
||||
## TimeScopedRestartCodePackage/TimeScopedRestartReplica are internal predicates to check for the number of times a repair has run to completion within a supplied time window.
|
||||
## If Completed Repair count is less then supplied value, then run RestartCodePackage/RestartReplica mitigation. If not, emit a message so developer has event data that describes why
|
||||
## the repair was not attempted at this time. EmitMessage always succeeds.
|
||||
## the repair was not attempted at this time. LogInfo/LogWarning/LogError always succeeds.
|
||||
|
||||
TimeScopedRestartCodePackage(?count, ?time) :- GetRepairHistory(?repairCount, ?time), ?repairCount >= ?count,
|
||||
EmitMessage("Exhausted specified run count, {0}, within specified max repair time window, {1}. Will not attempt RestartCodePackage repair at this time.", ?count, ?time).
|
||||
LogInfo("Exhausted specified run count, {0}, within specified max repair time window, {1}. Will not attempt RestartCodePackage repair at this time.", ?count, ?time).
|
||||
|
||||
TimeScopedRestartReplica(?count, ?time) :- GetRepairHistory(?repairCount, ?time), ?repairCount >= ?count,
|
||||
EmitMessage("Exhausted specified run count, {0}, within specified max repair time window, {1}. Will not attempt RestartReplica repair at this time.", ?count, ?time).
|
||||
LogInfo("Exhausted specified run count, {0}, within specified max repair time window, {1}. Will not attempt RestartReplica repair at this time.", ?count, ?time).
|
||||
|
||||
## If we get here, it means the number of repairs for a target has not exceeded the maximum number specified to run within a time window.
|
||||
## Note you can add up to two optional arguments to RestartCodePackage/RestartReplica, name them whatever you want or omit the names, it just has to be either a TimeSpan value for how long to wait
|
||||
|
|
|
@ -1,57 +1,4 @@
|
|||
## Logic rules for scheduling Machine-level repair jobs in the cluster. EntityType fact is Machine.
|
||||
## FH does not conduct (execute) these repairs. It simply schedules them. InfrastructureService is always the Executor for these types of Repair Jobs.
|
||||
|
||||
## Applicable Named Arguments for Mitigate. Facts are supplied by FabricObserver, FHProxy or FH itself.
|
||||
## Any argument below with (FO/FHProxy) means that only FO or FHProxy will present the fact.
|
||||
## | Argument Name | Definition |
|
||||
## |---------------------------|------------------------------------------------------------------------|
|
||||
## | NodeName | Name of the node |
|
||||
## | NodeType | Type of node |
|
||||
## | ErrorCode (FO/FHProxy) | Supported Error Code emitted by caller (e.g. "FO002") |
|
||||
## | MetricName (FO/FHProxy) | Name of the Metric (e.g., CpuPercent or MemoryMB, etc.) |
|
||||
## | MetricValue (FO/FHProxy) | Corresponding Metric Value (e.g. "85" indicating 85% CPU usage) |
|
||||
## | OS | The name of the OS where FabricHealer is running (Linux or Windows) |
|
||||
## | HealthState | The HealthState of the target entity: Error or Warning |
|
||||
|
||||
## Metric Names, from FO or FHProxy.
|
||||
## | Name |
|
||||
## |--------------------------------|
|
||||
## | ActiveTcpPorts |
|
||||
## | CpuPercent |
|
||||
## | EphemeralPorts |
|
||||
## | EphemeralPortsPercent |
|
||||
## | MemoryMB |
|
||||
## | MemoryPercent |
|
||||
## | FileHandles (Linux-only) |
|
||||
## | FileHandlesPercent (Linux-only)|
|
||||
|
||||
## If this is what you need, then first check if we are inside the specified run interval for scheduling *any* machine-level repair for any reason.
|
||||
## Ending with a cut (!) means the goal (Mitigate) has been satisfied and Guan will immediately stop processing rules.
|
||||
## Mitigate() :- CheckInsideRunInterval(02:00:00), !.
|
||||
|
||||
## TimeScopedScheduleRepair is an internal predicate to check for the number of times the specified machine repair action has run to completion within a supplied time window.
|
||||
## If the completed machine repair count is less then supplied value, then schedule an infrastructure repair via ScheduleMachineRepair predicate.
|
||||
TimeScopedScheduleRepair(?count, ?time, ?repairAction) :- GetRepairHistory(?repairCount, ?time), ?repairCount < ?count, ScheduleMachineRepair(?repairAction).
|
||||
|
||||
## Metric-defined machine repair scheduling - facts supplied by FabricObserver service or some other service that employs the FHProxy library.
|
||||
|
||||
## Percent Memory in Use (of total physical).
|
||||
Mitigate(MetricName=MemoryPercent, MetricValue=?MetricValue) :- ?MetricValue >= 95,
|
||||
GetHealthEventHistory(?HealthEventCount, 00:15:00), ?HealthEventCount >= 3,
|
||||
TimeScopedScheduleRepair(4, 08:00:00, System.Reboot).
|
||||
|
||||
## File Handles/FDs. Linux-only.
|
||||
|
||||
## Percent Allocated, System-wide.
|
||||
Mitigate(MetricName=FileHandlesPercent, MetricValue=?MetricValue, OS=Linux) :- ?MetricValue >= 95,
|
||||
GetHealthEventHistory(?HealthEventCount, 00:15:00), ?HealthEventCount >= 3,
|
||||
TimeScopedScheduleRepair(2, 08:00:00, System.Reboot).
|
||||
|
||||
## Total Allocated, System-wide.
|
||||
Mitigate(MetricName=FileHandles, MetricValue=?MetricValue, OS=Linux) :- ?MetricValue >= 1000000,
|
||||
GetHealthEventHistory(?HealthEventCount, 00:15:00), ?HealthEventCount >= 3,
|
||||
TimeScopedScheduleRepair(2, 08:00:00, System.Reboot).
|
||||
|
||||
|
||||
## Non-FO/FHProxy machine repair logic - facts supplied by FabricHealer, based on health data from HM.
|
||||
## FabricHealerManager provides the facts used here by querying HM directly, versus supplying facts from serialized
|
||||
## TelemetryData instances generated by FabricObserver or FHProxy.
|
||||
|
@ -76,7 +23,7 @@ Mitigate() :- CheckInsideScheduleInterval(00:10:00), !.
|
|||
## Mitigations (RM repair scheduling logic - InfrastructureService for the target node type will be the repair Executor, not FH).
|
||||
## The logic below demonstrates how to specify a repair escalation path: Reboot -> Reimage -> Heal -> Triage (human intervention (TODO)).
|
||||
|
||||
## Reboot.
|
||||
## Reboot. Note that employing the internal predicate TimeScopedScheduleRepair will not work given the placement of the cut operator in the rules below.
|
||||
## Don't process any other rules if scheduling succeeds OR fails (note the position of ! (cut operator)) and there are less than 2 of these repairs that have completed in the last 4 hours.
|
||||
Mitigate() :- GetRepairHistory(?repairCount, 04:00:00, System.Reboot), ?repairCount < 2, !, ScheduleMachineRepair(System.Reboot).
|
||||
|
||||
|
@ -86,5 +33,5 @@ Mitigate() :- GetRepairHistory(?repairCount, 04:00:00, System.ReimageOS), ?repai
|
|||
## Heal.
|
||||
Mitigate() :- GetRepairHistory(?repairCount, 04:00:00, System.Azure.Heal), ?repairCount < 2, !, ScheduleMachineRepair(System.Azure.Heal).
|
||||
|
||||
## If we end up here, then human intervention is required (Triage).
|
||||
Mitigate(NodeName=?nodeName) :- EmitMessage("Machine repairs (escalations) have been exhausted for node {0}. Human intervention is requried.", ?nodeName), !.
|
||||
## If we end up here, then human intervention is required (Triage). LogWarning will generate an ETW event (FabricHealerDataEvent) containing the level and message.
|
||||
Mitigate(NodeName=?nodeName) :- LogWarning("Specified Machine repair escalations have been exhausted for node {0}. Human intervention is required.", ?nodeName), !.
|
|
@ -0,0 +1,88 @@
|
|||
// ------------------------------------------------------------
|
||||
// Copyright (c) Microsoft Corporation. All rights reserved.
|
||||
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
|
||||
// ------------------------------------------------------------
|
||||
|
||||
using System.Globalization;
|
||||
using Guan.Logic;
|
||||
using FabricHealer.Utilities;
|
||||
using System.Threading.Tasks;
|
||||
|
||||
namespace FabricHealer.Repair.Guan
|
||||
{
|
||||
/// <summary>
|
||||
/// Helper external predicate that generates health/etw/telemetry events.
|
||||
/// </summary>
|
||||
public class LogErrorPredicateType : PredicateType
|
||||
{
|
||||
private static LogErrorPredicateType Instance;
|
||||
|
||||
private class Resolver : BooleanPredicateResolver
|
||||
{
|
||||
public Resolver(CompoundTerm input, Constraint constraint, QueryContext context)
|
||||
: base(input, constraint, context)
|
||||
{
|
||||
|
||||
}
|
||||
|
||||
protected override async Task<bool> CheckAsync()
|
||||
{
|
||||
int count = Input.Arguments.Count;
|
||||
string output, format;
|
||||
|
||||
if (count == 0)
|
||||
{
|
||||
throw new GuanException("At least 1 argument is required.");
|
||||
}
|
||||
|
||||
format = Input.Arguments[0].Value.GetEffectiveTerm().GetStringValue();
|
||||
|
||||
if (string.IsNullOrWhiteSpace(format))
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// formatted args string?
|
||||
if (count > 1)
|
||||
{
|
||||
object[] args = new object[count - 1];
|
||||
|
||||
for (int i = 1; i < count; i++)
|
||||
{
|
||||
args[i - 1] = Input.Arguments[i].Value.GetEffectiveTerm().GetObjectValue();
|
||||
}
|
||||
|
||||
output = string.Format(CultureInfo.InvariantCulture, format, args);
|
||||
}
|
||||
else
|
||||
{
|
||||
output = format;
|
||||
}
|
||||
|
||||
await FabricHealerManager.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
|
||||
LogLevel.Error,
|
||||
"LogErrorPredicate",
|
||||
output,
|
||||
FabricHealerManager.Token);
|
||||
|
||||
return true;
|
||||
}
|
||||
}
|
||||
|
||||
public static LogErrorPredicateType Singleton(string name)
|
||||
{
|
||||
return Instance ??= new LogErrorPredicateType(name);
|
||||
}
|
||||
|
||||
private LogErrorPredicateType(string name)
|
||||
: base(name, true, 1)
|
||||
{
|
||||
|
||||
}
|
||||
|
||||
public override PredicateResolver CreateResolver(CompoundTerm input, Constraint constraint, QueryContext context)
|
||||
{
|
||||
return new Resolver(input, constraint, context);
|
||||
}
|
||||
}
|
||||
}
|
|
@ -13,9 +13,9 @@ namespace FabricHealer.Repair.Guan
|
|||
/// <summary>
|
||||
/// Helper external predicate that generates health/etw/telemetry events.
|
||||
/// </summary>
|
||||
public class EmitMessagePredicateType : PredicateType
|
||||
public class LogInfoPredicateType : PredicateType
|
||||
{
|
||||
private static EmitMessagePredicateType Instance;
|
||||
private static LogInfoPredicateType Instance;
|
||||
|
||||
private class Resolver : BooleanPredicateResolver
|
||||
{
|
||||
|
@ -28,15 +28,15 @@ namespace FabricHealer.Repair.Guan
|
|||
protected override async Task<bool> CheckAsync()
|
||||
{
|
||||
int count = Input.Arguments.Count;
|
||||
string output;
|
||||
string output, format;
|
||||
|
||||
if (count == 0)
|
||||
{
|
||||
throw new GuanException("At least one argument is required.");
|
||||
throw new GuanException("At least 1 argument is required.");
|
||||
}
|
||||
|
||||
string format = Input.Arguments[0].Value.GetEffectiveTerm().GetStringValue();
|
||||
|
||||
|
||||
format = Input.Arguments[0].Value.GetEffectiveTerm().GetStringValue();
|
||||
|
||||
if (string.IsNullOrWhiteSpace(format))
|
||||
{
|
||||
return true;
|
||||
|
@ -61,7 +61,7 @@ namespace FabricHealer.Repair.Guan
|
|||
|
||||
await FabricHealerManager.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
|
||||
LogLevel.Info,
|
||||
"EmitMessagePredicate",
|
||||
"LogInfoPredicate",
|
||||
output,
|
||||
FabricHealerManager.Token);
|
||||
|
||||
|
@ -69,12 +69,12 @@ namespace FabricHealer.Repair.Guan
|
|||
}
|
||||
}
|
||||
|
||||
public static EmitMessagePredicateType Singleton(string name)
|
||||
public static LogInfoPredicateType Singleton(string name)
|
||||
{
|
||||
return Instance ??= new EmitMessagePredicateType(name);
|
||||
return Instance ??= new LogInfoPredicateType(name);
|
||||
}
|
||||
|
||||
private EmitMessagePredicateType(string name)
|
||||
private LogInfoPredicateType(string name)
|
||||
: base(name, true, 1)
|
||||
{
|
||||
|
|
@ -0,0 +1,88 @@
|
|||
// ------------------------------------------------------------
|
||||
// Copyright (c) Microsoft Corporation. All rights reserved.
|
||||
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
|
||||
// ------------------------------------------------------------
|
||||
|
||||
using System.Globalization;
|
||||
using Guan.Logic;
|
||||
using FabricHealer.Utilities;
|
||||
using System.Threading.Tasks;
|
||||
|
||||
namespace FabricHealer.Repair.Guan
|
||||
{
|
||||
/// <summary>
|
||||
/// Helper external predicate that generates health/etw/telemetry events.
|
||||
/// </summary>
|
||||
public class LogWarningPredicateType : PredicateType
|
||||
{
|
||||
private static LogWarningPredicateType Instance;
|
||||
|
||||
private class Resolver : BooleanPredicateResolver
|
||||
{
|
||||
public Resolver(CompoundTerm input, Constraint constraint, QueryContext context)
|
||||
: base(input, constraint, context)
|
||||
{
|
||||
|
||||
}
|
||||
|
||||
protected override async Task<bool> CheckAsync()
|
||||
{
|
||||
int count = Input.Arguments.Count;
|
||||
string output, format;
|
||||
|
||||
if (count == 0)
|
||||
{
|
||||
throw new GuanException("At least 1 argument is required.");
|
||||
}
|
||||
|
||||
format = Input.Arguments[0].Value.GetEffectiveTerm().GetStringValue();
|
||||
|
||||
if (string.IsNullOrWhiteSpace(format))
|
||||
{
|
||||
return true;
|
||||
}
|
||||
|
||||
// formatted args string?
|
||||
if (count > 1)
|
||||
{
|
||||
object[] args = new object[count - 1];
|
||||
|
||||
for (int i = 1; i < count; i++)
|
||||
{
|
||||
args[i - 1] = Input.Arguments[i].Value.GetEffectiveTerm().GetObjectValue();
|
||||
}
|
||||
|
||||
output = string.Format(CultureInfo.InvariantCulture, format, args);
|
||||
}
|
||||
else
|
||||
{
|
||||
output = format;
|
||||
}
|
||||
|
||||
await FabricHealerManager.TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
|
||||
LogLevel.Warning,
|
||||
"LogWarningPredicate",
|
||||
output,
|
||||
FabricHealerManager.Token);
|
||||
|
||||
return true;
|
||||
}
|
||||
}
|
||||
|
||||
public static LogWarningPredicateType Singleton(string name)
|
||||
{
|
||||
return Instance ??= new LogWarningPredicateType(name);
|
||||
}
|
||||
|
||||
private LogWarningPredicateType(string name)
|
||||
: base(name, true, 1)
|
||||
{
|
||||
|
||||
}
|
||||
|
||||
public override PredicateResolver CreateResolver(CompoundTerm input, Constraint constraint, QueryContext context)
|
||||
{
|
||||
return new Resolver(input, constraint, context);
|
||||
}
|
||||
}
|
||||
}
|
|
@ -95,7 +95,9 @@ namespace FabricHealer.Repair
|
|||
public const string CheckInsideHealthStateMinDuration = "CheckInsideHealthStateMinDuration";
|
||||
public const string GetHealthEventHistory = "GetHealthEventHistory";
|
||||
public const string GetRepairHistory = "GetRepairHistory";
|
||||
public const string EmitMessage = "EmitMessage";
|
||||
public const string LogInfo = "LogInfo";
|
||||
public const string LogWarning = "LogWarning";
|
||||
public const string LogError = "LogError";
|
||||
|
||||
// Metric names.
|
||||
public const string ActiveTcpPorts = "ActiveTcpPorts";
|
||||
|
|
|
@ -139,7 +139,9 @@ namespace FabricHealer.Repair
|
|||
functorTable.Add(CheckInsideNodeProbationPeriodPredicateType.Singleton(RepairConstants.CheckInsideNodeProbationPeriod, repairData));
|
||||
functorTable.Add(CheckInsideScheduleIntervalPredicateType.Singleton(RepairConstants.CheckInsideScheduleInterval, repairData));
|
||||
functorTable.Add(CheckOutstandingRepairsPredicateType.Singleton(RepairConstants.CheckOutstandingRepairs, repairData));
|
||||
functorTable.Add(EmitMessagePredicateType.Singleton(RepairConstants.EmitMessage));
|
||||
functorTable.Add(LogInfoPredicateType.Singleton(RepairConstants.LogInfo));
|
||||
functorTable.Add(LogErrorPredicateType.Singleton(RepairConstants.LogError));
|
||||
functorTable.Add(LogWarningPredicateType.Singleton(RepairConstants.LogWarning));
|
||||
functorTable.Add(CheckInsideHealthStateMinDurationPredicateType.Singleton(RepairConstants.CheckInsideHealthStateMinDuration, repairData, this));
|
||||
functorTable.Add(GetHealthEventHistoryPredicateType.Singleton(RepairConstants.GetHealthEventHistory, this, repairData));
|
||||
functorTable.Add(GetRepairHistoryPredicateType.Singleton(RepairConstants.GetRepairHistory, repairData));
|
||||
|
@ -163,7 +165,7 @@ namespace FabricHealer.Repair
|
|||
List<CompoundTerm> compoundTerms = new();
|
||||
|
||||
// Mitigate is the head of the rules used in FH. It's the goal that Guan will try to accomplish based on the logical expressions (or subgoals) that form a given rule.
|
||||
CompoundTerm compoundTerm = new("Mitigate");
|
||||
CompoundTerm ruleHead = new("Mitigate");
|
||||
|
||||
// The type of metric that led FO to generate the unhealthy evaluation for the entity (App, Node, VM, Replica, etc).
|
||||
// We rename these for brevity for simplified use in logic rule composition (e;g., MetricName="Threads" instead of MetricName="Total Thread Count").
|
||||
|
@ -171,25 +173,25 @@ namespace FabricHealer.Repair
|
|||
|
||||
// These args hold the related values supplied by FO and are available anywhere Mitigate is used as a rule head.
|
||||
// Think of these as facts from FabricObserver.
|
||||
compoundTerm.AddArgument(new Constant(repairData.ApplicationName), RepairConstants.AppName);
|
||||
compoundTerm.AddArgument(new Constant(repairData.Code), RepairConstants.ErrorCode);
|
||||
compoundTerm.AddArgument(new Constant(repairData.EntityType.ToString()), RepairConstants.EntityType);
|
||||
compoundTerm.AddArgument(new Constant(repairData.HealthState.ToString()), RepairConstants.HealthState);
|
||||
compoundTerm.AddArgument(new Constant(repairData.Metric), RepairConstants.MetricName);
|
||||
compoundTerm.AddArgument(new Constant(Convert.ToInt64(repairData.Value)), RepairConstants.MetricValue);
|
||||
compoundTerm.AddArgument(new Constant(repairData.NodeName), RepairConstants.NodeName);
|
||||
compoundTerm.AddArgument(new Constant(repairData.NodeType), RepairConstants.NodeType);
|
||||
compoundTerm.AddArgument(new Constant(repairData.ObserverName), RepairConstants.ObserverName);
|
||||
compoundTerm.AddArgument(new Constant(repairData.OS), RepairConstants.OS);
|
||||
compoundTerm.AddArgument(new Constant(repairData.ServiceKind), RepairConstants.ServiceKind);
|
||||
compoundTerm.AddArgument(new Constant(repairData.ServiceName), RepairConstants.ServiceName);
|
||||
compoundTerm.AddArgument(new Constant(repairData.ProcessId), RepairConstants.ProcessId);
|
||||
compoundTerm.AddArgument(new Constant(repairData.ProcessName), RepairConstants.ProcessName);
|
||||
compoundTerm.AddArgument(new Constant(repairData.ProcessStartTime), RepairConstants.ProcessStartTime);
|
||||
compoundTerm.AddArgument(new Constant(repairData.PartitionId), RepairConstants.PartitionId);
|
||||
compoundTerm.AddArgument(new Constant(repairData.ReplicaId), RepairConstants.ReplicaOrInstanceId);
|
||||
compoundTerm.AddArgument(new Constant(repairData.ReplicaRole), RepairConstants.ReplicaRole);
|
||||
compoundTerms.Add(compoundTerm);
|
||||
ruleHead.AddArgument(new Constant(repairData.ApplicationName), RepairConstants.AppName);
|
||||
ruleHead.AddArgument(new Constant(repairData.Code), RepairConstants.ErrorCode);
|
||||
ruleHead.AddArgument(new Constant(repairData.EntityType.ToString()), RepairConstants.EntityType);
|
||||
ruleHead.AddArgument(new Constant(repairData.HealthState.ToString()), RepairConstants.HealthState);
|
||||
ruleHead.AddArgument(new Constant(repairData.Metric), RepairConstants.MetricName);
|
||||
ruleHead.AddArgument(new Constant(Convert.ToInt64(repairData.Value)), RepairConstants.MetricValue);
|
||||
ruleHead.AddArgument(new Constant(repairData.NodeName), RepairConstants.NodeName);
|
||||
ruleHead.AddArgument(new Constant(repairData.NodeType), RepairConstants.NodeType);
|
||||
ruleHead.AddArgument(new Constant(repairData.ObserverName), RepairConstants.ObserverName);
|
||||
ruleHead.AddArgument(new Constant(repairData.OS), RepairConstants.OS);
|
||||
ruleHead.AddArgument(new Constant(repairData.ServiceKind), RepairConstants.ServiceKind);
|
||||
ruleHead.AddArgument(new Constant(repairData.ServiceName), RepairConstants.ServiceName);
|
||||
ruleHead.AddArgument(new Constant(repairData.ProcessId), RepairConstants.ProcessId);
|
||||
ruleHead.AddArgument(new Constant(repairData.ProcessName), RepairConstants.ProcessName);
|
||||
ruleHead.AddArgument(new Constant(repairData.ProcessStartTime), RepairConstants.ProcessStartTime);
|
||||
ruleHead.AddArgument(new Constant(repairData.PartitionId), RepairConstants.PartitionId);
|
||||
ruleHead.AddArgument(new Constant(repairData.ReplicaId), RepairConstants.ReplicaOrInstanceId);
|
||||
ruleHead.AddArgument(new Constant(repairData.ReplicaRole), RepairConstants.ReplicaRole);
|
||||
compoundTerms.Add(ruleHead);
|
||||
|
||||
// Run Guan query.
|
||||
// This is where the supplied rules are run with FO data that may or may not lead to mitigation of some supported SF entity in trouble (or a VM/Disk).
|
||||
|
|
|
@ -139,11 +139,8 @@ namespace FabricHealer.Utilities.Telemetry
|
|||
// Anonymous types are supported by FH's ETW impl.
|
||||
var anonType = new
|
||||
{
|
||||
LogLevel = level,
|
||||
Source = source,
|
||||
Message = description,
|
||||
Property = property,
|
||||
EntityType = entityType
|
||||
LogLevel = level.ToString(),
|
||||
Message = description
|
||||
};
|
||||
|
||||
logger.LogEtw(RepairConstants.FabricHealerDataEvent, anonType);
|
||||
|
|
Загрузка…
Ссылка в новой задаче