This commit is contained in:
Charles Torre 2022-06-01 16:29:28 -07:00
Родитель 828d0f791d 2ee2742a83
Коммит 3654bb3929
21 изменённых файлов: 226 добавлений и 146 удалений

Просмотреть файл

@ -23,16 +23,16 @@ Mitigate(AppName="fabric:/App1", MetricName="MemoryPercent") :- RestartCodePacka
Don't be alarmed if you don't understand how to read that repair action! We will go more in-depth later about the syntax and semantics of Guan. The takeaway is that expressing a Guan repair workflow doesn't require a deep knowledge of Prolog programming to get started. Hopefully this also gives you a general idea about the kinds of repair workflows we can express with GuanLogic.
Each repair policy has its own corresponding configuration file:
Each repair policy has its own corresponding configuration file located in the FabricHealer project's PackageRoot/Config/LogicRules folder:
| Repair Policy | Configuration File Name |
|---------------------------|------------------------------|
| AppRepairPolicy | AppRules.config.txt |
| DiskRepairPolicy | DiskRules.config.txt |
| FabricNodeRepairPolicy | FabricNodeRules.config.txt |
| ReplicaRepairPolicy | ReplicaRules.config.txt |
| SystemAppRepairPolicy | SystemAppRules.config.txt |
| VMRepairPolicy | VmRules.config.txt |
| AppRepairPolicy | AppRules.guan |
| DiskRepairPolicy | DiskRules.guan |
| FabricNodeRepairPolicy | FabricNodeRules.guan |
| MachineRepairPolicy | MachineRules.guan |
| ReplicaRepairPolicy | ReplicaRules.guan |
| SystemServiceRepairPolicy | SystemServiceRules.guan |
Now let's look at *how* to actually define a Guan logic repair workflow, so that you will have the knowledge necessary to express your own.
@ -113,17 +113,18 @@ By default, for logic-based repair workflows, FH will execute a query which call
| Argument Name | Definition |
|---------------------------|----------------------------------------------------------------------------------------------|
| AppName | Name of the SF application, format is fabric:/SomeApp |
| ServiceName | Name of the SF service, format is fabric:/SomeApp/SomeService |
| AppName | Name of the SF application, format is "fabric:/SomeApp" (Quotes are required) |
| ServiceName | Name of the SF service, format is "fabric:/SomeApp/SomeService" (Quotes are required) |
| NodeName | Name of the node |
| NodeType | Type of node |
| ObserverName | Name of Observer that generated the event (if the data comes from FabricObserver service) |
| PartitionId | Id of the partition |
| ReplicaOrInstanceId | Id of the replica or instance |
| FOErrorCode | Error Code emitted by FO (e.g. "FO002") |
| MetricName | Name of the resource supplied by FO (e.g., CpuPercent or MemoryMB, etc.) |
| MetricValue | Corresponding Metric Value supplied by FO (e.g. "85" indicating 85% CPU usage) |
| SystemServiceProcessName | The name of a Fabric system service process supplied in FO health data |
| OS | The name of the OS from which the FO data was collected (Linux or Windows) |
| ErrorCode | Supported Error Code emitted by caller (e.g. "FO002") |
| MetricName | Name of the metric (e.g., CpuPercent or MemoryMB, etc.) |
| MetricValue | Corresponding Metric Value (e.g. "85" indicating 85% CPU usage) |
| OS | The name of the OS from which the data was collected (Linux or Windows) |
| HealthState | The HealthState of the target entity: Error or Warning |
For example if you wanted to use AppName and ServiceName in your repair workflow you would specify them like so:
```

Просмотреть файл

@ -176,7 +176,7 @@ namespace FHTest
};
string testRulesFilePath = Path.Combine(Environment.CurrentDirectory, "testrules_wellformed");
string[] rules = await File.ReadAllLinesAsync(testRulesFilePath, token).ConfigureAwait(true);
string[] rules = await File.ReadAllLinesAsync(testRulesFilePath, token);
List<string> repairRules = ParseRulesFile(rules);
var repairData = new TelemetryData
{
@ -225,7 +225,7 @@ namespace FHTest
TelemetryEnabled = false
};
string[] rules = await File.ReadAllLinesAsync(Path.Combine(Environment.CurrentDirectory, "testrules_malformed"), token).ConfigureAwait(true);
string[] rules = await File.ReadAllLinesAsync(Path.Combine(Environment.CurrentDirectory, "testrules_malformed"), token);
List<string> repairAction = ParseRulesFile(rules);
var repairData = new TelemetryData

Просмотреть файл

@ -51,11 +51,6 @@ namespace FabricHealer
private set;
}
private bool FabricHealerOperationalTelemetryEnabled
{
get; set;
}
// CancellationToken from FabricHealer.RunAsync.
private CancellationToken Token
{
@ -98,7 +93,7 @@ namespace FabricHealer
};
RepairHistory = new RepairData();
healthReporter = new FabricHealthReporter(fabricClient);
healthReporter = new FabricHealthReporter(fabricClient, RepairLogger);
sfRuntimeVersion = GetServiceFabricRuntimeVersion();
}
@ -395,13 +390,13 @@ namespace FabricHealer
private FabricHealerOperationalEventData GetFabricHealerInternalTelemetryData()
{
FabricHealerOperationalEventData telemetryData = null;
FabricHealerOperationalEventData opsTelemData = null;
try
{
RepairHistory.EnabledRepairCount = GetEnabledRepairRuleCount();
telemetryData = new FabricHealerOperationalEventData
opsTelemData = new FabricHealerOperationalEventData
{
UpTime = DateTime.UtcNow.Subtract(StartDateTime).ToString(),
Version = InternalVersionNumber,
@ -414,7 +409,7 @@ namespace FabricHealer
}
return telemetryData;
return opsTelemData;
}
/// <summary>
@ -499,7 +494,7 @@ namespace FabricHealer
var repairRules =
GetRepairRulesFromConfiguration(
!string.IsNullOrWhiteSpace(
repairExecutorData.RepairData.SystemServiceProcessName) ? RepairConstants.SystemAppRepairPolicySectionName : RepairConstants.FabricNodeRepairPolicySectionName);
repairExecutorData.RepairData.SystemServiceProcessName) ? RepairConstants.SystemServiceRepairPolicySectionName : RepairConstants.FabricNodeRepairPolicySectionName);
var repairData = new TelemetryData
{
@ -565,7 +560,7 @@ namespace FabricHealer
string telemetryDescription = $"Cluster is currently upgrading in UD {udInClusterUpgrade}. Will not schedule or execute repairs at this time.";
await TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"MonitorRepairableHealthEventsAsync::ClusterUpgradeDetected",
"MonitorHealthEventsAsync::ClusterUpgradeDetected",
telemetryDescription,
Token,
null,
@ -582,13 +577,15 @@ namespace FabricHealer
}
catch (Exception e) when (e is FabricException || e is TimeoutException)
{
#if DEBUG
await TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"MonitorRepairableHealthEventsAsync::HandledException",
$"Failure in MonitorRepairableHealthEventAsync::Node:{Environment.NewLine}{e}",
"MonitorHealthEventsAsync::HandledException",
$"Failure in MonitorHealthEventsAsync::Node:{Environment.NewLine}{e}",
Token,
null,
ConfigSettings.EnableVerboseLogging);
#endif
}
var unhealthyEvaluations = clusterHealth.UnhealthyEvaluations;
@ -612,13 +609,15 @@ namespace FabricHealer
}
catch (Exception e) when (e is FabricException || e is TimeoutException)
{
#if DEBUG
await TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"MonitorRepairableHealthEventsAsync::HandledException",
$"Failure in MonitorRepairableHealthEventAsync::Node:{Environment.NewLine}{e}",
"MonitorHealthEventsAsync::HandledException",
$"Failure in MonitorHealthEventsAsync::Node:{Environment.NewLine}{e}",
Token,
null,
ConfigSettings.EnableVerboseLogging);
#endif
}
}
else if (kind != null && kind.Contains("Application"))
@ -658,13 +657,15 @@ namespace FabricHealer
}
catch (Exception e) when (e is FabricException || e is TimeoutException)
{
#if DEBUG
await TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"MonitorRepairableHealthEventsAsync::HandledException",
$"Failure in MonitorRepairableHealthEventAsync::Application:{Environment.NewLine}{e}",
"MonitorHealthEventsAsync::HandledException",
$"Failure in MonitorHealthEventsAsync::Application:{Environment.NewLine}{e}",
Token,
null,
ConfigSettings.EnableVerboseLogging);
#endif
}
}
}
@ -682,18 +683,20 @@ namespace FabricHealer
}
catch (Exception e) when (e is FabricException || e is TimeoutException)
{
#if DEBUG
await TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Info,
"MonitorRepairableHealthEventsAsync::HandledException",
$"Failure in MonitorRepairableHealthEventAsync::Replica:{Environment.NewLine}{e}",
"MonitorHealthEventsAsync::HandledException",
$"Failure in MonitorHealthEventsAsync::Replica:{Environment.NewLine}{e}",
Token,
null,
ConfigSettings.EnableVerboseLogging);
#endif
}
}
}
}
catch (FabricException)
catch (Exception e) when (e is ArgumentException || e is FabricException)
{
// Don't crash.
}
@ -701,15 +704,15 @@ namespace FabricHealer
{
await TelemetryUtilities.EmitTelemetryEtwHealthEventAsync(
LogLevel.Error,
"MonitorRepairableHealthEventsAsync::UnhandledException",
$"Failure in MonitorRepairableHealthEventAsync:{Environment.NewLine}{e}",
"MonitorHealthEventsAsync::UnhandledException",
$"Failure in MonitorHealthEventsAsync:{Environment.NewLine}{e}",
Token,
null,
ConfigSettings.EnableVerboseLogging);
RepairLogger.LogWarning($"Unhandled exception in MonitorRepairableHealthEventsAsync:{Environment.NewLine}{e}");
RepairLogger.LogWarning($"Unhandled exception in MonitorHealthEventsAsync:{Environment.NewLine}{e}");
// Fix the bug(s)..
// Fix the bug(s).
throw;
}
}
@ -1681,7 +1684,7 @@ namespace FabricHealer
case SupportedErrorCodes.AppWarningTooManyOpenFileHandles:
case SupportedErrorCodes.AppWarningTooManyThreads:
repairPolicySectionName = app == RepairConstants.SystemAppName ? RepairConstants.SystemAppRepairPolicySectionName : RepairConstants.AppRepairPolicySectionName;
repairPolicySectionName = app == RepairConstants.SystemAppName ? RepairConstants.SystemServiceRepairPolicySectionName : RepairConstants.AppRepairPolicySectionName;
break;
// VM repair.
@ -1733,7 +1736,7 @@ namespace FabricHealer
// System service repair.
case RepairConstants.FabricSystemObserver:
repairPolicySectionName = RepairConstants.SystemAppRepairPolicySectionName;
repairPolicySectionName = RepairConstants.SystemServiceRepairPolicySectionName;
break;
// Disk repair
@ -1776,7 +1779,7 @@ namespace FabricHealer
// System service process repair.
case EntityType.Application when repairData.SystemServiceProcessName != null:
case EntityType.Process:
repairPolicySectionName = RepairConstants.SystemAppRepairPolicySectionName;
repairPolicySectionName = RepairConstants.SystemServiceRepairPolicySectionName;
break;
// Disk repair.
@ -1974,7 +1977,7 @@ namespace FabricHealer
{
try
{
var healthReporter = new FabricHealthReporter(fabricClient);
var healthReporter = new FabricHealthReporter(fabricClient, RepairLogger);
var healthReport = new HealthReport
{
HealthMessage = "Clearing existing health reports as FabricHealer is stopping or updating.",

Просмотреть файл

@ -48,7 +48,7 @@ namespace FabricHealer.Interfaces
/// </summary>
string OS { get; }
/// <summary>
/// Required if the repair target is a Service. The Partition Id (as a string) where the replica or instance resides that is in Error or Warning state.
/// Required if the repair target is a Service. The Partition Id (as a nullable Guid) where the replica or instance resides that is in Error or Warning state.
/// </summary>
Guid? PartitionId { get; set; }
/// <summary>

Просмотреть файл

@ -5,8 +5,8 @@
These must be set in ApplicationManifest.xml -->
<Parameter Name="HealthCheckLoopSleepTimeSeconds" Value="" MustOverride="true" />
<Parameter Name="EnableVerboseLogging" Value="" MustOverride="true" />
<Parameter Name="EnableTelemetryProvider" Value="" MustOverride="true" />
<Parameter Name="EnableEventSourceProvider" Value="" MustOverride="true" />
<Parameter Name="EnableTelemetry" Value="" MustOverride="true" />
<Parameter Name="EnableETW" Value="" MustOverride="true" />
<Parameter Name="EnableAutoMitigation" Value="" MustOverride="true" />
<Parameter Name="EnableOperationalTelemetry" Value="" MustOverride="true" />
<!-- Folder name for local log output. You can use a full path or just a folder name. -->
@ -46,7 +46,7 @@
<Parameter Name="Enabled" Value="" MustOverride="true" />
<Parameter Name="LogicRulesConfigurationFile" Value="" MustOverride="true" />
</Section>
<Section Name="SystemAppRepairPolicy">
<Section Name="SystemServiceRepairPolicy">
<Parameter Name="Enabled" Value="" MustOverride="true" />
<Parameter Name="LogicRulesConfigurationFile" Value="" MustOverride="true" />
</Section>

Просмотреть файл

@ -3,12 +3,15 @@
// Licensed under the MIT License (MIT). See License.txt in the repo root for license information.
// ------------------------------------------------------------
using System;
namespace FabricHealer.Repair
{
/// <summary>
/// The type of repair action.
/// Not all of these have implementations yet.
/// </summary>
[Serializable]
public enum RepairActionType
{
DeleteFiles,

Просмотреть файл

@ -25,9 +25,9 @@ namespace FabricHealer.Repair
// RepairManager Settings Parameters.
public const string RepairManagerConfigurationSectionName = "RepairManagerConfiguration";
public const string EnableVerboseLoggingParameter = "EnableVerboseLogging";
public const string AppInsightsTelemetryEnabled = "EnableTelemetryProvider";
public const string EnableTelemetry = "EnableTelemetry";
public const string AppInsightsInstrumentationKeyParameter = "AppInsightsInstrumentationKey";
public const string EnableEventSourceProvider = "EnableEventSourceProvider";
public const string EnableETW = "EnableETW";
public const string HealthCheckLoopSleepTimeSeconds = "HealthCheckLoopSleepTimeSeconds";
public const string LocalLogPathParameter = "LocalLogPath";
public const string AsyncOperationTimeout = "AsyncOperationTimeoutSeconds";
@ -41,7 +41,7 @@ namespace FabricHealer.Repair
public const string ReplicaRepairPolicySectionName = "ReplicaRepairPolicy";
public const string AppRepairPolicySectionName = "AppRepairPolicy";
public const string DiskRepairPolicySectionName = "DiskRepairPolicy";
public const string SystemAppRepairPolicySectionName = "SystemAppRepairPolicy";
public const string SystemServiceRepairPolicySectionName = "SystemServiceRepairPolicy";
public const string MachineRepairPolicySectionName = "MachineRepairPolicy";
// RepairPolicy
@ -95,7 +95,7 @@ namespace FabricHealer.Repair
public const string FileHandlesPercent = "FileHandlesPercent";
public const string Threads = "Threads";
// Supported Observer Names
// Supported FabricObserver Observer Names
public const string AppObserver = "AppObserver";
public const string ContainerObserver = "ContainerObserver";
public const string DiskObserver = "DiskObserver";

Просмотреть файл

@ -185,11 +185,12 @@ namespace FabricHealer.Repair
if (!FabricHealerManager.RepairHistory.Repairs.ContainsKey(repairName))
{
FabricHealerManager.RepairHistory.Repairs.Add(repairName, 1);
FabricHealerManager.RepairHistory.Repairs.Add(repairName, (repairData.Source, 1));
}
else
{
FabricHealerManager.RepairHistory.Repairs[repairName]++;
double count = FabricHealerManager.RepairHistory.Repairs[repairName].Count + 1;
FabricHealerManager.RepairHistory.Repairs[repairName] = (repairData.Source, count);
}
FabricHealerManager.RepairHistory.RepairCount++;

Просмотреть файл

@ -4,12 +4,15 @@
// ------------------------------------------------------------
using System;
using System.Diagnostics.Tracing;
namespace FabricHealer.Repair
{
/// <summary>
/// Defines the type of repair to execute.
/// </summary>
[EventData]
[Serializable]
public class RepairPolicy
{
/// <summary>
@ -23,6 +26,7 @@ namespace FabricHealer.Repair
/// <summary>
/// The type of repair execution (RestartCodePackage, RestartReplica, etc..)
/// </summary>
[EventField]
public RepairActionType RepairAction
{
get; set;
@ -31,6 +35,7 @@ namespace FabricHealer.Repair
/// <summary>
/// Maximum amount of time to check if health state of repaired target entity is Ok.
/// </summary>
[EventField]
public TimeSpan MaxTimePostRepairHealthCheck
{
get; set;
@ -40,6 +45,7 @@ namespace FabricHealer.Repair
/// Whether or not RepairManager should do preparing and restoring health checks before approving the target repair job.
/// Setting this to true will increase the time it takes to complete a repair.
/// </summary>
[EventField]
public bool DoHealthChecks
{
get; set;

Просмотреть файл

@ -168,7 +168,7 @@ namespace FabricHealer.Utilities
}
// Telemetry.
if (bool.TryParse(GetConfigSettingValue(RepairConstants.RepairManagerConfigurationSectionName, RepairConstants.AppInsightsTelemetryEnabled), out bool telemEnabled))
if (bool.TryParse(GetConfigSettingValue(RepairConstants.RepairManagerConfigurationSectionName, RepairConstants.EnableTelemetry), out bool telemEnabled))
{
TelemetryEnabled = telemEnabled;
@ -200,8 +200,8 @@ namespace FabricHealer.Utilities
}
}
// FabricHealer ETW telemetry.
if (bool.TryParse(GetConfigSettingValue(RepairConstants.RepairManagerConfigurationSectionName, RepairConstants.EnableEventSourceProvider), out bool etwEnabled))
// ETW.
if (bool.TryParse(GetConfigSettingValue(RepairConstants.RepairManagerConfigurationSectionName, RepairConstants.EnableETW), out bool etwEnabled))
{
EtwEnabled = etwEnabled;
}
@ -233,7 +233,7 @@ namespace FabricHealer.Utilities
EnableReplicaRepair = replicaRepairEnabled;
}
if (bool.TryParse(GetConfigSettingValue(RepairConstants.SystemAppRepairPolicySectionName, RepairConstants.Enabled), out bool systemAppRepairEnabled))
if (bool.TryParse(GetConfigSettingValue(RepairConstants.SystemServiceRepairPolicySectionName, RepairConstants.Enabled), out bool systemAppRepairEnabled))
{
EnableSystemAppRepair = systemAppRepairEnabled;
}

Просмотреть файл

@ -16,16 +16,18 @@ namespace FabricHealer.Utilities
public class FabricHealthReporter
{
private readonly FabricClient fabricClient;
private readonly Logger _logger;
/// <summary>
/// Initializes a new instance of the <see cref="FabricHealthReporter"/> class.
/// </summary>
/// <param name="fabricClient"></param>
public FabricHealthReporter(FabricClient fabricClient)
public FabricHealthReporter(FabricClient fabricClient, Logger logger)
{
this.fabricClient = fabricClient ?? throw new ArgumentException("FabricClient can't be null");
this.fabricClient.Settings.HealthReportSendInterval = TimeSpan.FromSeconds(1);
this.fabricClient.Settings.HealthReportRetrySendInterval = TimeSpan.FromSeconds(3);
_logger = logger;
}
public void ReportHealthToServiceFabric(HealthReport healthReport)
@ -57,6 +59,19 @@ namespace FabricHealer.Utilities
RemoveWhenExpired = true,
};
// Local file logging.
if (healthReport.EmitLogEvent)
{
if (healthReport.State == HealthState.Ok)
{
_logger.LogInfo(healthReport.HealthMessage);
}
else
{
_logger.LogWarning(healthReport.HealthMessage);
}
}
switch (healthReport.EntityType)
{
case EntityType.Application when healthReport.AppName != null:

Просмотреть файл

@ -31,7 +31,10 @@ namespace FabricHealer.Utilities
get; set;
}
public bool EmitLogEvent { get; set; } = true;
public bool EmitLogEvent
{
get; set;
}
public EntityType EntityType
{

Просмотреть файл

@ -16,13 +16,6 @@ namespace FabricHealer.Utilities.Telemetry
{
public static readonly ServiceEventSource Current = new ServiceEventSource();
static ServiceEventSource()
{
// A workaround for the problem where ETW activities do not get tracked until Tasks infrastructure is initialized.
// This problem is fixed in .NET Framework 4.6.2. If you are running this version or greater, then delete the below code.
_ = Task.Run(() => { });
}
// Instance constructor is private to enforce singleton semantics.
// FabricObserver ETW provider name is passed to base.ctor here instead of decorating this class.
private ServiceEventSource() : base(RepairConstants.EventSourceProviderName)

Просмотреть файл

@ -9,9 +9,12 @@ using FabricHealer.Interfaces;
using System.Fabric.Health;
using System;
using FabricHealer.Repair;
using System.Diagnostics.Tracing;
namespace FabricHealer.Utilities.Telemetry
{
[EventData]
[Serializable]
public class TelemetryData : ITelemetryData
{
private readonly string _os;
@ -41,11 +44,13 @@ namespace FabricHealer.Utilities.Telemetry
get; set;
}
[EventField]
public EntityType EntityType
{
get; set;
}
[EventField]
public HealthState HealthState
{
get; set;
@ -74,6 +79,7 @@ namespace FabricHealer.Utilities.Telemetry
get { return _os; }
}
[EventField]
public Guid? PartitionId
{
get; set;
@ -119,6 +125,7 @@ namespace FabricHealer.Utilities.Telemetry
get; set;
}
[EventField]
public RepairPolicy RepairPolicy
{
get; set;

Просмотреть файл

@ -84,19 +84,16 @@ namespace FabricHealer.Utilities.Telemetry
_ => HealthState.Ok
};
// Do not write ETW/send Telemetry if the data is informational-only and verbose logging is not enabled.
// Do not write ETW/send Telemetry/create health report if the data is informational-only and verbose logging is not enabled.
// This means only Warning and Error messages will be transmitted. In general, however, it is best to enable Verbose Logging (default)
// in FabricHealer as it will not generate noisy local logs and you will have a complete record of mitigation steps in your AI or LA workspace.
// in FabricHealer as it will not generate noisy local logs and you will have a complete record of mitigation history in your AI or LA workspace.
if (!verboseLogging && level == LogLevel.Info)
{
return;
}
// Local Logging
logger.LogInfo(description);
// Service Fabric health report generation.
var healthReporter = new FabricHealthReporter(fabricClient);
var healthReporter = new FabricHealthReporter(fabricClient, logger);
var healthReport = new HealthReport
{
AppName = reportType == EntityType.Application ? new Uri("fabric:/FabricHealer") : null,
@ -108,52 +105,61 @@ namespace FabricHealer.Utilities.Telemetry
HealthReportTimeToLive = ttl == default ? TimeSpan.FromMinutes(5) : ttl,
Property = property,
SourceId = source,
EmitLogEvent = true
};
healthReporter.ReportHealthToServiceFabric(healthReport);
if (!FabricHealerManager.ConfigSettings.EtwEnabled && !FabricHealerManager.ConfigSettings.TelemetryEnabled)
{
return;
}
var telemData = new TelemetryData()
{
ApplicationName = repairData?.ApplicationName,
ClusterId = ClusterInformation.ClusterInfoTuple.ClusterId,
Code = repairData?.Code,
ContainerId = repairData?.ContainerId,
Description = description,
EntityType = reportType,
HealthState = healthState,
Metric = repairAction,
NodeName = repairData?.NodeName,
NodeType = repairData?.NodeType,
ObserverName = repairData?.ObserverName,
PartitionId = repairData?.PartitionId,
ProcessId = repairData != null ? repairData.ProcessId : -1,
Property = property,
ReplicaId = repairData != null ? repairData.ReplicaId : 0,
RepairPolicy = repairData?.RepairPolicy ?? new RepairPolicy(),
ServiceName = repairData?.ServiceName,
Source = source,
SystemServiceProcessName = repairData?.SystemServiceProcessName,
Value = repairData != null ? repairData.Value : -1,
};
// Telemetry.
if (FabricHealerManager.ConfigSettings.TelemetryEnabled)
{
var telemData = new TelemetryData()
{
ApplicationName = repairData?.ApplicationName ?? string.Empty,
ClusterId = ClusterInformation.ClusterInfoTuple.ClusterId,
Description = description,
HealthState = healthState,
Metric = repairAction,
NodeName = repairData?.NodeName ?? string.Empty,
PartitionId = repairData?.PartitionId != null ? repairData.PartitionId : default,
ReplicaId = repairData != null ? repairData.ReplicaId : 0,
ServiceName = repairData?.ServiceName ?? string.Empty,
Source = source,
SystemServiceProcessName = repairData?.SystemServiceProcessName ?? string.Empty,
};
await telemetryClient?.ReportMetricAsync(telemData, token);
}
// ETW.
if (FabricHealerManager.ConfigSettings.EtwEnabled)
{
ServiceEventSource.Current.Write(
RepairConstants.EventSourceEventName,
new
if (healthState == HealthState.Ok || healthState == HealthState.Unknown || healthState == HealthState.Invalid)
{
ApplicationName = repairData?.ApplicationName ?? string.Empty,
ClusterInformation.ClusterInfoTuple.ClusterId,
Description = description,
HealthState = Enum.GetName(typeof(HealthState), healthState),
Metric = repairAction,
PartitionId = repairData?.PartitionId.ToString() ?? string.Empty,
ReplicaId = repairData?.ReplicaId.ToString() ?? string.Empty,
Level = level,
NodeName = repairData?.NodeName ?? string.Empty,
OS = RuntimeInformation.IsOSPlatform(OSPlatform.Windows) ? "Windows" : "Linux",
ServiceName = repairData?.ServiceName ?? string.Empty,
Source = source,
SystemServiceProcessName = repairData?.SystemServiceProcessName ?? string.Empty,
});
ServiceEventSource.Current.DataTypeWriteInfo(RepairConstants.EventSourceEventName, telemData);
}
else if (healthState == HealthState.Warning)
{
ServiceEventSource.Current.DataTypeWriteWarning(RepairConstants.EventSourceEventName, telemData);
}
else
{
ServiceEventSource.Current.DataTypeWriteError(RepairConstants.EventSourceEventName, telemData);
}
}
}
}

Просмотреть файл

@ -3,9 +3,9 @@
<Parameters>
<!-- FabricHealerManager Settings -->
<Parameter Name="AutoMitigationEnabled" DefaultValue="true" />
<Parameter Name="EventSourceProviderEnabled" DefaultValue="true" />
<Parameter Name="EnableETW" DefaultValue="true" />
<Parameter Name="MonitorLoopSleepSeconds" DefaultValue="5" />
<Parameter Name="TelemetryProviderEnabled" DefaultValue="true" />
<Parameter Name="EnableTelemetry" DefaultValue="true" />
<!-- Set VerboseLoggingEnabled to true if you want detailed local logging and telemetry/ETW with repair data.
This data will live in a folder named RepairData, which will be created in your LocalLogPath directory.
Default is true. This is not noisy. Keep this enabled if you want a record of repair workflow steps. -->
@ -18,7 +18,7 @@
<Parameter Name="EnableFabricNodeRepair" DefaultValue="false" />
<Parameter Name="EnableMachineRepair" DefaultValue="false" />
<Parameter Name="EnableReplicaRepair" DefaultValue="false" />
<Parameter Name="EnableSystemAppRepair" DefaultValue="false" />
<Parameter Name="EnableSystemServiceRepair" DefaultValue="false" />
<!-- Logic rule files -->
<Parameter Name="AppRulesConfigurationFile" DefaultValue="AppRules.guan" />
<Parameter Name="DiskRulesConfigurationFile" DefaultValue="DiskRules.guan" />
@ -39,8 +39,8 @@
<Section Name="RepairManagerConfiguration">
<Parameter Name="HealthCheckLoopSleepTimeSeconds" Value="[MonitorLoopSleepSeconds]" />
<Parameter Name="EnableAutoMitigation" Value="[AutoMitigationEnabled]" />
<Parameter Name="EnableEventSourceProvider" Value="[EventSourceProviderEnabled]" />
<Parameter Name="EnableTelemetryProvider" Value="[TelemetryProviderEnabled]" />
<Parameter Name="EnableETW" Value="[EnableETW]" />
<Parameter Name="EnableTelemetry" Value="[EnableTelemetry]" />
<Parameter Name="EnableVerboseLogging" Value="[VerboseLoggingEnabled]" />
<Parameter Name="EnableOperationalTelemetry" Value="[OperationalTelemetryEnabled]" />
<Parameter Name="LocalLogPath" Value="[LocalLogPath]" />
@ -62,8 +62,8 @@
<Parameter Name="Enabled" Value="[EnableReplicaRepair]" />
<Parameter Name="LogicRulesConfigurationFile" Value="[ReplicaRulesConfigurationFile]" />
</Section>
<Section Name="SystemAppRepairPolicy">
<Parameter Name="Enabled" Value="[EnableSystemAppRepair]" />
<Section Name="SystemServiceRepairPolicy">
<Parameter Name="Enabled" Value="[EnableSystemServiceRepair]" />
<Parameter Name="LogicRulesConfigurationFile" Value="[SystemServiceRulesConfigurationFile]" />
</Section>
<Section Name="MachineRepairPolicy">

Просмотреть файл

@ -4,10 +4,10 @@ FabricHealerProxy is a .NET Standard 2.0 library that provides a very simple and
### How to use FabricHealerProxy
- Deploy [FabricHealer](https://github.com/microsoft/service-fabric-healer/releases) [TODO: this will point to Deployment doc folder] to your cluster (Do note that if you deploy FabricHealer as a singleton partition 1 (versus -1), then FH will only conduct SF-related repairs).
- Install FabricHealerProxy nupkg into your own service from where you want to initiate repair of SF entities (stateful/stateless services, Fabric nodes).
- Deploy [FabricHealer](https://github.com/microsoft/service-fabric-healer/releases) to your cluster (Do note that if you deploy FabricHealer as a singleton partition 1 (versus -1), then FH will only conduct SF-related repairs).
- Install FabricHealerProxy nupkg [TODO: Link to nuget.org section] into your own service from where you want to initiate repair of SF entities (stateful/stateless services, Fabric nodes).
FabricHealer will execute entity-related logic rules (housed in it's FabricNodeRules.guan file in this case), and if any of the rules succeed, then FH will create a Repair Job with pre and post safety checks (default),
FabricHealer will execute entity-related logic rules (housed in FabricHealer's PackageRoot/Config/LogicRules folder), and if any of the related rules succeed, then FH will create a Repair Job with pre and post safety checks (default),
orchestrate RM through to repair completion (FH will be the executor of the repair), emit repair step information via telemetry, local logging, and etw.
### Sample application (Stateless Service)
@ -49,22 +49,22 @@ namespace Stateless1
// already has a restart replica catch-all (applies to any service) rule that will restart the primary replica of
// the specified service below, deployed to the a specified Fabric node.
// By default, if you only supply NodeName and ServiceName, then FabricHealerProxy assumes the target EntityType is Service. This is a convience to limit how many facts
// you must supply in a RepairFacts instance. For any type of repair, NodeName is always required.
// you must supply in a RepairFacts instance. Note that for *any* type of repair, NodeName is always required.
var RepairFactsServiceTarget1 = new RepairFacts
{
ServiceName = "fabric:/HealthMetrics/DoctorActorServiceType",
ServiceName = "fabric:/GettingStartedApplication/MyActorService",
NodeName = "_Node_0"
};
var RepairFactsServiceTarget2 = new RepairFacts
{
ServiceName = "fabric:/HealthMetrics/BandActorServiceType",
ServiceName = "fabric:/GettingStartedApplication/StatefulBackendService",
NodeName = "_Node_0"
};
var RepairFactsServiceTarget3 = new RepairFacts
{
ServiceName = "fabric:/HealthMetrics/HealthMetrics.WebServiceType",
ServiceName = "fabric:/GettingStartedApplication/StatelessBackendService",
NodeName = "_Node_0"
};
@ -88,13 +88,7 @@ namespace Stateless1
var RepairFactsServiceTarget7 = new RepairFacts
{
ServiceName = "fabric:/ContainerFoo2/ContainerFooService",
NodeName = "_Node_0"
};
var RepairFactsServiceTarget8 = new RepairFacts
{
ServiceName = "fabric:/ContainerFoo2/ContainerService2",
ServiceName = "fabric:/GettingStartedApplication/WebService",
NodeName = "_Node_0"
};
@ -114,6 +108,26 @@ namespace Stateless1
EntityType = EntityType.Machine
};
// Restart system service process.
var SystemServiceRepairFacts = new RepairFacts
{
ApplicationName = "fabric:/System",
NodeName = "_Node_0",
SystemServiceProcessName = "FabricDCA",
ProcessId = 73588,
Code = SupportedErrorCodes.AppWarningMemoryMB
};
// Disk - Delete files. This only works if FabricHealer instance is present on the same target node.
// Note the rules in FabricHealer\PackageRoot\LogicRules\DiskRules.guan file in the FabricHealer project.
var DiskRepairFacts = new RepairFacts
{
NodeName = "_Node_0",
EntityType = EntityType.Disk,
Metric = SupportedMetricNames.DiskSpaceUsageMb,
Code = SupportedErrorCodes.NodeWarningDiskSpaceMB
};
// For use in the IEnumerable<RepairFacts> RepairEntityAsync overload.
List<RepairFacts> RepairFactsList = new List<RepairFacts>
{
@ -124,16 +138,17 @@ namespace Stateless1
RepairFactsServiceTarget4,
RepairFactsServiceTarget5,
RepairFactsServiceTarget6,
RepairFactsServiceTarget7,
RepairFactsServiceTarget8
RepairFactsServiceTarget7
};
// This demonstrates which exceptions will be thrown by the API. The first three are FabricHealerProxy custom exceptions and represent user error (most likely).
// The last two are internal SF issues which will be thrown only after a series of retries. How to handle these is up to you.
try
{
await FabricHealer.Proxy.RepairEntityAsync(RepairFactsMachineTarget, cancellationToken).ConfigureAwait(false);
await FabricHealer.Proxy.RepairEntityAsync(RepairFactsList, cancellationToken).ConfigureAwait(false);
await FabricHealer.Proxy.RepairEntityAsync(DiskRepairFacts, cancellationToken);
//await FabricHealer.Proxy.RepairEntityAsync(SystemServiceRepairFacts, cancellationToken);
//await FabricHealer.Proxy.RepairEntityAsync(RepairFactsMachineTarget, cancellationToken);
//await FabricHealer.Proxy.RepairEntityAsync(RepairFactsList, cancellationToken);
}
catch (MissingRepairFactsException)
{
@ -162,11 +177,15 @@ namespace Stateless1
// This means that something is wrong at the SF level, so you could wait and then try again later.
}
// FabricHealerProxy API is thread-safe. So, you can process the list of repair facts above in a parallel loop, for example.
/*_ = Parallel.For (0, RepairFactsList.Count, async (i, state) =>
// FabricHealerProxy API is thread-safe. So, you could also process the List<RepairFacts> above in a parallel loop, for example.
/*
_ = Parallel.For (0, RepairFactsList.Count, async (i, state) =>
{
await FabricHealer.Proxy.RepairEntityAsync(RepairFactsList[i], cancellationToken).ConfigureAwait(false);
});*/
});
*/
// Do nothing and wait.
while (!cancellationToken.IsCancellationRequested)
@ -181,7 +200,7 @@ namespace Stateless1
}
}
// When cancellationToken is cancelled (in this case by the SF runtime) any active health reports will be automatically cleared by FabricHealerProxy.
// When the RunAsync cancellationToken is cancelled (in this case by the SF runtime) any active health reports will be automatically cleared by FabricHealerProxy.
// Note: This does not guarantee that some target entity that has an active FabricHealerProxy health report will be cancelled. Cancellation of repairs is
// not currently supported by FabricHealer.
}

Просмотреть файл

@ -27,6 +27,7 @@ namespace FabricHealerProxy
/// </summary>
public sealed class FabricHealer
{
private const string FHProxyId = "FabricHealerProxy";
private static FabricHealer instance;
private static readonly FabricClientSettings settings = new FabricClientSettings
@ -234,7 +235,7 @@ namespace FabricHealerProxy
CodePackageActivationContext context =
await FabricRuntime.GetActivationContextAsync(TimeSpan.FromSeconds(30), cancellationToken);
repairData.Source = context.GetServiceManifestName() + "_" + "FabricHealerProxy";
repairData.Source = $"{context.GetServiceManifestName()}_{FHProxyId}";
}
// Support for repair data that does not contain replica/partition facts for service level repair.

Просмотреть файл

@ -9,10 +9,10 @@ namespace FabricHealer.TelemetryLib
{
public class RepairData
{
public Dictionary<string, double> Repairs
public Dictionary<string, (string Source, double Count)> Repairs
{
get; set;
} = new Dictionary<string, double>();
} = new Dictionary<string, (string Source, double Count)>();
public double RepairCount
{

Просмотреть файл

@ -6,6 +6,7 @@
using System;
using System.Collections.Generic;
using System.Fabric;
using System.Fabric.Repair;
using System.IO;
using System.Linq;
using System.Security.Cryptography;
@ -52,7 +53,7 @@ namespace FabricHealer.TelemetryLib
{
_ = TryGetHashStringSha256(serviceContext?.NodeContext.NodeName, out string nodeHashString);
IDictionary<string, string> eventProperties = new Dictionary<string, string>
var eventProperties = new Dictionary<string, string>
{
{ "EventName", OperationalEventName},
{ "TaskName", TaskName},
@ -75,7 +76,7 @@ namespace FabricHealer.TelemetryLib
}
}
Dictionary<string, double> eventMetrics = new Dictionary<string, double>
var eventMetrics = new Dictionary<string, double>
{
{ "EnabledRepairCount", repairData.RepairData.EnabledRepairCount },
{ "TotalRepairAttempts", repairData.RepairData.RepairCount },
@ -83,8 +84,21 @@ namespace FabricHealer.TelemetryLib
{ "FailedRepairs", repairData.RepairData.FailedRepairs },
};
Dictionary<string, double> repairs = repairData.RepairData.Repairs;
eventMetrics.Append(repairs);
// Add RepairData (repair name, count).
var repairDataNames = new Dictionary<string, double>();
foreach (var t in repairData.RepairData.Repairs)
{
repairDataNames.Add(t.Key, t.Value.Count);
}
eventMetrics.Append(repairDataNames);
// Add RepairData (source name, count).
var repairDataSources = new Dictionary<string, double>();
foreach (var t in repairData.RepairData.Repairs)
{
repairDataSources.Add(t.Value.Source, t.Value.Count);
}
eventMetrics.Append(repairDataSources);
telemetryClient?.TrackEvent($"{TaskName}.{OperationalEventName}", eventProperties, eventMetrics);
telemetryClient?.Flush();

Просмотреть файл

@ -1,5 +1,6 @@
## FabricHealer
## FabricHealer 1.1.0
### Configuration as Logic and auto-mitigation in Service Fabric clusters
### Important: Requires Service Fabric version 8.x and higher
### (Requires net6.0+ and SF Runtime 9.0+)
@ -25,6 +26,12 @@ FabricHealer requires that RepairManager (RM) service is deployed.
For VM level repair, InfrastructureService (IS) service must be deployed.
```
## Build and run
1. Clone the repo.
2. Install [.NET Core 3.1](https://dotnet.microsoft.com/download/dotnet-core/3.1)
3. Build.
***Note: FabricHealer must be run under the LocalSystem account (see ApplicationManifest.xml) in order to function correctly. This means on Windows, by default, it will run as System user. On Linux, by default, it will run as root user. You do not have to make any changes to ApplicationManifest.xml for this to be the case.***
## Using FabricHealer
@ -34,7 +41,7 @@ FabricHealer is a service specifically designed to auto-mitigate Service Fabric
the result of bugs in user code.
```
Let's say you have a service that leaks memory or ephemeral ports. You would use FabricHealer to keep the problem in check while your developers figure out the root cause and fix the bug(s) that lead to resource usage over-consumption. FabricHealer is really just a temporary solution to problems, not a fix. This is how you should think about auto-mitigation, generally. FabricHealer aims to keep your cluster green while you fix your bugs. With it's configuration-as-logic support, you can easily specify that some repair for some service should only be attempted for n weeks or months, while your dev team fixes the underlying issues with the problematic service. FabricHealer should be thought of as a "disappearing task force" in that it can provide stability during times of instability, then "go away" when bugs are fixed.
Let's say you have a service that is using too much memory or too many ephemeral ports, as defined in both FabricObserver (which generates the Warning(s)) and in your related logic rule (this is optional since you can decide that if FabricObserver warns, then FabricHealer should mitigate without testing the related metric value that led to the Warning by FabricObserver, which, of course, you configured. It's up to you.). You would use FabricHealer to keep the problem in check while your developers figure out the root cause and fix the bug(s) that lead to resource usage over-consumption. FabricHealer is really just a temporary solution to problems, not a fix. This is how you should think about auto-mitigation, generally. FabricHealer aims to keep your cluster green while you fix your bugs. With it's configuration-as-logic support, you can easily specify that some repair for some service should only be attempted for n weeks or months, while your dev team fixes the underlying issues with the problematic service. FabricHealer should be thought of as a "disappearing task force" in that it can provide stability during times of instability, then "go away" when bugs are fixed.
FabricHealer comes with a number of already-implemented/tested target-specific logic rules. You will only need to modify existing rules to get going quickly. FabricHealer is a rule-based repair service and the rules are defined in logic. These rules also form FabricHealer's repair workflow configuration. This is what is meant by Configuration-as-Logic. The only use of XML-based configuration with respect to repair workflow is enabling automitigation (big on/off switch), enabling repair policies, and specifying rule file names. The rest is just the typical Service Fabric application configuration that you know and love. Most of the settings in Settings.xml are overridable parameters and you set the values in ApplicationManifest.xml. This enables versionless parameter-only application upgrades, which means you can change Settings.xml-based settings without redeploying FabricHealer.
@ -60,4 +67,5 @@ Mitigate(AppName="fabric:/ILikeMemory", MetricName="MemoryPercent", MetricValue=
## Quickstart
To quickly learn how to use FabricHealer, please see the [simple scenario-based examples.](https://github.com/microsoft/service-fabric-healer/blob/main/Documentation/Using.md)