Merge with main branch
This commit is contained in:
Коммит
dfb2dde747
|
@ -0,0 +1,50 @@
|
|||
name: Upload Release Asset
|
||||
on:
|
||||
push:
|
||||
# Sequence of patterns matched against refs/tags
|
||||
tags:
|
||||
- 'v*'
|
||||
jobs:
|
||||
build:
|
||||
name: Upload Release Asset
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Checkout code
|
||||
uses: actions/checkout@v2
|
||||
- name: Build project # This would actually build your project, using zip for an example artifact
|
||||
run: |
|
||||
cd ./hcheck/hcheck/
|
||||
dotnet build -r linux-x64 --self-contained
|
||||
- name: Publish
|
||||
run: dotnet publish ./hcheck/hcheck/hcheck.csproj -c Release -o release -r linux-x64 --self-contained
|
||||
- name: copy send_log file
|
||||
run: cp ./hcheck/hcheck/src/send_log /home/runner/work/cyclecloud-nodehealth/cyclecloud-nodehealth/hcheck/hcheck/bin/Release/net6.0/linux-x64/
|
||||
- name: Get the version
|
||||
id: get_version
|
||||
run:
|
||||
echo ::set-output name=VERSION::${GITHUB_REF#refs/tags/}
|
||||
- name: tar files
|
||||
run: |
|
||||
echo ${{ steps.get_version.outputs.version }}
|
||||
cd /home/runner/work/cyclecloud-nodehealth/cyclecloud-nodehealth/hcheck/hcheck/bin/Release/net6.0/
|
||||
tar czf hcheck-linux-${{ steps.get_version.outputs.version }}.tgz ./linux-x64
|
||||
- name: Create Release
|
||||
id: create_release
|
||||
uses: actions/create-release@v1
|
||||
env:
|
||||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
with:
|
||||
tag_name: ${{ github.ref }}
|
||||
release_name: Release ${{ github.ref }}
|
||||
draft: false
|
||||
prerelease: true
|
||||
- name: Upload Release Asset
|
||||
id: upload-release-asset
|
||||
uses: actions/upload-release-asset@v1
|
||||
env:
|
||||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
with:
|
||||
upload_url: ${{ steps.create_release.outputs.upload_url }} # This pulls from the CREATE RELEASE step above, referencing it's ID to get its outputs object, which include a `upload_url`. See this blog post for more info: https://jasonet.co/posts/new-features-of-github-actions/#passing-data-to-future-steps
|
||||
asset_path: /home/runner/work/cyclecloud-nodehealth/cyclecloud-nodehealth/hcheck/hcheck/bin/Release/net6.0/hcheck-linux-${{ steps.get_version.outputs.version }}.tgz
|
||||
asset_name: hcheck-linux-${{ steps.get_version.outputs.version }}.tgz
|
||||
asset_content_type: application/tgz
|
89
README.md
89
README.md
|
@ -4,9 +4,30 @@ Azure-healthcheck project is a helper that is capable of running custom healthch
|
|||
|
||||
This project supports [NHC](https://github.com/mej/nhc) healthcheck scripts and allows the addition of custom scripts. This was achieved with the help of work by Cormac Garvey, [cc_slurm_nhc](https://github.com/Azure/azurehpc/tree/master/experimental/cc_slurm_nhc). To learn more about this project and the advantages of running GPU healthchecks, refer to [this article](https://techcommunity.microsoft.com/t5/azure-global/automated-hpc-ai-compute-node-health-checks-integrated-with-the/ba-p/3113454).
|
||||
|
||||
|
||||
## Table of Contents (by [gh-md-toc](https://github.com/ekalinin/github-markdown-toc))
|
||||
<!--ts-->
|
||||
* [Installation](#installation)
|
||||
* [Prerequisites](#prerequisites)
|
||||
* [Building the project](#building-the-project)
|
||||
* [Uploading the executable files into the blobs storage](#uploading-the-executable-files-into-the-blobs-storage)
|
||||
* [Uploading the project to the Azure locker](#uploading-the-project-to-the-azure-locker)
|
||||
* [Customizing healtchecks](#customizing-healtchecks)
|
||||
* [Importing the cluster template into CycleCloud](#importing-the-cluster-template-into-cyclecloud)
|
||||
* [Running NHC healthcheck](#running-nhc-healthcheck)
|
||||
* [Designing custom NHC tests](#designing-custom-nhc-tests)
|
||||
* [Running custom test scripts](#running-custom-test-scripts)
|
||||
* [Designing custom test scripts](#designing-custom-test-scripts)
|
||||
* [Running the hcheck binary](#running-the-hcheck-binary)
|
||||
* [Changing the script for reporting errors](#changing-the-script-for-reporting-errors)
|
||||
* [Testing the project](#testing-the-project)
|
||||
* [Sample healthcheck report](#sample-healthcheck-report)
|
||||
* [Contributing](#contributing)
|
||||
* [Trademarks](#trademarks)
|
||||
|
||||
## Installation
|
||||
|
||||
### Pre-requisites
|
||||
### Prerequisites
|
||||
The instructions below assume that:
|
||||
|
||||
* you have a valid CycleCloud subscription
|
||||
|
@ -16,11 +37,27 @@ The instructions below assume that:
|
|||
|
||||
### Building the project
|
||||
|
||||
The project comes with a pre-built binary used to run the test scripts and build reports compatible with linux-x64. If you wish to build the source yourself, you will need to install .NET Core. Please refer to the deploy.sh for an example of steps you need to take
|
||||
The project comes with a pre-built binary used to run the test scripts and build reports compatible with linux-x64. If you wish to build the source yourself, you will need to install .NET Core. Please refer to the deploy.sh for an example of steps you need to take.
|
||||
|
||||
```bash
|
||||
cd ./hcheck/hcheck/
|
||||
dotnet build -r linux-x64 --self-contained
|
||||
```
|
||||
|
||||
### Uploading the executable files into the blobs storage
|
||||
|
||||
All the executable files used by the project (including the external script for sending logs) need to be archived and stored in the blobs folder. You can reference deploy.sh to see how this is achieved:
|
||||
|
||||
```bash
|
||||
VERSION=$(cyclecloud project info | grep Version | cut -d: -f2 | cut -d" " -f2)
|
||||
DEST_FILE=$(pwd)/blobs/hcheck-linux-$VERSION.tgz
|
||||
cp ../../../src/send_log ./linux-x64
|
||||
tar czf $DEST_FILE ./linux-x64
|
||||
```
|
||||
|
||||
### Uploading the project to the Azure locker
|
||||
|
||||
In order for you to be able to add the project to your CycleCloud cluster, you will first need to upload it to your Azure Locker.
|
||||
In order for you to be able to add the project to your CycleCloud cluster, you will first need to upload it to your Azure Locker. The easiest way to do it is by editing deploy.sh
|
||||
|
||||
```bash
|
||||
cyclecloud project upload your-locker-name
|
||||
|
@ -45,7 +82,6 @@ Most of them can be configured from the "Advanced Settings" tab in CycleCloud Se
|
|||
|
||||
![Alt](/images/advanced_settings.png "Advanced Settings")
|
||||
|
||||
|
||||
### Importing the cluster template into CycleCloud
|
||||
|
||||
With CycleCloud CLI, upload the cluster template. Run the commands below to save your cluster settings (such as the region and configuration), and then import the cluster template along with those settings.
|
||||
|
@ -55,7 +91,7 @@ cyclecloud export_parameters MyClusterName > param.json
|
|||
cyclecloud import_cluster --force -f slurm.txt -c Slurm MyClusterName -p param.json
|
||||
```
|
||||
|
||||
## Running NHC healthcheck:
|
||||
## Running NHC healthcheck
|
||||
|
||||
Which NHC checks are run is based on the .conf file. By default, this project includes a set of cluster-specific configuration files. If you want to use a custom configuration instead, put your .config file into the nhc-config subfolder within your project's files directory and edit the parameter to reflect that name instead:
|
||||
|
||||
|
@ -68,19 +104,13 @@ Alternatively, you can change the cluster template directly. This can be useful
|
|||
config = YOUR_CUSTOM_NAME.conf
|
||||
```
|
||||
|
||||
### Designing custom tests:
|
||||
### Designing custom NHC tests
|
||||
|
||||
You can write your own test scripts to be run by the healthcheck tool.
|
||||
|
||||
1) NHC-based tests (.nhc files) have to be placed in the nhc-tests folder. In order for NHC to actually use them, you will need to create your own configuration files. Just place them in nhc-config folder and pass the name to the NHC config name parameter in the settings
|
||||
NHC-based tests (.nhc files) have to be placed in the nhc-tests folder. In order for NHC to actually use them, you will need to create your own configuration files. Just place them in nhc-config folder and pass the name to the NHC config name parameter in the settings
|
||||
|
||||
2) Custom test scripts. Whether it is a bash or a python script, anything executable can be a test, as long as it adheres to the following rules:
|
||||
|
||||
- Exit code for a passing test is 0. Any non-zero exit code is considered a failure and will be reported
|
||||
- To receive a meaningful report on the error, you need to output the message into the stdout
|
||||
- If you want the report to contain more information than a single message can convey, you can make your script output a json string - just make sure it has a field "message" that would be used to log the error. If you do this, everything but the message field will end up in the "extra-info" part of the report as a valid json (please refer to the [Sample healthcheck report](##Sample-healthcheck-report) section for an example). If there are any formatting issues or you fail to include the "message" field, the whole json construction will become the reported message instead
|
||||
|
||||
## Running custom test scripts:
|
||||
## Running custom test scripts
|
||||
|
||||
Put the custom scripts you want the healthcheck tool to run into the custom-tests directory. Update healthchecks.custom.pattern in the cluster-ini template to a pattern that the healthcheck will use to determine which test scripts to run.
|
||||
|
||||
|
@ -93,8 +123,16 @@ Alternatively, you can change the cluster template directly. This can be useful
|
|||
pattern = *.sh
|
||||
```
|
||||
|
||||
All your healthchecks should exit with code 0 upon the successfull pass of a healthcheck, and non-0 otherwise.
|
||||
All your healthchecks should exit with code 0 upon the successfull pass of a healthcheck, and non-0 otherwise.
|
||||
|
||||
### Designing custom test scripts
|
||||
|
||||
Whether it is a bash or a python script, anything executable can be a test, as long as it adheres to the following rules:
|
||||
|
||||
- Your script should contain a [shebang](#https://en.wikipedia.org/wiki/Shebang_(Unix))
|
||||
- Exit code for a passing test is 0. Any non-zero exit code is considered a failure and will be reported
|
||||
- To receive a meaningful report on the error, you need to output the message into the stdout
|
||||
- If you want the report to contain more information than a single message can convey, you can make your script output a json string - just make sure it has a field "message" that would be used to log the error. If you do this, everything but the message field will end up in the "extra-info" part of the report as a valid json (please refer to the [Sample healthcheck report](#sample-healthcheck-report) section for an example). If there are any formatting issues or you fail to include the "message" field, the whole json construction will become the reported message instead
|
||||
|
||||
## Running the hcheck binary
|
||||
|
||||
|
@ -110,9 +148,24 @@ You should never have to run the tool manually, but in the case you want to do s
|
|||
| --nr | Number of reruns for the set of scripts | --nr 3 |
|
||||
| --pt | Pattern for custom script detection | -pt .sh |
|
||||
| --rpath | Path to where the report would be generated | --rpath /tmp/log/report.json |
|
||||
| --rscript | Path to the script reporting the results back to the portal | --rscript ./send_logs |
|
||||
|
||||
## Changing the script for reporting errors
|
||||
|
||||
## Testing the project:
|
||||
Currently, the script reporting errors back to the portal is CycleCloud specific and uses a custom version of jetpack log command to send detailed information. If you wish to use another script to report the errors back, here are the inline parameters that it will be called with:
|
||||
|
||||
| Flag | Use |
|
||||
| --- | --- |
|
||||
| -m | Short message that shows up in CycleCloud logs |
|
||||
| --level error | Level of the message |
|
||||
| --info | Extra information about the tests in json format |
|
||||
| --code | Exit code of the test |
|
||||
| --testname | Name of the test |
|
||||
| --nodeid | Id of the vm the tests were run on |
|
||||
| --time | The time it took to run the test in ms |
|
||||
| --error | Error message retured by the test script |
|
||||
|
||||
## Testing the project
|
||||
|
||||
You can test the project by putting your custom scripts returning fixed results into the custom-test folder and setting the healthchecks.custom.pattern to the pattern that would detect them.
|
||||
|
||||
|
@ -122,6 +175,8 @@ C# tool itself also comes with unit tests that you can run yourself by going int
|
|||
dotnet test
|
||||
```
|
||||
|
||||
If you want to test how healthchecks work on a real cluster, you can use the provided evenfail.sh test located in sample-healthchecks subfolder. Just copy it to the ./specs/default/cluster-init/files/custom-tests directory, import the slurm.txt template into your cluster (which should have a single dash in its name, for example - "cycleslurm-demo"), and put "even*.sh" as the custom script pattern parameter. After this, you can run deploy.sh and start the cluster.
|
||||
|
||||
## Sample healthcheck report
|
||||
|
||||
All healthcheck scripts run by the tool are required to exit with a non-zero code upon an error encountered. If you want to store some extra information into the report and have it as a proper json field, make sure your script outputs a valid json that contains a field "message" - that field will be trimmed from the extra information and would be used as a main output of the script. A failure to add a "message" field or errors in json would result in the whole json string used as a message.
|
||||
|
@ -159,8 +214,6 @@ All healthcheck scripts run by the tool are required to exit with a non-zero cod
|
|||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Contributing
|
||||
|
||||
This project welcomes contributions and suggestions. Most contributions require you to agree to a
|
||||
|
|
Двоичный файл не отображается.
|
@ -17,6 +17,6 @@ cp ../../../src/send_log /linux-x64
|
|||
tar czf hcheck-linux-$VERSION.tgz linux-x64
|
||||
cp hcheck-linux-$VERSION.tgz ../../../../../blobs/
|
||||
cd $( dirname $0 )/
|
||||
echo \#!/usr/bin/env bash > ../../../../../specs/default/cluster-init/files/version.sh
|
||||
echo export HEALTHCHECK_VERSION=$VERSION >> ../../../../../specs/default/cluster-init/files/version.sh
|
||||
echo \#!/usr/bin/env bash > ./specs/default/cluster-init/files/version.sh
|
||||
#echo export HEALTHCHECK_VERSION=$VERSION >> ./specs/default/cluster-init/files/version.sh
|
||||
#cyclecloud project upload azure-storage
|
||||
|
|
|
@ -26,7 +26,7 @@ public class HealthcheckTest
|
|||
{
|
||||
using (TestScriptGenerator tsg = new TestScriptGenerator("print(\"Hello, python world\")\nexit(0)", true))
|
||||
{
|
||||
string[] args = { "-k", tsg.Path, "--rpath", "./report.json", "--python", "python3"};
|
||||
string[] args = { "-k", tsg.Path, "--rpath", "./report.json"};
|
||||
Healthcheck.Main(args);
|
||||
ArgumentProcessor argus = new ArgumentProcessor(args);
|
||||
ReportBuilder builder = new ReportBuilder(argus, argus.FilePath);
|
||||
|
|
|
@ -23,7 +23,7 @@ namespace hcheck
|
|||
public bool isSuccess = false;
|
||||
public DateTime startTime;
|
||||
public DateTime exitTime;
|
||||
public virtual void RunProcess(string filePath, string[] args = null, int timeout = 1000)
|
||||
public virtual void RunProcess(string filePath, string[]? args = null, int timeout = 1000)
|
||||
{
|
||||
using (System.Diagnostics.Process pProcess = new System.Diagnostics.Process())
|
||||
{
|
||||
|
|
|
@ -1,118 +1,118 @@
|
|||
using System.Text.Json;
|
||||
using System.Xml;
|
||||
|
||||
namespace hcheck
|
||||
{
|
||||
public class TestRunner
|
||||
{
|
||||
private HealthReport report;
|
||||
public ProcessRunner? pr = null;
|
||||
|
||||
public TestRunner(HealthReport header)
|
||||
{
|
||||
report = header;
|
||||
}
|
||||
|
||||
public HealthReport getReport()
|
||||
{
|
||||
return report;
|
||||
}
|
||||
|
||||
private void AddRepeatInfo(string testPath, Dictionary<string, object> testResults, ProcessRunner pr)
|
||||
{
|
||||
|
||||
if (report.testresults[testPath].ContainsKey("repeat-history"))
|
||||
{
|
||||
LinkedList<object> list= JsonSerializer.Deserialize< LinkedList<object>>( report.testresults[testPath]["repeat-history"].ToString());
|
||||
list.AddLast(testResults);
|
||||
report.testresults[testPath]["repeat-history"] = list;
|
||||
//LinkedList<object> list = (LinkedList<object>)report.testresults[testPath]["repeat-history"];
|
||||
//list.AddLast(testResults);
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
LinkedList<object> list = new LinkedList<object>();
|
||||
list.AddLast(new Dictionary<string, object>(report.testresults[testPath]));
|
||||
list.AddLast(testResults);
|
||||
report.testresults[testPath]["repeat-history"] = list;
|
||||
}
|
||||
//first test run that returned an error should be reported
|
||||
if (report.testresults[testPath]["exit-code"].ToString() == "0" && pr.exitCode != 0)
|
||||
{
|
||||
report.testresults[testPath]["exit-code"] = pr.exitCode;
|
||||
report.testresults[testPath]["extra-info"] = testResults["extra-info"];
|
||||
report.testresults[testPath][key: "message"] = testResults["message"];
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
|
||||
public void RunTest(string testPath, ArgumentProcessor args)
|
||||
{
|
||||
//consern: properly initialize the test, success or fail
|
||||
//how long took, collect results, write the report
|
||||
//contract: if the external process is running, exit 0
|
||||
//invoke nvidia-smi
|
||||
if (pr == null) pr = new ProcessRunner();
|
||||
//python scripts need to be run with a python installation
|
||||
if (args.ReframePath != "")
|
||||
{
|
||||
string? reportPath = Path.GetDirectoryName(args.FilePath);
|
||||
// string actualReportPath = (reportPath == null) ? "/var/log/reframe_results.json" : reportPath + "reframe_results.json";
|
||||
string actualReportPath = "/var/log/reframe_results.json";
|
||||
pr.RunProcess(args.ReframePath, new string[] { "-C",args.ReframeConfigPath,"--force-local","--report-file", actualReportPath,"-c", testPath, "-R","-r" }, 10000);
|
||||
if (!pr.isSuccess)
|
||||
{
|
||||
Console.WriteLine("There was an error in launching the script: " + pr.stderr);
|
||||
return;
|
||||
}
|
||||
else
|
||||
Console.WriteLine("reframe ran with output " + pr.stdout);
|
||||
string reframeErrorMessage = ReframeWorker.ReadReframeReport(actualReportPath);
|
||||
pr.stdout = (reframeErrorMessage == "") ? "No message" : reframeErrorMessage;
|
||||
}
|
||||
else
|
||||
{
|
||||
pr.RunProcess(testPath);
|
||||
}
|
||||
var options = new JsonSerializerOptions()
|
||||
{
|
||||
AllowTrailingCommas = true
|
||||
};
|
||||
if (!pr.isSuccess)
|
||||
{
|
||||
Console.WriteLine("There was an error in launching the script: " + pr.stderr);
|
||||
return;
|
||||
}
|
||||
Dictionary<string, object> testResults = new Dictionary<string, object>();
|
||||
try
|
||||
{
|
||||
testResults.Add("exit-code", pr.exitCode);
|
||||
testResults.Add("test-time", (pr.exitTime - pr.startTime).TotalMilliseconds);
|
||||
testResults.Add("extra-info", "None");
|
||||
Dictionary<string, object>? deserializedResult = JsonSerializer.Deserialize<Dictionary<string, object>>(pr.stdout, options);
|
||||
if (deserializedResult == null) throw new System.Text.Json.JsonException();
|
||||
Dictionary<string, object> extraInfo = new Dictionary<string, object>();
|
||||
foreach (KeyValuePair<string, object> record in deserializedResult)
|
||||
{
|
||||
if (record.Key == "message") testResults.Add("message", record.Value);
|
||||
else extraInfo.Add(record.Key, record.Value);
|
||||
}
|
||||
//if no "message" tag, treat the whole thing as a message
|
||||
if (!testResults.ContainsKey("message")) throw new System.Text.Json.JsonException("No message set");
|
||||
testResults["extra-info"] = extraInfo;
|
||||
}
|
||||
catch (System.Text.Json.JsonException ex) when (ex.Data != null) //if not parse-able, result was a simple message
|
||||
{
|
||||
testResults.Add("message", pr.stdout);
|
||||
}
|
||||
if (!report.testresults.ContainsKey(testPath))
|
||||
report.testresults.Add(testPath, value: testResults);
|
||||
else
|
||||
{
|
||||
AddRepeatInfo(testPath, testResults, pr);
|
||||
}
|
||||
}
|
||||
}
|
||||
using System.Text.Json;
|
||||
using System.Xml;
|
||||
|
||||
namespace hcheck
|
||||
{
|
||||
public class TestRunner
|
||||
{
|
||||
private HealthReport report;
|
||||
public ProcessRunner? pr = null;
|
||||
|
||||
public TestRunner(HealthReport header)
|
||||
{
|
||||
report = header;
|
||||
}
|
||||
|
||||
public HealthReport getReport()
|
||||
{
|
||||
return report;
|
||||
}
|
||||
|
||||
private void AddRepeatInfo(string testPath, Dictionary<string, object> testResults, ProcessRunner pr)
|
||||
{
|
||||
|
||||
if (report.testresults[testPath].ContainsKey("repeat-history"))
|
||||
{
|
||||
LinkedList<object> list= JsonSerializer.Deserialize< LinkedList<object>>( report.testresults[testPath]["repeat-history"].ToString());
|
||||
list.AddLast(testResults);
|
||||
report.testresults[testPath]["repeat-history"] = list;
|
||||
//LinkedList<object> list = (LinkedList<object>)report.testresults[testPath]["repeat-history"];
|
||||
//list.AddLast(testResults);
|
||||
|
||||
}
|
||||
else
|
||||
{
|
||||
LinkedList<object> list = new LinkedList<object>();
|
||||
list.AddLast(new Dictionary<string, object>(report.testresults[testPath]));
|
||||
list.AddLast(testResults);
|
||||
report.testresults[testPath]["repeat-history"] = list;
|
||||
}
|
||||
//first test run that returned an error should be reported
|
||||
if (report.testresults[testPath]["exit-code"].ToString() == "0" && pr.exitCode != 0)
|
||||
{
|
||||
report.testresults[testPath]["exit-code"] = pr.exitCode;
|
||||
report.testresults[testPath]["extra-info"] = testResults["extra-info"];
|
||||
report.testresults[testPath][key: "message"] = testResults["message"];
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
|
||||
public void RunTest(string testPath, ArgumentProcessor args)
|
||||
{
|
||||
//consern: properly initialize the test, success or fail
|
||||
//how long took, collect results, write the report
|
||||
//contract: if the external process is running, exit 0
|
||||
//invoke nvidia-smi
|
||||
if (pr == null) pr = new ProcessRunner();
|
||||
//python scripts need to be run with a python installation
|
||||
if (args.ReframePath != "")
|
||||
{
|
||||
string? reportPath = Path.GetDirectoryName(args.FilePath);
|
||||
// string actualReportPath = (reportPath == null) ? "/var/log/reframe_results.json" : reportPath + "reframe_results.json";
|
||||
string actualReportPath = "/var/log/reframe_results.json";
|
||||
pr.RunProcess(args.ReframePath, new string[] { "-C",args.ReframeConfigPath,"--force-local","--report-file", actualReportPath,"-c", testPath, "-R","-r" }, 10000);
|
||||
if (!pr.isSuccess)
|
||||
{
|
||||
Console.WriteLine("There was an error in launching the script: " + pr.stderr);
|
||||
return;
|
||||
}
|
||||
else
|
||||
Console.WriteLine("reframe ran with output " + pr.stdout);
|
||||
string reframeErrorMessage = ReframeWorker.ReadReframeReport(actualReportPath);
|
||||
pr.stdout = (reframeErrorMessage == "") ? "No message" : reframeErrorMessage;
|
||||
}
|
||||
else
|
||||
{
|
||||
pr.RunProcess(testPath);
|
||||
}
|
||||
var options = new JsonSerializerOptions()
|
||||
{
|
||||
AllowTrailingCommas = true
|
||||
};
|
||||
if (!pr.isSuccess)
|
||||
{
|
||||
Console.WriteLine("There was an error in launching the script: " + pr.stderr);
|
||||
return;
|
||||
}
|
||||
Dictionary<string, object> testResults = new Dictionary<string, object>();
|
||||
try
|
||||
{
|
||||
testResults.Add("exit-code", pr.exitCode);
|
||||
testResults.Add("test-time", (pr.exitTime - pr.startTime).TotalMilliseconds);
|
||||
testResults.Add("extra-info", "None");
|
||||
Dictionary<string, object>? deserializedResult = JsonSerializer.Deserialize<Dictionary<string, object>>(pr.stdout, options);
|
||||
if (deserializedResult == null) throw new System.Text.Json.JsonException();
|
||||
Dictionary<string, object> extraInfo = new Dictionary<string, object>();
|
||||
foreach (KeyValuePair<string, object> record in deserializedResult)
|
||||
{
|
||||
if (record.Key == "message") testResults.Add("message", record.Value);
|
||||
else extraInfo.Add(record.Key, record.Value);
|
||||
}
|
||||
//if no "message" tag, treat the whole thing as a message
|
||||
if (!testResults.ContainsKey("message")) throw new System.Text.Json.JsonException("No message set");
|
||||
testResults["extra-info"] = extraInfo;
|
||||
}
|
||||
catch (System.Text.Json.JsonException ex) when (ex.Data != null) //if not parse-able, result was a simple message
|
||||
{
|
||||
testResults.Add("message", pr.stdout);
|
||||
}
|
||||
if (!report.testresults.ContainsKey(testPath))
|
||||
report.testresults.Add(testPath, value: testResults);
|
||||
else
|
||||
{
|
||||
AddRepeatInfo(testPath, testResults, pr);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
|
@ -16,11 +16,18 @@ namespace hcheck
|
|||
public TestScriptGenerator(string script, bool isPython = false)
|
||||
{
|
||||
_path = System.IO.Path.GetTempFileName();
|
||||
if (isPython) _path += ".py"; //extension is used to detect python scripts
|
||||
else File.WriteAllText(_path, "#!/usr/bin/env bash\n");
|
||||
if (isPython)
|
||||
{
|
||||
_path += ".py"; //extension is used to detect python scripts
|
||||
File.WriteAllText(_path, "#!/usr/bin/env python3\n");
|
||||
}
|
||||
else
|
||||
{
|
||||
File.WriteAllText(_path, "#!/usr/bin/env bash\n");
|
||||
}
|
||||
File.AppendAllText(_path, script);
|
||||
ProcessRunner pr = new ProcessRunner();
|
||||
pr.RunProcess("chmod", new string[]{"+x", _path});
|
||||
pr.RunProcess("chmod", new string[] { "+x", _path });
|
||||
}
|
||||
|
||||
public void Dispose()
|
||||
|
|
|
@ -0,0 +1,117 @@
|
|||
#!/opt/cycle/jetpack/system/embedded/bin/python3
|
||||
|
||||
from jetpack import util
|
||||
import time
|
||||
import json
|
||||
import sys
|
||||
import argparse
|
||||
|
||||
|
||||
class SendError(Exception):
|
||||
pass
|
||||
|
||||
|
||||
class LogError(Exception):
|
||||
pass
|
||||
|
||||
|
||||
def detailed_log(message, exit_code, extra_info, node_id, test_name, error_message, test_time = 0, level = "error", priority=None):
|
||||
if level not in ["info", "warn", "error"]:
|
||||
raise LogError("Invalid level: %s" % level)
|
||||
|
||||
priority = priority or _get_priority(level)
|
||||
|
||||
message_data = {
|
||||
"level": level,
|
||||
"message": message,
|
||||
"priority": priority,
|
||||
"exit_code": exit_code,
|
||||
"extra_info": extra_info,
|
||||
"node_id": node_id,
|
||||
"test_name": test_name,
|
||||
"test_time": test_time,
|
||||
"error_message": error_message
|
||||
}
|
||||
|
||||
send_internal_message(message_data, "log")
|
||||
|
||||
def send_internal_message(message_data, message_type):
|
||||
'''
|
||||
Sends a system message to CycleCloud
|
||||
parameters:
|
||||
message_data - this is a python dictionary
|
||||
message_type - type of message, examples are test, log, installation
|
||||
'''
|
||||
if not isinstance(message_data, dict):
|
||||
raise SendError("message_data parameter must be a dictionary")
|
||||
|
||||
config = util.parse_config(None)
|
||||
|
||||
try:
|
||||
identity = config["identity"]
|
||||
cluster_session_id = identity.get('cluster_session_id')
|
||||
cluster_name = identity["cluster_name"]
|
||||
instance_id = identity["instance_id"]
|
||||
cycle_server_config = config['cycle_server']
|
||||
except KeyError as e:
|
||||
raise SendError("Unable to find '%s' in config" % str(e))
|
||||
|
||||
message_obj = {
|
||||
"cluster_name": cluster_name,
|
||||
"instance_id": instance_id,
|
||||
"timestamp": util.iso_8601_timestamp(),
|
||||
"cluster_session_id": cluster_session_id,
|
||||
"type": message_type,
|
||||
"data": message_data
|
||||
}
|
||||
|
||||
def func():
|
||||
r = _post_message(message_obj)
|
||||
if r.status != 202:
|
||||
raise Exception("Failed to send message: %d" % r.status)
|
||||
|
||||
return _retry_func(func)
|
||||
|
||||
|
||||
def _post_message(message_obj):
|
||||
return util.query_cyclecloud("/clusterlink/messages", body=json.dumps({"messages": [message_obj]}), method="POST")
|
||||
|
||||
|
||||
def _retry_func(func):
|
||||
wait_length = 5
|
||||
|
||||
while True:
|
||||
try:
|
||||
return func()
|
||||
except Exception as e:
|
||||
# retry 5 times, waiting 5, 10, 20, then 40 seconds
|
||||
if wait_length < 41:
|
||||
time.sleep(wait_length)
|
||||
wait_length *= 2
|
||||
else:
|
||||
raise e
|
||||
|
||||
def _get_priority(level):
|
||||
if level == 'info':
|
||||
log_priority = 'medium'
|
||||
elif level == 'warn':
|
||||
log_priority = 'medium'
|
||||
elif level == 'error':
|
||||
log_priority = 'high'
|
||||
else:
|
||||
raise LogError("Invalid log level")
|
||||
return log_priority
|
||||
|
||||
parser = argparse.ArgumentParser(description='Send detailed healthcheck logs to CycleCloud')
|
||||
parser.add_argument("-m", "--message", help="message displayed in CycleCloud logs")
|
||||
parser.add_argument("--info", help="extra information about the tests in json format")
|
||||
parser.add_argument("--code", help="exit code of the test")
|
||||
parser.add_argument("--testname", help="name of the test")
|
||||
parser.add_argument("--nodeid", help="the id of the vm the tests were run on")
|
||||
parser.add_argument("--time", help="the time it took to run the test in ms")
|
||||
parser.add_argument("-l", "--level", help="the level the log should be submitted at")
|
||||
parser.add_argument("--error", help="the error message retured by the test script")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
detailed_log(args.message, args.code, args.info, args.nodeid, args.testname, args.error, args.time)
|
|
@ -1,9 +1,9 @@
|
|||
[project]
|
||||
name = healthcheck
|
||||
type = application
|
||||
version = 1.0.9
|
||||
version = 0.0.9
|
||||
|
||||
|
||||
[blobs]
|
||||
|
||||
Files = hcheck-linux-1.0.9.tgz
|
||||
Files = hcheck-linux-0.0.9.tgz
|
|
@ -0,0 +1,7 @@
|
|||
#!/usr/bin/env bash
|
||||
#this is set to work on HPC nodes of a cluster that has a single dash in its name.
|
||||
#you might need to change the number in -f5 parameter of cut to adapt it to a different number of dashes in the cluster name
|
||||
node_index=$(jetpack config cyclecloud.node.name | cut -d- -f5)
|
||||
if [[ $(expr $node_index % 2) == 0 ]]; then
|
||||
echo failed; exit 1;
|
||||
fi
|
|
@ -0,0 +1 @@
|
|||
Use this folder to upload your custom test scripts
|
|
@ -0,0 +1 @@
|
|||
Use this folder to store NHC configuration files
|
Загрузка…
Ссылка в новой задаче