Azure-healthcheck project is a helper that is capable of running custom healthcheck scripts and reporting on any issues with the virtual machine upon its initialization.
This project supports [NHC](https://github.com/mej/nhc) healthcheck scripts and allows the addition of custom scripts. This was achieved with the help of work by Cormac Garvey, [cc_slurm_nhc](https://github.com/Azure/azurehpc/tree/master/experimental/cc_slurm_nhc). To learn more about this project and the advantages of running GPU healthchecks, refer to [this article](https://techcommunity.microsoft.com/t5/azure-global/automated-hpc-ai-compute-node-health-checks-integrated-with-the/ba-p/3113454).
* your Azure VM runs a linux-based operating system and supports bash commands
* You have CycleCloud CLI installed and congigured. Refer to [this instruction](https://docs.microsoft.com/en-us/azure/cyclecloud/how-to/install-cyclecloud-cli?view=cyclecloud-8) for the installation steps
The project comes with a pre-built binary used to run the test scripts and build reports compatible with linux-x64. If you wish to build the source yourself, you will need to install .NET Core. Please refer to the deploy.sh for an example of steps you need to take.
```bash
cd ./hcheck/hcheck/
dotnet build -r linux-x64 --self-contained
```
### Uploading the executable files into the blobs storage
All the executable files used by the project (including the external script for sending logs) need to be archived and stored in the blobs folder. You can reference deploy.sh to see how this is achieved:
```bash
VERSION=$(cyclecloud project info | grep Version | cut -d: -f2 | cut -d" " -f2)
In order for you to be able to add the project to your CycleCloud cluster, you will first need to upload it to your Azure Locker. The easiest way to do it is by editing deploy.sh
### Importing the cluster template into CycleCloud
With CycleCloud CLI, upload the cluster template. Run the commands below to save your cluster settings (such as the region and configuration), and then import the cluster template along with those settings.
Which NHC checks are run is based on the .conf file. By default, this project includes a set of cluster-specific configuration files. If you want to use a custom configuration instead, put your .config file into the nhc-config subfolder within your project's files directory and edit the parameter to reflect that name instead:
NHC-based tests (.nhc files) have to be placed in the nhc-tests folder. In order for NHC to actually use them, you will need to create your own configuration files. Just place them in nhc-config folder and pass the name to the NHC config name parameter in the settings
Put the custom scripts you want the healthcheck tool to run into the custom-tests directory. Update healthchecks.custom.pattern in the cluster-ini template to a pattern that the healthcheck will use to determine which test scripts to run.
![Alt](/images/user_pattern.png "Custom pattern")
Alternatively, you can change the cluster template directly. This can be useful if you are planning to set up multiple clusters using that template:
Whether it is a bash or a python script, anything executable can be a test, as long as it adheres to the following rules:
- Your script should contain a [shebang](#https://en.wikipedia.org/wiki/Shebang_(Unix))
- Exit code for a passing test is 0. Any non-zero exit code is considered a failure and will be reported
- To receive a meaningful report on the error, you need to output the message into the stdout
- If you want the report to contain more information than a single message can convey, you can make your script output a json string - just make sure it has a field "message" that would be used to log the error. If you do this, everything but the message field will end up in the "extra-info" part of the report as a valid json (please refer to the [Sample healthcheck report](#sample-healthcheck-report) section for an example). If there are any formatting issues or you fail to include the "message" field, the whole json construction will become the reported message instead
Alternatively, you can change the cluster template directly. This can be useful if you are planning to set up multiple clusters using that template:
```ini
[[[configuration healthchecks.reframe]]]
pattern = *.py
```
003_run_reframe.sh basically clones Jon shelly's repo to install and run reframe tests and then like other tests uses hcheck project to send log to cyclecloud and generate a report.
## Developer testing for reframe scripts
Example configuration for reframe tests for Dev testing:
If you are using Centos then you would need to edit the azure_centos_7.py file present in Jon Shelly's repo https://github.com/JonShelley/reframe/blob/master/azure_nhc/config/azure_centos_7.py to include the sku configuration in following way:
Currently, the script reporting errors back to the portal is CycleCloud specific and uses a custom version of jetpack log command to send detailed information. If you wish to use another script to report the errors back, here are the inline parameters that it will be called with:
You can test the project by putting your custom scripts returning fixed results into the custom-test folder and setting the healthchecks.custom.pattern to the pattern that would detect them.
C# tool itself also comes with unit tests that you can run yourself by going into the hcheck-test directory and running:
If you want to test how healthchecks work on a real cluster, you can use the provided evenfail.sh test located in sample-healthchecks subfolder. Just copy it to the ./specs/default/cluster-init/files/custom-tests directory, import the slurm.txt template into your cluster (which should have a single dash in its name, for example - "cycleslurm-demo"), and put "even*.sh" as the custom script pattern parameter. After this, you can run deploy.sh and start the cluster.
All healthcheck scripts run by the tool are required to exit with a non-zero code upon an error encountered. If you want to store some extra information into the report and have it as a proper json field, make sure your script outputs a valid json that contains a field "message" - that field will be trimmed from the extra information and would be used as a main output of the script. A failure to add a "message" field or errors in json would result in the whole json string used as a message.