43e2b90f04 | ||
---|---|---|
src | ||
.gitattributes | ||
.gitignore | ||
LICENSE | ||
README.md |
README.md
HPC Azure Cluster Management Service
Introduction
The service enabled the diagnostics scenario of HPC clusters in Azure by providing the following features:
-
Diagnostics Jobs
With predefined diagnostic test definitions, the clsuter admin can easily validate the health of an HPC cluster.
-
Clusrun Jobs
By selecting a group of nodes and run clusrun, the commands will be dispatched to the selected nodes, and the outputs will be collected and shown interactively.
-
Heatmap
The heatmap is a real-time graphical view of a specific metric value of all nodes in the cluster. It provides a vivid way to view the cluster's metrics.
How to deploy
There are two ways to deploy the service.
Deploy from scratch
A cluster is deployed together with the diagnostic services, allowing the deployer to choose the scheduler, location, cluster size, the portal name, etc. This is the easiest way to create an HPC cluster with diagnostics functionalities enabled. For detailed usage of the deployment template, please refer: Azure cluster deployment
Apply to an existing cluster
For an already deployed cluster, to enable the diagnostics functionalities, follow the steps below:
For how to use the template, please refer: Build HPC ACM Diagnostic service
-
Register the cluster with the service (You can register multiple clusters with the same service by repeating this step for each of your cluster)
Download the script from: RegisterToAcm.ps1 Run it in an elevated powershell window:
.\RegisterToAcm.ps1 -resourceGroupName theResourceGroupOfYourCluster -acmRgName theResourceGroupOfAcmServices -subscriptionId theSubscriptionId
After the configuration, the VMs will register themselves to the HPC ACM services, and you could check the resources section in the portal to see them.
Known issues
- The service only support linux for now
- The service provide https portal with a self-signed cert, you need bypass the cert validation to visit the portal and use the rest api.