HPC Azure Cluster Management
Перейти к файлу
Jingjing Li 3cb2926b59
Merge pull request #25 from Azure/jingjli/dockfile-fix
Update github repo information after username renaming.
2024-06-12 21:44:22 +08:00
src Update github repo information after username renaming. 2024-06-12 21:43:08 +08:00
.gitattributes Git files 2018-02-28 17:38:47 +08:00
.gitignore Use HTTPS for Frontend 2018-09-06 12:36:37 +08:00
3rdpartylicenses.txt Add 3rd party licenses. 2019-04-16 11:07:01 +08:00
LICENSE Create License file 2018-09-17 21:28:41 +08:00
README.md Fix the link 2018-09-28 21:51:42 +08:00
SECURITY.md Microsoft mandatory file 2023-01-24 16:40:24 +00:00

README.md

HPC Azure Cluster Management Service

Introduction

The service enabled the diagnostics scenario of HPC clusters in Azure by providing the following features:

  • Diagnostics Jobs

    With predefined diagnostic test definitions, the clsuter admin can easily validate the health of an HPC cluster.

  • Clusrun Jobs

    By selecting a group of nodes and run clusrun, the commands will be dispatched to the selected nodes, and the outputs will be collected and shown interactively.

  • Heatmap

    The heatmap is a real-time graphical view of a specific metric value of all nodes in the cluster. It provides a vivid way to view the cluster's metrics.

How to deploy

There are two ways to deploy the service.

Deploy from scratch

Deploy to Azure

A cluster is deployed together with the diagnostic services, allowing the deployer to choose the scheduler, location, cluster size, the portal name, etc. This is the easiest way to create an HPC cluster with diagnostics functionalities enabled. For detailed usage of the deployment template, please refer: Azure cluster deployment

Apply to an existing cluster

For an already deployed cluster, to enable the diagnostics functionalities, follow the steps below:

  1. Create the service alone

For how to use the template, please refer: Build HPC ACM Diagnostic service

  1. Register the cluster with the service (You can register multiple clusters with the same service by repeating this step for each of your cluster)

    Download the script from: RegisterToAcm.ps1 Run it in an elevated powershell window:

    .\RegisterToAcm.ps1 -resourceGroupName theResourceGroupOfYourCluster -acmRgName theResourceGroupOfAcmServices -subscriptionId theSubscriptionId
    

    After the configuration, the VMs will register themselves to the HPC ACM services, and you could check the resources section in the portal to see them.

Known issues

  • The service only support linux for now
  • The service provide https portal with a self-signed cert, you need bypass the cert validation to visit the portal and use the rest api.