This commit is contained in:
savitamittal1 2024-06-17 11:20:32 -07:00 коммит произвёл GitHub
Родитель b34a0c67e3
Коммит 07d87f095d
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: B5690EEEBB952194
1 изменённых файлов: 2 добавлений и 1 удалений

Просмотреть файл

@ -1,5 +1,6 @@
## Run Node Health Checks (NHC)
This command job is used to test for and remove unhealthy compute nodes in an AzureML Cluster. During training, issues with the compute nodes can cause problems that may affect the training in unexpected ways. Oftentimes it can be hard to determine if the compute nodes are the source of the problem. This command job will check for any unhealthy nodes in a cluster and optionally remove them from the cluster.
This command job is used to test for and remove unhealthy compute nodes in an AzureML Cluster. During training, issues with the compute nodes can cause problems that may affect the training in unexpected ways. Oftentimes it can be hard to determine if the compute nodes are the source of the problem. This command job will check for any unhealthy nodes in a cluster and optionally remove them from the cluster.
For large-scale clusters, it is best to avoid scaling down as removing problematic nodes reduces the chance of issues. Instead, maintain minimum and maximum node counts and avoid enabling auto-scaling. This approach helps minimize recurring issues, particularly regarding IB performance problems in the backend infrastructure. Therefore, for large customers, it's advisable to run the job, remove problematic nodes, and avoid scaling down.
### What does it do?
A series of node heath checks will be run on each node in a cluster to check for any problems. If any nodes in the cluster fail a health check, the failing node will be kicked out of the cluster and a healthy node will be reallocated(Kicking the nodes out can be turned off and on with an environment variable, see 'How To Run' instructions). The exact health checks that are run depend on the type of compute node being used. The node health check descriptions and results will be outputed to the std_out file in the outputs after running the job. The full list of node health checks that may be used to test is the following: