Граф коммитов

3735 Коммитов

Автор SHA1 Сообщение Дата
Davide Vanzo 2d57191cb3
Rename pyxis source archive 2024-10-31 11:33:25 -05:00
Davide Vanzo 7d0ca49b79
Fixed typo 2024-10-31 11:14:29 -05:00
Davide Vanzo 4a6c51aa46
Updated libnvidia version 2024-10-31 10:51:20 -05:00
Davide Vanzo ff24109606
Merge pull request #753 from Azure/update_pyxis_enroot
Update to pyxis 0.20.0
2024-10-31 10:15:36 -05:00
Davide Vanzo 4f28c718ee Update to pyxis 0.20.0 2024-10-31 10:05:40 -05:00
Cormac Garvey 013a778225
Merge pull request #752 from Azure/check_ecc_rp
Added support for retired pages (check_gpu_ecc)
2024-10-08 12:47:21 -05:00
Cormac Garvey 8083f0c62f Re-order imported python modules 2024-10-07 09:36:56 -05:00
Cormac Garvey 89c8aa7267 Corrected imports 2024-09-17 13:54:02 -05:00
Cormac Garvey 6ad2b3e849 Corrected SRAM uncorrectable error check. 2024-09-16 17:05:14 -05:00
edwardsp 76259ff0e4
Update cc_install.sh
Removed the cyclecloud download URL and put a placeholder, "<INSERT_CYCLECLOU_DOWNLOAD_URL>".
2024-09-13 10:16:47 +01:00
Cormac Garvey de02b1ea56 Modified SRAM ECC counter threshold 2024-09-07 11:59:35 -05:00
Cormac Garvey f49779f843 Added support for retired pages 2024-09-04 19:07:24 -05:00
Davide Vanzo 0896b0967c
Merge pull request #751 from Azure/update_pyxis_enroot
Updated pyxys/enroot versions
2024-07-18 09:13:28 -05:00
Davide Vanzo 01d28b37a8 Updated pyxys/enroot versions 2024-07-17 15:46:23 -05:00
Davide Vanzo 0fe82f56c5
Updated pyxis version 2024-07-17 15:39:24 -05:00
Cormac Garvey 7e6d1fc1fc
Merge pull request #750 from Azure/hpc_monitoring_node_meta
hpc monitoring (Added support to report node metadata metrics)
2024-07-16 11:32:25 -05:00
Cormac Garvey e57cf65e4e Added support to report node metadata metrics 2024-07-16 09:59:18 -05:00
Cormac Garvey 9efb4582dc
Merge pull request #749 from Azure/hpc_monitor_aks
Integration of hpc monitoring into AKS
2024-07-10 16:56:25 -05:00
Cormac Garvey a503b00c8a
Merge pull request #748 from Azure/aks_npd_draino_2
Added GPU VBIOS and GPU throttling tests. (to AKS NPD+DRAINO)
2024-07-10 16:55:53 -05:00
Cormac Garvey 2cffa04e68 Minor typo 2024-07-10 16:53:29 -05:00
Cormac Garvey 5eaa9b6b42 minor edit. 2024-07-09 15:41:22 -05:00
Cormac Garvey 31e56170aa Correct typo 2024-07-09 13:53:13 -05:00
Cormac Garvey 783c30f986 Initial version of hpc monitoring integration into AKS 2024-07-09 13:46:10 -05:00
Cormac Garvey b31e66a551 Changed NPD base container image to ngc cuda. 2024-07-05 16:22:02 -05:00
Cormac Garvey 08e5ac0097 Corrected expected GPU VBIOS version 2024-07-05 15:04:35 -05:00
Cormac Garvey e19d1b3c5a Changed NPD base container image 2024-07-03 21:17:50 -05:00
Cormac Garvey b63700f9e4 Fix minor typo. 2024-07-03 13:06:54 -05:00
Cormac Garvey d9e8351cfa Added GPU VBIOS and GPU throttling tests. 2024-07-03 12:47:03 -05:00
marco netto c3d8aceffe
Merge pull request #747 from Azure/fixhbv4topology
Fix HBV4 topology
2024-06-28 13:09:36 -07:00
Cormac Garvey 23fd8c339e
Merge pull request #746 from Azure/aks_npd_draino
Integrate GPU node health checks into AKS
2024-06-28 14:36:29 -05:00
Cormac Garvey 1500f47260 Add IB device health check. 2024-06-28 14:29:19 -05:00
Marco Netto 0fe481f7bd Fix HBV4 topology
From the VM view, each NUMA domain has 6 CCXs, first 4 with 8 cores and
last 4 with 6 cores.
2024-06-28 10:41:51 -07:00
Cormac Garvey 86c08773e4 Added GPU tests to readme. 2024-06-26 17:43:02 -05:00
Cormac Garvey cb15924bc2 Corrected GPU ECC comment 2024-06-26 17:03:40 -05:00
Cormac Garvey fb5dac5842 Added GPU ECC test. 2024-06-26 16:57:49 -05:00
Cormac Garvey 4b03fbc2bc added backslash 2024-06-25 20:44:11 -05:00
Cormac Garvey 51d3ef03a8 added single quote 2024-06-25 20:39:38 -05:00
Cormac Garvey f7e76692b1 Added modified draino manifest 2024-06-25 19:34:31 -05:00
Cormac Garvey da685e4db4 Initial version of aks_npd_draino 2024-06-25 19:33:27 -05:00
Davide Vanzo 8dadec21b1
Added NDv5 as supported SKU 2024-04-16 10:35:06 -05:00
Jingchao Zhang cd9ade3374
Merge pull request #743 from JZ-Azure/master
fix code for upsteam change
2024-04-05 10:10:59 -04:00
Davide Vanzo 58e3ab7c7f
Update azure_nccl_allreduce_ib_loopback.nhc
Fix misspelled vars
2024-04-04 13:22:41 -05:00
Davide Vanzo 69423ce2b9
Removed exit 0 as prevents execution of externally appended commands 2024-03-28 16:20:14 -05:00
Davide Vanzo 2caead70fd
CC 8.5+ starts slurm services after cluster-init completion 2024-03-01 14:22:52 -06:00
Davide Vanzo 25ad0d9511
Prevent error if project is ran again 2024-03-01 11:01:11 -06:00
Davide Vanzo 972dfa31aa
Merge pull request #741 from Azure/nhc_exclusive
Add support for Ubuntu 22.04 in nhc-run script
2024-02-21 11:33:13 -06:00
Davide Vanzo ed9146309c Revert change 2024-02-21 11:27:32 -06:00
Davide Vanzo 9bbfc48943
Merge pull request #742 from Azure/ecc_check_ndv5
Add support for NDv5 in check_gpu_ecc
2024-02-21 11:18:54 -06:00
Davide Vanzo 443cb6a8a1 Fixed assignment order 2024-02-21 11:18:06 -06:00
Davide Vanzo ba009147de Revert change 2024-02-21 11:16:40 -06:00