Davide Vanzo
|
2d57191cb3
|
Rename pyxis source archive
|
2024-10-31 11:33:25 -05:00 |
Davide Vanzo
|
7d0ca49b79
|
Fixed typo
|
2024-10-31 11:14:29 -05:00 |
Davide Vanzo
|
4a6c51aa46
|
Updated libnvidia version
|
2024-10-31 10:51:20 -05:00 |
Davide Vanzo
|
ff24109606
|
Merge pull request #753 from Azure/update_pyxis_enroot
Update to pyxis 0.20.0
|
2024-10-31 10:15:36 -05:00 |
Davide Vanzo
|
4f28c718ee
|
Update to pyxis 0.20.0
|
2024-10-31 10:05:40 -05:00 |
Cormac Garvey
|
013a778225
|
Merge pull request #752 from Azure/check_ecc_rp
Added support for retired pages (check_gpu_ecc)
|
2024-10-08 12:47:21 -05:00 |
Cormac Garvey
|
8083f0c62f
|
Re-order imported python modules
|
2024-10-07 09:36:56 -05:00 |
Cormac Garvey
|
89c8aa7267
|
Corrected imports
|
2024-09-17 13:54:02 -05:00 |
Cormac Garvey
|
6ad2b3e849
|
Corrected SRAM uncorrectable error check.
|
2024-09-16 17:05:14 -05:00 |
edwardsp
|
76259ff0e4
|
Update cc_install.sh
Removed the cyclecloud download URL and put a placeholder, "<INSERT_CYCLECLOU_DOWNLOAD_URL>".
|
2024-09-13 10:16:47 +01:00 |
Cormac Garvey
|
de02b1ea56
|
Modified SRAM ECC counter threshold
|
2024-09-07 11:59:35 -05:00 |
Cormac Garvey
|
f49779f843
|
Added support for retired pages
|
2024-09-04 19:07:24 -05:00 |
Davide Vanzo
|
0896b0967c
|
Merge pull request #751 from Azure/update_pyxis_enroot
Updated pyxys/enroot versions
|
2024-07-18 09:13:28 -05:00 |
Davide Vanzo
|
01d28b37a8
|
Updated pyxys/enroot versions
|
2024-07-17 15:46:23 -05:00 |
Davide Vanzo
|
0fe82f56c5
|
Updated pyxis version
|
2024-07-17 15:39:24 -05:00 |
Cormac Garvey
|
7e6d1fc1fc
|
Merge pull request #750 from Azure/hpc_monitoring_node_meta
hpc monitoring (Added support to report node metadata metrics)
|
2024-07-16 11:32:25 -05:00 |
Cormac Garvey
|
e57cf65e4e
|
Added support to report node metadata metrics
|
2024-07-16 09:59:18 -05:00 |
Cormac Garvey
|
9efb4582dc
|
Merge pull request #749 from Azure/hpc_monitor_aks
Integration of hpc monitoring into AKS
|
2024-07-10 16:56:25 -05:00 |
Cormac Garvey
|
a503b00c8a
|
Merge pull request #748 from Azure/aks_npd_draino_2
Added GPU VBIOS and GPU throttling tests. (to AKS NPD+DRAINO)
|
2024-07-10 16:55:53 -05:00 |
Cormac Garvey
|
2cffa04e68
|
Minor typo
|
2024-07-10 16:53:29 -05:00 |
Cormac Garvey
|
5eaa9b6b42
|
minor edit.
|
2024-07-09 15:41:22 -05:00 |
Cormac Garvey
|
31e56170aa
|
Correct typo
|
2024-07-09 13:53:13 -05:00 |
Cormac Garvey
|
783c30f986
|
Initial version of hpc monitoring integration into AKS
|
2024-07-09 13:46:10 -05:00 |
Cormac Garvey
|
b31e66a551
|
Changed NPD base container image to ngc cuda.
|
2024-07-05 16:22:02 -05:00 |
Cormac Garvey
|
08e5ac0097
|
Corrected expected GPU VBIOS version
|
2024-07-05 15:04:35 -05:00 |
Cormac Garvey
|
e19d1b3c5a
|
Changed NPD base container image
|
2024-07-03 21:17:50 -05:00 |
Cormac Garvey
|
b63700f9e4
|
Fix minor typo.
|
2024-07-03 13:06:54 -05:00 |
Cormac Garvey
|
d9e8351cfa
|
Added GPU VBIOS and GPU throttling tests.
|
2024-07-03 12:47:03 -05:00 |
marco netto
|
c3d8aceffe
|
Merge pull request #747 from Azure/fixhbv4topology
Fix HBV4 topology
|
2024-06-28 13:09:36 -07:00 |
Cormac Garvey
|
23fd8c339e
|
Merge pull request #746 from Azure/aks_npd_draino
Integrate GPU node health checks into AKS
|
2024-06-28 14:36:29 -05:00 |
Cormac Garvey
|
1500f47260
|
Add IB device health check.
|
2024-06-28 14:29:19 -05:00 |
Marco Netto
|
0fe481f7bd
|
Fix HBV4 topology
From the VM view, each NUMA domain has 6 CCXs, first 4 with 8 cores and
last 4 with 6 cores.
|
2024-06-28 10:41:51 -07:00 |
Cormac Garvey
|
86c08773e4
|
Added GPU tests to readme.
|
2024-06-26 17:43:02 -05:00 |
Cormac Garvey
|
cb15924bc2
|
Corrected GPU ECC comment
|
2024-06-26 17:03:40 -05:00 |
Cormac Garvey
|
fb5dac5842
|
Added GPU ECC test.
|
2024-06-26 16:57:49 -05:00 |
Cormac Garvey
|
4b03fbc2bc
|
added backslash
|
2024-06-25 20:44:11 -05:00 |
Cormac Garvey
|
51d3ef03a8
|
added single quote
|
2024-06-25 20:39:38 -05:00 |
Cormac Garvey
|
f7e76692b1
|
Added modified draino manifest
|
2024-06-25 19:34:31 -05:00 |
Cormac Garvey
|
da685e4db4
|
Initial version of aks_npd_draino
|
2024-06-25 19:33:27 -05:00 |
Davide Vanzo
|
8dadec21b1
|
Added NDv5 as supported SKU
|
2024-04-16 10:35:06 -05:00 |
Jingchao Zhang
|
cd9ade3374
|
Merge pull request #743 from JZ-Azure/master
fix code for upsteam change
|
2024-04-05 10:10:59 -04:00 |
Davide Vanzo
|
58e3ab7c7f
|
Update azure_nccl_allreduce_ib_loopback.nhc
Fix misspelled vars
|
2024-04-04 13:22:41 -05:00 |
Davide Vanzo
|
69423ce2b9
|
Removed exit 0 as prevents execution of externally appended commands
|
2024-03-28 16:20:14 -05:00 |
Davide Vanzo
|
2caead70fd
|
CC 8.5+ starts slurm services after cluster-init completion
|
2024-03-01 14:22:52 -06:00 |
Davide Vanzo
|
25ad0d9511
|
Prevent error if project is ran again
|
2024-03-01 11:01:11 -06:00 |
Davide Vanzo
|
972dfa31aa
|
Merge pull request #741 from Azure/nhc_exclusive
Add support for Ubuntu 22.04 in nhc-run script
|
2024-02-21 11:33:13 -06:00 |
Davide Vanzo
|
ed9146309c
|
Revert change
|
2024-02-21 11:27:32 -06:00 |
Davide Vanzo
|
9bbfc48943
|
Merge pull request #742 from Azure/ecc_check_ndv5
Add support for NDv5 in check_gpu_ecc
|
2024-02-21 11:18:54 -06:00 |
Davide Vanzo
|
443cb6a8a1
|
Fixed assignment order
|
2024-02-21 11:18:06 -06:00 |
Davide Vanzo
|
ba009147de
|
Revert change
|
2024-02-21 11:16:40 -06:00 |