savitamittal1
07d87f095d
Added more recommendations ( #3244 )
2024-06-17 11:20:32 -07:00
cassieesvelt
93bb9a8d8d
fix when the job fails and doesn't kick the node ( #3186 )
...
* fix when the job fails and doesn't kick the node
* reformat
2024-05-13 13:28:16 -07:00
cassieesvelt
fa2caa094f
fix readme with nhc descriptions ( #3162 )
2024-05-06 15:16:57 -07:00
cassieesvelt
a3a1c5946f
Add nhc command job ( #3016 )
...
* add nhc checks
* fix process_count_per_instance
* Add job description
* update readme
* Add wrapper script
* update readme with changes
* remove testing files
* remove uneeded env file
* add node kick command
* reformat
* add node id print
* add fallocate command
* update fallocate
* update readme and reformat
* add 3T
* use new mcr image
* Add kick_bad_node flag
* format code
* add entry to Training readme
* add setup script
2024-04-30 09:13:40 -07:00
cassieesvelt
4a14687dac
Add elastic training benchmark ( #2555 )
...
* Add elastic benchmark results
* add graphs + link
* reformat
2023-10-16 11:36:18 -07:00
kdestin
577a8a0522
ci: Refactor Python/Jupyter formatting CI ( #2337 )
...
* Add code quality checks for python/jupyter
* refactor: Remove .github/workflows/smoke.yml
Superceded by .github/workflows/code-quality-python.yml
* refactor: Remove smoke.yml badge from README files
* chore: Trigger .github/workflows/code-quality.yml on pushes to main
2023-10-11 15:28:30 -04:00
rdondera-microsoft
d9acebaec5
Fix internal links. ( #2393 )
2023-06-22 10:02:07 -07:00
rdondera-microsoft
7c695d99df
Initial set of guidelines for large scale training for Computer Vision ( #2381 )
...
* ViT-Pretrain folder.
* Update to the README file under Training.
* Move launcher.py and conda.yml to src folder.
* Merge descriptions of model pretraining into a single paragraph.
* Note about Infiniband addressing multi-node case only.
* Copyright header for image classification script.
2023-06-21 10:47:16 -07:00
Samuel Kemp
88465236d1
Samuel100/loadingupdate ( #2363 )
...
* updated data loading
* data loading update
* addressed feedback
2023-06-13 10:04:39 +01:00
Neehar Duvvuri
6eb684f054
Rename job_service_type to type ( #2253 )
2023-05-05 12:47:36 -04:00
Li, Xiaoran
d18f65698e
Change torch_nebula to nebulaml ( #2219 )
...
* Change torch_nebula to nebulaml
* Renaming the doc as README file
* Rename the package name from torch_nebula to nebulaml
---------
Co-authored-by: xiaoranli <xiaoranli@microsoft.com>
Co-authored-by: Ziqi Wang <zikeiwong@outlook.com>
2023-04-27 11:30:35 +08:00
savitamittal1
a6a8465afa
Update nebula.md for support of memory buffer size ( #2153 )
2023-03-27 16:23:12 -07:00
ccozianu
c720ee1648
Update README.md ( #2147 )
...
fix typo
2023-03-24 09:59:26 +05:30
savitamittal1
c7dbfb0014
Update README.md ( #2151 )
...
* Update README.md
* resolved comments
* removed space
* Changed Monitoring and optimization to Bold as well.
2023-03-23 16:07:23 -07:00
savitamittal1
31b4358caa
Table of content fix and added smoke yaml ( #2148 )
...
* Table of content fix and added smoke yaml
* added sample page_type
* changed description
* updated description
2023-03-23 12:34:43 -07:00
Razvan Tanase
ca3685b405
Fixing broken links in the BestPractices folder. ( #2146 )
...
Fixing broken links under BestPractices folder, used relative paths.
2023-03-21 15:15:16 -07:00
Razvan Tanase
2cbb042412
Adding best practices for large scale deep learning ( #2144 )
...
Adding best-practices for large-scale deep learning workloads.
2023-03-21 13:22:09 -07:00