Граф коммитов

17 Коммитов

Автор SHA1 Сообщение Дата
savitamittal1 07d87f095d
Added more recommendations (#3244) 2024-06-17 11:20:32 -07:00
cassieesvelt 93bb9a8d8d
fix when the job fails and doesn't kick the node (#3186)
* fix when the job fails and doesn't kick the node

* reformat
2024-05-13 13:28:16 -07:00
cassieesvelt fa2caa094f
fix readme with nhc descriptions (#3162) 2024-05-06 15:16:57 -07:00
cassieesvelt a3a1c5946f
Add nhc command job (#3016)
* add nhc checks

* fix process_count_per_instance

* Add job description

* update readme

* Add wrapper script

* update readme with changes

* remove testing files

* remove uneeded env file

* add node kick command

* reformat

* add node id print

* add fallocate command

* update fallocate

* update readme and reformat

* add 3T

* use new mcr image

* Add kick_bad_node flag

* format code

* add entry to Training readme

* add setup script
2024-04-30 09:13:40 -07:00
cassieesvelt 4a14687dac
Add elastic training benchmark (#2555)
* Add elastic benchmark results

* add graphs + link

* reformat
2023-10-16 11:36:18 -07:00
kdestin 577a8a0522
ci: Refactor Python/Jupyter formatting CI (#2337)
* Add code quality checks for python/jupyter

* refactor: Remove .github/workflows/smoke.yml

    Superceded by .github/workflows/code-quality-python.yml

* refactor: Remove smoke.yml badge from README files

* chore: Trigger .github/workflows/code-quality.yml on pushes to main
2023-10-11 15:28:30 -04:00
rdondera-microsoft d9acebaec5
Fix internal links. (#2393) 2023-06-22 10:02:07 -07:00
rdondera-microsoft 7c695d99df
Initial set of guidelines for large scale training for Computer Vision (#2381)
* ViT-Pretrain folder.

* Update to the README file under Training.

* Move launcher.py and conda.yml to src folder.

* Merge descriptions of model pretraining into a single paragraph.

* Note about Infiniband addressing multi-node case only.

* Copyright header for image classification script.
2023-06-21 10:47:16 -07:00
Samuel Kemp 88465236d1
Samuel100/loadingupdate (#2363)
* updated data loading

* data loading update

* addressed feedback
2023-06-13 10:04:39 +01:00
Neehar Duvvuri 6eb684f054 Rename job_service_type to type (#2253) 2023-05-05 12:47:36 -04:00
Li, Xiaoran d18f65698e Change torch_nebula to nebulaml (#2219)
* Change torch_nebula to nebulaml

* Renaming the doc as README file

* Rename the package name from torch_nebula to nebulaml

---------

Co-authored-by: xiaoranli <xiaoranli@microsoft.com>
Co-authored-by: Ziqi Wang <zikeiwong@outlook.com>
2023-04-27 11:30:35 +08:00
savitamittal1 a6a8465afa Update nebula.md for support of memory buffer size (#2153) 2023-03-27 16:23:12 -07:00
ccozianu c720ee1648 Update README.md (#2147)
fix typo
2023-03-24 09:59:26 +05:30
savitamittal1 c7dbfb0014 Update README.md (#2151)
* Update README.md

* resolved comments

* removed space

* Changed Monitoring and optimization to Bold as well.
2023-03-23 16:07:23 -07:00
savitamittal1 31b4358caa Table of content fix and added smoke yaml (#2148)
* Table of content fix and added smoke yaml

* added sample page_type

* changed description

* updated description
2023-03-23 12:34:43 -07:00
Razvan Tanase ca3685b405 Fixing broken links in the BestPractices folder. (#2146)
Fixing broken links under BestPractices folder, used relative paths.
2023-03-21 15:15:16 -07:00
Razvan Tanase 2cbb042412 Adding best practices for large scale deep learning (#2144)
Adding best-practices for large-scale deep learning workloads.
2023-03-21 13:22:09 -07:00