Граф коммитов

375 Коммитов

Автор SHA1 Сообщение Дата
Jeff Rasley 7435b2f10a
Ability to initialize distributed backend outside deepspeed runtime (#608) 2020-12-17 23:17:19 -08:00
Reza Yazdani fd2f970bdf
Transformer-kernel - supporting any arbitrary sequence-length (#587)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-12-17 10:13:54 -08:00
Jeff Rasley 6380ee3511
Fixes for RTD build errors (#606)
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
2020-12-15 15:29:21 -08:00
Stas Bekman 007466e576
[doc] xref to hostfile discussion (#604)
* [doc] xref to hostfile discussion

wasn't clear where to find what was meant by `hostfile` - so adding a link to where it's discussed.

* remove whitespace
2020-12-15 13:44:32 -08:00
Stas Bekman 9f8e8f3829
implement missing get_last_lr (#595)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-12-14 14:24:58 -08:00
Jeff Rasley c5a449f9a3
Update launcher to set local rank environ variable (#597)
* Update launch.py

* formatting
2020-12-11 14:54:45 -08:00
carefree0910 a4763f5516
Supported customizing kwargs for lr_scheduler (#584)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-12-11 13:52:06 -08:00
Stas Bekman 66268bd337
add DeepSpeedZeroConfig repr method (#596)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-12-11 12:40:14 -08:00
Stas Bekman 8a184b6b1d
[build] fix computer capability arch flags, add PTX, handle PTX (#591)
* fix arch flags, add PTX

* bug fix

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-12-11 10:15:33 -08:00
Jeff Rasley 0518252d64 add manual workflow to run tests with precompiled ops 2020-12-11 10:05:37 -08:00
Jeff Rasley 7300f3e328
Add AML video link 2020-12-09 12:42:40 -08:00
Jeff Rasley 19acd6cf17
Add papers/videos to readme/website (#592) 2020-12-09 12:25:47 -08:00
Jeff Rasley cb7c7da6f7
bump to 0.3.8 2020-12-09 09:04:08 -08:00
Jeff Rasley d901a6d2f5
Pin triton to 0.2.3 for now, 0.3.0 is broken 2020-12-09 09:03:05 -08:00
Shaden Smith 2f6269787a
Pipeline warnings and checkpoint portability (#588)
* Switch from deprecated allreduce interface.

* Make pipeline checkpoint files portable.
2020-12-08 09:42:08 -08:00
Stas Bekman e8b126d986
[build] add compute_86 (#577)
RTX-30 series are compute_86
```
python -c "import torch; print(torch.cuda.get_device_capability())"
```
This PR adds support for this compute capability.

Reference: https://developer.nvidia.com/cuda-gpus

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-12-07 12:49:50 -08:00
Stas Bekman ce363d0e06
[build] make builder smarter and configurable wrt compute capabilities + docs (#578) 2020-12-07 12:08:41 -08:00
Zhun 1e44d48d53
Fix potential random layout inconsistency issues in sparse attention modules (#534)
* 1) Register layout as buffer of module so that we can save/load checkpoint; 2) Add a broadcast of layout at the beginning to ensure different processes will have consistent layout during distributed training.

* Add docstring for max_seq_length argument in SparseSelfAttention

Co-authored-by: Zhun Liu <zhunliu@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-12-04 14:58:10 -08:00
Stas Bekman ff58fa7e5a
[build] build against installed cuda-11.1 while torch built w/ cuda-11.0 (#570) 2020-12-02 21:20:16 -08:00
Jeff Rasley be33bea475
Add compute capability 8.0 if on cuda 11+ (#572) 2020-12-02 17:22:16 -08:00
Stas Bekman 2d1f7c0172
[engine] train should be able to get `mode` arg (#571) 2020-12-02 16:54:00 -08:00
Jeff Rasley 845921b3b6
Add 'latest' checkpoint save/load support (#569) 2020-12-02 13:49:31 -08:00
Stas Bekman 7a75f8b36f
[cifar tutorial] improve readability (#567)
* [cifar tutorial] improve readability
2020-12-02 11:10:47 -08:00
Reza Yazdani 9f52a36fad
tracking optimizer step in cpu-adam when loading checkpoint (#564)
* tracking optimizer step in cpu-adam when loading checkpoint

* add warning/error message for updating optimizer step count

* resolve build issue

* supporting state update from the python side

* track step from python in all cases

* remove comma
2020-12-01 15:11:38 -08:00
Reza Yazdani c78c29f938
supporting different hidden dimensions (#559)
* supporting different hidden dimensions

* add support for larger hidden dimensions (greater than 8K)

* remove empty line

* add loop unrolling factor for dropout kernels

* update different kernels based on the reviews

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-12-01 14:01:24 -08:00
Stas Bekman 17f36f1b2e
[doc] typo fix and clarification (#563)
This PR:
* fixes a misspelled method name
* also `( () )` doesn't read too well, until one reads the code and understands that it's not a formatting bug. I proposed to simply say that it's a callable object.
2020-11-27 21:05:27 -08:00
Jeff Rasley c51fa65de8 bump to 0.3.7 2020-11-25 15:20:07 -08:00
Jeff Rasley e4e20662fd update manifest 2020-11-25 15:19:14 -08:00
Jeff Rasley 73c3262df6
bump to 0.3.6 and fix manifest to include reqs (#561) 2020-11-25 10:27:10 -08:00
Shaden Smith 6009713653
Adds long_description to setup.py (#560) 2020-11-25 09:43:53 -08:00
Jeff Rasley 16313a962b bump to 0.3.5 2020-11-23 04:51:53 -08:00
Jeff Rasley eec44af1e3
Turn back on PP tests (#558) 2020-11-24 17:29:08 -08:00
Ammar Ahmad Awan 0e831e23b6
Simplify dist init and only init if needed. (#553)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-11-24 16:37:13 -08:00
Olatunji Ruwase 6e65c2cc08
Deprecate client ability to disable gradient reduction (#552)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-11-24 15:14:37 -08:00
Jeff Rasley 1ef5cd2398
Update badges and CI name (#557) 2020-11-24 14:37:40 -08:00
Jeff Rasley 3347460ed1
Switch to CI to GitHub Actions (#556) 2020-11-24 14:31:27 -08:00
Jeff Rasley c18fb0de91
Create main.yml 2020-11-24 14:03:29 -08:00
Samyam Rajbhandari 00c3a254a9
Bug fix for norm calculation in absence of model parallel group (#551)
In the absence of a model parallel group, model_parallel_allreduce should not do any reduction. This commit fixes the bug which was doing a model parallel allreduce across world group when model parallel group is None
2020-11-23 11:29:20 -08:00
Samyam Rajbhandari bcd56f9772
Adding static_loss_scale to unfused optimizer (#546) 2020-11-22 20:07:37 -08:00
Olatunji Ruwase 6021b70288
Support non-tensor state in checkpoint (#548) 2020-11-21 15:41:22 -08:00
Olatunji Ruwase 0178e6cc22
Fix unbalanced gradients bug in ZeRO-2 gradient accumulation (#545)
* Use zero-tensors for missing gradients to avoid size mismatch

* Unit test for unbalanced gradients in ZeRO

* Formatting fixes
2020-11-20 15:39:01 -08:00
Jeff Rasley 6b28bc5db5 bump version 0.3.4 2020-11-19 23:10:37 +00:00
Ammar Ahmad Awan 1b45917cf6
Discover variables for NCCL backend on AML without mpi4py (#542)
* Use AML method to set env vars instead of using mpi4py.

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-11-19 15:04:51 -08:00
Seunghwan Hong d81cb26d92
Fix setup.py for cpu-only environment installation (#538)
* Add guard to not using `torch.version.cuda` above no-CUDA environment.
* Fix several typos on setup.py.

Signed-off-by: Seunghwan Hong <seunghwan@scatterlab.co.kr>

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-11-19 13:49:31 -08:00
Jeff Rasley dce054dbba
backwards compatability w. v020 ckpts, fix issue with zero-1 ckpts (#543) 2020-11-19 13:48:40 -08:00
Jeff Rasley 9de21b72b5
bump to v0.3.3 2020-11-19 08:36:19 -08:00
Jeff Rasley 08c96a1bc6
ZeRO-1 tune max-elems + bug fix (#532)
* zero-1 memory fix

* auto-tune max elems per comm to reduce padding/comm intervals

* clean-up and added previously missing reduction options

* fix testing backing to work with torch1.7
2020-11-19 08:16:27 -08:00
Jeff Rasley fdd81c305c
more fine-grained manifest file for includes/excludes (#540) 2020-11-18 16:42:19 -08:00
Jeff Rasley 5b09be60f7
append job-name if explicit output dir is given (#539) 2020-11-18 14:53:04 -08:00
Olatunji Ruwase 7752dc5ea1
Fix layout bug in ZeRO Stage 1 checkpoint logic (#531)
* Fix layout bug in ZeRO Stage 1 checkpoint logic
Add elastic checkpoint option for ZeRO stage 1, default to True

* Format fixes
2020-11-17 16:20:02 -08:00