Граф коммитов

20 Коммитов

Автор SHA1 Сообщение Дата
Jeff Rasley 7435b2f10a
Ability to initialize distributed backend outside deepspeed runtime (#608) 2020-12-17 23:17:19 -08:00
Jeff Rasley dce054dbba
backwards compatability w. v020 ckpts, fix issue with zero-1 ckpts (#543) 2020-11-19 13:48:40 -08:00
Jeff Rasley 31f46feee2
DeepSpeed JIT op + PyPI support (#496)
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
2020-11-12 11:51:38 -08:00
Reza Yazdani f5aa2547d8
Add CPUAdam optimizer for zero-offload in deepspeed engine (#484)
* add adamW to CPU-ADAM implementation

* supporting cpu-adam optimizer for zero-offload on deepspeed side

* bump DSE to match cpu-adam updates

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-10-30 09:01:04 -07:00
Shaden Smith 65c2f974d8
Pipeline parallel training engine. (#392)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-09-09 23:14:55 -07:00
Jeff Rasley 41db1c2f03
ZeRO-Offload release (#391)
* ZeRO-Offload (squash) (#381)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Jie <37380896+jren73@users.noreply.github.com>
Co-authored-by: Arash Ashari <arashari@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: arashashari <arashashari@ArashMSLaptop.redmond.corp.microsoft.com>
Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
2020-09-09 17:14:12 -07:00
Jeff Rasley e5bbc2e559
Sparse attn + ops/runtime refactor + v0.3.0 (#343)
* Sparse attn + ops/runtime refactor + v0.3.0

Co-authored-by: Arash Ashari <arashari@microsoft.com>

Co-authored-by: Arash Ashari <arashari@microsoft.com>
2020-09-01 18:06:15 -07:00
Chunyang Wen e1ad8803eb
Add log util (#230)
* Add log util

* replace all occurrences of print and logging

* address format

* disable propagate to avoid duplicate log
2020-06-04 14:05:04 -07:00
Jeff Rasley 734d8991c8
Transformer kernel release (#242)
* Transformer kernels release

Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Elton Zheng <eltonz@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Tunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Elton Zheng <eltonz@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Tunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
2020-05-29 13:15:36 -07:00
Jeff Rasley f2ac7eafd5
ZeRO-2 (#217)
Updates for ZeRO stage 2 + ZeRO stage 1 w. RS

Co-authored-by: Tunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Elton Zheng <eltonz@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: yuxionghe <yuxhe@microsoft.com>
Co-authored-by: Arash Ashari <arashari@microsoft.com>
2020-05-19 01:00:53 -07:00
Shaden Smith dd166ee6b6
README and RTD improvements. (#198) 2020-04-21 22:18:47 -07:00
Jeff Rasley 7e8132832f
MPI 3.x support via mpi4py (#107)
* add mpirun support for openmpi 4.0

* add master addr support from args

* switch mpi detection to use mpi4py

* set constant for default distributed port

* Make sure deepspeed_mpi exits in args
2020-02-27 07:22:56 -08:00
Jeff Rasley 5aa58b3878
Init distributed torch only if needed (#108)
* add auto-detect to torch dist init

* update tests to infer distributed init status

* prevent crash if dist_init_required is True but already initiliazed

* only init if safe to do so (forgot to add this file in prev commit)
2020-02-26 15:07:49 -08:00
Jeff Rasley 5897091eb9
add deprecated deepspeed flag for legacy code (#104) 2020-02-24 12:47:17 -08:00
Jeff Rasley 001abe2362
Refactor simple model test, fix pythonpath issue (#96)
Also a fix for #94
2020-02-20 14:16:41 -08:00
Shaden Smith 2abef1ef76
Updating MPU docs (#92) 2020-02-19 21:41:57 -08:00
Shaden Smith 50ae149f82 Moving to major/minor/patch versioning. (#51) 2020-02-09 20:03:35 -08:00
Olatunji Ruwase 8326aff279
Improve doc string for add_XXX_arguments (#32)
Unit tests for add_XXX_arguments
2020-02-06 13:14:22 -08:00
Shaden Smith b18eae24e8
Fixing file permissions (#1)
Fixing file permissions.
2020-02-03 10:55:19 -08:00
Jeff Rasley 6ef93347ed add deepspeed init 2020-01-31 16:16:04 -08:00