Jeff Rasley
7435b2f10a
Ability to initialize distributed backend outside deepspeed runtime ( #608 )
2020-12-17 23:17:19 -08:00
Jeff Rasley
dce054dbba
backwards compatability w. v020 ckpts, fix issue with zero-1 ckpts ( #543 )
2020-11-19 13:48:40 -08:00
Jeff Rasley
31f46feee2
DeepSpeed JIT op + PyPI support ( #496 )
...
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
2020-11-12 11:51:38 -08:00
Reza Yazdani
f5aa2547d8
Add CPUAdam optimizer for zero-offload in deepspeed engine ( #484 )
...
* add adamW to CPU-ADAM implementation
* supporting cpu-adam optimizer for zero-offload on deepspeed side
* bump DSE to match cpu-adam updates
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-10-30 09:01:04 -07:00
Shaden Smith
65c2f974d8
Pipeline parallel training engine. ( #392 )
...
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-09-09 23:14:55 -07:00
Jeff Rasley
41db1c2f03
ZeRO-Offload release ( #391 )
...
* ZeRO-Offload (squash) (#381 )
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Jie <37380896+jren73@users.noreply.github.com>
Co-authored-by: Arash Ashari <arashari@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: arashashari <arashashari@ArashMSLaptop.redmond.corp.microsoft.com>
Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
2020-09-09 17:14:12 -07:00
Jeff Rasley
e5bbc2e559
Sparse attn + ops/runtime refactor + v0.3.0 ( #343 )
...
* Sparse attn + ops/runtime refactor + v0.3.0
Co-authored-by: Arash Ashari <arashari@microsoft.com>
Co-authored-by: Arash Ashari <arashari@microsoft.com>
2020-09-01 18:06:15 -07:00
Chunyang Wen
e1ad8803eb
Add log util ( #230 )
...
* Add log util
* replace all occurrences of print and logging
* address format
* disable propagate to avoid duplicate log
2020-06-04 14:05:04 -07:00
Jeff Rasley
734d8991c8
Transformer kernel release ( #242 )
...
* Transformer kernels release
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Elton Zheng <eltonz@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Tunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Elton Zheng <eltonz@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Tunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
2020-05-29 13:15:36 -07:00
Jeff Rasley
f2ac7eafd5
ZeRO-2 ( #217 )
...
Updates for ZeRO stage 2 + ZeRO stage 1 w. RS
Co-authored-by: Tunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Elton Zheng <eltonz@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: yuxionghe <yuxhe@microsoft.com>
Co-authored-by: Arash Ashari <arashari@microsoft.com>
2020-05-19 01:00:53 -07:00
Shaden Smith
dd166ee6b6
README and RTD improvements. ( #198 )
2020-04-21 22:18:47 -07:00
Jeff Rasley
7e8132832f
MPI 3.x support via mpi4py ( #107 )
...
* add mpirun support for openmpi 4.0
* add master addr support from args
* switch mpi detection to use mpi4py
* set constant for default distributed port
* Make sure deepspeed_mpi exits in args
2020-02-27 07:22:56 -08:00
Jeff Rasley
5aa58b3878
Init distributed torch only if needed ( #108 )
...
* add auto-detect to torch dist init
* update tests to infer distributed init status
* prevent crash if dist_init_required is True but already initiliazed
* only init if safe to do so (forgot to add this file in prev commit)
2020-02-26 15:07:49 -08:00
Jeff Rasley
5897091eb9
add deprecated deepspeed flag for legacy code ( #104 )
2020-02-24 12:47:17 -08:00
Jeff Rasley
001abe2362
Refactor simple model test, fix pythonpath issue ( #96 )
...
Also a fix for #94
2020-02-20 14:16:41 -08:00
Shaden Smith
2abef1ef76
Updating MPU docs ( #92 )
2020-02-19 21:41:57 -08:00
Shaden Smith
50ae149f82
Moving to major/minor/patch versioning. ( #51 )
2020-02-09 20:03:35 -08:00
Olatunji Ruwase
8326aff279
Improve doc string for add_XXX_arguments ( #32 )
...
Unit tests for add_XXX_arguments
2020-02-06 13:14:22 -08:00
Shaden Smith
b18eae24e8
Fixing file permissions ( #1 )
...
Fixing file permissions.
2020-02-03 10:55:19 -08:00
Jeff Rasley
6ef93347ed
add deepspeed init
2020-01-31 16:16:04 -08:00