Граф коммитов

25 Коммитов

Автор SHA1 Сообщение Дата
digger-yu 077e42e68a
Update install.sh (#3270)
Optimization Code
1. Use #!/usr/bin/env bash instead of #!/bin/bash to make the script more portable.
2. Use rm -rf instead of rm -r to remove directories recursively.

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-04-17 20:41:08 -07:00
Jeff Rasley c3c8d5dd93
AMD support (#1430)
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Co-authored-by: rraminen <rraminen@amd.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: okakarpa <okakarpa@amd.com>
Co-authored-by: rraminen <rraminen@amd.com>
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: okakarpa <okakarpa@amd.com>
Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com>
2022-03-03 01:53:35 +00:00
Samyam Rajbhandari 599258f979
ZeRO 3 Offload (#834)
* Squash stage3 v1 (#146)

Co-authored-by: Samyam <samyamr@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>

* Fix correctness bug (#147)

* formatting fix (#150)

* stage3 bugfix (API) update and simplified FP16 Z3 tests (#151)

* fp16 Z3 API update and bugfix

* revert debug change

* ZeRO-3 detach and race condition bugfixes (#149)

* trying out ZeRO-3 race condition fix

* CUDA sync instead of stream

* reduction stream sync

* remove commented code

* Fix optimizer state_dict KeyError (#148)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* fix for smaller SGS sizes, ensures each grad is backed by unique tensors (#152)

* Simplifying the logic for getting averaged gradients (#153)

* skip for now

* Z3 Docs redux (#154)

* removing some TODOs and commented code (#155)

* New Z3 defaults (#156)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* formatting

* megatron external params

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>
2021-03-08 12:54:54 -08:00
Jeff Rasley 7bf1b837a4
[install] add -e/--examples flag to checkout submodules (#755)
* add -e/--examples flag to checkout submodules

* bump DSE commit
2021-02-12 10:19:37 -08:00
Stas Bekman 78e776a9ac
[install] fixes/improvements/docs (#752)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2021-02-12 09:50:03 -08:00
Jeff Rasley 7435b2f10a
Ability to initialize distributed backend outside deepspeed runtime (#608) 2020-12-17 23:17:19 -08:00
Jeff Rasley 31f46feee2
DeepSpeed JIT op + PyPI support (#496)
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
2020-11-12 11:51:38 -08:00
Jeff Rasley 5bc7d4e1e6
Remove pip --use-feature (#419) 2020-09-17 16:57:54 -07:00
Shaden Smith 5812e84544
readthedocs yaml configuration (#410)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-09-16 18:57:43 -07:00
Shaden Smith 65c2f974d8
Pipeline parallel training engine. (#392)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-09-09 23:14:55 -07:00
Jeff Rasley 41db1c2f03
ZeRO-Offload release (#391)
* ZeRO-Offload (squash) (#381)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Jie <37380896+jren73@users.noreply.github.com>
Co-authored-by: Arash Ashari <arashari@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: arashashari <arashashari@ArashMSLaptop.redmond.corp.microsoft.com>
Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
2020-09-09 17:14:12 -07:00
Ammar Ahmad Awan 01726ce2b8
Add 1-bit Adam support to DeepSpeed (#380)
* 1-bit adam (#353)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: tanghl1994 <htang14@ur.rochester.edu>
Co-authored-by: Hank <tanghl1994@gmail.com>
Co-authored-by: root <root@node2x12b.cs.rochester.edu>
Co-authored-by: Ammar Ahmad Awan <awan.ammar@microsoft.com>
2020-09-09 14:37:37 -07:00
Jeff Rasley e5bbc2e559
Sparse attn + ops/runtime refactor + v0.3.0 (#343)
* Sparse attn + ops/runtime refactor + v0.3.0

Co-authored-by: Arash Ashari <arashari@microsoft.com>

Co-authored-by: Arash Ashari <arashari@microsoft.com>
2020-09-01 18:06:15 -07:00
Jeff Rasley f5025506de
install update: no-sudo + clean build files (#258)
* install update: no-sudo + clean build files

Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
2020-06-09 10:46:35 -07:00
Jeff Rasley 7dc209c661
add basic post-install test (#209)
* add basic post-install test
2020-05-05 15:01:39 -07:00
Jeff Rasley e0f5cc688e
add skip reqs flag (#133) 2020-03-11 13:29:18 -07:00
Jeff Rasley 259f894a8b
Install specific apex hash (#132)
* allow installing a specific apex commit
2020-03-11 12:17:12 -07:00
Incomplete 5f6294bd04
Add two CLI options to help with the installation inside of conda (#113)
* Add --no_sudo to run without sudo

* Add --pip_mirror to set the pip mirror

* Default to running pip without sudo

* Typo

* Add --pip_sudo to Dockerfile and azure-pipelines.yml

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-03-09 08:30:18 -07:00
Jeff Rasley 001abe2362
Refactor simple model test, fix pythonpath issue (#96)
Also a fix for #94
2020-02-20 14:16:41 -08:00
Jeff Rasley bf2689a9dd
Fix bug in install script, bump TF version (#71)
* bump tf version in dockerfile

* Update install.sh
2020-02-12 17:06:22 -08:00
Shaden Smith 50ae149f82 Moving to major/minor/patch versioning. (#51) 2020-02-09 20:03:35 -08:00
Jeff Rasley 9f2e54c09e
DeepSpeed dockerfile, install reqs, update examples reqs (#26)
* update examples submodule

* install requirements.txt with install script

* add dockerfile
2020-02-05 14:58:13 -08:00
Jeff Rasley 00825428bb
update install to use pdcp to distribute wheels (#12)
update install to use pdcp to distribute wheels
2020-02-04 14:28:33 -08:00
Shaden Smith b18eae24e8
Fixing file permissions (#1)
Fixing file permissions.
2020-02-03 10:55:19 -08:00
Jeff Rasley 16be6de6f1
Install script 2020-01-31 16:03:36 -08:00