Граф коммитов

31 Коммитов

Автор SHA1 Сообщение Дата
Cheng Li 9caa74e577
Autotuning (#1554)
* [squash] Staging autotuning v4

Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Minjia Zhang <minjiaz@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* add new extra, guard xgboost, cleanup dead files (#268)

* Fix autotuning docs (#1553)

* fix docs

* rewording the goal

* fix typos

* fix typos (#1556)

* fix typos

* fix format

* fix bug (#1557)

* fix bug

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Minjia Zhang <minjiaz@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2021-11-13 08:56:55 +00:00
Jeff Rasley 2665c8b149
Fix 1bit extra issue (#1542) 2021-11-11 08:57:17 -08:00
Alex Hedges be789b1665
Fix many typos (#1423)
* Fix typos in docs/

* Fix typos in code comments and output strings

* Fix typos in the code itself

* Fix typos in tests/

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2021-10-01 19:56:32 -07:00
Anurag Kumar 8e577c923d
Update setup.py (#1361)
updated classifiers
2021-09-13 09:37:32 -07:00
Adam Moody e82060d090
query for libaio package using known package managers (#1250)
* aio: test for libaio with various package managers

* aio: note typical tool used to install libaio package

* setup: abort with error if cannot build requested op

* setup: define op_envvar to return op build environment variable

* setup: call is_compatible once for each op

* setup: only print suggestion to disable op when its envvar not set

* setup: add method to abort from fatal error

* Revert "setup: add method to abort from fatal error"

This reverts commit 0e4cde6b0a.

* setup: add method to abort from fatal error

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2021-07-28 22:42:27 -07:00
Jeff Rasley da1fe2f82c
Remove hard torch dependency at install (#1166) 2021-06-16 14:18:37 -07:00
eltonzheng 10104284a2
Create symlinks on Windows setup (#1099)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2021-05-24 22:53:44 -07:00
Reza Yazdani ed3de0c21b
Quantization + inference release (#1091)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Elton Zheng <eltonz@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>
Co-authored-by: Arash Ashari <arashari@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: niumanar <60243342+niumanar@users.noreply.github.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Arash Ashari <arashari@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: niumanar <60243342+niumanar@users.noreply.github.com>

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Arash Ashari <arashari@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: niumanar <60243342+niumanar@users.noreply.github.com>
2021-05-24 01:10:39 -07:00
Samyam Rajbhandari 599258f979
ZeRO 3 Offload (#834)
* Squash stage3 v1 (#146)

Co-authored-by: Samyam <samyamr@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>

* Fix correctness bug (#147)

* formatting fix (#150)

* stage3 bugfix (API) update and simplified FP16 Z3 tests (#151)

* fp16 Z3 API update and bugfix

* revert debug change

* ZeRO-3 detach and race condition bugfixes (#149)

* trying out ZeRO-3 race condition fix

* CUDA sync instead of stream

* reduction stream sync

* remove commented code

* Fix optimizer state_dict KeyError (#148)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* fix for smaller SGS sizes, ensures each grad is backed by unique tensors (#152)

* Simplifying the logic for getting averaged gradients (#153)

* skip for now

* Z3 Docs redux (#154)

* removing some TODOs and commented code (#155)

* New Z3 defaults (#156)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* formatting

* megatron external params

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>
2021-03-08 12:54:54 -08:00
Jeff Rasley 81aeea361d
Elastic training support (#602)
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
2020-12-22 22:26:26 -08:00
Jeff Rasley be33bea475
Add compute capability 8.0 if on cuda 11+ (#572) 2020-12-02 17:22:16 -08:00
Shaden Smith 6009713653
Adds long_description to setup.py (#560) 2020-11-25 09:43:53 -08:00
Seunghwan Hong d81cb26d92
Fix setup.py for cpu-only environment installation (#538)
* Add guard to not using `torch.version.cuda` above no-CUDA environment.
* Fix several typos on setup.py.

Signed-off-by: Seunghwan Hong <seunghwan@scatterlab.co.kr>

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-11-19 13:49:31 -08:00
Jeff Rasley ca9ab1201f
ds_report bug fix on cpu and guard torch import in setup.py (#524)
* on cpu box error gracefully if cuda home doesn't exist

* gaurd against torch import issue

* fix sytax error

* fix import
2020-11-12 13:58:14 -08:00
Jeff Rasley 31f46feee2
DeepSpeed JIT op + PyPI support (#496)
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
2020-11-12 11:51:38 -08:00
Jeff Rasley 7ddfda8526
Add support for p100 in transformer kernels (#470)
add compute cap of 6.0, support p100
2020-10-14 10:44:16 -07:00
Jeff Rasley 1afca8f722
revert previous (accidental) change 2020-10-12 16:11:55 -07:00
Jeff Rasley b8eb40eb67
add compute cap of 6.0 to transformer kernels
add compute cap of 6.0 to transformer kernels
2020-10-12 16:10:18 -07:00
Olatunji Ruwase 7b8be2a7d9
Disable default installation of CPU Adam (#450)
* Disable default installation of CPU Adam

* Handle cpufeature import/use errors separately
2020-09-29 14:43:34 -07:00
Shaden Smith 5812e84544
readthedocs yaml configuration (#410)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-09-16 18:57:43 -07:00
Jeff Rasley 240ea97b33
only add 1bit adam reqs if mpi is installed, update cond build for cpu-adam (#400) 2020-09-10 10:52:45 -07:00
Jeff Rasley 41db1c2f03
ZeRO-Offload release (#391)
* ZeRO-Offload (squash) (#381)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Jie <37380896+jren73@users.noreply.github.com>
Co-authored-by: Arash Ashari <arashari@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: arashashari <arashashari@ArashMSLaptop.redmond.corp.microsoft.com>
Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
2020-09-09 17:14:12 -07:00
Ammar Ahmad Awan 01726ce2b8
Add 1-bit Adam support to DeepSpeed (#380)
* 1-bit adam (#353)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: tanghl1994 <htang14@ur.rochester.edu>
Co-authored-by: Hank <tanghl1994@gmail.com>
Co-authored-by: root <root@node2x12b.cs.rochester.edu>
Co-authored-by: Ammar Ahmad Awan <awan.ammar@microsoft.com>
2020-09-09 14:37:37 -07:00
Jeff Rasley e5bbc2e559
Sparse attn + ops/runtime refactor + v0.3.0 (#343)
* Sparse attn + ops/runtime refactor + v0.3.0

Co-authored-by: Arash Ashari <arashari@microsoft.com>

Co-authored-by: Arash Ashari <arashari@microsoft.com>
2020-09-01 18:06:15 -07:00
Jeff Rasley 871f7e6305
Update setup.py (#298) 2020-07-22 08:33:05 -07:00
Jeff Rasley 734d8991c8
Transformer kernel release (#242)
* Transformer kernels release

Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Elton Zheng <eltonz@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Tunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Elton Zheng <eltonz@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Tunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
2020-05-29 13:15:36 -07:00
Jeff Rasley 3d3f8d36a4
PyTorch 1.3+ build support (#135)
* add support for torch 1.3+ builds inside a docker build environment
* remove apex imports
2020-03-12 13:08:58 -07:00
Shaden Smith 50ae149f82 Moving to major/minor/patch versioning. (#51) 2020-02-09 20:03:35 -08:00
Jeff Rasley 20ff66a0c1 Azure tutorial updates and cleanup (#43) 2020-02-08 22:00:24 -08:00
Jeff Rasley a5acb5b22d updating formatting to pass yapf 2020-02-03 11:12:45 -08:00
Jeff Rasley 26e5ecab9b
Initial commit of setup 2020-01-31 15:57:11 -08:00