Jeff Rasley
bbfd0a6a3e
update email info
2023-03-15 14:16:26 -07:00
Logan Adams
b4d40e357b
Fix example command when building wheel with dev version specified ( #2815 )
2023-02-21 18:16:35 +00:00
Jeff Rasley
0b549ad70a
[install] only add deepspeed pkg at install ( #2714 )
...
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-01-18 10:26:26 -08:00
Jeff Rasley
cd271a4aa6
exclude benchmarks during install ( #2698 )
2023-01-13 14:24:30 -08:00
Ma, Guokai
9548d48f48
Abstract accelerator (step 2) ( #2560 )
...
* Abstract accelerator (step 2)
* more flex op_builder path for both installation and runtime
* add SpatialInferenceBuilder into cuda_accelerator.py
* use reflection to make cuda_accelerator adapt to CUDA op builder change automatically
* clean up deepspeed/__init__.py
* add comments in cuda_accelerator for no torch path
* Update deepspeed/env_report.py
Change env_report.py according to suggestion
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
* reduce the range of try...except for better code clarity
* Add porting for deepspeed/ops/random_ltd/dropping_utils.py
* move accelerator to top directory and create symlink under deepspeed
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-01-06 23:40:58 -05:00
Jeff Rasley
35eabb0a33
Fix issues w. python 3.6 + add py-version checks to CI ( #2589 )
2022-12-09 21:53:58 +00:00
Michael Wyatt
521d329b97
Fix CI issues related to cupy install ( #2483 )
...
* remove any cupy install when setting up environments
* revert previous changes to run on cu111 runners
* fix for when no cupy is installed
* remove cupy uninstall for workflows not using latest torch version
* update to cu116 for inference tests
* fix pip uninstall line
* move python environment list to after DS install
* remove cupy uninstall
* re-add --forked
* fix how we get cupy version (should be based on nvcc version)
2022-11-08 10:17:03 -08:00
eltonzheng
b85eb3b979
Fix build issues on Windows ( #2428 )
...
* Fix build issues on Windows
* small fix to complie with new version of Microsoft C++ Build Tools
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
2022-10-26 00:14:43 +00:00
Jeff Rasley
1b7c6791d5
only add deps if extra is explictly called ( #2432 )
2022-10-18 13:57:02 -07:00
Alex Hedges
316c4a43e0
Add flake8 to pre-commit checks ( #2051 )
2022-07-25 16:48:08 -07:00
Alex Hedges
3540ce74d9
Check for bf16 support only if CUDA is available ( #2049 )
...
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2022-07-06 17:17:31 -06:00
Quentin Anthony
9b70ce56e7
Comms Benchmarks ( #2040 )
...
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2022-06-29 10:49:20 -07:00
Jeff Rasley
7c3344e215
DeepSpeed examples refresh ( #2021 )
2022-06-15 18:46:30 -07:00
Jeff Rasley
b666d5cd73
[inference] test suite for ds-kernels (bert, roberta, gpt2, gpt-neo, gpt-j) ( #1992 )
...
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
2022-06-15 14:21:19 -07:00
Michael Wyatt
7fc3065074
Add torch-latest and torch-nightly CI workflows ( #1990 )
...
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2022-06-06 16:19:00 -07:00
Jithun Nair
350d74ca39
Invoke hipify from op builder for JIT extension builds ( #1807 )
...
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2022-03-07 18:59:14 +00:00
Andrii Oriekhov
d7684f4e81
add GitHub URL for PyPi ( #1812 )
...
* add GitHub URL for PyPi
* add GitHub URL for PyPi fix formatting
2022-03-06 04:42:03 +00:00
Jeff Rasley
c3c8d5dd93
AMD support ( #1430 )
...
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
Co-authored-by: rraminen <rraminen@amd.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: okakarpa <okakarpa@amd.com>
Co-authored-by: rraminen <rraminen@amd.com>
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: okakarpa <okakarpa@amd.com>
Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com>
2022-03-03 01:53:35 +00:00
Jeff Rasley
9351266f78
Multi-node save pid support + allow sparse-attn extra ( #1728 )
2022-01-27 12:35:18 -08:00
Jeff Rasley
e46d808a1b
MoE inference + PR-MoE model support ( #1705 )
...
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Zhewei Yao <zheweiy@berkeley.edu>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
2022-01-18 16:25:01 -08:00
Cheng Li
9caa74e577
Autotuning ( #1554 )
...
* [squash] Staging autotuning v4
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Minjia Zhang <minjiaz@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* add new extra, guard xgboost, cleanup dead files (#268 )
* Fix autotuning docs (#1553 )
* fix docs
* rewording the goal
* fix typos
* fix typos (#1556 )
* fix typos
* fix format
* fix bug (#1557 )
* fix bug
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Minjia Zhang <minjiaz@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2021-11-13 08:56:55 +00:00
Jeff Rasley
2665c8b149
Fix 1bit extra issue ( #1542 )
2021-11-11 08:57:17 -08:00
Alex Hedges
be789b1665
Fix many typos ( #1423 )
...
* Fix typos in docs/
* Fix typos in code comments and output strings
* Fix typos in the code itself
* Fix typos in tests/
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2021-10-01 19:56:32 -07:00
Anurag Kumar
8e577c923d
Update setup.py ( #1361 )
...
updated classifiers
2021-09-13 09:37:32 -07:00
Adam Moody
e82060d090
query for libaio package using known package managers ( #1250 )
...
* aio: test for libaio with various package managers
* aio: note typical tool used to install libaio package
* setup: abort with error if cannot build requested op
* setup: define op_envvar to return op build environment variable
* setup: call is_compatible once for each op
* setup: only print suggestion to disable op when its envvar not set
* setup: add method to abort from fatal error
* Revert "setup: add method to abort from fatal error"
This reverts commit 0e4cde6b0a
.
* setup: add method to abort from fatal error
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2021-07-28 22:42:27 -07:00
Jeff Rasley
da1fe2f82c
Remove hard torch dependency at install ( #1166 )
2021-06-16 14:18:37 -07:00
eltonzheng
10104284a2
Create symlinks on Windows setup ( #1099 )
...
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2021-05-24 22:53:44 -07:00
Reza Yazdani
ed3de0c21b
Quantization + inference release ( #1091 )
...
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Elton Zheng <eltonz@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>
Co-authored-by: Arash Ashari <arashari@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: niumanar <60243342+niumanar@users.noreply.github.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Arash Ashari <arashari@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: niumanar <60243342+niumanar@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Arash Ashari <arashari@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: niumanar <60243342+niumanar@users.noreply.github.com>
2021-05-24 01:10:39 -07:00
Samyam Rajbhandari
599258f979
ZeRO 3 Offload ( #834 )
...
* Squash stage3 v1 (#146 )
Co-authored-by: Samyam <samyamr@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>
* Fix correctness bug (#147 )
* formatting fix (#150 )
* stage3 bugfix (API) update and simplified FP16 Z3 tests (#151 )
* fp16 Z3 API update and bugfix
* revert debug change
* ZeRO-3 detach and race condition bugfixes (#149 )
* trying out ZeRO-3 race condition fix
* CUDA sync instead of stream
* reduction stream sync
* remove commented code
* Fix optimizer state_dict KeyError (#148 )
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* fix for smaller SGS sizes, ensures each grad is backed by unique tensors (#152 )
* Simplifying the logic for getting averaged gradients (#153 )
* skip for now
* Z3 Docs redux (#154 )
* removing some TODOs and commented code (#155 )
* New Z3 defaults (#156 )
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* formatting
* megatron external params
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>
2021-03-08 12:54:54 -08:00
Jeff Rasley
81aeea361d
Elastic training support ( #602 )
...
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
2020-12-22 22:26:26 -08:00
Jeff Rasley
be33bea475
Add compute capability 8.0 if on cuda 11+ ( #572 )
2020-12-02 17:22:16 -08:00
Shaden Smith
6009713653
Adds long_description to setup.py ( #560 )
2020-11-25 09:43:53 -08:00
Seunghwan Hong
d81cb26d92
Fix setup.py for cpu-only environment installation ( #538 )
...
* Add guard to not using `torch.version.cuda` above no-CUDA environment.
* Fix several typos on setup.py.
Signed-off-by: Seunghwan Hong <seunghwan@scatterlab.co.kr>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-11-19 13:49:31 -08:00
Jeff Rasley
ca9ab1201f
ds_report bug fix on cpu and guard torch import in setup.py ( #524 )
...
* on cpu box error gracefully if cuda home doesn't exist
* gaurd against torch import issue
* fix sytax error
* fix import
2020-11-12 13:58:14 -08:00
Jeff Rasley
31f46feee2
DeepSpeed JIT op + PyPI support ( #496 )
...
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
2020-11-12 11:51:38 -08:00
Jeff Rasley
7ddfda8526
Add support for p100 in transformer kernels ( #470 )
...
add compute cap of 6.0, support p100
2020-10-14 10:44:16 -07:00
Jeff Rasley
1afca8f722
revert previous (accidental) change
2020-10-12 16:11:55 -07:00
Jeff Rasley
b8eb40eb67
add compute cap of 6.0 to transformer kernels
...
add compute cap of 6.0 to transformer kernels
2020-10-12 16:10:18 -07:00
Olatunji Ruwase
7b8be2a7d9
Disable default installation of CPU Adam ( #450 )
...
* Disable default installation of CPU Adam
* Handle cpufeature import/use errors separately
2020-09-29 14:43:34 -07:00
Shaden Smith
5812e84544
readthedocs yaml configuration ( #410 )
...
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-09-16 18:57:43 -07:00
Jeff Rasley
240ea97b33
only add 1bit adam reqs if mpi is installed, update cond build for cpu-adam ( #400 )
2020-09-10 10:52:45 -07:00
Jeff Rasley
41db1c2f03
ZeRO-Offload release ( #391 )
...
* ZeRO-Offload (squash) (#381 )
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Jie <37380896+jren73@users.noreply.github.com>
Co-authored-by: Arash Ashari <arashari@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: arashashari <arashashari@ArashMSLaptop.redmond.corp.microsoft.com>
Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
2020-09-09 17:14:12 -07:00
Ammar Ahmad Awan
01726ce2b8
Add 1-bit Adam support to DeepSpeed ( #380 )
...
* 1-bit adam (#353 )
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Your Name <you@example.com>
Co-authored-by: tanghl1994 <htang14@ur.rochester.edu>
Co-authored-by: Hank <tanghl1994@gmail.com>
Co-authored-by: root <root@node2x12b.cs.rochester.edu>
Co-authored-by: Ammar Ahmad Awan <awan.ammar@microsoft.com>
2020-09-09 14:37:37 -07:00
Jeff Rasley
e5bbc2e559
Sparse attn + ops/runtime refactor + v0.3.0 ( #343 )
...
* Sparse attn + ops/runtime refactor + v0.3.0
Co-authored-by: Arash Ashari <arashari@microsoft.com>
Co-authored-by: Arash Ashari <arashari@microsoft.com>
2020-09-01 18:06:15 -07:00
Jeff Rasley
871f7e6305
Update setup.py ( #298 )
2020-07-22 08:33:05 -07:00
Jeff Rasley
734d8991c8
Transformer kernel release ( #242 )
...
* Transformer kernels release
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Elton Zheng <eltonz@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Tunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Elton Zheng <eltonz@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Tunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
2020-05-29 13:15:36 -07:00
Jeff Rasley
3d3f8d36a4
PyTorch 1.3+ build support ( #135 )
...
* add support for torch 1.3+ builds inside a docker build environment
* remove apex imports
2020-03-12 13:08:58 -07:00
Shaden Smith
50ae149f82
Moving to major/minor/patch versioning. ( #51 )
2020-02-09 20:03:35 -08:00
Jeff Rasley
20ff66a0c1
Azure tutorial updates and cleanup ( #43 )
2020-02-08 22:00:24 -08:00
Jeff Rasley
a5acb5b22d
updating formatting to pass yapf
2020-02-03 11:12:45 -08:00