* Change the sparse attention API to be compatible with latest changes on the triton side
* remove compatibility checks for CUDA 11
* Update requirements-sparse_attn.txt
Co-authored-by: Arash Ashari <arashari@microsoft.com>
* Unused parameters assert should be disabled by default
* Fix message
* Invert assert logic in unit test
* Change option for ignoring unused parameters
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* Add additional conditions when checking types of output from the model
* Add test
* Modify test to use torch.tensor as well
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
* Add find_unused_parameters option
As unused parameters in modules may not be expected sometimes,
add an explicit error msg when it occurred and an option to avoid the error: https://github.com/microsoft/DeepSpeed/issues/707
* Add find_unused_parameters option
As unused parameters in modules may not be expected sometimes,
add an explicit error msg when it occurred and an option to avoid the error: https://github.com/microsoft/DeepSpeed/issues/707
* Fix syntax error
* Fix yapf error
* Fix yapf error
* Fix yapf error
* Fix yapf error
* Move stage2 find_unused_parameters to config file
* Add stage2 find_unused_parameters
* Add stage2 find_unused_parameters
* Add stage2_find_unused_parameters option
* Change error msg to reflect zero_optimization config change
* Fix yapf error
* Fix yapf errors
* Change find_unused_parameters option name
* Change find_unused_parameters option name
* Change find_unused_parameters option name
* Change find_unused_parameters option name
* Change find_unused_parameters option name
* Add UnusedParametersModel for test option find_unused_parameters
* Add unit test for stage2 find_unused_parameters
* Add cpu-adam compatible check
* Remove dups import
* Trim spaces
* Fix yapf errors
* Trim spaces
* Add False Positive test check
* Fix find_unused_parameters test
* Trim spaces
* Fix yapf error
* Use amp autocast in ZeRO3 linear
* Fix typo
* Handle specific exceptions
* CI breaks on torch.distributed
* Add autocast unit test
* Format fixes
* Fix skip logic
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Authors: @awan-10 @conglongli @samyam @jeffra
What's new:
NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation.
Add support to momentum masks for those parameters with constant zero gradients during training.
Bug fixes (e.g., #813).
* NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (#594)
* NCCL based 1-bit Implementation + Refactor to add communication backends (#593)
* add nccl 1-bit optim.
* temporary commit to save stuff.
* Use dist collectives instead of mpi routines.
* remove old code for comm.
* Fix bugs. still does not work.
* modify to test the nccl side code path
* Initial gather impl. Works intra-node.
* Updates to comm. phase 2. nccl comm. passed the tests.
* refactor code to introduce nccl/mpi as backends for onebit adam.
* Refactor updates to test/engine.
* Fix compile/runtime errors.
* simplify support for nccl/mpi backends.
* Add missign file
* Add compression backend in constructor. Revert later.
* modify test with some perf counting.
* Implement a true non-blocking gather for nccl side.
* Revert "Add compression backend in constructor. Revert later."
This reverts commit df8c40d310.
* improve the 1-bit adam test.
* Refactor comm. and compression backend in 1-bit adam.
* Fix the test.
* Fix runtime errors and typos in nccl backend
* fix mpi backend. modify tests.
* modify nccl perf test.
* fix mpi side errors.
* Add an mpi perf test
* Sync DSE.
* Remove old collectives file.
* Undo a typo.
* Graceful failure for torch versions that don't support nccl pt2pt.
* Revert "Merge branch 'master' into staging-1bit-nccl-v2"
This reverts commit 7840085070, reversing
changes made to a6dba72aea.
* Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2""
This reverts commit 6dbdd9858b.
* comm optimization + 1-bit lamb
* Saving/debugging commit.
* finalizing 1-bit lamb
* finalizing 1-bit lamb
* add momentum mask and chkpt handling for 1-bit adam
* Cleanup and modify nccl test to be runnable with deepspeed launcher.
* Fix format.
* fix formatting again.
* make test runnable without mpi4py
* Add dist.alltoall and dist.allgather instead of custom functions.
* remove debug prints.
* formatting and renaming
* renaming
* renaming
* add unit test, fix existing tests
* skip unit test when torch < 1.8
* revert 1-bit lamb
* flatten momentum when dimension is more than 1
* add warning message for 1-bit adam under fp32
* improve version check
* add fp32 test
* 1-bit adam doc
* fix file name
* doc fix
* torch 1.8 is released
* doc fix
* fix tests
* update news
* add doc for momentum mask
* fix checkpoing handling, add unit test
* checkpoint handling doc
* doc final cleanup
* bump dates
* update tests
* url change
* doc fix
* fix test
* doc update
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* Fix mis-aligned-grad
When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that.
* Formatting fix
* Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size
* also removing alignment from flat fp16 buffers
* Testing for hidden dim alignment
* inference hook fix
* Update stage3.py
* formatting
* [bug-fix] move params to gpu if offload params is turned off
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* Add Linear warmup+decay lr schedule
Update lr schedule unit tests
* LR scheduler unit tests for LR Range Test and 1Cycle
* Disable yapf to preserve parameterizaton
* Disable test_pipe.py for CI debugging
* Disable test_lr_scheduler for CI debugging
* Disable test_lr_scheduler for CI debugging
* Enable all unit tests for CI debugging
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* supporting different hidden dimensions
* add support for larger hidden dimensions (greater than 8K)
* remove empty line
* add loop unrolling factor for dropout kernels
* update different kernels based on the reviews
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* zero-1 memory fix
* auto-tune max elems per comm to reduce padding/comm intervals
* clean-up and added previously missing reduction options
* fix testing backing to work with torch1.7
* add adamW to CPU-ADAM implementation
* supporting cpu-adam optimizer for zero-offload on deepspeed side
* bump DSE to match cpu-adam updates
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>