* [doc] xref to hostfile discussion
wasn't clear where to find what was meant by `hostfile` - so adding a link to where it's discussed.
* remove whitespace
RTX-30 series are compute_86
```
python -c "import torch; print(torch.cuda.get_device_capability())"
```
This PR adds support for this compute capability.
Reference: https://developer.nvidia.com/cuda-gpus
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* 1) Register layout as buffer of module so that we can save/load checkpoint; 2) Add a broadcast of layout at the beginning to ensure different processes will have consistent layout during distributed training.
* Add docstring for max_seq_length argument in SparseSelfAttention
Co-authored-by: Zhun Liu <zhunliu@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* tracking optimizer step in cpu-adam when loading checkpoint
* add warning/error message for updating optimizer step count
* resolve build issue
* supporting state update from the python side
* track step from python in all cases
* remove comma
* supporting different hidden dimensions
* add support for larger hidden dimensions (greater than 8K)
* remove empty line
* add loop unrolling factor for dropout kernels
* update different kernels based on the reviews
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
This PR:
* fixes a misspelled method name
* also `( () )` doesn't read too well, until one reads the code and understands that it's not a formatting bug. I proposed to simply say that it's a callable object.
In the absence of a model parallel group, model_parallel_allreduce should not do any reduction. This commit fixes the bug which was doing a model parallel allreduce across world group when model parallel group is None
* Add guard to not using `torch.version.cuda` above no-CUDA environment.
* Fix several typos on setup.py.
Signed-off-by: Seunghwan Hong <seunghwan@scatterlab.co.kr>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* zero-1 memory fix
* auto-tune max elems per comm to reduce padding/comm intervals
* clean-up and added previously missing reduction options
* fix testing backing to work with torch1.7