зеркало из https://github.com/microsoft/DeepSpeed.git
2afa1c7f2f
This PR brings in some communication optimization for large-scale training systems for both dense and MoE architectures. In particular, we have focused on the backward communication collectives, such as AllReduce and AllGather, which are used for ZeRO stages 1 and 2. Also, we have added some optimization for the sequence parallelism to reduce the All2All cost. With these optimizations, we improve the training scalability, as we see proportional boost of the training speed when increasing the number of GPUs. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> |
||
---|---|---|
.. | ||
__init__.py | ||
layer.py |