DeepSpeed

Reza Yazdani 2afa1c7f2f Communication Optimization for Large-Scale Training (#4695 ) This PR brings in some communication optimization for large-scale training systems for both dense and MoE architectures. In particular, we have focused on the backward communication collectives, such as AllReduce and AllGather, which are used for ZeRO stages 1 and 2. Also, we have added some optimization for the sequence parallelism to reduce the All2All cost. With these optimizations, we improve the training scalability, as we see proportional boost of the training speed when increasing the number of GPUs. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2023-11-21 14:30:43 -08:00
..
__init__.py	DeepSpeed Ulysses release (#4198 )	2023-08-23 16:22:30 -07:00
layer.py	Communication Optimization for Large-Scale Training (#4695 )	2023-11-21 14:30:43 -08:00

Communication Optimization for Large-Scale Training (#4695 )

This PR brings in some communication optimization for large-scale
training systems for both dense and MoE architectures. In particular, we
have focused on the backward communication collectives, such as
AllReduce and AllGather, which are used for ZeRO stages 1 and 2. Also,
we have added some optimization for the sequence parallelism to reduce
the All2All cost. With these optimizations, we improve the training
scalability, as we see proportional boost of the training speed when
increasing the number of GPUs.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

2023-11-21 14:30:43 -08:00

__init__.py

DeepSpeed Ulysses release (#4198 )

2023-08-23 16:22:30 -07:00

layer.py

Communication Optimization for Large-Scale Training (#4695 )

2023-11-21 14:30:43 -08:00