зеркало из https://github.com/microsoft/DeepSpeed.git
Ulyssess offload blog (#6814)
Ulysses-Offload (FPDT) blog, please see corresponding tutorial page at [link](https://github.com/microsoft/DeepSpeed/pull/6813). --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>
This commit is contained in:
Родитель
b966e1f97f
Коммит
0b0fef3d41
|
@ -0,0 +1,263 @@
|
||||||
|
# Ulysses-Offload: Democratizing Long Context LLM Training
|
||||||
|
|
||||||
|
<img src="./media/image1.png" style="width:6.5in;height:3.34583in"
|
||||||
|
alt="A screenshot of a computer Description automatically generated" />
|
||||||
|
|
||||||
|
Figure 1: Ulysses-Offload supports 16x longer sequence lengths at 55%
|
||||||
|
Model FLOPs Utilization (MFU) than NVIDIA Megatron-SP and DeepSpeed Ulysses.
|
||||||
|
|
||||||
|
|
||||||
|
To cite and for more technical in depth of this release, please see
|
||||||
|
our [arxiv report](https://arxiv.org/abs/2408.16978):
|
||||||
|
|
||||||
|
@article{yao2024ulysses,
|
||||||
|
|
||||||
|
title={ Training Ultra Long Context Language Model with Fully Pipelined
|
||||||
|
Distributed Transformer},
|
||||||
|
|
||||||
|
author={Jinghan Yao and Sam Ade Jacobs and Masahiro Tanaka and Olatunji
|
||||||
|
Ruwase and Aamir Shafi and Hari Subramoni and Dhabaleswar K. (DK) Panda
|
||||||
|
},
|
||||||
|
|
||||||
|
journal={https://arxiv.org/abs/2408.16978},
|
||||||
|
|
||||||
|
year={2024}
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
## Introduction
|
||||||
|
|
||||||
|
In the rapidly evolving field of generative AI and scientific ML, the
|
||||||
|
ability to train large (language) models with ultra-long context
|
||||||
|
capabilities is becoming increasingly important. These models are
|
||||||
|
essential for a variety of complex tasks, such as understanding
|
||||||
|
lengthy documents, generating images and videos, and processing extensive
|
||||||
|
sequences in computational biology. However, training such models
|
||||||
|
efficiently poses significant challenges due to the enormous GPU
|
||||||
|
memory required.
|
||||||
|
|
||||||
|
Building DeepSpeed Ulysses, our previous project, which developed
|
||||||
|
system optimizations for training extremely long sequence transformer
|
||||||
|
models, we are excited to present Ulysses-Offload, in this release. Ulysses-Offload
|
||||||
|
is an innovative, resource-efficient technique that offers comparable
|
||||||
|
benefits to DeepSpeed Ulysses and other previous long-context
|
||||||
|
optimization methods, but with a lower hardware budget. Ulysses-Offload makes
|
||||||
|
ultra long-context large language models (LLM) training and finetuning
|
||||||
|
accessible to everyone, including those with limited GPU resources. Ulysses-Offload enables
|
||||||
|
training with context lengths of up to 2 million tokens using just 4
|
||||||
|
NVIDIA A100-40GB GPUs. Ulysses-Offload supports 16x longer sequence lengths at 55%
|
||||||
|
Model FLOPs Utilization (MFU) than NVIDIA Megatron-SP and DeepSpeed Ulysses
|
||||||
|
(see Figure 1). The next section highlights the key innovations of Ulysses-Offload,
|
||||||
|
and subsequent sections provide additional details on the design and
|
||||||
|
usability of Ulysses-Offload, followed by experimental results.
|
||||||
|
|
||||||
|
## Key Innovations
|
||||||
|
|
||||||
|
### 1. Fully Pipelined Distributed Transformer (FPDT)
|
||||||
|
|
||||||
|
The core innovation of our work is the Fully Pipelined Distributed
|
||||||
|
Transformer (FPDT). This approach leverages a pipelined sequence
|
||||||
|
chunking, which allows for the training of LLMs with sequence lengths up
|
||||||
|
to 2 million tokens on just 4 A100-40GB GPUs. By breaking down the
|
||||||
|
sequence into manageable chunks and processing them in a pipelined
|
||||||
|
manner, Ulysses-Offload significantly reduces the memory footprint while
|
||||||
|
maintaining high computational efficiency. This method ensures that the
|
||||||
|
GPUs are utilized effectively, even when dealing with extremely long
|
||||||
|
sequences.
|
||||||
|
|
||||||
|
### 2. Memory Optimization
|
||||||
|
|
||||||
|
One of the critical aspects of our approach is the comprehensive
|
||||||
|
analysis and optimization of the memory footprint during LLM training.
|
||||||
|
We target the reduction of redundant intermediate buffers in both the
|
||||||
|
forward and backward passes of the training process. By optimizing the
|
||||||
|
use of GPU and host CPU memory, we can train larger models with longer
|
||||||
|
sequences without running into GPU memory limitations. This optimization
|
||||||
|
is crucial for enabling the training of ultra-long context models on a
|
||||||
|
limited number of GPUs. It is worth noting that Ulysses-Offload memory optimization
|
||||||
|
is orthogonal and complementary to model- parameter-focused memory
|
||||||
|
optimization techniques used by DeepSpeed ZeRO and PyTorch FSDP. Ulysses-Offload optimizes memory footprint of activations associated with long sequences while ZeRO and FSDP optimize memory footprint of model parameters.
|
||||||
|
|
||||||
|
### 3. Compatibility and Flexibility
|
||||||
|
|
||||||
|
Ulysses-Offload is designed to be agnostic to existing training techniques and
|
||||||
|
works efficiently across different LLM models, including popular
|
||||||
|
architecture like GPT and Llama. This flexibility ensures that our
|
||||||
|
approach can be easily integrated into various training workflows.
|
||||||
|
Additionally, Ulysses-Offload is compatible with advanced memory optimization
|
||||||
|
techniques such as DeepSpeed ZeRO and PyTorch FSDP, further enhancing
|
||||||
|
its usability and performance.
|
||||||
|
|
||||||
|
## Core Design of Ulysses-Offload
|
||||||
|
|
||||||
|
Figure 2 illustrates the core structure of Ulysses-Offload. Ulysses-Offload leverages multiple
|
||||||
|
memory hierarchies in modern GPU clusters, thus boosting hardware
|
||||||
|
efficiency and cost-effectiveness while achieving very high model FLOP
|
||||||
|
utilization (MFU). The design of Ulysses-Offload centers around pipelining,
|
||||||
|
scheduling, and memory management. These well-known optimization
|
||||||
|
techniques are essential for scaling LLM context length to a million
|
||||||
|
scale with a few GPUs and will be discussed in the subsequent
|
||||||
|
subsections.
|
||||||
|
|
||||||
|
<img src="./media/image2.png" style="width:6.5in;height:2.68634in"
|
||||||
|
alt="A screenshot of a computer Description automatically generated" />
|
||||||
|
|
||||||
|
Figure 2: Core design
|
||||||
|
|
||||||
|
###
|
||||||
|
|
||||||
|
### Pipelining and Scheduling
|
||||||
|
|
||||||
|
Ulysses-Offload employs sequence chunking and pipelined computation design to manage the memory
|
||||||
|
and computational load efficiently. In traditional Transformer model,
|
||||||
|
input (hidden state) tensor is projected to q, k, v tensors. Each of these tensors can be denoted *\[B, S, H, D\]*, where *B* is batch
|
||||||
|
size, *S* is sequence length, *H* is number of heads and *D* is hidden
|
||||||
|
dimension per head. With sequence parallelism such as DeepSpeed Ulysses,
|
||||||
|
input tensor is partitioned along sequence dimension across sequence
|
||||||
|
parallel group P, that is *\[B, S/P, H, D\]* prior to alltoall collective
|
||||||
|
communication. The alltoall collective communication gathers partitioned tensors
|
||||||
|
along sequence dimension and scatter them along head dimension essentially
|
||||||
|
transforming tensor from *\[B, S/P, H, D\]* to *\[B, S, H/P, D\]*. Post attention computation, a second alltoall communication transforms *\[B, S, H/P, D\]* back to *\[B, S/P, H, D\]*
|
||||||
|
|
||||||
|
In our Ulysses-Offload design, input sequence are partitioned at a much finer granularity than DeepSpeed Ulysses. In other words, we made changes to sequence partitioning such that we further subdivide per GPU *S/P* sequence into smaller *u*
|
||||||
|
chunks. Thus, the input tensors are now represented as \[*B, S/uP, H,
|
||||||
|
D*\]. We denote these chunks as *T<sub>i</sub>*,
|
||||||
|
where$\ i\ \in \ 0,1,\ldots,\ u - 1.$ As shown in Figure 1,
|
||||||
|
*T<sub>i</sub>* is projected to query *q<sub>i</sub>*, key
|
||||||
|
*k<sub>i</sub>*, and value *v<sub>i</sub>*. Then, similar to DeepSpeed Ulysses, an alltoall collective communication gathers partitioned tensor
|
||||||
|
along sequence dimension and scatter them along head dimension. In our chunk
|
||||||
|
design, the sequence length for each chunk is reduced by a factor of *u*
|
||||||
|
compared to Ulysses. Please note that our Ulysses-Offload chunking procedure is generally applicable to other sequence parallelism techniques.
|
||||||
|
|
||||||
|
<img src="./media/image3.png" style="width:6.5in;height:5.36042in"
|
||||||
|
alt="A screenshot of a computer Description automatically generated" />
|
||||||
|
|
||||||
|
Figure 3: Core design with offload description
|
||||||
|
|
||||||
|
Figure 3 gives an example of how to perform the computation of chunk
|
||||||
|
*T<sub>m</sub>*. After the alltoall collective communication,
|
||||||
|
*GPU<sub>j</sub>* receives
|
||||||
|
$\widehat{q}m,\ \widehat{k}m,\ and\ \widehat{v}m$*.* We then fetch the
|
||||||
|
previous sequence chunk by chunk from the host memory to
|
||||||
|
GPU<sub>j</sub>, and perform online attention with the current
|
||||||
|
$\widehat{q}m$ and update the output chunk accordingly. Note that, in a
|
||||||
|
strict manner, at any given time, only one set of chunks
|
||||||
|
$\widehat{k}i,\ and\ \widehat{v}i$ is placed on GPU's HBM, reducing the
|
||||||
|
memory footprint to $\frac{1}{u}$ compared to the non-offloading version
|
||||||
|
without double buffering. With double buffering, memory footprint is
|
||||||
|
reduced by *2/u*.
|
||||||
|
|
||||||
|
### Memory Management
|
||||||
|
|
||||||
|
Ulysses-Offload optimizes memory usage by carefully managing the allocation and
|
||||||
|
deallocation of buffers during training. This involves:
|
||||||
|
|
||||||
|
1. Double Buffering:
|
||||||
|
|
||||||
|
- Two sets of buffers are maintained to overlap computation with
|
||||||
|
data transfer.
|
||||||
|
|
||||||
|
- While one set of buffers is used for computation, the other set is
|
||||||
|
preloaded with the next chunk of data.
|
||||||
|
|
||||||
|
2. Hierarchical Memory Utilization:
|
||||||
|
|
||||||
|
- GPU High Bandwidth Memory (HBM) is used for active computation.
|
||||||
|
|
||||||
|
- Host memory is used to store intermediate results that are not
|
||||||
|
immediately needed, reducing the pressure on GPU memory.
|
||||||
|
|
||||||
|
## Integration with Existing Frameworks
|
||||||
|
|
||||||
|
Ulysses-Offload is designed to integrate seamlessly with popular deep learning
|
||||||
|
frameworks such as PyTorch. Ulysses-Offload provides user-friendly APIs that
|
||||||
|
abstract the complexities of pipelined training and memory management.
|
||||||
|
Users can adopt Ulysses-Offload with minimal changes to existing codebases.
|
||||||
|
|
||||||
|
## Experimental Results
|
||||||
|
|
||||||
|
<img src="./media/image4.png" style="width:6.5in;height:3.37431in"
|
||||||
|
alt="A collage of graphs Description automatically generated" />
|
||||||
|
|
||||||
|
Figure 4: Supported sequence lengths and corresponding Model FLOPs
|
||||||
|
Utilization (MFU) using Megatron-SP, Ulysses, and our proposed Ulysses-Offload (FPDT). OOM
|
||||||
|
denotes the point where increasing sequence length will cause memory
|
||||||
|
issues. We show Ulysses-Offload's performance when the sequence length is larger
|
||||||
|
than 128K, as shorter sequences can be properly handled by existing
|
||||||
|
strategies.
|
||||||
|
|
||||||
|
### Extended Sequence Lengths
|
||||||
|
|
||||||
|
In our experimental setup, we compare Ulysses-Offload with two existing methods:
|
||||||
|
Microsoft DeepSpeed Ulysses and NVIDIA Megatron-SP. Both DeepSpeed
|
||||||
|
Ulysses and Megatron-SP employ similar approaches to sequence
|
||||||
|
parallelism but differ in the collective communication used for
|
||||||
|
gathering sequences before the attention block. The former utilizes
|
||||||
|
alltoall communication, whereas the latter employs allgather. Ulysses-Offload
|
||||||
|
builds upon the DeepSpeed Ulysses approach. The primary advantage of
|
||||||
|
Ulysses-Offload is its capability to support the training of large language models
|
||||||
|
(LLMs) with ultra-long sequence lengths using fewer GPUs. As shown in
|
||||||
|
Figure 4, our method enables the training of 8B parameter models with
|
||||||
|
sequence lengths of 2 million tokens using only 4 GPUs. For even larger
|
||||||
|
models, such as GPT-30B and Llama-70B parameter models, Ulysses-Offload supports
|
||||||
|
sequence lengths up to 3 million and 4 million tokens using 16 GPUs and
|
||||||
|
32 GPUs respectively. This represents a 16x increase in sequence length
|
||||||
|
compared to current state-of-the-art solutions (see Figure 5), making
|
||||||
|
Ulysses-Offload a game-changer for tasks that require processing long sequences.
|
||||||
|
|
||||||
|
### High Hardware Efficiency
|
||||||
|
|
||||||
|
As shown in Figure 4 with different model sizes ranging from GPT-2.7B to
|
||||||
|
Llama-80B parameters, Ulysses-Offload achieves over 55% Model FLOPs Utilization
|
||||||
|
(MFU), ensuring that the hardware resources are utilized effectively.
|
||||||
|
This high level of efficiency is maintained even when dealing with
|
||||||
|
extremely long sequences (up to 4 million context length), making Ulysses-Offload
|
||||||
|
an ideal solution for training large-scale LLMs. By maximizing the use
|
||||||
|
of available hardware, Ulysses-Offload reduces the overall cost and complexity of
|
||||||
|
training long-context models. Our [technical report](https://arxiv.org/abs/2408.16978) offers
|
||||||
|
further insights into optimizing sequence chunks to balance the
|
||||||
|
trade-off between memory usage and MFU.
|
||||||
|
|
||||||
|
<img src="./media/image5.png" style="width:6.5in;height:2.01667in" />
|
||||||
|
|
||||||
|
Figure 5: A comprehensive analysis on long-context LLM training with
|
||||||
|
different training techniques: tensor parallelism (TP), activation
|
||||||
|
checkpoint (AC), activation checkpoint with CPU offloading (OC), Ulysses
|
||||||
|
(UL), and our approach Ulysses-Offload (FPDT).
|
||||||
|
|
||||||
|
## Implementation and Usability
|
||||||
|
|
||||||
|
Ulysses-Offload is designed to be easily integrated with popular deep learning
|
||||||
|
frameworks such as DeepSpeed, Megatron-DeepSpeed and PyTorch. Users can
|
||||||
|
adopt our approach with minimal changes to their existing training
|
||||||
|
pipeline, making it accessible to a broad audience. The integration
|
||||||
|
process involves setting up the sequence chunk pipeline and configuring
|
||||||
|
the memory optimization techniques, both of which are straightforward
|
||||||
|
and well-documented (see tutorial).
|
||||||
|
|
||||||
|
Our pipeline design and memory optimization techniques are
|
||||||
|
straightforward to implement, making Ulysses-Offload accessible to researchers and
|
||||||
|
practitioners aiming to train long-context LLMs efficiently. We provide
|
||||||
|
detailed [technical report](https://arxiv.org/abs/2408.16978),
|
||||||
|
documentation and examples to guide users through the setup process,
|
||||||
|
ensuring a smooth transition to using Ulysses-Offload. Additionally, Ulysses-Offload, in the
|
||||||
|
tradition of DeepSpeed provides user-friendly API which abstracts the
|
||||||
|
complexities of mixed precision training and memory optimization,
|
||||||
|
allowing users to focus on their research and development tasks.
|
||||||
|
|
||||||
|
## General Availability of DeepSpeed Ulysses-Offload
|
||||||
|
|
||||||
|
We are excited to release Ulysses-Offload. Ulysses-Offload has been
|
||||||
|
fully integrated with Megatron-DeepSpeed and accessible through both
|
||||||
|
DeepSpeed and Megatron-DeepSpeed GitHub repos. Click here for detailed
|
||||||
|
[tutorial](https://www.deepspeed.ai/tutorials/fpdt/) on usage.
|
||||||
|
|
||||||
|
We invite the community to explore our implementation, contribute to
|
||||||
|
further advancements, and join us in pushing the boundaries of what is
|
||||||
|
possible in LLM and AI. This release is part of the bigger DeepSpeed
|
||||||
|
ecosystem of large-scale AI training, finetuning and inference. For more
|
||||||
|
details on all DeepSpeed technologies and innovations, please visit our
|
||||||
|
[website]((https://www.deepspeed.ai/)) and follow us
|
||||||
|
on X, formerly Twitter, ([English](https://twitter.com/MSFTDeepSpeed),
|
||||||
|
[Japanese](https://twitter.com/MSFTDeepSpeedJP)) and
|
||||||
|
[Chinese Zhihu](https://www.zhihu.com/people/deepspeed).
|
Двоичный файл не отображается.
После Ширина: | Высота: | Размер: 1.3 MiB |
Двоичный файл не отображается.
После Ширина: | Высота: | Размер: 132 KiB |
Двоичный файл не отображается.
После Ширина: | Высота: | Размер: 183 KiB |
Двоичный файл не отображается.
После Ширина: | Высота: | Размер: 312 KiB |
Двоичный файл не отображается.
После Ширина: | Высота: | Размер: 89 KiB |
Загрузка…
Ссылка в новой задаче