Ulyssess offload blog (#6814)

Ulysses-Offload (FPDT) blog, please see corresponding tutorial page at [link](https://github.com/microsoft/DeepSpeed/pull/6813). --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>
2024-12-05 08:39:44 -08:00 · 2024-12-05 08:39:44 -08:00 · 0b0fef3d41
--- a/blogs/ulysses-offload/README.md
+++ b/blogs/ulysses-offload/README.md
@ -0,0 +1,263 @@
 # Ulysses-Offload: Democratizing Long Context LLM Training
 <img src="./media/image1.png" style="width:6.5in;height:3.34583in"
 alt="A screenshot of a computer Description automatically generated" />
 Figure 1: Ulysses-Offload supports 16x longer sequence lengths at 55%
 Model FLOPs Utilization (MFU) than NVIDIA Megatron-SP and DeepSpeed Ulysses.
 To cite and for more technical in depth of this release, please see
 our [arxiv report](https://arxiv.org/abs/2408.16978):
@article{yao2024ulysses,
 title={ Training Ultra Long Context Language Model with Fully Pipelined
 Distributed Transformer},
 author={Jinghan Yao and Sam Ade Jacobs and Masahiro Tanaka and Olatunji
 Ruwase and Aamir Shafi and Hari Subramoni and Dhabaleswar K. (DK) Panda
 },
 journal={https://arxiv.org/abs/2408.16978},
 year={2024}
 }
 ## Introduction
 In the rapidly evolving field of generative AI and scientific ML, the
 ability to train large (language) models with ultra-long context
 capabilities is becoming increasingly important. These models are
 essential for a variety of complex tasks, such as understanding
 lengthy documents, generating images and videos, and processing extensive
 sequences in computational biology. However, training such models
 efficiently poses significant challenges due to the enormous GPU
 memory required.
 Building DeepSpeed Ulysses, our previous project, which developed
 system optimizations for training extremely long sequence transformer
 models, we are excited to present Ulysses-Offload, in this release. Ulysses-Offload
 is an innovative, resource-efficient technique that offers comparable
 benefits to DeepSpeed Ulysses and other previous long-context
 optimization methods, but with a lower hardware budget. Ulysses-Offload makes
 ultra long-context large language models (LLM) training and finetuning
 accessible to everyone, including those with limited GPU resources. Ulysses-Offload enables
 training with context lengths of up to 2 million tokens using just 4
 NVIDIA A100-40GB GPUs. Ulysses-Offload supports 16x longer sequence lengths at 55%
 Model FLOPs Utilization (MFU) than NVIDIA Megatron-SP and DeepSpeed Ulysses
 (see Figure 1). The next section highlights the key innovations of Ulysses-Offload,
 and subsequent sections provide additional details on the design and
 usability of Ulysses-Offload, followed by experimental results.
 ## Key Innovations
 ### 1. Fully Pipelined Distributed Transformer (FPDT)
 The core innovation of our work is the Fully Pipelined Distributed
 Transformer (FPDT). This approach leverages a pipelined sequence
 chunking, which allows for the training of LLMs with sequence lengths up
 to 2 million tokens on just 4 A100-40GB GPUs. By breaking down the
 sequence into manageable chunks and processing them in a pipelined
 manner, Ulysses-Offload significantly reduces the memory footprint while
 maintaining high computational efficiency. This method ensures that the
 GPUs are utilized effectively, even when dealing with extremely long
 sequences.
 ### 2. Memory Optimization
 One of the critical aspects of our approach is the comprehensive
 analysis and optimization of the memory footprint during LLM training.
 We target the reduction of redundant intermediate buffers in both the
 forward and backward passes of the training process. By optimizing the
 use of GPU and host CPU memory, we can train larger models with longer
 sequences without running into GPU memory limitations. This optimization
 is crucial for enabling the training of ultra-long context models on a
 limited number of GPUs. It is worth noting that Ulysses-Offload memory optimization
 is orthogonal and complementary to model- parameter-focused memory
 optimization techniques used by DeepSpeed ZeRO and PyTorch FSDP. Ulysses-Offload optimizes memory footprint of activations associated with long sequences while ZeRO and FSDP optimize memory footprint of model parameters.
 ### 3. Compatibility and Flexibility
 Ulysses-Offload is designed to be agnostic to existing training techniques and
 works efficiently across different LLM models, including popular
 architecture like GPT and Llama. This flexibility ensures that our
 approach can be easily integrated into various training workflows.
 Additionally, Ulysses-Offload is compatible with advanced memory optimization
 techniques such as DeepSpeed ZeRO and PyTorch FSDP, further enhancing
 its usability and performance.
 ## Core Design of Ulysses-Offload
 Figure 2 illustrates the core structure of Ulysses-Offload. Ulysses-Offload leverages multiple
 memory hierarchies in modern GPU clusters, thus boosting hardware
 efficiency and cost-effectiveness while achieving very high model FLOP
 utilization (MFU). The design of Ulysses-Offload centers around pipelining,
 scheduling, and memory management. These well-known optimization
 techniques are essential for scaling LLM context length to a million
 scale with a few GPUs and will be discussed in the subsequent
 subsections.
 <img src="./media/image2.png" style="width:6.5in;height:2.68634in"
 alt="A screenshot of a computer Description automatically generated" />
 Figure 2: Core design
 ###
 ### Pipelining and Scheduling
 Ulysses-Offload employs sequence chunking and pipelined computation design to manage the memory
 and computational load efficiently. In traditional Transformer model,
 input (hidden state) tensor is projected to q, k, v tensors. Each of these tensors can be denoted *\[B, S, H, D\]*, where *B* is batch
 size, *S* is sequence length, *H* is number of heads and *D* is hidden
 dimension per head. With sequence parallelism such as DeepSpeed Ulysses,
 input tensor is partitioned along sequence dimension across sequence
 parallel group P, that is *\[B, S/P, H, D\]* prior to alltoall collective
 communication. The alltoall collective communication gathers partitioned tensors
 along sequence dimension and scatter them along head dimension essentially
 transforming tensor from *\[B, S/P, H, D\]* to *\[B, S, H/P, D\]*. Post attention computation, a second alltoall communication transforms *\[B, S, H/P, D\]* back to *\[B, S/P, H, D\]*
 In our Ulysses-Offload design, input sequence are partitioned at a much finer granularity than DeepSpeed Ulysses. In other words, we made changes to sequence partitioning such that we further subdivide per GPU *S/P* sequence into smaller *u*
 chunks. Thus, the input tensors are now represented as \[*B, S/uP, H,
 D*\]. We denote these chunks as *T<sub>i</sub>*,
 where$\ i\  \in \ 0,1,\ldots,\ u - 1.$ As shown in Figure 1,
 *T<sub>i</sub>* is projected to query *q<sub>i</sub>*, key
 *k<sub>i</sub>*, and value *v<sub>i</sub>*. Then, similar to DeepSpeed Ulysses, an alltoall collective communication gathers partitioned tensor
 along sequence dimension and scatter them along head dimension. In our chunk
 design, the sequence length for each chunk is reduced by a factor of *u*
 compared to Ulysses. Please note that our Ulysses-Offload chunking procedure is generally applicable to other sequence parallelism techniques.
 <img src="./media/image3.png" style="width:6.5in;height:5.36042in"
 alt="A screenshot of a computer Description automatically generated" />
 Figure 3: Core design with offload description
 Figure 3 gives an example of how to perform the computation of chunk
 *T<sub>m</sub>*. After the alltoall collective communication,
 *GPU<sub>j</sub>* receives
 $\widehat{q}m,\ \widehat{k}m,\ and\ \widehat{v}m$*.* We then fetch the
 previous sequence chunk by chunk from the host memory to
 GPU<sub>j</sub>, and perform online attention with the current
 $\widehat{q}m$ and update the output chunk accordingly. Note that, in a
 strict manner, at any given time, only one set of chunks
 $\widehat{k}i,\ and\ \widehat{v}i$ is placed on GPU's HBM, reducing the
 memory footprint to $\frac{1}{u}$ compared to the non-offloading version
 without double buffering. With double buffering, memory footprint is
 reduced by *2/u*.
 ### Memory Management
 Ulysses-Offload optimizes memory usage by carefully managing the allocation and
 deallocation of buffers during training. This involves:
 1.  Double Buffering:
    - Two sets of buffers are maintained to overlap computation with
      data transfer.
    - While one set of buffers is used for computation, the other set is
      preloaded with the next chunk of data.
 2.  Hierarchical Memory Utilization:
    - GPU High Bandwidth Memory (HBM) is used for active computation.
    - Host memory is used to store intermediate results that are not
      immediately needed, reducing the pressure on GPU memory.
 ## Integration with Existing Frameworks
 Ulysses-Offload is designed to integrate seamlessly with popular deep learning
 frameworks such as PyTorch. Ulysses-Offload provides user-friendly APIs that
 abstract the complexities of pipelined training and memory management.
 Users can adopt Ulysses-Offload with minimal changes to existing codebases.
 ## Experimental Results
 <img src="./media/image4.png" style="width:6.5in;height:3.37431in"
 alt="A collage of graphs Description automatically generated" />
 Figure 4: Supported sequence lengths and corresponding Model FLOPs
 Utilization (MFU) using Megatron-SP, Ulysses, and our proposed Ulysses-Offload (FPDT). OOM
 denotes the point where increasing sequence length will cause memory
 issues. We show Ulysses-Offload's performance when the sequence length is larger
 than 128K, as shorter sequences can be properly handled by existing
 strategies.
 ### Extended Sequence Lengths
 In our experimental setup, we compare Ulysses-Offload with two existing methods:
 Microsoft DeepSpeed Ulysses and NVIDIA Megatron-SP. Both DeepSpeed
 Ulysses and Megatron-SP employ similar approaches to sequence
 parallelism but differ in the collective communication used for
 gathering sequences before the attention block. The former utilizes
 alltoall communication, whereas the latter employs allgather. Ulysses-Offload
 builds upon the DeepSpeed Ulysses approach. The primary advantage of
 Ulysses-Offload is its capability to support the training of large language models
 (LLMs) with ultra-long sequence lengths using fewer GPUs. As shown in
 Figure 4, our method enables the training of 8B parameter models with
 sequence lengths of 2 million tokens using only 4 GPUs. For even larger
 models, such as GPT-30B and Llama-70B parameter models, Ulysses-Offload supports
 sequence lengths up to 3 million and 4 million tokens using 16 GPUs and
 32 GPUs respectively. This represents a 16x increase in sequence length
 compared to current state-of-the-art solutions (see Figure 5), making
 Ulysses-Offload a game-changer for tasks that require processing long sequences.
 ### High Hardware Efficiency
 As shown in Figure 4 with different model sizes ranging from GPT-2.7B to
 Llama-80B parameters, Ulysses-Offload achieves over 55% Model FLOPs Utilization
 (MFU), ensuring that the hardware resources are utilized effectively.
 This high level of efficiency is maintained even when dealing with
 extremely long sequences (up to 4 million context length), making Ulysses-Offload
 an ideal solution for training large-scale LLMs. By maximizing the use
 of available hardware, Ulysses-Offload reduces the overall cost and complexity of
 training long-context models. Our [technical report](https://arxiv.org/abs/2408.16978) offers
 further insights into optimizing sequence chunks to balance the
 trade-off between memory usage and MFU.
 <img src="./media/image5.png" style="width:6.5in;height:2.01667in" />
 Figure 5: A comprehensive analysis on long-context LLM training with
 different training techniques: tensor parallelism (TP), activation
 checkpoint (AC), activation checkpoint with CPU offloading (OC), Ulysses
 (UL), and our approach Ulysses-Offload (FPDT).
 ## Implementation and Usability
 Ulysses-Offload is designed to be easily integrated with popular deep learning
 frameworks such as DeepSpeed, Megatron-DeepSpeed and PyTorch. Users can
 adopt our approach with minimal changes to their existing training
 pipeline, making it accessible to a broad audience. The integration
 process involves setting up the sequence chunk pipeline and configuring
 the memory optimization techniques, both of which are straightforward
 and well-documented (see tutorial).
 Our pipeline design and memory optimization techniques are
 straightforward to implement, making Ulysses-Offload accessible to researchers and
 practitioners aiming to train long-context LLMs efficiently. We provide
 detailed [technical report](https://arxiv.org/abs/2408.16978),
 documentation and examples to guide users through the setup process,
 ensuring a smooth transition to using Ulysses-Offload. Additionally, Ulysses-Offload, in the
 tradition of DeepSpeed provides user-friendly API which abstracts the
 complexities of mixed precision training and memory optimization,
 allowing users to focus on their research and development tasks.
 ## General Availability of DeepSpeed Ulysses-Offload
 We are excited to release Ulysses-Offload. Ulysses-Offload has been
 fully integrated with Megatron-DeepSpeed and accessible through both
 DeepSpeed and Megatron-DeepSpeed GitHub repos. Click here for detailed
 [tutorial](https://www.deepspeed.ai/tutorials/fpdt/) on usage.
 We invite the community to explore our implementation, contribute to
 further advancements, and join us in pushing the boundaries of what is
 possible in LLM and AI. This release is part of the bigger DeepSpeed
 ecosystem of large-scale AI training, finetuning and inference. For more
 details on all DeepSpeed technologies and innovations, please visit our
 [website]((https://www.deepspeed.ai/)) and follow us
 on X, formerly Twitter, ([English](https://twitter.com/MSFTDeepSpeed),
 [Japanese](https://twitter.com/MSFTDeepSpeedJP)) and
 [Chinese Zhihu](https://www.zhihu.com/people/deepspeed).
--- a/blogs/ulysses-offload/media/image1.png
+++ b/blogs/ulysses-offload/media/image1.png
--- a/blogs/ulysses-offload/media/image2.png
+++ b/blogs/ulysses-offload/media/image2.png
--- a/blogs/ulysses-offload/media/image3.png
+++ b/blogs/ulysses-offload/media/image3.png
--- a/blogs/ulysses-offload/media/image4.png
+++ b/blogs/ulysses-offload/media/image4.png
--- a/blogs/ulysses-offload/media/image5.png
+++ b/blogs/ulysses-offload/media/image5.png