This commit is contained in:
Sam Ade Jacobs 2024-07-01 15:58:21 -07:00 коммит произвёл GitHub
Родитель dd7a5be53d
Коммит 3d347276ce
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: B5690EEEBB952194
1 изменённых файлов: 4 добавлений и 6 удалений

Просмотреть файл

@ -1,8 +1,7 @@
---
#title: "Universal Checkpointing with DeepSpeed: A Practical Guide"
title: "Universal Checkpointing with DeepSpeed: A Practical Guide"
tags: checkpointing, training, deepspeed
---
# Universal Checkpointing with DeepSpeed: A Practical Guide
DeepSpeed Universal Checkpointing feature is a powerful tool for saving and loading model checkpoints in a way that is both efficient and flexible, enabling seamless model training continuation and finetuning across different model architectures, different parallelism techniques and training configurations. This tutorial, tailored for both begininers and experienced users, provides a step-by-step guide on how to leverage Universal Checkpointing in your DeepSpeed-powered applications. This tutorial will guide you through the process of creating ZeRO checkpoints, converting them into a Universal format, and resuming training with these universal checkpoints. This approach is crucial for leveraging pre-trained models and facilitating seamless model training across different setups.
@ -23,7 +22,7 @@ Follow the three simple steps below:
### Step 1: Create ZeRO Checkpoint
The first step in leveraging DeepSpeed Universal Checkpointing is to create a ZeRO checkpoint. [ZeRO]((/tutorials/zero/)) (Zero Redundancy Optimizer) is a memory optimization technology in DeepSpeed that allows for efficient training of large models. To create a ZeRO checkpoint, you'll need to:
The first step in leveraging DeepSpeed Universal Checkpointing is to create a ZeRO checkpoint. [ZeRO](/tutorials/zero/) (Zero Redundancy Optimizer) is a memory optimization technology in DeepSpeed that allows for efficient training of large models. To create a ZeRO checkpoint, you'll need to:
- Initialize your model with DeepSpeed using the ZeRO optimizer.
- Train your model to the desired state (iterations).
@ -50,9 +49,8 @@ With the Universal checkpoint ready, you can now resume training on potentially
## Conclusion
DeepSpeed Universal Checkpointing simplifies the management of model states, making it easier to save, load, and transfer model states across different training sessions and parallelism techniques. By following the steps outlined in this tutorial, you can integrate Universal Checkpointing into your DeepSpeed applications, enhancing your model training and development workflow.
For more detailed examples and advanced configurations, please refer to the [Megatron-DeepSpeed examples]((https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/universal_checkpointing)).
For more detailed examples and advanced configurations, please refer to the [Megatron-DeepSpeed examples](https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/universal_checkpointing).
For technical in-depth of DeepSpeed Universal Checkpointing, please see [arxiv manuscript](https://arxiv.org/abs/2406.18820) and [blog](/blogs/deepspeed-ucp/).
For technical in-depth of DeepSpeed Universal Checkpointing, please see [arxiv manuscript](https://arxiv.org/abs/2406.18820) and [blog](https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-ucp/).
Happy training!
```