Граф коммитов

2602 Коммитов

Автор SHA1 Сообщение Дата
xyxie 1b58ba5ec0
Merge LoCo with Zero++ (#6730)
### Integration of LoCo Method into ZeRO++

#### Overview
This PR introduces the integration of the **LoCo** method, as outlined
in [this paper](https://arxiv.org/abs/2407.04480), into the ZeRO++
framework of DeepSpeed. The key enhancement involves applying error
feedback compensation to 4-bit gradients before communication. This
approach ***improves pre-training loss outcomes without additional time
overhead***, though it requires extra GPU memory. The extent of this
memory increase depends on model size and training configuration.

#### Experimental Results
We conducted pre-training experiments using the Llama2 architecture,
adjusting the number of layers and hidden size. The experiments
included:
- **A smaller-scale model with 0.8B parameters trained on 30B tokens**.
- **A larger-scale model with 8B parameters trained on 5B tokens**.

The training data was sampled from **Redpajama-V2**.
<p align="center">
<img
src="https://github.com/user-attachments/assets/e7db9487-728c-4a17-9806-c15afa12f62e"
width="49%" />
<img
src="https://github.com/user-attachments/assets/3efec895-b71d-43ab-b5ce-65468ba8b9f1"
width="49%" />
</p>

**Findings**:
- **Smaller Models (0.8B parameters)**: Significant gains were observed
when applying the LoCo method.
- **Larger Models (8B parameters)**: The gains were present but less
pronounced. This could be due to:
  1. Relatively smaller data volume.
2. Lower pre-training loss for larger models, making significant
improvements harder to achieve.

However, even a smaller pre-training loss gap in larger models can
translate to meaningful gains in downstream tasks.

#### Example Script
For reference, the
[run.sh](https://github.com/user-attachments/files/17679552/zeroplus-7b3.zip)
script used for the 8B parameter, 5B tokens experiment is attached. The
experiment was conducted using the **DeepSpeed-Megatron** platform.



#### Acknowledgments
Special thanks to cc @GuanhuaWang for ongoing communication and guidance
throughout this work.

---

We appreciate your consideration of this PR and welcome any feedback or
questions!

---------

Co-authored-by: ChuanxinTang <tangchuanxin.chn@gmail.com>
Co-authored-by: root <pan.jiachun@outlook.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
2024-12-10 10:31:11 -08:00
Logan Adams 06f1d3609e
Unpin pytest-subtests now that 0.14.1 is released (#6844)
The issue we encountered was covered here:
https://github.com/pytest-dev/pytest-subtests/issues/173

And is resolved with the latest changes from this PR:
https://github.com/pytest-dev/pytest-subtests/issues/174, and is
published in the latest version 0.14.1.
2024-12-09 22:14:59 -08:00
Raza Sikander 0c92c39dd0
Inference UTs check for trition support from accelerator (#6782)
Instead of checking if installed or not check for support. Skip if not
supported.

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-12-09 16:15:42 -08:00
Logan Adams 08b907a226
Pin pytest-subtests version for accelerate tests (#6842) 2024-12-09 12:24:33 -08:00
Hoa La 9a41ccaf44
Flops profiler support einops.einsum (#6755)
- Added support for FlopsProfiler to include einops.einsum operation
- Added _patch_miscellaneous_operations() and
_reload_miscellaneous_operations() to include this operation and
potentially include other miscellaneous operations in the future
- Added _einops_einsum_flops_compute() that mimic already-existed
_einsum_flops_compute()

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-12-09 09:56:54 -08:00
Logan Adams 9ca6016017
Pin HPU tests (#6831)
HPU tests are impacted by the same issue as other tests that use
transformers latest. This PR pins to a version of transformers before
the fix.
2024-12-06 14:29:00 -08:00
Logan Adams a4499668fe
Update version.txt after 0.16.1 release (#6826)
**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.16.1
Author           - @loadams

Co-authored-by: loadams <loadams@users.noreply.github.com>
2024-12-05 14:21:53 -08:00
Logan Adams 177832ed45
Update pre-commit version (#6821) 2024-12-05 13:51:05 -08:00
Logan Adams 95ead2a055
Pin transformers version in cpu-torch-latest due to multiprocessing error. (#6823)
This is a copy of https://github.com/microsoft/DeepSpeed/pull/6820 for
the cpu-torch-latest tests.

This PR will revert/fix these:
https://github.com/microsoft/DeepSpeed/pull/6822
2024-12-05 12:16:46 -08:00
Sam Ade Jacobs 2ea181f0c3
Update README.md (#6825)
Add Ulysses-offload to News page

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-12-05 11:32:36 -08:00
Sam Ade Jacobs 0e92f9b41f
Update README.md (#6824)
Fix broken tutorial link
2024-12-05 11:31:52 -08:00
Sam Ade Jacobs 7b9fc8c74d
add FPDT tutorial (#6813)
Tutorial page for Ulysses-Offload (FPDT), blog page to follow.

---------

Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw02.ten.osc.edu>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
2024-12-05 16:44:00 +00:00
Sam Ade Jacobs 0b0fef3d41
Ulyssess offload blog (#6814)
Ulysses-Offload (FPDT) blog, please see corresponding tutorial page at
[link](https://github.com/microsoft/DeepSpeed/pull/6813).

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
2024-12-05 16:39:44 +00:00
Logan Adams b966e1f97f
Pin transformers to avoid errors with latest version (#6820) 2024-12-05 08:38:01 -08:00
Jinghan Yao 60a1b57b98
Adding the new feature of FPDT (#6462)
[FPDT](https://arxiv.org/abs/2408.16978) can only be used with [this
version](https://github.com/microsoft/Megatron-DeepSpeed/pull/441) of
Megatron-DeepSpeed.

---------

Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw02.ten.osc.edu>
Co-authored-by: Sam Ade Jacobs <samjacobs@microsoft.com>
Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw01.ten.osc.edu>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com>
2024-12-04 15:29:45 -08:00
Logan Adams ed7d183bed
Update python version but now we need to include setuptools on our own (#6787)
TODO:
- [x] determine if this means we should technically add setuptools to
the requirements.txt
2024-12-04 12:39:45 -08:00
Xu Song fc230070ef
Fix zero checkpoint (#6792)
Fix #6791

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-12-04 10:16:56 -08:00
Guanhua Wang 0c6c981109
Domino news update on readme.md (#6815) 2024-12-03 08:12:21 -08:00
Logan Adams f743feca03
Update version.txt after 0.16.0 release (#6786)
**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.16.0
Author           - @loadams

Co-authored-by: loadams <loadams@users.noreply.github.com>
2024-11-25 12:12:44 -08:00
Logan Adams e5570b10ee
Revert release workflow (#6785) 2024-11-25 12:10:05 -08:00
Logan Adams 03845dbc85
Update version.txt before release (#6784) 2024-11-25 12:06:06 -08:00
Guanhua Wang ec6cc49034
Domino Blog (#6776)
This PR is domino blog on our public side.

cc @tjruwase

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-11-25 11:59:04 -08:00
Logan Adams fabcf407f9
Cleanup code docs warnings (#6783)
We have a number of warnings in our readthedocs sphinx/autodoc .rst
files, so this cleans some of those up so we can fix real issues there.
2024-11-25 11:30:47 -08:00
Wentao Ye d6410f9051
Fix Doc Error: ZeRO Stage 2 gradient partitioning (#6775)
Fix the issue described in
https://github.com/microsoft/DeepSpeed/issues/6707
2024-11-25 10:19:27 -08:00
谭九鼎 5e16f255a6
docs: fix HF links (#6780)
The current link
https://huggingface.co/docs/transformers/main_classes/deepspeed is very
unhelpful.

It turns out in the past it had some guides:
https://huggingface.co/docs/transformers/v4.27.1/main_classes/deepspeed#shared-configuration

Later it's refreshed and moved to
https://huggingface.co/docs/transformers/deepspeed
2024-11-25 10:10:08 -08:00
Logan Adams f57b1ef18a
Unpin with latest transformers fixes (#6763)
Reverts #6759

Requires from transformers: 
https://github.com/huggingface/transformers/pull/34816
https://github.com/huggingface/transformers/pull/34800

Todo:
- [x] Need to merge first PR to get support for torch 2.4
2024-11-22 10:31:59 -08:00
ChenWenbin cd20a3bbc7
Fix potential memory issues when use deepspeed Z3 (#6726)
I had OOM problem when doing DPO training using zero3. It needs to call
module twice in one training step, and second call is with no_grad().
The problem is caused by two bugs:
1. "__n_available_params", which helps to control fetched parameters,
becomes negative after release_and_reset_all() function.
2. module.ds_grads_remaining becomes negative in backward() if we call
module more than once in one training step.

I tried to create two patches to fix these issues.

---------

Signed-off-by: Wenbin Chen <wenbin.chen@intel.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
2024-11-21 18:32:03 +00:00
Hyeonseung Lee f515104e95
Removes unnecessary cloning (#6761)
`clone_tensors_for_torch_save()` function:

When the `item.device` is different from `device` input,
`tensor.clone()` is not actually required because `to()` function also
clones the original tensor.


+) I observed memory bloat under following conditions:
* Training a Whisper model w/ `transformers` framework with `ZeRO-0` and
`ZeRO-1` configuration.
* Memory bloating can be observed every time the model state_dict is
cloned using `clone_tensors_for_torch_save()`

After I removed the unnecessary `clone()`, seems like the problem is
solved.

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-11-21 17:37:29 +00:00
Max Kovalenko b5709cce66
Enable torch compile on _allgather_params (#6769)
* Previosuly ZerO3 was crashing when trying to compile _allgather_params
* Disabling grad solves the issue
2024-11-21 16:01:13 +00:00
Quentin Gallouédec 83e4364fbd
Use `json_schema_extra` instead of extra keyword in `Field` (#6764)
> Using extra keyword arguments on `Field` is deprecated and will be
removed. Use `json_schema_extra` instead. (Extra keys: 'new_param').
Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2
Migration Guide at https://errors.pydantic.dev/2.9/migration/

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-11-20 01:04:47 +00:00
Nadav Elyahu 065398d5de
Fix setup.py bash cmd generation to correctly extract git info (#6762)
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-11-19 13:54:53 -08:00
Logan Adams 2e0c39b55c
Add explicit parameters for torch.load (#6751)
Successor PR to #6094:

> FutureWarning: You are using torch.load with weights_only=False (the
current default value), which uses the default pickle module implicitly.
It is possible to construct malicious pickle data which will execute
arbitrary code during unpickling (See
https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models
for more details). In a future release, the default value for
weights_only will be flipped to True. This limits the functions that
could be executed during unpickling. Arbitrary objects will no longer be
allowed to be loaded via this mode unless they are explicitly
allowlisted by the user via torch.serialization.add_safe_globals. We
recommend you start setting weights_only=True for any use case where you
don't have full control of the loaded file. Please open an issue on
GitHub for any issues related to this experimental feature.

Todo:
- [ ] Update values in non-test files to True where necessary.
2024-11-19 11:09:52 -08:00
baodi 1fdad1fa52
make xpu ops compatible with oneapi 2025.0 (#6760)
Compatibility update for xpu ops

This PR introduces changes that will make xpu ops compatible with the
OneAPI 2025.0 toolkit. This is an important update that will allow us to
develop and ship our most demanding models on this innovative hardware.

---------

Signed-off-by: baodii <di.bao@intel.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
2024-11-19 17:38:27 +00:00
Logan Adams 8488beea29
Pin transformers version to work around latest torch requirements (#6759)
Latest transformers seems to break our tests that aren't on torch latest
(>= 2.5). Issue opened here:
https://github.com/huggingface/transformers/issues/34795. This pins our
version so these tests can pass in the meantime.
2024-11-19 01:36:51 +00:00
Xu Song dd40269426
A faster and more memory-efficient implementation of `zero_to_fp32` (#6658)
It is a faster and more memory-efficient implementation of
`zero_to_fp32`.


The previous version double the memory usage, which cause cpu OOM for
very large models (e.g. llama 405B).

b647fb2470/deepspeed/utils/zero_to_fp32.py (L438-L441)


## How does it work?

1. **Lazy loading**: Load checkpoint with `mmap=True`, thus the weights
are mmaped rather than loading all the storages into memory.
2. **Lazy merge**: `GatheredTensor` contains the mmaped weights and
tensor offset. It is a memory-efficient pseudo tensor. Only when
`tensor.contiguous()` is called, it starts to load related weights to
memory and merge into a single tensor.
3. **Release memory in time**: Save checkpoints shard by shard, and
release the memory once a shard is saved.


Throughout the process, only one shard of tensors are keeped in memory.

## How much benefit in speed and memory ?

Experiments were conducted on a linux host with 1TB of memory. Here is a
detailed comparision
| | world size | peak memory(GB) | elapsed time(h:mm:ss) |

|----------------------|------------|--------------|--------------------|
| llama3-8B(old->new)  | 8          | 90 -> 41 | 0:02:17 -> 0:01:10 |
| llama2-13B(old->new)  | 8        | 146 -> 54 | 0:02:30 -> 0:01:47  |
| llama2-70B(old->new)  | 16        | 789 -> 159 | 0:20:47 -> 0:20:45 |
| qwen1.5-110B(old->new)  | 32       | OOM -> 217 | ? -> 0:34:21 |
| llama3-405B(old->new)  | 192      | OOM -> 262 | ? -> 2:09:59 |



You can reproduce with the following scripts
```sh
# 1. install requirments
apt-get install time
# 2. prepare zero-3 checkpoints
# 3. convert zero to fp32 checkpoints
/usr/bin/time -v python zero_to_fp32.py . output_dir/ --safe_serialization
```

- **memory**: Theoretically, this PR reduces the memory cost from `2M`
to `(1/n)M`, where `M` is the memory cost of the full weights, `n` is
num_shards.
- **speed**: The speed gain mainly comes from avoiding extra tensor
copying. The benifit may be slight.




## Impl history

-
[v1](19712a1c75 (diff-6a2ca3427fa608c387b7351359f98cfc1313be6e960cee86344ff246bf1b8326R441-R447))
: a hf_hub compatible approach.
It has been discarded due to the controversial implementation of
`data_ptr().`
- [v2](https://github.com/microsoft/DeepSpeed/pull/6658/files): a simple
approach with `torch.empty`

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-11-18 20:14:35 +00:00
Logan Adams f594dbe3df
Disable failing python tests (#6758) 2024-11-18 10:16:21 -08:00
Raza Sikander e3b5a4b6e0
Gaudi2 Nightly job for daily check (#6753)
Co-authored-by: Logan Adams <loadams@microsoft.com>
2024-11-15 15:11:59 -08:00
Olatunji Ruwase fc4e73370d
Add no_sync context manager (#6675)
Fix #1902

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-11-14 18:52:51 +00:00
Minjia Zhang d702eb5f79
Adding the governance doc (#6748)
Drafted governance doc for the LFAI.

Co-authored-by: Minjia Zhang <minjiaz@illinois.edu>
2024-11-14 12:01:53 -08:00
Logan Adams 9a2c209cee
Sanitize inputs to eval() (#6745) 2024-11-13 09:04:56 -08:00
Logan Adams 877aa0dba6
Update path for BingBertSquad from DeepSpeedExamples (#6746)
In https://github.com/microsoft/DeepSpeedExamples/pull/245, the
DeepSpeedExamples directory structure was refactored, this updates the
DeepSpeed examples from those changes.
2024-11-12 18:50:02 +00:00
Joe Mayer b692cdea47
AIO File Offsets (#6641)
Adding the option for a file offset to the read/write functions of AIO &
GDS ops.

---------

Co-authored-by: jomayeri <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-11-12 16:34:17 +00:00
inkcherry 7af3a4beb5
add zero3 ```module_granularity_threshold ``` to zero optimization. (#6649)
This PR adds Z3 coalesced fetch to zero optimization. Currently, some
logic can be reused, but it's difficult to realize that as optimization
choice(I only discovered these logic when trying to implement it).

The benefit of this approach is reducing host overhead(reduce many
hooks) and during the process of recursive fetching parameters
(especially in fine-grained models, such as those with a large number of
moe experts). This is particularly helpful for host-sensitive devices
(such as hpu), where it achieved a 40% performance improvement in our
customer workloads.
FYI @delock @deepcharm

---------

Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-11-12 14:25:33 +00:00
Hongwei Chen 73d974ee64
Add data type check for bf16 (#6742)
Add data type check for bf16 to fix #6723
2024-11-12 13:01:31 +00:00
Chengming Zhang fabab197f7
Add Domino code (#6733)
add domino code

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-11-11 23:55:09 +00:00
Xinyu Lian 99e9cbed16
Fix Type Name Inconsistency & Typo in cpu_adam (#6732)
There is a typing error & inconsistency in cpu-adam code, while not
affecting functionality, impacts code readability. Specifically, the
type name `ds_params_percision_t` contains a typo ('percision'), whereas
the related type name `ds_state_precision_t` is spelled correctly. I
think it is beneficial to fix this typo&inconsistency to improve code
readability, maintainability and further development.
I have tested the corrected version of cpu_adam, and it compiles and
runs successfully.

Compilation Log:
<img width="2560" alt="image"
src="https://github.com/user-attachments/assets/b7bc307d-9c9d-4ab7-8671-34e565903ca5">

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-11-11 23:31:45 +00:00
Logan Adams b45ca26354
Update AMD apex version (#6739) 2024-11-11 13:26:41 -08:00
Olatunji Ruwase b7e2ff5080
Add COMMITTER file (#6741)
Add COMMITTER file
2024-11-11 11:51:10 -08:00
Logan Adams 0855566228
Update GH hosted workflows to 24.04 (#6717)
`ubuntu-latset` is moving to be 24.04, so we should test updating as
well to ensure it doesn't break any of our workflows.
2024-11-11 06:22:08 -08:00
Logan Adams 057d25be67
Update version.txt after 0.15.4 release (#6731)
**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.15.4
Author           - @loadams

Co-authored-by: loadams <loadams@users.noreply.github.com>
2024-11-08 08:34:20 -08:00