> Using extra keyword arguments on `Field` is deprecated and will be
removed. Use `json_schema_extra` instead. (Extra keys: 'new_param').
Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2
Migration Guide at https://errors.pydantic.dev/2.9/migration/
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Successor PR to #6094:
> FutureWarning: You are using torch.load with weights_only=False (the
current default value), which uses the default pickle module implicitly.
It is possible to construct malicious pickle data which will execute
arbitrary code during unpickling (See
https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models
for more details). In a future release, the default value for
weights_only will be flipped to True. This limits the functions that
could be executed during unpickling. Arbitrary objects will no longer be
allowed to be loaded via this mode unless they are explicitly
allowlisted by the user via torch.serialization.add_safe_globals. We
recommend you start setting weights_only=True for any use case where you
don't have full control of the loaded file. Please open an issue on
GitHub for any issues related to this experimental feature.
Todo:
- [ ] Update values in non-test files to True where necessary.
Compatibility update for xpu ops
This PR introduces changes that will make xpu ops compatible with the
OneAPI 2025.0 toolkit. This is an important update that will allow us to
develop and ship our most demanding models on this innovative hardware.
---------
Signed-off-by: baodii <di.bao@intel.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
It is a faster and more memory-efficient implementation of
`zero_to_fp32`.
The previous version double the memory usage, which cause cpu OOM for
very large models (e.g. llama 405B).
b647fb2470/deepspeed/utils/zero_to_fp32.py (L438-L441)
## How does it work?
1. **Lazy loading**: Load checkpoint with `mmap=True`, thus the weights
are mmaped rather than loading all the storages into memory.
2. **Lazy merge**: `GatheredTensor` contains the mmaped weights and
tensor offset. It is a memory-efficient pseudo tensor. Only when
`tensor.contiguous()` is called, it starts to load related weights to
memory and merge into a single tensor.
3. **Release memory in time**: Save checkpoints shard by shard, and
release the memory once a shard is saved.
Throughout the process, only one shard of tensors are keeped in memory.
## How much benefit in speed and memory ?
Experiments were conducted on a linux host with 1TB of memory. Here is a
detailed comparision
| | world size | peak memory(GB) | elapsed time(h:mm:ss) |
|----------------------|------------|--------------|--------------------|
| llama3-8B(old->new) | 8 | 90 -> 41 | 0:02:17 -> 0:01:10 |
| llama2-13B(old->new) | 8 | 146 -> 54 | 0:02:30 -> 0:01:47 |
| llama2-70B(old->new) | 16 | 789 -> 159 | 0:20:47 -> 0:20:45 |
| qwen1.5-110B(old->new) | 32 | OOM -> 217 | ? -> 0:34:21 |
| llama3-405B(old->new) | 192 | OOM -> 262 | ? -> 2:09:59 |
You can reproduce with the following scripts
```sh
# 1. install requirments
apt-get install time
# 2. prepare zero-3 checkpoints
# 3. convert zero to fp32 checkpoints
/usr/bin/time -v python zero_to_fp32.py . output_dir/ --safe_serialization
```
- **memory**: Theoretically, this PR reduces the memory cost from `2M`
to `(1/n)M`, where `M` is the memory cost of the full weights, `n` is
num_shards.
- **speed**: The speed gain mainly comes from avoiding extra tensor
copying. The benifit may be slight.
## Impl history
-
[v1](19712a1c75 (diff-6a2ca3427fa608c387b7351359f98cfc1313be6e960cee86344ff246bf1b8326R441-R447))
: a hf_hub compatible approach.
It has been discarded due to the controversial implementation of
`data_ptr().`
- [v2](https://github.com/microsoft/DeepSpeed/pull/6658/files): a simple
approach with `torch.empty`
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR adds Z3 coalesced fetch to zero optimization. Currently, some
logic can be reused, but it's difficult to realize that as optimization
choice(I only discovered these logic when trying to implement it).
The benefit of this approach is reducing host overhead(reduce many
hooks) and during the process of recursive fetching parameters
(especially in fine-grained models, such as those with a large number of
moe experts). This is particularly helpful for host-sensitive devices
(such as hpu), where it achieved a 40% performance improvement in our
customer workloads.
FYI @delock @deepcharm
---------
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
There is a typing error & inconsistency in cpu-adam code, while not
affecting functionality, impacts code readability. Specifically, the
type name `ds_params_percision_t` contains a typo ('percision'), whereas
the related type name `ds_state_precision_t` is spelled correctly. I
think it is beneficial to fix this typo&inconsistency to improve code
readability, maintainability and further development.
I have tested the corrected version of cpu_adam, and it compiles and
runs successfully.
Compilation Log:
<img width="2560" alt="image"
src="https://github.com/user-attachments/assets/b7bc307d-9c9d-4ab7-8671-34e565903ca5">
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.15.4
Author - @loadams
Co-authored-by: loadams <loadams@users.noreply.github.com>
Add support for testing compilation with python 3.11/3.12.
Also add the dockerfiles used to build those images.
---------
Co-authored-by: Michael Wyatt <michael.wyatt@snowflake.com>
This PR is useful for updating the flake8 checks we run, but is mostly
needed to update flake8 so that it can run on newer versions of python
which are included in newer ubuntu-latest versions from GitHub that we
update to in #6717
This update is needed to support eventually running on ubuntu-24.04 from
GitHub, specifically because the python version is updated to 3.12 and
results in the following error: `ModuleNotFoundError: No module named
'lib2to3'` since that package is deprecated.
The parameter coordinator in ZeRO3 throws a "backward pass is invalid
for module in evaluation mode" error when the training mode is
unexpected, as it expects all modules to be in training mode during the
backward pass. This is an unnecessarily strict restriction.
This PR relaxes the restriction by using a single parameter coordinator
(instead of separate ones for training and evaluation modes) and
resetting the prefetch state before starting a forward pass.
Use of `is_compiling` needs to be fixed after #6663 is merged.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
When launching apply_rotary_pos_half kernel, only threads_per_head of 64
is supported for wavefront size of 64.
This change adds support for threads_per_head < 64 such as 4, 8, 16.
Fixes the issue introduced in
https://github.com/microsoft/DeepSpeed/pull/5402
---------
Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Tests in universal checkpointing were not freeing the engine after use
when `reuse_dist_env` was set to `True`, leading to memory leaks.
This PR ensure freeing the engine in the tests and enables
`reuse_dist_env`.
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Git-base model is an image-text model. After supporting the llama3.2
vision model, we set num_kv_heads dynamically.
Git-base only includes vision_config, so we need to add an attribute
check for vision_config/text_config when setting num_kv_heads.
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Hi, guys
I find there is an assert failure when I train huggingface's lora based
model in pipeline style.
Here is the whole steps that I created my model:
1) Load the pre-trained chatglm-6b model from huggingface, as Model_A
2) Use huggingface's peft's `get_peft_model(...)` and my
`LoraConfig(...)` from Model_A to create the lora model, as Model_B
3) Create my own pipeline based model Model_C from Model_B
And I run Model_C under 2 3090ti GPUs. And the assertion failure looks
like this:
```text
Traceback (most recent call last):
File "/home/ubuntu/proj/chatglm-finetuning/train_pipeline.py", line 372, in <module>
main()
File "/home/ubuntu/proj/chatglm-finetuning/train_pipeline.py", line 351, in main
loss = engine.train_batch(data_iter=train_dataloader)
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 375, in train_batch
self._exec_schedule(sched)
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 1375, in _exec_schedule
self._exec_instr(**cmd.kwargs)
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 276, in _exec_reduce_tied_grads
dist.all_reduce(grad, group=group)
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
return func(*args, **kwargs)
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 496, in all_reduce
return cdb.all_reduce(tensor, op, group, async_op)
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 159, in all_reduce
return torch.distributed.all_reduce(tensor=tensor, op=op, group=group, async_op=async_op)
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1520, in all_reduce
_check_single_tensor(tensor, "tensor")
File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 463, in _check_single_tensor
raise RuntimeError(
RuntimeError: Invalid function argument. Expected parameter `tensor` to be of type torch.Tensor.
```
After some debugging, I find out the root cause is that my configuration
of lora (in below) only add extra lora layer(part) in qkv related layers
but not the embedding layer. So the whole embedding layer's parameters
are freezed.
```python
lora_config = LoraConfig(r=8, # copied from finetuning_lora.py
lora_alpha=32,
target_modules=["query_key_value"],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM",
inference_mode=False,
)
```
And in my implementation of pipeline based model, I declared the
embeding layer as a tied-layer. So the whole thing is that there are no
gradients at all for embedding layer, but embedding layer as the tied
layer needs to be synced between two gpus. The value of gradient is None
but is still passed to `all_reduce` operation.
Current, my fix is simple and add a check if this `grad` is None.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Heyang Qin <heyangqin@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Updated docker version to 1.18.0-latest
Note: for this update the firmware on the Gaudi2 node had to be updated
to use firmware version 1.18.
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Importing `torch.compiler.is_compiling` causes an error with an older
version of PyTorch.
This PR adds a fallback for `is_compiling` to use an equivalent function
of older PyTorch versions.
This will resolve#6656.
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
In sequence_parallel (Ulysses), the sequence parallel size is
constrained by the requirement to be divisible by the number of heads,
which prevents some models/workloads from setting a specific sequence
parallel size. This PR implements uneven all-to-all heads splitting.
- both support batch first (b,s,...) and seq_len first(s,b..) layout.
- Added unit tests with numerical checks. Locally also tested with **7
heads with sp=4** and **20 heads with sp=8**, and it passed.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Dynamo use faketensor to trace tensor ops. In some case, the mechanism
break compiling with deepspeed.
An example could be found at
https://gist.github.com/oraluben/9b8240c2fe482eb4382453d6c97a5f76, to
see issues, install deepspeed==0.14.4 instead of my fork
without this PR, llama cannot be compiled.
Detailed explanation:
1. `ZeROOrderedDict`
dynamo use deepcopy to copy tensors, which will call
`object.__reduce__`. When copying `ZeROOrderedDict`, the default
implementation do not copy its `_parent_module` and will lead to
failure.
2. `param` maybe faketensor and do not have `ds_status` yet, but during
tracing it's ok to just skip the `register_external_parameter`, it
should be done ways before.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Start torch 2.4, in
[`init_device_mesh()`](de4c2a3b4e/torch/distributed/device_mesh.py (L915))
,device type with a GPU index, such as "cuda:0", is not allowed.
![image](https://github.com/user-attachments/assets/1ddb61bf-8a15-4e0a-9115-a3681d7f19ff)
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com>
**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.15.3
Author - @jomayeri
Co-authored-by: jomayeri <jomayeri@users.noreply.github.com>
We have found that #6592 uses `_pinned_tensor_mgr` to create cpu bounce
buffer, which is same with what our xpu accelerator currently doing.
So no need to use xpu device specific cpu_op_desc_t.
In this PR:
1. remove custom csrc/xpu/aio/deepspeed_cpu_op.cpp
2. modify xpu async_io opbuilder.
This issue cannot be easily done with revert #6532 , for we added some
source file as last time GDS feature going in DS. So file this new PR :)
With intel-extension-for-pytorch=2.3.110 released last month, max1100 CI
workflow can be updated too. Software versions aligned with #6570 .
Increased CI tests scope for torch/ipex2.3 will be in later PR.
This workflow passed in my cloned repo self-hosted runner.