This PR adds support for Qwen1.5MoE-A2.7B models.
support for https://github.com/microsoft/DeepSpeed-MII/issues/457
### Test Code
for mii pipeline:
```python
import mii
pipe = mii.pipeline("/data/zonepg/models/Qwen/Qwen1.5-MoE-A2.7B")
responses = pipe("DeepSpeed is", max_new_tokens=128, do_sample=False)
if pipe.is_rank_0:
print(responses[0])
```
for huggingface:
```python
import mii
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
tokenizer = AutoTokenizer.from_pretrained("/data/zonepg/models/Qwen/Qwen1.5-MoE-A2.7B")
model = AutoModelForCausalLM.from_pretrained("/data/zonepg/models/Qwen/Qwen1.5-MoE-A2.7B", device_map="auto", torch_dtype=torch.float16, trust_remote_code=True).eval()
print(model)
inputs = tokenizer('DeepSpeed is', return_tensors='pt')
inputs = inputs.to(model.device)
pred = model.generate(**inputs, max_new_tokens=128, do_sample=False, repetition_penalty=1.0)
test = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False)
print(test)
```
### Qwen1.5-MoE-A2.7B
Huggingface output with prompt "DeepSpeed is":
```
a deep learning framework that is designed to accelerate the training of large-scale neural networks. It is built on top of PyTorch and provides a set of tools and techniques for optimizing the performance of deep learning models.
DeepSpeed supports a variety of hardware accelerators, including GPUs, TPUs, and FPGAs, and can be used to train models on distributed systems, such as clusters of GPUs or TPUs.
One of the key features of DeepSpeed is its ability to automatically parallelize the training of deep learning models across multiple GPUs or TPUs. This can significantly reduce the time required to train large models, as it allows the
```
DeepSpeed-FastGen output with prompt "DeepSpeed is":
```
a deep learning framework that is designed to accelerate the training of large-scale neural networks. It is built on top of PyTorch and provides a set of tools and techniques for optimizing the performance of deep learning models.
DeepSpeed supports a variety of hardware accelerators, including GPUs, TPUs, and FPGAs, and can be used to train models on distributed systems, such as clusters of GPUs or TPUs.
One of the key features of DeepSpeed is its ability to automatically parallelize the training of deep learning models across multiple GPUs or TPUs. This can significantly reduce the time required to train large models, as it allows the
```
DeepSpeed-FastGen output with prompt "DeepSpeed is" with 8-way sharding:
```
a deep learning framework that is designed to accelerate the training of large-scale neural networks. It is built on top of PyTorch and provides a set of tools and techniques for optimizing the performance of deep learning models.
DeepSpeed supports a variety of hardware accelerators, including GPUs, TPUs, and FPGAs, and can be used to train models on distributed systems, such as clusters of GPUs or TPUs.
One of the key features of DeepSpeed is its ability to automatically parallelize the training of deep learning models across multiple GPUs or TPUs. This can significantly reduce the time required to train large models, as it allows the
```
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Heyang Qin <heyangqin@microsoft.com>
Co-authored-by: Abhishek Kulkarni <11399+adk9@users.noreply.github.com>
SP is a fantastic piece of work, it is very elegant and concise, at the
current stage, a transformer layer's forward and backward passes involve
8 all-to-all operations, with 5 opportunities for overlapping
communication:
Forward pass: The QKV matrix operations can be pipelined alongside some
of the all-to-all communications.
Backward pass: DQ, DK, DV all-to-all communications can be pipelined
alongside matrix operations.
Backward pass: DO_w can be parallel with DO_input, involving matrix
operations and all-to-all communications. Similar overlap-comm
strategies are used in Megatron for TP/TP-sp parallelism.
I tested under conditions of 1N8C zero1, disabled activation
checkpointing, ds-sp=8, and gbs=16:
1B 64K
7B 16K
They showed over 10% improvement (where I found that for mega-ds, using
split QKV itself can also enhance performance due to reducing slice +
cat operations in fwd/bwd), despite some TFLOPs already performing at a
relatively good level.
co-work with https://github.com/microsoft/Megatron-DeepSpeed/pull/415
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Heyang Qin <heyangqin@microsoft.com>
Previously we used to get ROCm version information from
/opt/rocm/.info/version-dev file.
This PR is to modify the code to get ROCm version from
/opt/rocm/.info/version file instead to add compatibility with ROCm
Centos9 docker images.
cc: @jithunnair-amd
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
README and media for the GDS blog.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR fixes CPU Adam JIT compilation by including the `CUDA_LIB64`
path in the `extra_ldflags` list before calling `load()`.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
ROCm is packaged natively on Fedora. It's install location do not match
the AMD release.
So add some Fedora specific logic to find the ROCm version and use
rocminfo when attempts to use the AMD release fail.
Signed-off-by: Tom Rix <trix@redhat.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
std lib needed to be updated to C++ version 20 for CUDA 12.5 to fix
compilation issues in the op_builder.
TODO:
Fix may need to be extended to CUDA 12.4, needs testing.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Lev Kurilenko <lekurile@microsoft.com>
This PR adds the new fused kernel for the Dense GeMM using fp8-quantized
weight.
---------
Co-authored-by: Jeff Rasley <jeffra45@gmail.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Use the host time to replace xpu event elapsed_time as a WA, on XPU
device, use XPU event to measure the time will be consolidated in ipex
2.5.
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
We have encountered a performance issue when running torch compile on a
model utilizing
the pipeline engine (Mixtral).
The issue was found to be the is_checkpointable function which is called
in the engine's forward function.
This function creates a graph break when using torch.compile leading to
decreased performance (particularly since this happens in every forward
call). We propose a change in the way is_checkpointable is checked by
precomputing and storing its value before the forward call and accessing
the stored values in the forward function.
given this change the graph break in the forward call is avoided which
should lead to better performance for torch compile.
Co-authored-by: Heyang Qin <heyangqin@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR aims to enable chatglm2 & chatglm3 autotp. Similar to the phi3,
this model uses the chunk MLP layer, so we adjust the weight order by
'shard_mlp_chunk' func. Please kindly review~ Thanks!
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Heyang Qin <heyangqin@microsoft.com>
When the model is quantized, the hidden sizes cannot be determined from
`ds_shape` and `shape`, because they are 1 dimensional. This PR fixes
the bug by determining hidden sizes from `in_features` and
`out_features`.
This PR fixes#5398
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Example: E + M + D parallel
world_size = 8
model_degree = 2
expert_degree = 4
mp_group = [0, 1], [2,3], [4,5],[6,7]
expert_parallel_group = [0,2,4,6], [1,3,5,7]
The original execution method was that before executing Expert, there
was no drop operation, and two EPs did all-to-all separately. In the
end, they both obtained complete data, but 0 and 1 obtained exactly the
same data. Similarly, 2, 3, and so on all obtained the same data.
Therefore, we can drop the data before executing all-to-all, and then
execute allgather after all-to-all to obtain the complete data.
After executing Expert, the data on 0 and 1 is exactly the same, so we
can drop it and then execute all-to-all , and then execute allgather to
obtain the complete data.
1. non-expert use TP, expert not use TP: drop -> alltoall -> exe MOE ->
alltoall -> allgather
2. both non-expert and expert all use TP:
- the original execution order: alltoall -> exe MOE-> allreduce ->
alltoall
- optimized execution order: drop -> alltoall -> allgather -> exe MOE ->
drop ->alltoall -> allgather
Signed-off-by: --local <zhiwei.tao@enflame-tech.com>
Co-authored-by: --local <zhiwei.tao@enflame-tech.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Move the global variable `warned` from
`deepspeed.runtime.zero.parameter_offload.py` to
`deepspeed.runtime.zero.utils.py` to avoid `NameError: name 'warned' is
not defined` when calling `ap
ply_to_tensors_only()` (defined in `deepspeed.runtime.zero.utils.py`).
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR adds support for Microsoft Phi-3 model to FastGen.
DeepSpeed-FastGen output with prompt "DeepSpeed is":
```
an AI-powered platform designed to optimize and scale distributed deep learning models across clusters.**
DeepSpeed is a cutting-edge AI-driven toolkit that empowers users to enhance and scale deep learning models across distributed computing environments. By harnessing the power of artificial intelligence, DeepSpeed provides innovative solutions for optimizing resource allocation, managing data synchronization, and improving model parallelism. This enables efficient scaling and execution of complex deep learning tasks, unlocking the full potential of distributed computing systems.
### Key Features of DeepSpeed:
1.
```
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Updates to the three models supported in deepspeed-fastgen since the
last Chinese README update.
Co-authored-by: weifangyuan <i.weifangyuan@yuewen.com>
This patch is to promote state in bf16_optimizer so it can be accessible
in downstream deepspeed usecase.
For example, without the patch, we found issue in megatron-deepspeed
llama showcase:
```
[rank3]: Traceback (most recent call last):
[rank3]: File "/yahao/Megatron-DeepSpeed/pretrain_gpt.py", line 356, in <module>
[rank3]: pretrain(train_valid_test_datasets_provider,
[rank3]: File "/yahao/Megatron-DeepSpeed/megatron/training.py", line 222, in pretrain
[rank3]: iteration = train(forward_step_func,
[rank3]: File "/yahao/Megatron-DeepSpeed/megatron/training.py", line 1264, in train
[rank3]: report_memory_flag = training_log(loss_dict, total_loss_dict,
[rank3]: File "/yahao/Megatron-DeepSpeed/megatron/training.py", line 999, in training_log
[rank3]: opt_stats[0] += (torch.norm(optimizer.state[param]['exp_avg_sq']).item())**2
[rank3]: AttributeError: 'BF16_Optimizer' object has no attribute 'state'
```
With the patch, the invocation can pass smoothly.
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR allows `deepspeed.comm.inference_all_reduce()` enters
torch.compile graph even it is implemented as C++ kernel in DeepSpeed.
Previous implementation register `inference_all_reduce()` C++ kernel as
pybind function so it can be called inside PyThon code. However pybind
function cannot be recognized by PyTorch so graph breaks when
`inference_all_reduce` is called.
We address issue by register `inference_all_reduce` as a PyTorch custom
op `torch.ops.deepspeed.inference_all_reduce`, so it can be built into
PyTorch graph
The output trace code from torchinductor
```
class GraphModule(torch.nn.Module):
def forward(self, primals_1: "f32[5, 4]", primals_2: "f32[5]", primals_3: "f32[4, 4]"):
# File: /home/gma/DeepSpeed/deepspeed/comm/torch.py:161 in inference_all_reduce, code: return torch.ops.deepspeed.inference_all_reduce_(tensor)
inference_all_reduce: "f32[4, 4]" = torch.ops.deepspeed.inference_all_reduce.default(primals_3)
# File: /home/gma/allreduce_graph/test_allreduce.py:33 in forward, code: return self.linear(input)
permute: "f32[4, 5]" = torch.ops.aten.permute.default(primals_1, [1, 0]); primals_1 = None
addmm: "f32[4, 5]" = torch.ops.aten.addmm.default(primals_2, inference_all_reduce, permute); primals_2 = permute = None
# No stacktrace found for following nodes
copy_: "f32[4, 4]" = torch.ops.aten.copy_.default(primals_3, inference_all_reduce); primals_3 = None
return [addmm, inference_all_reduce]
```
Note in this PR the inference_all_reduce op for CPU does not handle
multinode and FP16 data type. For FP16 data type support, we will align
with PyTorch CPU FP16 plan. For multinode, we are still looking at the
possibility to upstream oneCCL integration into PyTorch, so we are able
to get use of oneCCL for multinode tensor parallel inference with
PyTorch.
This PR is independent to
https://github.com/microsoft/DeepSpeed/pull/5571. They can work
seperately or together without issue.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
We identified a memory leak when training with NVMe offloaded optimizer
states. The issue occurs when `pipeline_write=true`, where the tensors
that have swapped out and written to NVMe are not deallocated, leading
to a memory leak.
This PR resolves the issue by deallocating the unused tensors which have
swapped out to NVMe.
Co-authored-by: amaurya <am6429@cs.rit.edu>
This PR fixes the issue mentioned in
[PR5722](https://github.com/microsoft/DeepSpeed/pull/5722) that causes
the hangs in the nv-torch-latest-v100 tests.
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Train {GPT,LLaMA, Phi}-like models (or any model) at ultra low-cost with
DeepSpeed Universal Checkpointing (UCP). UCP abstracts away the
complexities of saving and loading model states. See arxiv paper, blog
and tutorial in this PR for details.
---------
Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
update the way to get queue for FusedAdam OpBuilder.
---------
Signed-off-by: baodii <di.bao@intel.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR solves the
[Issue-5430](https://github.com/microsoft/DeepSpeed/issues/5430).
The PR enables the universal checkpoint feature for other platforms like
HuggingFace Trainer without requiring changes to the HuggingFace code.
It does this by adding an argument that allows the injection of minimal
necessary information into the state before this
[assertion](ebf82e8f3a/deepspeed/checkpoint/ds_to_universal.py (L358)).
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Abhishek Kulkarni <abkulkarni@microsoft.com>