Граф коммитов

2602 Коммитов

Автор SHA1 Сообщение Дата
Perry Zou 249c1db2fb
Add fp16 support of Qwen1.5MoE models (A2.7B) to DeepSpeed-FastGen (#5403)
This PR adds support for Qwen1.5MoE-A2.7B models.

support for https://github.com/microsoft/DeepSpeed-MII/issues/457

### Test Code

for mii pipeline:
```python
import mii

pipe = mii.pipeline("/data/zonepg/models/Qwen/Qwen1.5-MoE-A2.7B")
responses = pipe("DeepSpeed is", max_new_tokens=128, do_sample=False)
if pipe.is_rank_0:
    print(responses[0])
```
for huggingface:
```python
import mii

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
tokenizer = AutoTokenizer.from_pretrained("/data/zonepg/models/Qwen/Qwen1.5-MoE-A2.7B")
model = AutoModelForCausalLM.from_pretrained("/data/zonepg/models/Qwen/Qwen1.5-MoE-A2.7B", device_map="auto", torch_dtype=torch.float16, trust_remote_code=True).eval()
print(model)
inputs = tokenizer('DeepSpeed is', return_tensors='pt')
inputs = inputs.to(model.device)
pred = model.generate(**inputs, max_new_tokens=128, do_sample=False, repetition_penalty=1.0)
test = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False)
print(test)
```

### Qwen1.5-MoE-A2.7B
Huggingface output with prompt "DeepSpeed is":
```
 a deep learning framework that is designed to accelerate the training of large-scale neural networks. It is built on top of PyTorch and provides a set of tools and techniques for optimizing the performance of deep learning models.

DeepSpeed supports a variety of hardware accelerators, including GPUs, TPUs, and FPGAs, and can be used to train models on distributed systems, such as clusters of GPUs or TPUs.

One of the key features of DeepSpeed is its ability to automatically parallelize the training of deep learning models across multiple GPUs or TPUs. This can significantly reduce the time required to train large models, as it allows the
```
DeepSpeed-FastGen output with prompt "DeepSpeed is":
```
 a deep learning framework that is designed to accelerate the training of large-scale neural networks. It is built on top of PyTorch and provides a set of tools and techniques for optimizing the performance of deep learning models.

DeepSpeed supports a variety of hardware accelerators, including GPUs, TPUs, and FPGAs, and can be used to train models on distributed systems, such as clusters of GPUs or TPUs.

One of the key features of DeepSpeed is its ability to automatically parallelize the training of deep learning models across multiple GPUs or TPUs. This can significantly reduce the time required to train large models, as it allows the
```

DeepSpeed-FastGen output with prompt "DeepSpeed is" with 8-way sharding:
```
 a deep learning framework that is designed to accelerate the training of large-scale neural networks. It is built on top of PyTorch and provides a set of tools and techniques for optimizing the performance of deep learning models.

DeepSpeed supports a variety of hardware accelerators, including GPUs, TPUs, and FPGAs, and can be used to train models on distributed systems, such as clusters of GPUs or TPUs.

One of the key features of DeepSpeed is its ability to automatically parallelize the training of deep learning models across multiple GPUs or TPUs. This can significantly reduce the time required to train large models, as it allows the
```

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Heyang Qin <heyangqin@microsoft.com>
Co-authored-by: Abhishek Kulkarni <11399+adk9@users.noreply.github.com>
2024-08-01 10:27:24 -07:00
Logan Adams 23d0e0221f
Update to ROCm6 (#5491) 2024-08-01 09:25:08 -07:00
inkcherry 17ed7c77c5
sequence parallel with communication overlap (#5691)
SP is a fantastic piece of work, it is very elegant and concise, at the
current stage, a transformer layer's forward and backward passes involve
8 all-to-all operations, with 5 opportunities for overlapping
communication:

Forward pass: The QKV matrix operations can be pipelined alongside some
of the all-to-all communications.
Backward pass: DQ, DK, DV all-to-all communications can be pipelined
alongside matrix operations.
Backward pass: DO_w can be parallel with DO_input, involving matrix
operations and all-to-all communications. Similar overlap-comm
strategies are used in Megatron for TP/TP-sp parallelism.
I tested under conditions of 1N8C zero1, disabled activation
checkpointing, ds-sp=8, and gbs=16:
1B 64K
7B 16K
They showed over 10% improvement (where I found that for mega-ds, using
split QKV itself can also enhance performance due to reducing slice +
cat operations in fwd/bwd), despite some TFLOPs already performing at a
relatively good level.
co-work with https://github.com/microsoft/Megatron-DeepSpeed/pull/415

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Heyang Qin <heyangqin@microsoft.com>
2024-08-01 09:14:36 -07:00
Ramya Ramineni f82d08862f
[ROCm] Get rocm version from /opt/rocm/.info/version (#5815)
Previously we used to get ROCm version information from
/opt/rocm/.info/version-dev file.
This PR is to modify the code to get ROCm version from
/opt/rocm/.info/version file instead to add compatibility with ROCm
Centos9 docker images.

cc: @jithunnair-amd

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-08-01 09:13:40 -07:00
Joe Mayer 324ee65cb0
GDS AIO Blog (#5817)
README and media for the GDS blog.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-08-01 09:15:10 -04:00
Lev Kurilenko 681be6f558
Fix CPU Adam JIT compilation (#5780)
This PR fixes CPU Adam JIT compilation by including the `CUDA_LIB64`
path in the `extra_ldflags` list before calling `load()`.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-31 14:34:59 -07:00
trixirt 550f9c75bc
Find ROCm on Fedora (#5705)
ROCm is packaged natively on Fedora. It's install location do not match
the AMD release.

So add some Fedora specific logic to find the ROCm version and use
rocminfo when attempts to use the AMD release fail.

Signed-off-by: Tom Rix <trix@redhat.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-31 12:42:40 -07:00
keshavkowshik 08598dbb3a
Fix op_builder for CUDA 12.5 (#5806)
std lib needed to be updated to C++ version 20 for CUDA 12.5 to fix
compilation issues in the op_builder.

TODO:
Fix may need to be extended to CUDA 12.4, needs testing.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Lev Kurilenko <lekurile@microsoft.com>
2024-07-31 10:37:52 -07:00
Logan Adams 5e8a27ad6d
Pin transformers version for MII tests (#5807)
Corresponding PR to https://github.com/microsoft/DeepSpeed-MII/pull/510
that is made due to changes from transformers introduced in
https://github.com/huggingface/transformers/pull/31747
2024-07-29 18:24:21 -07:00
Heyang Qin 58241b1d71
fix: handle exception when loading cache file in test_inference.py (#5802)
This PR is to fix CI failures such as
https://github.com/microsoft/DeepSpeed/actions/runs/10085903860/job/27887546470#step:8:3616
cc @tjruwase

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-29 14:18:50 -07:00
Liangliang Ma afe1b9ede1
Add doc of compressed backend in Onebit optimizers (#5782)
This one is document supplement for
https://github.com/microsoft/DeepSpeed/pull/5473.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-29 11:38:03 -07:00
Reza Yazdani 4f9506729f
Add fp8-fused gemm kernel (#5764)
This PR adds the new fused kernel for the Dense GeMM using fp8-quantized
weight.

---------

Co-authored-by: Jeff Rasley <jeffra45@gmail.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2024-07-29 11:07:00 -07:00
Logan Adams f80394349d
Update MII tests to pull correct torchvision (#5800) 2024-07-29 08:54:49 -07:00
YiSheng5 45b363504e
[XPU]Use host time to replace xpu time when IPEX version slower than 2.5. (#5796)
Use the host time to replace xpu event elapsed_time as a WA, on XPU
device, use XPU event to measure the time will be consolidated in ipex
2.5.

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-25 15:03:07 -07:00
Logan Adams ffd0a0e3ef
Update other workflows to run on Ubuntu 22.04 (#5798) 2024-07-24 02:11:39 +00:00
Logan Adams e661ecb35a
Unpin transformers version (#5650)
Reverts changes in #5629 after fixes have been applied to MII repo/MII
tests.

---------

Co-authored-by: Heyang Qin <heyangqin@microsoft.com>
2024-07-23 23:10:21 +00:00
Nir Sonnenschein 6d0dbf86e1
move is_checkpointable call reducing torch.compile Graph breaks (#5759)
We have encountered a performance issue when running torch compile on a
model utilizing
the pipeline engine (Mixtral). 
The issue was found to be the is_checkpointable function which is called
in the engine's forward function.
This function creates a graph break when using torch.compile leading to
decreased performance (particularly since this happens in every forward
call). We propose a change in the way is_checkpointable is checked by
precomputing and storing its value before the forward call and accessing
the stored values in the forward function.
given this change the graph break in the forward call is avoided which
should lead to better performance for torch compile.

Co-authored-by: Heyang Qin <heyangqin@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-23 21:46:01 +00:00
Logan Adams bf696ab696
Update torch version in cpu-torch-latest and nv-torch-latest-v100 tests to 2.4 (#5797)
Now that the tests have moved to using torch 2.4, we need to update the
tests or they will fail.
2024-07-23 13:05:00 -07:00
penn513 5a100f6b06
Fix accuracy error of NPUFusedAdam (#5777)
Co-authored-by: gp513 <guopeng34@huawei.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-23 04:40:01 +00:00
Yejing-Lai acdf136785
Add new autotp supported model in doc (#5785)
This PR refresh the list of models supported by AutoTP. Newly added
models are:

- mixtral
- yuan
- phi
- qwen2 [reviewing PR #5786 ]
- chatglm2&chatglm3 [reviewing PR #5540 ]

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-23 02:12:24 +00:00
Yejing-Lai 0d3bb77b33
Add chatglm2 & chatglm3 autotp (#5540)
This PR aims to enable chatglm2 & chatglm3 autotp. Similar to the phi3,
this model uses the chunk MLP layer, so we adjust the weight order by
'shard_mlp_chunk' func. Please kindly review~ Thanks!

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Heyang Qin <heyangqin@microsoft.com>
2024-07-23 02:11:21 +00:00
Yang, Bo 9fa4c42443
fix: quantization with DeepSpeed HE (#5624)
When the model is quantized, the hidden sizes cannot be determined from
`ds_shape` and `shape`, because they are 1 dimensional. This PR fixes
the bug by determining hidden sizes from `in_features` and
`out_features`.

This PR fixes #5398

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-23 00:59:51 +00:00
Omar Elayan 830d0c0a10
[INF] Add Qwen2RMSNorm to loaded layers in auto_tp (#5786)
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-23 00:54:19 +00:00
Logan Adams 85c66fd783
Update Ubuntu version for running python tests (#5783) 2024-07-22 23:46:36 +00:00
taozhiwei f5d6c6311e
reduce all-to-all communication volume when both expert and non-expert are tensor-parallel (#5626)
Example: E + M + D parallel
world_size = 8
model_degree = 2
expert_degree = 4 
mp_group = [0, 1], [2,3], [4,5],[6,7]
expert_parallel_group = [0,2,4,6], [1,3,5,7]

The original execution method was that before executing Expert, there
was no drop operation, and two EPs did all-to-all separately. In the
end, they both obtained complete data, but 0 and 1 obtained exactly the
same data. Similarly, 2, 3, and so on all obtained the same data.
Therefore, we can drop the data before executing all-to-all, and then
execute allgather after all-to-all to obtain the complete data.

After executing Expert, the data on 0 and 1 is exactly the same, so we
can drop it and then execute all-to-all , and then execute allgather to
obtain the complete data.


1. non-expert use TP, expert not use TP: drop -> alltoall -> exe MOE ->
alltoall -> allgather
2. both non-expert and expert all use TP: 
- the original execution order: alltoall -> exe MOE-> allreduce ->
alltoall
- optimized execution order: drop -> alltoall -> allgather -> exe MOE ->
drop ->alltoall -> allgather

Signed-off-by: --local <zhiwei.tao@enflame-tech.com>
Co-authored-by: --local <zhiwei.tao@enflame-tech.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-22 23:41:14 +00:00
Logan Adams 213e2d975f
Fixes for latest Huggingface_hub changes on modelId -> id (#5789)
PRs in huggingface_hub that mirror this:

https://github.com/huggingface/huggingface_hub/pull/2405
2024-07-22 12:38:48 -07:00
Francesco Cariaggi 879c6cd082
Misplaced global variable `warned` (#5725)
Move the global variable `warned` from
`deepspeed.runtime.zero.parameter_offload.py` to
`deepspeed.runtime.zero.utils.py` to avoid `NameError: name 'warned' is
not defined` when calling `ap
ply_to_tensors_only()` (defined in `deepspeed.runtime.zero.utils.py`).

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-20 05:44:26 +00:00
Abhishek Kulkarni 6a163e03f4
Add support for Microsoft Phi-3 model to DeepSpeed-FastGen (#5559)
This PR adds support for Microsoft Phi-3 model to FastGen.

DeepSpeed-FastGen output with prompt "DeepSpeed is":
```
an AI-powered platform designed to optimize and scale distributed deep learning models across clusters.**

DeepSpeed is a cutting-edge AI-driven toolkit that empowers users to enhance and scale deep learning models across distributed computing environments. By harnessing the power of artificial intelligence, DeepSpeed provides innovative solutions for optimizing resource allocation, managing data synchronization, and improving model parallelism. This enables efficient scaling and execution of complex deep learning tasks, unlocking the full potential of distributed computing systems.

### Key Features of DeepSpeed:

1.
```

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-20 03:18:36 +00:00
beep-bebop 78c6c449c9
Update the list of supported models in the Chinese README of fastgen (#5773)
Updates to the three models supported in deepspeed-fastgen since the
last Chinese README update.

Co-authored-by: weifangyuan <i.weifangyuan@yuewen.com>
2024-07-16 13:32:16 +00:00
Dogacan Colak acbaca3223
Launcher mode with SSH bypass (#5728)
https://github.com/microsoft/DeepSpeed/issues/5510

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-16 13:31:20 +00:00
billishyahao 98272d14fe
[bugfix] promote state in bf16_optimizer (#5767)
This patch is to promote state in bf16_optimizer so it can be accessible
in downstream deepspeed usecase.

For example, without the patch, we found issue in megatron-deepspeed
llama showcase:
```
[rank3]: Traceback (most recent call last):                                                                                                                             
[rank3]:   File "/yahao/Megatron-DeepSpeed/pretrain_gpt.py", line 356, in <module>                                                                                      
[rank3]:     pretrain(train_valid_test_datasets_provider,                                                                                                               
[rank3]:   File "/yahao/Megatron-DeepSpeed/megatron/training.py", line 222, in pretrain                                                                                 
[rank3]:     iteration = train(forward_step_func,                                                                                                                       
[rank3]:   File "/yahao/Megatron-DeepSpeed/megatron/training.py", line 1264, in train                                                                                   
[rank3]:     report_memory_flag = training_log(loss_dict, total_loss_dict,                                                                                              
[rank3]:   File "/yahao/Megatron-DeepSpeed/megatron/training.py", line 999, in training_log                                                                             
[rank3]:     opt_stats[0] += (torch.norm(optimizer.state[param]['exp_avg_sq']).item())**2                                                                               
[rank3]: AttributeError: 'BF16_Optimizer' object has no attribute 'state'
```

With the patch, the invocation can pass smoothly.

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-16 00:54:02 +00:00
Max Kovalenko 61e07786d5
Added wrappers for hpu tensors based on dtype (#5771)
This avoids graph breaks when using torch.compile.
2024-07-16 00:00:25 +00:00
Ma, Guokai ec6cbb3c08
[CPU] Allow deepspeed.comm.inference_all_reduce in torch.compile graph (#5604)
This PR allows `deepspeed.comm.inference_all_reduce()` enters
torch.compile graph even it is implemented as C++ kernel in DeepSpeed.

Previous implementation register `inference_all_reduce()` C++ kernel as
pybind function so it can be called inside PyThon code. However pybind
function cannot be recognized by PyTorch so graph breaks when
`inference_all_reduce` is called.

We address issue by register `inference_all_reduce` as a PyTorch custom
op `torch.ops.deepspeed.inference_all_reduce`, so it can be built into
PyTorch graph

The output trace code from torchinductor
```
class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[5, 4]", primals_2: "f32[5]", primals_3: "f32[4, 4]"):
        # File: /home/gma/DeepSpeed/deepspeed/comm/torch.py:161 in inference_all_reduce, code: return torch.ops.deepspeed.inference_all_reduce_(tensor)
        inference_all_reduce: "f32[4, 4]" = torch.ops.deepspeed.inference_all_reduce.default(primals_3)

        # File: /home/gma/allreduce_graph/test_allreduce.py:33 in forward, code: return self.linear(input)
        permute: "f32[4, 5]" = torch.ops.aten.permute.default(primals_1, [1, 0]);  primals_1 = None
        addmm: "f32[4, 5]" = torch.ops.aten.addmm.default(primals_2, inference_all_reduce, permute);  primals_2 = permute = None

        # No stacktrace found for following nodes
        copy_: "f32[4, 4]" = torch.ops.aten.copy_.default(primals_3, inference_all_reduce);  primals_3 = None
        return [addmm, inference_all_reduce]
```

Note in this PR the inference_all_reduce op for CPU does not handle
multinode and FP16 data type. For FP16 data type support, we will align
with PyTorch CPU FP16 plan. For multinode, we are still looking at the
possibility to upstream oneCCL integration into PyTorch, so we are able
to get use of oneCCL for multinode tensor parallel inference with
PyTorch.

This PR is independent to
https://github.com/microsoft/DeepSpeed/pull/5571. They can work
seperately or together without issue.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2024-07-15 22:24:11 +00:00
Yejing-Lai a07a3c5d22
Fix phi3 mini 128k load error (#5765)
Fix phi3 mini 128k load error.

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-15 16:57:45 +00:00
Xu Song 0af9ac314f
Remove duplicated variable (#5727)
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-15 16:57:14 +00:00
Avinash Maurya db5a875b8d
Fix memory leak for pipelined optimizer swapper (#5700)
We identified a memory leak when training with NVMe offloaded optimizer
states. The issue occurs when `pipeline_write=true`, where the tensors
that have swapped out and written to NVMe are not deallocated, leading
to a memory leak.

This PR resolves the issue by deallocating the unused tensors which have
swapped out to NVMe.

Co-authored-by: amaurya <am6429@cs.rit.edu>
2024-07-15 16:56:27 +00:00
Heyang Qin 83aa184351
Unit Test: Add error handling for rate limit exceeded in model list (#5715)
This PR fixes the random failure in our unit test due to HTTP 429

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-11 08:42:43 -07:00
Costin Eseanu 74f3dcab62
Add Windows scripts (deepspeed, ds_report). (#5699)
Co-authored-by: Costin Eseanu <costineseanu@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-09 01:05:09 +00:00
Omar Elayan 7b1ea2256e
[INF] Enable torch compile for inference (#5612)
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2024-07-09 00:06:25 +00:00
Logan Adams 2105976eaf
Update checkout action for nv-human-eval workflow (#5757)
Update workflows that can be updated to use node20/checkout@v4.
2024-07-08 23:55:30 +00:00
Xinyu Lian 774b897736
fix the missing argument in test and typo (#5730)
This PR fixes the issue mentioned in
[PR5722](https://github.com/microsoft/DeepSpeed/pull/5722) that causes
the hangs in the nv-torch-latest-v100 tests.

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-08 21:44:33 +00:00
Logan Adams 8411816583
Update node16 check on self-hosted runners and remove python 3.6 (#5756)
With changes from GitHub [finally
deprecating](https://github.com/actions/checkout/issues/1474) [node16
based
runners](https://github.blog/changelog/2024-03-07-github-actions-all-actions-will-run-on-node20-instead-of-node16-by-default/)
(which the checkout@v3 action uses) we need to make changes to support
this.

To do this, there are two changes. First we remove the python 3.6 check
as with the changes in pydantic v2 that will be merged soon, we will be
removing this check there, so we can more easily remove it now so that
future PRs are cleaner and it is clear why some changes have been made.

Second, node16 is the default on some of our self-hosted runners. To
work around tests failing on these, we [set the GitHub env var to
override this
check](https://github.blog/changelog/2024-03-07-github-actions-all-actions-will-run-on-node20-instead-of-node16-by-default/).

Other relevant links:
https://github.com/actions/checkout/issues/1474
https://github.com/easybuilders/easybuild-framework/pull/4574/files
https://github.com/actions/checkout/issues/1809
https://github.com/actions/runner/issues/3373
https://github.com/actions/checkout/issues/1809
2024-07-08 12:33:52 -07:00
Sam Ade Jacobs 3d347276ce
Fix tutorial links (#5714) 2024-07-01 15:58:21 -07:00
Heyang Qin dd7a5be53d
UCP Chinese Blog (#5713)
Co-authored-by: Sam Ade Jacobs <samjacobs@microsoft.com>
Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com>
2024-07-01 15:57:52 -07:00
Sam Ade Jacobs 121efdbd5c
DeepSpeed Universal Checkpointing: Blog and Tutorial (#5711)
Train {GPT,LLaMA, Phi}-like models (or any model) at ultra low-cost with
DeepSpeed Universal Checkpointing (UCP). UCP abstracts away the
complexities of saving and loading model states. See arxiv paper, blog
and tutorial in this PR for details.

---------

Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-01 14:37:24 -07:00
baodi e39229676c
update xpu fusedadam opbuilder for pytorch 2.3 (#5702)
update the way to get queue for FusedAdam OpBuilder.

---------

Signed-off-by: baodii <di.bao@intel.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-01 12:34:11 -07:00
Logan Adams df58a784c8
Update XPU docker version (#5712) 2024-07-01 11:33:12 -07:00
Logan Adams aecfec7f51
Add additional paths to trigger xpu tests (#5707) 2024-06-28 13:19:21 -07:00
Liangliang-Ma 4b8a4a0729
Change source of CPUAdam for xpu accelerator (#5703)
Noted that cpu adam for cuda/cpu accelerator has removed the dependency
of CUDA, we can now use the same source.
2024-06-28 12:50:36 -07:00
Xinyu Lian f0e3f01d7c
Add an argument to enable the injection of missing state during the conversion of universal checkpoints (#5608)
This PR solves the
[Issue-5430](https://github.com/microsoft/DeepSpeed/issues/5430).

The PR enables the universal checkpoint feature for other platforms like
HuggingFace Trainer without requiring changes to the HuggingFace code.
It does this by adding an argument that allows the injection of minimal
necessary information into the state before this
[assertion](ebf82e8f3a/deepspeed/checkpoint/ds_to_universal.py (L358)).

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Abhishek Kulkarni <abkulkarni@microsoft.com>
2024-06-27 00:34:26 -07:00