DeepSpeed

Граф коммитов

Автор	SHA1	Сообщение	Дата
Perry Zou	249c1db2fb	Add fp16 support of Qwen1.5MoE models (A2.7B) to DeepSpeed-FastGen (#5403 ) This PR adds support for Qwen1.5MoE-A2.7B models. support for https://github.com/microsoft/DeepSpeed-MII/issues/457 ### Test Code for mii pipeline: ```python import mii pipe = mii.pipeline("/data/zonepg/models/Qwen/Qwen1.5-MoE-A2.7B") responses = pipe("DeepSpeed is", max_new_tokens=128, do_sample=False) if pipe.is_rank_0: print(responses[0]) ``` for huggingface: ```python import mii from transformers import AutoModelForCausalLM, AutoTokenizer from transformers.generation import GenerationConfig import torch tokenizer = AutoTokenizer.from_pretrained("/data/zonepg/models/Qwen/Qwen1.5-MoE-A2.7B") model = AutoModelForCausalLM.from_pretrained("/data/zonepg/models/Qwen/Qwen1.5-MoE-A2.7B", device_map="auto", torch_dtype=torch.float16, trust_remote_code=True).eval() print(model) inputs = tokenizer('DeepSpeed is', return_tensors='pt') inputs = inputs.to(model.device) pred = model.generate(**inputs, max_new_tokens=128, do_sample=False, repetition_penalty=1.0) test = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False) print(test) ``` ### Qwen1.5-MoE-A2.7B Huggingface output with prompt "DeepSpeed is": ``` a deep learning framework that is designed to accelerate the training of large-scale neural networks. It is built on top of PyTorch and provides a set of tools and techniques for optimizing the performance of deep learning models. DeepSpeed supports a variety of hardware accelerators, including GPUs, TPUs, and FPGAs, and can be used to train models on distributed systems, such as clusters of GPUs or TPUs. One of the key features of DeepSpeed is its ability to automatically parallelize the training of deep learning models across multiple GPUs or TPUs. This can significantly reduce the time required to train large models, as it allows the ``` DeepSpeed-FastGen output with prompt "DeepSpeed is": ``` a deep learning framework that is designed to accelerate the training of large-scale neural networks. It is built on top of PyTorch and provides a set of tools and techniques for optimizing the performance of deep learning models. DeepSpeed supports a variety of hardware accelerators, including GPUs, TPUs, and FPGAs, and can be used to train models on distributed systems, such as clusters of GPUs or TPUs. One of the key features of DeepSpeed is its ability to automatically parallelize the training of deep learning models across multiple GPUs or TPUs. This can significantly reduce the time required to train large models, as it allows the ``` DeepSpeed-FastGen output with prompt "DeepSpeed is" with 8-way sharding: ``` a deep learning framework that is designed to accelerate the training of large-scale neural networks. It is built on top of PyTorch and provides a set of tools and techniques for optimizing the performance of deep learning models. DeepSpeed supports a variety of hardware accelerators, including GPUs, TPUs, and FPGAs, and can be used to train models on distributed systems, such as clusters of GPUs or TPUs. One of the key features of DeepSpeed is its ability to automatically parallelize the training of deep learning models across multiple GPUs or TPUs. This can significantly reduce the time required to train large models, as it allows the ``` Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Heyang Qin <heyangqin@microsoft.com> Co-authored-by: Abhishek Kulkarni <11399+adk9@users.noreply.github.com>	2024-08-01 10:27:24 -07:00
Logan Adams	23d0e0221f	Update to ROCm6 (#5491 )	2024-08-01 09:25:08 -07:00
inkcherry	17ed7c77c5	sequence parallel with communication overlap (#5691 ) SP is a fantastic piece of work, it is very elegant and concise， at the current stage, a transformer layer's forward and backward passes involve 8 all-to-all operations, with 5 opportunities for overlapping communication: Forward pass: The QKV matrix operations can be pipelined alongside some of the all-to-all communications. Backward pass: DQ, DK, DV all-to-all communications can be pipelined alongside matrix operations. Backward pass: DO_w can be parallel with DO_input, involving matrix operations and all-to-all communications. Similar overlap-comm strategies are used in Megatron for TP/TP-sp parallelism. I tested under conditions of 1N8C zero1, disabled activation checkpointing, ds-sp=8, and gbs=16: 1B 64K 7B 16K They showed over 10% improvement (where I found that for mega-ds, using split QKV itself can also enhance performance due to reducing slice + cat operations in fwd/bwd), despite some TFLOPs already performing at a relatively good level. co-work with https://github.com/microsoft/Megatron-DeepSpeed/pull/415 --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Heyang Qin <heyangqin@microsoft.com>	2024-08-01 09:14:36 -07:00
Ramya Ramineni	f82d08862f	[ROCm] Get rocm version from /opt/rocm/.info/version (#5815 ) Previously we used to get ROCm version information from /opt/rocm/.info/version-dev file. This PR is to modify the code to get ROCm version from /opt/rocm/.info/version file instead to add compatibility with ROCm Centos9 docker images. cc: @jithunnair-amd Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-08-01 09:13:40 -07:00
Joe Mayer	324ee65cb0	GDS AIO Blog (#5817 ) README and media for the GDS blog. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-08-01 09:15:10 -04:00
Lev Kurilenko	681be6f558	Fix CPU Adam JIT compilation (#5780 ) This PR fixes CPU Adam JIT compilation by including the `CUDA_LIB64` path in the `extra_ldflags` list before calling `load()`. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-31 14:34:59 -07:00
trixirt	550f9c75bc	Find ROCm on Fedora (#5705 ) ROCm is packaged natively on Fedora. It's install location do not match the AMD release. So add some Fedora specific logic to find the ROCm version and use rocminfo when attempts to use the AMD release fail. Signed-off-by: Tom Rix <trix@redhat.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-31 12:42:40 -07:00
keshavkowshik	08598dbb3a	Fix op_builder for CUDA 12.5 (#5806 ) std lib needed to be updated to C++ version 20 for CUDA 12.5 to fix compilation issues in the op_builder. TODO: Fix may need to be extended to CUDA 12.4, needs testing. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Lev Kurilenko <lekurile@microsoft.com>	2024-07-31 10:37:52 -07:00
Logan Adams	5e8a27ad6d	Pin transformers version for MII tests (#5807 ) Corresponding PR to https://github.com/microsoft/DeepSpeed-MII/pull/510 that is made due to changes from transformers introduced in https://github.com/huggingface/transformers/pull/31747	2024-07-29 18:24:21 -07:00
Heyang Qin	58241b1d71	fix: handle exception when loading cache file in test_inference.py (#5802 ) This PR is to fix CI failures such as https://github.com/microsoft/DeepSpeed/actions/runs/10085903860/job/27887546470#step:8:3616 cc @tjruwase Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-29 14:18:50 -07:00
Liangliang Ma	afe1b9ede1	Add doc of compressed backend in Onebit optimizers (#5782 ) This one is document supplement for https://github.com/microsoft/DeepSpeed/pull/5473. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-29 11:38:03 -07:00
Reza Yazdani	4f9506729f	Add fp8-fused gemm kernel (#5764 ) This PR adds the new fused kernel for the Dense GeMM using fp8-quantized weight. --------- Co-authored-by: Jeff Rasley <jeffra45@gmail.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2024-07-29 11:07:00 -07:00
Logan Adams	f80394349d	Update MII tests to pull correct torchvision (#5800 )	2024-07-29 08:54:49 -07:00
YiSheng5	45b363504e	[XPU]Use host time to replace xpu time when IPEX version slower than 2.5. (#5796 ) Use the host time to replace xpu event elapsed_time as a WA, on XPU device, use XPU event to measure the time will be consolidated in ipex 2.5. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-25 15:03:07 -07:00
Logan Adams	ffd0a0e3ef	Update other workflows to run on Ubuntu 22.04 (#5798 )	2024-07-24 02:11:39 +00:00
Logan Adams	e661ecb35a	Unpin transformers version (#5650 ) Reverts changes in #5629 after fixes have been applied to MII repo/MII tests. --------- Co-authored-by: Heyang Qin <heyangqin@microsoft.com>	2024-07-23 23:10:21 +00:00
Nir Sonnenschein	6d0dbf86e1	move is_checkpointable call reducing torch.compile Graph breaks (#5759 ) We have encountered a performance issue when running torch compile on a model utilizing the pipeline engine (Mixtral). The issue was found to be the is_checkpointable function which is called in the engine's forward function. This function creates a graph break when using torch.compile leading to decreased performance (particularly since this happens in every forward call). We propose a change in the way is_checkpointable is checked by precomputing and storing its value before the forward call and accessing the stored values in the forward function. given this change the graph break in the forward call is avoided which should lead to better performance for torch compile. Co-authored-by: Heyang Qin <heyangqin@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-23 21:46:01 +00:00
Logan Adams	bf696ab696	Update torch version in cpu-torch-latest and nv-torch-latest-v100 tests to 2.4 (#5797 ) Now that the tests have moved to using torch 2.4, we need to update the tests or they will fail.	2024-07-23 13:05:00 -07:00
penn513	5a100f6b06	Fix accuracy error of NPUFusedAdam (#5777 ) Co-authored-by: gp513 <guopeng34@huawei.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-23 04:40:01 +00:00
Yejing-Lai	acdf136785	Add new autotp supported model in doc (#5785 ) This PR refresh the list of models supported by AutoTP. Newly added models are: - mixtral - yuan - phi - qwen2 [reviewing PR #5786 ] - chatglm2&chatglm3 [reviewing PR #5540 ] Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-23 02:12:24 +00:00
Yejing-Lai	0d3bb77b33	Add chatglm2 & chatglm3 autotp (#5540 ) This PR aims to enable chatglm2 & chatglm3 autotp. Similar to the phi3, this model uses the chunk MLP layer, so we adjust the weight order by 'shard_mlp_chunk' func. Please kindly review~ Thanks! --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Heyang Qin <heyangqin@microsoft.com>	2024-07-23 02:11:21 +00:00
Yang, Bo	9fa4c42443	fix: quantization with DeepSpeed HE (#5624 ) When the model is quantized, the hidden sizes cannot be determined from `ds_shape` and `shape`, because they are 1 dimensional. This PR fixes the bug by determining hidden sizes from `in_features` and `out_features`. This PR fixes #5398 Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-23 00:59:51 +00:00
Omar Elayan	830d0c0a10	[INF] Add Qwen2RMSNorm to loaded layers in auto_tp (#5786 ) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-23 00:54:19 +00:00
Logan Adams	85c66fd783	Update Ubuntu version for running python tests (#5783 )	2024-07-22 23:46:36 +00:00
taozhiwei	f5d6c6311e	reduce all-to-all communication volume when both expert and non-expert are tensor-parallel (#5626 ) Example: E + M + D parallel world_size = 8 model_degree = 2 expert_degree = 4 mp_group = [0, 1], [2,3], [4,5],[6,7] expert_parallel_group = [0,2,4,6], [1,3,5,7] The original execution method was that before executing Expert, there was no drop operation, and two EPs did all-to-all separately. In the end, they both obtained complete data, but 0 and 1 obtained exactly the same data. Similarly, 2, 3, and so on all obtained the same data. Therefore, we can drop the data before executing all-to-all, and then execute allgather after all-to-all to obtain the complete data. After executing Expert, the data on 0 and 1 is exactly the same, so we can drop it and then execute all-to-all , and then execute allgather to obtain the complete data. 1. non-expert use TP, expert not use TP: drop -> alltoall -> exe MOE -> alltoall -> allgather 2. both non-expert and expert all use TP: - the original execution order: alltoall -> exe MOE-> allreduce -> alltoall - optimized execution order: drop -> alltoall -> allgather -> exe MOE -> drop ->alltoall -> allgather Signed-off-by: --local <zhiwei.tao@enflame-tech.com> Co-authored-by: --local <zhiwei.tao@enflame-tech.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-22 23:41:14 +00:00
Logan Adams	213e2d975f	Fixes for latest Huggingface_hub changes on modelId -> id (#5789 ) PRs in huggingface_hub that mirror this: https://github.com/huggingface/huggingface_hub/pull/2405	2024-07-22 12:38:48 -07:00
Francesco Cariaggi	879c6cd082	Misplaced global variable `warned` (#5725 ) Move the global variable `warned` from `deepspeed.runtime.zero.parameter_offload.py` to `deepspeed.runtime.zero.utils.py` to avoid `NameError: name 'warned' is not defined` when calling `ap ply_to_tensors_only()` (defined in `deepspeed.runtime.zero.utils.py`). Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-20 05:44:26 +00:00
Abhishek Kulkarni	6a163e03f4	Add support for Microsoft Phi-3 model to DeepSpeed-FastGen (#5559 ) This PR adds support for Microsoft Phi-3 model to FastGen. DeepSpeed-FastGen output with prompt "DeepSpeed is": ``` an AI-powered platform designed to optimize and scale distributed deep learning models across clusters.** DeepSpeed is a cutting-edge AI-driven toolkit that empowers users to enhance and scale deep learning models across distributed computing environments. By harnessing the power of artificial intelligence, DeepSpeed provides innovative solutions for optimizing resource allocation, managing data synchronization, and improving model parallelism. This enables efficient scaling and execution of complex deep learning tasks, unlocking the full potential of distributed computing systems. ### Key Features of DeepSpeed: 1. ``` --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-20 03:18:36 +00:00
beep-bebop	78c6c449c9	Update the list of supported models in the Chinese README of fastgen (#5773 ) Updates to the three models supported in deepspeed-fastgen since the last Chinese README update. Co-authored-by: weifangyuan <i.weifangyuan@yuewen.com>	2024-07-16 13:32:16 +00:00
Dogacan Colak	acbaca3223	Launcher mode with SSH bypass (#5728 ) https://github.com/microsoft/DeepSpeed/issues/5510 Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-16 13:31:20 +00:00
billishyahao	98272d14fe	[bugfix] promote state in bf16_optimizer (#5767 ) This patch is to promote state in bf16_optimizer so it can be accessible in downstream deepspeed usecase. For example, without the patch, we found issue in megatron-deepspeed llama showcase: ``` [rank3]: Traceback (most recent call last): [rank3]: File "/yahao/Megatron-DeepSpeed/pretrain_gpt.py", line 356, in <module> [rank3]: pretrain(train_valid_test_datasets_provider, [rank3]: File "/yahao/Megatron-DeepSpeed/megatron/training.py", line 222, in pretrain [rank3]: iteration = train(forward_step_func, [rank3]: File "/yahao/Megatron-DeepSpeed/megatron/training.py", line 1264, in train [rank3]: report_memory_flag = training_log(loss_dict, total_loss_dict, [rank3]: File "/yahao/Megatron-DeepSpeed/megatron/training.py", line 999, in training_log [rank3]: opt_stats[0] += (torch.norm(optimizer.state[param]['exp_avg_sq']).item())**2 [rank3]: AttributeError: 'BF16_Optimizer' object has no attribute 'state' ``` With the patch, the invocation can pass smoothly. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-16 00:54:02 +00:00
Max Kovalenko	61e07786d5	Added wrappers for hpu tensors based on dtype (#5771 ) This avoids graph breaks when using torch.compile.	2024-07-16 00:00:25 +00:00
Ma, Guokai	ec6cbb3c08	[CPU] Allow deepspeed.comm.inference_all_reduce in torch.compile graph (#5604 ) This PR allows `deepspeed.comm.inference_all_reduce()` enters torch.compile graph even it is implemented as C++ kernel in DeepSpeed. Previous implementation register `inference_all_reduce()` C++ kernel as pybind function so it can be called inside PyThon code. However pybind function cannot be recognized by PyTorch so graph breaks when `inference_all_reduce` is called. We address issue by register `inference_all_reduce` as a PyTorch custom op `torch.ops.deepspeed.inference_all_reduce`, so it can be built into PyTorch graph The output trace code from torchinductor ``` class GraphModule(torch.nn.Module): def forward(self, primals_1: "f32[5, 4]", primals_2: "f32[5]", primals_3: "f32[4, 4]"): # File: /home/gma/DeepSpeed/deepspeed/comm/torch.py:161 in inference_all_reduce, code: return torch.ops.deepspeed.inference_all_reduce_(tensor) inference_all_reduce: "f32[4, 4]" = torch.ops.deepspeed.inference_all_reduce.default(primals_3) # File: /home/gma/allreduce_graph/test_allreduce.py:33 in forward, code: return self.linear(input) permute: "f32[4, 5]" = torch.ops.aten.permute.default(primals_1, [1, 0]); primals_1 = None addmm: "f32[4, 5]" = torch.ops.aten.addmm.default(primals_2, inference_all_reduce, permute); primals_2 = permute = None # No stacktrace found for following nodes copy_: "f32[4, 4]" = torch.ops.aten.copy_.default(primals_3, inference_all_reduce); primals_3 = None return [addmm, inference_all_reduce] ``` Note in this PR the inference_all_reduce op for CPU does not handle multinode and FP16 data type. For FP16 data type support, we will align with PyTorch CPU FP16 plan. For multinode, we are still looking at the possibility to upstream oneCCL integration into PyTorch, so we are able to get use of oneCCL for multinode tensor parallel inference with PyTorch. This PR is independent to https://github.com/microsoft/DeepSpeed/pull/5571. They can work seperately or together without issue. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2024-07-15 22:24:11 +00:00
Yejing-Lai	a07a3c5d22	Fix phi3 mini 128k load error (#5765 ) Fix phi3 mini 128k load error. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-15 16:57:45 +00:00
Xu Song	0af9ac314f	Remove duplicated variable (#5727 ) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-15 16:57:14 +00:00
Avinash Maurya	db5a875b8d	Fix memory leak for pipelined optimizer swapper (#5700 ) We identified a memory leak when training with NVMe offloaded optimizer states. The issue occurs when `pipeline_write=true`, where the tensors that have swapped out and written to NVMe are not deallocated, leading to a memory leak. This PR resolves the issue by deallocating the unused tensors which have swapped out to NVMe. Co-authored-by: amaurya <am6429@cs.rit.edu>	2024-07-15 16:56:27 +00:00
Heyang Qin	83aa184351	Unit Test: Add error handling for rate limit exceeded in model list (#5715 ) This PR fixes the random failure in our unit test due to HTTP 429 --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-11 08:42:43 -07:00
Costin Eseanu	74f3dcab62	Add Windows scripts (deepspeed, ds_report). (#5699 ) Co-authored-by: Costin Eseanu <costineseanu@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-09 01:05:09 +00:00
Omar Elayan	7b1ea2256e	[INF] Enable torch compile for inference (#5612 ) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2024-07-09 00:06:25 +00:00
Logan Adams	2105976eaf	Update checkout action for nv-human-eval workflow (#5757 ) Update workflows that can be updated to use node20/checkout@v4.	2024-07-08 23:55:30 +00:00
Xinyu Lian	774b897736	fix the missing argument in test and typo (#5730 ) This PR fixes the issue mentioned in [PR5722](https://github.com/microsoft/DeepSpeed/pull/5722) that causes the hangs in the nv-torch-latest-v100 tests. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-08 21:44:33 +00:00
Logan Adams	8411816583	Update node16 check on self-hosted runners and remove python 3.6 (#5756 ) With changes from GitHub [finally deprecating](https://github.com/actions/checkout/issues/1474) [node16 based runners](https://github.blog/changelog/2024-03-07-github-actions-all-actions-will-run-on-node20-instead-of-node16-by-default/) (which the checkout@v3 action uses) we need to make changes to support this. To do this, there are two changes. First we remove the python 3.6 check as with the changes in pydantic v2 that will be merged soon, we will be removing this check there, so we can more easily remove it now so that future PRs are cleaner and it is clear why some changes have been made. Second, node16 is the default on some of our self-hosted runners. To work around tests failing on these, we [set the GitHub env var to override this check](https://github.blog/changelog/2024-03-07-github-actions-all-actions-will-run-on-node20-instead-of-node16-by-default/). Other relevant links: https://github.com/actions/checkout/issues/1474 https://github.com/easybuilders/easybuild-framework/pull/4574/files https://github.com/actions/checkout/issues/1809 https://github.com/actions/runner/issues/3373 https://github.com/actions/checkout/issues/1809	2024-07-08 12:33:52 -07:00
Sam Ade Jacobs	3d347276ce	Fix tutorial links (#5714 )	2024-07-01 15:58:21 -07:00
Heyang Qin	dd7a5be53d	UCP Chinese Blog (#5713 ) Co-authored-by: Sam Ade Jacobs <samjacobs@microsoft.com> Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com>	2024-07-01 15:57:52 -07:00
Sam Ade Jacobs	121efdbd5c	DeepSpeed Universal Checkpointing: Blog and Tutorial (#5711 ) Train {GPT,LLaMA, Phi}-like models (or any model) at ultra low-cost with DeepSpeed Universal Checkpointing (UCP). UCP abstracts away the complexities of saving and loading model states. See arxiv paper, blog and tutorial in this PR for details. --------- Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-01 14:37:24 -07:00
baodi	e39229676c	update xpu fusedadam opbuilder for pytorch 2.3 (#5702 ) update the way to get queue for FusedAdam OpBuilder. --------- Signed-off-by: baodii <di.bao@intel.com> Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-01 12:34:11 -07:00
Logan Adams	df58a784c8	Update XPU docker version (#5712 )	2024-07-01 11:33:12 -07:00
Logan Adams	aecfec7f51	Add additional paths to trigger xpu tests (#5707 )	2024-06-28 13:19:21 -07:00
Liangliang-Ma	4b8a4a0729	Change source of CPUAdam for xpu accelerator (#5703 ) Noted that cpu adam for cuda/cpu accelerator has removed the dependency of CUDA, we can now use the same source.	2024-06-28 12:50:36 -07:00
Xinyu Lian	f0e3f01d7c	Add an argument to enable the injection of missing state during the conversion of universal checkpoints (#5608 ) This PR solves the [Issue-5430](https://github.com/microsoft/DeepSpeed/issues/5430). The PR enables the universal checkpoint feature for other platforms like HuggingFace Trainer without requiring changes to the HuggingFace code. It does this by adding an argument that allows the injection of minimal necessary information into the state before this [assertion](`ebf82e8f3a/deepspeed/checkpoint/ds_to_universal.py (L358)`). --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Abhishek Kulkarni <abkulkarni@microsoft.com>	2024-06-27 00:34:26 -07:00

... 3 4 5 6 7 ...

2602 Коммитов Все ветки Поиск

2602 Коммитов

Все ветки