DeepSpeed

Граф коммитов

Автор	SHA1	Сообщение	Дата
Quentin Gallouédec	83e4364fbd	Use `json_schema_extra` instead of extra keyword in `Field` (#6764 ) > Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'new_param'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.9/migration/ Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-11-20 01:04:47 +00:00
Nadav Elyahu	065398d5de	Fix setup.py bash cmd generation to correctly extract git info (#6762 ) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-11-19 13:54:53 -08:00
Logan Adams	2e0c39b55c	Add explicit parameters for torch.load (#6751 ) Successor PR to #6094: > FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. Todo: - [ ] Update values in non-test files to True where necessary.	2024-11-19 11:09:52 -08:00
baodi	1fdad1fa52	make xpu ops compatible with oneapi 2025.0 (#6760 ) Compatibility update for xpu ops This PR introduces changes that will make xpu ops compatible with the OneAPI 2025.0 toolkit. This is an important update that will allow us to develop and ship our most demanding models on this innovative hardware. --------- Signed-off-by: baodii <di.bao@intel.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>	2024-11-19 17:38:27 +00:00
Logan Adams	8488beea29	Pin transformers version to work around latest torch requirements (#6759 ) Latest transformers seems to break our tests that aren't on torch latest (>= 2.5). Issue opened here: https://github.com/huggingface/transformers/issues/34795. This pins our version so these tests can pass in the meantime.	2024-11-19 01:36:51 +00:00
Xu Song	dd40269426	A faster and more memory-efficient implementation of `zero_to_fp32` (#6658 ) It is a faster and more memory-efficient implementation of `zero_to_fp32`. The previous version double the memory usage, which cause cpu OOM for very large models (e.g. llama 405B). `b647fb2470/deepspeed/utils/zero_to_fp32.py (L438-L441)` ## How does it work? 1. Lazy loading: Load checkpoint with `mmap=True`, thus the weights are mmaped rather than loading all the storages into memory. 2. Lazy merge: `GatheredTensor` contains the mmaped weights and tensor offset. It is a memory-efficient pseudo tensor. Only when `tensor.contiguous()` is called, it starts to load related weights to memory and merge into a single tensor. 3. Release memory in time: Save checkpoints shard by shard, and release the memory once a shard is saved. Throughout the process, only one shard of tensors are keeped in memory. ## How much benefit in speed and memory ? Experiments were conducted on a linux host with 1TB of memory. Here is a detailed comparision \| \| world size \| peak memory(GB) \| elapsed time(h:mm:ss) \| \|----------------------\|------------\|--------------\|--------------------\| \| llama3-8B(old->new) \| 8 \| 90 -> 41 \| 0:02:17 -> 0:01:10 \| \| llama2-13B(old->new) \| 8 \| 146 -> 54 \| 0:02:30 -> 0:01:47 \| \| llama2-70B(old->new) \| 16 \| 789 -> 159 \| 0:20:47 -> 0:20:45 \| \| qwen1.5-110B(old->new) \| 32 \| OOM -> 217 \| ? -> 0:34:21 \| \| llama3-405B(old->new) \| 192 \| OOM -> 262 \| ? -> 2:09:59 \| You can reproduce with the following scripts ```sh # 1. install requirments apt-get install time # 2. prepare zero-3 checkpoints # 3. convert zero to fp32 checkpoints /usr/bin/time -v python zero_to_fp32.py . output_dir/ --safe_serialization ``` - memory: Theoretically, this PR reduces the memory cost from `2M` to `(1/n)M`, where `M` is the memory cost of the full weights, `n` is num_shards. - speed: The speed gain mainly comes from avoiding extra tensor copying. The benifit may be slight. ## Impl history - [v1](`19712a1c75 (diff-6a2ca3427fa608c387b7351359f98cfc1313be6e960cee86344ff246bf1b8326R441-R447)`) : a hf_hub compatible approach. It has been discarded due to the controversial implementation of `data_ptr().` - [v2](https://github.com/microsoft/DeepSpeed/pull/6658/files): a simple approach with `torch.empty` --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-11-18 20:14:35 +00:00
Logan Adams	f594dbe3df	Disable failing python tests (#6758 )	2024-11-18 10:16:21 -08:00
Raza Sikander	e3b5a4b6e0	Gaudi2 Nightly job for daily check (#6753 ) Co-authored-by: Logan Adams <loadams@microsoft.com>	2024-11-15 15:11:59 -08:00
Olatunji Ruwase	fc4e73370d	Add no_sync context manager (#6675 ) Fix #1902 --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-11-14 18:52:51 +00:00
Minjia Zhang	d702eb5f79	Adding the governance doc (#6748 ) Drafted governance doc for the LFAI. Co-authored-by: Minjia Zhang <minjiaz@illinois.edu>	2024-11-14 12:01:53 -08:00
Logan Adams	9a2c209cee	Sanitize inputs to eval() (#6745 )	2024-11-13 09:04:56 -08:00
Logan Adams	877aa0dba6	Update path for BingBertSquad from DeepSpeedExamples (#6746 ) In https://github.com/microsoft/DeepSpeedExamples/pull/245, the DeepSpeedExamples directory structure was refactored, this updates the DeepSpeed examples from those changes.	2024-11-12 18:50:02 +00:00
Joe Mayer	b692cdea47	AIO File Offsets (#6641 ) Adding the option for a file offset to the read/write functions of AIO & GDS ops. --------- Co-authored-by: jomayeri <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-11-12 16:34:17 +00:00
inkcherry	7af3a4beb5	add zero3 ```module_granularity_threshold ``` to zero optimization. (#6649 ) This PR adds Z3 coalesced fetch to zero optimization. Currently, some logic can be reused, but it's difficult to realize that as optimization choice(I only discovered these logic when trying to implement it). The benefit of this approach is reducing host overhead（reduce many hooks) and during the process of recursive fetching parameters (especially in fine-grained models, such as those with a large number of moe experts). This is particularly helpful for host-sensitive devices (such as hpu), where it achieved a 40% performance improvement in our customer workloads. FYI @delock @deepcharm --------- Co-authored-by: Ma, Guokai <guokai.ma@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-11-12 14:25:33 +00:00
Hongwei Chen	73d974ee64	Add data type check for bf16 (#6742 ) Add data type check for bf16 to fix #6723	2024-11-12 13:01:31 +00:00
Chengming Zhang	fabab197f7	Add Domino code (#6733 ) add domino code Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-11-11 23:55:09 +00:00
Xinyu Lian	99e9cbed16	Fix Type Name Inconsistency & Typo in cpu_adam (#6732 ) There is a typing error & inconsistency in cpu-adam code, while not affecting functionality, impacts code readability. Specifically, the type name `ds_params_percision_t` contains a typo ('percision'), whereas the related type name `ds_state_precision_t` is spelled correctly. I think it is beneficial to fix this typo&inconsistency to improve code readability, maintainability and further development. I have tested the corrected version of cpu_adam, and it compiles and runs successfully. Compilation Log: <img width="2560" alt="image" src="https://github.com/user-attachments/assets/b7bc307d-9c9d-4ab7-8671-34e565903ca5"> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-11-11 23:31:45 +00:00
Logan Adams	b45ca26354	Update AMD apex version (#6739 )	2024-11-11 13:26:41 -08:00
Olatunji Ruwase	b7e2ff5080	Add COMMITTER file (#6741 ) Add COMMITTER file	2024-11-11 11:51:10 -08:00
Logan Adams	0855566228	Update GH hosted workflows to 24.04 (#6717 ) `ubuntu-latset` is moving to be 24.04, so we should test updating as well to ensure it doesn't break any of our workflows.	2024-11-11 06:22:08 -08:00
Logan Adams	057d25be67	Update version.txt after 0.15.4 release (#6731 ) Auto-generated PR to update version.txt after a DeepSpeed release Released version - 0.15.4 Author - @loadams Co-authored-by: loadams <loadams@users.noreply.github.com>	2024-11-08 08:34:20 -08:00
Logan Adams	a1b0c35a1d	Switch what versions of python are supported (#5676 ) Add support for testing compilation with python 3.11/3.12. Also add the dockerfiles used to build those images. --------- Co-authored-by: Michael Wyatt <michael.wyatt@snowflake.com>	2024-11-06 20:37:52 -08:00
Logan Adams	3beda32e94	Update flake8 version (#6722 ) This PR is useful for updating the flake8 checks we run, but is mostly needed to update flake8 so that it can run on newer versions of python which are included in newer ubuntu-latest versions from GitHub that we update to in #6717	2024-11-06 15:17:48 -08:00
Logan Adams	d2a4718946	Update yapf version (#6721 ) This update is needed to support eventually running on ubuntu-24.04 from GitHub, specifically because the python version is updated to 3.12 and results in the following error: `ModuleNotFoundError: No module named 'lib2to3'` since that package is deprecated.	2024-11-06 18:57:12 +00:00
Masahiro Tanaka	351569dd4a	Use one param coordinator for both train/inference scenarios (#6662 ) The parameter coordinator in ZeRO3 throws a "backward pass is invalid for module in evaluation mode" error when the training mode is unexpected, as it expects all modules to be in training mode during the backward pass. This is an unnecessarily strict restriction. This PR relaxes the restriction by using a single parameter coordinator (instead of separate ones for training and evaluation modes) and resetting the prefetch state before starting a forward pass. Use of `is_compiling` needs to be fixed after #6663 is merged. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-11-05 22:53:01 +00:00
Jagadish Krishnamoorthy	2b41d6212c	[Bug Fix] Support threads_per_head < 64 for wavefront size of 64 (#6622 ) When launching apply_rotary_pos_half kernel, only threads_per_head of 64 is supported for wavefront size of 64. This change adds support for threads_per_head < 64 such as 4, 8, 16. Fixes the issue introduced in https://github.com/microsoft/DeepSpeed/pull/5402 --------- Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>	2024-11-04 21:51:27 +00:00
Logan Adams	6c08b7f932	Pin transformers to 4.45.2 in nv-ds-chat workflow (#6710 ) This commit causes breaking changes we need to fix, for now we will pin the version but we will fix shortly https://github.com/huggingface/transformers/pull/33325	2024-11-04 20:51:01 +00:00
jiahao su	9068acb6fb	Update URL in README Pipeline Status for Huawei Ascend NPU (#6706 )	2024-11-04 17:49:21 +00:00
Masahiro Tanaka	b24dfa9d08	Explictly set device when reusing dist env (#6696 ) A rank of a process can change when reusing the environment. This PR explicitly sets the device when reusing the environment.	2024-11-01 12:57:47 +00:00
Masahiro Tanaka	95ea95fcd6	Free memory in universal checkpointing tests (#6693 ) Tests in universal checkpointing were not freeing the engine after use when `reuse_dist_env` was set to `True`, leading to memory leaks. This PR ensure freeing the engine in the tests and enables `reuse_dist_env`. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-10-31 11:51:11 -07:00
Xinyu Lian	ff1c54351f	fix memcpy issue on backward for zero-infinity (#6670 ) This PR is similar to [PR#5301](https://github.com/microsoft/DeepSpeed/pull/5301), that optimizes the D2H time use pinned memory. Previously, the D2H memcpy will be the bottleneck during the final backward pass of each iteration for ZeRO-Infinity(offload), as shown in Trace-1. The new version can eliminate the bottleneck, as shown in Trace-2. _Trace-1_ <img width="480" alt="image" src="https://github.com/user-attachments/assets/891e3770-351b-4e03-8a59-b491bc44d03b"> _Trace-2_ <img width="192" alt="image" src="https://github.com/user-attachments/assets/f1cf9037-77f8-42a6-adc8-d5c6bacde0aa"> cc @tjruwase --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-10-31 10:56:09 -07:00
Yejing-Lai	c7f58c899f	Add attribute check to support git-base autotp (#6688 ) Git-base model is an image-text model. After supporting the llama3.2 vision model, we set num_kv_heads dynamically. Git-base only includes vision_config, so we need to add an attribute check for vision_config/text_config when setting num_kv_heads. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-10-31 00:48:52 +00:00
Logan Adams	9b547313c6	Update checkout action to latest version (#5021 ) Latest checkout uses latest (non-deprecated) version of node (16 -> 20). More information [here](https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/): ``` Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/. ``` Checkout action: https://github.com/actions/checkout Node 20 requires a minimum of Ubuntu 20.04, so workflows currently using 18.04 are failing/will fail.	2024-10-30 17:36:53 +00:00
xuanhua	e4a247ed13	Fix training of pipeline based peft's lora model (#5477 ) Hi, guys I find there is an assert failure when I train huggingface's lora based model in pipeline style. Here is the whole steps that I created my model: 1) Load the pre-trained chatglm-6b model from huggingface, as Model_A 2) Use huggingface's peft's `get_peft_model(...)` and my `LoraConfig(...)` from Model_A to create the lora model, as Model_B 3) Create my own pipeline based model Model_C from Model_B And I run Model_C under 2 3090ti GPUs. And the assertion failure looks like this: ```text Traceback (most recent call last): File "/home/ubuntu/proj/chatglm-finetuning/train_pipeline.py", line 372, in <module> main() File "/home/ubuntu/proj/chatglm-finetuning/train_pipeline.py", line 351, in main loss = engine.train_batch(data_iter=train_dataloader) File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 375, in train_batch self._exec_schedule(sched) File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 1375, in _exec_schedule self._exec_instr(*cmd.kwargs) File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 276, in _exec_reduce_tied_grads dist.all_reduce(grad, group=group) File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper return func(args, **kwargs) File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 496, in all_reduce return cdb.all_reduce(tensor, op, group, async_op) File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 159, in all_reduce return torch.distributed.all_reduce(tensor=tensor, op=op, group=group, async_op=async_op) File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1520, in all_reduce _check_single_tensor(tensor, "tensor") File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 463, in _check_single_tensor raise RuntimeError( RuntimeError: Invalid function argument. Expected parameter `tensor` to be of type torch.Tensor. ``` After some debugging, I find out the root cause is that my configuration of lora (in below) only add extra lora layer(part) in qkv related layers but not the embedding layer. So the whole embedding layer's parameters are freezed. ```python lora_config = LoraConfig(r=8, # copied from finetuning_lora.py lora_alpha=32, target_modules=["query_key_value"], lora_dropout=0.1, bias="none", task_type="CAUSAL_LM", inference_mode=False, ) ``` And in my implementation of pipeline based model, I declared the embeding layer as a tied-layer. So the whole thing is that there are no gradients at all for embedding layer, but embedding layer as the tied layer needs to be synced between two gpus. The value of gradient is None but is still passed to `all_reduce` operation. Current, my fix is simple and add a check if this `grad` is None. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Heyang Qin <heyangqin@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2024-10-29 16:04:35 +00:00
Logan Adams	07cac9e021	Remove packages that no longer need to be updated in the latest container (#6682 )	2024-10-28 21:12:29 -07:00
Logan Adams	0e11b081be	Update base docker image for A6000 GPU tests (#6681 ) Update to a [container (24.03)](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-24-03.html) with python 3.10 as transformers dropped support for python 3.8 in their latest release. Note: nv-human-eval.yml was never completed and isn't used, it is just updated for any potential future support. Resolves: #6672	2024-10-28 16:06:02 -07:00
Raza Sikander	e6357c28cd	Update gaudi2 docker version to latest release (1.18) (#6648 ) Updated docker version to 1.18.0-latest Note: for this update the firmware on the Gaudi2 node had to be updated to use firmware version 1.18. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-10-28 12:58:25 -07:00
Logan Adams	b3e959460b	Update Gaudi2 docker image (#6677 )	2024-10-28 09:57:53 -07:00
Logan Adams	229960a5e9	Add support for H100/sm_90 arch compilation (#6669 ) Resolves: #6549	2024-10-28 03:39:51 +00:00
Logan Adams	54903e09eb	Update profiler registration check (#6668 ) Resolves #5432.	2024-10-25 22:14:26 +00:00
Masahiro Tanaka	24285d6c73	Add fallback for is_compiling (#6663 ) Importing `torch.compiler.is_compiling` causes an error with an older version of PyTorch. This PR adds a fallback for `is_compiling` to use an equivalent function of older PyTorch versions. This will resolve #6656. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-10-25 20:47:22 +00:00
inkcherry	5fb71c0a18	sequence parallel for uneven heads (#6392 ) In sequence_parallel (Ulysses), the sequence parallel size is constrained by the requirement to be divisible by the number of heads, which prevents some models/workloads from setting a specific sequence parallel size. This PR implements uneven all-to-all heads splitting. - both support batch first (b,s,...) and seq_len first(s,b..) layout. - Added unit tests with numerical checks. Locally also tested with 7 heads with sp=4 and 20 heads with sp=8, and it passed. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2024-10-25 18:26:47 +00:00
Yichen Yan	3d5cf739ea	Fix dynamo issue (#6527 ) Dynamo use faketensor to trace tensor ops. In some case, the mechanism break compiling with deepspeed. An example could be found at https://gist.github.com/oraluben/9b8240c2fe482eb4382453d6c97a5f76, to see issues, install deepspeed==0.14.4 instead of my fork without this PR, llama cannot be compiled. Detailed explanation: 1. `ZeROOrderedDict` dynamo use deepcopy to copy tensors, which will call `object.__reduce__`. When copying `ZeROOrderedDict`, the default implementation do not copy its `_parent_module` and will lead to failure. 2. `param` maybe faketensor and do not have `ds_status` yet, but during tracing it's ok to just skip the `register_external_parameter`, it should be done ways before. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2024-10-25 00:17:30 +00:00
Lzhang-hub	6e6563d3c8	fix init_device_mesh for torch 2.4 (#6614 ) Start torch 2.4, in [`init_device_mesh()`](`de4c2a3b4e/torch/distributed/device_mesh.py (L915)`) ,device type with a GPU index, such as "cuda:0", is not allowed. ![image](https://github.com/user-attachments/assets/1ddb61bf-8a15-4e0a-9115-a3681d7f19ff) --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com>	2024-10-23 20:29:30 +00:00
Yejing-Lai	e06bb518aa	Add attribute check for language_model when replace last linear module (#6650 ) Fix module has no attribute 'language_model' issue. Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2024-10-23 20:22:59 +00:00
wyooyw	b647fb2470	Fix expert grad scaling problem with ZeRO optimizer (#6546 ) Fix [#6545] work: - expert gradient average: divide edp_world_size -> divide dp_world_size - unit test: make sure model with different dp/ep has same expert gradient --------- Co-authored-by: wangyiou <wangyiou@xiaohongshu.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-10-23 00:08:39 +00:00
Logan Adams	bf03f48352	Update version.txt after 0.15.3 release (#6652 ) Auto-generated PR to update version.txt after a DeepSpeed release Released version - 0.15.3 Author - @jomayeri Co-authored-by: jomayeri <jomayeri@users.noreply.github.com>	2024-10-22 14:15:45 -07:00
Liangliang Ma	a24cdd6b67	[XPU] [DeepNVMe] use same cpu_op_desc_t with cuda (#6645 ) We have found that #6592 uses `_pinned_tensor_mgr` to create cpu bounce buffer, which is same with what our xpu accelerator currently doing. So no need to use xpu device specific cpu_op_desc_t. In this PR: 1. remove custom csrc/xpu/aio/deepspeed_cpu_op.cpp 2. modify xpu async_io opbuilder. This issue cannot be easily done with revert #6532 , for we added some source file as last time GDS feature going in DS. So file this new PR :)	2024-10-22 14:45:05 +00:00
Yizhou Wang	11bbf45af5	[XPU] host timer check version from Torch 2.5 to Torch 2.6 (#6633 ) Elapsed time would be supported in Torch 2.6. Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2024-10-22 06:53:15 +00:00
Liangliang Ma	40bde528bc	[XPU] upgrade xpu max1100 CI workflow to pytorch2.3 (#6646 ) With intel-extension-for-pytorch=2.3.110 released last month, max1100 CI workflow can be updated too. Software versions aligned with #6570 . Increased CI tests scope for torch/ipex2.3 will be in later PR. This workflow passed in my cloned repo self-hosted runner.	2024-10-21 12:25:11 +00:00

1 2 3 4 5 ...

2573 Коммитов Все ветки Поиск

2573 Коммитов

Все ветки