DeepSpeed

Граф коммитов

Автор	SHA1	Сообщение	Дата
baodi	1fdad1fa52	make xpu ops compatible with oneapi 2025.0 (#6760 ) Compatibility update for xpu ops This PR introduces changes that will make xpu ops compatible with the OneAPI 2025.0 toolkit. This is an important update that will allow us to develop and ship our most demanding models on this innovative hardware. --------- Signed-off-by: baodii <di.bao@intel.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>	2024-11-19 17:38:27 +00:00
Joe Mayer	b692cdea47	AIO File Offsets (#6641 ) Adding the option for a file offset to the read/write functions of AIO & GDS ops. --------- Co-authored-by: jomayeri <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-11-12 16:34:17 +00:00
Xinyu Lian	99e9cbed16	Fix Type Name Inconsistency & Typo in cpu_adam (#6732 ) There is a typing error & inconsistency in cpu-adam code, while not affecting functionality, impacts code readability. Specifically, the type name `ds_params_percision_t` contains a typo ('percision'), whereas the related type name `ds_state_precision_t` is spelled correctly. I think it is beneficial to fix this typo&inconsistency to improve code readability, maintainability and further development. I have tested the corrected version of cpu_adam, and it compiles and runs successfully. Compilation Log: <img width="2560" alt="image" src="https://github.com/user-attachments/assets/b7bc307d-9c9d-4ab7-8671-34e565903ca5"> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-11-11 23:31:45 +00:00
Jagadish Krishnamoorthy	2b41d6212c	[Bug Fix] Support threads_per_head < 64 for wavefront size of 64 (#6622 ) When launching apply_rotary_pos_half kernel, only threads_per_head of 64 is supported for wavefront size of 64. This change adds support for threads_per_head < 64 such as 4, 8, 16. Fixes the issue introduced in https://github.com/microsoft/DeepSpeed/pull/5402 --------- Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>	2024-11-04 21:51:27 +00:00
Liangliang Ma	a24cdd6b67	[XPU] [DeepNVMe] use same cpu_op_desc_t with cuda (#6645 ) We have found that #6592 uses `_pinned_tensor_mgr` to create cpu bounce buffer, which is same with what our xpu accelerator currently doing. So no need to use xpu device specific cpu_op_desc_t. In this PR: 1. remove custom csrc/xpu/aio/deepspeed_cpu_op.cpp 2. modify xpu async_io opbuilder. This issue cannot be easily done with revert #6532 , for we added some source file as last time GDS feature going in DS. So file this new PR :)	2024-10-22 14:45:05 +00:00
Joe Mayer	6eefc3d0ea	Fix Memory Leak In AIO (#6630 ) Fixing a memory leak in AIO pinned tensor as well as an incorrect function type for gds op. --------- Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2024-10-18 02:58:06 +00:00
Joe Mayer	a1f98bdc70	AIO CPU Locked Tensor (#6592 ) Restoring the functionality of the cpu locked tensor in the AIO library. Make async_io operator available for CPU accelerator, i.e., CPU only environment. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-10-09 21:07:31 +00:00
Omar Elayan	645639bcf8	Rearrange inference OPS and stop using builder.load (#5490 ) This PR mainly handles all places where InferenceBuilder is used to access any op or a specific implementation for an op. Instead an op is defined, and its proper implementation is picked inside and the usage will be transparent to the user. What was done in the PR: 1) Added missing ops (added a py file with fallback mechanism) 2) Added missing fallback implementations for existing ops 3) removed all usages for builder.load and replaced them with ops instead. 4) added workspace op and inferenceContext which contains all workspace related functions and inferenceContext is the python fallback of inferenceContext in CUDA 5) a small change to softmax_context signature to fit the fallback signature. --------- Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com> Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-10-09 01:22:28 +00:00
Liangliang Ma	d45cfd3455	[XPU] Support DeepNVMe new code structure (#6532 ) In DeepNVMe GDS update, many functions are changed into a more abstract way. Also added some files. These change break zero-infinity on XPU. To bring this feature back, we have this PR: 1. modify the aio opbuilder for new files. 2. Add custom cpu_op_desc_t for xpu users. (XPU don't handle buffer aligned here) --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-09-26 20:39:59 +00:00
Olatunji Ruwase	659f6be105	Avoid security issues of subprocess shell (#6498 ) Avoid security issues of `shell=True` in subprocess --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-09-11 20:07:06 +00:00
Omar Elayan	c27483933d	wrap include cuda_bf16.h with ifdef BF16_AVAILABLE (#6520 )	2024-09-10 16:08:50 +00:00
Joe Mayer	f2739b4f72	Change GDS to 1 AIO thread (#6459 ) The `numThreads` config option determines how many threads are used to read from the file. In the CPU case these threads are created via AIO, in the GDS case they are handled by the GDS library via the cufile.json. If we were to also create AIO threads it would have a multiplicative effect. Example 8 AIO threads * 8 GDS threads would be 64 threads reading from the file when the user really only intended for 8 threads. Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-08-29 15:59:32 +00:00
Joe Mayer	e2654bfd1a	Fix Type Mismatch (#6410 ) `num_bytes_per_thread` was a smaller type than `file_num_bytes`, this caused issues when dividing by `num_threads`. Co-authored-by: jomayeri <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-08-23 23:17:38 +00:00
Joe Mayer	5f0d177fd7	DeepNVMe GDS (#5852 ) PR for the GDS AIO code. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Ubuntu <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-08-19 04:28:50 +00:00
Logan Adams	297a6840e1	Update clang-format version from 16 to 18. (#5839 ) We used a slightly old version of clang-format before, this caused issues when folks installed the latest via apt or similar rather than python to try and fix their formatting issues. Plus installing older versions is a pain and the formatting style of the newer version seems better?	2024-08-06 09:14:21 -07:00
Reza Yazdani	4f9506729f	Add fp8-fused gemm kernel (#5764 ) This PR adds the new fused kernel for the Dense GeMM using fp8-quantized weight. --------- Co-authored-by: Jeff Rasley <jeffra45@gmail.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2024-07-29 11:07:00 -07:00
Ma, Guokai	ec6cbb3c08	[CPU] Allow deepspeed.comm.inference_all_reduce in torch.compile graph (#5604 ) This PR allows `deepspeed.comm.inference_all_reduce()` enters torch.compile graph even it is implemented as C++ kernel in DeepSpeed. Previous implementation register `inference_all_reduce()` C++ kernel as pybind function so it can be called inside PyThon code. However pybind function cannot be recognized by PyTorch so graph breaks when `inference_all_reduce` is called. We address issue by register `inference_all_reduce` as a PyTorch custom op `torch.ops.deepspeed.inference_all_reduce`, so it can be built into PyTorch graph The output trace code from torchinductor ``` class GraphModule(torch.nn.Module): def forward(self, primals_1: "f32[5, 4]", primals_2: "f32[5]", primals_3: "f32[4, 4]"): # File: /home/gma/DeepSpeed/deepspeed/comm/torch.py:161 in inference_all_reduce, code: return torch.ops.deepspeed.inference_all_reduce_(tensor) inference_all_reduce: "f32[4, 4]" = torch.ops.deepspeed.inference_all_reduce.default(primals_3) # File: /home/gma/allreduce_graph/test_allreduce.py:33 in forward, code: return self.linear(input) permute: "f32[4, 5]" = torch.ops.aten.permute.default(primals_1, [1, 0]); primals_1 = None addmm: "f32[4, 5]" = torch.ops.aten.addmm.default(primals_2, inference_all_reduce, permute); primals_2 = permute = None # No stacktrace found for following nodes copy_: "f32[4, 4]" = torch.ops.aten.copy_.default(primals_3, inference_all_reduce); primals_3 = None return [addmm, inference_all_reduce] ``` Note in this PR the inference_all_reduce op for CPU does not handle multinode and FP16 data type. For FP16 data type support, we will align with PyTorch CPU FP16 plan. For multinode, we are still looking at the possibility to upstream oneCCL integration into PyTorch, so we are able to get use of oneCCL for multinode tensor parallel inference with PyTorch. This PR is independent to https://github.com/microsoft/DeepSpeed/pull/5571. They can work seperately or together without issue. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2024-07-15 22:24:11 +00:00
baodi	e39229676c	update xpu fusedadam opbuilder for pytorch 2.3 (#5702 ) update the way to get queue for FusedAdam OpBuilder. --------- Signed-off-by: baodii <di.bao@intel.com> Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-01 12:34:11 -07:00
Liangliang-Ma	4b8a4a0729	Change source of CPUAdam for xpu accelerator (#5703 ) Noted that cpu adam for cuda/cpu accelerator has removed the dependency of CUDA, we can now use the same source.	2024-06-28 12:50:36 -07:00
Ma, Guokai	19da95f783	[CPU] add fp16 support to shm inference_all_reduce (#5669 ) This PR adds FP16 support to DeepSpeed SHM inference_all_reduce. Previously only FP32 and BF16 is supported. This is to align with PyTorch CPU support on FP16 datatype. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-06-26 18:25:15 +00:00
Costin Eseanu	b3767d01d4	Fixed Windows inference build. (#5609 ) Fix #2427 --------- Co-authored-by: Costin Eseanu <costineseanu@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-06-24 13:39:18 -07:00
Ma, Guokai	eda5075b88	[CPU] SHM based allreduce improvement for small message size (#5571 ) On CPU server, when running SHM based allreduce for small messages, the performance is pretty much dominated by synchronization latency. These latency includes the following two situations: 1. Wait for status change from other ranks. 2. Use `#pragma omp parallel for` to accelerator memory bandwidth bound operations such as `parallel_memcpy` or `reduce`. Each synchronization add a little time to allreduce latency. In current implementation, for small messages, 5 syncs on rank 0 are needed. This includes: 1) copy-in; 2) wait for other ranks done copy; 3) reduce; 4) copy-out; 5) wait for other ranks finish copy-out We redesign the algorithm for small message allreduce (called `symmetric_naive_allreduce`) to have only three syncs, each rank do exactly the same steps: 1) copy-in; 2) wait for other ranks done copy; 3) reduce to output buffer directly. We use double buffer so we can skip the last wait and go directly to next call using another buffer. We have a carefully designed state check to avoid using global barrier among ranks. Test shows for message size < 1MB, allreduce latency will reduce 30% to 50%. This is especially helpful for tensor parallel decoding with small batch size, where the tensor size is usually a few 10s of KBytes. \|message size(bytes)\|new method latency(us)\|old method latency(us)\| \|---\|---\|---\| \| 2 \| 13.34\|20.39 \| 4 \| 13.44\|19.57 \| 8 \| 13.70\|19.76 \| 16 \| 13.27\|20.43 \| 32 \| 13.42\|19.75 \| 64 \| 13.38\|19.80 \| 128 \| 13.70\|19.44 \| 256 \| 13.99\|20.33 \| 512 \| 13.91\|20.28 \| 1024 \| 15.00\|22.86 \| 2048 \| 15.82\|20.93 \| 4096 \| 16.00\|21.08 \| 8192 \| 16.31\|21.50 \| 16384 \| 16.27\|22.95 \| 32768 \| 16.13\|25.17 \| 65536 \| 18.92\|25.90 \| 131072 \| 21.12\|27.42 \| 262144 \| 23.09\|32.36 \| 524288 \| 32.78\|42.80 Because the new method would compute same reduce value on all ranks. Caution needs to be taken to ensure the result is identical on all ranks. We use the test in the link https://github.com/delock/ds_allreduce_bench/blob/main/ds_comm_bench.py#L70 to ensure the implementation is correct. https://github.com/delock/ds_allreduce_bench/blob/main/validate.sh is a test script for better coverage. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Abhishek Kulkarni <11399+adk9@users.noreply.github.com>	2024-06-12 21:00:20 +00:00
Liangliang-Ma	11a62a0635	Add Compressedbackend for Onebit optimizers (#5473 ) In the process of adding onebit optimizers support for XPU devices, we have noticed that for different accelerator, the main difference of implementation of `compressed_allreduce` lies on `packbits` and `unpackbits`. CUDA uses cupy and NPU uses torch_npu. Instead of replace these to xpu only functions, we provided a CompressedBackend to do the `compressed_allreduce` work where users can add their own packbits/unpackbits kernels, which is a general path for all kinds of accelerators. In this PR, we: 1. Add CompressedBackend for onebitAdam, onebitLamb and zerooneAdam 2. Add XPU implement of packbits/unpackbits with SYCL, built in PackbitsBuilder 3. Add tests for onebit with CompressedBackend --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-06-05 20:28:46 +00:00
Costin Eseanu	e7dd28a23d	Fixed the Windows build. (#5596 ) Fixed the Windows build. Fixes applied: - Remove some more ops that don't build on Windows. - Remove the use of symlinks that didn't work correctly and replace with `shutil.copytree()`. - Small fixes to make the C++ code compile. Tested with Python 3.9 and CUDA 12.1. --------- Co-authored-by: Costin Eseanu <costineseanu@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-05-31 22:11:10 +00:00
Logan Adams	bf66acdbae	Rename files in fp_quantize op from quantize.* to fp_quantize.* (#5577 ) Fixes #5535. Todo: need to test.	2024-05-28 20:01:48 +00:00
Liran Bachar	69af361167	CPUAdam fp16 and bf16 support (#5409 ) Hi. Please review the following changes I added support for BF16 to cpu adam. BF16, FP16 and float are supported at compilation time. the correct template is called at runtime according to input params dtype. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-05-20 12:50:20 +00:00
Ramya Ramineni	76c9c69fb1	Rocm warp size fix (#5402 ) This PR enables building the below extensions for AMD GPUs with warp size 32. - transformer_inference - quantizer - random_ltd This PR works stand-alone for torch version <=2.0. For the latest versions, https://github.com/microsoft/DeepSpeed/pull/5401 is required to be merged in addition to this PR. Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2) on NAVI3x: transformer_inference: pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/transformer/inference Before this PR: ===== 674 failed, 622 skipped, 8 warnings, 1728 errors in 69.37s (0:01:09) ===== After this PR: ========== 476 failed, 1062 passed, 1486 skipped, 8 warnings in 9.31s ========== quantizer: pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/quantizer Before this PR: ==== 244 failed, 8 warnings in 30.53s ==== After this PR: ====== 186 failed, 58 passed, 8 warnings in 8.89s ====== I could not find random_ltd related unit tests to run. Fixes: https://github.com/microsoft/DeepSpeed/issues/4753 https://github.com/microsoft/DeepSpeed/issues/5474 https://github.com/ROCm/DeepSpeed/issues/68 cc: @jithunnair-amd --------- Co-authored-by: rraminen@amd.com <rraminen> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-05-17 20:35:58 +00:00
Ramya Ramineni	d3dd8e7454	rocblas -> hipblas changes for ROCm (#5401 ) Fixes https://github.com/microsoft/DeepSpeed/issues/4989 In addition to this PR, below changes are required to build below extensions successfully. Please note that not all unit tests for these extensions will pass with this PR. More details on the unit test results are below. These unit tests are skipped in CI anyway, so they will not break the CI. - transformer_inference - quantizer - random_ltd - https://github.com/pytorch/pytorch/pull/121030 - https://github.com/microsoft/DeepSpeed/pull/5402 Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2) on MI200: transformer_inference: pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/transformer/inference Before this PR: ==== 674 failed, 622 skipped, 8 warnings, 1728 errors in 123.66s (0:02:03) ===== After this PR: ========== 555 failed, 983 passed, 1486 skipped, 8 warnings in 14.35s ========== quantizer: pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/quantizer Before this PR: ==== 244 failed, 8 warnings in 48.02s ==== After this PR: ===== 187 failed, 57 passed, 8 warnings in 14.74s ==== I could not find random_ltd related unit tests to run. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>	2024-05-17 01:57:00 +00:00
Wei Fu	9b6ef9e1f0	64bit indexing fused adam (#5187 ) ## The Issue Applying `FusedAdam` on large tensors will cause an error `CUDA error: an illegal memory access was encountered`. https://github.com/microsoft/DeepSpeed/issues/3429 https://github.com/NVIDIA/apex/issues/1654 ## PR Content Following the solution in the apex repository (https://github.com/NVIDIA/apex/pull/1765), changing indexing type to `int64` if necessary. --------- Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-04-22 19:47:00 +00:00
Reza Yazdani	c632ea09f8	Selective dequantization (#5375 ) This PR adds a new functionality for the dequantizer function, called `selective_dequantize`, which enables partially dequantizing a 3-dimensional matrix in case we don't need to dequantize all the data from lower bit (like fp8/fp6) to bf16. I also added a unit test to check its functionality. --------- Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-04-19 15:58:27 +00:00
Ma, Guokai	b22706a721	[CPU] Support SHM based inference_all_reduce in TorchBackend (#5391 ) This PR adds SHM based `inference_all_reduce` kernel to `TorchBackend` communication backend. When inference on CPU server, this path replaces default `torch.distributed.all_reduce` which eventurally use gloo backend. This PR will improve inference performance with AutoTP when only stock PyTorch is installed without Intel Extension for PyTorch. Compared with gloo backend. SHM based inference_all_reduce kernel is a more directed path and perform much better on single node. \| message size \| gloo all_reduce(ms) \| SHM all_reduce(ms) \| \| --- \| --- \| --- \| \| 32MB \| 30.7 \| 0.65 \| \| 64KB \| 0.23 \| 0.028 \| In text generation of bloom-3b with AutoTP, average token latency improved 1.45x with this PR on 2S Xeon node. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-04-17 18:52:36 +00:00
Reza Yazdani	a8b821535a	Optimize the fp-dequantizer to get high memory-BW utilization (#5373 ) This PR removes the for loop inside the dequantizer kernel and use as many threads and blocks as needed to dequantize the quantized matrix. The previous implementation was processing each group per thread block which can reduce the efficiency when have having smaller group-size and also processes more data per-thread which is unnecessary and we can use more parallelism to improve the dequantization performance. Based on my testing results, for a 4K by 4K matrix, dequantizing from fp8 to bf16 gives 2.5x speedup (improving the BW efficiency from 1 TB/s to 2.5 TB/s on Nvidia H100 GPU). --------- Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>	2024-04-10 18:50:47 +00:00
Ma, Guokai	731fd68299	CPU SHM based inference_all_reduce improve (#5320 ) This PR improves SHM based inference_all_reduce on CPU: 1. Optimize for larger message size which affects performance of first token generation with long context. For example, for llama2 70b with 1024 input sequence length, all_reduce message size is 32MB with single batch. * Increased SHM buffer size from 1MB/worker to 32MB/worker * Each worker allocate SHM buffer on its own NUMA node, instead of rank 0 allcoate SHM buffers for all other workers * For message size > 1MB, a more distributed algoritm is used to make memory bandwidth and computation evenly distributed among workers 2. Decouple SHM based collective code with oneCCL based code, making it ready to integrate with other backend i.e. gloo backend 3. Loosen the condition SHM based allreduce is used, i.e. message size does not have to divisible by 32 bytes. The new distributed algorithm, combine with larger per worker SHM buffer, brings ~3x allreduce performance improvement for 32MB message size on a 2 socket machine. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-04-04 22:25:56 +00:00
Jeff Rasley	3fbd01ccca	FP [6,8,12] quantizer op (#5336 ) Flexible-bit quantizer-dequantizer library with fp6/fp12/fp8 support Requires Ampere+ architecture, this is due to the initial focus of this op only on `bfloat16` input types. Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>	2024-04-04 19:58:08 +00:00
Mingzhen WANG	0d9cfa01e2	[NPU]ZeRO-Infinity feature compatibility (#5077 ) NPU hardware platform does not support `O_DIRECT`. Surrounded by `#if defined`, it will have no impact on other hardware platforms such as GPUs --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-02-12 20:28:20 +00:00
yzhblind	f517903162	Fix confusing width in simd_load (#4714 ) I found the width using in simd_load is different from simd_store. This implementation confuses me. The reason lies in the missing parentheses for the type conversion of x in the SIMD_LOAD2 macro definition, disrupting the intended semantics of width variable. I try to make a quick fix for it. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-01-10 23:38:53 +00:00
Ma, Guokai	d8d865f492	[Fix] Fix cpu inference UT failure (#4430 ) This PR fix UT test error as described in this PR and the following test job. This PR skips `TestModelTask` if dtype is not supported by accelerator, or `InferenceBuilder` is not implemented by accelerator. https://github.com/microsoft/DeepSpeed/pull/4419 https://github.com/microsoft/DeepSpeed/actions/runs/6341645987/job/17235544538 --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Liangliang-Ma <1906710196@qq.com> Co-authored-by: Quentin Anthony <qganthony@yahoo.com> Co-authored-by: Dashiell Stander <dash.stander@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com> Co-authored-by: Xie Zejian <xiezej@gmail.com> Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>	2024-01-08 23:03:44 +00:00
Mingzhen WANG	b596963b06	[NPU]Add ZeRO-Infinity feature for NPU (#4809 ) Add ZeRO-Infinity feature for NPU devices. I add a new `async_io.py` in `op_builder/npu` and compilation preprocessing judgment in `deepspeed_aio_thread.cpp` specifically for NPU, which will be isolated from other devices such as the GPU and will not affect each other. See what we have already done in https://github.com/microsoft/DeepSpeed/issues/4567 . --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-01-05 20:56:42 +00:00
Ma, Guokai	f4f31317ed	[XPU] XPU accelerator support for Intel GPU device (#4547 ) This PR includes XPU support for Intel GPU. With this PR, DeepSpeed can support XPU devices without install Intel Extension for DeepSpeed. --------- Co-authored-by: Liangliang-Ma <1906710196@qq.com> Co-authored-by: baodi <di.bao@intel.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yizhou Wang <yizhou.wang@intel.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>	2024-01-05 12:29:07 -08:00
CurryRice233	d873ce6159	[NPU] Fix npu offload bug (#4883 ) There are some syntax errors in the NPU offload. There may be no AVX instruction set on our server due to environment variables, as a result, this problem is not verified in our tests. Sorry for the inconvenience and we will be more cautious in the next PRs. Co-authored-by: jializheng <jializheng@huawei.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-01-02 20:58:39 +00:00
hipudding	c1ba6a104f	[CANN] Support cpu offload optimizer for Ascend NPU (#4568 ) Support cpu_adam, cpu_adagrad and cpu_lion optimizer for Ascend NPU. All these optimizer are running on host, the difference between each backend is the way to copy params back to device. This commit add a new symbol called "__ENABLE_CANN__". This symbol can compile code adapted to NPU. The NPU builder adds the required header files and libraries for compiling, according to CANN's compilation manual. Note that there's no FusedLion implementation for NPU, test_cpu_lion test case should disabled until FusedLion optimizer implemented. Besides, when NPU is selected as the accelerator, ds_report will show torch_npu and CANN informations. With this PR, deepspeed test cases in [huggingface/accelerate](https://github.com/huggingface/accelerate/tree/main/tests/deepspeed) are all passed. It's a part of feature list for Ascend NPU support, @see #4567 --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2023-11-14 13:37:16 +00:00
cctry	1d1a20c5a1	Fix the openfold training. (#4657 ) This PR removes the bias created as placeholders, which causes a crash in openfold's training pipeline. --------- Co-authored-by: Conglong Li <conglong.li@gmail.com>	2023-11-09 02:04:12 +00:00
Connor Holmes	38b41dffa1	DeepSpeed-FastGen (#4604 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2023-11-03 15:07:35 -07:00
Xie Zejian	8f168c2f8d	fix multiple definition while building evoformer (#4556 ) Current builder for evoformer use the same name for `attention.cpp` and `attention.cu`, leading to same intermediate filename `attention.o`: ```shell march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe - isystem /home/zejianxie/.conda/envs/dll/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/zejianxie/.conda/envs/dll/include build/temp.linux-x86_64-cpython- 310/csrc/deepspeed4science/evoformer_attn/attention.o build/temp.linux-x86_64-cpython- 310/csrc/deepspeed4science/evoformer_attn/attention.o build/temp.linux-x86_64-cpython- 310/csrc/deepspeed4science/evoformer_attn/attention_back.o ``` and ``` `attention_impl(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&)': tmpxft_0012bef1_00000000-6_attention.compute_86.cudafe1.cpp:(.text+0x330): multiple definition of `attention_impl(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&)'; build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:tmpxft_0012bef1_00000000-6_attention.compute_86.cudafe1.cpp:(.text+0x330): first defined here /home/zejianxie/.conda/envs/dll/bin/../lib/gcc/x86_64-conda-linux-gnu/11.4.0/../../../../x86_64-conda-linux-gnu/bin/ld: build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:(.bss+0x0): multiple definition of `torch::autograd::(anonymous namespace)::graph_task_id'; build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:(.bss+0x0): first defined here ``` I use following to reproduce and confirm my fix works: ``` git clone https://github.com/NVIDIA/cutlass --depth 1 CUTLASS_PATH=$PWD/cutlass DS_BUILD_EVOFORMER_ATTN=1 pip install ./DeepSpeed --global-option="build_ext" ``` ![image](https://github.com/microsoft/DeepSpeed/assets/41792945/9e406b37-330c-431c-8bf9-6be378dee4ff) Co-authored-by: Conglong Li <conglong.li@gmail.com>	2023-10-26 21:48:07 +00:00
Logan Adams	869629c210	Add missing RocBlas include (#4557 )	2023-10-23 21:11:04 +00:00
Logan Adams	c7724c6181	Switch from HIP_PLATFORM_HCC to HIP_PLATFORM_AMD (#4539 ) * Switch from HIP_PLATFORM_HCC to HIP_PLATFORM_AMD * Merge changes and fix from #4528	2023-10-19 21:01:48 +00:00
Ramya Ramineni	3e4a587135	Added rocblas header (#4538 ) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2023-10-19 18:14:01 +00:00
Ilya Vologin	beed962c25	[Bug fix] Add rope_theta for llama config (#4480 ) * Add rope_theta for llama config * Add rope_theta to bias_add_transform_0213 * Fix CI problems * Add rope_theta to linear layer --------- Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>	2023-10-19 16:48:29 +00:00
Liangliang-Ma	4fc181b010	[CCLBackend] update API (#4378 ) * unify ccl.py for cpu and gpu' * handle all coll in on func * add coresponding func in cclbackend * add coresponding pybind func * change device setting --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2023-10-12 14:17:44 +00:00
stephen youn	6c86ff393f	adding 8bit dequantization kernel for asym fine-grained block quantization in zero-inference (#4450 ) * kernels added for asym fine-grained block quantization with 8bits * formatting * clean up the code * rename quantize_int4.cu to quantize_intX.cu * rename test_int4_quantization.py to test_intX_quantization.py * "rename test_int4_quantization.py to test_intX_quantization.py" This reverts commit `2d341405b2`. * rename * fix after the pr comments * increased coverage of QuantLinear test (w/ and w/o the cuda kernels) * formatting --------- Co-authored-by: Stephen Youn <styoun@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2023-10-11 03:01:48 +00:00

1 2 3 4 5

207 Коммитов