Граф коммитов

207 Коммитов

Автор SHA1 Сообщение Дата
baodi 1fdad1fa52
make xpu ops compatible with oneapi 2025.0 (#6760)
Compatibility update for xpu ops

This PR introduces changes that will make xpu ops compatible with the
OneAPI 2025.0 toolkit. This is an important update that will allow us to
develop and ship our most demanding models on this innovative hardware.

---------

Signed-off-by: baodii <di.bao@intel.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
2024-11-19 17:38:27 +00:00
Joe Mayer b692cdea47
AIO File Offsets (#6641)
Adding the option for a file offset to the read/write functions of AIO &
GDS ops.

---------

Co-authored-by: jomayeri <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-11-12 16:34:17 +00:00
Xinyu Lian 99e9cbed16
Fix Type Name Inconsistency & Typo in cpu_adam (#6732)
There is a typing error & inconsistency in cpu-adam code, while not
affecting functionality, impacts code readability. Specifically, the
type name `ds_params_percision_t` contains a typo ('percision'), whereas
the related type name `ds_state_precision_t` is spelled correctly. I
think it is beneficial to fix this typo&inconsistency to improve code
readability, maintainability and further development.
I have tested the corrected version of cpu_adam, and it compiles and
runs successfully.

Compilation Log:
<img width="2560" alt="image"
src="https://github.com/user-attachments/assets/b7bc307d-9c9d-4ab7-8671-34e565903ca5">

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-11-11 23:31:45 +00:00
Jagadish Krishnamoorthy 2b41d6212c
[Bug Fix] Support threads_per_head < 64 for wavefront size of 64 (#6622)
When launching apply_rotary_pos_half kernel, only threads_per_head of 64
is supported for wavefront size of 64.
This change adds support for threads_per_head < 64 such as 4, 8, 16.

Fixes the issue introduced in
https://github.com/microsoft/DeepSpeed/pull/5402

---------

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
2024-11-04 21:51:27 +00:00
Liangliang Ma a24cdd6b67
[XPU] [DeepNVMe] use same cpu_op_desc_t with cuda (#6645)
We have found that #6592 uses `_pinned_tensor_mgr` to create cpu bounce
buffer, which is same with what our xpu accelerator currently doing.
So no need to use xpu device specific cpu_op_desc_t.
In this PR:
1. remove custom csrc/xpu/aio/deepspeed_cpu_op.cpp
2. modify xpu async_io opbuilder.

This issue cannot be easily done with revert #6532 , for we added some
source file as last time GDS feature going in DS. So file this new PR :)
2024-10-22 14:45:05 +00:00
Joe Mayer 6eefc3d0ea
Fix Memory Leak In AIO (#6630)
Fixing a memory leak in AIO pinned tensor as well as an incorrect
function type for gds op.

---------

Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2024-10-18 02:58:06 +00:00
Joe Mayer a1f98bdc70
AIO CPU Locked Tensor (#6592)
Restoring the functionality of the cpu locked tensor in the AIO library.
Make async_io operator available for CPU accelerator, i.e., CPU only
environment.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-10-09 21:07:31 +00:00
Omar Elayan 645639bcf8
Rearrange inference OPS and stop using builder.load (#5490)
This PR mainly handles all places where InferenceBuilder is used to
access any op or a specific implementation for an op.
Instead an op is defined, and its proper implementation is picked inside
and the usage will be transparent to the user.
What was done in the PR:
1) Added missing ops (added a py file with fallback mechanism)
2) Added missing fallback implementations for existing ops
3) removed all usages for builder.load and replaced them with ops
instead.
4) added workspace op and inferenceContext which contains all workspace
related functions and inferenceContext is the python fallback of
inferenceContext in CUDA
5) a small change to softmax_context signature to fit the fallback
signature.

---------

Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com>
Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-10-09 01:22:28 +00:00
Liangliang Ma d45cfd3455
[XPU] Support DeepNVMe new code structure (#6532)
In DeepNVMe GDS update, many functions are changed into a more abstract
way. Also added some files. These change break zero-infinity on XPU. To
bring this feature back, we have this PR:
1. modify the aio opbuilder for new files.
2. Add custom cpu_op_desc_t for xpu users. (XPU don't handle buffer
aligned here)

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-09-26 20:39:59 +00:00
Olatunji Ruwase 659f6be105
Avoid security issues of subprocess shell (#6498)
Avoid security issues of `shell=True` in subprocess

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-09-11 20:07:06 +00:00
Omar Elayan c27483933d
wrap include cuda_bf16.h with ifdef BF16_AVAILABLE (#6520) 2024-09-10 16:08:50 +00:00
Joe Mayer f2739b4f72
Change GDS to 1 AIO thread (#6459)
The `numThreads` config option determines how many threads are used to
read from the file. In the CPU case these threads are created via AIO,
in the GDS case they are handled by the GDS library via the cufile.json.
If we were to also create AIO threads it would have a multiplicative
effect. Example 8 AIO threads * 8 GDS threads would be 64 threads
reading from the file when the user really only intended for 8 threads.

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-08-29 15:59:32 +00:00
Joe Mayer e2654bfd1a
Fix Type Mismatch (#6410)
`num_bytes_per_thread` was a smaller type than `file_num_bytes`, this
caused issues when dividing by `num_threads`.

Co-authored-by: jomayeri <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-08-23 23:17:38 +00:00
Joe Mayer 5f0d177fd7
DeepNVMe GDS (#5852)
PR for the GDS AIO code.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Ubuntu <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-08-19 04:28:50 +00:00
Logan Adams 297a6840e1
Update clang-format version from 16 to 18. (#5839)
We used a slightly old version of clang-format before, this caused
issues when folks installed the latest via apt or similar rather than
python to try and fix their formatting issues. Plus installing older
versions is a pain and the formatting style of the newer version seems
better?
2024-08-06 09:14:21 -07:00
Reza Yazdani 4f9506729f
Add fp8-fused gemm kernel (#5764)
This PR adds the new fused kernel for the Dense GeMM using fp8-quantized
weight.

---------

Co-authored-by: Jeff Rasley <jeffra45@gmail.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2024-07-29 11:07:00 -07:00
Ma, Guokai ec6cbb3c08
[CPU] Allow deepspeed.comm.inference_all_reduce in torch.compile graph (#5604)
This PR allows `deepspeed.comm.inference_all_reduce()` enters
torch.compile graph even it is implemented as C++ kernel in DeepSpeed.

Previous implementation register `inference_all_reduce()` C++ kernel as
pybind function so it can be called inside PyThon code. However pybind
function cannot be recognized by PyTorch so graph breaks when
`inference_all_reduce` is called.

We address issue by register `inference_all_reduce` as a PyTorch custom
op `torch.ops.deepspeed.inference_all_reduce`, so it can be built into
PyTorch graph

The output trace code from torchinductor
```
class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[5, 4]", primals_2: "f32[5]", primals_3: "f32[4, 4]"):
        # File: /home/gma/DeepSpeed/deepspeed/comm/torch.py:161 in inference_all_reduce, code: return torch.ops.deepspeed.inference_all_reduce_(tensor)
        inference_all_reduce: "f32[4, 4]" = torch.ops.deepspeed.inference_all_reduce.default(primals_3)

        # File: /home/gma/allreduce_graph/test_allreduce.py:33 in forward, code: return self.linear(input)
        permute: "f32[4, 5]" = torch.ops.aten.permute.default(primals_1, [1, 0]);  primals_1 = None
        addmm: "f32[4, 5]" = torch.ops.aten.addmm.default(primals_2, inference_all_reduce, permute);  primals_2 = permute = None

        # No stacktrace found for following nodes
        copy_: "f32[4, 4]" = torch.ops.aten.copy_.default(primals_3, inference_all_reduce);  primals_3 = None
        return [addmm, inference_all_reduce]
```

Note in this PR the inference_all_reduce op for CPU does not handle
multinode and FP16 data type. For FP16 data type support, we will align
with PyTorch CPU FP16 plan. For multinode, we are still looking at the
possibility to upstream oneCCL integration into PyTorch, so we are able
to get use of oneCCL for multinode tensor parallel inference with
PyTorch.

This PR is independent to
https://github.com/microsoft/DeepSpeed/pull/5571. They can work
seperately or together without issue.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2024-07-15 22:24:11 +00:00
baodi e39229676c
update xpu fusedadam opbuilder for pytorch 2.3 (#5702)
update the way to get queue for FusedAdam OpBuilder.

---------

Signed-off-by: baodii <di.bao@intel.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-01 12:34:11 -07:00
Liangliang-Ma 4b8a4a0729
Change source of CPUAdam for xpu accelerator (#5703)
Noted that cpu adam for cuda/cpu accelerator has removed the dependency
of CUDA, we can now use the same source.
2024-06-28 12:50:36 -07:00
Ma, Guokai 19da95f783
[CPU] add fp16 support to shm inference_all_reduce (#5669)
This PR adds FP16 support to DeepSpeed SHM inference_all_reduce.
Previously only FP32 and BF16 is supported. This is to align with
PyTorch CPU support on FP16 datatype.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-06-26 18:25:15 +00:00
Costin Eseanu b3767d01d4
Fixed Windows inference build. (#5609)
Fix #2427

---------

Co-authored-by: Costin Eseanu <costineseanu@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-06-24 13:39:18 -07:00
Ma, Guokai eda5075b88
[CPU] SHM based allreduce improvement for small message size (#5571)
On CPU server, when running SHM based allreduce for small messages, the
performance is pretty much dominated by synchronization latency. These
latency includes the following two situations:
1. Wait for status change from other ranks.
2. Use `#pragma omp parallel for` to accelerator memory bandwidth bound
operations such as `parallel_memcpy` or `reduce`.

Each synchronization add a little time to allreduce latency. In current
implementation, for small messages, 5 syncs on rank 0 are needed. This
includes: 1) copy-in; 2) wait for other ranks done copy; 3) reduce; 4)
copy-out; 5) wait for other ranks finish copy-out

We redesign the algorithm for small message allreduce (called
`symmetric_naive_allreduce`) to have only three syncs, each rank do
exactly the same steps: 1) copy-in; 2) wait for other ranks done copy;
3) reduce to output buffer directly. We use double buffer so we can skip
the last wait and go directly to next call using another buffer. We have
a carefully designed state check to avoid using global barrier among
ranks.

Test shows for message size < 1MB, allreduce latency will reduce 30% to
50%. This is especially helpful for tensor parallel decoding with small
batch size, where the tensor size is usually a few 10s of KBytes.

|message size(bytes)|new method latency(us)|old method latency(us)|
|---|---|---| 
|             2      |    13.34|20.39
|             4      |    13.44|19.57
|             8      |    13.70|19.76
|            16     |     13.27|20.43
|            32     |     13.42|19.75
|            64     |     13.38|19.80
|           128    |      13.70|19.44
|           256    |      13.99|20.33
|           512    |      13.91|20.28
|          1024   |       15.00|22.86
|          2048   |       15.82|20.93
|          4096   |       16.00|21.08
|          8192   |       16.31|21.50
|         16384  |        16.27|22.95
|         32768  |        16.13|25.17
|         65536  |        18.92|25.90
|        131072 |         21.12|27.42
|        262144 |         23.09|32.36
|        524288 |         32.78|42.80

Because the new method would compute same reduce value on all ranks.
Caution needs to be taken to ensure the result is identical on all
ranks. We use the test in the link
https://github.com/delock/ds_allreduce_bench/blob/main/ds_comm_bench.py#L70
to ensure the implementation is correct.
https://github.com/delock/ds_allreduce_bench/blob/main/validate.sh is a
test script for better coverage.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Abhishek Kulkarni <11399+adk9@users.noreply.github.com>
2024-06-12 21:00:20 +00:00
Liangliang-Ma 11a62a0635
Add Compressedbackend for Onebit optimizers (#5473)
In the process of adding onebit optimizers support for XPU devices, we
have noticed that for different accelerator, the main difference of
implementation of `compressed_allreduce` lies on `packbits` and
`unpackbits`. CUDA uses cupy and NPU uses torch_npu. Instead of replace
these to xpu only functions, we provided a CompressedBackend to do the
`compressed_allreduce` work where users can add their own
packbits/unpackbits kernels, which is a general path for all kinds of
accelerators.

In this PR, we:
1. Add CompressedBackend for onebitAdam, onebitLamb and zerooneAdam
2. Add XPU implement of packbits/unpackbits with SYCL, built in
PackbitsBuilder
3. Add tests for onebit with CompressedBackend

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-06-05 20:28:46 +00:00
Costin Eseanu e7dd28a23d
Fixed the Windows build. (#5596)
Fixed the Windows build.

Fixes applied:
- Remove some more ops that don't build on Windows.
- Remove the use of symlinks that didn't work correctly and replace with
`shutil.copytree()`.
- Small fixes to make the C++ code compile.

Tested with Python 3.9 and CUDA 12.1.

---------

Co-authored-by: Costin Eseanu <costineseanu@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-05-31 22:11:10 +00:00
Logan Adams bf66acdbae
Rename files in fp_quantize op from quantize.* to fp_quantize.* (#5577)
Fixes #5535.

Todo: need to test.
2024-05-28 20:01:48 +00:00
Liran Bachar 69af361167
CPUAdam fp16 and bf16 support (#5409)
Hi.
Please review the following changes
I added support for BF16 to cpu adam. BF16, FP16 and float are supported
at compilation time. the correct template is called at runtime according
to input params dtype.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-05-20 12:50:20 +00:00
Ramya Ramineni 76c9c69fb1
Rocm warp size fix (#5402)
This PR enables building the below extensions for AMD GPUs with warp
size 32.
- transformer_inference
- quantizer
- random_ltd


This PR works stand-alone for torch version <=2.0. For the latest
versions, https://github.com/microsoft/DeepSpeed/pull/5401 is required
to be merged in addition to this PR.

Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2)
on NAVI3x:

**transformer_inference:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/transformer/inference

Before this PR:
===== 674 failed, 622 skipped, 8 warnings, 1728 errors in 69.37s
(0:01:09) =====

After this PR:
========== 476 failed, 1062 passed, 1486 skipped, 8 warnings in 9.31s
==========

**quantizer:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/quantizer

Before this PR:
     ==== 244 failed, 8 warnings in 30.53s ====

After this PR:
    ====== 186 failed, 58 passed, 8 warnings in 8.89s ======

I could not find random_ltd related unit tests to run.

Fixes: 
https://github.com/microsoft/DeepSpeed/issues/4753
https://github.com/microsoft/DeepSpeed/issues/5474
https://github.com/ROCm/DeepSpeed/issues/68

cc: @jithunnair-amd

---------

Co-authored-by: rraminen@amd.com <rraminen>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-05-17 20:35:58 +00:00
Ramya Ramineni d3dd8e7454
rocblas -> hipblas changes for ROCm (#5401)
Fixes https://github.com/microsoft/DeepSpeed/issues/4989

In addition to this PR, below changes are required to build below
extensions successfully. Please note that not all unit tests for these
extensions will pass with this PR. More details on the unit test results
are below. These unit tests are skipped in CI anyway, so they will not
break the CI.
- transformer_inference
- quantizer
- random_ltd

- https://github.com/pytorch/pytorch/pull/121030
- https://github.com/microsoft/DeepSpeed/pull/5402


Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2)
on MI200:

**transformer_inference:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/transformer/inference

Before this PR: 
==== 674 failed, 622 skipped, 8 warnings, 1728 errors in 123.66s
(0:02:03) =====

After this PR:
========== 555 failed, 983 passed, 1486 skipped, 8 warnings in 14.35s
==========

**quantizer:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/quantizer

Before this PR: 
==== 244 failed, 8 warnings in 48.02s ====

After this PR:
===== 187 failed, 57 passed, 8 warnings in 14.74s ====

I could not find random_ltd related unit tests to run.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
2024-05-17 01:57:00 +00:00
Wei Fu 9b6ef9e1f0
64bit indexing fused adam (#5187)
## The Issue

Applying `FusedAdam` on large tensors will cause an error `CUDA error:
an illegal memory access was encountered`.

https://github.com/microsoft/DeepSpeed/issues/3429

https://github.com/NVIDIA/apex/issues/1654

## PR Content

Following the solution in the apex repository
(https://github.com/NVIDIA/apex/pull/1765), changing indexing type to
`int64` if necessary.

---------

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-04-22 19:47:00 +00:00
Reza Yazdani c632ea09f8
Selective dequantization (#5375)
This PR adds a new functionality for the dequantizer function, called
`selective_dequantize`, which enables partially dequantizing a
3-dimensional matrix in case we don't need to dequantize all the data
from lower bit (like fp8/fp6) to bf16.
I also added a unit test to check its functionality.

---------

Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-04-19 15:58:27 +00:00
Ma, Guokai b22706a721
[CPU] Support SHM based inference_all_reduce in TorchBackend (#5391)
This PR adds SHM based `inference_all_reduce` kernel to `TorchBackend`
communication backend. When inference on CPU server, this path replaces
default `torch.distributed.all_reduce` which eventurally use gloo
backend. This PR will improve inference performance with AutoTP when
only stock PyTorch is installed without Intel Extension for PyTorch.

Compared with gloo backend. SHM based inference_all_reduce kernel is a
more directed path and perform much better on single node.

| message size | gloo all_reduce(ms) | SHM all_reduce(ms) |
| --- | --- | --- |
| 32MB | 30.7 | 0.65 |
| 64KB | 0.23 | 0.028 |

In text generation of bloom-3b with AutoTP, average token latency
improved 1.45x with this PR on 2S Xeon node.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-04-17 18:52:36 +00:00
Reza Yazdani a8b821535a
Optimize the fp-dequantizer to get high memory-BW utilization (#5373)
This PR removes the for loop inside the dequantizer kernel and use as
many threads and blocks as needed to dequantize the quantized matrix.
The previous implementation was processing each group per thread block
which can reduce the efficiency when have having smaller group-size and
also processes more data per-thread which is unnecessary and we can use
more parallelism to improve the dequantization performance.

Based on my testing results, for a 4K by 4K matrix, dequantizing from
fp8 to bf16 gives 2.5x speedup (improving the BW efficiency from 1 TB/s
to 2.5 TB/s on Nvidia H100 GPU).

---------

Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
2024-04-10 18:50:47 +00:00
Ma, Guokai 731fd68299
CPU SHM based inference_all_reduce improve (#5320)
This PR improves SHM based inference_all_reduce on CPU:
1. Optimize for larger message size which affects performance of first
token generation with long context. For example, for llama2 70b with
1024 input sequence length, all_reduce message size is 32MB with single
batch.
    * Increased SHM buffer size from 1MB/worker to 32MB/worker
* Each worker allocate SHM buffer on its own NUMA node, instead of rank
0 allcoate SHM buffers for all other workers
* For message size > 1MB, a more distributed algoritm is used to make
memory bandwidth and computation evenly distributed among workers
2. Decouple SHM based collective code with oneCCL based code, making it
ready to integrate with other backend i.e. gloo backend
3. Loosen the condition SHM based allreduce is used, i.e. message size
does not have to divisible by 32 bytes.

The new distributed algorithm, combine with larger per worker SHM
buffer, brings ~3x allreduce performance improvement for 32MB message
size on a 2 socket machine.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-04-04 22:25:56 +00:00
Jeff Rasley 3fbd01ccca
FP [6,8,12] quantizer op (#5336)
Flexible-bit quantizer-dequantizer library with fp6/fp12/fp8 support

Requires Ampere+ architecture, this is due to the initial focus of this
op only on `bfloat16` input types.

Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
2024-04-04 19:58:08 +00:00
Mingzhen WANG 0d9cfa01e2
[NPU]ZeRO-Infinity feature compatibility (#5077)
NPU hardware platform does not support `O_DIRECT`.  
Surrounded by `#if defined`, it will have no impact on other hardware
platforms such as GPUs

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-02-12 20:28:20 +00:00
yzhblind f517903162
Fix confusing width in simd_load (#4714)
I found the width using in simd_load is different from simd_store.
This implementation confuses me.
The reason lies in the missing parentheses for the type conversion of x
in the SIMD_LOAD2 macro definition, disrupting the intended semantics of
width variable.
I try to make a quick fix for it.

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-01-10 23:38:53 +00:00
Ma, Guokai d8d865f492
[Fix] Fix cpu inference UT failure (#4430)
This PR fix UT test error as described in this PR and the following test
job. This PR skips `TestModelTask` if dtype is not supported by
accelerator, or `InferenceBuilder` is not implemented by accelerator.
https://github.com/microsoft/DeepSpeed/pull/4419

https://github.com/microsoft/DeepSpeed/actions/runs/6341645987/job/17235544538

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Liangliang-Ma <1906710196@qq.com>
Co-authored-by: Quentin Anthony <qganthony@yahoo.com>
Co-authored-by: Dashiell Stander <dash.stander@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com>
Co-authored-by: Xie Zejian <xiezej@gmail.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2024-01-08 23:03:44 +00:00
Mingzhen WANG b596963b06
[NPU]Add ZeRO-Infinity feature for NPU (#4809)
Add ZeRO-Infinity feature for NPU devices. 
I add a new `async_io.py` in `op_builder/npu` and compilation
preprocessing judgment in `deepspeed_aio_thread.cpp` specifically for
NPU, which will be isolated from other devices such as the GPU and will
not affect each other.
See what we have already done in
https://github.com/microsoft/DeepSpeed/issues/4567 .

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-01-05 20:56:42 +00:00
Ma, Guokai f4f31317ed
[XPU] XPU accelerator support for Intel GPU device (#4547)
This PR includes XPU support for Intel GPU. With this PR, DeepSpeed can
support XPU devices without install Intel Extension for DeepSpeed.

---------

Co-authored-by: Liangliang-Ma <1906710196@qq.com>
Co-authored-by: baodi <di.bao@intel.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Yizhou Wang <yizhou.wang@intel.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2024-01-05 12:29:07 -08:00
CurryRice233 d873ce6159
[NPU] Fix npu offload bug (#4883)
There are some syntax errors in the NPU offload. 

There may be no AVX instruction set on our server due to environment
variables, as a result, this problem is not verified in our tests.

Sorry for the inconvenience and we will be more cautious in the next
PRs.

Co-authored-by: jializheng <jializheng@huawei.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-01-02 20:58:39 +00:00
hipudding c1ba6a104f
[CANN] Support cpu offload optimizer for Ascend NPU (#4568)
Support cpu_adam, cpu_adagrad and cpu_lion optimizer for Ascend NPU. All
these optimizer are running on host, the difference between each backend
is the way to copy params back to device. This commit add a new symbol
called "__ENABLE_CANN__". This symbol can compile code adapted to NPU.
The NPU builder adds the required header files and libraries for
compiling, according to CANN's compilation manual.
Note that there's no FusedLion implementation for NPU, test_cpu_lion
test case should disabled until FusedLion optimizer implemented.

Besides, when NPU is selected as the accelerator, ds_report will show
torch_npu and CANN informations.

With this PR, deepspeed test cases in
[huggingface/accelerate](https://github.com/huggingface/accelerate/tree/main/tests/deepspeed)
are all passed.

It's a part of feature list for Ascend NPU support, @see #4567

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-11-14 13:37:16 +00:00
cctry 1d1a20c5a1
Fix the openfold training. (#4657)
This PR removes the bias created as placeholders, which causes a crash
in openfold's training pipeline.

---------

Co-authored-by: Conglong Li <conglong.li@gmail.com>
2023-11-09 02:04:12 +00:00
Connor Holmes 38b41dffa1
DeepSpeed-FastGen (#4604)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2023-11-03 15:07:35 -07:00
Xie Zejian 8f168c2f8d
fix multiple definition while building evoformer (#4556)
Current builder for evoformer use the same name for `attention.cpp` and
`attention.cu`, leading to same intermediate filename `attention.o`:
```shell
march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -
isystem /home/zejianxie/.conda/envs/dll/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem 
/home/zejianxie/.conda/envs/dll/include build/temp.linux-x86_64-cpython-
310/csrc/deepspeed4science/evoformer_attn/attention.o build/temp.linux-x86_64-cpython-
310/csrc/deepspeed4science/evoformer_attn/attention.o build/temp.linux-x86_64-cpython-
310/csrc/deepspeed4science/evoformer_attn/attention_back.o
```
and
```
`attention_impl(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&)':
      tmpxft_0012bef1_00000000-6_attention.compute_86.cudafe1.cpp:(.text+0x330): multiple definition of `attention_impl(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&)'; build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:tmpxft_0012bef1_00000000-6_attention.compute_86.cudafe1.cpp:(.text+0x330): first defined here
      /home/zejianxie/.conda/envs/dll/bin/../lib/gcc/x86_64-conda-linux-gnu/11.4.0/../../../../x86_64-conda-linux-gnu/bin/ld: build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:(.bss+0x0): multiple definition of `torch::autograd::(anonymous namespace)::graph_task_id'; build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:(.bss+0x0): first defined here
```

I use following to reproduce and confirm my fix works:
```
git clone https://github.com/NVIDIA/cutlass --depth 1
CUTLASS_PATH=$PWD/cutlass DS_BUILD_EVOFORMER_ATTN=1 pip install ./DeepSpeed --global-option="build_ext"
```

![image](https://github.com/microsoft/DeepSpeed/assets/41792945/9e406b37-330c-431c-8bf9-6be378dee4ff)

Co-authored-by: Conglong Li <conglong.li@gmail.com>
2023-10-26 21:48:07 +00:00
Logan Adams 869629c210
Add missing RocBlas include (#4557) 2023-10-23 21:11:04 +00:00
Logan Adams c7724c6181
Switch from HIP_PLATFORM_HCC to HIP_PLATFORM_AMD (#4539)
* Switch from HIP_PLATFORM_HCC to HIP_PLATFORM_AMD

* Merge changes and fix from #4528
2023-10-19 21:01:48 +00:00
Ramya Ramineni 3e4a587135
Added rocblas header (#4538)
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2023-10-19 18:14:01 +00:00
Ilya Vologin beed962c25
[Bug fix] Add rope_theta for llama config (#4480)
* Add rope_theta for llama config

* Add rope_theta to bias_add_transform_0213

* Fix CI problems

* Add rope_theta to linear layer

---------

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>
2023-10-19 16:48:29 +00:00
Liangliang-Ma 4fc181b010
[CCLBackend] update API (#4378)
* unify ccl.py for cpu and gpu'

* handle all coll in on func

* add coresponding func in cclbackend

* add coresponding pybind func

* change device setting

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-10-12 14:17:44 +00:00
stephen youn 6c86ff393f
adding 8bit dequantization kernel for asym fine-grained block quantization in zero-inference (#4450)
* kernels added for asym fine-grained block quantization with 8bits

* formatting

* clean up the code

* rename quantize_int4.cu to quantize_intX.cu

* rename test_int4_quantization.py to test_intX_quantization.py

* "rename test_int4_quantization.py to test_intX_quantization.py"

This reverts commit 2d341405b2.

* rename

* fix after the pr comments

* increased coverage of QuantLinear test
(w/ and w/o the cuda kernels)

* formatting

---------

Co-authored-by: Stephen Youn <styoun@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-10-11 03:01:48 +00:00