Граф коммитов

198 Коммитов

Автор SHA1 Сообщение Дата
Olatunji Ruwase 659f6be105
Avoid security issues of subprocess shell (#6498)
Avoid security issues of `shell=True` in subprocess

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-09-11 20:07:06 +00:00
Omar Elayan c27483933d
wrap include cuda_bf16.h with ifdef BF16_AVAILABLE (#6520) 2024-09-10 16:08:50 +00:00
Joe Mayer f2739b4f72
Change GDS to 1 AIO thread (#6459)
The `numThreads` config option determines how many threads are used to
read from the file. In the CPU case these threads are created via AIO,
in the GDS case they are handled by the GDS library via the cufile.json.
If we were to also create AIO threads it would have a multiplicative
effect. Example 8 AIO threads * 8 GDS threads would be 64 threads
reading from the file when the user really only intended for 8 threads.

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-08-29 15:59:32 +00:00
Joe Mayer e2654bfd1a
Fix Type Mismatch (#6410)
`num_bytes_per_thread` was a smaller type than `file_num_bytes`, this
caused issues when dividing by `num_threads`.

Co-authored-by: jomayeri <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-08-23 23:17:38 +00:00
Joe Mayer 5f0d177fd7
DeepNVMe GDS (#5852)
PR for the GDS AIO code.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Ubuntu <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-08-19 04:28:50 +00:00
Logan Adams 297a6840e1
Update clang-format version from 16 to 18. (#5839)
We used a slightly old version of clang-format before, this caused
issues when folks installed the latest via apt or similar rather than
python to try and fix their formatting issues. Plus installing older
versions is a pain and the formatting style of the newer version seems
better?
2024-08-06 09:14:21 -07:00
Reza Yazdani 4f9506729f
Add fp8-fused gemm kernel (#5764)
This PR adds the new fused kernel for the Dense GeMM using fp8-quantized
weight.

---------

Co-authored-by: Jeff Rasley <jeffra45@gmail.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2024-07-29 11:07:00 -07:00
Ma, Guokai ec6cbb3c08
[CPU] Allow deepspeed.comm.inference_all_reduce in torch.compile graph (#5604)
This PR allows `deepspeed.comm.inference_all_reduce()` enters
torch.compile graph even it is implemented as C++ kernel in DeepSpeed.

Previous implementation register `inference_all_reduce()` C++ kernel as
pybind function so it can be called inside PyThon code. However pybind
function cannot be recognized by PyTorch so graph breaks when
`inference_all_reduce` is called.

We address issue by register `inference_all_reduce` as a PyTorch custom
op `torch.ops.deepspeed.inference_all_reduce`, so it can be built into
PyTorch graph

The output trace code from torchinductor
```
class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[5, 4]", primals_2: "f32[5]", primals_3: "f32[4, 4]"):
        # File: /home/gma/DeepSpeed/deepspeed/comm/torch.py:161 in inference_all_reduce, code: return torch.ops.deepspeed.inference_all_reduce_(tensor)
        inference_all_reduce: "f32[4, 4]" = torch.ops.deepspeed.inference_all_reduce.default(primals_3)

        # File: /home/gma/allreduce_graph/test_allreduce.py:33 in forward, code: return self.linear(input)
        permute: "f32[4, 5]" = torch.ops.aten.permute.default(primals_1, [1, 0]);  primals_1 = None
        addmm: "f32[4, 5]" = torch.ops.aten.addmm.default(primals_2, inference_all_reduce, permute);  primals_2 = permute = None

        # No stacktrace found for following nodes
        copy_: "f32[4, 4]" = torch.ops.aten.copy_.default(primals_3, inference_all_reduce);  primals_3 = None
        return [addmm, inference_all_reduce]
```

Note in this PR the inference_all_reduce op for CPU does not handle
multinode and FP16 data type. For FP16 data type support, we will align
with PyTorch CPU FP16 plan. For multinode, we are still looking at the
possibility to upstream oneCCL integration into PyTorch, so we are able
to get use of oneCCL for multinode tensor parallel inference with
PyTorch.

This PR is independent to
https://github.com/microsoft/DeepSpeed/pull/5571. They can work
seperately or together without issue.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2024-07-15 22:24:11 +00:00
baodi e39229676c
update xpu fusedadam opbuilder for pytorch 2.3 (#5702)
update the way to get queue for FusedAdam OpBuilder.

---------

Signed-off-by: baodii <di.bao@intel.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-01 12:34:11 -07:00
Liangliang-Ma 4b8a4a0729
Change source of CPUAdam for xpu accelerator (#5703)
Noted that cpu adam for cuda/cpu accelerator has removed the dependency
of CUDA, we can now use the same source.
2024-06-28 12:50:36 -07:00
Ma, Guokai 19da95f783
[CPU] add fp16 support to shm inference_all_reduce (#5669)
This PR adds FP16 support to DeepSpeed SHM inference_all_reduce.
Previously only FP32 and BF16 is supported. This is to align with
PyTorch CPU support on FP16 datatype.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-06-26 18:25:15 +00:00
Costin Eseanu b3767d01d4
Fixed Windows inference build. (#5609)
Fix #2427

---------

Co-authored-by: Costin Eseanu <costineseanu@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-06-24 13:39:18 -07:00
Ma, Guokai eda5075b88
[CPU] SHM based allreduce improvement for small message size (#5571)
On CPU server, when running SHM based allreduce for small messages, the
performance is pretty much dominated by synchronization latency. These
latency includes the following two situations:
1. Wait for status change from other ranks.
2. Use `#pragma omp parallel for` to accelerator memory bandwidth bound
operations such as `parallel_memcpy` or `reduce`.

Each synchronization add a little time to allreduce latency. In current
implementation, for small messages, 5 syncs on rank 0 are needed. This
includes: 1) copy-in; 2) wait for other ranks done copy; 3) reduce; 4)
copy-out; 5) wait for other ranks finish copy-out

We redesign the algorithm for small message allreduce (called
`symmetric_naive_allreduce`) to have only three syncs, each rank do
exactly the same steps: 1) copy-in; 2) wait for other ranks done copy;
3) reduce to output buffer directly. We use double buffer so we can skip
the last wait and go directly to next call using another buffer. We have
a carefully designed state check to avoid using global barrier among
ranks.

Test shows for message size < 1MB, allreduce latency will reduce 30% to
50%. This is especially helpful for tensor parallel decoding with small
batch size, where the tensor size is usually a few 10s of KBytes.

|message size(bytes)|new method latency(us)|old method latency(us)|
|---|---|---| 
|             2      |    13.34|20.39
|             4      |    13.44|19.57
|             8      |    13.70|19.76
|            16     |     13.27|20.43
|            32     |     13.42|19.75
|            64     |     13.38|19.80
|           128    |      13.70|19.44
|           256    |      13.99|20.33
|           512    |      13.91|20.28
|          1024   |       15.00|22.86
|          2048   |       15.82|20.93
|          4096   |       16.00|21.08
|          8192   |       16.31|21.50
|         16384  |        16.27|22.95
|         32768  |        16.13|25.17
|         65536  |        18.92|25.90
|        131072 |         21.12|27.42
|        262144 |         23.09|32.36
|        524288 |         32.78|42.80

Because the new method would compute same reduce value on all ranks.
Caution needs to be taken to ensure the result is identical on all
ranks. We use the test in the link
https://github.com/delock/ds_allreduce_bench/blob/main/ds_comm_bench.py#L70
to ensure the implementation is correct.
https://github.com/delock/ds_allreduce_bench/blob/main/validate.sh is a
test script for better coverage.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Abhishek Kulkarni <11399+adk9@users.noreply.github.com>
2024-06-12 21:00:20 +00:00
Liangliang-Ma 11a62a0635
Add Compressedbackend for Onebit optimizers (#5473)
In the process of adding onebit optimizers support for XPU devices, we
have noticed that for different accelerator, the main difference of
implementation of `compressed_allreduce` lies on `packbits` and
`unpackbits`. CUDA uses cupy and NPU uses torch_npu. Instead of replace
these to xpu only functions, we provided a CompressedBackend to do the
`compressed_allreduce` work where users can add their own
packbits/unpackbits kernels, which is a general path for all kinds of
accelerators.

In this PR, we:
1. Add CompressedBackend for onebitAdam, onebitLamb and zerooneAdam
2. Add XPU implement of packbits/unpackbits with SYCL, built in
PackbitsBuilder
3. Add tests for onebit with CompressedBackend

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-06-05 20:28:46 +00:00
Costin Eseanu e7dd28a23d
Fixed the Windows build. (#5596)
Fixed the Windows build.

Fixes applied:
- Remove some more ops that don't build on Windows.
- Remove the use of symlinks that didn't work correctly and replace with
`shutil.copytree()`.
- Small fixes to make the C++ code compile.

Tested with Python 3.9 and CUDA 12.1.

---------

Co-authored-by: Costin Eseanu <costineseanu@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-05-31 22:11:10 +00:00
Logan Adams bf66acdbae
Rename files in fp_quantize op from quantize.* to fp_quantize.* (#5577)
Fixes #5535.

Todo: need to test.
2024-05-28 20:01:48 +00:00
Liran Bachar 69af361167
CPUAdam fp16 and bf16 support (#5409)
Hi.
Please review the following changes
I added support for BF16 to cpu adam. BF16, FP16 and float are supported
at compilation time. the correct template is called at runtime according
to input params dtype.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-05-20 12:50:20 +00:00
Ramya Ramineni 76c9c69fb1
Rocm warp size fix (#5402)
This PR enables building the below extensions for AMD GPUs with warp
size 32.
- transformer_inference
- quantizer
- random_ltd


This PR works stand-alone for torch version <=2.0. For the latest
versions, https://github.com/microsoft/DeepSpeed/pull/5401 is required
to be merged in addition to this PR.

Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2)
on NAVI3x:

**transformer_inference:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/transformer/inference

Before this PR:
===== 674 failed, 622 skipped, 8 warnings, 1728 errors in 69.37s
(0:01:09) =====

After this PR:
========== 476 failed, 1062 passed, 1486 skipped, 8 warnings in 9.31s
==========

**quantizer:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/quantizer

Before this PR:
     ==== 244 failed, 8 warnings in 30.53s ====

After this PR:
    ====== 186 failed, 58 passed, 8 warnings in 8.89s ======

I could not find random_ltd related unit tests to run.

Fixes: 
https://github.com/microsoft/DeepSpeed/issues/4753
https://github.com/microsoft/DeepSpeed/issues/5474
https://github.com/ROCm/DeepSpeed/issues/68

cc: @jithunnair-amd

---------

Co-authored-by: rraminen@amd.com <rraminen>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-05-17 20:35:58 +00:00
Ramya Ramineni d3dd8e7454
rocblas -> hipblas changes for ROCm (#5401)
Fixes https://github.com/microsoft/DeepSpeed/issues/4989

In addition to this PR, below changes are required to build below
extensions successfully. Please note that not all unit tests for these
extensions will pass with this PR. More details on the unit test results
are below. These unit tests are skipped in CI anyway, so they will not
break the CI.
- transformer_inference
- quantizer
- random_ltd

- https://github.com/pytorch/pytorch/pull/121030
- https://github.com/microsoft/DeepSpeed/pull/5402


Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2)
on MI200:

**transformer_inference:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/transformer/inference

Before this PR: 
==== 674 failed, 622 skipped, 8 warnings, 1728 errors in 123.66s
(0:02:03) =====

After this PR:
========== 555 failed, 983 passed, 1486 skipped, 8 warnings in 14.35s
==========

**quantizer:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/quantizer

Before this PR: 
==== 244 failed, 8 warnings in 48.02s ====

After this PR:
===== 187 failed, 57 passed, 8 warnings in 14.74s ====

I could not find random_ltd related unit tests to run.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
2024-05-17 01:57:00 +00:00
Wei Fu 9b6ef9e1f0
64bit indexing fused adam (#5187)
## The Issue

Applying `FusedAdam` on large tensors will cause an error `CUDA error:
an illegal memory access was encountered`.

https://github.com/microsoft/DeepSpeed/issues/3429

https://github.com/NVIDIA/apex/issues/1654

## PR Content

Following the solution in the apex repository
(https://github.com/NVIDIA/apex/pull/1765), changing indexing type to
`int64` if necessary.

---------

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-04-22 19:47:00 +00:00
Reza Yazdani c632ea09f8
Selective dequantization (#5375)
This PR adds a new functionality for the dequantizer function, called
`selective_dequantize`, which enables partially dequantizing a
3-dimensional matrix in case we don't need to dequantize all the data
from lower bit (like fp8/fp6) to bf16.
I also added a unit test to check its functionality.

---------

Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-04-19 15:58:27 +00:00
Ma, Guokai b22706a721
[CPU] Support SHM based inference_all_reduce in TorchBackend (#5391)
This PR adds SHM based `inference_all_reduce` kernel to `TorchBackend`
communication backend. When inference on CPU server, this path replaces
default `torch.distributed.all_reduce` which eventurally use gloo
backend. This PR will improve inference performance with AutoTP when
only stock PyTorch is installed without Intel Extension for PyTorch.

Compared with gloo backend. SHM based inference_all_reduce kernel is a
more directed path and perform much better on single node.

| message size | gloo all_reduce(ms) | SHM all_reduce(ms) |
| --- | --- | --- |
| 32MB | 30.7 | 0.65 |
| 64KB | 0.23 | 0.028 |

In text generation of bloom-3b with AutoTP, average token latency
improved 1.45x with this PR on 2S Xeon node.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-04-17 18:52:36 +00:00
Reza Yazdani a8b821535a
Optimize the fp-dequantizer to get high memory-BW utilization (#5373)
This PR removes the for loop inside the dequantizer kernel and use as
many threads and blocks as needed to dequantize the quantized matrix.
The previous implementation was processing each group per thread block
which can reduce the efficiency when have having smaller group-size and
also processes more data per-thread which is unnecessary and we can use
more parallelism to improve the dequantization performance.

Based on my testing results, for a 4K by 4K matrix, dequantizing from
fp8 to bf16 gives 2.5x speedup (improving the BW efficiency from 1 TB/s
to 2.5 TB/s on Nvidia H100 GPU).

---------

Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
2024-04-10 18:50:47 +00:00
Ma, Guokai 731fd68299
CPU SHM based inference_all_reduce improve (#5320)
This PR improves SHM based inference_all_reduce on CPU:
1. Optimize for larger message size which affects performance of first
token generation with long context. For example, for llama2 70b with
1024 input sequence length, all_reduce message size is 32MB with single
batch.
    * Increased SHM buffer size from 1MB/worker to 32MB/worker
* Each worker allocate SHM buffer on its own NUMA node, instead of rank
0 allcoate SHM buffers for all other workers
* For message size > 1MB, a more distributed algoritm is used to make
memory bandwidth and computation evenly distributed among workers
2. Decouple SHM based collective code with oneCCL based code, making it
ready to integrate with other backend i.e. gloo backend
3. Loosen the condition SHM based allreduce is used, i.e. message size
does not have to divisible by 32 bytes.

The new distributed algorithm, combine with larger per worker SHM
buffer, brings ~3x allreduce performance improvement for 32MB message
size on a 2 socket machine.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-04-04 22:25:56 +00:00
Jeff Rasley 3fbd01ccca
FP [6,8,12] quantizer op (#5336)
Flexible-bit quantizer-dequantizer library with fp6/fp12/fp8 support

Requires Ampere+ architecture, this is due to the initial focus of this
op only on `bfloat16` input types.

Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
2024-04-04 19:58:08 +00:00
Mingzhen WANG 0d9cfa01e2
[NPU]ZeRO-Infinity feature compatibility (#5077)
NPU hardware platform does not support `O_DIRECT`.  
Surrounded by `#if defined`, it will have no impact on other hardware
platforms such as GPUs

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-02-12 20:28:20 +00:00
yzhblind f517903162
Fix confusing width in simd_load (#4714)
I found the width using in simd_load is different from simd_store.
This implementation confuses me.
The reason lies in the missing parentheses for the type conversion of x
in the SIMD_LOAD2 macro definition, disrupting the intended semantics of
width variable.
I try to make a quick fix for it.

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-01-10 23:38:53 +00:00
Ma, Guokai d8d865f492
[Fix] Fix cpu inference UT failure (#4430)
This PR fix UT test error as described in this PR and the following test
job. This PR skips `TestModelTask` if dtype is not supported by
accelerator, or `InferenceBuilder` is not implemented by accelerator.
https://github.com/microsoft/DeepSpeed/pull/4419

https://github.com/microsoft/DeepSpeed/actions/runs/6341645987/job/17235544538

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Liangliang-Ma <1906710196@qq.com>
Co-authored-by: Quentin Anthony <qganthony@yahoo.com>
Co-authored-by: Dashiell Stander <dash.stander@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com>
Co-authored-by: Xie Zejian <xiezej@gmail.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2024-01-08 23:03:44 +00:00
Mingzhen WANG b596963b06
[NPU]Add ZeRO-Infinity feature for NPU (#4809)
Add ZeRO-Infinity feature for NPU devices. 
I add a new `async_io.py` in `op_builder/npu` and compilation
preprocessing judgment in `deepspeed_aio_thread.cpp` specifically for
NPU, which will be isolated from other devices such as the GPU and will
not affect each other.
See what we have already done in
https://github.com/microsoft/DeepSpeed/issues/4567 .

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-01-05 20:56:42 +00:00
Ma, Guokai f4f31317ed
[XPU] XPU accelerator support for Intel GPU device (#4547)
This PR includes XPU support for Intel GPU. With this PR, DeepSpeed can
support XPU devices without install Intel Extension for DeepSpeed.

---------

Co-authored-by: Liangliang-Ma <1906710196@qq.com>
Co-authored-by: baodi <di.bao@intel.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Yizhou Wang <yizhou.wang@intel.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2024-01-05 12:29:07 -08:00
CurryRice233 d873ce6159
[NPU] Fix npu offload bug (#4883)
There are some syntax errors in the NPU offload. 

There may be no AVX instruction set on our server due to environment
variables, as a result, this problem is not verified in our tests.

Sorry for the inconvenience and we will be more cautious in the next
PRs.

Co-authored-by: jializheng <jializheng@huawei.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-01-02 20:58:39 +00:00
hipudding c1ba6a104f
[CANN] Support cpu offload optimizer for Ascend NPU (#4568)
Support cpu_adam, cpu_adagrad and cpu_lion optimizer for Ascend NPU. All
these optimizer are running on host, the difference between each backend
is the way to copy params back to device. This commit add a new symbol
called "__ENABLE_CANN__". This symbol can compile code adapted to NPU.
The NPU builder adds the required header files and libraries for
compiling, according to CANN's compilation manual.
Note that there's no FusedLion implementation for NPU, test_cpu_lion
test case should disabled until FusedLion optimizer implemented.

Besides, when NPU is selected as the accelerator, ds_report will show
torch_npu and CANN informations.

With this PR, deepspeed test cases in
[huggingface/accelerate](https://github.com/huggingface/accelerate/tree/main/tests/deepspeed)
are all passed.

It's a part of feature list for Ascend NPU support, @see #4567

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-11-14 13:37:16 +00:00
cctry 1d1a20c5a1
Fix the openfold training. (#4657)
This PR removes the bias created as placeholders, which causes a crash
in openfold's training pipeline.

---------

Co-authored-by: Conglong Li <conglong.li@gmail.com>
2023-11-09 02:04:12 +00:00
Connor Holmes 38b41dffa1
DeepSpeed-FastGen (#4604)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2023-11-03 15:07:35 -07:00
Xie Zejian 8f168c2f8d
fix multiple definition while building evoformer (#4556)
Current builder for evoformer use the same name for `attention.cpp` and
`attention.cu`, leading to same intermediate filename `attention.o`:
```shell
march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -
isystem /home/zejianxie/.conda/envs/dll/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem 
/home/zejianxie/.conda/envs/dll/include build/temp.linux-x86_64-cpython-
310/csrc/deepspeed4science/evoformer_attn/attention.o build/temp.linux-x86_64-cpython-
310/csrc/deepspeed4science/evoformer_attn/attention.o build/temp.linux-x86_64-cpython-
310/csrc/deepspeed4science/evoformer_attn/attention_back.o
```
and
```
`attention_impl(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&)':
      tmpxft_0012bef1_00000000-6_attention.compute_86.cudafe1.cpp:(.text+0x330): multiple definition of `attention_impl(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&)'; build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:tmpxft_0012bef1_00000000-6_attention.compute_86.cudafe1.cpp:(.text+0x330): first defined here
      /home/zejianxie/.conda/envs/dll/bin/../lib/gcc/x86_64-conda-linux-gnu/11.4.0/../../../../x86_64-conda-linux-gnu/bin/ld: build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:(.bss+0x0): multiple definition of `torch::autograd::(anonymous namespace)::graph_task_id'; build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:(.bss+0x0): first defined here
```

I use following to reproduce and confirm my fix works:
```
git clone https://github.com/NVIDIA/cutlass --depth 1
CUTLASS_PATH=$PWD/cutlass DS_BUILD_EVOFORMER_ATTN=1 pip install ./DeepSpeed --global-option="build_ext"
```

![image](https://github.com/microsoft/DeepSpeed/assets/41792945/9e406b37-330c-431c-8bf9-6be378dee4ff)

Co-authored-by: Conglong Li <conglong.li@gmail.com>
2023-10-26 21:48:07 +00:00
Logan Adams 869629c210
Add missing RocBlas include (#4557) 2023-10-23 21:11:04 +00:00
Logan Adams c7724c6181
Switch from HIP_PLATFORM_HCC to HIP_PLATFORM_AMD (#4539)
* Switch from HIP_PLATFORM_HCC to HIP_PLATFORM_AMD

* Merge changes and fix from #4528
2023-10-19 21:01:48 +00:00
Ramya Ramineni 3e4a587135
Added rocblas header (#4538)
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2023-10-19 18:14:01 +00:00
Ilya Vologin beed962c25
[Bug fix] Add rope_theta for llama config (#4480)
* Add rope_theta for llama config

* Add rope_theta to bias_add_transform_0213

* Fix CI problems

* Add rope_theta to linear layer

---------

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>
2023-10-19 16:48:29 +00:00
Liangliang-Ma 4fc181b010
[CCLBackend] update API (#4378)
* unify ccl.py for cpu and gpu'

* handle all coll in on func

* add coresponding func in cclbackend

* add coresponding pybind func

* change device setting

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-10-12 14:17:44 +00:00
stephen youn 6c86ff393f
adding 8bit dequantization kernel for asym fine-grained block quantization in zero-inference (#4450)
* kernels added for asym fine-grained block quantization with 8bits

* formatting

* clean up the code

* rename quantize_int4.cu to quantize_intX.cu

* rename test_int4_quantization.py to test_intX_quantization.py

* "rename test_int4_quantization.py to test_intX_quantization.py"

This reverts commit 2d341405b2.

* rename

* fix after the pr comments

* increased coverage of QuantLinear test
(w/ and w/o the cuda kernels)

* formatting

---------

Co-authored-by: Stephen Youn <styoun@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-10-11 03:01:48 +00:00
Hongjiu "Enneamer" Zhang 8e64c3b550
feat: add Lion optimizer (#4331)
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2023-10-05 22:32:14 +00:00
Ma, Guokai 9a55291452
[CCLBackend] Using parallel memcpy for inference_all_reduce (#4404)
* use parallel version of memcpy

* include max buf size to 16MB per rank

* support any input buffer size

* fix format error
2023-10-03 13:38:20 +00:00
Liangliang-Ma 1760627eb9
Zero infinity xpu support (#4130)
* zero infinity xpu support

* remove env var depends

* client align mem

* sync with all accelerators'

* format fix

* add align in pin_memory api

* add missing brackets

* remove align

* modify pin_memory api

* modify pin_memory api to use only on align para

* change value of align bytes

* Update csrc/aio/common/deepspeed_aio_common.cpp

* add version check and change format

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2023-10-03 13:17:08 +00:00
cctry c58146471e
Openfold fix (#4368)
* update

* format

---------

Co-authored-by: Conglong Li <conglong.li@gmail.com>
2023-09-20 01:44:16 -07:00
Reza Yazdani 468882fb68
Add the policy to run llama model from the official repo (#4313)
* Add the llama2 support from the official llama repo

* add back commented function

* add new policy & implementation for llama2

* add some changes to inject/run the 70b llama model

* remove debugging code

* remove more debugging code

* formatting

* use num_kv only when it has positive value

* use the num_kv param only if  it is positive

* fix syntax and format errors.

* fix an issue with the float32 transform kernel

---------

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
2023-09-19 16:57:55 +00:00
Conglong Li f876d81d34
DeepSpeed4Science (#4357)
* zero++ tutorial PR (#3783)

* [Fix] _conv_flops_compute when padding is a str and stride=1 (#3169)

* fix conv_flops_compute when padding is a str when stride=1

* fix error

* change type of paddings to tuple

* fix padding calculation

* apply formatting check

---------

Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* fix interpolate flops compute (#3782)

* use `Flops Profiler` to test `model.generate()` (#2515)

* Update profiler.py

* pre-commit run --all-files

* Delete .DS_Store

* Delete .DS_Store

* Delete .DS_Store

---------

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>

* revert PR #3611 (#3786)

* bump to 0.9.6

* ZeRO++ chinese blog (#3793)

* zeropp chinese blog

* try better quality images

* make title larger

* even larger...

* various fix

* center captions

* more fixes

* fix format

* remove staging trigger (#3792)

* DeepSpeed-Triton for Inference (#3748)

Co-authored-by: Stephen Youn <styoun@microsoft.com>
Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Ethan Doe <yidoe@microsoft.com>
Co-authored-by: yidoe <68296935+yidoe@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* ZeRO++ (#3784)

Co-authored-by: HeyangQin <heyangqin@microsoft.com>
Co-authored-by: GuanhuaWang <alexwgh333@gmail.com>
Co-authored-by: cmikeh2 <connorholmes@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>

* adding zero++ to navigation panel of deepspeed.ai (#3796)

* Add ZeRO++ Japanese blog (#3797)

* zeropp chinese blog

* try better quality images

* make title larger

* even larger...

* various fix

* center captions

* more fixes

* fix format

* add ZeRO++ Japanese blog

* add links

---------

Co-authored-by: HeyangQin <heyangqin@microsoft.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>

* Bug Fixes for autotuner and flops profiler (#1880)

* fix autotuner when backward is not called

* fix format

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Missing strided copy for gated MLP (#3788)

Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

* Requires grad checking. (#3789)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* bump to 0.10.0

* Fix Bug in transform.cu (#3534)

* Bug fix

* Fixed formatting error

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

* bug fix: triton importing error (#3799)

Co-authored-by: Stephen Youn <styoun@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* DeepSpeed4Science (#569)

* Integrating evoformer attention

* add cutlass version check

* Updaate error message

* add benchmark

* Update

* Update evoformer_attn.py

* Update run_evoformer_test.py

* Update evoformer_attn.py

* Update run_evoformer_test.py

* support more GPU archs

* add copyright

* add tests

* Fix bugs

* Update benchmark

* update

* Fix nvcc macro

* clean code

* fix formatting

* fix yaml import

* skip unit test when not compatible

* fix yaml requirement

* revert changes

* update tutorial

* update

* fix formatting

* fix format

* skip evoformer attn in pre-compile-ops

* revert changes

* update tutorial

* fix cutlass check

* update tutorial

* refactor tutorial

* revise

* Updated the Megatron-DS section (#565)

* Updated the Megatron-DS section

* minor fix

* minor fix

* minor fix

* separate evoformer tutorial

* Revised the ds4science landing page (#566)

* Updated the Megatron-DS section

* minor fix

* minor fix

* minor fix

* Revised the landing page

* Revised the landing page

* Removing unused file

* fix links image position

* modify main page

* fix doc

---------

Co-authored-by: Shiyang Chen <csycfl@gmail.com>
Co-authored-by: Minjia Zhang <33713995+minjiaz@users.noreply.github.com>

---------

Co-authored-by: Heyang Qin <heyangqin@microsoft.com>
Co-authored-by: Bill Luo <50068224+zhiruiluo@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Guorun <84232793+CaffreyR@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: stephen youn <13525892+stephen-youn@users.noreply.github.com>
Co-authored-by: Stephen Youn <styoun@microsoft.com>
Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org>
Co-authored-by: Ethan Doe <yidoe@microsoft.com>
Co-authored-by: yidoe <68296935+yidoe@users.noreply.github.com>
Co-authored-by: GuanhuaWang <alexwgh333@gmail.com>
Co-authored-by: cmikeh2 <connorholmes@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com>
Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com>
Co-authored-by: Shiyang Chen <csycfl@gmail.com>
Co-authored-by: Minjia Zhang <33713995+minjiaz@users.noreply.github.com>
2023-09-18 22:16:08 +00:00
Alex Kogan 9bf77782b2
Fix a bug in the implementation of dequantization for inference (#3433)
* bugfix in launch_dequantize()

Get rid of `hid_cnt` and simply set #blocks to output size / #groups

* add a unit test for dequantization

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
2023-09-14 21:09:41 +00:00
Olatunji Ruwase aa4a7401f8
ZeRO-Inference refresh (#4197)
* INT4 weight only quantization (#479)

* INT4 weight only quantization

* pre commit

* fix UT

* fix UT

* fix UT

* fix UT

* fix UT

* fix UT

* fix UT

* add zero3 test

* quantize small weight first to prevent oom

* fold quantization config into ds_config

* Fix license & refactor ds_config & rebase master

* fix UT

* Moving quantization into post_init_method and add int4 dequantization kernel (#522)

* Add experimental int4 dequantize kernel

* move quantiation into post_init_method

* fix

* Refactor: move int4 code to deepspeed/inference (#528)

* Move int 4 code to deepspeed/inference

* fix

* fix

* fix

* zero++ tutorial PR (#3783)

* [Fix] _conv_flops_compute when padding is a str and stride=1 (#3169)

* fix conv_flops_compute when padding is a str when stride=1

* fix error

* change type of paddings to tuple

* fix padding calculation

* apply formatting check

---------

Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* fix interpolate flops compute (#3782)

* use `Flops Profiler` to test `model.generate()` (#2515)

* Update profiler.py

* pre-commit run --all-files

* Delete .DS_Store

* Delete .DS_Store

* Delete .DS_Store

---------

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>

* revert PR #3611 (#3786)

* bump to 0.9.6

* ZeRO++ chinese blog (#3793)

* zeropp chinese blog

* try better quality images

* make title larger

* even larger...

* various fix

* center captions

* more fixes

* fix format

* remove staging trigger (#3792)

* DeepSpeed-Triton for Inference (#3748)

Co-authored-by: Stephen Youn <styoun@microsoft.com>
Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Ethan Doe <yidoe@microsoft.com>
Co-authored-by: yidoe <68296935+yidoe@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* ZeRO++ (#3784)

Co-authored-by: HeyangQin <heyangqin@microsoft.com>
Co-authored-by: GuanhuaWang <alexwgh333@gmail.com>
Co-authored-by: cmikeh2 <connorholmes@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>

* adding zero++ to navigation panel of deepspeed.ai (#3796)

* Add ZeRO++ Japanese blog (#3797)

* zeropp chinese blog

* try better quality images

* make title larger

* even larger...

* various fix

* center captions

* more fixes

* fix format

* add ZeRO++ Japanese blog

* add links

---------

Co-authored-by: HeyangQin <heyangqin@microsoft.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>

* Bug Fixes for autotuner and flops profiler (#1880)

* fix autotuner when backward is not called

* fix format

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Missing strided copy for gated MLP (#3788)

Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

* Requires grad checking. (#3789)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* bump to 0.10.0

* Fix Bug in transform.cu (#3534)

* Bug fix

* Fixed formatting error

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

* bug fix: triton importing error (#3799)

Co-authored-by: Stephen Youn <styoun@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Fix dequant bug

* Address PR feedback

* Use super() __exit__

* Fix unit tests

---------

Co-authored-by: Donglin Zhuang <donglinzhuang@outlook.com>
Co-authored-by: Heyang Qin <heyangqin@microsoft.com>
Co-authored-by: Bill Luo <50068224+zhiruiluo@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Guorun <84232793+CaffreyR@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: stephen youn <13525892+stephen-youn@users.noreply.github.com>
Co-authored-by: Stephen Youn <styoun@microsoft.com>
Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org>
Co-authored-by: Ethan Doe <yidoe@microsoft.com>
Co-authored-by: yidoe <68296935+yidoe@users.noreply.github.com>
Co-authored-by: GuanhuaWang <alexwgh333@gmail.com>
Co-authored-by: cmikeh2 <connorholmes@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com>
Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com>
2023-09-11 16:19:21 +00:00
Connor Holmes 542dc0d5cb
AMD Kernel Compatibility Fixes (#3180)
* Guard against APIs not available on AMD in reduction_utils, code cleanup

* More API alignment simplification

* Int conversion fix

* Syntax

---------

Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
2023-09-08 16:54:57 +00:00