Compatibility update for xpu ops
This PR introduces changes that will make xpu ops compatible with the
OneAPI 2025.0 toolkit. This is an important update that will allow us to
develop and ship our most demanding models on this innovative hardware.
---------
Signed-off-by: baodii <di.bao@intel.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
There is a typing error & inconsistency in cpu-adam code, while not
affecting functionality, impacts code readability. Specifically, the
type name `ds_params_percision_t` contains a typo ('percision'), whereas
the related type name `ds_state_precision_t` is spelled correctly. I
think it is beneficial to fix this typo&inconsistency to improve code
readability, maintainability and further development.
I have tested the corrected version of cpu_adam, and it compiles and
runs successfully.
Compilation Log:
<img width="2560" alt="image"
src="https://github.com/user-attachments/assets/b7bc307d-9c9d-4ab7-8671-34e565903ca5">
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
When launching apply_rotary_pos_half kernel, only threads_per_head of 64
is supported for wavefront size of 64.
This change adds support for threads_per_head < 64 such as 4, 8, 16.
Fixes the issue introduced in
https://github.com/microsoft/DeepSpeed/pull/5402
---------
Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
We have found that #6592 uses `_pinned_tensor_mgr` to create cpu bounce
buffer, which is same with what our xpu accelerator currently doing.
So no need to use xpu device specific cpu_op_desc_t.
In this PR:
1. remove custom csrc/xpu/aio/deepspeed_cpu_op.cpp
2. modify xpu async_io opbuilder.
This issue cannot be easily done with revert #6532 , for we added some
source file as last time GDS feature going in DS. So file this new PR :)
Fixing a memory leak in AIO pinned tensor as well as an incorrect
function type for gds op.
---------
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Restoring the functionality of the cpu locked tensor in the AIO library.
Make async_io operator available for CPU accelerator, i.e., CPU only
environment.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
This PR mainly handles all places where InferenceBuilder is used to
access any op or a specific implementation for an op.
Instead an op is defined, and its proper implementation is picked inside
and the usage will be transparent to the user.
What was done in the PR:
1) Added missing ops (added a py file with fallback mechanism)
2) Added missing fallback implementations for existing ops
3) removed all usages for builder.load and replaced them with ops
instead.
4) added workspace op and inferenceContext which contains all workspace
related functions and inferenceContext is the python fallback of
inferenceContext in CUDA
5) a small change to softmax_context signature to fit the fallback
signature.
---------
Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com>
Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
In DeepNVMe GDS update, many functions are changed into a more abstract
way. Also added some files. These change break zero-infinity on XPU. To
bring this feature back, we have this PR:
1. modify the aio opbuilder for new files.
2. Add custom cpu_op_desc_t for xpu users. (XPU don't handle buffer
aligned here)
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
The `numThreads` config option determines how many threads are used to
read from the file. In the CPU case these threads are created via AIO,
in the GDS case they are handled by the GDS library via the cufile.json.
If we were to also create AIO threads it would have a multiplicative
effect. Example 8 AIO threads * 8 GDS threads would be 64 threads
reading from the file when the user really only intended for 8 threads.
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
`num_bytes_per_thread` was a smaller type than `file_num_bytes`, this
caused issues when dividing by `num_threads`.
Co-authored-by: jomayeri <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
We used a slightly old version of clang-format before, this caused
issues when folks installed the latest via apt or similar rather than
python to try and fix their formatting issues. Plus installing older
versions is a pain and the formatting style of the newer version seems
better?
This PR adds the new fused kernel for the Dense GeMM using fp8-quantized
weight.
---------
Co-authored-by: Jeff Rasley <jeffra45@gmail.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
This PR allows `deepspeed.comm.inference_all_reduce()` enters
torch.compile graph even it is implemented as C++ kernel in DeepSpeed.
Previous implementation register `inference_all_reduce()` C++ kernel as
pybind function so it can be called inside PyThon code. However pybind
function cannot be recognized by PyTorch so graph breaks when
`inference_all_reduce` is called.
We address issue by register `inference_all_reduce` as a PyTorch custom
op `torch.ops.deepspeed.inference_all_reduce`, so it can be built into
PyTorch graph
The output trace code from torchinductor
```
class GraphModule(torch.nn.Module):
def forward(self, primals_1: "f32[5, 4]", primals_2: "f32[5]", primals_3: "f32[4, 4]"):
# File: /home/gma/DeepSpeed/deepspeed/comm/torch.py:161 in inference_all_reduce, code: return torch.ops.deepspeed.inference_all_reduce_(tensor)
inference_all_reduce: "f32[4, 4]" = torch.ops.deepspeed.inference_all_reduce.default(primals_3)
# File: /home/gma/allreduce_graph/test_allreduce.py:33 in forward, code: return self.linear(input)
permute: "f32[4, 5]" = torch.ops.aten.permute.default(primals_1, [1, 0]); primals_1 = None
addmm: "f32[4, 5]" = torch.ops.aten.addmm.default(primals_2, inference_all_reduce, permute); primals_2 = permute = None
# No stacktrace found for following nodes
copy_: "f32[4, 4]" = torch.ops.aten.copy_.default(primals_3, inference_all_reduce); primals_3 = None
return [addmm, inference_all_reduce]
```
Note in this PR the inference_all_reduce op for CPU does not handle
multinode and FP16 data type. For FP16 data type support, we will align
with PyTorch CPU FP16 plan. For multinode, we are still looking at the
possibility to upstream oneCCL integration into PyTorch, so we are able
to get use of oneCCL for multinode tensor parallel inference with
PyTorch.
This PR is independent to
https://github.com/microsoft/DeepSpeed/pull/5571. They can work
seperately or together without issue.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
update the way to get queue for FusedAdam OpBuilder.
---------
Signed-off-by: baodii <di.bao@intel.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR adds FP16 support to DeepSpeed SHM inference_all_reduce.
Previously only FP32 and BF16 is supported. This is to align with
PyTorch CPU support on FP16 datatype.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
On CPU server, when running SHM based allreduce for small messages, the
performance is pretty much dominated by synchronization latency. These
latency includes the following two situations:
1. Wait for status change from other ranks.
2. Use `#pragma omp parallel for` to accelerator memory bandwidth bound
operations such as `parallel_memcpy` or `reduce`.
Each synchronization add a little time to allreduce latency. In current
implementation, for small messages, 5 syncs on rank 0 are needed. This
includes: 1) copy-in; 2) wait for other ranks done copy; 3) reduce; 4)
copy-out; 5) wait for other ranks finish copy-out
We redesign the algorithm for small message allreduce (called
`symmetric_naive_allreduce`) to have only three syncs, each rank do
exactly the same steps: 1) copy-in; 2) wait for other ranks done copy;
3) reduce to output buffer directly. We use double buffer so we can skip
the last wait and go directly to next call using another buffer. We have
a carefully designed state check to avoid using global barrier among
ranks.
Test shows for message size < 1MB, allreduce latency will reduce 30% to
50%. This is especially helpful for tensor parallel decoding with small
batch size, where the tensor size is usually a few 10s of KBytes.
|message size(bytes)|new method latency(us)|old method latency(us)|
|---|---|---|
| 2 | 13.34|20.39
| 4 | 13.44|19.57
| 8 | 13.70|19.76
| 16 | 13.27|20.43
| 32 | 13.42|19.75
| 64 | 13.38|19.80
| 128 | 13.70|19.44
| 256 | 13.99|20.33
| 512 | 13.91|20.28
| 1024 | 15.00|22.86
| 2048 | 15.82|20.93
| 4096 | 16.00|21.08
| 8192 | 16.31|21.50
| 16384 | 16.27|22.95
| 32768 | 16.13|25.17
| 65536 | 18.92|25.90
| 131072 | 21.12|27.42
| 262144 | 23.09|32.36
| 524288 | 32.78|42.80
Because the new method would compute same reduce value on all ranks.
Caution needs to be taken to ensure the result is identical on all
ranks. We use the test in the link
https://github.com/delock/ds_allreduce_bench/blob/main/ds_comm_bench.py#L70
to ensure the implementation is correct.
https://github.com/delock/ds_allreduce_bench/blob/main/validate.sh is a
test script for better coverage.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Abhishek Kulkarni <11399+adk9@users.noreply.github.com>
In the process of adding onebit optimizers support for XPU devices, we
have noticed that for different accelerator, the main difference of
implementation of `compressed_allreduce` lies on `packbits` and
`unpackbits`. CUDA uses cupy and NPU uses torch_npu. Instead of replace
these to xpu only functions, we provided a CompressedBackend to do the
`compressed_allreduce` work where users can add their own
packbits/unpackbits kernels, which is a general path for all kinds of
accelerators.
In this PR, we:
1. Add CompressedBackend for onebitAdam, onebitLamb and zerooneAdam
2. Add XPU implement of packbits/unpackbits with SYCL, built in
PackbitsBuilder
3. Add tests for onebit with CompressedBackend
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Fixed the Windows build.
Fixes applied:
- Remove some more ops that don't build on Windows.
- Remove the use of symlinks that didn't work correctly and replace with
`shutil.copytree()`.
- Small fixes to make the C++ code compile.
Tested with Python 3.9 and CUDA 12.1.
---------
Co-authored-by: Costin Eseanu <costineseanu@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Hi.
Please review the following changes
I added support for BF16 to cpu adam. BF16, FP16 and float are supported
at compilation time. the correct template is called at runtime according
to input params dtype.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
This PR enables building the below extensions for AMD GPUs with warp
size 32.
- transformer_inference
- quantizer
- random_ltd
This PR works stand-alone for torch version <=2.0. For the latest
versions, https://github.com/microsoft/DeepSpeed/pull/5401 is required
to be merged in addition to this PR.
Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2)
on NAVI3x:
**transformer_inference:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/transformer/inference
Before this PR:
===== 674 failed, 622 skipped, 8 warnings, 1728 errors in 69.37s
(0:01:09) =====
After this PR:
========== 476 failed, 1062 passed, 1486 skipped, 8 warnings in 9.31s
==========
**quantizer:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/quantizer
Before this PR:
==== 244 failed, 8 warnings in 30.53s ====
After this PR:
====== 186 failed, 58 passed, 8 warnings in 8.89s ======
I could not find random_ltd related unit tests to run.
Fixes:
https://github.com/microsoft/DeepSpeed/issues/4753https://github.com/microsoft/DeepSpeed/issues/5474https://github.com/ROCm/DeepSpeed/issues/68
cc: @jithunnair-amd
---------
Co-authored-by: rraminen@amd.com <rraminen>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Fixes https://github.com/microsoft/DeepSpeed/issues/4989
In addition to this PR, below changes are required to build below
extensions successfully. Please note that not all unit tests for these
extensions will pass with this PR. More details on the unit test results
are below. These unit tests are skipped in CI anyway, so they will not
break the CI.
- transformer_inference
- quantizer
- random_ltd
- https://github.com/pytorch/pytorch/pull/121030
- https://github.com/microsoft/DeepSpeed/pull/5402
Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2)
on MI200:
**transformer_inference:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/transformer/inference
Before this PR:
==== 674 failed, 622 skipped, 8 warnings, 1728 errors in 123.66s
(0:02:03) =====
After this PR:
========== 555 failed, 983 passed, 1486 skipped, 8 warnings in 14.35s
==========
**quantizer:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/quantizer
Before this PR:
==== 244 failed, 8 warnings in 48.02s ====
After this PR:
===== 187 failed, 57 passed, 8 warnings in 14.74s ====
I could not find random_ltd related unit tests to run.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
This PR adds a new functionality for the dequantizer function, called
`selective_dequantize`, which enables partially dequantizing a
3-dimensional matrix in case we don't need to dequantize all the data
from lower bit (like fp8/fp6) to bf16.
I also added a unit test to check its functionality.
---------
Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR adds SHM based `inference_all_reduce` kernel to `TorchBackend`
communication backend. When inference on CPU server, this path replaces
default `torch.distributed.all_reduce` which eventurally use gloo
backend. This PR will improve inference performance with AutoTP when
only stock PyTorch is installed without Intel Extension for PyTorch.
Compared with gloo backend. SHM based inference_all_reduce kernel is a
more directed path and perform much better on single node.
| message size | gloo all_reduce(ms) | SHM all_reduce(ms) |
| --- | --- | --- |
| 32MB | 30.7 | 0.65 |
| 64KB | 0.23 | 0.028 |
In text generation of bloom-3b with AutoTP, average token latency
improved 1.45x with this PR on 2S Xeon node.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
This PR removes the for loop inside the dequantizer kernel and use as
many threads and blocks as needed to dequantize the quantized matrix.
The previous implementation was processing each group per thread block
which can reduce the efficiency when have having smaller group-size and
also processes more data per-thread which is unnecessary and we can use
more parallelism to improve the dequantization performance.
Based on my testing results, for a 4K by 4K matrix, dequantizing from
fp8 to bf16 gives 2.5x speedup (improving the BW efficiency from 1 TB/s
to 2.5 TB/s on Nvidia H100 GPU).
---------
Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
This PR improves SHM based inference_all_reduce on CPU:
1. Optimize for larger message size which affects performance of first
token generation with long context. For example, for llama2 70b with
1024 input sequence length, all_reduce message size is 32MB with single
batch.
* Increased SHM buffer size from 1MB/worker to 32MB/worker
* Each worker allocate SHM buffer on its own NUMA node, instead of rank
0 allcoate SHM buffers for all other workers
* For message size > 1MB, a more distributed algoritm is used to make
memory bandwidth and computation evenly distributed among workers
2. Decouple SHM based collective code with oneCCL based code, making it
ready to integrate with other backend i.e. gloo backend
3. Loosen the condition SHM based allreduce is used, i.e. message size
does not have to divisible by 32 bytes.
The new distributed algorithm, combine with larger per worker SHM
buffer, brings ~3x allreduce performance improvement for 32MB message
size on a 2 socket machine.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Flexible-bit quantizer-dequantizer library with fp6/fp12/fp8 support
Requires Ampere+ architecture, this is due to the initial focus of this
op only on `bfloat16` input types.
Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
NPU hardware platform does not support `O_DIRECT`.
Surrounded by `#if defined`, it will have no impact on other hardware
platforms such as GPUs
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
I found the width using in simd_load is different from simd_store.
This implementation confuses me.
The reason lies in the missing parentheses for the type conversion of x
in the SIMD_LOAD2 macro definition, disrupting the intended semantics of
width variable.
I try to make a quick fix for it.
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR fix UT test error as described in this PR and the following test
job. This PR skips `TestModelTask` if dtype is not supported by
accelerator, or `InferenceBuilder` is not implemented by accelerator.
https://github.com/microsoft/DeepSpeed/pull/4419https://github.com/microsoft/DeepSpeed/actions/runs/6341645987/job/17235544538
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Liangliang-Ma <1906710196@qq.com>
Co-authored-by: Quentin Anthony <qganthony@yahoo.com>
Co-authored-by: Dashiell Stander <dash.stander@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com>
Co-authored-by: Xie Zejian <xiezej@gmail.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Add ZeRO-Infinity feature for NPU devices.
I add a new `async_io.py` in `op_builder/npu` and compilation
preprocessing judgment in `deepspeed_aio_thread.cpp` specifically for
NPU, which will be isolated from other devices such as the GPU and will
not affect each other.
See what we have already done in
https://github.com/microsoft/DeepSpeed/issues/4567 .
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
This PR includes XPU support for Intel GPU. With this PR, DeepSpeed can
support XPU devices without install Intel Extension for DeepSpeed.
---------
Co-authored-by: Liangliang-Ma <1906710196@qq.com>
Co-authored-by: baodi <di.bao@intel.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Yizhou Wang <yizhou.wang@intel.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
There are some syntax errors in the NPU offload.
There may be no AVX instruction set on our server due to environment
variables, as a result, this problem is not verified in our tests.
Sorry for the inconvenience and we will be more cautious in the next
PRs.
Co-authored-by: jializheng <jializheng@huawei.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Support cpu_adam, cpu_adagrad and cpu_lion optimizer for Ascend NPU. All
these optimizer are running on host, the difference between each backend
is the way to copy params back to device. This commit add a new symbol
called "__ENABLE_CANN__". This symbol can compile code adapted to NPU.
The NPU builder adds the required header files and libraries for
compiling, according to CANN's compilation manual.
Note that there's no FusedLion implementation for NPU, test_cpu_lion
test case should disabled until FusedLion optimizer implemented.
Besides, when NPU is selected as the accelerator, ds_report will show
torch_npu and CANN informations.
With this PR, deepspeed test cases in
[huggingface/accelerate](https://github.com/huggingface/accelerate/tree/main/tests/deepspeed)
are all passed.
It's a part of feature list for Ascend NPU support, @see #4567
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
This PR removes the bias created as placeholders, which causes a crash
in openfold's training pipeline.
---------
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Current builder for evoformer use the same name for `attention.cpp` and
`attention.cu`, leading to same intermediate filename `attention.o`:
```shell
march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -
isystem /home/zejianxie/.conda/envs/dll/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem
/home/zejianxie/.conda/envs/dll/include build/temp.linux-x86_64-cpython-
310/csrc/deepspeed4science/evoformer_attn/attention.o build/temp.linux-x86_64-cpython-
310/csrc/deepspeed4science/evoformer_attn/attention.o build/temp.linux-x86_64-cpython-
310/csrc/deepspeed4science/evoformer_attn/attention_back.o
```
and
```
`attention_impl(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&)':
tmpxft_0012bef1_00000000-6_attention.compute_86.cudafe1.cpp:(.text+0x330): multiple definition of `attention_impl(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&)'; build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:tmpxft_0012bef1_00000000-6_attention.compute_86.cudafe1.cpp:(.text+0x330): first defined here
/home/zejianxie/.conda/envs/dll/bin/../lib/gcc/x86_64-conda-linux-gnu/11.4.0/../../../../x86_64-conda-linux-gnu/bin/ld: build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:(.bss+0x0): multiple definition of `torch::autograd::(anonymous namespace)::graph_task_id'; build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:(.bss+0x0): first defined here
```
I use following to reproduce and confirm my fix works:
```
git clone https://github.com/NVIDIA/cutlass --depth 1
CUTLASS_PATH=$PWD/cutlass DS_BUILD_EVOFORMER_ATTN=1 pip install ./DeepSpeed --global-option="build_ext"
```
![image](https://github.com/microsoft/DeepSpeed/assets/41792945/9e406b37-330c-431c-8bf9-6be378dee4ff)
Co-authored-by: Conglong Li <conglong.li@gmail.com>
* Add rope_theta for llama config
* Add rope_theta to bias_add_transform_0213
* Fix CI problems
* Add rope_theta to linear layer
---------
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>
* unify ccl.py for cpu and gpu'
* handle all coll in on func
* add coresponding func in cclbackend
* add coresponding pybind func
* change device setting
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
* kernels added for asym fine-grained block quantization with 8bits
* formatting
* clean up the code
* rename quantize_int4.cu to quantize_intX.cu
* rename test_int4_quantization.py to test_intX_quantization.py
* "rename test_int4_quantization.py to test_intX_quantization.py"
This reverts commit 2d341405b2.
* rename
* fix after the pr comments
* increased coverage of QuantLinear test
(w/ and w/o the cuda kernels)
* formatting
---------
Co-authored-by: Stephen Youn <styoun@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>