Use the host time to replace xpu event elapsed_time as a WA, on XPU
device, use XPU event to measure the time will be consolidated in ipex
2.5.
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
#Motivation
From our next release, xpu DeepSpeed related kernels would be put into
intel_extension_for_pytorch. This PR is to add new op builders and use
kernel path from intel_extension_for_pytorch. More ops like MOE and WOQ
will be added.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Previously, lay_call function was wrapped by
torch.xpu.lay_init._lazy_call, which is now changed to
torch.xpu._lazy_call.
Thus we change this function to adapt different versions.
In the process of adding onebit optimizers support for XPU devices, we
have noticed that for different accelerator, the main difference of
implementation of `compressed_allreduce` lies on `packbits` and
`unpackbits`. CUDA uses cupy and NPU uses torch_npu. Instead of replace
these to xpu only functions, we provided a CompressedBackend to do the
`compressed_allreduce` work where users can add their own
packbits/unpackbits kernels, which is a general path for all kinds of
accelerators.
In this PR, we:
1. Add CompressedBackend for onebitAdam, onebitLamb and zerooneAdam
2. Add XPU implement of packbits/unpackbits with SYCL, built in
PackbitsBuilder
3. Add tests for onebit with CompressedBackend
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Fixing following error
/datadisk2/wengshiy/llm.devkit/DeepSpeed/deepspeed/runtime/utils.py
return get_accelerator().FloatTensor(float(v)).detach()
TypeError: new(): data must be a sequence (got float)
cuda accelerator modified the interface for fixing warning:
177dc14331
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Add getter and setter methods for `compile_backend` across accelerators,
which provide a mechanism to retrieve the compile backend. These APIs
handle user-defined backend selection and raise a `ValueError` with
informative error messages for unsupported backends.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Thank you for [pr](https://github.com/microsoft/DeepSpeed/pull/5369) and
@delock contribution of ideas.
As mentioned in this
[pr](https://github.com/microsoft/DeepSpeed/pull/5369), each device has
its own environmental variables.
We create visible_devices_envs() and set_visible_devices_envs() methods
on the accelerator class to enable each accelerator to implement env
settings within the interface , which is more generic to other
accelerators.
this commit has tested on npu, each one has 8 ascend npus
---------
Co-authored-by: yangcheng <yangcheng104@huawei.com>
Co-authored-by: eigen2017 <wobushiliu2@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Deepspeed currently calls is_synchronized_device() to decide how to use
the device.
HPU does not fit into this definition since it behaves like all streams
are blocking streams,
meaning they preserve order between each other but asynchronous to CPU.
see cudaStreamCreateWithFlags.
**has_data_dependency_resolving()**
HPU device is considered synchronized wrt CPU. Operations executed in
the script order
regardless of stream they were enqueued on. Tensor data is guaranteed to
be valid.
No need to stream dependencies or CPU synchronizations.
**use_host_timers()**
HPU device execution is async. To measure device execution time we must
use device timers.
**has_memory_backpressure()**
limiting number of inflight fetched params and number of inflight grad
reduce_scatter calls
is not necessary since HPU will stop enqueuing calls if memory is full,
creating internal
backpressure for the CPU until memory is available.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR includes XPU support for Intel GPU. With this PR, DeepSpeed can
support XPU devices without install Intel Extension for DeepSpeed.
---------
Co-authored-by: Liangliang-Ma <1906710196@qq.com>
Co-authored-by: baodi <di.bao@intel.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Yizhou Wang <yizhou.wang@intel.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>