Fixed the Windows build.
Fixes applied:
- Remove some more ops that don't build on Windows.
- Remove the use of symlinks that didn't work correctly and replace with
`shutil.copytree()`.
- Small fixes to make the C++ code compile.
Tested with Python 3.9 and CUDA 12.1.
---------
Co-authored-by: Costin Eseanu <costineseanu@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
till today only last layer (idx=-1) was considered using
FINAL_LAYER_NORM_INDEX which is set to -1.
this PR allows the user to pass custom value for model where this
default value does not apply.
see example for usage in HabanaAI/Megatron-DeepSpeed fork repository:
c9feb8caca/tools/verify_checkpoint_non_tp_consistency.py (L296)
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
This PR updates the `nv-ds-chat` GitHub workflow to include
`hybrid_engine.py` file in the path. This is done to ensure testing on
the DS-Chat flow is done whenever any changes are made to the Hybrid
Engine.
The `DistributedAttention` in DeepSpeed-Ulysses has a compatibility with
the training code in
[Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/model/transformer.py#L811)
because it only takes sequential sequences as input parameters. However,
this is not compatible with the frequently used scenarios of specifying
parameters, such as the following scenario when using Flash Attention:
```python
ulysses_attn = DistributedAttention(local_attention=flash_attn_func, sequence_process_group=None, scatter_idx=2, gather_idx=1)
attn_output = ulysses_attn(
query_states,
key_states,
value_states,
dropout,
softmax_scale,
causal=causal,
)
```
Therefore, the `**kwargs` parameter has been added to increase
compatibility with more local attention, while making minimal code
modifications.
Co-authored-by: Kwen-Chen <2133949025@qq.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
The new "timers" section describes configuration for different timers.
Specifically, in the "throughput" section, it is possible to disable the
throughput timer (enabled by default). This allows to avoid the
performance degradation whenever the throughput measurement is not
needed, for example in production environment.
No device synchronize() is invoked when "synchronized" is set to False
(default is True). This allows to produce approximate throughput
measurements with minimal performance penalty.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Addresses the following warning:
```
/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/transformers/utils/hub.py:123: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
```
and the code on the transformers side is
[here](1a585c1222/src/transformers/utils/hub.py (L86C1-L96C81)).
**Fix overwriting of the compiled wrapper class attributes by those of
the wrapped class itself: Copy only those attributes which are not
already present in the wrapper.**
In the current implementation of the `CompiledModuleWrapper` the wrapper
attributes (eg `forward` method) are overwritten by `self._dict_ =
module._dict_.copy()`:
```
def CompiledModuleWrapper(mod, compile_config: Union[CompileConfig, None] = None):
class wrapper(mod.__class__):
def __init__(self, module, compile_config: Union[CompileConfig, None] = None):
self.__dict__ = module.__dict__.copy()
```
This causes the `wrapper`'s `forward` method not being called and,
consequently, the wrapped module not compiled. Instead, the wrapped
module `forward` method is being called as illustrated in the diagram
below (a real scenario from Deespeed-Chat):
![compiled_module_wrapper_bug](https://github.com/microsoft/DeepSpeed/assets/75629718/00eeb3d1-927c-49c7-84ab-f882821cc452)
The proposed fix copies only those attributes which are not present in
the wrapper class, thus implementing the desired inheritance quality of
the wrapper.
Attached is a simple reproducer of the problem.
[compiled_module_wrapper_bug.zip](https://github.com/microsoft/DeepSpeed/files/15378282/compiled_module_wrapper_bug.zip)
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
There is following error on XPU while unit testing
"DeepSpeed/tests/unit/moe/test_moe.py"
DeepSpeed/deepspeed/moe/sharded_moe.py line 223, in top1gating
RuntimeError: Expected all tensors to be on the same device, but found
at least two devices, xpu:0 and cpu!
Fix it by device conversion.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Fixing following error
/datadisk2/wengshiy/llm.devkit/DeepSpeed/deepspeed/runtime/utils.py
return get_accelerator().FloatTensor(float(v)).detach()
TypeError: new(): data must be a sequence (got float)
cuda accelerator modified the interface for fixing warning:
177dc14331
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Hi.
Please review the following changes
I added support for BF16 to cpu adam. BF16, FP16 and float are supported
at compilation time. the correct template is called at runtime according
to input params dtype.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
* Use all_reduce instead of all_gather to fetch module parameters. This
improves performance by reducing the overhead of concatenation and
slicing, which are no longer required.
* Instead, all tensors views are created prior to the collective
(all_reduce), so upon its completion only the parameter status is
updated.
* The behavior is enabled via a new boolean flag under the section
"zero_optimization": { "stage3_use_all_reduce_for_fetch_params": true }
* By default the optimization is not enabled.
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
This PR enables building the below extensions for AMD GPUs with warp
size 32.
- transformer_inference
- quantizer
- random_ltd
This PR works stand-alone for torch version <=2.0. For the latest
versions, https://github.com/microsoft/DeepSpeed/pull/5401 is required
to be merged in addition to this PR.
Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2)
on NAVI3x:
**transformer_inference:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/transformer/inference
Before this PR:
===== 674 failed, 622 skipped, 8 warnings, 1728 errors in 69.37s
(0:01:09) =====
After this PR:
========== 476 failed, 1062 passed, 1486 skipped, 8 warnings in 9.31s
==========
**quantizer:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/quantizer
Before this PR:
==== 244 failed, 8 warnings in 30.53s ====
After this PR:
====== 186 failed, 58 passed, 8 warnings in 8.89s ======
I could not find random_ltd related unit tests to run.
Fixes:
https://github.com/microsoft/DeepSpeed/issues/4753https://github.com/microsoft/DeepSpeed/issues/5474https://github.com/ROCm/DeepSpeed/issues/68
cc: @jithunnair-amd
---------
Co-authored-by: rraminen@amd.com <rraminen>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Fixes https://github.com/microsoft/DeepSpeed/issues/4989
In addition to this PR, below changes are required to build below
extensions successfully. Please note that not all unit tests for these
extensions will pass with this PR. More details on the unit test results
are below. These unit tests are skipped in CI anyway, so they will not
break the CI.
- transformer_inference
- quantizer
- random_ltd
- https://github.com/pytorch/pytorch/pull/121030
- https://github.com/microsoft/DeepSpeed/pull/5402
Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2)
on MI200:
**transformer_inference:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/transformer/inference
Before this PR:
==== 674 failed, 622 skipped, 8 warnings, 1728 errors in 123.66s
(0:02:03) =====
After this PR:
========== 555 failed, 983 passed, 1486 skipped, 8 warnings in 14.35s
==========
**quantizer:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/quantizer
Before this PR:
==== 244 failed, 8 warnings in 48.02s ====
After this PR:
===== 187 failed, 57 passed, 8 warnings in 14.74s ====
I could not find random_ltd related unit tests to run.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Enhance testing: Skip fused_optimizer tests if not supported.
Added condition check to skip fused_optimizer tests if FusedAdam and
FusedLamb are not supported by the accelerator. This enhancement ensures
that the tests are appropriately skipped when the hardware configuration
does not support these optimizers, preventing potential issues.
Details:
- Introduced a condition check to determine support for FusedAdam and
FusedLamb.
- If not supported, fused_optimizer tests are skipped to improve test
reliability.
- Improved compatibility and stability across different hardware
configurations.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Was providing the optimizer name which was configured, and not optimizer
that was actually taking place after this function processing.
This is not always aligned.
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR introduces a new monitoring option - `CometMonitor` which comes
up as an official integration with
[CometML](https://www.comet.com/site/).
The new monitor is covered with unit tests.
Notes:
* We've updated `docs/code-docs/source/monitor.rst` but it doesn't look
used anymore
* We've updated the "Monitoring Module" section name in `config-json.md`
to be generic so the next integration won't require updating it.
---------
Co-authored-by: Boris Feld <lothiraldan@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Hi @loadams, Could you please help to review this pr?
After add hpp files in csrc, we found sometimes the hpp headers will
still be excluded from the op src packaging, so we add the hpp file in
deepspeed to make sure the hpp header in deepspeed package, to ensure
jit load to compile the xpu/fused_adam ops in 0.14.2.
This PR aims to enable phi3 mini autotp.
Phi3 mini uses chunk MLP. We adjust this linear layer weight order to
support this model.
Please kindly review~ Thanks!
---------
Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>
compile wrapper will inherit from user module class and copy it's
__dict__
This should resolve most issues in #5383 except potential extra user
forward hooks.
@tohtana @loadams
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Creating a Torch tensor with the parameter
`device=get_accelerator().current_device()` can result in a crash when
using an NPU.
This issue arises because the `current_device` API across all
accelerators is expected to return a device id as an integer, according
to the [interface
docs.](fa8458b1a8/docs/_tutorials/accelerator-abstraction-interface.md?plain=1#L52C1-L56C103)
However, specifying `device` as an interger when creating tensors by
default directs Torch to use the CUDA backend, which leads to crash on
NPUs (and potentially other accelerators as well).
To resolve this, we should use `get_accelerator().current_device_name()`
instead, which returns the correct device identifier strings such as
`"npu:0", "cuda:0", or "xpu:0"`. This API provides the appropriate
context needed for creating tensors on specific hardware accelerators.
I also notice that `device=get_accelerator().current_device()` is used
across several files under deepspeed/inference, and may also lead to
crash on other accelerators.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Order of parameters in create_dir_symlink method looks wrong. Because
this we get the error "PermissionError: [WinError 5] Denied access:
'.\\deepspeed\\ops\\csrc'" when install deepspeed >= 0.4.0 on Windows
enviroment.
Please check this out @eltonzheng and @jeffra.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This reverts commit 54c0687264 due to
#5256 causing bugs when the ZeRO3 + ZeRO Offload features are enabled.
This bug was discovered due to failures in the DS Chat CI workflow.
Failing tests across CI failures:
| Failing Test Name |
| --- |
| test_ds_chat[zero3--offload-] |
| test_ds_chat[zero3--offload-lora] |
| test_ds_chat[zero3-he-offload-] |
| test_ds_chat[zero3-he-offload-lora] |
Error message:
```
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cpu!
```
It seems that `torch.stack()` or `torch.norm()` is having issues when
the offload feature is enabled and tensors are split between CPU/GPU,
however this is just an initial guess and would require more
investigation.
@nelyahu Since you are the original author of the PR, if you have some
bandwidth, any help here is greatly appreciated!
After reverting this commit, all tests pass in the DS Chat CI workflow:
https://github.com/microsoft/DeepSpeed/actions/runs/8824064414/job/24225802763
@tjruwase for context.
PyTorch v2.3 throws an error when it tries to compile `iter_params` used
for ZeRO3.
This PR excludes the function from the compilation targets.
After this PR is merged, we can [unpin the torch version for unit
tests](https://github.com/microsoft/DeepSpeed/pull/5459).
Add getter and setter methods for `compile_backend` across accelerators,
which provide a mechanism to retrieve the compile backend. These APIs
handle user-defined backend selection and raise a `ValueError` with
informative error messages for unsupported backends.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.14.2
Author - @loadams
Co-authored-by: loadams <loadams@users.noreply.github.com>
Optimized version of `nn.Linear` that adds features such as:
* LoRA w. base weight sharding
* FP [6,8,12] quantization
Depends on #5336 being merged first
Co-authored-by: @rajhans
Co-authored-by: @aurickq
---------
Co-authored-by: Rajhans Samdani <rajhans.samdani@snowflake.com>
Co-authored-by: Jeff Rasley <jeff.rasley@snowflake.com>
as discussed in #5175, set the default to use set_to_none for clearing
gradients in BF16 optimizer.
Additionally, for the case of zero clearing, use foreach_zero.
Verified correctness with mega-ds llama 7B training.
FYI @loadams
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
The conversion script from a regular checkpoint to the universal one
runs the followings in parallel.
1. extracts zero sharded optimizer states
2. merge the shards
However, it passes `map()` a set of only a few tasks (the number
specified as workers). Thus it needs to wait for the slowest tasks to
finish for every set.
This PR submits all the tasks to the pool and wait until the futures get
ready. We can keep all workers running.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>