DeepSpeed

Граф коммитов

Автор	SHA1	Сообщение	Дата
Costin Eseanu	e7dd28a23d	Fixed the Windows build. (#5596 ) Fixed the Windows build. Fixes applied: - Remove some more ops that don't build on Windows. - Remove the use of symlinks that didn't work correctly and replace with `shutil.copytree()`. - Small fixes to make the C++ code compile. Tested with Python 3.9 and CUDA 12.1. --------- Co-authored-by: Costin Eseanu <costineseanu@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-05-31 22:11:10 +00:00
Masahiro Tanaka	77c949421e	Add slide deck for meetup in Japan (#5598 )	2024-05-31 14:05:19 -07:00
Abhishek Kulkarni	1baf68840f	Update minor CUDA version compatibility (#5591 ) Add CUDA versions 12.4 and 12.5 to the list	2024-05-31 16:47:35 +00:00
Nadav Elyahu	2fc702ed9f	DeepSpeedCheckpoint: support custom final ln idx (#5506 ) till today only last layer (idx=-1) was considered using FINAL_LAYER_NORM_INDEX which is set to -1. this PR allows the user to pass custom value for model where this default value does not apply. see example for usage in HabanaAI/Megatron-DeepSpeed fork repository: `c9feb8caca/tools/verify_checkpoint_non_tp_consistency.py (L296)` --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <loadams@microsoft.com>	2024-05-28 23:42:59 +00:00
Logan Adams	4deb40de67	Update to fix sidebar over text (#5567 ) - [x] Needs to be tested. Fixes #5494. Sample screenshot: <img width="1141" alt="image" src="https://github.com/microsoft/DeepSpeed/assets/114770087/f89f642b-bca1-4d45-b3f1-ec7943ab2ad4">	2024-05-28 15:42:05 -07:00
Logan Adams	bf66acdbae	Rename files in fp_quantize op from quantize.* to fp_quantize.* (#5577 ) Fixes #5535. Todo: need to test.	2024-05-28 20:01:48 +00:00
YiSheng5	fd8051a69c	[MiCS] Remove the handle print on DeepSpeed side (#5574 ) When running for MiCS, we found many handle print on DeepSpeed from the output log, this pr is to remove it to suppress this.	2024-05-28 17:08:42 +00:00
Logan Adams	988372b7bf	Update HPU docker version (#5566 )	2024-05-28 16:09:38 +00:00
Lev Kurilenko	3a3a6db333	Add hybrid_engine.py as path to trigger the DS-Chat GH workflow (#5562 ) This PR updates the `nv-ds-chat` GitHub workflow to include `hybrid_engine.py` file in the path. This is done to ensure testing on the DS-Chat flow is done whenever any changes are made to the Hybrid Engine.	2024-05-23 15:21:02 +00:00
Kun Chen	f86824be81	Add Ulysses DistributedAttention compatibility (#5525 ) The `DistributedAttention` in DeepSpeed-Ulysses has a compatibility with the training code in [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/model/transformer.py#L811) because it only takes sequential sequences as input parameters. However, this is not compatible with the frequently used scenarios of specifying parameters, such as the following scenario when using Flash Attention: ```python ulysses_attn = DistributedAttention(local_attention=flash_attn_func, sequence_process_group=None, scatter_idx=2, gather_idx=1) attn_output = ulysses_attn( query_states, key_states, value_states, dropout, softmax_scale, causal=causal, ) ``` Therefore, the `**kwargs` parameter has been added to increase compatibility with more local attention, while making minimal code modifications. Co-authored-by: Kwen-Chen <2133949025@qq.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-05-22 21:52:39 +00:00
Max Kovalenko	995ba11928	Add throughput timer configuration (#5363 ) The new "timers" section describes configuration for different timers. Specifically, in the "throughput" section, it is possible to disable the throughput timer (enabled by default). This allows to avoid the performance degradation whenever the throughput measurement is not needed, for example in production environment. No device synchronize() is invoked when "synchronized" is set to False (default is True). This allows to produce approximate throughput measurements with minimal performance penalty. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-05-22 20:28:02 +00:00
Omar Elayan	f4efef21b8	[INF] DSAttention allow input_mask to have false as value (#5546 ) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-05-22 20:22:53 +00:00
Logan Adams	263bfe2892	Update to HF_HOME from TRANSFORMERS_CACHE (#4816 ) Addresses the following warning: ``` /tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/transformers/utils/hub.py:123: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. ``` and the code on the transformers side is [here](`1a585c1222/src/transformers/utils/hub.py (L86C1-L96C81)`).	2024-05-22 16:08:51 +00:00
Yichen Yan	29903925cf	Adapt doc for #4405 (#5552 ) ditto Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-05-21 21:58:47 +00:00
Zihan Zhao	975290ae65	Small typos in functions set_none_gradients_to_zero (#5557 ) change from "zero_like" to "zeros_like"	2024-05-21 21:37:27 +00:00
Max Kovalenko	5b314f4e6b	Avoid overwrite of compiled module wrapper attributes (#5549 ) Fix overwriting of the compiled wrapper class attributes by those of the wrapped class itself: Copy only those attributes which are not already present in the wrapper. In the current implementation of the `CompiledModuleWrapper` the wrapper attributes (eg `forward` method) are overwritten by `self._dict_ = module._dict_.copy()`: ``` def CompiledModuleWrapper(mod, compile_config: Union[CompileConfig, None] = None): class wrapper(mod.__class__): def __init__(self, module, compile_config: Union[CompileConfig, None] = None): self.__dict__ = module.__dict__.copy() ``` This causes the `wrapper`'s `forward` method not being called and, consequently, the wrapped module not compiled. Instead, the wrapped module `forward` method is being called as illustrated in the diagram below (a real scenario from Deespeed-Chat): ![compiled_module_wrapper_bug](https://github.com/microsoft/DeepSpeed/assets/75629718/00eeb3d1-927c-49c7-84ab-f882821cc452) The proposed fix copies only those attributes which are not present in the wrapper class, thus implementing the desired inheritance quality of the wrapper. Attached is a simple reproducer of the problem. [compiled_module_wrapper_bug.zip](https://github.com/microsoft/DeepSpeed/files/15378282/compiled_module_wrapper_bug.zip) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-05-21 17:17:06 +00:00
Liran Bachar	0a1740386f	Remove synchronize calls from allgather params (#5516 ) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-05-21 15:01:20 +00:00
shiyang-weng	695d79ea06	Fix RuntimeError for moe on XPU: tensors found at least two devices (#5519 ) There is following error on XPU while unit testing "DeepSpeed/tests/unit/moe/test_moe.py" DeepSpeed/deepspeed/moe/sharded_moe.py line 223, in top1gating RuntimeError: Expected all tensors to be on the same device, but found at least two devices, xpu:0 and cpu! Fix it by device conversion. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-05-21 15:01:05 +00:00
shiyang-weng	1d8196736f	Fix the TypeError for XPU Accelerator (#5531 ) Fixing following error /datadisk2/wengshiy/llm.devkit/DeepSpeed/deepspeed/runtime/utils.py return get_accelerator().FloatTensor(float(v)).detach() TypeError: new(): data must be a sequence (got float) cuda accelerator modified the interface for fixing warning: `177dc14331` --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-05-20 14:52:44 +00:00
Liran Bachar	69af361167	CPUAdam fp16 and bf16 support (#5409 ) Hi. Please review the following changes I added support for BF16 to cpu adam. BF16, FP16 and float are supported at compilation time. the correct template is called at runtime according to input params dtype. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-05-20 12:50:20 +00:00
Max Kovalenko	49df8d8da0	Optimize zero3 fetch params using all_reduce (#5420 ) * Use all_reduce instead of all_gather to fetch module parameters. This improves performance by reducing the overhead of concatenation and slicing, which are no longer required. * Instead, all tensors views are created prior to the collective (all_reduce), so upon its completion only the parameter status is updated. * The behavior is enabled via a new boolean flag under the section "zero_optimization": { "stage3_use_all_reduce_for_fetch_params": true } * By default the optimization is not enabled. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-05-20 12:48:56 +00:00
Ramya Ramineni	76c9c69fb1	Rocm warp size fix (#5402 ) This PR enables building the below extensions for AMD GPUs with warp size 32. - transformer_inference - quantizer - random_ltd This PR works stand-alone for torch version <=2.0. For the latest versions, https://github.com/microsoft/DeepSpeed/pull/5401 is required to be merged in addition to this PR. Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2) on NAVI3x: transformer_inference: pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/transformer/inference Before this PR: ===== 674 failed, 622 skipped, 8 warnings, 1728 errors in 69.37s (0:01:09) ===== After this PR: ========== 476 failed, 1062 passed, 1486 skipped, 8 warnings in 9.31s ========== quantizer: pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/quantizer Before this PR: ==== 244 failed, 8 warnings in 30.53s ==== After this PR: ====== 186 failed, 58 passed, 8 warnings in 8.89s ====== I could not find random_ltd related unit tests to run. Fixes: https://github.com/microsoft/DeepSpeed/issues/4753 https://github.com/microsoft/DeepSpeed/issues/5474 https://github.com/ROCm/DeepSpeed/issues/68 cc: @jithunnair-amd --------- Co-authored-by: rraminen@amd.com <rraminen> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-05-17 20:35:58 +00:00
Ramya Ramineni	d3dd8e7454	rocblas -> hipblas changes for ROCm (#5401 ) Fixes https://github.com/microsoft/DeepSpeed/issues/4989 In addition to this PR, below changes are required to build below extensions successfully. Please note that not all unit tests for these extensions will pass with this PR. More details on the unit test results are below. These unit tests are skipped in CI anyway, so they will not break the CI. - transformer_inference - quantizer - random_ltd - https://github.com/pytorch/pytorch/pull/121030 - https://github.com/microsoft/DeepSpeed/pull/5402 Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2) on MI200: transformer_inference: pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/transformer/inference Before this PR: ==== 674 failed, 622 skipped, 8 warnings, 1728 errors in 123.66s (0:02:03) ===== After this PR: ========== 555 failed, 983 passed, 1486 skipped, 8 warnings in 14.35s ========== quantizer: pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/quantizer Before this PR: ==== 244 failed, 8 warnings in 48.02s ==== After this PR: ===== 187 failed, 57 passed, 8 warnings in 14.74s ==== I could not find random_ltd related unit tests to run. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>	2024-05-17 01:57:00 +00:00
Zixu Wang	8e4f6e48db	Skip the UT cases that use unimplemented op builders. (#5372 ) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>	2024-05-16 17:46:52 +00:00
vikram singh shekhawat	7f55b20f3e	Enhance testing: Skip fused_optimizer tests if not supported. (#5159 ) Enhance testing: Skip fused_optimizer tests if not supported. Added condition check to skip fused_optimizer tests if FusedAdam and FusedLamb are not supported by the accelerator. This enhancement ensures that the tests are appropriately skipped when the hardware configuration does not support these optimizers, preventing potential issues. Details: - Introduced a condition check to determine support for FusedAdam and FusedLamb. - If not supported, fused_optimizer tests are skipped to improve test reliability. - Improved compatibility and stability across different hardware configurations. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-05-16 00:34:25 +00:00
Nadav Elyahu	23173faa4b	Improve _configure_optimizer() final optimizer log (#5528 ) Was providing the optimizer name which was configured, and not optimizer that was actually taking place after this function processing. This is not always aligned. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-05-15 18:06:36 +00:00
Aliaksandr Kuzmik	488a823f64	New integration - CometMonitor (#5466 ) This PR introduces a new monitoring option - `CometMonitor` which comes up as an official integration with [CometML](https://www.comet.com/site/). The new monitor is covered with unit tests. Notes: * We've updated `docs/code-docs/source/monitor.rst` but it doesn't look used anymore * We've updated the "Monitoring Module" section name in `config-json.md` to be generic so the next integration won't require updating it. --------- Co-authored-by: Boris Feld <lothiraldan@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-05-15 16:04:44 +00:00
YiSheng5	ebf82e8f3a	[manifest] update mainfest to add hpp file in deepspeed. (#5533 ) Hi @loadams, Could you please help to review this pr? After add hpp files in csrc, we found sometimes the hpp headers will still be excluded from the op src packaging, so we add the hpp file in deepspeed to make sure the hpp header in deepspeed package, to ensure jit load to compile the xpu/fused_adam ops in 0.14.2.	2024-05-14 16:28:18 +00:00
Logan Adams	62ca317829	Switch from double quotes to match single quotes (#5530 )	2024-05-13 20:20:21 -07:00
Logan Adams	82ce4ae815	Switch pynvml to nvidia-ml-py (#5529 ) Fixes: #5517 Link to PyPI for nvidia-ml-py [here](https://pypi.org/project/nvidia-ml-py/) showing usage remaining the same as previous pynvml package.	2024-05-13 23:45:50 +00:00
Yejing-Lai	3a7f3aa849	enable phi2 autotp (#5436 ) This PR aims to enable phi2 model autotp. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-05-13 20:10:53 +00:00
YiSheng5	4696afd27b	[manifest] update mainfest to add hpp file in csrc. (#5522 ) Update the mainfest to cover hpp file in csrc.	2024-05-13 18:26:11 +00:00
BacharL	df4ef0ab69	Fused adam for HPU (#5500 ) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-05-10 20:53:55 +00:00
Yejing-Lai	3dd7ccff81	enable phi3_mini autotp (#5501 ) This PR aims to enable phi3 mini autotp. Phi3 mini uses chunk MLP. We adjust this linear layer weight order to support this model. Please kindly review~ Thanks! --------- Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>	2024-05-08 22:04:02 +00:00
BacharL	0b224edcf7	Fix compile wrapper (#5455 ) compile wrapper will inherit from user module class and copy it's __dict__ This should resolve most issues in #5383 except potential extra user forward hooks. @tohtana @loadams Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2024-05-08 09:53:25 +00:00
harygo2	0fc19b6a32	Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device() (#5464 ) Creating a Torch tensor with the parameter `device=get_accelerator().current_device()` can result in a crash when using an NPU. This issue arises because the `current_device` API across all accelerators is expected to return a device id as an integer, according to the [interface docs.](`fa8458b1a8/docs/_tutorials/accelerator-abstraction-interface.md`?plain=1#L52C1-L56C103) However, specifying `device` as an interger when creating tensors by default directs Torch to use the CUDA backend, which leads to crash on NPUs (and potentially other accelerators as well). To resolve this, we should use `get_accelerator().current_device_name()` instead, which returns the correct device identifier strings such as `"npu:0", "cuda:0", or "xpu:0"`. This API provides the appropriate context needed for creating tensors on specific hardware accelerators. I also notice that `device=get_accelerator().current_device()` is used across several files under deepspeed/inference, and may also lead to crash on other accelerators. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-05-07 00:05:54 +00:00
Nadav Elyahu	90793aab54	re-introduce: stage3: efficient compute of scaled_global_grad_norm (#5493 ) reverting previous revert of this feature: `bc48371c5e` in addition, bug fix for offload mode.	2024-05-03 20:22:29 +00:00
Logan Adams	f32ad3e1c5	Un-pin torch version in nv-torch-latest back to latest and skip test_compile_zero tests on v100 (#5459 ) Torch updating to 2.3.0 broke some test_compile_zero tests, we pinned it, @tohtana pushed fixes in #5463, this should un-pin and move us back to the latest. Failing test that indicates the generated code cannot run bf16 on V100 [here](https://github.com/microsoft/DeepSpeed/actions/runs/8838672379/job/24270349996?pr=5459#step:8:5157).	2024-04-29 23:39:12 +00:00
Antônio Vieira	059bb2085c	fix: swapping order of parameters in create_dir_symlink method. (#5465 ) Order of parameters in create_dir_symlink method looks wrong. Because this we get the error "PermissionError: [WinError 5] Denied access: '.\\deepspeed\\ops\\csrc'" when install deepspeed >= 0.4.0 on Windows enviroment. Please check this out @eltonzheng and @jeffra. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-04-29 17:37:54 +00:00
Logan Adams	4c15ad9f8d	Update with ops not supported on Windows (#5468 )	2024-04-25 21:44:39 +00:00
Lev Kurilenko	e37296b23c	Update ds-chat CI workflow paths to include zero stage 1-3 files (#5462 ) This PR updates the ds-chat CI workflow to run when ZeRO stage 1-3 files are updated.	2024-04-25 20:36:46 +00:00
Lev Kurilenko	bc48371c5e	Revert "stage3: efficient compute of scaled_global_grad_norm (#5256 )" (#5461 ) This reverts commit `54c0687264` due to #5256 causing bugs when the ZeRO3 + ZeRO Offload features are enabled. This bug was discovered due to failures in the DS Chat CI workflow. Failing tests across CI failures: \| Failing Test Name \| \| --- \| \| test_ds_chat[zero3--offload-] \| \| test_ds_chat[zero3--offload-lora] \| \| test_ds_chat[zero3-he-offload-] \| \| test_ds_chat[zero3-he-offload-lora] \| Error message: ``` RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cpu! ``` It seems that `torch.stack()` or `torch.norm()` is having issues when the offload feature is enabled and tensors are split between CPU/GPU, however this is just an initial guess and would require more investigation. @nelyahu Since you are the original author of the PR, if you have some bandwidth, any help here is greatly appreciated! After reverting this commit, all tests pass in the DS Chat CI workflow: https://github.com/microsoft/DeepSpeed/actions/runs/8824064414/job/24225802763 @tjruwase for context.	2024-04-25 18:37:15 +00:00
Masahiro Tanaka	fcc731f09d	Fix torch.compile error for PyTorch v2.3 (#5463 ) PyTorch v2.3 throws an error when it tries to compile `iter_params` used for ZeRO3. This PR excludes the function from the compilation targets. After this PR is merged, we can [unpin the torch version for unit tests](https://github.com/microsoft/DeepSpeed/pull/5459).	2024-04-25 18:01:35 +00:00
vikram singh shekhawat	fa8458b1a8	Add getter and setter methods for compile_backend across accelerators. (#5299 ) Add getter and setter methods for `compile_backend` across accelerators, which provide a mechanism to retrieve the compile backend. These APIs handle user-defined backend selection and raise a `ValueError` with informative error messages for unsupported backends. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-04-24 15:25:18 +00:00
Michael Wyatt	fbdf0eaf15	Update version.txt after 0.14.2 release (#5458 ) Auto-generated PR to update version.txt after a DeepSpeed release Released version - 0.14.2 Author - @loadams Co-authored-by: loadams <loadams@users.noreply.github.com>	2024-04-23 16:27:27 -07:00
Logan Adams	5f631abc2f	Update PyTest torch version to match PyTorch latest official (2.3.0) (#5454 )	2024-04-23 16:24:12 -07:00
Jhonso7393	ad2027952f	Update README.md (#5453 ) Fixing a minor typo at the README file Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-04-23 13:45:47 -07:00
Jeff Rasley	5e6c9b9311	OptimizedLinear implementation (#5355 ) Optimized version of `nn.Linear` that adds features such as: * LoRA w. base weight sharding * FP [6,8,12] quantization Depends on #5336 being merged first Co-authored-by: @rajhans Co-authored-by: @aurickq --------- Co-authored-by: Rajhans Samdani <rajhans.samdani@snowflake.com> Co-authored-by: Jeff Rasley <jeff.rasley@snowflake.com>	2024-04-23 12:24:37 -07:00
inkcherry	c66bc4269e	set the default to use set_to_none for clearing gradients in BF16 optimizer. (#5434 ) as discussed in #5175, set the default to use set_to_none for clearing gradients in BF16 optimizer. Additionally, for the case of zero clearing, use foreach_zero. Verified correctness with mega-ds llama 7B training. FYI @loadams --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-04-22 23:27:09 +00:00
Masahiro Tanaka	c292b03a40	Improve parallel process of universal checkpoint conversion (#5343 ) The conversion script from a regular checkpoint to the universal one runs the followings in parallel. 1. extracts zero sharded optimizer states 2. merge the shards However, it passes `map()` a set of only a few tasks (the number specified as workers). Thus it needs to wait for the slowest tasks to finish for every set. This PR submits all the tasks to the pool and wait until the futures get ready. We can keep all workers running. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-04-22 19:50:15 +00:00

1 2 3 4 5 ...

2317 Коммитов Все ветки Поиск

2317 Коммитов

Все ветки