DeepSpeed

Граф коммитов

Автор	SHA1	Сообщение	Дата
Nadav Elyahu	2fc702ed9f	DeepSpeedCheckpoint: support custom final ln idx (#5506 ) till today only last layer (idx=-1) was considered using FINAL_LAYER_NORM_INDEX which is set to -1. this PR allows the user to pass custom value for model where this default value does not apply. see example for usage in HabanaAI/Megatron-DeepSpeed fork repository: `c9feb8caca/tools/verify_checkpoint_non_tp_consistency.py (L296)` --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <loadams@microsoft.com>	2024-05-28 23:42:59 +00:00
YiSheng5	fd8051a69c	[MiCS] Remove the handle print on DeepSpeed side (#5574 ) When running for MiCS, we found many handle print on DeepSpeed from the output log, this pr is to remove it to suppress this.	2024-05-28 17:08:42 +00:00
Kun Chen	f86824be81	Add Ulysses DistributedAttention compatibility (#5525 ) The `DistributedAttention` in DeepSpeed-Ulysses has a compatibility with the training code in [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/model/transformer.py#L811) because it only takes sequential sequences as input parameters. However, this is not compatible with the frequently used scenarios of specifying parameters, such as the following scenario when using Flash Attention: ```python ulysses_attn = DistributedAttention(local_attention=flash_attn_func, sequence_process_group=None, scatter_idx=2, gather_idx=1) attn_output = ulysses_attn( query_states, key_states, value_states, dropout, softmax_scale, causal=causal, ) ``` Therefore, the `**kwargs` parameter has been added to increase compatibility with more local attention, while making minimal code modifications. Co-authored-by: Kwen-Chen <2133949025@qq.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-05-22 21:52:39 +00:00
Max Kovalenko	995ba11928	Add throughput timer configuration (#5363 ) The new "timers" section describes configuration for different timers. Specifically, in the "throughput" section, it is possible to disable the throughput timer (enabled by default). This allows to avoid the performance degradation whenever the throughput measurement is not needed, for example in production environment. No device synchronize() is invoked when "synchronized" is set to False (default is True). This allows to produce approximate throughput measurements with minimal performance penalty. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-05-22 20:28:02 +00:00
Omar Elayan	f4efef21b8	[INF] DSAttention allow input_mask to have false as value (#5546 ) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-05-22 20:22:53 +00:00
Yichen Yan	29903925cf	Adapt doc for #4405 (#5552 ) ditto Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-05-21 21:58:47 +00:00
Zihan Zhao	975290ae65	Small typos in functions set_none_gradients_to_zero (#5557 ) change from "zero_like" to "zeros_like"	2024-05-21 21:37:27 +00:00
Max Kovalenko	5b314f4e6b	Avoid overwrite of compiled module wrapper attributes (#5549 ) Fix overwriting of the compiled wrapper class attributes by those of the wrapped class itself: Copy only those attributes which are not already present in the wrapper. In the current implementation of the `CompiledModuleWrapper` the wrapper attributes (eg `forward` method) are overwritten by `self._dict_ = module._dict_.copy()`: ``` def CompiledModuleWrapper(mod, compile_config: Union[CompileConfig, None] = None): class wrapper(mod.__class__): def __init__(self, module, compile_config: Union[CompileConfig, None] = None): self.__dict__ = module.__dict__.copy() ``` This causes the `wrapper`'s `forward` method not being called and, consequently, the wrapped module not compiled. Instead, the wrapped module `forward` method is being called as illustrated in the diagram below (a real scenario from Deespeed-Chat): ![compiled_module_wrapper_bug](https://github.com/microsoft/DeepSpeed/assets/75629718/00eeb3d1-927c-49c7-84ab-f882821cc452) The proposed fix copies only those attributes which are not present in the wrapper class, thus implementing the desired inheritance quality of the wrapper. Attached is a simple reproducer of the problem. [compiled_module_wrapper_bug.zip](https://github.com/microsoft/DeepSpeed/files/15378282/compiled_module_wrapper_bug.zip) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-05-21 17:17:06 +00:00
Liran Bachar	0a1740386f	Remove synchronize calls from allgather params (#5516 ) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-05-21 15:01:20 +00:00
shiyang-weng	695d79ea06	Fix RuntimeError for moe on XPU: tensors found at least two devices (#5519 ) There is following error on XPU while unit testing "DeepSpeed/tests/unit/moe/test_moe.py" DeepSpeed/deepspeed/moe/sharded_moe.py line 223, in top1gating RuntimeError: Expected all tensors to be on the same device, but found at least two devices, xpu:0 and cpu! Fix it by device conversion. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-05-21 15:01:05 +00:00
Liran Bachar	69af361167	CPUAdam fp16 and bf16 support (#5409 ) Hi. Please review the following changes I added support for BF16 to cpu adam. BF16, FP16 and float are supported at compilation time. the correct template is called at runtime according to input params dtype. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-05-20 12:50:20 +00:00
Max Kovalenko	49df8d8da0	Optimize zero3 fetch params using all_reduce (#5420 ) * Use all_reduce instead of all_gather to fetch module parameters. This improves performance by reducing the overhead of concatenation and slicing, which are no longer required. * Instead, all tensors views are created prior to the collective (all_reduce), so upon its completion only the parameter status is updated. * The behavior is enabled via a new boolean flag under the section "zero_optimization": { "stage3_use_all_reduce_for_fetch_params": true } * By default the optimization is not enabled. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-05-20 12:48:56 +00:00
Ramya Ramineni	76c9c69fb1	Rocm warp size fix (#5402 ) This PR enables building the below extensions for AMD GPUs with warp size 32. - transformer_inference - quantizer - random_ltd This PR works stand-alone for torch version <=2.0. For the latest versions, https://github.com/microsoft/DeepSpeed/pull/5401 is required to be merged in addition to this PR. Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2) on NAVI3x: transformer_inference: pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/transformer/inference Before this PR: ===== 674 failed, 622 skipped, 8 warnings, 1728 errors in 69.37s (0:01:09) ===== After this PR: ========== 476 failed, 1062 passed, 1486 skipped, 8 warnings in 9.31s ========== quantizer: pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/quantizer Before this PR: ==== 244 failed, 8 warnings in 30.53s ==== After this PR: ====== 186 failed, 58 passed, 8 warnings in 8.89s ====== I could not find random_ltd related unit tests to run. Fixes: https://github.com/microsoft/DeepSpeed/issues/4753 https://github.com/microsoft/DeepSpeed/issues/5474 https://github.com/ROCm/DeepSpeed/issues/68 cc: @jithunnair-amd --------- Co-authored-by: rraminen@amd.com <rraminen> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-05-17 20:35:58 +00:00
Ramya Ramineni	d3dd8e7454	rocblas -> hipblas changes for ROCm (#5401 ) Fixes https://github.com/microsoft/DeepSpeed/issues/4989 In addition to this PR, below changes are required to build below extensions successfully. Please note that not all unit tests for these extensions will pass with this PR. More details on the unit test results are below. These unit tests are skipped in CI anyway, so they will not break the CI. - transformer_inference - quantizer - random_ltd - https://github.com/pytorch/pytorch/pull/121030 - https://github.com/microsoft/DeepSpeed/pull/5402 Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2) on MI200: transformer_inference: pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/transformer/inference Before this PR: ==== 674 failed, 622 skipped, 8 warnings, 1728 errors in 123.66s (0:02:03) ===== After this PR: ========== 555 failed, 983 passed, 1486 skipped, 8 warnings in 14.35s ========== quantizer: pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/quantizer Before this PR: ==== 244 failed, 8 warnings in 48.02s ==== After this PR: ===== 187 failed, 57 passed, 8 warnings in 14.74s ==== I could not find random_ltd related unit tests to run. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>	2024-05-17 01:57:00 +00:00
Nadav Elyahu	23173faa4b	Improve _configure_optimizer() final optimizer log (#5528 ) Was providing the optimizer name which was configured, and not optimizer that was actually taking place after this function processing. This is not always aligned. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-05-15 18:06:36 +00:00
Aliaksandr Kuzmik	488a823f64	New integration - CometMonitor (#5466 ) This PR introduces a new monitoring option - `CometMonitor` which comes up as an official integration with [CometML](https://www.comet.com/site/). The new monitor is covered with unit tests. Notes: * We've updated `docs/code-docs/source/monitor.rst` but it doesn't look used anymore * We've updated the "Monitoring Module" section name in `config-json.md` to be generic so the next integration won't require updating it. --------- Co-authored-by: Boris Feld <lothiraldan@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-05-15 16:04:44 +00:00
Logan Adams	62ca317829	Switch from double quotes to match single quotes (#5530 )	2024-05-13 20:20:21 -07:00
Yejing-Lai	3a7f3aa849	enable phi2 autotp (#5436 ) This PR aims to enable phi2 model autotp. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-05-13 20:10:53 +00:00
Yejing-Lai	3dd7ccff81	enable phi3_mini autotp (#5501 ) This PR aims to enable phi3 mini autotp. Phi3 mini uses chunk MLP. We adjust this linear layer weight order to support this model. Please kindly review~ Thanks! --------- Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>	2024-05-08 22:04:02 +00:00
BacharL	0b224edcf7	Fix compile wrapper (#5455 ) compile wrapper will inherit from user module class and copy it's __dict__ This should resolve most issues in #5383 except potential extra user forward hooks. @tohtana @loadams Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2024-05-08 09:53:25 +00:00
harygo2	0fc19b6a32	Fix crash when creating Torch tensor on NPU with device=get_accelerator().current_device() (#5464 ) Creating a Torch tensor with the parameter `device=get_accelerator().current_device()` can result in a crash when using an NPU. This issue arises because the `current_device` API across all accelerators is expected to return a device id as an integer, according to the [interface docs.](`fa8458b1a8/docs/_tutorials/accelerator-abstraction-interface.md`?plain=1#L52C1-L56C103) However, specifying `device` as an interger when creating tensors by default directs Torch to use the CUDA backend, which leads to crash on NPUs (and potentially other accelerators as well). To resolve this, we should use `get_accelerator().current_device_name()` instead, which returns the correct device identifier strings such as `"npu:0", "cuda:0", or "xpu:0"`. This API provides the appropriate context needed for creating tensors on specific hardware accelerators. I also notice that `device=get_accelerator().current_device()` is used across several files under deepspeed/inference, and may also lead to crash on other accelerators. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-05-07 00:05:54 +00:00
Nadav Elyahu	90793aab54	re-introduce: stage3: efficient compute of scaled_global_grad_norm (#5493 ) reverting previous revert of this feature: `bc48371c5e` in addition, bug fix for offload mode.	2024-05-03 20:22:29 +00:00
Lev Kurilenko	bc48371c5e	Revert "stage3: efficient compute of scaled_global_grad_norm (#5256 )" (#5461 ) This reverts commit `54c0687264` due to #5256 causing bugs when the ZeRO3 + ZeRO Offload features are enabled. This bug was discovered due to failures in the DS Chat CI workflow. Failing tests across CI failures: \| Failing Test Name \| \| --- \| \| test_ds_chat[zero3--offload-] \| \| test_ds_chat[zero3--offload-lora] \| \| test_ds_chat[zero3-he-offload-] \| \| test_ds_chat[zero3-he-offload-lora] \| Error message: ``` RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cpu! ``` It seems that `torch.stack()` or `torch.norm()` is having issues when the offload feature is enabled and tensors are split between CPU/GPU, however this is just an initial guess and would require more investigation. @nelyahu Since you are the original author of the PR, if you have some bandwidth, any help here is greatly appreciated! After reverting this commit, all tests pass in the DS Chat CI workflow: https://github.com/microsoft/DeepSpeed/actions/runs/8824064414/job/24225802763 @tjruwase for context.	2024-04-25 18:37:15 +00:00
Masahiro Tanaka	fcc731f09d	Fix torch.compile error for PyTorch v2.3 (#5463 ) PyTorch v2.3 throws an error when it tries to compile `iter_params` used for ZeRO3. This PR excludes the function from the compilation targets. After this PR is merged, we can [unpin the torch version for unit tests](https://github.com/microsoft/DeepSpeed/pull/5459).	2024-04-25 18:01:35 +00:00
Jeff Rasley	5e6c9b9311	OptimizedLinear implementation (#5355 ) Optimized version of `nn.Linear` that adds features such as: * LoRA w. base weight sharding * FP [6,8,12] quantization Depends on #5336 being merged first Co-authored-by: @rajhans Co-authored-by: @aurickq --------- Co-authored-by: Rajhans Samdani <rajhans.samdani@snowflake.com> Co-authored-by: Jeff Rasley <jeff.rasley@snowflake.com>	2024-04-23 12:24:37 -07:00
inkcherry	c66bc4269e	set the default to use set_to_none for clearing gradients in BF16 optimizer. (#5434 ) as discussed in #5175, set the default to use set_to_none for clearing gradients in BF16 optimizer. Additionally, for the case of zero clearing, use foreach_zero. Verified correctness with mega-ds llama 7B training. FYI @loadams --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-04-22 23:27:09 +00:00
Masahiro Tanaka	c292b03a40	Improve parallel process of universal checkpoint conversion (#5343 ) The conversion script from a regular checkpoint to the universal one runs the followings in parallel. 1. extracts zero sharded optimizer states 2. merge the shards However, it passes `map()` a set of only a few tasks (the number specified as workers). Thus it needs to wait for the slowest tasks to finish for every set. This PR submits all the tasks to the pool and wait until the futures get ready. We can keep all workers running. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-04-22 19:50:15 +00:00
shiyuan680	3f875d9519	add device config env for the accelerator (#5396 ) Thank you for [pr](https://github.com/microsoft/DeepSpeed/pull/5369) and @delock contribution of ideas. As mentioned in this [pr](https://github.com/microsoft/DeepSpeed/pull/5369), each device has its own environmental variables. We create visible_devices_envs() and set_visible_devices_envs() methods on the accelerator class to enable each accelerator to implement env settings within the interface , which is more generic to other accelerators. this commit has tested on npu, each one has 8 ascend npus --------- Co-authored-by: yangcheng <yangcheng104@huawei.com> Co-authored-by: eigen2017 <wobushiliu2@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-04-20 23:35:50 +00:00
Masahiro Tanaka	99951caa3d	Fix sorting of shard optimizer states files for universal checkpoint (#5395 ) This PR resolves the issue reported in #5283. To resolve the issue, we sort files of sharded optimizer states based on DP indices. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-04-19 21:19:47 +00:00
Reza Yazdani	c632ea09f8	Selective dequantization (#5375 ) This PR adds a new functionality for the dequantizer function, called `selective_dequantize`, which enables partially dequantizing a 3-dimensional matrix in case we don't need to dequantize all the data from lower bit (like fp8/fp6) to bf16. I also added a unit test to check its functionality. --------- Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-04-19 15:58:27 +00:00
Bruno Magalhaes	64defe65b7	Parallel map step for `DistributedDataAnalyzer` map-reduce (#5291 ) - adds multi CPU-processing to the `DistributedDataAnalyzer` map operation (parallelism set with parameter `num_workers`). Works with a `SharedMemory` / `Manager's` queue per metric, written concurrently by processes. - much faster `write_buffer_to_file` in `DistributedDataAnalyzer` reduce operation by copying to cpu and "detaching" output tensor. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Conglong Li <conglong.li@gmail.com>	2024-04-18 21:14:08 +00:00
Bruno Magalhaes	aaaf8bc5e0	Bug fix in `split_index` method (#5292 ) Bug description: on a dataset of 20 samples, when running 4 workers with 8 threads per worker, then the `split_dataset` would return for worker id `1`: ``` self.worker_splits [[0, 5], [5, 10], [10, 15], [15, 20]] self.thread_splits [[5, 6], [6, 7], [7, 8], [8, 9], [9, 10], [10, 10], [11, 10], [12, 10]] ``` `thread_splits` is wrong and causes a crash in the `DataAnalyzer`: the end sample id is lower than the initial one on the last 2 threads. This PR fixes that by fixing the behaviour of `split_index` --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-04-18 18:39:07 +00:00
Ma, Guokai	b22706a721	[CPU] Support SHM based inference_all_reduce in TorchBackend (#5391 ) This PR adds SHM based `inference_all_reduce` kernel to `TorchBackend` communication backend. When inference on CPU server, this path replaces default `torch.distributed.all_reduce` which eventurally use gloo backend. This PR will improve inference performance with AutoTP when only stock PyTorch is installed without Intel Extension for PyTorch. Compared with gloo backend. SHM based inference_all_reduce kernel is a more directed path and perform much better on single node. \| message size \| gloo all_reduce(ms) \| SHM all_reduce(ms) \| \| --- \| --- \| --- \| \| 32MB \| 30.7 \| 0.65 \| \| 64KB \| 0.23 \| 0.028 \| In text generation of bloom-3b with AutoTP, average token latency improved 1.45x with this PR on 2S Xeon node. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-04-17 18:52:36 +00:00
Etienne.bfx	bc0f774728	Update engine.py to avoid torch warning (#5408 ) The state_dict function of module.py from torch write a warning if arguments are positional arguments and not keyword arguments --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: ebonnafoux <ebonnafoux156@headmind.com>	2024-04-16 19:43:44 +00:00
inkcherry	0896503e2f	Fix a convergence issues in TP topology caused by incorrect grad_norm. (#5411 ) Some users are concerned that changes in TP topology during MOE training may potentially cause interference with experiments when noticing similar issues https://github.com/microsoft/Megatron-DeepSpeed/issues/151 https://github.com/microsoft/Megatron-DeepSpeed/pull/176/files We found a grad_norm calculation error after enabling TP. This error occurs because flattened grad of a params group is used, where the group contains both non-TP and TP parameters. Therefore, it is not possible to use a single attribute to determine whether flattened grad needs to compute the norm. In the current code logic, all params are assumed to be non-TP, resulting in only tp_rank0 grad participating in grad_norm computation. Other tp_rank grads have grad_norm_sum equal to 0. We tested and found that with TP=1 and TP=4, the difference in grad_norm is approximately twice (sqrt(4)). This aligns with the aforementioned issue. This problem should also affect dense models. Due to the absence of flattening params_group grad in bf16, this problem is avoided. We tested the loss curve on the 1.3B model. In cases where TP size increases the inconsistent gap should be larger. with this change 1.3B with EP=4 TP=4 &1 , fp16,mbs=1,gbs=16 ![image](https://github.com/microsoft/DeepSpeed/assets/27563729/855042c8-ac8a-4192-b465-5fa60c1a7c59) without this change 1.3B with EP=4 TP=4&1 ,fp16,mbs=1,gbs=16 ![image](https://github.com/microsoft/DeepSpeed/assets/27563729/66854d14-7b83-4b09-a669-b452d6157ea0) --------- Co-authored-by: Conglong Li <conglong.li@gmail.com>	2024-04-16 18:21:26 +00:00
ZHENG, Zhen	e3d873a00e	Fix the FP6 kernels compilation problem on non-Ampere GPUs. (#5333 ) Refine the guards of FP6 kernel compilation. Fix the `undefined symbol` problem of FP6 kernels on non-Ampere architectures. Related issue: https://github.com/microsoft/DeepSpeed-MII/issues/443. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>	2024-04-15 17:23:28 +00:00
Nadav Elyahu	54c0687264	stage3: efficient compute of scaled_global_grad_norm (#5256 ) using torch.norm instead of inefficient for loop --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-04-14 19:40:52 +00:00
Guanhua Wang	7b5b06602d	fix pagable h2d memcpy (#5301 ) ZeRO offload case Fix the issue of pageble h2d memcpy in step process. Now h2d memcpy uses pinned memory. Speedup h2d memcpy by 6x on single GPU and 4-5x on 8GPU node. cc @tjruwase --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Ubuntu <deepspeed@deepspeed-login.2d1icxc5dsxehnpuwt3ifc34ph.gvxx.internal.cloudapp.net> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-04-14 18:26:19 +00:00
Masahiro Tanaka	2c51aba0b7	Add custom reshaping for universal checkpoint (#5390 ) This PR adds more flexibility to define weight tensor reshaping for universal checkpointing. Currently universal checkpointing assumes a few patterns of partitioning for tensor parallelism, such as column/row wise partitioning of a 2-dim tensor. However, these are not flexible enough to define partitioning for more complex usages. Here are some examples: 1) MoE: The user may define the weight tensor for MoE's FFN as [n_experts * hidden_out, hidden_in]. For TP, we need to view this tensor as 3-dim tensor and partition it along `hidden_out` dimension. 2) GQA: The weights for QKV are often represented as one tensor and we may have Q, K and V with different sizes. The tensor shape will be [q_size + k_size + v_size, hidden]. We partition this along first dimension but for each Q, K, and V. In this case, we first need to partition Q, V, and V separately and then concatenate them to get a shard for TP. We propose a new pattern `PARAMETER_WITH_SUB_PARAMS` to support this. Here is the usage to cover the above use cases. You can define the view of the weight tensor and specify the dimension for partitioning based on the view. ```python from deepspeed.checkpoint import PARAMETER_WITH_SUB_PARAMS, SubparamShape info[PARAMETER_WITH_SUB_PARAMS] = [ asdict(SubparamShape(patterns=[layers_prefix + r"\d+moe.fc1.weight"], shape=(num_experts, hidden_out, hidden_in), partition_dim=1)), asdict(SubparamShape(patterns=[layers_prefix + r"\d+.qkv.weight"], shape=((q_size, k_size, v_size), hidden_size), partition_dim=0)), ... ] ``` The conversion script (`ds_to_universal.py`) merges TP-sharded weight tensors and the loader of universal checkpoints also partitions them following the information. Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-04-13 20:40:31 +00:00
Logan Adams	6dcced1d5c	Cleanup required_torch_version code and references. (#5370 ) - Move `required_torch_version` check from deepspeed.runtime.utils to deepspeed.utils.torch (newly created). - Remove unused duplicate definition from `tests/unit/util.py`. - Update all references to this function. - Switch checks in `deepspeed/runtime/pipe/p2p.py` to use this function. - Switch checks in `deepspeed/comm/torch.py` to use this function. --------- Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>	2024-04-10 15:39:24 +00:00
Moshe Island	08e0733e4a	Support MoE for pipeline models (#5338 ) This PR enhances DeepSpeed to support MoE for pipeline models (e.g. GPTModelPipe from Megatron-DeepSpeed). Main changes: - Enhance expert groups creation for pipeline (enhance both flavors: DP/PP/EP and DP/TP/PP/EP) - Fix MoE save/load checkpoint for PipelineModule based models. - Display MoE loss for PipelineModule based models. - Support gradients reduce for BF16_Optimizer for PipelineModule.<br>Note that same commit also fixes gradients reduction error when using Megatron-DeepSpeed GPTModelPipe with BF16_Optimizer also for a dense (no MOE) model. - When using no-drop tokens, all-reduce the capacity (op=max) using expert parallel group instead of world group --------- Signed-off-by: Moshe Island <misland@habana.ai> Co-authored-by: Moshe Island <misland@habana.ai>	2024-04-08 15:35:53 +00:00
Jeff Rasley	42a8eaa705	Auto convert moe param groups (#5354 ) When using frameworks like HF Accelerate with MoE models in HF there's an issue when DeepSpeed is creating the optimizer where we have no way to automatically create the compatible MoE param groups. This PR detects if no client optimizer is set and model_parameters are passed to DeepSpeed that they are either MoE compatible or makes them MoE compatible automatically. This was never an issue previously since (1) MoE hasn't really been tested outside MDS and (2) MDS manually converts the weight-decay param groups into being MoE compatible before deepspeed.initialize. The error that is triggered if the param groups are not MoE compatible is triggered here: `cc897ecf15/deepspeed/runtime/zero/stage_1_and_2.py (L610-L612)` Tagging @tohtana and @ykim362 to help review --------- Co-authored-by: Jeff Rasley <jeff.rasley@snowflake.com>	2024-04-05 17:18:20 +00:00
Jeff Rasley	3fbd01ccca	FP [6,8,12] quantizer op (#5336 ) Flexible-bit quantizer-dequantizer library with fp6/fp12/fp8 support Requires Ampere+ architecture, this is due to the initial focus of this op only on `bfloat16` input types. Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>	2024-04-04 19:58:08 +00:00
Jeff Rasley	5fbc3eeebb	Ensure capacity does not exceed number of tokens (#5353 ) When fine-tuning we were running into issues where the capacity would trigger the following error after some amount of time training. This was caused when the size of the inputs to top1gating were not aligned between ranks. ``` ... File "/shared/users/jrasley/DeepSpeed/deepspeed/moe/sharded_moe.py", line 427, in forward gate_output = top1gating(logits, self.capacity_factor if self.training else self.eval_capacity_factor, File "/shared/users/jrasley/DeepSpeed/deepspeed/moe/sharded_moe.py", line 240, in top1gating top_idx = _top_idx(mask1_rand, capacity) RuntimeError: The following operation failed in the TorchScript interpreter. Traceback of TorchScript (most recent call last): File "/shared/users/jrasley/DeepSpeed/deepspeed/moe/sharded_moe.py", line 172, in _top_idx @torch.jit.script def _top_idx(source, k): return torch.topk(source, k=k, dim=0)[1] ~~~~~~~~~~ <--- HERE RuntimeError: selected index k out of range ``` Co-authored with: @rajhans Reviewed/approved by: @samyam, @yaozhewei Tagging @tohtana and @ykim362 to help review	2024-04-04 16:58:02 +00:00
Roger Feng	6a4b96a168	logger update with torch master changes (#5346 ) minor fix to resolve the logger import issue caused by torch upstream cleanup `b6201a60c5` log variable was renamed in the torch master. To create the logger using public API to avoid compatibility issues. Fixes: https://github.com/microsoft/DeepSpeed/pull/5346 --------- Signed-off-by: roger feng <roger.feng@intel.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>	2024-04-04 16:07:09 +00:00
BacharL	9f0e21363b	compute global norm on device (#5125 ) Avoid host synchronization by keeping data on device --------- Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-04-03 20:54:07 +00:00
Zhihao Lin	dc3554f832	Add `distributed_port` for `deepspeed.initialize` (#5260 ) `deepspeed.initialize` does not involve the `distributed_port` argument, and always uses `TORCH_DISTRIBUTED_DEFAULT_PORT` to initialize the dist env Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-04-02 19:33:20 +00:00
Masahiro Tanaka	c946a34220	Fix sort of zero checkpoint files (#5342 ) The conversion from a regular checkpoint to universal one relies on sorting of zero checkpoint files to merge sharded optimizer states. This merge can silently produce wrong results as the sorting is in alphabetical order. The merging logic assumes that files are given in this order. 1. pp_index=0 tp_index=0 dp_index=0 2. pp_index=0 tp_index=0 dp_index=1 ... The optimizer state of a parameter can be sharded across multiple ranks. If it is sharded across dp_index 9-11, the files will be - bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt - bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt - bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt As they are sorted in alphabetical order, the script merges the sharded fragment in the order of [10, 11, 9]. This PR fixes this sort to extracts dp ranks in files and sort the files treating the ranks as numbers. Fix #5283 Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-04-01 23:32:48 +00:00
YangQun	40009eb1c7	BF16 optimizer: Clear lp grads after updating hp grads in hook (#5328 ) This fix is to solve: - Previous iteration's lp grads will still alive during the next iteration's forward. This increases the memory footprint. - The hook behavior is not aligned to its name accumulate_hp_grads_and_remove_lp Co-authored-by: qunyang <quyang@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-04-01 23:27:00 +00:00
Lzhang-hub	cc897ecf15	resolve KeyError: 'PDSH_SSH_ARGS_APPEND' (#5318 ) when start job with `deepspeed --hostfile hostfile --master_addr $MASTER_IP --ssh_port 20023 src/train_bash.py ` get error: KeyError: 'PDSH_SSH_ARGS_APPEND' in https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/launcher/multinode_runner.py#L77 because PDSH_SSH_ARGS_APPEND not in environment. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-04-01 22:45:03 +00:00

1 2 3 4 5 ...

1211 Коммитов