till today only last layer (idx=-1) was considered using
FINAL_LAYER_NORM_INDEX which is set to -1.
this PR allows the user to pass custom value for model where this
default value does not apply.
see example for usage in HabanaAI/Megatron-DeepSpeed fork repository:
c9feb8caca/tools/verify_checkpoint_non_tp_consistency.py (L296)
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
The `DistributedAttention` in DeepSpeed-Ulysses has a compatibility with
the training code in
[Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed/blob/main/megatron/model/transformer.py#L811)
because it only takes sequential sequences as input parameters. However,
this is not compatible with the frequently used scenarios of specifying
parameters, such as the following scenario when using Flash Attention:
```python
ulysses_attn = DistributedAttention(local_attention=flash_attn_func, sequence_process_group=None, scatter_idx=2, gather_idx=1)
attn_output = ulysses_attn(
query_states,
key_states,
value_states,
dropout,
softmax_scale,
causal=causal,
)
```
Therefore, the `**kwargs` parameter has been added to increase
compatibility with more local attention, while making minimal code
modifications.
Co-authored-by: Kwen-Chen <2133949025@qq.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
The new "timers" section describes configuration for different timers.
Specifically, in the "throughput" section, it is possible to disable the
throughput timer (enabled by default). This allows to avoid the
performance degradation whenever the throughput measurement is not
needed, for example in production environment.
No device synchronize() is invoked when "synchronized" is set to False
(default is True). This allows to produce approximate throughput
measurements with minimal performance penalty.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
**Fix overwriting of the compiled wrapper class attributes by those of
the wrapped class itself: Copy only those attributes which are not
already present in the wrapper.**
In the current implementation of the `CompiledModuleWrapper` the wrapper
attributes (eg `forward` method) are overwritten by `self._dict_ =
module._dict_.copy()`:
```
def CompiledModuleWrapper(mod, compile_config: Union[CompileConfig, None] = None):
class wrapper(mod.__class__):
def __init__(self, module, compile_config: Union[CompileConfig, None] = None):
self.__dict__ = module.__dict__.copy()
```
This causes the `wrapper`'s `forward` method not being called and,
consequently, the wrapped module not compiled. Instead, the wrapped
module `forward` method is being called as illustrated in the diagram
below (a real scenario from Deespeed-Chat):
![compiled_module_wrapper_bug](https://github.com/microsoft/DeepSpeed/assets/75629718/00eeb3d1-927c-49c7-84ab-f882821cc452)
The proposed fix copies only those attributes which are not present in
the wrapper class, thus implementing the desired inheritance quality of
the wrapper.
Attached is a simple reproducer of the problem.
[compiled_module_wrapper_bug.zip](https://github.com/microsoft/DeepSpeed/files/15378282/compiled_module_wrapper_bug.zip)
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
There is following error on XPU while unit testing
"DeepSpeed/tests/unit/moe/test_moe.py"
DeepSpeed/deepspeed/moe/sharded_moe.py line 223, in top1gating
RuntimeError: Expected all tensors to be on the same device, but found
at least two devices, xpu:0 and cpu!
Fix it by device conversion.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Hi.
Please review the following changes
I added support for BF16 to cpu adam. BF16, FP16 and float are supported
at compilation time. the correct template is called at runtime according
to input params dtype.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
* Use all_reduce instead of all_gather to fetch module parameters. This
improves performance by reducing the overhead of concatenation and
slicing, which are no longer required.
* Instead, all tensors views are created prior to the collective
(all_reduce), so upon its completion only the parameter status is
updated.
* The behavior is enabled via a new boolean flag under the section
"zero_optimization": { "stage3_use_all_reduce_for_fetch_params": true }
* By default the optimization is not enabled.
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
This PR enables building the below extensions for AMD GPUs with warp
size 32.
- transformer_inference
- quantizer
- random_ltd
This PR works stand-alone for torch version <=2.0. For the latest
versions, https://github.com/microsoft/DeepSpeed/pull/5401 is required
to be merged in addition to this PR.
Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2)
on NAVI3x:
**transformer_inference:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/transformer/inference
Before this PR:
===== 674 failed, 622 skipped, 8 warnings, 1728 errors in 69.37s
(0:01:09) =====
After this PR:
========== 476 failed, 1062 passed, 1486 skipped, 8 warnings in 9.31s
==========
**quantizer:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/quantizer
Before this PR:
==== 244 failed, 8 warnings in 30.53s ====
After this PR:
====== 186 failed, 58 passed, 8 warnings in 8.89s ======
I could not find random_ltd related unit tests to run.
Fixes:
https://github.com/microsoft/DeepSpeed/issues/4753https://github.com/microsoft/DeepSpeed/issues/5474https://github.com/ROCm/DeepSpeed/issues/68
cc: @jithunnair-amd
---------
Co-authored-by: rraminen@amd.com <rraminen>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Fixes https://github.com/microsoft/DeepSpeed/issues/4989
In addition to this PR, below changes are required to build below
extensions successfully. Please note that not all unit tests for these
extensions will pass with this PR. More details on the unit test results
are below. These unit tests are skipped in CI anyway, so they will not
break the CI.
- transformer_inference
- quantizer
- random_ltd
- https://github.com/pytorch/pytorch/pull/121030
- https://github.com/microsoft/DeepSpeed/pull/5402
Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2)
on MI200:
**transformer_inference:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/transformer/inference
Before this PR:
==== 674 failed, 622 skipped, 8 warnings, 1728 errors in 123.66s
(0:02:03) =====
After this PR:
========== 555 failed, 983 passed, 1486 skipped, 8 warnings in 14.35s
==========
**quantizer:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/quantizer
Before this PR:
==== 244 failed, 8 warnings in 48.02s ====
After this PR:
===== 187 failed, 57 passed, 8 warnings in 14.74s ====
I could not find random_ltd related unit tests to run.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Was providing the optimizer name which was configured, and not optimizer
that was actually taking place after this function processing.
This is not always aligned.
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR introduces a new monitoring option - `CometMonitor` which comes
up as an official integration with
[CometML](https://www.comet.com/site/).
The new monitor is covered with unit tests.
Notes:
* We've updated `docs/code-docs/source/monitor.rst` but it doesn't look
used anymore
* We've updated the "Monitoring Module" section name in `config-json.md`
to be generic so the next integration won't require updating it.
---------
Co-authored-by: Boris Feld <lothiraldan@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR aims to enable phi3 mini autotp.
Phi3 mini uses chunk MLP. We adjust this linear layer weight order to
support this model.
Please kindly review~ Thanks!
---------
Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>
compile wrapper will inherit from user module class and copy it's
__dict__
This should resolve most issues in #5383 except potential extra user
forward hooks.
@tohtana @loadams
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Creating a Torch tensor with the parameter
`device=get_accelerator().current_device()` can result in a crash when
using an NPU.
This issue arises because the `current_device` API across all
accelerators is expected to return a device id as an integer, according
to the [interface
docs.](fa8458b1a8/docs/_tutorials/accelerator-abstraction-interface.md?plain=1#L52C1-L56C103)
However, specifying `device` as an interger when creating tensors by
default directs Torch to use the CUDA backend, which leads to crash on
NPUs (and potentially other accelerators as well).
To resolve this, we should use `get_accelerator().current_device_name()`
instead, which returns the correct device identifier strings such as
`"npu:0", "cuda:0", or "xpu:0"`. This API provides the appropriate
context needed for creating tensors on specific hardware accelerators.
I also notice that `device=get_accelerator().current_device()` is used
across several files under deepspeed/inference, and may also lead to
crash on other accelerators.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This reverts commit 54c0687264 due to
#5256 causing bugs when the ZeRO3 + ZeRO Offload features are enabled.
This bug was discovered due to failures in the DS Chat CI workflow.
Failing tests across CI failures:
| Failing Test Name |
| --- |
| test_ds_chat[zero3--offload-] |
| test_ds_chat[zero3--offload-lora] |
| test_ds_chat[zero3-he-offload-] |
| test_ds_chat[zero3-he-offload-lora] |
Error message:
```
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cpu!
```
It seems that `torch.stack()` or `torch.norm()` is having issues when
the offload feature is enabled and tensors are split between CPU/GPU,
however this is just an initial guess and would require more
investigation.
@nelyahu Since you are the original author of the PR, if you have some
bandwidth, any help here is greatly appreciated!
After reverting this commit, all tests pass in the DS Chat CI workflow:
https://github.com/microsoft/DeepSpeed/actions/runs/8824064414/job/24225802763
@tjruwase for context.
PyTorch v2.3 throws an error when it tries to compile `iter_params` used
for ZeRO3.
This PR excludes the function from the compilation targets.
After this PR is merged, we can [unpin the torch version for unit
tests](https://github.com/microsoft/DeepSpeed/pull/5459).
Optimized version of `nn.Linear` that adds features such as:
* LoRA w. base weight sharding
* FP [6,8,12] quantization
Depends on #5336 being merged first
Co-authored-by: @rajhans
Co-authored-by: @aurickq
---------
Co-authored-by: Rajhans Samdani <rajhans.samdani@snowflake.com>
Co-authored-by: Jeff Rasley <jeff.rasley@snowflake.com>
as discussed in #5175, set the default to use set_to_none for clearing
gradients in BF16 optimizer.
Additionally, for the case of zero clearing, use foreach_zero.
Verified correctness with mega-ds llama 7B training.
FYI @loadams
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
The conversion script from a regular checkpoint to the universal one
runs the followings in parallel.
1. extracts zero sharded optimizer states
2. merge the shards
However, it passes `map()` a set of only a few tasks (the number
specified as workers). Thus it needs to wait for the slowest tasks to
finish for every set.
This PR submits all the tasks to the pool and wait until the futures get
ready. We can keep all workers running.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Thank you for [pr](https://github.com/microsoft/DeepSpeed/pull/5369) and
@delock contribution of ideas.
As mentioned in this
[pr](https://github.com/microsoft/DeepSpeed/pull/5369), each device has
its own environmental variables.
We create visible_devices_envs() and set_visible_devices_envs() methods
on the accelerator class to enable each accelerator to implement env
settings within the interface , which is more generic to other
accelerators.
this commit has tested on npu, each one has 8 ascend npus
---------
Co-authored-by: yangcheng <yangcheng104@huawei.com>
Co-authored-by: eigen2017 <wobushiliu2@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
This PR resolves the issue reported in #5283.
To resolve the issue, we sort files of sharded optimizer states based on
DP indices.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
This PR adds a new functionality for the dequantizer function, called
`selective_dequantize`, which enables partially dequantizing a
3-dimensional matrix in case we don't need to dequantize all the data
from lower bit (like fp8/fp6) to bf16.
I also added a unit test to check its functionality.
---------
Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
- adds multi CPU-processing to the `DistributedDataAnalyzer` map
operation (parallelism set with parameter `num_workers`). Works with a
`SharedMemory` / `Manager's` queue per metric, written concurrently by
processes.
- much faster `write_buffer_to_file` in `DistributedDataAnalyzer` reduce
operation by copying to cpu and "detaching" output tensor.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Bug description: on a dataset of 20 samples, when running 4 workers with
8 threads per worker, then the `split_dataset` would return for worker
id `1`:
```
self.worker_splits
[[0, 5], [5, 10], [10, 15], [15, 20]]
self.thread_splits
[[5, 6], [6, 7], [7, 8], [8, 9], [9, 10], [10, 10], [11, 10], [12, 10]]
```
`thread_splits` is wrong and causes a crash in the `DataAnalyzer`: the
end sample id is lower than the initial one on the last 2 threads.
This PR fixes that by fixing the behaviour of `split_index`
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR adds SHM based `inference_all_reduce` kernel to `TorchBackend`
communication backend. When inference on CPU server, this path replaces
default `torch.distributed.all_reduce` which eventurally use gloo
backend. This PR will improve inference performance with AutoTP when
only stock PyTorch is installed without Intel Extension for PyTorch.
Compared with gloo backend. SHM based inference_all_reduce kernel is a
more directed path and perform much better on single node.
| message size | gloo all_reduce(ms) | SHM all_reduce(ms) |
| --- | --- | --- |
| 32MB | 30.7 | 0.65 |
| 64KB | 0.23 | 0.028 |
In text generation of bloom-3b with AutoTP, average token latency
improved 1.45x with this PR on 2S Xeon node.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
The state_dict function of module.py from torch write a warning if
arguments are positional arguments and not keyword arguments
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: ebonnafoux <ebonnafoux156@headmind.com>
Some users are concerned that changes in TP topology during MOE training
may potentially cause interference with experiments when noticing
similar issues
https://github.com/microsoft/Megatron-DeepSpeed/issues/151https://github.com/microsoft/Megatron-DeepSpeed/pull/176/files
We found a grad_norm calculation error after enabling TP. This error
occurs because flattened grad of a params group is used, where the group
contains both non-TP and TP parameters. Therefore, it is not possible to
use a single attribute to determine whether flattened grad needs to
compute the norm. In the current code logic, all params are assumed to
be non-TP, resulting in only tp_rank0 grad participating in grad_norm
computation. Other tp_rank grads have grad_norm_sum equal to 0. We
tested and found that with TP=1 and TP=4, the difference in grad_norm is
approximately twice (sqrt(4)). This aligns with the aforementioned
issue. This problem should also affect dense models.
Due to the absence of flattening params_group grad in bf16, this problem
is avoided.
We tested the loss curve on the 1.3B model. In cases where TP size
increases the inconsistent gap should be larger.
with this change 1.3B with EP=4 TP=4 &1 , fp16,mbs=1,gbs=16
![image](https://github.com/microsoft/DeepSpeed/assets/27563729/855042c8-ac8a-4192-b465-5fa60c1a7c59)
without this change 1.3B with EP=4 TP=4&1 ,fp16,mbs=1,gbs=16
![image](https://github.com/microsoft/DeepSpeed/assets/27563729/66854d14-7b83-4b09-a669-b452d6157ea0)
---------
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Refine the guards of FP6 kernel compilation. Fix the `undefined symbol`
problem of FP6 kernels on non-Ampere architectures.
Related issue: https://github.com/microsoft/DeepSpeed-MII/issues/443.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
ZeRO offload case
Fix the issue of pageble h2d memcpy in step process. Now h2d memcpy uses
pinned memory.
Speedup h2d memcpy by 6x on single GPU and 4-5x on 8GPU node.
cc @tjruwase
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Ubuntu <deepspeed@deepspeed-login.2d1icxc5dsxehnpuwt3ifc34ph.gvxx.internal.cloudapp.net>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR adds more flexibility to define weight tensor reshaping for
universal checkpointing.
Currently universal checkpointing assumes a few patterns of partitioning
for tensor parallelism, such as column/row wise partitioning of a 2-dim
tensor. However, these are not flexible enough to define partitioning
for more complex usages. Here are some examples:
1) MoE: The user may define the weight tensor for MoE's FFN as
[n_experts * hidden_out, hidden_in]. For TP, we need to *view* this
tensor as 3-dim tensor and partition it along `hidden_out` dimension.
2) GQA: The weights for QKV are often represented as one tensor and we
may have Q, K and V with different sizes. The tensor shape will be
[q_size + k_size + v_size, hidden]. We partition this along first
dimension but for each Q, K, and V. In this case, we first need to
partition Q, V, and V separately and then concatenate them to get a
shard for TP.
We propose a new pattern `PARAMETER_WITH_SUB_PARAMS` to support this.
Here is the usage to cover the above use cases. You can define the view
of the weight tensor and specify the dimension for partitioning based on
the view.
```python
from deepspeed.checkpoint import PARAMETER_WITH_SUB_PARAMS, SubparamShape
info[PARAMETER_WITH_SUB_PARAMS] = [
asdict(SubparamShape(patterns=[layers_prefix + r"\d+moe.fc1.weight"],
shape=(num_experts, hidden_out, hidden_in), partition_dim=1)),
asdict(SubparamShape(patterns=[layers_prefix + r"\d+.qkv.weight"],
shape=((q_size, k_size, v_size), hidden_size), partition_dim=0)),
...
]
```
The conversion script (`ds_to_universal.py`) merges TP-sharded weight
tensors and the loader of universal checkpoints also partitions them
following the information.
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
- Move `required_torch_version` check from deepspeed.runtime.utils to
deepspeed.utils.torch (newly created).
- Remove unused duplicate definition from `tests/unit/util.py`.
- Update all references to this function.
- Switch checks in `deepspeed/runtime/pipe/p2p.py` to use this function.
- Switch checks in `deepspeed/comm/torch.py` to use this function.
---------
Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>
This PR enhances DeepSpeed to support MoE for pipeline models (e.g.
GPTModelPipe from Megatron-DeepSpeed).
Main changes:
- Enhance expert groups creation for pipeline (enhance both flavors:
DP/PP/EP and DP/TP/PP/EP)
- Fix MoE save/load checkpoint for PipelineModule based models.
- Display MoE loss for PipelineModule based models.
- Support gradients reduce for BF16_Optimizer for
PipelineModule.<br>Note that same commit also fixes gradients reduction
error when using Megatron-DeepSpeed GPTModelPipe with BF16_Optimizer
also for a dense (no MOE) model.
- When using no-drop tokens, all-reduce the capacity (op=max) using
expert parallel group instead of world group
---------
Signed-off-by: Moshe Island <misland@habana.ai>
Co-authored-by: Moshe Island <misland@habana.ai>
When using frameworks like HF Accelerate with MoE models in HF there's
an issue when DeepSpeed is creating the optimizer where we have no way
to automatically create the compatible MoE param groups. This PR detects
if no client optimizer is set and model_parameters are passed to
DeepSpeed that they are either MoE compatible or makes them MoE
compatible automatically.
This was never an issue previously since (1) MoE hasn't really been
tested outside MDS and (2) MDS manually converts the weight-decay param
groups into being MoE compatible before deepspeed.initialize.
The error that is triggered if the param groups are not MoE compatible
is triggered here:
cc897ecf15/deepspeed/runtime/zero/stage_1_and_2.py (L610-L612)
Tagging @tohtana and @ykim362 to help review
---------
Co-authored-by: Jeff Rasley <jeff.rasley@snowflake.com>
Flexible-bit quantizer-dequantizer library with fp6/fp12/fp8 support
Requires Ampere+ architecture, this is due to the initial focus of this
op only on `bfloat16` input types.
Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
When fine-tuning we were running into issues where the capacity would
trigger the following error after some amount of time training. This was
caused when the size of the inputs to top1gating were not aligned
between ranks.
```
...
File "/shared/users/jrasley/DeepSpeed/deepspeed/moe/sharded_moe.py", line 427, in forward
gate_output = top1gating(logits, self.capacity_factor if self.training else self.eval_capacity_factor,
File "/shared/users/jrasley/DeepSpeed/deepspeed/moe/sharded_moe.py", line 240, in top1gating
top_idx = _top_idx(mask1_rand, capacity)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
File "/shared/users/jrasley/DeepSpeed/deepspeed/moe/sharded_moe.py", line 172, in _top_idx
@torch.jit.script
def _top_idx(source, k):
return torch.topk(source, k=k, dim=0)[1]
~~~~~~~~~~ <--- HERE
RuntimeError: selected index k out of range
```
Co-authored with: @rajhans
Reviewed/approved by: @samyam, @yaozhewei
Tagging @tohtana and @ykim362 to help review
minor fix to resolve the logger import issue caused by torch upstream
cleanup
b6201a60c5
log variable was renamed in the torch master. To create the logger using
public API to avoid compatibility issues.
Fixes: https://github.com/microsoft/DeepSpeed/pull/5346
---------
Signed-off-by: roger feng <roger.feng@intel.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
`deepspeed.initialize` does not involve the `distributed_port` argument,
and always uses `TORCH_DISTRIBUTED_DEFAULT_PORT` to initialize the dist
env
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
The conversion from a regular checkpoint to universal one relies on
sorting of zero checkpoint files to merge sharded optimizer states. This
merge can silently produce wrong results as the sorting is in
alphabetical order.
The merging logic assumes that files are given in this order.
1. pp_index=0 tp_index=0 dp_index=0
2. pp_index=0 tp_index=0 dp_index=1
...
The optimizer state of a parameter can be sharded across multiple ranks.
If it is sharded across dp_index 9-11, the files will be
- bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt
- bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt
- bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt
As they are sorted in alphabetical order, the script merges the sharded
fragment in the order of [10, 11, 9].
This PR fixes this sort to extracts dp ranks in files and sort the files
treating the ranks as numbers.
Fix#5283
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
This fix is to solve:
- Previous iteration's lp grads will still alive during the next
iteration's forward. This increases the memory footprint.
- The hook behavior is not aligned to its name
accumulate_hp_grads_and_remove_lp
Co-authored-by: qunyang <quyang@habana.ai>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
when start job with `deepspeed --hostfile hostfile --master_addr
$MASTER_IP --ssh_port 20023 src/train_bash.py `
get error: KeyError: 'PDSH_SSH_ARGS_APPEND' in
https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/launcher/multinode_runner.py#L77
because PDSH_SSH_ARGS_APPEND not in environment.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>