This PR adds an API `deepspeed.runtime.zero.offload_states
get_state_devices`, which gets devices of offload states as suggested in
this
[comment](https://github.com/microsoft/DeepSpeed/pull/6011#issuecomment-2358068777).
We could lift this up to `deepspeed.utils` but would need to resolve a
circular import: User code -> `deepspeed.utils` ->
`deepspeed.utils.offload_states` -> `deepspeed.runtime.zero` ->
`deepspeed.runtime.zero.partition_parameters` -> `deepspeed.utils`
This will require a significant refactoring as long as we have
`OffloadStateTypeEnum` in `deepspeed.runtime.zero`.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Restoring the functionality of the cpu locked tensor in the AIO library.
Make async_io operator available for CPU accelerator, i.e., CPU only
environment.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Modified _replace_module in auto_tp.py :
The modification keeps the layers 'shared_expert_gate' and 'gate' in
qwen2-moe the original type torch.nn.Linear and not changes them into
LinearLayer. In this way, their weights will not be split into multiple
HPU/GPU cards. Then the qwen2-moe can run on multiple HPU/GPU cards.
Since the weights of 'gate' are not split into multiple HPU/GPU cards,
all gather operations are not needed, which may improve performance.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Added Intel Gaudi to the list of accelerators in the setup guide.
Co-authored-by: sakell <sakell@habana.ai>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR adds the setup instructions for Huawei Ascend NPU. Please refer
to the remainder of the guide for instructions on other devices.
---------
Co-authored-by: sjh <sjh1270@163.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
DeepSpeed on Windows blog
---------
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This document provide a places to hold accelerator setup guides. It is
intend to be a single place to lookup installation guides for different
accelerators. Currently CPU and XPU setup guides are added to this
document and could be extended to other accelerators.
---------
Co-authored-by: Liangliang Ma <1906710196@qq.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Update instructions with webrick dependency
Restore Gemfile that was accidentally removed in #5821
---------
Co-authored-by: Logan Adams <loadams@microsoft.com>
Train {GPT,LLaMA, Phi}-like models (or any model) at ultra low-cost with
DeepSpeed Universal Checkpointing (UCP). UCP abstracts away the
complexities of saving and loading model states. See arxiv paper, blog
and tutorial in this PR for details.
---------
Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR introduces a new monitoring option - `CometMonitor` which comes
up as an official integration with
[CometML](https://www.comet.com/site/).
The new monitor is covered with unit tests.
Notes:
* We've updated `docs/code-docs/source/monitor.rst` but it doesn't look
used anymore
* We've updated the "Monitoring Module" section name in `config-json.md`
to be generic so the next integration won't require updating it.
---------
Co-authored-by: Boris Feld <lothiraldan@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
The right sidebar disappears off of the right side of the page. These
changes will help bring the content back and place it correctly on the
page.
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
The default value should be `1e5` as in
[config.py](2eafe41be7/deepspeed/runtime/zero/config.py (L200)).
Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
The config variable for accumulation steps is
```gradient_accumulation_steps``` but the docs explaining batch size
related parameters state it as ```gradient_accumulation``` in the note
at the top. This could lead to misconfiguration if someone uses this
note as their reference for configuration, and it makes the docs less
clear to read because it is not necessarily obvious that
```gradient_accumulation``` actually refers to
```gradient_accumulation_steps```.
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
This PR fix UT test error as described in this PR and the following test
job. This PR skips `TestModelTask` if dtype is not supported by
accelerator, or `InferenceBuilder` is not implemented by accelerator.
https://github.com/microsoft/DeepSpeed/pull/4419https://github.com/microsoft/DeepSpeed/actions/runs/6341645987/job/17235544538
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Liangliang-Ma <1906710196@qq.com>
Co-authored-by: Quentin Anthony <qganthony@yahoo.com>
Co-authored-by: Dashiell Stander <dash.stander@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com>
Co-authored-by: Xie Zejian <xiezej@gmail.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
The DeepSpeed currently supports a set of debugging APIs to
[get](https://deepspeed.readthedocs.io/en/latest/zero3.html#debugging)
and
[set](https://deepspeed.readthedocs.io/en/latest/zero3.html#modifying-partitioned-states)
the **full** model states (parameters, gradients, and optimizer states).
However, in some scenarios, only **local states** are needed, for
example, when pruning some model layers based on a local criterion.
After calling `model_engine.step()`, we need to apply the local mask to
the partitioned parameters owned by each process. Therefore, I am
submitting this PR to introduce some new APIs for `get` and `set` ZeRO-3
partial model states.
### APIs intro
```python
def safe_get_local_fp32_param(param):
"""Get the fp32 partitioned parameter."""
def safe_get_local_grad(param):
"""Get the fp32 gradient of a partitioned parameter."""
def safe_get_local_optimizer_state(param, optim_state_key):
"""Get the fp32 optimizer state of a partitioned parameter."""
def safe_set_local_fp32_param(param, value):
"""Update the partitioned fp32 parameter."""
def safe_set_local_optimizer_state(param, value, optim_state_key):
"""Update the fp32 optimizer state of a partitioned parameter."""
```
### Usage
```python
# local API
from deepspeed.utils import (
safe_get_local_fp32_param,
safe_get_local_grad,
safe_get_local_optimizer_state,
safe_set_local_fp32_param,
safe_set_local_optimizer_state
)
```
### TODO
- [x] Add local APIs
- [x] Add UTs
- [x] Update Docs
@tjruwase
---------
Signed-off-by: yliu <test@do_not_reply@neuralstudio.intel.com>
Co-authored-by: yliu <test@do_not_reply@neuralstudio.intel.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This blog adds Japanese blog for DeepSpeed-FastGen.
(also includes small fix of typos in the original blog)
---------
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Thanks @xiaoxiawu-microsoft and @conglongli for reviewing and improving
it!
---------
Co-authored-by: Xiaoxia (Shirley) Wu <94406484+xiaoxiawu-microsoft@users.noreply.github.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
This PR introduces Twin-Flow feature of ZeRO-Offload++, which improves
e2e training iteration time by up to 6x on DGX-H100s.
This PR includes:
* Twin-Flow implementation inside ZeRO optimizer
* json config tutorial
* example using deepspeed
* unit tests
cc @jeffra @awan-10 @tjruwase @mrwyattii
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>