Граф коммитов

462 Коммитов

Автор SHA1 Сообщение Дата
Olatunji Ruwase 65ab64481f
Add API for updating ZeRO gradients (#6590) 2024-10-14 17:35:41 +00:00
Masahiro Tanaka adec99121b
Add API to get devices of offload states (#6586)
This PR adds an API `deepspeed.runtime.zero.offload_states
get_state_devices`, which gets devices of offload states as suggested in
this
[comment](https://github.com/microsoft/DeepSpeed/pull/6011#issuecomment-2358068777).

We could lift this up to `deepspeed.utils` but would need to resolve a
circular import: User code -> `deepspeed.utils` ->
`deepspeed.utils.offload_states` -> `deepspeed.runtime.zero` ->
`deepspeed.runtime.zero.partition_parameters` -> `deepspeed.utils`

This will require a significant refactoring as long as we have
`OffloadStateTypeEnum` in `deepspeed.runtime.zero`.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-10-10 02:59:26 +00:00
Joe Mayer a1f98bdc70
AIO CPU Locked Tensor (#6592)
Restoring the functionality of the cpu locked tensor in the AIO library.
Make async_io operator available for CPU accelerator, i.e., CPU only
environment.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-10-09 21:07:31 +00:00
gyou2021 474a3288cd
Enabled Qwen2-MoE Tensor Parallelism (TP) inference (#6551)
Modified _replace_module in auto_tp.py :
The modification keeps the layers 'shared_expert_gate' and 'gate' in
qwen2-moe the original type torch.nn.Linear and not changes them into
LinearLayer. In this way, their weights will not be split into multiple
HPU/GPU cards. Then the qwen2-moe can run on multiple HPU/GPU cards.
Since the weights of 'gate' are not split into multiple HPU/GPU cards,
all gather operations are not needed, which may improve performance.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-10-09 15:23:16 +00:00
Masahiro Tanaka 047bcf6af6
Add APIs to offload states of model, optimizer, and engine (#6011)
This PR adds the following APIs to offload model, optimizer, and engine
states.

```pytyon
def offload_states(self,
                   include: Container[OffloadStateTypeEnum] = None,
                   device: OffloadDeviceEnum = OffloadDeviceEnum.cpu,
                   pin_memory: bool = True,
                   non_blocking: bool = False) -> None:
    """Move the ZeRO optimizer buffers to the specified device.

    Arguments:
        include: Optional. The set of states to offload. If not provided, all states are offloaded.
        device: Optional. The device to move the ZeRO optimizer buffers to.
        pin_memory: Optional. Whether to pin the memory of the offloaded states.
        non_blocking: Optional. Whether to offload the states asynchronously.
...
def offload_states_back(self, non_blocking: bool = False) -> None:
```

Here is the typical usage.
```python
# Offload after forward, backward, and step
model.offload_states()
# Do something requiring a lot of device memory
...
# Load states back to device memory
model.offload_states_back()
```

You can selectively offload states to balance the offloading overhead
and memory saving.
```python
model.offload_states(include=set([OffloadStateTypeEnum.hp_params, OffloadStateTypeEnum.opt_states], device=OffloadDeviceEnum.cpu)
```

Performance (4.3B parameters / 4x A100)
- Environment (4x A100, [benchmark
script](https://gist.github.com/tohtana/05d5faba5068cf839abfc7b1e38b85e4))
- Average Device to Host transfer time: 2.45 GB/s, aggregated: 9.79 GB/s
  - Average Host to Device transfer: 11.05 GB/s, aggregated: 44.19 GB/s
- Mem (allocated by PyTorch)
  - Before offload 18.2GB
  - After offloading 17.7MB
- Time ([benchmark
script](https://github.com/microsoft/DeepSpeedExamples/tree/tohtana/offload_states/training/offload_states),
offloading time/loading time)

python output_table.py 
| |pin_memory=0 non_blocking=0|pin_memory=0 non_blocking=1|pin_memory=1
non_blocking=0|pin_memory=1 non_blocking=1|

|--:|---------------------------|---------------------------|---------------------------|---------------------------|
| 1|4.34 / 3.42 |4.99 / 2.37 |6.5 / 2.42 |6.0 / 2.39 |
| 2|9.9 / 3.28 |5.1 / 2.34 |6.21 / 2.42 |6.25 / 2.45 |
| 3|9.92 / 3.19 |6.71 / 2.35 |6.33 / 2.38 |5.93 / 2.42 |
| 4|9.55 / 2.82 |7.11 / 2.39 |6.9 / 2.38 |6.5 / 2.43 |
| 5|4.4 / 3.35 |6.04 / 2.41 |6.26 / 2.41 |6.32 / 2.47 |
| 6|4.4 / 3.57 |6.58 / 2.42 |6.88 / 2.4 |6.35 / 2.43 |
| 7|9.51 / 3.12 |6.9 / 2.39 |6.9 / 2.39 |6.46 / 2.4 |
| 8|4.77 / 3.64 |6.69 / 2.39 |7.39 / 2.42 |6.56 / 2.46 |
| 9|9.5 / 3.07 |7.18 / 2.42 |6.67 / 2.39 |7.38 / 2.46 |

TODO:
- Enable offloading to a NVMe storage -> NVMe support is non-trivial. I
suggest adding the support in another PR
- [DONE] Discard buffer (and recreate it) instead of offloading. We
don't need to restore the contiguous buffer for reduce.
- [DONE] Check pin_memory improves performance or not

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-09-27 05:37:32 +00:00
Olatunji Ruwase a5400974df
DeepNVMe perf tuning (#6560)
Add performance tuning utilities: `ds_nvme_tune` and `ds_io`.  
Update tutorial with tuning section.

---------

Co-authored-by: Ubuntu <jomayeri@microsoft.com>
Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com>
2024-09-26 13:07:19 +00:00
ShifaAbu 2a56f53395
Added Intel Gaudi to Accelerator Setup Guide (#6543)
Added Intel Gaudi to the list of accelerators in the setup guide.

Co-authored-by: sakell <sakell@habana.ai>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-09-16 15:24:45 -07:00
Roger Feng 2a647c51d4
Fix the broken url link (#6500)
Simple changes to fix the Intel cpu example link and add more xpu
examples.

Signed-off-by: roger feng <roger.feng@intel.com>
2024-09-06 13:09:30 +00:00
Olatunji Ruwase 5df12a4a85
DeepNVMe tutorial (#6449)
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: jomayeri <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net>
2024-09-04 15:31:31 +00:00
Roger Feng 405b6d5e33
Add the accelerator setup guide link in Getting Started page (#6452)
Add the link of
https://www.deepspeed.ai/tutorials/accelerator-setup-guide/ into the
installation section in Getting Started page so that users can easily
find the doc.

Signed-off-by: roger feng <roger.feng@intel.com>
2024-08-28 16:55:33 +00:00
Dogacan Colak 1041c8a172
Add documentation for launcher without SSH (#6455)
#5728

---------

Co-authored-by: Logan Adams <loadams@microsoft.com>
2024-08-28 15:28:10 +00:00
jiahao su 1bfa341bbd
add Huawei Ascend NPU setup guide (#6445)
This PR adds the setup instructions for Huawei Ascend NPU. Please refer
to the remainder of the guide for instructions on other devices.

---------

Co-authored-by: sjh <sjh1270@163.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
2024-08-27 18:15:48 +00:00
Olatunji Ruwase 01fe65b300
DeepSpeed on Window blog (#6364)
DeepSpeed on Windows blog

---------

Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-08-19 11:16:22 -07:00
Ma, Guokai 19b01e1d60
Add accelerator setup guides (#5827)
This document provide a places to hold accelerator setup guides. It is
intend to be a single place to lookup installation guides for different
accelerators. Currently CPU and XPU setup guides are added to this
document and could be extended to other accelerators.

---------

Co-authored-by: Liangliang Ma <1906710196@qq.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-08-14 22:43:37 +00:00
Olatunji Ruwase 0584689d43
Fix docs building guide (#5825)
Update instructions with webrick dependency
Restore Gemfile that was accidentally removed in #5821

---------

Co-authored-by: Logan Adams <loadams@microsoft.com>
2024-08-05 08:51:26 -07:00
Olatunji Ruwase 2ef8223210
Fix NV references (#5821)
Fix NVIDIA references and typos.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-08-02 10:18:01 -07:00
Olatunji Ruwase 029bb5274a
Link GDS blog to site (#5820) 2024-08-01 13:35:26 -07:00
Liangliang Ma afe1b9ede1
Add doc of compressed backend in Onebit optimizers (#5782)
This one is document supplement for
https://github.com/microsoft/DeepSpeed/pull/5473.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-29 11:38:03 -07:00
Yejing-Lai acdf136785
Add new autotp supported model in doc (#5785)
This PR refresh the list of models supported by AutoTP. Newly added
models are:

- mixtral
- yuan
- phi
- qwen2 [reviewing PR #5786 ]
- chatglm2&chatglm3 [reviewing PR #5540 ]

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-23 02:12:24 +00:00
Sam Ade Jacobs 3d347276ce
Fix tutorial links (#5714) 2024-07-01 15:58:21 -07:00
Sam Ade Jacobs 121efdbd5c
DeepSpeed Universal Checkpointing: Blog and Tutorial (#5711)
Train {GPT,LLaMA, Phi}-like models (or any model) at ultra low-cost with
DeepSpeed Universal Checkpointing (UCP). UCP abstracts away the
complexities of saving and loading model states. See arxiv paper, blog
and tutorial in this PR for details.

---------

Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-07-01 14:37:24 -07:00
Masahiro Tanaka 77c949421e
Add slide deck for meetup in Japan (#5598) 2024-05-31 14:05:19 -07:00
Logan Adams 4deb40de67
Update to fix sidebar over text (#5567)
- [x] Needs to be tested.

Fixes #5494.

Sample screenshot:
<img width="1141" alt="image"
src="https://github.com/microsoft/DeepSpeed/assets/114770087/f89f642b-bca1-4d45-b3f1-ec7943ab2ad4">
2024-05-28 15:42:05 -07:00
Aliaksandr Kuzmik 488a823f64
New integration - CometMonitor (#5466)
This PR introduces a new monitoring option - `CometMonitor` which comes
up as an official integration with
[CometML](https://www.comet.com/site/).

The new monitor is covered with unit tests.

Notes:
* We've updated `docs/code-docs/source/monitor.rst` but it doesn't look
used anymore
* We've updated the "Monitoring Module" section name in `config-json.md`
to be generic so the next integration won't require updating it.

---------

Co-authored-by: Boris Feld <lothiraldan@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-05-15 16:04:44 +00:00
Shafiq Jetha a9cbd688f0
Update _sidebar.scss (#5293)
The right sidebar disappears off of the right side of the page. These
changes will help bring the content back and place it correctly on the
page.

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-04-16 20:49:35 +00:00
Georg Herstein cea5ea1eb6
Docs typos fix and grammar suggestions (#5322)
Hey, this commit contains a few typo fixes and grammar suggestions for
you to consider.
2024-03-27 15:41:30 +00:00
William Kaiser 4520edd61c
Fixed Accelerate Link (#5314)
The current link was broken. Fixed it.
2024-03-26 09:51:55 -07:00
Xiaoxia (Shirley) Wu d1536e4494
Fp6 blog chinese (#5239) 2024-03-07 17:33:50 -08:00
ByronHsu 3e6d606957
[doc/1-line change] default stage3_param_persistence_threshold is wrong in the doc (#5073)
The default value should be `1e5` as in
[config.py](2eafe41be7/deepspeed/runtime/zero/config.py (L200)).

Signed-off-by: byhsu <byhsu@linkedin.com>
Co-authored-by: byhsu <byhsu@linkedin.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-02-05 09:54:51 -08:00
segyges dde64b000c
Make batch size documentation clearer (#5072)
The config variable for accumulation steps is
```gradient_accumulation_steps``` but the docs explaining batch size
related parameters state it as ```gradient_accumulation``` in the note
at the top. This could lead to misconfiguration if someone uses this
note as their reference for configuration, and it makes the docs less
clear to read because it is not necessarily obvious that
```gradient_accumulation``` actually refers to
```gradient_accumulation_steps```.

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2024-02-05 09:49:08 -08:00
Yun Dai 76ec8b4927
[doc] update inference related docs from `mp_size` to `tensor_parallel` for TP (#5048)
`mp_size` field is deprecated in flavor of `tensor_parallel`/`tp`
https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/inference/engine.py
so update related docs that are still sticking to `mp_size`

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2024-02-01 16:55:12 -08:00
Matthew Hoffman 971d82b573
MoE type hints (#5043)
This PR fixes 5 pyright errors in `deepspeed`. My main goal is to fix
the type signatures of
`split_params_into_different_moe_groups_for_optimizer` since this
affects my project's linting.

I made a few other improvements along the way:

* use more descriptive variable names (`param_group` instead of `v1`,
`moe_group` instead of `v`)
* remove a few unused variables by choosing better-suited iterators like
`dict.values()` instead of `dict.items()` or `nn.Module.parameters()`
instead of `nn.Module.named_parameters()`
* fix incorrect function type signatures
* [use simple `dict()` shallow copy instead of of unnecessary for loop
excluding a key is then immediately overwritten:
](https://github.com/microsoft/DeepSpeed/compare/master...ringohoffman:moe-type-hints?expand=1#diff-cec48b3c7def770ef2d14ac7398bfbdf0f209d2558645ffd47d0028988fa66a3L134-L138)
* [ternary to reduce duplicating long
expression](https://github.com/microsoft/DeepSpeed/compare/master...ringohoffman:moe-type-hints?expand=1#diff-cec48b3c7def770ef2d14ac7398bfbdf0f209d2558645ffd47d0028988fa66a3L101-L104)
* `isinstance()` instead of `type(...) is ...`
* `typing.cast(List[nn.Parameter], param_group['params'])` as a general
pattern for improved type hinting of its elements during iteration

---------

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2024-02-01 14:03:32 -08:00
Michael Wyatt 24f20ef0a1
update inference pages to point to FastGen (#5029) 2024-01-30 16:52:04 -08:00
Michael Wyatt 9144b1742a
Update index.md 2024-01-19 15:13:45 -08:00
Ma, Guokai 7739c0aca4
[docs] Add new autotp supported model in tutorial (#4960)
This PR refresh the list of models supported by AutoTP. Newly added
models are:
- baichuan
- codellama
- falcon
- llama2
- mistral
- qwen
- starcode
2024-01-16 16:32:35 +00:00
Logan Adams 05cc3462c9
Fix docs inconsistency on default value for `ignore_unused_parameters` (#4949)
Link to code where the default is set:
13d84b4912/deepspeed/runtime/zero/config.py (L242C4-L242C42)
2024-01-12 17:29:44 +00:00
Ma, Guokai d8d865f492
[Fix] Fix cpu inference UT failure (#4430)
This PR fix UT test error as described in this PR and the following test
job. This PR skips `TestModelTask` if dtype is not supported by
accelerator, or `InferenceBuilder` is not implemented by accelerator.
https://github.com/microsoft/DeepSpeed/pull/4419

https://github.com/microsoft/DeepSpeed/actions/runs/6341645987/job/17235544538

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Liangliang-Ma <1906710196@qq.com>
Co-authored-by: Quentin Anthony <qganthony@yahoo.com>
Co-authored-by: Dashiell Stander <dash.stander@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com>
Co-authored-by: Xie Zejian <xiezej@gmail.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2024-01-08 23:03:44 +00:00
Dean Wyatte 59c5f37e7a
Add WarmupCosineLR to Read the Docs (#4916)
I found this scheduler via code search. It has been working well for me,
so if it is meant to be released, it would be good to document it
2024-01-08 19:54:58 +00:00
Gavin Goodship 75c7720214
doc corrections (#4861) 2023-12-21 19:13:24 +00:00
Gavin Goodship a00bdde86a
Update zeropp.md (#4835)
Doc corrections

---------

Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2023-12-18 21:17:50 +00:00
Michael Wyatt d1f1d45f4b
Update broken link in docs (#4822)
resolves #4821
2023-12-15 13:02:17 -08:00
Jeff Rasley 6b8103b46e
[docs] Intel inference blog (#4734) 2023-11-28 08:27:54 -08:00
Yi30 0ec2d3e4bf
Add get and set APIs for the ZeRO-3 partitioned parameters (#4681)
The DeepSpeed currently supports a set of debugging APIs to
[get](https://deepspeed.readthedocs.io/en/latest/zero3.html#debugging)
and
[set](https://deepspeed.readthedocs.io/en/latest/zero3.html#modifying-partitioned-states)
the **full** model states (parameters, gradients, and optimizer states).
However, in some scenarios, only **local states** are needed, for
example, when pruning some model layers based on a local criterion.
After calling `model_engine.step()`, we need to apply the local mask to
the partitioned parameters owned by each process. Therefore, I am
submitting this PR to introduce some new APIs for `get` and `set` ZeRO-3
partial model states.

### APIs intro
```python
def safe_get_local_fp32_param(param):
    """Get the fp32 partitioned parameter."""

def safe_get_local_grad(param):
    """Get the fp32 gradient of a partitioned parameter."""

def safe_get_local_optimizer_state(param, optim_state_key):
    """Get the fp32 optimizer state of a partitioned parameter."""

def safe_set_local_fp32_param(param, value):
    """Update the partitioned fp32 parameter."""

def safe_set_local_optimizer_state(param, value, optim_state_key):
    """Update the fp32 optimizer state of a partitioned parameter."""
```

### Usage
```python
# local API
from deepspeed.utils import (
    safe_get_local_fp32_param,
    safe_get_local_grad,
    safe_get_local_optimizer_state,
    safe_set_local_fp32_param,
    safe_set_local_optimizer_state
    )
```
### TODO
- [x] Add local APIs
- [x] Add UTs
- [x] Update Docs

@tjruwase

---------

Signed-off-by: yliu <test@do_not_reply@neuralstudio.intel.com>
Co-authored-by: yliu <test@do_not_reply@neuralstudio.intel.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2023-11-17 21:58:47 +00:00
Masahiro Tanaka ab6b1e16bb
Add Japanese blog for DeepSpeed-FastGen (#4651)
This blog adds Japanese blog for DeepSpeed-FastGen.
(also includes small fix of typos in the original blog)

---------

Co-authored-by: Conglong Li <conglong.li@gmail.com>
2023-11-07 10:10:45 -08:00
Heyang Qin 00df0c1998
DeepSpeed-FastGen Chinese Blog (#4642)
Thanks @xiaoxiawu-microsoft and @conglongli for reviewing and improving
it!

---------

Co-authored-by: Xiaoxia (Shirley) Wu <94406484+xiaoxiawu-microsoft@users.noreply.github.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
2023-11-06 21:16:53 -08:00
Jeff Rasley cbec96b00e
[docs] update news items (#4640)
Co-authored-by: Guanhua Wang <alexwgh333@gmail.com>
2023-11-06 15:55:25 -08:00
Guanhua Wang b1cb0dfc46
Guanhua/partial offload rebase v2 (#590) (#4636)
This PR introduces Twin-Flow feature of ZeRO-Offload++, which improves
e2e training iteration time by up to 6x on DGX-H100s.

 This PR includes:

* Twin-Flow implementation inside ZeRO optimizer
* json config tutorial
* example using deepspeed
* unit tests


cc @jeffra @awan-10 @tjruwase @mrwyattii

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-11-06 14:15:16 -08:00
Jeff Rasley 1d9e256c03
DeepSpeed-FastGen blog (#4607)
Co-authored-by: Connor Holmes <connorholmes@microsoft.com>
Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com>
Co-authored-by: Heyang Qin <heyangqin@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Lev Kurilenko <lekurile@microsoft.com>
2023-11-03 15:32:40 -07:00
Jeff Rasley 4199dc25af [docs] fix deepspeed.ai links 2023-10-30 14:24:11 -07:00
Jeff Rasley 45b07bf944
[docs] paper updates (#4584) 2023-10-30 14:17:18 -07:00