DeepSpeed

Граф коммитов

Автор	SHA1	Сообщение	Дата
Olatunji Ruwase	65ab64481f	Add API for updating ZeRO gradients (#6590 )	2024-10-14 17:35:41 +00:00
Masahiro Tanaka	adec99121b	Add API to get devices of offload states (#6586 ) This PR adds an API `deepspeed.runtime.zero.offload_states get_state_devices`, which gets devices of offload states as suggested in this [comment](https://github.com/microsoft/DeepSpeed/pull/6011#issuecomment-2358068777). We could lift this up to `deepspeed.utils` but would need to resolve a circular import: User code -> `deepspeed.utils` -> `deepspeed.utils.offload_states` -> `deepspeed.runtime.zero` -> `deepspeed.runtime.zero.partition_parameters` -> `deepspeed.utils` This will require a significant refactoring as long as we have `OffloadStateTypeEnum` in `deepspeed.runtime.zero`. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-10-10 02:59:26 +00:00
Joe Mayer	a1f98bdc70	AIO CPU Locked Tensor (#6592 ) Restoring the functionality of the cpu locked tensor in the AIO library. Make async_io operator available for CPU accelerator, i.e., CPU only environment. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-10-09 21:07:31 +00:00
gyou2021	474a3288cd	Enabled Qwen2-MoE Tensor Parallelism (TP) inference (#6551 ) Modified _replace_module in auto_tp.py : The modification keeps the layers 'shared_expert_gate' and 'gate' in qwen2-moe the original type torch.nn.Linear and not changes them into LinearLayer. In this way, their weights will not be split into multiple HPU/GPU cards. Then the qwen2-moe can run on multiple HPU/GPU cards. Since the weights of 'gate' are not split into multiple HPU/GPU cards, all gather operations are not needed, which may improve performance. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-10-09 15:23:16 +00:00
Masahiro Tanaka	047bcf6af6	Add APIs to offload states of model, optimizer, and engine (#6011 ) This PR adds the following APIs to offload model, optimizer, and engine states. ```pytyon def offload_states(self, include: Container[OffloadStateTypeEnum] = None, device: OffloadDeviceEnum = OffloadDeviceEnum.cpu, pin_memory: bool = True, non_blocking: bool = False) -> None: """Move the ZeRO optimizer buffers to the specified device. Arguments: include: Optional. The set of states to offload. If not provided, all states are offloaded. device: Optional. The device to move the ZeRO optimizer buffers to. pin_memory: Optional. Whether to pin the memory of the offloaded states. non_blocking: Optional. Whether to offload the states asynchronously. ... def offload_states_back(self, non_blocking: bool = False) -> None: ``` Here is the typical usage. ```python # Offload after forward, backward, and step model.offload_states() # Do something requiring a lot of device memory ... # Load states back to device memory model.offload_states_back() ``` You can selectively offload states to balance the offloading overhead and memory saving. ```python model.offload_states(include=set([OffloadStateTypeEnum.hp_params, OffloadStateTypeEnum.opt_states], device=OffloadDeviceEnum.cpu) ``` Performance (4.3B parameters / 4x A100) - Environment (4x A100, [benchmark script](https://gist.github.com/tohtana/05d5faba5068cf839abfc7b1e38b85e4)) - Average Device to Host transfer time: 2.45 GB/s, aggregated: 9.79 GB/s - Average Host to Device transfer: 11.05 GB/s, aggregated: 44.19 GB/s - Mem (allocated by PyTorch) - Before offload 18.2GB - After offloading 17.7MB - Time ([benchmark script](https://github.com/microsoft/DeepSpeedExamples/tree/tohtana/offload_states/training/offload_states), offloading time/loading time) python output_table.py \| \|pin_memory=0 non_blocking=0\|pin_memory=0 non_blocking=1\|pin_memory=1 non_blocking=0\|pin_memory=1 non_blocking=1\| \|--:\|---------------------------\|---------------------------\|---------------------------\|---------------------------\| \| 1\|4.34 / 3.42 \|4.99 / 2.37 \|6.5 / 2.42 \|6.0 / 2.39 \| \| 2\|9.9 / 3.28 \|5.1 / 2.34 \|6.21 / 2.42 \|6.25 / 2.45 \| \| 3\|9.92 / 3.19 \|6.71 / 2.35 \|6.33 / 2.38 \|5.93 / 2.42 \| \| 4\|9.55 / 2.82 \|7.11 / 2.39 \|6.9 / 2.38 \|6.5 / 2.43 \| \| 5\|4.4 / 3.35 \|6.04 / 2.41 \|6.26 / 2.41 \|6.32 / 2.47 \| \| 6\|4.4 / 3.57 \|6.58 / 2.42 \|6.88 / 2.4 \|6.35 / 2.43 \| \| 7\|9.51 / 3.12 \|6.9 / 2.39 \|6.9 / 2.39 \|6.46 / 2.4 \| \| 8\|4.77 / 3.64 \|6.69 / 2.39 \|7.39 / 2.42 \|6.56 / 2.46 \| \| 9\|9.5 / 3.07 \|7.18 / 2.42 \|6.67 / 2.39 \|7.38 / 2.46 \| TODO: - Enable offloading to a NVMe storage -> NVMe support is non-trivial. I suggest adding the support in another PR - [DONE] Discard buffer (and recreate it) instead of offloading. We don't need to restore the contiguous buffer for reduce. - [DONE] Check pin_memory improves performance or not --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-09-27 05:37:32 +00:00
Olatunji Ruwase	a5400974df	DeepNVMe perf tuning (#6560 ) Add performance tuning utilities: `ds_nvme_tune` and `ds_io`. Update tutorial with tuning section. --------- Co-authored-by: Ubuntu <jomayeri@microsoft.com> Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com>	2024-09-26 13:07:19 +00:00
ShifaAbu	2a56f53395	Added Intel Gaudi to Accelerator Setup Guide (#6543 ) Added Intel Gaudi to the list of accelerators in the setup guide. Co-authored-by: sakell <sakell@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-09-16 15:24:45 -07:00
Roger Feng	2a647c51d4	Fix the broken url link (#6500 ) Simple changes to fix the Intel cpu example link and add more xpu examples. Signed-off-by: roger feng <roger.feng@intel.com>	2024-09-06 13:09:30 +00:00
Olatunji Ruwase	5df12a4a85	DeepNVMe tutorial (#6449 ) Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: jomayeri <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net>	2024-09-04 15:31:31 +00:00
Roger Feng	405b6d5e33	Add the accelerator setup guide link in Getting Started page (#6452 ) Add the link of https://www.deepspeed.ai/tutorials/accelerator-setup-guide/ into the installation section in Getting Started page so that users can easily find the doc. Signed-off-by: roger feng <roger.feng@intel.com>	2024-08-28 16:55:33 +00:00
Dogacan Colak	1041c8a172	Add documentation for launcher without SSH (#6455 ) #5728 --------- Co-authored-by: Logan Adams <loadams@microsoft.com>	2024-08-28 15:28:10 +00:00
jiahao su	1bfa341bbd	add Huawei Ascend NPU setup guide (#6445 ) This PR adds the setup instructions for Huawei Ascend NPU. Please refer to the remainder of the guide for instructions on other devices. --------- Co-authored-by: sjh <sjh1270@163.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>	2024-08-27 18:15:48 +00:00
Olatunji Ruwase	01fe65b300	DeepSpeed on Window blog (#6364 ) DeepSpeed on Windows blog --------- Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-08-19 11:16:22 -07:00
Ma, Guokai	19b01e1d60	Add accelerator setup guides (#5827 ) This document provide a places to hold accelerator setup guides. It is intend to be a single place to lookup installation guides for different accelerators. Currently CPU and XPU setup guides are added to this document and could be extended to other accelerators. --------- Co-authored-by: Liangliang Ma <1906710196@qq.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-08-14 22:43:37 +00:00
Olatunji Ruwase	0584689d43	Fix docs building guide (#5825 ) Update instructions with webrick dependency Restore Gemfile that was accidentally removed in #5821 --------- Co-authored-by: Logan Adams <loadams@microsoft.com>	2024-08-05 08:51:26 -07:00
Olatunji Ruwase	2ef8223210	Fix NV references (#5821 ) Fix NVIDIA references and typos. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-08-02 10:18:01 -07:00
Olatunji Ruwase	029bb5274a	Link GDS blog to site (#5820 )	2024-08-01 13:35:26 -07:00
Liangliang Ma	afe1b9ede1	Add doc of compressed backend in Onebit optimizers (#5782 ) This one is document supplement for https://github.com/microsoft/DeepSpeed/pull/5473. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-29 11:38:03 -07:00
Yejing-Lai	acdf136785	Add new autotp supported model in doc (#5785 ) This PR refresh the list of models supported by AutoTP. Newly added models are: - mixtral - yuan - phi - qwen2 [reviewing PR #5786 ] - chatglm2&chatglm3 [reviewing PR #5540 ] Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-23 02:12:24 +00:00
Sam Ade Jacobs	3d347276ce	Fix tutorial links (#5714 )	2024-07-01 15:58:21 -07:00
Sam Ade Jacobs	121efdbd5c	DeepSpeed Universal Checkpointing: Blog and Tutorial (#5711 ) Train {GPT,LLaMA, Phi}-like models (or any model) at ultra low-cost with DeepSpeed Universal Checkpointing (UCP). UCP abstracts away the complexities of saving and loading model states. See arxiv paper, blog and tutorial in this PR for details. --------- Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-07-01 14:37:24 -07:00
Masahiro Tanaka	77c949421e	Add slide deck for meetup in Japan (#5598 )	2024-05-31 14:05:19 -07:00
Logan Adams	4deb40de67	Update to fix sidebar over text (#5567 ) - [x] Needs to be tested. Fixes #5494. Sample screenshot: <img width="1141" alt="image" src="https://github.com/microsoft/DeepSpeed/assets/114770087/f89f642b-bca1-4d45-b3f1-ec7943ab2ad4">	2024-05-28 15:42:05 -07:00
Aliaksandr Kuzmik	488a823f64	New integration - CometMonitor (#5466 ) This PR introduces a new monitoring option - `CometMonitor` which comes up as an official integration with [CometML](https://www.comet.com/site/). The new monitor is covered with unit tests. Notes: * We've updated `docs/code-docs/source/monitor.rst` but it doesn't look used anymore * We've updated the "Monitoring Module" section name in `config-json.md` to be generic so the next integration won't require updating it. --------- Co-authored-by: Boris Feld <lothiraldan@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-05-15 16:04:44 +00:00
Shafiq Jetha	a9cbd688f0	Update _sidebar.scss (#5293 ) The right sidebar disappears off of the right side of the page. These changes will help bring the content back and place it correctly on the page. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-04-16 20:49:35 +00:00
Georg Herstein	cea5ea1eb6	Docs typos fix and grammar suggestions (#5322 ) Hey, this commit contains a few typo fixes and grammar suggestions for you to consider.	2024-03-27 15:41:30 +00:00
William Kaiser	4520edd61c	Fixed Accelerate Link (#5314 ) The current link was broken. Fixed it.	2024-03-26 09:51:55 -07:00
Xiaoxia (Shirley) Wu	d1536e4494	Fp6 blog chinese (#5239 )	2024-03-07 17:33:50 -08:00
ByronHsu	3e6d606957	[doc/1-line change] default stage3_param_persistence_threshold is wrong in the doc (#5073 ) The default value should be `1e5` as in [config.py](`2eafe41be7/deepspeed/runtime/zero/config.py (L200)`). Signed-off-by: byhsu <byhsu@linkedin.com> Co-authored-by: byhsu <byhsu@linkedin.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-02-05 09:54:51 -08:00
segyges	dde64b000c	Make batch size documentation clearer (#5072 ) The config variable for accumulation steps is ```gradient_accumulation_steps``` but the docs explaining batch size related parameters state it as ```gradient_accumulation``` in the note at the top. This could lead to misconfiguration if someone uses this note as their reference for configuration, and it makes the docs less clear to read because it is not necessarily obvious that ```gradient_accumulation``` actually refers to ```gradient_accumulation_steps```. Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>	2024-02-05 09:49:08 -08:00
Yun Dai	76ec8b4927	[doc] update inference related docs from `mp_size` to `tensor_parallel` for TP (#5048 ) `mp_size` field is deprecated in flavor of `tensor_parallel`/`tp` https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/inference/engine.py so update related docs that are still sticking to `mp_size` Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>	2024-02-01 16:55:12 -08:00
Matthew Hoffman	971d82b573	MoE type hints (#5043 ) This PR fixes 5 pyright errors in `deepspeed`. My main goal is to fix the type signatures of `split_params_into_different_moe_groups_for_optimizer` since this affects my project's linting. I made a few other improvements along the way: * use more descriptive variable names (`param_group` instead of `v1`, `moe_group` instead of `v`) * remove a few unused variables by choosing better-suited iterators like `dict.values()` instead of `dict.items()` or `nn.Module.parameters()` instead of `nn.Module.named_parameters()` * fix incorrect function type signatures * [use simple `dict()` shallow copy instead of of unnecessary for loop excluding a key is then immediately overwritten: ](https://github.com/microsoft/DeepSpeed/compare/master...ringohoffman:moe-type-hints?expand=1#diff-cec48b3c7def770ef2d14ac7398bfbdf0f209d2558645ffd47d0028988fa66a3L134-L138) * [ternary to reduce duplicating long expression](https://github.com/microsoft/DeepSpeed/compare/master...ringohoffman:moe-type-hints?expand=1#diff-cec48b3c7def770ef2d14ac7398bfbdf0f209d2558645ffd47d0028988fa66a3L101-L104) * `isinstance()` instead of `type(...) is ...` * `typing.cast(List[nn.Parameter], param_group['params'])` as a general pattern for improved type hinting of its elements during iteration --------- Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>	2024-02-01 14:03:32 -08:00
Michael Wyatt	24f20ef0a1	update inference pages to point to FastGen (#5029 )	2024-01-30 16:52:04 -08:00
Michael Wyatt	9144b1742a	Update index.md	2024-01-19 15:13:45 -08:00
Ma, Guokai	7739c0aca4	[docs] Add new autotp supported model in tutorial (#4960 ) This PR refresh the list of models supported by AutoTP. Newly added models are: - baichuan - codellama - falcon - llama2 - mistral - qwen - starcode	2024-01-16 16:32:35 +00:00
Logan Adams	05cc3462c9	Fix docs inconsistency on default value for `ignore_unused_parameters` (#4949 ) Link to code where the default is set: `13d84b4912/deepspeed/runtime/zero/config.py (L242C4-L242C42)`	2024-01-12 17:29:44 +00:00
Ma, Guokai	d8d865f492	[Fix] Fix cpu inference UT failure (#4430 ) This PR fix UT test error as described in this PR and the following test job. This PR skips `TestModelTask` if dtype is not supported by accelerator, or `InferenceBuilder` is not implemented by accelerator. https://github.com/microsoft/DeepSpeed/pull/4419 https://github.com/microsoft/DeepSpeed/actions/runs/6341645987/job/17235544538 --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Liangliang-Ma <1906710196@qq.com> Co-authored-by: Quentin Anthony <qganthony@yahoo.com> Co-authored-by: Dashiell Stander <dash.stander@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com> Co-authored-by: Xie Zejian <xiezej@gmail.com> Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>	2024-01-08 23:03:44 +00:00
Dean Wyatte	59c5f37e7a	Add WarmupCosineLR to Read the Docs (#4916 ) I found this scheduler via code search. It has been working well for me, so if it is meant to be released, it would be good to document it	2024-01-08 19:54:58 +00:00
Gavin Goodship	75c7720214	doc corrections (#4861 )	2023-12-21 19:13:24 +00:00
Gavin Goodship	a00bdde86a	Update zeropp.md (#4835 ) Doc corrections --------- Co-authored-by: Michael Wyatt <mrwyattii@gmail.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>	2023-12-18 21:17:50 +00:00
Michael Wyatt	d1f1d45f4b	Update broken link in docs (#4822 ) resolves #4821	2023-12-15 13:02:17 -08:00
Jeff Rasley	6b8103b46e	[docs] Intel inference blog (#4734 )	2023-11-28 08:27:54 -08:00
Yi30	0ec2d3e4bf	Add get and set APIs for the ZeRO-3 partitioned parameters (#4681 ) The DeepSpeed currently supports a set of debugging APIs to [get](https://deepspeed.readthedocs.io/en/latest/zero3.html#debugging) and [set](https://deepspeed.readthedocs.io/en/latest/zero3.html#modifying-partitioned-states) the full model states (parameters, gradients, and optimizer states). However, in some scenarios, only local states are needed, for example, when pruning some model layers based on a local criterion. After calling `model_engine.step()`, we need to apply the local mask to the partitioned parameters owned by each process. Therefore, I am submitting this PR to introduce some new APIs for `get` and `set` ZeRO-3 partial model states. ### APIs intro ```python def safe_get_local_fp32_param(param): """Get the fp32 partitioned parameter.""" def safe_get_local_grad(param): """Get the fp32 gradient of a partitioned parameter.""" def safe_get_local_optimizer_state(param, optim_state_key): """Get the fp32 optimizer state of a partitioned parameter.""" def safe_set_local_fp32_param(param, value): """Update the partitioned fp32 parameter.""" def safe_set_local_optimizer_state(param, value, optim_state_key): """Update the fp32 optimizer state of a partitioned parameter.""" ``` ### Usage ```python # local API from deepspeed.utils import ( safe_get_local_fp32_param, safe_get_local_grad, safe_get_local_optimizer_state, safe_set_local_fp32_param, safe_set_local_optimizer_state ) ``` ### TODO - [x] Add local APIs - [x] Add UTs - [x] Update Docs @tjruwase --------- Signed-off-by: yliu <test@do_not_reply@neuralstudio.intel.com> Co-authored-by: yliu <test@do_not_reply@neuralstudio.intel.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2023-11-17 21:58:47 +00:00
Masahiro Tanaka	ab6b1e16bb	Add Japanese blog for DeepSpeed-FastGen (#4651 ) This blog adds Japanese blog for DeepSpeed-FastGen. (also includes small fix of typos in the original blog) --------- Co-authored-by: Conglong Li <conglong.li@gmail.com>	2023-11-07 10:10:45 -08:00
Heyang Qin	00df0c1998	DeepSpeed-FastGen Chinese Blog (#4642 ) Thanks @xiaoxiawu-microsoft and @conglongli for reviewing and improving it! --------- Co-authored-by: Xiaoxia (Shirley) Wu <94406484+xiaoxiawu-microsoft@users.noreply.github.com> Co-authored-by: Conglong Li <conglong.li@gmail.com>	2023-11-06 21:16:53 -08:00
Jeff Rasley	cbec96b00e	[docs] update news items (#4640 ) Co-authored-by: Guanhua Wang <alexwgh333@gmail.com>	2023-11-06 15:55:25 -08:00
Guanhua Wang	b1cb0dfc46	Guanhua/partial offload rebase v2 (#590 ) (#4636 ) This PR introduces Twin-Flow feature of ZeRO-Offload++, which improves e2e training iteration time by up to 6x on DGX-H100s. This PR includes: * Twin-Flow implementation inside ZeRO optimizer * json config tutorial * example using deepspeed * unit tests cc @jeffra @awan-10 @tjruwase @mrwyattii Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2023-11-06 14:15:16 -08:00
Jeff Rasley	1d9e256c03	DeepSpeed-FastGen blog (#4607 ) Co-authored-by: Connor Holmes <connorholmes@microsoft.com> Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com> Co-authored-by: Heyang Qin <heyangqin@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Lev Kurilenko <lekurile@microsoft.com>	2023-11-03 15:32:40 -07:00
Jeff Rasley	4199dc25af	[docs] fix deepspeed.ai links	2023-10-30 14:24:11 -07:00
Jeff Rasley	45b07bf944	[docs] paper updates (#4584 )	2023-10-30 14:17:18 -07:00

1 2 3 4 5 ...

462 Коммитов