DeepSpeed

Граф коммитов

Автор	SHA1	Сообщение	Дата
xyxie	1b58ba5ec0	Merge LoCo with Zero++ (#6730 ) ### Integration of LoCo Method into ZeRO++ #### Overview This PR introduces the integration of the LoCo method, as outlined in [this paper](https://arxiv.org/abs/2407.04480), into the ZeRO++ framework of DeepSpeed. The key enhancement involves applying error feedback compensation to 4-bit gradients before communication. This approach *improves pre-training loss outcomes without additional time overhead, though it requires extra GPU memory. The extent of this memory increase depends on model size and training configuration. #### Experimental Results We conducted pre-training experiments using the Llama2 architecture, adjusting the number of layers and hidden size. The experiments included: - A smaller-scale model with 0.8B parameters trained on 30B tokens. - A larger-scale model with 8B parameters trained on 5B tokens. The training data was sampled from Redpajama-V2. <p align="center"> <img src="https://github.com/user-attachments/assets/e7db9487-728c-4a17-9806-c15afa12f62e" width="49%" /> <img src="https://github.com/user-attachments/assets/3efec895-b71d-43ab-b5ce-65468ba8b9f1" width="49%" /> </p> Findings: - Smaller Models (0.8B parameters): Significant gains were observed when applying the LoCo method. - Larger Models (8B parameters): The gains were present but less pronounced. This could be due to: 1. Relatively smaller data volume. 2. Lower pre-training loss for larger models, making significant improvements harder to achieve. However, even a smaller pre-training loss gap in larger models can translate to meaningful gains in downstream tasks. #### Example Script For reference, the [run.sh](https://github.com/user-attachments/files/17679552/zeroplus-7b3.zip) script used for the 8B parameter, 5B tokens experiment is attached. The experiment was conducted using the DeepSpeed-Megatron* platform. #### Acknowledgments Special thanks to cc @GuanhuaWang for ongoing communication and guidance throughout this work. --- We appreciate your consideration of this PR and welcome any feedback or questions! --------- Co-authored-by: ChuanxinTang <tangchuanxin.chn@gmail.com> Co-authored-by: root <pan.jiachun@outlook.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>	2024-12-10 10:31:11 -08:00
Logan Adams	06f1d3609e	Unpin pytest-subtests now that 0.14.1 is released (#6844 ) The issue we encountered was covered here: https://github.com/pytest-dev/pytest-subtests/issues/173 And is resolved with the latest changes from this PR: https://github.com/pytest-dev/pytest-subtests/issues/174, and is published in the latest version 0.14.1.	2024-12-09 22:14:59 -08:00
Raza Sikander	0c92c39dd0	Inference UTs check for trition support from accelerator (#6782 ) Instead of checking if installed or not check for support. Skip if not supported. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-12-09 16:15:42 -08:00
Logan Adams	08b907a226	Pin pytest-subtests version for accelerate tests (#6842 )	2024-12-09 12:24:33 -08:00
Hoa La	9a41ccaf44	Flops profiler support einops.einsum (#6755 ) - Added support for FlopsProfiler to include einops.einsum operation - Added _patch_miscellaneous_operations() and _reload_miscellaneous_operations() to include this operation and potentially include other miscellaneous operations in the future - Added _einops_einsum_flops_compute() that mimic already-existed _einsum_flops_compute() --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-12-09 09:56:54 -08:00
Logan Adams	9ca6016017	Pin HPU tests (#6831 ) HPU tests are impacted by the same issue as other tests that use transformers latest. This PR pins to a version of transformers before the fix.	2024-12-06 14:29:00 -08:00
Logan Adams	a4499668fe	Update version.txt after 0.16.1 release (#6826 ) Auto-generated PR to update version.txt after a DeepSpeed release Released version - 0.16.1 Author - @loadams Co-authored-by: loadams <loadams@users.noreply.github.com>	2024-12-05 14:21:53 -08:00
Logan Adams	177832ed45	Update pre-commit version (#6821 )	2024-12-05 13:51:05 -08:00
Logan Adams	95ead2a055	Pin transformers version in cpu-torch-latest due to multiprocessing error. (#6823 ) This is a copy of https://github.com/microsoft/DeepSpeed/pull/6820 for the cpu-torch-latest tests. This PR will revert/fix these: https://github.com/microsoft/DeepSpeed/pull/6822	2024-12-05 12:16:46 -08:00
Sam Ade Jacobs	2ea181f0c3	Update README.md (#6825 ) Add Ulysses-offload to News page Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-12-05 11:32:36 -08:00
Sam Ade Jacobs	0e92f9b41f	Update README.md (#6824 ) Fix broken tutorial link	2024-12-05 11:31:52 -08:00
Sam Ade Jacobs	7b9fc8c74d	add FPDT tutorial (#6813 ) Tutorial page for Ulysses-Offload (FPDT), blog page to follow. --------- Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw02.ten.osc.edu> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>	2024-12-05 16:44:00 +00:00
Sam Ade Jacobs	0b0fef3d41	Ulyssess offload blog (#6814 ) Ulysses-Offload (FPDT) blog, please see corresponding tutorial page at [link](https://github.com/microsoft/DeepSpeed/pull/6813). --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>	2024-12-05 16:39:44 +00:00
Logan Adams	b966e1f97f	Pin transformers to avoid errors with latest version (#6820 )	2024-12-05 08:38:01 -08:00
Jinghan Yao	60a1b57b98	Adding the new feature of FPDT (#6462 ) [FPDT](https://arxiv.org/abs/2408.16978) can only be used with [this version](https://github.com/microsoft/Megatron-DeepSpeed/pull/441) of Megatron-DeepSpeed. --------- Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw02.ten.osc.edu> Co-authored-by: Sam Ade Jacobs <samjacobs@microsoft.com> Co-authored-by: Jinghan Yao <yjhmitweb@ascend-rw01.ten.osc.edu> Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jinghan Yao <yjhmitweb@cardinal-rw02.ten.osc.edu> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com>	2024-12-04 15:29:45 -08:00
Logan Adams	ed7d183bed	Update python version but now we need to include setuptools on our own (#6787 ) TODO: - [x] determine if this means we should technically add setuptools to the requirements.txt	2024-12-04 12:39:45 -08:00
Xu Song	fc230070ef	Fix zero checkpoint (#6792 ) Fix #6791 --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-12-04 10:16:56 -08:00
Guanhua Wang	0c6c981109	Domino news update on readme.md (#6815 )	2024-12-03 08:12:21 -08:00
Logan Adams	f743feca03	Update version.txt after 0.16.0 release (#6786 ) Auto-generated PR to update version.txt after a DeepSpeed release Released version - 0.16.0 Author - @loadams Co-authored-by: loadams <loadams@users.noreply.github.com>	2024-11-25 12:12:44 -08:00
Logan Adams	e5570b10ee	Revert release workflow (#6785 )	2024-11-25 12:10:05 -08:00
Logan Adams	03845dbc85	Update version.txt before release (#6784 )	2024-11-25 12:06:06 -08:00
Guanhua Wang	ec6cc49034	Domino Blog (#6776 ) This PR is domino blog on our public side. cc @tjruwase --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-11-25 11:59:04 -08:00
Logan Adams	fabcf407f9	Cleanup code docs warnings (#6783 ) We have a number of warnings in our readthedocs sphinx/autodoc .rst files, so this cleans some of those up so we can fix real issues there.	2024-11-25 11:30:47 -08:00
Wentao Ye	d6410f9051	Fix Doc Error: ZeRO Stage 2 gradient partitioning (#6775 ) Fix the issue described in https://github.com/microsoft/DeepSpeed/issues/6707	2024-11-25 10:19:27 -08:00
谭九鼎	5e16f255a6	docs: fix HF links (#6780 ) The current link https://huggingface.co/docs/transformers/main_classes/deepspeed is very unhelpful. It turns out in the past it had some guides: https://huggingface.co/docs/transformers/v4.27.1/main_classes/deepspeed#shared-configuration Later it's refreshed and moved to https://huggingface.co/docs/transformers/deepspeed	2024-11-25 10:10:08 -08:00
Logan Adams	f57b1ef18a	Unpin with latest transformers fixes (#6763 ) Reverts #6759 Requires from transformers: https://github.com/huggingface/transformers/pull/34816 https://github.com/huggingface/transformers/pull/34800 Todo: - [x] Need to merge first PR to get support for torch 2.4	2024-11-22 10:31:59 -08:00
ChenWenbin	cd20a3bbc7	Fix potential memory issues when use deepspeed Z3 (#6726 ) I had OOM problem when doing DPO training using zero3. It needs to call module twice in one training step, and second call is with no_grad(). The problem is caused by two bugs: 1. "__n_available_params", which helps to control fetched parameters, becomes negative after release_and_reset_all() function. 2. module.ds_grads_remaining becomes negative in backward() if we call module more than once in one training step. I tried to create two patches to fix these issues. --------- Signed-off-by: Wenbin Chen <wenbin.chen@intel.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>	2024-11-21 18:32:03 +00:00
Hyeonseung Lee	f515104e95	Removes unnecessary cloning (#6761 ) `clone_tensors_for_torch_save()` function: When the `item.device` is different from `device` input, `tensor.clone()` is not actually required because `to()` function also clones the original tensor. +) I observed memory bloat under following conditions: * Training a Whisper model w/ `transformers` framework with `ZeRO-0` and `ZeRO-1` configuration. * Memory bloating can be observed every time the model state_dict is cloned using `clone_tensors_for_torch_save()` After I removed the unnecessary `clone()`, seems like the problem is solved. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-11-21 17:37:29 +00:00
Max Kovalenko	b5709cce66	Enable torch compile on _allgather_params (#6769 ) * Previosuly ZerO3 was crashing when trying to compile _allgather_params * Disabling grad solves the issue	2024-11-21 16:01:13 +00:00
Quentin Gallouédec	83e4364fbd	Use `json_schema_extra` instead of extra keyword in `Field` (#6764 ) > Using extra keyword arguments on `Field` is deprecated and will be removed. Use `json_schema_extra` instead. (Extra keys: 'new_param'). Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.9/migration/ Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-11-20 01:04:47 +00:00
Nadav Elyahu	065398d5de	Fix setup.py bash cmd generation to correctly extract git info (#6762 ) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-11-19 13:54:53 -08:00
Logan Adams	2e0c39b55c	Add explicit parameters for torch.load (#6751 ) Successor PR to #6094: > FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. Todo: - [ ] Update values in non-test files to True where necessary.	2024-11-19 11:09:52 -08:00
baodi	1fdad1fa52	make xpu ops compatible with oneapi 2025.0 (#6760 ) Compatibility update for xpu ops This PR introduces changes that will make xpu ops compatible with the OneAPI 2025.0 toolkit. This is an important update that will allow us to develop and ship our most demanding models on this innovative hardware. --------- Signed-off-by: baodii <di.bao@intel.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>	2024-11-19 17:38:27 +00:00
Logan Adams	8488beea29	Pin transformers version to work around latest torch requirements (#6759 ) Latest transformers seems to break our tests that aren't on torch latest (>= 2.5). Issue opened here: https://github.com/huggingface/transformers/issues/34795. This pins our version so these tests can pass in the meantime.	2024-11-19 01:36:51 +00:00
Xu Song	dd40269426	A faster and more memory-efficient implementation of `zero_to_fp32` (#6658 ) It is a faster and more memory-efficient implementation of `zero_to_fp32`. The previous version double the memory usage, which cause cpu OOM for very large models (e.g. llama 405B). `b647fb2470/deepspeed/utils/zero_to_fp32.py (L438-L441)` ## How does it work? 1. Lazy loading: Load checkpoint with `mmap=True`, thus the weights are mmaped rather than loading all the storages into memory. 2. Lazy merge: `GatheredTensor` contains the mmaped weights and tensor offset. It is a memory-efficient pseudo tensor. Only when `tensor.contiguous()` is called, it starts to load related weights to memory and merge into a single tensor. 3. Release memory in time: Save checkpoints shard by shard, and release the memory once a shard is saved. Throughout the process, only one shard of tensors are keeped in memory. ## How much benefit in speed and memory ? Experiments were conducted on a linux host with 1TB of memory. Here is a detailed comparision \| \| world size \| peak memory(GB) \| elapsed time(h:mm:ss) \| \|----------------------\|------------\|--------------\|--------------------\| \| llama3-8B(old->new) \| 8 \| 90 -> 41 \| 0:02:17 -> 0:01:10 \| \| llama2-13B(old->new) \| 8 \| 146 -> 54 \| 0:02:30 -> 0:01:47 \| \| llama2-70B(old->new) \| 16 \| 789 -> 159 \| 0:20:47 -> 0:20:45 \| \| qwen1.5-110B(old->new) \| 32 \| OOM -> 217 \| ? -> 0:34:21 \| \| llama3-405B(old->new) \| 192 \| OOM -> 262 \| ? -> 2:09:59 \| You can reproduce with the following scripts ```sh # 1. install requirments apt-get install time # 2. prepare zero-3 checkpoints # 3. convert zero to fp32 checkpoints /usr/bin/time -v python zero_to_fp32.py . output_dir/ --safe_serialization ``` - memory: Theoretically, this PR reduces the memory cost from `2M` to `(1/n)M`, where `M` is the memory cost of the full weights, `n` is num_shards. - speed: The speed gain mainly comes from avoiding extra tensor copying. The benifit may be slight. ## Impl history - [v1](`19712a1c75 (diff-6a2ca3427fa608c387b7351359f98cfc1313be6e960cee86344ff246bf1b8326R441-R447)`) : a hf_hub compatible approach. It has been discarded due to the controversial implementation of `data_ptr().` - [v2](https://github.com/microsoft/DeepSpeed/pull/6658/files): a simple approach with `torch.empty` --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-11-18 20:14:35 +00:00
Logan Adams	f594dbe3df	Disable failing python tests (#6758 )	2024-11-18 10:16:21 -08:00
Raza Sikander	e3b5a4b6e0	Gaudi2 Nightly job for daily check (#6753 ) Co-authored-by: Logan Adams <loadams@microsoft.com>	2024-11-15 15:11:59 -08:00
Olatunji Ruwase	fc4e73370d	Add no_sync context manager (#6675 ) Fix #1902 --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-11-14 18:52:51 +00:00
Minjia Zhang	d702eb5f79	Adding the governance doc (#6748 ) Drafted governance doc for the LFAI. Co-authored-by: Minjia Zhang <minjiaz@illinois.edu>	2024-11-14 12:01:53 -08:00
Logan Adams	9a2c209cee	Sanitize inputs to eval() (#6745 )	2024-11-13 09:04:56 -08:00
Logan Adams	877aa0dba6	Update path for BingBertSquad from DeepSpeedExamples (#6746 ) In https://github.com/microsoft/DeepSpeedExamples/pull/245, the DeepSpeedExamples directory structure was refactored, this updates the DeepSpeed examples from those changes.	2024-11-12 18:50:02 +00:00
Joe Mayer	b692cdea47	AIO File Offsets (#6641 ) Adding the option for a file offset to the read/write functions of AIO & GDS ops. --------- Co-authored-by: jomayeri <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-11-12 16:34:17 +00:00
inkcherry	7af3a4beb5	add zero3 ```module_granularity_threshold ``` to zero optimization. (#6649 ) This PR adds Z3 coalesced fetch to zero optimization. Currently, some logic can be reused, but it's difficult to realize that as optimization choice(I only discovered these logic when trying to implement it). The benefit of this approach is reducing host overhead（reduce many hooks) and during the process of recursive fetching parameters (especially in fine-grained models, such as those with a large number of moe experts). This is particularly helpful for host-sensitive devices (such as hpu), where it achieved a 40% performance improvement in our customer workloads. FYI @delock @deepcharm --------- Co-authored-by: Ma, Guokai <guokai.ma@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-11-12 14:25:33 +00:00
Hongwei Chen	73d974ee64	Add data type check for bf16 (#6742 ) Add data type check for bf16 to fix #6723	2024-11-12 13:01:31 +00:00
Chengming Zhang	fabab197f7	Add Domino code (#6733 ) add domino code Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-11-11 23:55:09 +00:00
Xinyu Lian	99e9cbed16	Fix Type Name Inconsistency & Typo in cpu_adam (#6732 ) There is a typing error & inconsistency in cpu-adam code, while not affecting functionality, impacts code readability. Specifically, the type name `ds_params_percision_t` contains a typo ('percision'), whereas the related type name `ds_state_precision_t` is spelled correctly. I think it is beneficial to fix this typo&inconsistency to improve code readability, maintainability and further development. I have tested the corrected version of cpu_adam, and it compiles and runs successfully. Compilation Log: <img width="2560" alt="image" src="https://github.com/user-attachments/assets/b7bc307d-9c9d-4ab7-8671-34e565903ca5"> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-11-11 23:31:45 +00:00
Logan Adams	b45ca26354	Update AMD apex version (#6739 )	2024-11-11 13:26:41 -08:00
Olatunji Ruwase	b7e2ff5080	Add COMMITTER file (#6741 ) Add COMMITTER file	2024-11-11 11:51:10 -08:00
Logan Adams	0855566228	Update GH hosted workflows to 24.04 (#6717 ) `ubuntu-latset` is moving to be 24.04, so we should test updating as well to ensure it doesn't break any of our workflows.	2024-11-11 06:22:08 -08:00
Logan Adams	057d25be67	Update version.txt after 0.15.4 release (#6731 ) Auto-generated PR to update version.txt after a DeepSpeed release Released version - 0.15.4 Author - @loadams Co-authored-by: loadams <loadams@users.noreply.github.com>	2024-11-08 08:34:20 -08:00

1 2 3 4 5 ...

2602 Коммитов Все ветки Поиск

2602 Коммитов

Все ветки