Граф коммитов

2573 Коммитов

Автор SHA1 Сообщение Дата
Nadav Elyahu 3b09d945ea
fix pipeline eval_batch micro_batches argument for schedule (#6484)
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-09-05 22:07:57 +00:00
Jinxing Pan 4f803852ac
Op_builder->is_compatible quite warning (#6093)
Set the default value of op_builder/xxx.py/is_compatible()/verbose to
False for quite warning.
Add verbose judgement before
op_builder/xxx.py/is_compatible()/self.warning(...).
Otherwise the verbose arg will not work.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-09-05 17:03:34 +00:00
Nadav Elyahu 857780a85a
HPU: add required ENV vars to acccelerator init (#6495)
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-09-05 15:43:08 +00:00
Logan Adams c210e601e3
Update version.txt after 0.15.1 release (#6493)
**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.15.1
Author           - @loadams

Co-authored-by: loadams <loadams@users.noreply.github.com>
2024-09-04 18:34:42 -07:00
Alex Morehead 10ba3dde84
Handle an edge case where `CUDA_HOME` is not defined on ROCm systems (#6488)
* Handles an edge case when building `gds` where `CUDA_HOME` is not
defined on ROCm systems
2024-09-04 22:28:13 +00:00
Olatunji Ruwase 662a421b05
Safe usage of popen (#6490)
Avoid shell=True security issues with Popen
2024-09-04 21:06:04 +00:00
Jiancheng Liu ddd3571823
Add default value to "checkpoint_folder" in "load_state_dict" of bf16_optimizer (#6446)
Add default value `checkpoint_folder=None` for compatibility.

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-09-04 20:32:02 +00:00
Olatunji Ruwase 5d1a30c033
DS_BUILD_OPS should build only compatible ops (#6489)
Currently DS_BUILD_OPS=1 fails on incompatible ops. This is a deviation
from
[documentation](https://www.deepspeed.ai/tutorials/advanced-install/#pre-install-deepspeed-ops)
which states that only compatible ops are built.

<img width="614" alt="image"
src="https://github.com/user-attachments/assets/0f1a184e-b568-4d25-9e9b-e394fb047df2">
2024-09-04 20:30:56 +00:00
Masahiro Tanaka ddeb0c19a0
Fix patch for parameter partitioning in zero.Init() (#6388)
This PR fixes an issue addressed in #5921.
With this change, we only apply the patch for parameter partitioning to
classes that have `__init__` so that we can avoid applying the patch
multiple times.
The class that does not have `__init__` now uses its superclass's one.
So this PR also applies the patch to the root class,
`torch.nn.modules.module.Module`.

Thanks @VeryLazyBoy for the report and initial solution.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-09-04 18:43:20 +00:00
Joshua C. Randall 9d17116fcd
print warning if actual triton cache dir is on NFS, not just for default (#6487)
move the logic that prints a warning when triton cache dir is on NFS to
act on the actual calculated cache_dir rather than on the default.

this means that:
- when the default directory (in the user's home directory) is on NFS
but `TRITON_CACHE_DIR` is set to a non-NFS directory, no warning will be
printed whereas prior to this change a spurious and confusing warning
was printed
- when the user's home directory is not on NFS but `TRITON_CACHE_DIR` is
set to an NFS directory, a warning will be printed whereas prior to this
change no warning would be printed
 
fixes #6486
2024-09-04 18:22:07 +00:00
Olatunji Ruwase 5df12a4a85
DeepNVMe tutorial (#6449)
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: jomayeri <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net>
2024-09-04 15:31:31 +00:00
Nadav Elyahu cfc6ed3722
bf16_optimizer: fixes to different grad acc dtype (#6485)
- fix step function to cast to FP32 before step in case of different
gradient accumulation data type
- remove redundatn function initialize_optimizer_states()
2024-09-04 15:27:26 +00:00
Logan Adams 9b7fc54524
Add workflow to build DS without torch to better test before releases (#6450)
- Adds a nightly workflow that tests to confirm we can build DeepSpeed
without torch as a dependency, as this often only surfaces when doing a
release.
2024-08-29 23:43:21 +00:00
Raza Sikander 89c4d9f5a7
TestLowCpuMemUsage UT get device by device_name (#6397)
Co-authored-by: Shaik Raza Sikander <srsikander@habana.ai>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-08-29 22:05:20 +00:00
Ramya Ramineni a7ffe540fc
Avoid gds build errors on ROCm (#6456)
This PR is to avoid the below error during DeepSpeed build on ROCm. The
error is because of the incompatibility of GDSBuilder extension on ROCm.

```
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-req-build-lv1v39xc/setup.py", line 180, in <module>
        op_compatible = builder.is_compatible()
      File "/tmp/pip-req-build-lv1v39xc/op_builder/gds.py", line 47, in is_compatible
        CUDA_LIB64 = os.path.join(CUDA_HOME, "lib64")
      File "/opt/conda/envs/py_3.9/lib/python3.9/posixpath.py", line 76, in join
        a = os.fspath(a)
    TypeError: expected str, bytes or os.PathLike object, not NoneType
    Total number of unsupported CUDA function calls: 0


    Total number of replaced kernel launches: 1
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output
```

cc: @jithunnair-amd

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-08-29 17:15:52 +00:00
Yizhou Wang 0cd9bf5978
[CCL] fix condition issue in ccl.py (#6443)
previous condition check is not right, it would cause this condition
always be True.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-08-29 16:42:53 +00:00
Joe Mayer f2739b4f72
Change GDS to 1 AIO thread (#6459)
The `numThreads` config option determines how many threads are used to
read from the file. In the CPU case these threads are created via AIO,
in the GDS case they are handled by the GDS library via the cufile.json.
If we were to also create AIO threads it would have a multiplicative
effect. Example 8 AIO threads * 8 GDS threads would be 64 threads
reading from the file when the user really only intended for 8 threads.

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-08-29 15:59:32 +00:00
Siddartha Naidu 4864991f53
Allow triton==3.0.x for fp_quantizer (#6447)
Tested with triton==3.0.x and the kernel tests pass so adding as an
allowed version.

Triton 2.3.x is not well supported on arm64. Triton 3.0.0 is supported
on arm64 and it appears the fp8 kernel works fine with triton==3.0.0 so
this simplifies usage on arm hosts (GH200).
2024-08-28 18:04:40 +00:00
Roger Feng 405b6d5e33
Add the accelerator setup guide link in Getting Started page (#6452)
Add the link of
https://www.deepspeed.ai/tutorials/accelerator-setup-guide/ into the
installation section in Getting Started page so that users can easily
find the doc.

Signed-off-by: roger feng <roger.feng@intel.com>
2024-08-28 16:55:33 +00:00
Raza Sikander 9bc4cd01b7
Store/Load CIFAR from local/offline (#6390)
CIFAR10_DATASET_PATH -> Path where the dataset is stored
STORE_CIFAR10        -> Store the dataset 1/0
CIFAR10_OFFLINE      -> To use offline dataset 1/0
MISC:
Added getDeviceId to get device if by name in case of accelerator

---------

Co-authored-by: Shaik Raza Sikander <srsikander@habana.ai>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
2024-08-28 16:28:22 +00:00
Raza Sikander b5cf30a085
Dtype support check for accelerator in UTs (#6360)
Check if the dtype is supported by the accelarator if not then skip

---------

Co-authored-by: Shaik Raza Sikander <srsikander@habana.ai>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-08-28 16:02:51 +00:00
Dogacan Colak 1041c8a172
Add documentation for launcher without SSH (#6455)
#5728

---------

Co-authored-by: Logan Adams <loadams@microsoft.com>
2024-08-28 15:28:10 +00:00
Logan Adams eb37cacf22 Revert "Revert "Fix torch check (#6402)""
This reverts commit 77f61e6771.
2024-08-27 13:02:44 -07:00
Logan Adams 77f61e6771 Revert "Fix torch check (#6402)"
This reverts commit 55b4cae80f.
2024-08-27 13:02:07 -07:00
jiahao su 1bfa341bbd
add Huawei Ascend NPU setup guide (#6445)
This PR adds the setup instructions for Huawei Ascend NPU. Please refer
to the remainder of the guide for instructions on other devices.

---------

Co-authored-by: sjh <sjh1270@163.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
2024-08-27 18:15:48 +00:00
Sam Ade Jacobs 8ac42ed73a
Fix redundant seq data parallel grp argument in Z3/MiCS (#5352)
Deprecate redundant sequence_data_parallel_group argument. Users/client
code will control across which process group Z3 parameters will be
partitioned from one of [None, data_parallel_group,
sequence_data_parallel].

---------

Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-08-27 01:10:21 +00:00
Joe Mayer e2654bfd1a
Fix Type Mismatch (#6410)
`num_bytes_per_thread` was a smaller type than `file_num_bytes`, this
caused issues when dividing by `num_threads`.

Co-authored-by: jomayeri <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-08-23 23:17:38 +00:00
Logan Adams ca4449e843
Update version.txt after 0.15.0 release (#6403)
**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.15.0
Author           - @loadams

Co-authored-by: loadams <loadams@users.noreply.github.com>
2024-08-22 15:48:34 -07:00
Logan Adams 55b4cae80f
Fix torch check (#6402) 2024-08-22 15:46:10 -07:00
Michael Wyatt 0a4457cc48
Pydantic v2 migration (#5167)
Pydantic v2 has been out for some time now. We have been relying on
using the v1 API available in v2 until now. This is a refresh of #3902
to bring proper v2 support to DeepSpeed.

Corresponding DeepSpeed-MII PR
[here](https://github.com/microsoft/DeepSpeed-MII/pull/423).

@loadams

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Abhishek Kulkarni <11399+adk9@users.noreply.github.com>
Co-authored-by: Abhishek Kulkarni <abkulkarni@microsoft.com>
Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>
2024-08-22 15:38:13 -07:00
Logan Adams 8c2be7e942 Update to prepare for next release 2024-08-22 15:37:17 -07:00
Yizhou Wang b81b197ec3
[XPU] API align with new intel pytorch extension release (#6395)
with intel_extension_for_pytorch new release, we changed deepspeed
kernel API. Need to align with upstream DeepSpeed xpu op builder.

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-08-22 17:24:38 +00:00
Guanhua Wang 51da191eb5
add pip install cutlass version check (#6393)
fix this issue: https://github.com/microsoft/DeepSpeed/issues/6006

cc. @tjruwase
2024-08-22 17:18:17 +00:00
Logan Adams 0f0f231e8a
Correct op_builder path to xpu files for trigger XPU tests (#6398) 2024-08-22 08:38:42 -07:00
Masahiro Tanaka 649b078571
Add Japanese translation of Windows support blog (#6394)
This PR adds the Japanese translation of the release blog of Windows
support.
2024-08-21 18:24:27 -07:00
Perry Zou e6fcc226c7
fix fp16 Qwen2 series model to DeepSpeed-FastGen (#6028)
based on PR #5403 (Qwen1.5-MOE) and #5219 (Qwen1.5), support Qwen2
series model.

including: 0.5B, 1.5B, 7B, 57B-A14B, and 72B models.

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-08-21 23:59:39 +00:00
ranzhejiang 7260890452
reduce cpu host overhead when using moe (#5578)
The operation `.to('cpu') `is not necessary for exp_counts, and it will
cause device to host synchronization which damage performance.

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-08-21 21:52:48 +00:00
Sam Ade Jacobs 8b191d7ccf
Long sequence parallelism (Ulysses) integration with HuggingFace (#5774)
This PR enhances capabilities of [DeepSpeed long sequence (context)
parallelism (aka DS
Ulysses)](https://dl.acm.org/doi/10.1145/3662158.3662806) with support
for HuggingFace (and by extension other frameworks) models. With HF
integration, users can use sequence parallelism for model
pre/mid/post-training, finetuning etc. Usage requires both _torch
>=2.2.2 and flash-attention_. ZeRO-1 and 2 are supported, ZeRO-3 and
SPDA support in progress. Corresponding PR in HF is
[PR32305](https://github.com/huggingface/transformers/pull/32305).

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-08-21 01:46:50 +00:00
Joe Mayer b65ea50631
GDS Swapping Fix (#6386)
Fixing gds api call

Co-authored-by: Ubuntu <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net>
2024-08-20 23:25:37 +00:00
Jinxing Pan 96393f561d
Update linear.py compatible with torch 2.4.0 (#5811)
deepspeed/runtime/zero/linear.py:67: FutureWarning:
`torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use
`torch.amp.custom_bwd(args..., device_type='cuda')` instead.

Fixes: #5682

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-08-20 17:22:02 +00:00
Joe Mayer 3831d4b57a
Bug Fix 5880 (#6378)
Allowing hf args to be passed through class to AutoConfig.pretrained.

Co-authored-by: Ubuntu <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-08-20 16:04:37 +00:00
Olatunji Ruwase 01fe65b300
DeepSpeed on Window blog (#6364)
DeepSpeed on Windows blog

---------

Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-08-19 11:16:22 -07:00
Logan Adams ba0ab7138d
Pin transformers version on nv-nightly (#6002)
nv-nightly was failing due to updates in transformers, we will need to
introduce a real fix for these, but this at least gets the test passing
and we need to update transformers support for MII too.
2024-08-19 09:51:31 -07:00
Joe Mayer 5f0d177fd7
DeepNVMe GDS (#5852)
PR for the GDS AIO code.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Ubuntu <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-08-19 04:28:50 +00:00
Joe Mayer c2e3a706b5
Add and Remove ZeRO 3 Hooks (#5658)
Gives the ability to add and remove the forward hooks in ZeRO 3 by using
a context manager. These code changes were taken from a Huggingface
[PR](https://github.com/huggingface/trl/pull/1617) and integrated for
direct support in DeepSpeed.

This is useful in the inference case and the speedup can be observed
[here](https://github.com/huggingface/trl/pull/1483).

---------

Co-authored-by: root <root@deepspeed-c000004.2d1icxc5dsxehnpuwt3ifc34ph.gvxx.internal.cloudapp.net>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Heyang Qin <heyangqin@microsoft.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
2024-08-16 18:40:25 +00:00
Masahiro Tanaka 1ab1928d79
Enable dynamic shapes for pipeline parallel engine inputs (#5481)
This PR enables dynamic shapes for inputs to pipeline parallel (PP)
engine.

Currently PP engine checks tensor shapes and allocates communication
buffer at the first forward/backward passes. This causes a tensor shape
mismatch error when input tensor shapes changed.

This PR adds an option to check tensor shapes at every iteration and
allocate buffer based on the shapes. As shown below, you can enable this
feature by passing `dynamic_shape=True` to `PipelineModule`.
Note that this might have a performance impact and the option is set to
False as default.

```python
model = PipelineModule(
...
   dynamic_shape=True
)
```

This will increase the overhead of buffer allocation and communication
for tensor metadata. To mitigate the overhead, this PR also includes
these improvements:
- Consolidate multiple communication calls to send/recv tensor shapes
9f96ad4049
- Reuse (extend) communication buffer instead of creating a new one
b3c07504be

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-08-16 15:19:12 +00:00
Liran Bachar 4d4ff0eddd
Move inf_or_nan_tracker to cpu for cpu offload (#5826)
Must use the same device as grad_partitions_flat_buffer

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-08-15 22:42:32 +00:00
inkcherry 9a3ede7079
add moe topk(k>2) gate support (#5881)
Notice some users need to use topk > 2 to train MoE models. For example:
https://huggingface.co/Qwen/Qwen2-57B-A14B/blob/main/config.json, this
PR adds support for topk (k > 2) gates.

- add topk (k>2) support
- add drop token policy based on position and probabilities.
- unit tests

---------

Co-authored-by: Kurt Chen <kurt.chen@intel.com>
Co-authored-by: Jin, Youzhi <youzhi.jin@intel.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2024-08-15 17:43:45 +00:00
Rohan Potdar 30428d0318
move pynvml install to setup.py (#5840)
Only install pynvml on nvidia gpus; not all accelerators
2024-08-15 16:27:10 +00:00
Logan Adams 3f6df9a236
Update version.txt after 0.14.5 release (#5982)
**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.14.5
Author           - @loadams

Co-authored-by: loadams <loadams@users.noreply.github.com>
2024-08-15 11:06:46 -07:00