Граф коммитов

2528 Коммитов

Автор SHA1 Сообщение Дата
wyooyw b647fb2470
Fix expert grad scaling problem with ZeRO optimizer (#6546)
Fix [#6545]

work:
- expert gradient average: divide edp_world_size -> divide dp_world_size
- unit test: make sure model with different dp/ep has same expert
gradient

---------

Co-authored-by: wangyiou <wangyiou@xiaohongshu.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-10-23 00:08:39 +00:00
Logan Adams bf03f48352
Update version.txt after 0.15.3 release (#6652)
**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.15.3
Author           - @jomayeri

Co-authored-by: jomayeri <jomayeri@users.noreply.github.com>
2024-10-22 14:15:45 -07:00
Liangliang Ma a24cdd6b67
[XPU] [DeepNVMe] use same cpu_op_desc_t with cuda (#6645)
We have found that #6592 uses `_pinned_tensor_mgr` to create cpu bounce
buffer, which is same with what our xpu accelerator currently doing.
So no need to use xpu device specific cpu_op_desc_t.
In this PR:
1. remove custom csrc/xpu/aio/deepspeed_cpu_op.cpp
2. modify xpu async_io opbuilder.

This issue cannot be easily done with revert #6532 , for we added some
source file as last time GDS feature going in DS. So file this new PR :)
2024-10-22 14:45:05 +00:00
Yizhou Wang 11bbf45af5
[XPU] host timer check version from Torch 2.5 to Torch 2.6 (#6633)
Elapsed time would be supported in Torch 2.6.

Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2024-10-22 06:53:15 +00:00
Liangliang Ma 40bde528bc
[XPU] upgrade xpu max1100 CI workflow to pytorch2.3 (#6646)
With intel-extension-for-pytorch=2.3.110 released last month, max1100 CI
workflow can be updated too. Software versions aligned with #6570 .

Increased CI tests scope for torch/ipex2.3 will be in later PR.

This workflow passed in my cloned repo self-hosted runner.
2024-10-21 12:25:11 +00:00
Joe Mayer 6eefc3d0ea
Fix Memory Leak In AIO (#6630)
Fixing a memory leak in AIO pinned tensor as well as an incorrect
function type for gds op.

---------

Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2024-10-18 02:58:06 +00:00
Masahiro Tanaka c9fc34a4be
Use file store for tests (#6632)
This PR changes the `init_method` for tests to `FileStore` for
robustness.
2024-10-17 22:15:25 +00:00
Masahiro Tanaka a36db9cc1c
Update torch version in workflows (#6631)
Set PyTorch version in CI workflows to v2.5.

Context: The
[error](https://github.com/microsoft/DeepSpeed/actions/runs/11371525624/job/31633793986?pr=6630)
in #6630 might have been caused by the PyTorch version mismatch or
something.
2024-10-17 17:50:55 +00:00
jiahao su c9899dc14a
Add README Pipeline Status for Huawei Ascend NPU (#6588)
Hello! Following the merge of
https://github.com/microsoft/DeepSpeed/pull/6445, I have implemented a
CI pipeline to validate the Huawei Ascend NPU.

---------

Co-authored-by: sjh <sjh1270@163.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2024-10-15 23:36:10 +00:00
Masahiro Tanaka 1a45bd8e8c
Lock cache file of HF model list (#6628)
The error in the following log suggests that the cache file for HF model
list can be broken:

https://github.com/microsoft/DeepSpeed/actions/runs/11343665365/job/31546708118?pr=6614

The actual cause of the above error is unclear, but `_hf_model_list`
potentially breaks the cache file when it is concurrently called from
multiple processes. This PR locks the cache file to ensure
`_hf_model_list` safely reads and writes the file.
2024-10-15 21:49:37 +00:00
Shelly Nahir ce468c3756
add option to disable logger while compiling to avoid graph breaks (#6496)
adding an option to disable calls for logger while compiling to avoid
graph breaks. Here I used an environment variable to determine whether
to activate this option, but it can also be determined using the json
config file or any other way you see fit.

---------

Co-authored-by: snahir <snahir@habana.ai>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2024-10-15 18:30:42 +00:00
Xu Song bf60fc0ca6
Support safetensors export (#6579)
## Feature

This commit implements the following features:

- [x] support saving checkpoint as safetensors (more commonly used
format)
- [x] support sharding checkpoints (which is important for very large
models)

Most of the codes are borrowed from
https://github.com/huggingface/transformers/blob/v4.45.1/src/transformers/modeling_utils.py#L2490

## Usage

For `pytorch_model.bin` export
```
python zero_to_fp32.py . output_dir/
```

For  `model.safetensors` export
```
python zero_to_fp32.py . output_dir/ --safe_serialization
```

---------

Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-10-15 11:22:31 +00:00
Joe Mayer 85b7469ea0
Add first Step in LR Schedulers (#6597)
Some (not all) of the LR schedulers in runtime were missing the
initialization of the optimizer group lr.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-10-14 19:31:45 +00:00
diskkid 13c16c9562
Accept btl_tcp_if_include option through launcher_args (#6613)
This patch fixes issue #4460.
When `btl_tcp_if_include` option is provided through `--launcher_args`,
we use the provided option instead of the hardcoded `--mca
btl_tcp_if_include eth0`. Otherwise we use `--mca btl_tcp_if_include
eth0` as the default for compatibility.

Fixes #4460

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-10-14 19:26:24 +00:00
Olatunji Ruwase 65ab64481f
Add API for updating ZeRO gradients (#6590) 2024-10-14 17:35:41 +00:00
Ma, Guokai cf41e8c4e8
[compile] Show breakdown of graph break (#6601)
This PR extends https://github.com/microsoft/DeepSpeed/pull/6570 by
showing a breakdown of graph breaks. So we can see how graph breaks are
distributed among different reasons. An example of graph break output
can be seen from the following workflow run
https://github.com/microsoft/DeepSpeed/actions/runs/11199157962
2024-10-14 17:31:34 +00:00
Masahiro Tanaka 7a5bc4fdf9
Ignore reuse_dist_env (#6623)
Tests with `reuse_dist_env = True` often causes memory leaks. This PR
ignores `reuse_dist_env` and forcibly sets it to `False`. This change
might slow down the tests, but I think it is better to manually restart
runners and relaunch tests.

Memory usages (See #6578):
- `reuse_dist_env == True`:
https://github.com/microsoft/DeepSpeed/actions/runs/11302940871/job/31439471512
- `reuse_dist_env == False`:
https://github.com/microsoft/DeepSpeed/actions/runs/11303250613/job/31440137894
2024-10-14 16:08:44 +00:00
Masahiro Tanaka 5c4b97f109 apply fp16 autocast only to floating point values 2024-10-11 19:41:10 +00:00
Masahiro Tanaka adec99121b
Add API to get devices of offload states (#6586)
This PR adds an API `deepspeed.runtime.zero.offload_states
get_state_devices`, which gets devices of offload states as suggested in
this
[comment](https://github.com/microsoft/DeepSpeed/pull/6011#issuecomment-2358068777).

We could lift this up to `deepspeed.utils` but would need to resolve a
circular import: User code -> `deepspeed.utils` ->
`deepspeed.utils.offload_states` -> `deepspeed.runtime.zero` ->
`deepspeed.runtime.zero.partition_parameters` -> `deepspeed.utils`

This will require a significant refactoring as long as we have
`OffloadStateTypeEnum` in `deepspeed.runtime.zero`.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-10-10 02:59:26 +00:00
Nir Sonnenschein d7ca3d8373
reduce setting global variables to reduce torch compile graph breaks (#6541)
setting global variables during training will create a graph breaks when
using torch.compile (reading global variables doesn't). this commit
attempts to reduce the setting of global variables in the checkpointing
flows.
there are 2 main uses setting global variables:
1. Share data between functions
2. Establish that this is the first call to the code

For most of the cases the data in the global variables is data that can
be computed on demand or set once in an initial state in a configure
function.
For "check that this is the first run" use case the code was moved to
the configure function.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-10-10 00:47:44 +00:00
Joe Mayer a1f98bdc70
AIO CPU Locked Tensor (#6592)
Restoring the functionality of the cpu locked tensor in the AIO library.
Make async_io operator available for CPU accelerator, i.e., CPU only
environment.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-10-09 21:07:31 +00:00
Masahiro Tanaka 7d751ee890
Clean up prefetched parameters (#6557)
Parameters prefetched by ZeRO3 are sometimes not used. This occurs when
the actual sub-module execution differs from previous tracing. As a
result, the state of the allgather handle for such a parameter remains
`INFLIGHT`, causing functions like `empty_partition_cache` to detect it
and throw an error.
This PR resolves the issue by ensuring that communication finishes and
the parameters are freed.

As this issue was mentioned in #6011, this includes the change of the
branch. We need to merge #6011 first.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-10-09 15:23:33 +00:00
Logan Adams 55f7f3789e
Update version.txt after 0.15.2 release (#6615)
**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.15.2
Author           - @jomayeri

Co-authored-by: jomayeri <jomayeri@users.noreply.github.com>
2024-10-09 10:48:39 -07:00
gyou2021 474a3288cd
Enabled Qwen2-MoE Tensor Parallelism (TP) inference (#6551)
Modified _replace_module in auto_tp.py :
The modification keeps the layers 'shared_expert_gate' and 'gate' in
qwen2-moe the original type torch.nn.Linear and not changes them into
LinearLayer. In this way, their weights will not be split into multiple
HPU/GPU cards. Then the qwen2-moe can run on multiple HPU/GPU cards.
Since the weights of 'gate' are not split into multiple HPU/GPU cards,
all gather operations are not needed, which may improve performance.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-10-09 15:23:16 +00:00
Logan Adams 1062a0c658
Unpin accelerate tests, update lightning with node16 removal. (#6611)
HF accelerate fixes implemented in
https://github.com/huggingface/accelerate/pull/3145 mean that we no
longer need to pin the Accelerate version!

nv-lightning tests now run on Ubuntu 20.04+, so we support >node 16, so
we can remove the explicit permissions for that in the env config.
2024-10-09 08:22:41 -07:00
Omar Elayan 645639bcf8
Rearrange inference OPS and stop using builder.load (#5490)
This PR mainly handles all places where InferenceBuilder is used to
access any op or a specific implementation for an op.
Instead an op is defined, and its proper implementation is picked inside
and the usage will be transparent to the user.
What was done in the PR:
1) Added missing ops (added a py file with fallback mechanism)
2) Added missing fallback implementations for existing ops
3) removed all usages for builder.load and replaced them with ops
instead.
4) added workspace op and inferenceContext which contains all workspace
related functions and inferenceContext is the python fallback of
inferenceContext in CUDA
5) a small change to softmax_context signature to fit the fallback
signature.

---------

Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com>
Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-10-09 01:22:28 +00:00
Yichen Yan ca8b1fe945
Handle when `backend` is also in compile_kwargs (#6502)
cc @tohtana

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2024-10-08 23:38:43 +00:00
Masahiro Tanaka 5cbbff40bd
Fix device selection using CUDA_VISIBLE_DEVICES (#6530)
This PR addresses #5818.
Instead of contiguous numbers based on the device count, this PR uses
device indices in `--include`.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-10-08 20:41:44 +00:00
Olatunji Ruwase f74ea69abf
Improve DS logging control (#6602)
Disable `steps_per_print` by default.
2024-10-08 18:38:51 +00:00
Yejing-Lai e97b453645
Add llama3.2 vision autotp (#6577)
Llama3.2-11b and llama3.2-90b including vision model and text model,
these two models have different num_kv_heads, so we need to set
num_kv_heads dynamically.

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-10-08 18:16:04 +00:00
Logan Adams 745dd48b90
Pin accelerate to fix CI failures/issues (#6610) 2024-10-08 11:15:46 -07:00
Logan Adams 00c4b98ba0
Fix SD workflow (#6609)
SD workflow needed updates when we moved to pydantic 2 support that was
never added before.

Passing nv-sd workflow
[here](https://github.com/microsoft/DeepSpeed/actions/runs/11239699283)
2024-10-08 10:42:22 -07:00
Logan Adams 20695b39b1
Move V100 workflows from cuda 11.1/11.7 to 12.1 (#6607) 2024-10-08 04:06:51 +00:00
Logan Adams 940887ded1
Add SSF Best practices badge (#6604)
Work in progress to ensure we meet SSF best practices:
https://www.bestpractices.dev/en/projects/9530
2024-10-07 11:22:05 -07:00
Logan Adams 239b83a77e
Cleanup CODEOWNERS file to be valid (#6603) 2024-10-07 10:01:53 -07:00
Jagadish Krishnamoorthy b93c7a20c8
[ROCm] Fix subprocess error (#6587)
Fixes https://github.com/microsoft/DeepSpeed/issues/6585 
Use shell=True for subprocess.check_output() in case of ROCm commands.
Do not use shlex.split() since command string has wildcard expansion.

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
2024-10-04 14:31:25 -07:00
Logan Adams 8cded575a9
Fix torch include in `op_builder/mlu/fused_adam.py` and update no-torch workflow triggers (#6584)
Changes from #6472 caused the no-torch workflow that is an example of
how we build the DeepSpeed release package to fail (so we caught this
before a release, see more in #6402). These changes also copy the style
used to include torch in other accelerator op_builder implementations,
such as npu
[here](https://github.com/microsoft/DeepSpeed/blob/master/op_builder/npu/fused_adam.py#L8)
and hpu
[here](828ddfbbda/op_builder/hpu/fused_adam.py (L15)).

This also updates the no-torch workflow to run on all changes to the
op_builder directory. The test runs quickly and shouldn't add any
additional testing burden there.

Resolves: #6576
2024-09-27 13:32:48 -07:00
Logan Adams 828ddfbbda
Fixes on the accelerate side mean we do not need to skip this test (#6583)
HF accelerate implemented fixes here:
https://github.com/huggingface/accelerate/pull/3131

This means we can revert the changes from #6574
2024-09-27 09:22:13 -07:00
Yizhou Wang d4e1895076
[COMPILE] workflow for deepspeed + torch.compile (#6570)
We use simple model + deepspeed zero 3 + torch.compile and count graph
break numbers to demonstrate current status of combing deepspeed +
torch.compile.

---------

Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2024-09-27 06:45:42 +00:00
Nadav Elyahu 1caf6e8107
add bfloat16 to inference support dtypes (#6528)
to allow running inference tasks using bfloat16

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
2024-09-27 06:11:06 +00:00
Masahiro Tanaka 047bcf6af6
Add APIs to offload states of model, optimizer, and engine (#6011)
This PR adds the following APIs to offload model, optimizer, and engine
states.

```pytyon
def offload_states(self,
                   include: Container[OffloadStateTypeEnum] = None,
                   device: OffloadDeviceEnum = OffloadDeviceEnum.cpu,
                   pin_memory: bool = True,
                   non_blocking: bool = False) -> None:
    """Move the ZeRO optimizer buffers to the specified device.

    Arguments:
        include: Optional. The set of states to offload. If not provided, all states are offloaded.
        device: Optional. The device to move the ZeRO optimizer buffers to.
        pin_memory: Optional. Whether to pin the memory of the offloaded states.
        non_blocking: Optional. Whether to offload the states asynchronously.
...
def offload_states_back(self, non_blocking: bool = False) -> None:
```

Here is the typical usage.
```python
# Offload after forward, backward, and step
model.offload_states()
# Do something requiring a lot of device memory
...
# Load states back to device memory
model.offload_states_back()
```

You can selectively offload states to balance the offloading overhead
and memory saving.
```python
model.offload_states(include=set([OffloadStateTypeEnum.hp_params, OffloadStateTypeEnum.opt_states], device=OffloadDeviceEnum.cpu)
```

Performance (4.3B parameters / 4x A100)
- Environment (4x A100, [benchmark
script](https://gist.github.com/tohtana/05d5faba5068cf839abfc7b1e38b85e4))
- Average Device to Host transfer time: 2.45 GB/s, aggregated: 9.79 GB/s
  - Average Host to Device transfer: 11.05 GB/s, aggregated: 44.19 GB/s
- Mem (allocated by PyTorch)
  - Before offload 18.2GB
  - After offloading 17.7MB
- Time ([benchmark
script](https://github.com/microsoft/DeepSpeedExamples/tree/tohtana/offload_states/training/offload_states),
offloading time/loading time)

python output_table.py 
| |pin_memory=0 non_blocking=0|pin_memory=0 non_blocking=1|pin_memory=1
non_blocking=0|pin_memory=1 non_blocking=1|

|--:|---------------------------|---------------------------|---------------------------|---------------------------|
| 1|4.34 / 3.42 |4.99 / 2.37 |6.5 / 2.42 |6.0 / 2.39 |
| 2|9.9 / 3.28 |5.1 / 2.34 |6.21 / 2.42 |6.25 / 2.45 |
| 3|9.92 / 3.19 |6.71 / 2.35 |6.33 / 2.38 |5.93 / 2.42 |
| 4|9.55 / 2.82 |7.11 / 2.39 |6.9 / 2.38 |6.5 / 2.43 |
| 5|4.4 / 3.35 |6.04 / 2.41 |6.26 / 2.41 |6.32 / 2.47 |
| 6|4.4 / 3.57 |6.58 / 2.42 |6.88 / 2.4 |6.35 / 2.43 |
| 7|9.51 / 3.12 |6.9 / 2.39 |6.9 / 2.39 |6.46 / 2.4 |
| 8|4.77 / 3.64 |6.69 / 2.39 |7.39 / 2.42 |6.56 / 2.46 |
| 9|9.5 / 3.07 |7.18 / 2.42 |6.67 / 2.39 |7.38 / 2.46 |

TODO:
- Enable offloading to a NVMe storage -> NVMe support is non-trivial. I
suggest adding the support in another PR
- [DONE] Discard buffer (and recreate it) instead of offloading. We
don't need to restore the contiguous buffer for reduce.
- [DONE] Check pin_memory improves performance or not

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-09-27 05:37:32 +00:00
Liangliang Ma d45cfd3455
[XPU] Support DeepNVMe new code structure (#6532)
In DeepNVMe GDS update, many functions are changed into a more abstract
way. Also added some files. These change break zero-infinity on XPU. To
bring this feature back, we have this PR:
1. modify the aio opbuilder for new files.
2. Add custom cpu_op_desc_t for xpu users. (XPU don't handle buffer
aligned here)

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-09-26 20:39:59 +00:00
Nir Sonnenschein ba58682a13
fix errors when setting zero3 leaf modules with torch.compile (#6564)
When setting zero3 leaf modules to a higher level module and running
with torch.compile, there are a few errors from ZeROOrderedDict.

First it doesn't support Deep copy for not having a constructor with no
parameters.

Second, it doesn't check the existence of ds_status attr on param before
accessing the attr.

change contributed by Haifeng Chen

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-09-26 14:55:12 +00:00
Masahiro Tanaka c85c8703bc
Fix gradient accumulation for Z2+offload (#6550)
The ZeRO 1/2 optimizer performs incorrect gradient accumulation in the
path for ZeRO2 + Offloading. This issue is caused by two main reasons:

1) The micro_step_id in the ZeRO 1/2 optimizer is:

- Initialized to 0 in the constructor.
- Reset to -1 during the backward pass.

For example, given a gradient accumulation step of 4, the micro_step_id
changes as follows:

- For the first global step: 1, 2, 3, 4.
- Subsequently: 0, 1, 2, 3.

2) Gradients are copied to the buffer on the first micro step and
accumulated in the buffer during the following micro steps. However, the
current code incorrectly copies gradients at steps that are not at the
accumulation boundary.

This PR aligns the micro_step_id initialization in both the constructor
and the backward pass, and corrects the condition for copying and
accumulating gradients.

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-09-26 13:11:24 +00:00
andyG 0fbe96a502
[Accelerator] Cambricon MLU support (#6472)
### Description
This PR includes Cambricon MLU accelerator support. 
With this PR, DeepSpeed supports MLU as backend for training and
inference tasks.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-09-26 13:10:52 +00:00
Olatunji Ruwase a5400974df
DeepNVMe perf tuning (#6560)
Add performance tuning utilities: `ds_nvme_tune` and `ds_io`.  
Update tutorial with tuning section.

---------

Co-authored-by: Ubuntu <jomayeri@microsoft.com>
Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com>
2024-09-26 13:07:19 +00:00
Masahiro Tanaka 7622cd9e68
Use msgpack for p2p comm (#6547)
Use msgpack for P2P communication in pipeline engine.

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-09-26 00:34:38 +00:00
Logan Adams 61de017176
Skip failing newly added tests in accelerate (#6574)
Adding the new tests in
https://github.com/huggingface/accelerate/pull/3097 caused the
nv-accelerate-v100 tests to fail. Due to other CI issues we didn't
notice this at first. This just skips the problematic test for now.

cc: @stas00 / @muellerzr
2024-09-25 16:18:44 -07:00
ShifaAbu 2a56f53395
Added Intel Gaudi to Accelerator Setup Guide (#6543)
Added Intel Gaudi to the list of accelerators in the setup guide.

Co-authored-by: sakell <sakell@habana.ai>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-09-16 15:24:45 -07:00
Logan Adams 170b46e8b1
Add conditional on torch version for scaled_dot_product_attention (#6517)
Changes from #4724 broke support for torch<2.0 in the flops profiler as
the scaled_dot_product_attention [wasn't
added](https://pytorch.org/docs/2.0/generated/torch.nn.functional.scaled_dot_product_attention.html#torch.nn.functional.scaled_dot_product_attention)
until a beta version in torch 2.0

Resolved: #5534

Todo:
- [ ] Test this
- [ ] Issue resolution with users.
2024-09-11 23:21:43 +00:00