DeepSpeed

Граф коммитов

Автор	SHA1	Сообщение	Дата
wyooyw	b647fb2470	Fix expert grad scaling problem with ZeRO optimizer (#6546 ) Fix [#6545] work: - expert gradient average: divide edp_world_size -> divide dp_world_size - unit test: make sure model with different dp/ep has same expert gradient --------- Co-authored-by: wangyiou <wangyiou@xiaohongshu.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-10-23 00:08:39 +00:00
Logan Adams	bf03f48352	Update version.txt after 0.15.3 release (#6652 ) Auto-generated PR to update version.txt after a DeepSpeed release Released version - 0.15.3 Author - @jomayeri Co-authored-by: jomayeri <jomayeri@users.noreply.github.com>	2024-10-22 14:15:45 -07:00
Liangliang Ma	a24cdd6b67	[XPU] [DeepNVMe] use same cpu_op_desc_t with cuda (#6645 ) We have found that #6592 uses `_pinned_tensor_mgr` to create cpu bounce buffer, which is same with what our xpu accelerator currently doing. So no need to use xpu device specific cpu_op_desc_t. In this PR: 1. remove custom csrc/xpu/aio/deepspeed_cpu_op.cpp 2. modify xpu async_io opbuilder. This issue cannot be easily done with revert #6532 , for we added some source file as last time GDS feature going in DS. So file this new PR :)	2024-10-22 14:45:05 +00:00
Yizhou Wang	11bbf45af5	[XPU] host timer check version from Torch 2.5 to Torch 2.6 (#6633 ) Elapsed time would be supported in Torch 2.6. Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2024-10-22 06:53:15 +00:00
Liangliang Ma	40bde528bc	[XPU] upgrade xpu max1100 CI workflow to pytorch2.3 (#6646 ) With intel-extension-for-pytorch=2.3.110 released last month, max1100 CI workflow can be updated too. Software versions aligned with #6570 . Increased CI tests scope for torch/ipex2.3 will be in later PR. This workflow passed in my cloned repo self-hosted runner.	2024-10-21 12:25:11 +00:00
Joe Mayer	6eefc3d0ea	Fix Memory Leak In AIO (#6630 ) Fixing a memory leak in AIO pinned tensor as well as an incorrect function type for gds op. --------- Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2024-10-18 02:58:06 +00:00
Masahiro Tanaka	c9fc34a4be	Use file store for tests (#6632 ) This PR changes the `init_method` for tests to `FileStore` for robustness.	2024-10-17 22:15:25 +00:00
Masahiro Tanaka	a36db9cc1c	Update torch version in workflows (#6631 ) Set PyTorch version in CI workflows to v2.5. Context: The [error](https://github.com/microsoft/DeepSpeed/actions/runs/11371525624/job/31633793986?pr=6630) in #6630 might have been caused by the PyTorch version mismatch or something.	2024-10-17 17:50:55 +00:00
jiahao su	c9899dc14a	Add README Pipeline Status for Huawei Ascend NPU (#6588 ) Hello! Following the merge of https://github.com/microsoft/DeepSpeed/pull/6445, I have implemented a CI pipeline to validate the Huawei Ascend NPU. --------- Co-authored-by: sjh <sjh1270@163.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2024-10-15 23:36:10 +00:00
Masahiro Tanaka	1a45bd8e8c	Lock cache file of HF model list (#6628 ) The error in the following log suggests that the cache file for HF model list can be broken: https://github.com/microsoft/DeepSpeed/actions/runs/11343665365/job/31546708118?pr=6614 The actual cause of the above error is unclear, but `_hf_model_list` potentially breaks the cache file when it is concurrently called from multiple processes. This PR locks the cache file to ensure `_hf_model_list` safely reads and writes the file.	2024-10-15 21:49:37 +00:00
Shelly Nahir	ce468c3756	add option to disable logger while compiling to avoid graph breaks (#6496 ) adding an option to disable calls for logger while compiling to avoid graph breaks. Here I used an environment variable to determine whether to activate this option, but it can also be determined using the json config file or any other way you see fit. --------- Co-authored-by: snahir <snahir@habana.ai> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2024-10-15 18:30:42 +00:00
Xu Song	bf60fc0ca6	Support safetensors export (#6579 ) ## Feature This commit implements the following features: - [x] support saving checkpoint as safetensors (more commonly used format) - [x] support sharding checkpoints (which is important for very large models) Most of the codes are borrowed from https://github.com/huggingface/transformers/blob/v4.45.1/src/transformers/modeling_utils.py#L2490 ## Usage For `pytorch_model.bin` export ``` python zero_to_fp32.py . output_dir/ ``` For `model.safetensors` export ``` python zero_to_fp32.py . output_dir/ --safe_serialization ``` --------- Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-10-15 11:22:31 +00:00
Joe Mayer	85b7469ea0	Add first Step in LR Schedulers (#6597 ) Some (not all) of the LR schedulers in runtime were missing the initialization of the optimizer group lr. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-10-14 19:31:45 +00:00
diskkid	13c16c9562	Accept btl_tcp_if_include option through launcher_args (#6613 ) This patch fixes issue #4460. When `btl_tcp_if_include` option is provided through `--launcher_args`, we use the provided option instead of the hardcoded `--mca btl_tcp_if_include eth0`. Otherwise we use `--mca btl_tcp_if_include eth0` as the default for compatibility. Fixes #4460 --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-10-14 19:26:24 +00:00
Olatunji Ruwase	65ab64481f	Add API for updating ZeRO gradients (#6590 )	2024-10-14 17:35:41 +00:00
Ma, Guokai	cf41e8c4e8	[compile] Show breakdown of graph break (#6601 ) This PR extends https://github.com/microsoft/DeepSpeed/pull/6570 by showing a breakdown of graph breaks. So we can see how graph breaks are distributed among different reasons. An example of graph break output can be seen from the following workflow run https://github.com/microsoft/DeepSpeed/actions/runs/11199157962	2024-10-14 17:31:34 +00:00
Masahiro Tanaka	7a5bc4fdf9	Ignore reuse_dist_env (#6623 ) Tests with `reuse_dist_env = True` often causes memory leaks. This PR ignores `reuse_dist_env` and forcibly sets it to `False`. This change might slow down the tests, but I think it is better to manually restart runners and relaunch tests. Memory usages (See #6578): - `reuse_dist_env == True`: https://github.com/microsoft/DeepSpeed/actions/runs/11302940871/job/31439471512 - `reuse_dist_env == False`: https://github.com/microsoft/DeepSpeed/actions/runs/11303250613/job/31440137894	2024-10-14 16:08:44 +00:00
Masahiro Tanaka	5c4b97f109	apply fp16 autocast only to floating point values	2024-10-11 19:41:10 +00:00
Masahiro Tanaka	adec99121b	Add API to get devices of offload states (#6586 ) This PR adds an API `deepspeed.runtime.zero.offload_states get_state_devices`, which gets devices of offload states as suggested in this [comment](https://github.com/microsoft/DeepSpeed/pull/6011#issuecomment-2358068777). We could lift this up to `deepspeed.utils` but would need to resolve a circular import: User code -> `deepspeed.utils` -> `deepspeed.utils.offload_states` -> `deepspeed.runtime.zero` -> `deepspeed.runtime.zero.partition_parameters` -> `deepspeed.utils` This will require a significant refactoring as long as we have `OffloadStateTypeEnum` in `deepspeed.runtime.zero`. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-10-10 02:59:26 +00:00
Nir Sonnenschein	d7ca3d8373	reduce setting global variables to reduce torch compile graph breaks (#6541 ) setting global variables during training will create a graph breaks when using torch.compile (reading global variables doesn't). this commit attempts to reduce the setting of global variables in the checkpointing flows. there are 2 main uses setting global variables: 1. Share data between functions 2. Establish that this is the first call to the code For most of the cases the data in the global variables is data that can be computed on demand or set once in an initial state in a configure function. For "check that this is the first run" use case the code was moved to the configure function. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-10-10 00:47:44 +00:00
Joe Mayer	a1f98bdc70	AIO CPU Locked Tensor (#6592 ) Restoring the functionality of the cpu locked tensor in the AIO library. Make async_io operator available for CPU accelerator, i.e., CPU only environment. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-10-09 21:07:31 +00:00
Masahiro Tanaka	7d751ee890	Clean up prefetched parameters (#6557 ) Parameters prefetched by ZeRO3 are sometimes not used. This occurs when the actual sub-module execution differs from previous tracing. As a result, the state of the allgather handle for such a parameter remains `INFLIGHT`, causing functions like `empty_partition_cache` to detect it and throw an error. This PR resolves the issue by ensuring that communication finishes and the parameters are freed. As this issue was mentioned in #6011, this includes the change of the branch. We need to merge #6011 first. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-10-09 15:23:33 +00:00
Logan Adams	55f7f3789e	Update version.txt after 0.15.2 release (#6615 ) Auto-generated PR to update version.txt after a DeepSpeed release Released version - 0.15.2 Author - @jomayeri Co-authored-by: jomayeri <jomayeri@users.noreply.github.com>	2024-10-09 10:48:39 -07:00
gyou2021	474a3288cd	Enabled Qwen2-MoE Tensor Parallelism (TP) inference (#6551 ) Modified _replace_module in auto_tp.py : The modification keeps the layers 'shared_expert_gate' and 'gate' in qwen2-moe the original type torch.nn.Linear and not changes them into LinearLayer. In this way, their weights will not be split into multiple HPU/GPU cards. Then the qwen2-moe can run on multiple HPU/GPU cards. Since the weights of 'gate' are not split into multiple HPU/GPU cards, all gather operations are not needed, which may improve performance. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-10-09 15:23:16 +00:00
Logan Adams	1062a0c658	Unpin accelerate tests, update lightning with node16 removal. (#6611 ) HF accelerate fixes implemented in https://github.com/huggingface/accelerate/pull/3145 mean that we no longer need to pin the Accelerate version! nv-lightning tests now run on Ubuntu 20.04+, so we support >node 16, so we can remove the explicit permissions for that in the env config.	2024-10-09 08:22:41 -07:00
Omar Elayan	645639bcf8	Rearrange inference OPS and stop using builder.load (#5490 ) This PR mainly handles all places where InferenceBuilder is used to access any op or a specific implementation for an op. Instead an op is defined, and its proper implementation is picked inside and the usage will be transparent to the user. What was done in the PR: 1) Added missing ops (added a py file with fallback mechanism) 2) Added missing fallback implementations for existing ops 3) removed all usages for builder.load and replaced them with ops instead. 4) added workspace op and inferenceContext which contains all workspace related functions and inferenceContext is the python fallback of inferenceContext in CUDA 5) a small change to softmax_context signature to fit the fallback signature. --------- Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com> Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-10-09 01:22:28 +00:00
Yichen Yan	ca8b1fe945	Handle when `backend` is also in compile_kwargs (#6502 ) cc @tohtana Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2024-10-08 23:38:43 +00:00
Masahiro Tanaka	5cbbff40bd	Fix device selection using CUDA_VISIBLE_DEVICES (#6530 ) This PR addresses #5818. Instead of contiguous numbers based on the device count, this PR uses device indices in `--include`. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-10-08 20:41:44 +00:00
Olatunji Ruwase	f74ea69abf	Improve DS logging control (#6602 ) Disable `steps_per_print` by default.	2024-10-08 18:38:51 +00:00
Yejing-Lai	e97b453645	Add llama3.2 vision autotp (#6577 ) Llama3.2-11b and llama3.2-90b including vision model and text model, these two models have different num_kv_heads, so we need to set num_kv_heads dynamically. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-10-08 18:16:04 +00:00
Logan Adams	745dd48b90	Pin accelerate to fix CI failures/issues (#6610 )	2024-10-08 11:15:46 -07:00
Logan Adams	00c4b98ba0	Fix SD workflow (#6609 ) SD workflow needed updates when we moved to pydantic 2 support that was never added before. Passing nv-sd workflow [here](https://github.com/microsoft/DeepSpeed/actions/runs/11239699283)	2024-10-08 10:42:22 -07:00
Logan Adams	20695b39b1	Move V100 workflows from cuda 11.1/11.7 to 12.1 (#6607 )	2024-10-08 04:06:51 +00:00
Logan Adams	940887ded1	Add SSF Best practices badge (#6604 ) Work in progress to ensure we meet SSF best practices: https://www.bestpractices.dev/en/projects/9530	2024-10-07 11:22:05 -07:00
Logan Adams	239b83a77e	Cleanup CODEOWNERS file to be valid (#6603 )	2024-10-07 10:01:53 -07:00
Jagadish Krishnamoorthy	b93c7a20c8	[ROCm] Fix subprocess error (#6587 ) Fixes https://github.com/microsoft/DeepSpeed/issues/6585 Use shell=True for subprocess.check_output() in case of ROCm commands. Do not use shlex.split() since command string has wildcard expansion. Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>	2024-10-04 14:31:25 -07:00
Logan Adams	8cded575a9	Fix torch include in `op_builder/mlu/fused_adam.py` and update no-torch workflow triggers (#6584 ) Changes from #6472 caused the no-torch workflow that is an example of how we build the DeepSpeed release package to fail (so we caught this before a release, see more in #6402). These changes also copy the style used to include torch in other accelerator op_builder implementations, such as npu [here](https://github.com/microsoft/DeepSpeed/blob/master/op_builder/npu/fused_adam.py#L8) and hpu [here](`828ddfbbda/op_builder/hpu/fused_adam.py (L15)`). This also updates the no-torch workflow to run on all changes to the op_builder directory. The test runs quickly and shouldn't add any additional testing burden there. Resolves: #6576	2024-09-27 13:32:48 -07:00
Logan Adams	828ddfbbda	Fixes on the accelerate side mean we do not need to skip this test (#6583 ) HF accelerate implemented fixes here: https://github.com/huggingface/accelerate/pull/3131 This means we can revert the changes from #6574	2024-09-27 09:22:13 -07:00
Yizhou Wang	d4e1895076	[COMPILE] workflow for deepspeed + torch.compile (#6570 ) We use simple model + deepspeed zero 3 + torch.compile and count graph break numbers to demonstrate current status of combing deepspeed + torch.compile. --------- Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2024-09-27 06:45:42 +00:00
Nadav Elyahu	1caf6e8107	add bfloat16 to inference support dtypes (#6528 ) to allow running inference tasks using bfloat16 --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>	2024-09-27 06:11:06 +00:00
Masahiro Tanaka	047bcf6af6	Add APIs to offload states of model, optimizer, and engine (#6011 ) This PR adds the following APIs to offload model, optimizer, and engine states. ```pytyon def offload_states(self, include: Container[OffloadStateTypeEnum] = None, device: OffloadDeviceEnum = OffloadDeviceEnum.cpu, pin_memory: bool = True, non_blocking: bool = False) -> None: """Move the ZeRO optimizer buffers to the specified device. Arguments: include: Optional. The set of states to offload. If not provided, all states are offloaded. device: Optional. The device to move the ZeRO optimizer buffers to. pin_memory: Optional. Whether to pin the memory of the offloaded states. non_blocking: Optional. Whether to offload the states asynchronously. ... def offload_states_back(self, non_blocking: bool = False) -> None: ``` Here is the typical usage. ```python # Offload after forward, backward, and step model.offload_states() # Do something requiring a lot of device memory ... # Load states back to device memory model.offload_states_back() ``` You can selectively offload states to balance the offloading overhead and memory saving. ```python model.offload_states(include=set([OffloadStateTypeEnum.hp_params, OffloadStateTypeEnum.opt_states], device=OffloadDeviceEnum.cpu) ``` Performance (4.3B parameters / 4x A100) - Environment (4x A100, [benchmark script](https://gist.github.com/tohtana/05d5faba5068cf839abfc7b1e38b85e4)) - Average Device to Host transfer time: 2.45 GB/s, aggregated: 9.79 GB/s - Average Host to Device transfer: 11.05 GB/s, aggregated: 44.19 GB/s - Mem (allocated by PyTorch) - Before offload 18.2GB - After offloading 17.7MB - Time ([benchmark script](https://github.com/microsoft/DeepSpeedExamples/tree/tohtana/offload_states/training/offload_states), offloading time/loading time) python output_table.py \| \|pin_memory=0 non_blocking=0\|pin_memory=0 non_blocking=1\|pin_memory=1 non_blocking=0\|pin_memory=1 non_blocking=1\| \|--:\|---------------------------\|---------------------------\|---------------------------\|---------------------------\| \| 1\|4.34 / 3.42 \|4.99 / 2.37 \|6.5 / 2.42 \|6.0 / 2.39 \| \| 2\|9.9 / 3.28 \|5.1 / 2.34 \|6.21 / 2.42 \|6.25 / 2.45 \| \| 3\|9.92 / 3.19 \|6.71 / 2.35 \|6.33 / 2.38 \|5.93 / 2.42 \| \| 4\|9.55 / 2.82 \|7.11 / 2.39 \|6.9 / 2.38 \|6.5 / 2.43 \| \| 5\|4.4 / 3.35 \|6.04 / 2.41 \|6.26 / 2.41 \|6.32 / 2.47 \| \| 6\|4.4 / 3.57 \|6.58 / 2.42 \|6.88 / 2.4 \|6.35 / 2.43 \| \| 7\|9.51 / 3.12 \|6.9 / 2.39 \|6.9 / 2.39 \|6.46 / 2.4 \| \| 8\|4.77 / 3.64 \|6.69 / 2.39 \|7.39 / 2.42 \|6.56 / 2.46 \| \| 9\|9.5 / 3.07 \|7.18 / 2.42 \|6.67 / 2.39 \|7.38 / 2.46 \| TODO: - Enable offloading to a NVMe storage -> NVMe support is non-trivial. I suggest adding the support in another PR - [DONE] Discard buffer (and recreate it) instead of offloading. We don't need to restore the contiguous buffer for reduce. - [DONE] Check pin_memory improves performance or not --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2024-09-27 05:37:32 +00:00
Liangliang Ma	d45cfd3455	[XPU] Support DeepNVMe new code structure (#6532 ) In DeepNVMe GDS update, many functions are changed into a more abstract way. Also added some files. These change break zero-infinity on XPU. To bring this feature back, we have this PR: 1. modify the aio opbuilder for new files. 2. Add custom cpu_op_desc_t for xpu users. (XPU don't handle buffer aligned here) --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-09-26 20:39:59 +00:00
Nir Sonnenschein	ba58682a13	fix errors when setting zero3 leaf modules with torch.compile (#6564 ) When setting zero3 leaf modules to a higher level module and running with torch.compile, there are a few errors from ZeROOrderedDict. First it doesn't support Deep copy for not having a constructor with no parameters. Second, it doesn't check the existence of ds_status attr on param before accessing the attr. change contributed by Haifeng Chen Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-09-26 14:55:12 +00:00
Masahiro Tanaka	c85c8703bc	Fix gradient accumulation for Z2+offload (#6550 ) The ZeRO 1/2 optimizer performs incorrect gradient accumulation in the path for ZeRO2 + Offloading. This issue is caused by two main reasons: 1) The micro_step_id in the ZeRO 1/2 optimizer is: - Initialized to 0 in the constructor. - Reset to -1 during the backward pass. For example, given a gradient accumulation step of 4, the micro_step_id changes as follows: - For the first global step: 1, 2, 3, 4. - Subsequently: 0, 1, 2, 3. 2) Gradients are copied to the buffer on the first micro step and accumulated in the buffer during the following micro steps. However, the current code incorrectly copies gradients at steps that are not at the accumulation boundary. This PR aligns the micro_step_id initialization in both the constructor and the backward pass, and corrects the condition for copying and accumulating gradients. Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-09-26 13:11:24 +00:00
andyG	0fbe96a502	[Accelerator] Cambricon MLU support (#6472 ) ### Description This PR includes Cambricon MLU accelerator support. With this PR, DeepSpeed supports MLU as backend for training and inference tasks. --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-09-26 13:10:52 +00:00
Olatunji Ruwase	a5400974df	DeepNVMe perf tuning (#6560 ) Add performance tuning utilities: `ds_nvme_tune` and `ds_io`. Update tutorial with tuning section. --------- Co-authored-by: Ubuntu <jomayeri@microsoft.com> Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com>	2024-09-26 13:07:19 +00:00
Masahiro Tanaka	7622cd9e68	Use msgpack for p2p comm (#6547 ) Use msgpack for P2P communication in pipeline engine. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-09-26 00:34:38 +00:00
Logan Adams	61de017176	Skip failing newly added tests in accelerate (#6574 ) Adding the new tests in https://github.com/huggingface/accelerate/pull/3097 caused the nv-accelerate-v100 tests to fail. Due to other CI issues we didn't notice this at first. This just skips the problematic test for now. cc: @stas00 / @muellerzr	2024-09-25 16:18:44 -07:00
ShifaAbu	2a56f53395	Added Intel Gaudi to Accelerator Setup Guide (#6543 ) Added Intel Gaudi to the list of accelerators in the setup guide. Co-authored-by: sakell <sakell@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-09-16 15:24:45 -07:00
Logan Adams	170b46e8b1	Add conditional on torch version for scaled_dot_product_attention (#6517 ) Changes from #4724 broke support for torch<2.0 in the flops profiler as the scaled_dot_product_attention [wasn't added](https://pytorch.org/docs/2.0/generated/torch.nn.functional.scaled_dot_product_attention.html#torch.nn.functional.scaled_dot_product_attention) until a beta version in torch 2.0 Resolved: #5534 Todo: - [ ] Test this - [ ] Issue resolution with users.	2024-09-11 23:21:43 +00:00

1 2 3 4 5 ...

2528 Коммитов Все ветки Поиск

2528 Коммитов

Все ветки