Граф коммитов

2552 Коммитов

Автор SHA1 Сообщение Дата
Logan Adams a1b0c35a1d
Switch what versions of python are supported (#5676)
Add support for testing compilation with python 3.11/3.12.  

Also add the dockerfiles used to build those images.

---------

Co-authored-by: Michael Wyatt <michael.wyatt@snowflake.com>
2024-11-06 20:37:52 -08:00
Logan Adams 3beda32e94
Update flake8 version (#6722)
This PR is useful for updating the flake8 checks we run, but is mostly
needed to update flake8 so that it can run on newer versions of python
which are included in newer ubuntu-latest versions from GitHub that we
update to in #6717
2024-11-06 15:17:48 -08:00
Logan Adams d2a4718946
Update yapf version (#6721)
This update is needed to support eventually running on ubuntu-24.04 from
GitHub, specifically because the python version is updated to 3.12 and
results in the following error: `ModuleNotFoundError: No module named
'lib2to3'` since that package is deprecated.
2024-11-06 18:57:12 +00:00
Masahiro Tanaka 351569dd4a
Use one param coordinator for both train/inference scenarios (#6662)
The parameter coordinator in ZeRO3 throws a "backward pass is invalid
for module in evaluation mode" error when the training mode is
unexpected, as it expects all modules to be in training mode during the
backward pass. This is an unnecessarily strict restriction.
This PR relaxes the restriction by using a single parameter coordinator
(instead of separate ones for training and evaluation modes) and
resetting the prefetch state before starting a forward pass.

Use of `is_compiling` needs to be fixed after #6663 is merged.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-11-05 22:53:01 +00:00
Jagadish Krishnamoorthy 2b41d6212c
[Bug Fix] Support threads_per_head < 64 for wavefront size of 64 (#6622)
When launching apply_rotary_pos_half kernel, only threads_per_head of 64
is supported for wavefront size of 64.
This change adds support for threads_per_head < 64 such as 4, 8, 16.

Fixes the issue introduced in
https://github.com/microsoft/DeepSpeed/pull/5402

---------

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
2024-11-04 21:51:27 +00:00
Logan Adams 6c08b7f932
Pin transformers to 4.45.2 in nv-ds-chat workflow (#6710)
This commit causes breaking changes we need to fix, for now we will pin
the version but we will fix shortly

https://github.com/huggingface/transformers/pull/33325
2024-11-04 20:51:01 +00:00
jiahao su 9068acb6fb
Update URL in README Pipeline Status for Huawei Ascend NPU (#6706) 2024-11-04 17:49:21 +00:00
Masahiro Tanaka b24dfa9d08
Explictly set device when reusing dist env (#6696)
A rank of a process can change when reusing the environment. This PR
explicitly sets the device when reusing the environment.
2024-11-01 12:57:47 +00:00
Masahiro Tanaka 95ea95fcd6
Free memory in universal checkpointing tests (#6693)
Tests in universal checkpointing were not freeing the engine after use
when `reuse_dist_env` was set to `True`, leading to memory leaks.
This PR ensure freeing the engine in the tests and enables
`reuse_dist_env`.

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-10-31 11:51:11 -07:00
Xinyu Lian ff1c54351f
fix memcpy issue on backward for zero-infinity (#6670)
This PR is similar to
[PR#5301](https://github.com/microsoft/DeepSpeed/pull/5301), that
optimizes the D2H time use pinned memory.

Previously, the D2H memcpy will be the bottleneck during the final
backward pass of each iteration for ZeRO-Infinity(offload), as shown in
Trace-1. The new version can eliminate the bottleneck, as shown in
Trace-2.

_Trace-1_
<img width="480" alt="image"
src="https://github.com/user-attachments/assets/891e3770-351b-4e03-8a59-b491bc44d03b">

_Trace-2_
<img width="192" alt="image"
src="https://github.com/user-attachments/assets/f1cf9037-77f8-42a6-adc8-d5c6bacde0aa">

cc @tjruwase

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-10-31 10:56:09 -07:00
Yejing-Lai c7f58c899f
Add attribute check to support git-base autotp (#6688)
Git-base model is an image-text model. After supporting the llama3.2
vision model, we set num_kv_heads dynamically.
Git-base only includes vision_config, so we need to add an attribute
check for vision_config/text_config when setting num_kv_heads.

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-10-31 00:48:52 +00:00
Logan Adams 9b547313c6
Update checkout action to latest version (#5021)
Latest checkout uses latest (non-deprecated) version of node (16 -> 20).

More information
[here](https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/):
```
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v3. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.
```

Checkout action: https://github.com/actions/checkout

Node 20 requires a minimum of Ubuntu 20.04, so workflows currently using
18.04 are failing/will fail.
2024-10-30 17:36:53 +00:00
xuanhua e4a247ed13
Fix training of pipeline based peft's lora model (#5477)
Hi, guys

I find there is an assert failure when I train huggingface's lora based
model in pipeline style.

Here is the whole steps that I created my model:
1)  Load the pre-trained chatglm-6b model from huggingface, as Model_A
2) Use huggingface's peft's `get_peft_model(...)` and my
`LoraConfig(...)` from Model_A to create the lora model, as Model_B
3)  Create my own pipeline based model Model_C from Model_B

And I run Model_C under 2 3090ti GPUs. And the assertion failure looks
like this:
```text
Traceback (most recent call last):
  File "/home/ubuntu/proj/chatglm-finetuning/train_pipeline.py", line 372, in <module>
    main()
  File "/home/ubuntu/proj/chatglm-finetuning/train_pipeline.py", line 351, in main
    loss = engine.train_batch(data_iter=train_dataloader)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 375, in train_batch
    self._exec_schedule(sched)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 1375, in _exec_schedule
    self._exec_instr(**cmd.kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 276, in _exec_reduce_tied_grads
    dist.all_reduce(grad, group=group)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 117, in log_wrapper
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 496, in all_reduce
    return cdb.all_reduce(tensor, op, group, async_op)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 159, in all_reduce
    return torch.distributed.all_reduce(tensor=tensor, op=op, group=group, async_op=async_op)
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1520, in all_reduce
    _check_single_tensor(tensor, "tensor")
  File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 463, in _check_single_tensor
    raise RuntimeError(
RuntimeError: Invalid function argument. Expected parameter `tensor` to be of type torch.Tensor.
```

After some debugging, I find out the root cause is that my configuration
of lora (in below) only add extra lora layer(part) in qkv related layers
but not the embedding layer. So the whole embedding layer's parameters
are freezed.
```python
lora_config = LoraConfig(r=8, # copied from finetuning_lora.py
                        lora_alpha=32,
                        target_modules=["query_key_value"],
                        lora_dropout=0.1,
                        bias="none",
                        task_type="CAUSAL_LM",
                        inference_mode=False,
                        )   
```
And in my implementation of pipeline based model, I declared the
embeding layer as a tied-layer. So the whole thing is that there are no
gradients at all for embedding layer, but embedding layer as the tied
layer needs to be synced between two gpus. The value of gradient is None
but is still passed to `all_reduce` operation.

Current, my fix is simple and add a check if this `grad` is None.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Heyang Qin <heyangqin@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2024-10-29 16:04:35 +00:00
Logan Adams 07cac9e021
Remove packages that no longer need to be updated in the latest container (#6682) 2024-10-28 21:12:29 -07:00
Logan Adams 0e11b081be
Update base docker image for A6000 GPU tests (#6681)
Update to a [container
(24.03)](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-24-03.html)
with python 3.10 as transformers dropped support for python 3.8 in their
latest release.

Note: nv-human-eval.yml was never completed and isn't used, it is just
updated for any potential future support.

Resolves: #6672
2024-10-28 16:06:02 -07:00
Raza Sikander e6357c28cd
Update gaudi2 docker version to latest release (1.18) (#6648)
Updated docker version to 1.18.0-latest

Note: for this update the firmware on the Gaudi2 node had to be updated
to use firmware version 1.18.

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-10-28 12:58:25 -07:00
Logan Adams b3e959460b
Update Gaudi2 docker image (#6677) 2024-10-28 09:57:53 -07:00
Logan Adams 229960a5e9
Add support for H100/sm_90 arch compilation (#6669)
Resolves: #6549
2024-10-28 03:39:51 +00:00
Logan Adams 54903e09eb
Update profiler registration check (#6668)
Resolves #5432.
2024-10-25 22:14:26 +00:00
Masahiro Tanaka 24285d6c73
Add fallback for is_compiling (#6663)
Importing `torch.compiler.is_compiling` causes an error with an older
version of PyTorch.
This PR adds a fallback for `is_compiling` to use an equivalent function
of older PyTorch versions.

This will resolve #6656.

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-10-25 20:47:22 +00:00
inkcherry 5fb71c0a18
sequence parallel for uneven heads (#6392)
In sequence_parallel (Ulysses), the sequence parallel size is
constrained by the requirement to be divisible by the number of heads,
which prevents some models/workloads from setting a specific sequence
parallel size. This PR implements uneven all-to-all heads splitting.

- both support  batch first (b,s,...) and seq_len first(s,b..) layout.
- Added unit tests with numerical checks. Locally also tested with **7
heads with sp=4** and **20 heads with sp=8**, and it passed.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2024-10-25 18:26:47 +00:00
Yichen Yan 3d5cf739ea
Fix dynamo issue (#6527)
Dynamo use faketensor to trace tensor ops. In some case, the mechanism
break compiling with deepspeed.

An example could be found at
https://gist.github.com/oraluben/9b8240c2fe482eb4382453d6c97a5f76, to
see issues, install deepspeed==0.14.4 instead of my fork

without this PR, llama cannot be compiled.

Detailed explanation:

1. `ZeROOrderedDict`
dynamo use deepcopy to copy tensors, which will call
`object.__reduce__`. When copying `ZeROOrderedDict`, the default
implementation do not copy its `_parent_module` and will lead to
failure.
2. `param` maybe faketensor and do not have `ds_status` yet, but during
tracing it's ok to just skip the `register_external_parameter`, it
should be done ways before.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2024-10-25 00:17:30 +00:00
Lzhang-hub 6e6563d3c8
fix init_device_mesh for torch 2.4 (#6614)
Start torch 2.4, in
[`init_device_mesh()`](de4c2a3b4e/torch/distributed/device_mesh.py (L915))
,device type with a GPU index, such as "cuda:0", is not allowed.


![image](https://github.com/user-attachments/assets/1ddb61bf-8a15-4e0a-9115-a3681d7f19ff)

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com>
2024-10-23 20:29:30 +00:00
Yejing-Lai e06bb518aa
Add attribute check for language_model when replace last linear module (#6650)
Fix module has no attribute 'language_model' issue.

Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2024-10-23 20:22:59 +00:00
wyooyw b647fb2470
Fix expert grad scaling problem with ZeRO optimizer (#6546)
Fix [#6545]

work:
- expert gradient average: divide edp_world_size -> divide dp_world_size
- unit test: make sure model with different dp/ep has same expert
gradient

---------

Co-authored-by: wangyiou <wangyiou@xiaohongshu.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-10-23 00:08:39 +00:00
Logan Adams bf03f48352
Update version.txt after 0.15.3 release (#6652)
**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.15.3
Author           - @jomayeri

Co-authored-by: jomayeri <jomayeri@users.noreply.github.com>
2024-10-22 14:15:45 -07:00
Liangliang Ma a24cdd6b67
[XPU] [DeepNVMe] use same cpu_op_desc_t with cuda (#6645)
We have found that #6592 uses `_pinned_tensor_mgr` to create cpu bounce
buffer, which is same with what our xpu accelerator currently doing.
So no need to use xpu device specific cpu_op_desc_t.
In this PR:
1. remove custom csrc/xpu/aio/deepspeed_cpu_op.cpp
2. modify xpu async_io opbuilder.

This issue cannot be easily done with revert #6532 , for we added some
source file as last time GDS feature going in DS. So file this new PR :)
2024-10-22 14:45:05 +00:00
Yizhou Wang 11bbf45af5
[XPU] host timer check version from Torch 2.5 to Torch 2.6 (#6633)
Elapsed time would be supported in Torch 2.6.

Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2024-10-22 06:53:15 +00:00
Liangliang Ma 40bde528bc
[XPU] upgrade xpu max1100 CI workflow to pytorch2.3 (#6646)
With intel-extension-for-pytorch=2.3.110 released last month, max1100 CI
workflow can be updated too. Software versions aligned with #6570 .

Increased CI tests scope for torch/ipex2.3 will be in later PR.

This workflow passed in my cloned repo self-hosted runner.
2024-10-21 12:25:11 +00:00
Joe Mayer 6eefc3d0ea
Fix Memory Leak In AIO (#6630)
Fixing a memory leak in AIO pinned tensor as well as an incorrect
function type for gds op.

---------

Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2024-10-18 02:58:06 +00:00
Masahiro Tanaka c9fc34a4be
Use file store for tests (#6632)
This PR changes the `init_method` for tests to `FileStore` for
robustness.
2024-10-17 22:15:25 +00:00
Masahiro Tanaka a36db9cc1c
Update torch version in workflows (#6631)
Set PyTorch version in CI workflows to v2.5.

Context: The
[error](https://github.com/microsoft/DeepSpeed/actions/runs/11371525624/job/31633793986?pr=6630)
in #6630 might have been caused by the PyTorch version mismatch or
something.
2024-10-17 17:50:55 +00:00
jiahao su c9899dc14a
Add README Pipeline Status for Huawei Ascend NPU (#6588)
Hello! Following the merge of
https://github.com/microsoft/DeepSpeed/pull/6445, I have implemented a
CI pipeline to validate the Huawei Ascend NPU.

---------

Co-authored-by: sjh <sjh1270@163.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2024-10-15 23:36:10 +00:00
Masahiro Tanaka 1a45bd8e8c
Lock cache file of HF model list (#6628)
The error in the following log suggests that the cache file for HF model
list can be broken:

https://github.com/microsoft/DeepSpeed/actions/runs/11343665365/job/31546708118?pr=6614

The actual cause of the above error is unclear, but `_hf_model_list`
potentially breaks the cache file when it is concurrently called from
multiple processes. This PR locks the cache file to ensure
`_hf_model_list` safely reads and writes the file.
2024-10-15 21:49:37 +00:00
Shelly Nahir ce468c3756
add option to disable logger while compiling to avoid graph breaks (#6496)
adding an option to disable calls for logger while compiling to avoid
graph breaks. Here I used an environment variable to determine whether
to activate this option, but it can also be determined using the json
config file or any other way you see fit.

---------

Co-authored-by: snahir <snahir@habana.ai>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2024-10-15 18:30:42 +00:00
Xu Song bf60fc0ca6
Support safetensors export (#6579)
## Feature

This commit implements the following features:

- [x] support saving checkpoint as safetensors (more commonly used
format)
- [x] support sharding checkpoints (which is important for very large
models)

Most of the codes are borrowed from
https://github.com/huggingface/transformers/blob/v4.45.1/src/transformers/modeling_utils.py#L2490

## Usage

For `pytorch_model.bin` export
```
python zero_to_fp32.py . output_dir/
```

For  `model.safetensors` export
```
python zero_to_fp32.py . output_dir/ --safe_serialization
```

---------

Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-10-15 11:22:31 +00:00
Joe Mayer 85b7469ea0
Add first Step in LR Schedulers (#6597)
Some (not all) of the LR schedulers in runtime were missing the
initialization of the optimizer group lr.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-10-14 19:31:45 +00:00
diskkid 13c16c9562
Accept btl_tcp_if_include option through launcher_args (#6613)
This patch fixes issue #4460.
When `btl_tcp_if_include` option is provided through `--launcher_args`,
we use the provided option instead of the hardcoded `--mca
btl_tcp_if_include eth0`. Otherwise we use `--mca btl_tcp_if_include
eth0` as the default for compatibility.

Fixes #4460

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-10-14 19:26:24 +00:00
Olatunji Ruwase 65ab64481f
Add API for updating ZeRO gradients (#6590) 2024-10-14 17:35:41 +00:00
Ma, Guokai cf41e8c4e8
[compile] Show breakdown of graph break (#6601)
This PR extends https://github.com/microsoft/DeepSpeed/pull/6570 by
showing a breakdown of graph breaks. So we can see how graph breaks are
distributed among different reasons. An example of graph break output
can be seen from the following workflow run
https://github.com/microsoft/DeepSpeed/actions/runs/11199157962
2024-10-14 17:31:34 +00:00
Masahiro Tanaka 7a5bc4fdf9
Ignore reuse_dist_env (#6623)
Tests with `reuse_dist_env = True` often causes memory leaks. This PR
ignores `reuse_dist_env` and forcibly sets it to `False`. This change
might slow down the tests, but I think it is better to manually restart
runners and relaunch tests.

Memory usages (See #6578):
- `reuse_dist_env == True`:
https://github.com/microsoft/DeepSpeed/actions/runs/11302940871/job/31439471512
- `reuse_dist_env == False`:
https://github.com/microsoft/DeepSpeed/actions/runs/11303250613/job/31440137894
2024-10-14 16:08:44 +00:00
Masahiro Tanaka 5c4b97f109 apply fp16 autocast only to floating point values 2024-10-11 19:41:10 +00:00
Masahiro Tanaka adec99121b
Add API to get devices of offload states (#6586)
This PR adds an API `deepspeed.runtime.zero.offload_states
get_state_devices`, which gets devices of offload states as suggested in
this
[comment](https://github.com/microsoft/DeepSpeed/pull/6011#issuecomment-2358068777).

We could lift this up to `deepspeed.utils` but would need to resolve a
circular import: User code -> `deepspeed.utils` ->
`deepspeed.utils.offload_states` -> `deepspeed.runtime.zero` ->
`deepspeed.runtime.zero.partition_parameters` -> `deepspeed.utils`

This will require a significant refactoring as long as we have
`OffloadStateTypeEnum` in `deepspeed.runtime.zero`.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-10-10 02:59:26 +00:00
Nir Sonnenschein d7ca3d8373
reduce setting global variables to reduce torch compile graph breaks (#6541)
setting global variables during training will create a graph breaks when
using torch.compile (reading global variables doesn't). this commit
attempts to reduce the setting of global variables in the checkpointing
flows.
there are 2 main uses setting global variables:
1. Share data between functions
2. Establish that this is the first call to the code

For most of the cases the data in the global variables is data that can
be computed on demand or set once in an initial state in a configure
function.
For "check that this is the first run" use case the code was moved to
the configure function.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-10-10 00:47:44 +00:00
Joe Mayer a1f98bdc70
AIO CPU Locked Tensor (#6592)
Restoring the functionality of the cpu locked tensor in the AIO library.
Make async_io operator available for CPU accelerator, i.e., CPU only
environment.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-10-09 21:07:31 +00:00
Masahiro Tanaka 7d751ee890
Clean up prefetched parameters (#6557)
Parameters prefetched by ZeRO3 are sometimes not used. This occurs when
the actual sub-module execution differs from previous tracing. As a
result, the state of the allgather handle for such a parameter remains
`INFLIGHT`, causing functions like `empty_partition_cache` to detect it
and throw an error.
This PR resolves the issue by ensuring that communication finishes and
the parameters are freed.

As this issue was mentioned in #6011, this includes the change of the
branch. We need to merge #6011 first.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-10-09 15:23:33 +00:00
Logan Adams 55f7f3789e
Update version.txt after 0.15.2 release (#6615)
**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.15.2
Author           - @jomayeri

Co-authored-by: jomayeri <jomayeri@users.noreply.github.com>
2024-10-09 10:48:39 -07:00
gyou2021 474a3288cd
Enabled Qwen2-MoE Tensor Parallelism (TP) inference (#6551)
Modified _replace_module in auto_tp.py :
The modification keeps the layers 'shared_expert_gate' and 'gate' in
qwen2-moe the original type torch.nn.Linear and not changes them into
LinearLayer. In this way, their weights will not be split into multiple
HPU/GPU cards. Then the qwen2-moe can run on multiple HPU/GPU cards.
Since the weights of 'gate' are not split into multiple HPU/GPU cards,
all gather operations are not needed, which may improve performance.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-10-09 15:23:16 +00:00
Logan Adams 1062a0c658
Unpin accelerate tests, update lightning with node16 removal. (#6611)
HF accelerate fixes implemented in
https://github.com/huggingface/accelerate/pull/3145 mean that we no
longer need to pin the Accelerate version!

nv-lightning tests now run on Ubuntu 20.04+, so we support >node 16, so
we can remove the explicit permissions for that in the env config.
2024-10-09 08:22:41 -07:00
Omar Elayan 645639bcf8
Rearrange inference OPS and stop using builder.load (#5490)
This PR mainly handles all places where InferenceBuilder is used to
access any op or a specific implementation for an op.
Instead an op is defined, and its proper implementation is picked inside
and the usage will be transparent to the user.
What was done in the PR:
1) Added missing ops (added a py file with fallback mechanism)
2) Added missing fallback implementations for existing ops
3) removed all usages for builder.load and replaced them with ops
instead.
4) added workspace op and inferenceContext which contains all workspace
related functions and inferenceContext is the python fallback of
inferenceContext in CUDA
5) a small change to softmax_context signature to fit the fallback
signature.

---------

Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com>
Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-10-09 01:22:28 +00:00