Successor PR to #6094:
> FutureWarning: You are using torch.load with weights_only=False (the
current default value), which uses the default pickle module implicitly.
It is possible to construct malicious pickle data which will execute
arbitrary code during unpickling (See
https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models
for more details). In a future release, the default value for
weights_only will be flipped to True. This limits the functions that
could be executed during unpickling. Arbitrary objects will no longer be
allowed to be loaded via this mode unless they are explicitly
allowlisted by the user via torch.serialization.add_safe_globals. We
recommend you start setting weights_only=True for any use case where you
don't have full control of the loaded file. Please open an issue on
GitHub for any issues related to this experimental feature.
Todo:
- [ ] Update values in non-test files to True where necessary.
This update is needed to support eventually running on ubuntu-24.04 from
GitHub, specifically because the python version is updated to 3.12 and
results in the following error: `ModuleNotFoundError: No module named
'lib2to3'` since that package is deprecated.
Git-base model is an image-text model. After supporting the llama3.2
vision model, we set num_kv_heads dynamically.
Git-base only includes vision_config, so we need to add an attribute
check for vision_config/text_config when setting num_kv_heads.
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
In sequence_parallel (Ulysses), the sequence parallel size is
constrained by the requirement to be divisible by the number of heads,
which prevents some models/workloads from setting a specific sequence
parallel size. This PR implements uneven all-to-all heads splitting.
- both support batch first (b,s,...) and seq_len first(s,b..) layout.
- Added unit tests with numerical checks. Locally also tested with **7
heads with sp=4** and **20 heads with sp=8**, and it passed.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Modified _replace_module in auto_tp.py :
The modification keeps the layers 'shared_expert_gate' and 'gate' in
qwen2-moe the original type torch.nn.Linear and not changes them into
LinearLayer. In this way, their weights will not be split into multiple
HPU/GPU cards. Then the qwen2-moe can run on multiple HPU/GPU cards.
Since the weights of 'gate' are not split into multiple HPU/GPU cards,
all gather operations are not needed, which may improve performance.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Llama3.2-11b and llama3.2-90b including vision model and text model,
these two models have different num_kv_heads, so we need to set
num_kv_heads dynamically.
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR aims to enable chatglm2 & chatglm3 autotp. Similar to the phi3,
this model uses the chunk MLP layer, so we adjust the weight order by
'shard_mlp_chunk' func. Please kindly review~ Thanks!
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Heyang Qin <heyangqin@microsoft.com>
When the model is quantized, the hidden sizes cannot be determined from
`ds_shape` and `shape`, because they are 1 dimensional. This PR fixes
the bug by determining hidden sizes from `in_features` and
`out_features`.
This PR fixes#5398
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR aims to enable yuan model autotp and add conv tp.
Yuan model used shared qk.
For example:
q_linear_out = [q1, q2, q3, q4, q5, ... , q16]
k_linear_out = [k1, k2, k3, k4, k5, ... , k16]
after share qk:
TP=1:
q' = [q1,q2,q3,q4, q9,q10,q11,q12, k1,k2 k3,k4, k9,k10,k11,k12]
k' = [q5,q6,q7,q8, q13,q14,q15,q16, k5,k6,k7,k8, k13,k14,k15,k16]
v' = [v1,v2,v3,v4, v5,v6,v7,v8, v9,v10,v11,v12, v13,v14,v15,v16]
TP=2:
rank0:
q'_0 = [q1,q2,q3,q4, k1,k2 k3,k4]
k'_0 = [q5,q6,q7,q8, k5,k6,k7,k8]
v'_0 = [v1,v2,v3,v4, v5,v6,v7,v8] -> v'_0 is error! Expect value is:
[v1,v2,v3,v4, v9,v10,v11,v12]
rank1:
q'_1 = [q9,q10,q11,q12, k9,k10,k11,k12]
k'_1 = [q13,q14,q15,q16, k13,k14,k15,k16]
v'_1 = [v9,v10,v11,v12, v13,v14,v15,v16] -> v'_1 is error! Expect value
is: [v5,v6,v7,v8, v13,v14,v15,v16]
To avoid modifying the modeling code. We adjust the value and oproj
weight to fit this qk type.
We also added the conv tp to support some models that including the
heavy conv calculation. It is similar to the linear tp policy.
if not last_conv_layer:
- 1. Divide the conv weight to each rank along the output channel
dimension.
- 2. To apply conv2d.
else:
- 1. Divide the conv weight to each rank along the input channel
dimension.
- 2. Apply conv2d.
- 3. Use allreduce to add outputs.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR aims to enable phi3 mini autotp.
Phi3 mini uses chunk MLP. We adjust this linear layer weight order to
support this model.
Please kindly review~ Thanks!
---------
Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>
The Bloom flow in Hybrid Engine applies the same transformation of the
input mask which is already performed earlier by the transformers
BloomModel::forward.
This results in the non-convergence of scores, specifically in Deepspeed
Chat on different accelerators, including CUDA and HPU.
The fix removes redundant mask transformation and application, producing
correct convergence.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>
Fused_qkv model can not correctly choose the fused_qkv type. Need to
update the module_name_matches.
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
This PR adds backwards compatibility for older versions of `diffusers`
(`<0.25.0`) by updating the `vae` container import logic to account for
changes between the various versions.
The selection of fused type depends on the order of fused_type_dict.
If put “DecoderLayer” in front of “FalconDecoderLayer”, Falcon will
still choose glmtype incorrectly, so need to put “DecoderLayer at” the
last position of fused_type_dict.
---------
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Fix 'NotImplementedError: Cannot copy out of meta tensor; no data!',
when loading T5 and mistral from device meta.
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR aims to balance the shard size of each worker as even as
possible.
1. We refactor the tp_shard logic that can make AutoTP work when
split_shape % num_kv_heads != 0.
2. When num_kv_heads is defined, the attention module relies on it to
sharding, but the mlp and lm_head modules can use near even division to
get more balance shard. It will get better performance.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>
Enabled autoTP for the Qwen model, added some module matching, and
adjusted TP-related variables. Verification was conducted on Qwen-1_8B
and Qwen-72B-chat.
---------
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This [PR](https://github.com/microsoft/DeepSpeed/pull/4721) added the
"DecoderLayer":glmtype. It will cause the Falcon model to choose
"glmtype" fused_qkv_type. Falcon model (including Falcondecoderlayer)
needs to choose 'bloomtype' explicitly.
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Falcon-40b will fail on uneven autotp. Need to add 'num_kv_heads' in the
kv_head_names list.
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
1. In both files, the same logic was done that if when it is meta no
need to move the tensors to the device.
2. Deletion of an unused member of the class
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
refactor: improve the way to decide whether a variable is None
fix: type mismatch for judging if current accelerator is in
SUPPORTED_ACCELERATOR_LIST
---------
Co-authored-by: ryan <ruanzhixiang1@huawei.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
* allow number of heads not divisible by number of ranks
* get num_heads from model config, more robust
* simplify logic where num_head itself is sharded
* name tweaks
* make code more robust where num_attention_heads may not be defined in model_config
* support num_key_value_heads < num_attention_heads which is used by llama2
* add test for 5 ranks
* change odd rank # to 3 to avoid test skip
* add get_shard_size function
* modify sharding mechanism according to latest auto TP
* fix accuracy issue
* fix format
* skip tests with fusedqkv
* remove skip of fusedqkv tests
* skip test fusedqkv with odd number of ranks
* support model with n_heads in model_config
* fix TestInjectionPolicy::test[fp32-t5]
* fix uneven_heads on some fusedqkv types (#12)
* odd support fusedqkv
* fix format and clear text
* better fix when activation size cannot be divided by number of heads
* move tp_shard.py under module_inject
* Add get_num_kv_heads in tp_shard.py
* Refine according to comments
* remove old comment
* fix bug in getting num_kv_heads
* support uneven sharding of lm_head tensor parallel
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Molly Smith <112220543+molly-smith@users.noreply.github.com>
Co-authored-by: mzl <mingzhi.liu@intel.com>
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
* Add rope_theta for llama config
* Add rope_theta to bias_add_transform_0213
* Fix CI problems
* Add rope_theta to linear layer
---------
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>
* Adapt to Llama when using meta tensor to load
* Fix gated mlp parameter mp
* Re-enable meta tensor for kernel injection
Fix layer params loading in meta tensor
* Revert mlp_inter_mp for gated mlp as it is fixed
* Monkey patch for fixing llama output
* Fix formatting
* Add comment
---------
Co-authored-by: Lev Kurilenko <lekurile@microsoft.com>
Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>
* Add the llama2 support from the official llama repo
* add back commented function
* add new policy & implementation for llama2
* add some changes to inject/run the 70b llama model
* remove debugging code
* remove more debugging code
* formatting
* use num_kv only when it has positive value
* use the num_kv param only if it is positive
* fix syntax and format errors.
* fix an issue with the float32 transform kernel
---------
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
* correct inference with some debug codes.
* remove prints
* update transformer import set_qkv and format
* support some lora abstract method
* fix attn_ob
* some debug
* leave orig layer set by user
* remove debugs
* move attn ob to mlp module
* move import transformer
* init orig class only once
* remove copyright
---------
Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>