зеркало из https://github.com/microsoft/DeepSpeed.git
2b41d6212c
When launching apply_rotary_pos_half kernel, only threads_per_head of 64 is supported for wavefront size of 64. This change adds support for threads_per_head < 64 such as 4, 8, 16. Fixes the issue introduced in https://github.com/microsoft/DeepSpeed/pull/5402 --------- Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com> |
||
---|---|---|
.. | ||
inference | ||
cublas_wrappers.cu | ||
dropout_kernels.cu | ||
ds_transformer_cuda.cpp | ||
gelu_kernels.cu | ||
general_kernels.cu | ||
normalize_kernels.cu | ||
softmax_kernels.cu | ||
transform_kernels.cu |