DeepSpeed

Граф коммитов

Автор	SHA1	Сообщение	Дата
cctry	c58146471e	Openfold fix (#4368 ) * update * format --------- Co-authored-by: Conglong Li <conglong.li@gmail.com>	2023-09-20 01:44:16 -07:00
Reza Yazdani	468882fb68	Add the policy to run llama model from the official repo (#4313 ) * Add the llama2 support from the official llama repo * add back commented function * add new policy & implementation for llama2 * add some changes to inject/run the 70b llama model * remove debugging code * remove more debugging code * formatting * use num_kv only when it has positive value * use the num_kv param only if it is positive * fix syntax and format errors. * fix an issue with the float32 transform kernel --------- Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>	2023-09-19 16:57:55 +00:00
Conglong Li	f876d81d34	DeepSpeed4Science (#4357 ) * zero++ tutorial PR (#3783) * [Fix] _conv_flops_compute when padding is a str and stride=1 (#3169) * fix conv_flops_compute when padding is a str when stride=1 * fix error * change type of paddings to tuple * fix padding calculation * apply formatting check --------- Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * fix interpolate flops compute (#3782) * use `Flops Profiler` to test `model.generate()` (#2515) * Update profiler.py * pre-commit run --all-files * Delete .DS_Store * Delete .DS_Store * Delete .DS_Store --------- Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Cheng Li <pistasable@gmail.com> * revert PR #3611 (#3786) * bump to 0.9.6 * ZeRO++ chinese blog (#3793) * zeropp chinese blog * try better quality images * make title larger * even larger... * various fix * center captions * more fixes * fix format * remove staging trigger (#3792) * DeepSpeed-Triton for Inference (#3748) Co-authored-by: Stephen Youn <styoun@microsoft.com> Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Ethan Doe <yidoe@microsoft.com> Co-authored-by: yidoe <68296935+yidoe@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * ZeRO++ (#3784) Co-authored-by: HeyangQin <heyangqin@microsoft.com> Co-authored-by: GuanhuaWang <alexwgh333@gmail.com> Co-authored-by: cmikeh2 <connorholmes@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> * adding zero++ to navigation panel of deepspeed.ai (#3796) * Add ZeRO++ Japanese blog (#3797) * zeropp chinese blog * try better quality images * make title larger * even larger... * various fix * center captions * more fixes * fix format * add ZeRO++ Japanese blog * add links --------- Co-authored-by: HeyangQin <heyangqin@microsoft.com> Co-authored-by: Conglong Li <conglong.li@gmail.com> * Bug Fixes for autotuner and flops profiler (#1880) * fix autotuner when backward is not called * fix format --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Missing strided copy for gated MLP (#3788) Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> * Requires grad checking. (#3789) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * bump to 0.10.0 * Fix Bug in transform.cu (#3534) * Bug fix * Fixed formatting error --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> * bug fix: triton importing error (#3799) Co-authored-by: Stephen Youn <styoun@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * DeepSpeed4Science (#569) * Integrating evoformer attention * add cutlass version check * Updaate error message * add benchmark * Update * Update evoformer_attn.py * Update run_evoformer_test.py * Update evoformer_attn.py * Update run_evoformer_test.py * support more GPU archs * add copyright * add tests * Fix bugs * Update benchmark * update * Fix nvcc macro * clean code * fix formatting * fix yaml import * skip unit test when not compatible * fix yaml requirement * revert changes * update tutorial * update * fix formatting * fix format * skip evoformer attn in pre-compile-ops * revert changes * update tutorial * fix cutlass check * update tutorial * refactor tutorial * revise * Updated the Megatron-DS section (#565) * Updated the Megatron-DS section * minor fix * minor fix * minor fix * separate evoformer tutorial * Revised the ds4science landing page (#566) * Updated the Megatron-DS section * minor fix * minor fix * minor fix * Revised the landing page * Revised the landing page * Removing unused file * fix links image position * modify main page * fix doc --------- Co-authored-by: Shiyang Chen <csycfl@gmail.com> Co-authored-by: Minjia Zhang <33713995+minjiaz@users.noreply.github.com> --------- Co-authored-by: Heyang Qin <heyangqin@microsoft.com> Co-authored-by: Bill Luo <50068224+zhiruiluo@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Guorun <84232793+CaffreyR@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: stephen youn <13525892+stephen-youn@users.noreply.github.com> Co-authored-by: Stephen Youn <styoun@microsoft.com> Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org> Co-authored-by: Ethan Doe <yidoe@microsoft.com> Co-authored-by: yidoe <68296935+yidoe@users.noreply.github.com> Co-authored-by: GuanhuaWang <alexwgh333@gmail.com> Co-authored-by: cmikeh2 <connorholmes@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com> Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com> Co-authored-by: Shiyang Chen <csycfl@gmail.com> Co-authored-by: Minjia Zhang <33713995+minjiaz@users.noreply.github.com>	2023-09-18 22:16:08 +00:00
Alex Kogan	9bf77782b2	Fix a bug in the implementation of dequantization for inference (#3433 ) * bugfix in launch_dequantize() Get rid of `hid_cnt` and simply set #blocks to output size / #groups * add a unit test for dequantization --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>	2023-09-14 21:09:41 +00:00
Olatunji Ruwase	aa4a7401f8	ZeRO-Inference refresh (#4197 ) * INT4 weight only quantization (#479) * INT4 weight only quantization * pre commit * fix UT * fix UT * fix UT * fix UT * fix UT * fix UT * fix UT * add zero3 test * quantize small weight first to prevent oom * fold quantization config into ds_config * Fix license & refactor ds_config & rebase master * fix UT * Moving quantization into post_init_method and add int4 dequantization kernel (#522) * Add experimental int4 dequantize kernel * move quantiation into post_init_method * fix * Refactor: move int4 code to deepspeed/inference (#528) * Move int 4 code to deepspeed/inference * fix * fix * fix * zero++ tutorial PR (#3783) * [Fix] _conv_flops_compute when padding is a str and stride=1 (#3169) * fix conv_flops_compute when padding is a str when stride=1 * fix error * change type of paddings to tuple * fix padding calculation * apply formatting check --------- Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * fix interpolate flops compute (#3782) * use `Flops Profiler` to test `model.generate()` (#2515) * Update profiler.py * pre-commit run --all-files * Delete .DS_Store * Delete .DS_Store * Delete .DS_Store --------- Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Cheng Li <pistasable@gmail.com> * revert PR #3611 (#3786) * bump to 0.9.6 * ZeRO++ chinese blog (#3793) * zeropp chinese blog * try better quality images * make title larger * even larger... * various fix * center captions * more fixes * fix format * remove staging trigger (#3792) * DeepSpeed-Triton for Inference (#3748) Co-authored-by: Stephen Youn <styoun@microsoft.com> Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Ethan Doe <yidoe@microsoft.com> Co-authored-by: yidoe <68296935+yidoe@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * ZeRO++ (#3784) Co-authored-by: HeyangQin <heyangqin@microsoft.com> Co-authored-by: GuanhuaWang <alexwgh333@gmail.com> Co-authored-by: cmikeh2 <connorholmes@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> * adding zero++ to navigation panel of deepspeed.ai (#3796) * Add ZeRO++ Japanese blog (#3797) * zeropp chinese blog * try better quality images * make title larger * even larger... * various fix * center captions * more fixes * fix format * add ZeRO++ Japanese blog * add links --------- Co-authored-by: HeyangQin <heyangqin@microsoft.com> Co-authored-by: Conglong Li <conglong.li@gmail.com> * Bug Fixes for autotuner and flops profiler (#1880) * fix autotuner when backward is not called * fix format --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Missing strided copy for gated MLP (#3788) Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> * Requires grad checking. (#3789) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * bump to 0.10.0 * Fix Bug in transform.cu (#3534) * Bug fix * Fixed formatting error --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> * bug fix: triton importing error (#3799) Co-authored-by: Stephen Youn <styoun@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Fix dequant bug * Address PR feedback * Use super() __exit__ * Fix unit tests --------- Co-authored-by: Donglin Zhuang <donglinzhuang@outlook.com> Co-authored-by: Heyang Qin <heyangqin@microsoft.com> Co-authored-by: Bill Luo <50068224+zhiruiluo@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Guorun <84232793+CaffreyR@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: stephen youn <13525892+stephen-youn@users.noreply.github.com> Co-authored-by: Stephen Youn <styoun@microsoft.com> Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org> Co-authored-by: Ethan Doe <yidoe@microsoft.com> Co-authored-by: yidoe <68296935+yidoe@users.noreply.github.com> Co-authored-by: GuanhuaWang <alexwgh333@gmail.com> Co-authored-by: cmikeh2 <connorholmes@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com> Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com>	2023-09-11 16:19:21 +00:00
Connor Holmes	542dc0d5cb	AMD Kernel Compatibility Fixes (#3180 ) * Guard against APIs not available on AMD in reduction_utils, code cleanup * More API alignment simplification * Int conversion fix * Syntax --------- Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>	2023-09-08 16:54:57 +00:00
Ma, Guokai	19e9a7c028	[CPU][Bugfix] Make uid and addr_port part of SHM name in CCL backend (#4115 ) * distinguish shm name with uid and addr_port * fix formatting	2023-08-17 10:58:52 +00:00
Connor Holmes	f0463b4d1f	Pass correct node size for ZeRO++ (#4085 ) * Pass correct node size * formatting --------- Co-authored-by: Connor Holmes <development@cmikeh2.me> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>	2023-08-09 01:55:44 +00:00
Ma, Guokai	82c498d947	Fix deadlock when SHM based allreduce spin too fast (#4048 ) * Fix deadlock when allreduce spin too fast * Change state to enum to increase readability --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2023-07-28 15:58:45 +00:00
Ma, Guokai	7f26bb6ae4	faster allreduce with omp parallel for reduce kernel (#4049 )	2023-07-27 21:59:26 +00:00
Ma, Guokai	0f5406323c	[CPU] FusedAdam and CPU training support (#3991 ) * fused adam can build * use cpu adam to implement fused adam * enable zero stage 1 and 2 for synchronized accelerator (a.k.a. CPU) * remove unused parameters * fix format error * Remove adam class * fix format * support stage3 * reuse simd.h * fix format * make memory_stat return meaningful dict * fix format * add cpu_adam * reuse cpu_adam * header cleanup * fix cpu_adam * fix format, add missing file --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2023-07-25 13:32:02 +00:00
Ma, Guokai	1bc3b78423	[CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) (#3919 ) * use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) * add fp32 support for SHM allreduce * avoid assertion for FP16 data type * fix format * change 'allreduce_low_latency' to 'inference_allreduce' * Fix according to comments * change inference_allreduce to inference_all_reduce to keep naming consistency * check whether LOCAL_SIZE is defined in ccl.cpp, also define LOCAL_SIZE in test_distributed * fix format * Fix format error * Update tests/unit/comm/test_dist.py Fix world_size to 4 in UT Co-authored-by: Michael Wyatt <mrwyattii@gmail.com> --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>	2023-07-19 20:57:54 +00:00
Alexander Grund	9aeba94a8e	Avoid deprecation warnings in `CHECK_CUDA` (#3854 ) The `type()` function is deprecated and `is_cuda()` can be used since about forever. This avoids MANY warnings when compiling extensions. Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>	2023-07-06 20:30:28 +00:00
Ramya Ramineni	aebdfb3b92	Fix Bug in transform.cu (#3534 ) * Bug fix * Fixed formatting error --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2023-06-23 14:30:49 -07:00
Heyang Qin	d18aa2c79c	ZeRO++ (#3784 ) Co-authored-by: Sam Abe Jacobs <samjacobs@microsoft.com> Co-authored-by: GuanhuaWang <alexwgh333@gmail.com> Co-authored-by: cmikeh2 <connorholmes@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>	2023-06-23 14:30:49 -07:00
Logan Adams	cd911f9ab2	Fix output transpose dimension bugs (#3747 )	2023-06-13 16:51:30 -07:00
john li	46bb08c2df	Include cublas error details when getting cublas handle fails (#3695 ) * include cublas error details when getting cublas handle fails * run clang-format * just use raw enum value to avoid depending on minimum cuda version --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2023-06-13 20:12:26 +00:00
mzl	d1c3c0df53	fix_typo (#3559 )	2023-05-18 12:13:37 -07:00
Ma, Guokai	1f72082fc0	[CPU] Support Intel CPU inference (#3041 ) * add fallback path for kernels used in megatron * temporary numactl WA for SPR 56core * adapt core allocation according to number of ranks * add switch to turn on numactl * detect number of cores on the system * allow select a subset of the cores on the system to bind * remove unneeded changes * add ccl backend * change nccl to ccl * remove unused code * add comm/ccl to ops * initial ccl comm support * first broadcast case passed * add CCL_Backend to DeepSpeed * support comm timer for CPU * support barrier for comm backend * support specify master address from deepspeed command line * support pytorch 2.0 * remove 'block' from api * Tweak for debug Signed-off-by: Cao, Zhong Z <zhong.z.cao@intel.com> * Remove unecessary directory Signed-off-by: Cao, Zhong Z <zhong.z.cao@intel.com> * Add bf16 kernel support for inference * Add temporary torch implement for cpu inference * Add softmax ops cpu fallback for inference * bind cores to numa domain as well * merge latest change in gma/numactl * initial bf16 kernel support with fallback path * initial fallback path for bloom kernel injection * fix softmax attn mask * check KMP_AFFINITY to avoid conflict with numactl * New CCLBackend which utilize TorchBackend for initialization * rollback last change because there is result error * fix bloom injection policy TP could not work issue. injection_policy={BloomBlock: ("self_attention.dense", "mlp.dense_4h_to_h")} * Use TorchBackend to initialize CCLBackend, make behavior consistent * remove comm under deepspeed/ops * add license header * code clean up * fix format issue * remove magic number in main address * add caching support but not turn on by default * change name of inference_cuda_module to inference_module * Check for is_synchronized_device in accelerator before get Event * fix typo * Fix fallback path of softmax kernel on CUDA device for BF16 data type, because CUDA tril does not support BF16 datatype, enforce fp32 data type * add cpu backend files * change CPU_Accelerator op_builder_dir * remove cpu_kernel_path * using CPU_Accelerator on non-cuda device * fix deepspeed.op_builder => deepspeed.ops.op_builder * add alias for num_gpus: num_accelerators * allow loading cpu_builder in build stage * Assume cuda available if torch not installed * add oneccl_binding_pt to requirements * move oneccl-binding-pt to seperate requiremetns-cpu.txt * add missing file * use dependency_links in setuptools.setup() call for additional dependency links * install oneccl_bind_pt in workflows * change oneccl_bind_pt's version from 1.13 to 2.0 * use intel_exention_for_pytorch as indicator that CPU_Accelerator should be used * Add indicator for Accelerator used * change foo.c to foo.cpp * exclude 'cpu' directory in CUDA op builder reflection * add a cpu-inference workflow * run cpu-inference workflow on self-hosted instance * change cpu runs-on node to v100 node * print out python version in workflow * add verbose in pip command to understand oneccl_bind_pt install issue * update cpu-inference workflow * add a stage to detect instance instruction sets * add back bf16 support for CPU inference * enable autoTP for bloom Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * update workflow to detect cpu instruction sets * temporary WA for Intel Extension for PyTorch AVX2 instructioon set detection * change cpu-inference workflow machine to ubuntu-20.04 * add sharded checkpoint loading for AutoTP path to reduce the peak memory in initialization stage Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * enable policy for llama * use a special build ipex to test avx2 detection fix * fix format * fix test fail issue Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * fix gptj sharded checkpoint loading problem Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> * return a not implemented build in get_op_builder in cpu_backend * support cpu device in tests * use cpuinfo to extract number of CPUs * use ~/tmp as transfomer cache rather than /blob/ * Add support for mpich launcher with prefer_deepspeed_comm * add missing modification in accelerator * enable IMPI launcher * remove unused file and fix formatting * clean up ccl.cpp * Less confusing error message when certin op builder are not implemented * Fix license header * Add license header * add license headers * add license header * fix cuda specific code in test * update CPU workflow * use numactl to bind to core * allow bind_cores_to_rank in multi-node impi runner * fix format error * Remove InferenceBuilder * fix format error in numa.py * check whether op is in installed ops in ds_report.py * allow override accelerator with DS_ACCELERATOR='cuda','cpu' or 'xpu' * lazy init class_dict in CUDA_Accelerator to avoid cyclic initialization of CUDA_Accelerator * put short path in the beginning in real_accelerator.py * device_count return number of NUMA nodes * fix typo * install numactl in cpu workflow * Follow comments * Better implementation of device_count() and current_device() * remove dependency_link for Intel Extension for DeepSpeed * use check is_synchronized_device in timer only once * remove env mapping WA in cpu_accelerator * fix duplicate definition * fix format error * refine ccl backend selection * move comments to the right place * remove prefer_deepspeed_comm, use CCLBackend by default * refractor fallback path * Fix execution failure in kernel injection path * do not refractory kernel injection fallback path in residual_add because it contains function call with side-effect * guard residual_add fallback path with environ DS_KI_FALLBACK=True * fix format error * add test for allreduce on CPU workflow * fix format error * Fallback to TorchBackend if CCLBackend kernel are not implemented * Update Intel Extension for Pytorch installation link * Don't specify version number of Intel Extension for PyTorch * install oneCCL for CCLBackend * fix link path for CPU comm kernels * fix source oneCCL environment * source oneCCL env before run UT * Give more specific instruction when CCL_ROOT not defined --------- Signed-off-by: Cao, Zhong Z <zhong.z.cao@intel.com> Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: sdp <sdp@aia-sdp-spr-108864.jf.intel.com> Co-authored-by: Cao, Zhong Z <zhong.z.cao@intel.com> Co-authored-by: Zhenhuan Chen <zhenhuan.chen@intel.com> Co-authored-by: baodii <di.bao@intel.com> Co-authored-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: jianan-gu <jianan.gu@intel.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2023-05-16 11:59:22 -04:00
Ramya Ramineni	5147b90aa4	[ROCm] Hip headers fix (#3532 ) * Add cg headers hipification * Exclude including cuda_bf16.h on ROCm * Merge * Retricting including cuda_bf16.h with BF16_AVAILABLE var --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2023-05-16 09:07:27 -04:00
LiYu Lu	f1fab902c8	fix spelling error (#3482 ) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2023-05-08 22:52:56 +00:00
Connor Holmes	0a61d5d664	Hybrid Engine Refactor and Llama Inference Support (#3425 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2023-05-03 17:20:07 -07:00
Dino Chen	297cd9ed8a	add bf16 cuda kernel support (#3092 ) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2023-04-22 15:23:28 +00:00
Ramya Ramineni	6e1cbebe52	Hipify cooperative_groups headers (#3323 ) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2023-04-21 18:41:23 +00:00
Michael Wyatt	4a3ca4e26d	Fix formatting (#3343 ) * formatting * fixing clang-format version * update pre-commit URL	2023-04-21 09:57:46 -07:00
Connor Holmes	145c3a7591	Fix missing scale attributes for GPTJ (#3256 ) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>	2023-04-20 17:38:02 -07:00
Bhavya Medishetty	f7bfe5e7ef	[ROCm] temporary workaround till __double2half support enabled in HIP (#3236 ) * temporary WAR workaround till __double2half support enabled in HIP * workaround only for hipcc --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2023-04-18 18:37:25 +00:00
Heyang Qin	48297c4841	improving int4 asymmetric quantization accuracy (#3190 ) * Fixes for asymmetric quantization * addtional offset to further improve accuracy * put the 0.5 into offset rather than applying it later * update unit test for quantization * fix format * attempt to fix format --------- Co-authored-by: Connor Holmes <connorholmes@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2023-04-17 16:54:28 -07:00
Olatunji Ruwase	47f9f13bd3	DeepSpeed Chat (#3186 ) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: yaozhewei <zheweiy@berkeley.edu> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Connor Holmes <connorholmes@microsoft.com> Co-authored-by: Lok Chand Koppaka <lokoppak@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>	2023-04-11 11:53:38 -07:00
Bing Xie	0cd64bd4c9	fixing a bug in CPU Adam and Adagrad (#3109 ) Co-authored-by: Bing Xie <bingxie@BINGHYPC014.redmond.corp.microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2023-04-07 08:52:31 -07:00
Olatunji Ruwase	514b020bce	Use generic O_DIRECT (#3115 )	2023-04-05 10:05:39 -04:00
Molly Smith	e73de8cee8	Optimize Softmax Kernel (#3112 ) * Simplify kernel * Coalesce memory attempt 1. Logits divergence. * Logits fix? * sync after every global mem access * template on iterations. Down to 8.3% cuda time for 8k tokens * Up to 64 iterations * Add alibi/mask check * fp32 * Revert builder.py * naming. precommit * Revert "naming. precommit" This reverts commit `150eb7d96b`. * naming. spacing * Spacing. simplify checks * remove bsyncs * missed bsyncs * precommit	2023-04-05 03:17:29 +00:00
Michael Wyatt	b361c72761	Update DeepSpeed copyright license to Apache 2.0 (#3111 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2023-03-30 17:14:38 -07:00
Jeff Rasley	91d63e0228	update formatter version and style settings (#3098 )	2023-03-27 07:55:19 -04:00
Connor Holmes	1286e374ab	Softmax Scheduling Cleanup (#3046 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2023-03-22 08:45:06 -07:00
Mor Zusman	871c8a3f5d	fix return prev key and value , added strides to from_blob (#2828 ) Co-authored-by: Mor Zusman <morz@ai21.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2023-03-22 08:43:35 -07:00
Quentin Anthony	b38b3036dd	[docs] add MCR-DL paper to readme/docs (#3066 )	2023-03-21 10:19:16 -07:00
Jeff Rasley	da84e60d98	add missing license info to top of all source code (#2889 ) Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2023-02-27 11:20:41 -08:00
Reza Yazdani	5b7413a4fc	Fix gpt-Neox rotary embedding implementation (#2782 ) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2023-02-16 10:47:45 -08:00
Lev Kurilenko	fd1449c766	Port Reza's INT8-quantization fix to container architecture (#2725 ) Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Heyang Qin <heyangqin@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>	2023-02-16 10:12:18 -08:00
Olatunji Ruwase	c9b08888d0	Enable page-locked tensors without CUDA (#2775 ) * Enable page-locked memory in cpu only env * Enable page-locked memory in cpu only env * Formatting * Add TODOs; Release page-locked memory * Update perf microbenchmark; Reduce unit test memory * Reduce CI mem usage	2023-02-07 17:14:19 -05:00
Reza Yazdani	9f41ffe4a6	Reset KV-cache at the beginning of text-generation (#2669 ) Co-authored-by: Martin Cai <martincai@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2023-02-03 12:07:44 -08:00
Reza Yazdani	0b06e0cbb0	Fix softmax backward (#2709 ) * Reset KV-cache at the beginning of text-generation * Add new backward kernel to handle large softmax-length * remove unrelated changes Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Connor Holmes <connorholmes@microsoft.com>	2023-01-26 18:22:35 -05:00
Ma, Guokai	98cc35b6a8	Abstract accelerator (step 3) (#2677 ) * Integrate accelerator abstraction interface into deepspeed/ * Fix error message in fp16/fused_optimizer * fix error message in fp16/unfused_optimizer.py * assign get_accelerator().pin_memory() result to input Tensor name * no need to check cuda and whether nvtx supported * move try-except into inner most block * call Event() and Stream() in get_accelerator() for data type * Make Stream and Event as properties of abstract interface so they can be used as data type in deepspeed * Apply op_builder backend api change from #2705 from @jeffra * fix tests where Builder NAME is used * keep original ...Builder.NAME interface instead of ...Builder().NAME interface * fix builder closure for installation * fix randomltd builder * add comments to clarify create_op_builder and get_op_builder * fix compatibility with pip install -e Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2023-01-26 06:03:12 -08:00
Olatunji Ruwase	3f210c9715	CUDA optional deepspeed ops (#2507 ) * CPU-Adam: add compile-flag to enable param-copy from CPU to GPU * guarde the CUDA-related include files and variables * remove CUDA dependency from op_builder when building against CPU * fixing the builder issues * fix formatting * return true when there is no mismatch on the cuda version * guard for when cuda is not available & test with cpu-only environment * Update cpu_adam and cpu_adagrad * Format fixes * Add configurable half precision type; Build/run in CUDA environment * Run cpu_adam and cpu_adagrad in cpu only environment * Mark CUDA only unit tests * CPU environment CI * Format fixes * Remove --forked * Add --forked * CPU only CI should pass * Format fixes * Format fixes * Remove scattered pytest.skip * Fix cpu_adam unit test * Update .github/workflows/nv-torch-latest-cpu.yml Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Update .github/workflows/nv-torch-latest-cpu.yml Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> * Address PR feedback * OpenMP linking * Fix unit tests Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>	2023-01-17 18:02:45 -05:00
LOK CHAND KOPPAKA	aef8a8560c	Extend quantization utils features (#2683 ) * Extend quantization utils features * remove unwanted files * fix cahce setting Co-authored-by: Connor Holmes <connorholmes@microsoft.com>	2023-01-12 19:25:55 +00:00
Ma, Guokai	9548d48f48	Abstract accelerator (step 2) (#2560 ) * Abstract accelerator (step 2) * more flex op_builder path for both installation and runtime * add SpatialInferenceBuilder into cuda_accelerator.py * use reflection to make cuda_accelerator adapt to CUDA op builder change automatically * clean up deepspeed/__init__.py * add comments in cuda_accelerator for no torch path * Update deepspeed/env_report.py Change env_report.py according to suggestion Co-authored-by: Michael Wyatt <mrwyattii@gmail.com> * reduce the range of try...except for better code clarity * Add porting for deepspeed/ops/random_ltd/dropping_utils.py * move accelerator to top directory and create symlink under deepspeed Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Michael Wyatt <mrwyattii@gmail.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2023-01-06 23:40:58 -05:00
Connor Holmes	a25c31b67d	Update AVX512 Detection (#2621 ) * Update cpuinfo AVX512 detection * Missing conversion from `_mm256` to `_mm256i` Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2022-12-17 05:57:28 -08:00
lokoppakmsft	3a3dfe66bb	Move layer norm to new schedule (#2590 ) * Move layer norm to new schedule * Pre-commit fixes * fix comments * format fixes * Merge unrolls * format fixes * camelCase * format fixes * revert unwanted file * move pow2 function * format fixes Co-authored-by: Connor Holmes <connorholmes@microsoft.com>	2022-12-13 18:37:24 +00:00
Conglong Li	ef869377e9	DeepSpeed Data Efficiency Library (#2585 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2022-12-12 16:55:18 -08:00
lokoppakmsft	591744eba3	Support N-dimension input in quantization kernel (#2575 ) * Add support for inputs > 2D * use vec * Add N-Dim support to Dequant kernel * merge master and fix format * Bug Fix * fix num_bits * Fix dequant Co-authored-by: Connor Holmes <connorholmes@microsoft.com>	2022-12-07 18:33:20 -08:00
Reza Yazdani	35b350b28c	Fix quantized-inference & Add generic support of checkpoint loading (#2547 ) * fix checkpoint loading when it is a dictionary * fix some issues with saving ckpt & int8 inference * fix quantized-inference & add generic support of checkpoint loading * remove int8 hard-coded flag * fix mlp return tensors * fix several issue to load checkpoints of GPT-J, GPT-NEOX, and OPT with different TP-size * add more comments & description for checkpoint-loading module Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>	2022-12-06 13:49:29 -08:00
Connor Holmes	b841628207	Drop Maxwell Support (#2574 ) * Officially drop Maxwell support * Formatting * Comparison mismatch fix	2022-12-06 10:42:32 -08:00
Connor Holmes	30c8d8a881	Initial dequant library implementation (#2521 )	2022-11-18 16:02:30 -08:00
lokoppakmsft	78d4ca1f4b	Deepspeed quantization library v0.1 (#2450 ) * Initial commit Deepspeed quantization library * Match function signatures * Add Quantization Kernel * adding offset comparision and precommit changes * format fixes * FIle name changes * pt_binding_changes * test name change * Integer quantization, minor refactors * Add directed test_case * format fixes * Move param calculation to constructor of params class * Use local function and add elemsPerBlock * change function to be specalized * sub block reduce * add new schedule * Add new schedule test case * fix illegal writes in sch1 * Style fixes in comments Co-authored-by: Connor Holmes <connorholmes@microsoft.com>	2022-11-17 11:09:45 -08:00
Connor Holmes	e7e7595502	Stable Diffusion Enhancements (#2491 ) Co-authored-by: cmikeh2 <connorholmes@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com>	2022-11-09 17:40:59 -08:00
Reza Yazdani	9cfcf7431a	Add correct memory-allocation at DeepSpeed-Attention (#2474 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Connor Holmes <connorholmes@microsoft.com>	2022-11-07 16:23:25 -08:00
Joe Mayer	4a06ecf631	Updating autotune json default in docs. (#2476 ) * Updating autotune default in docs. * Running pre-commit.	2022-11-04 16:00:13 -07:00
Connor Holmes	10e9d04c23	Cache Allocation and Softmax Fixes (#2433 ) Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2022-11-02 10:48:18 -07:00
Connor Holmes	be4ffb82ad	Reduction Kernel Utility (#2436 ) * Initial reduction_utils.h implementation * Add initialization helper, ensures correct min/max behavior * Remove unnecessary warp sync	2022-10-27 20:34:26 +00:00
eltonzheng	b85eb3b979	Fix build issues on Windows (#2428 ) * Fix build issues on Windows * small fix to complie with new version of Microsoft C++ Build Tools Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>	2022-10-26 00:14:43 +00:00
Jeff Rasley	ec13da6ba7	add SD injection policy (#2381 ) Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>	2022-10-13 16:47:12 -07:00
Andrey Chernykh	d5d10b0ce8	Fix issue with corrupted output on long generation for GPT (#2359 ) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2022-10-13 09:53:48 -07:00
Connor Holmes	c3001324b4	Add predicated global load (#2373 ) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>	2022-10-11 19:41:23 +00:00
Arash Bakhtiari	0a2ae2ef45	Fix the MLP output tensor's shape (#2380 )	2022-10-04 22:27:37 -07:00
Arash Bakhtiari	e14d40e5f3	Refactor fused_bias_residual kernels for better readability (#2356 ) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2022-09-27 10:27:46 -07:00
Guanhua Wang	3486afb1a3	fix cuda invalid config error in dequant kernel (#2362 ) * format * remove round fn	2022-09-27 01:31:28 +00:00
Arash Bakhtiari	9df604bf51	Refactor gptj_residual_add kernels for better readability (#2358 ) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>	2022-09-26 17:02:45 -04:00
Connor Holmes	9aa7b638b7	Kernel Data Conversion Utility (#2327 ) * Unify macro definitions and constants in a single file * Conversion utility implementation. * Fix reversion from formatting * Bugfixes after testing with correct DeepSpeed * Inline markers are available on both HIP + CUDA	2022-09-26 07:25:08 -04:00
Connor Holmes	3d097bb865	Extend scratch buffer for long prompts (#2212 ) Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2022-09-22 18:04:36 -07:00
Guanhua Wang	954e0c61f1	mem access for quantize kernel (#2331 ) * mem access for quantize kernel * format * format fp32 * modify quant kernel * modify quant kernel2 * modify format * format * fix comments in pytest * fix comments in pytest * format * rerun Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Connor Holmes <connorholmes@microsoft.com>	2022-09-22 13:28:30 -07:00
Arash Bakhtiari	48c5220b52	Refactor residual add kernels (#2333 ) Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>	2022-09-21 13:15:45 -07:00
Sam Ade Jacobs	12e1cb8262	MOE matmult with memaccess (#2336 ) * Fix formatting * Remove redundant variable	2022-09-21 09:17:15 -07:00
Sam Ade Jacobs	80b10d0c69	MOE residual matmult unit test (#2323 ) MOE residual matmul unit tests Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>	2022-09-19 15:00:32 -07:00
Michael Wyatt	c199edac82	refactor to use mem_access (#2317 )	2022-09-14 01:11:08 +00:00
Arash Bakhtiari	efa8aded4a	Fix the residual add mp scaling for GPTNeoX (#2310 )	2022-09-12 11:45:32 -07:00
Molly Smith	d0dfe38d53	Update relu.cu with mem_access_utils (#2306 )	2022-09-09 18:53:48 +00:00
Michael Wyatt	b2d550ab85	Unit test for bias add kernel (#2298 ) * added unit test * Update pt_binding.cpp * formatting * Update test_bias_add.py	2022-09-09 09:31:39 -07:00
Reza Yazdani	47e030f54d	Fp32 accuracy bug fix (#2285 ) Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org> Co-authored-by: Arash Bakhtiari <arashb@users.noreply.github.com>	2022-09-02 17:13:41 -07:00
Connor Holmes	c84bca37b1	Memory Access Utility (#2276 ) Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>	2022-09-01 12:49:51 -07:00
Reza Yazdani	afdc72879f	Ds-inference Int8 support through ZeroQuant technology (#2217 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2022-08-30 16:39:34 -07:00
Reza Yazdani	d154cc0f55	Ds inference/fix mp2 (#2270 )	2022-08-29 16:33:03 -07:00
Connor Holmes	2a64448830	Update half precision header guards (#2261 )	2022-08-25 11:40:29 -04:00
Reza Yazdani	fda63432ba	Remove the random-generator from context during inference (#2228 ) * Fix the tensor-slicing copy for qkv parameters * remove the random-generator from context during inference * formatting Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2022-08-18 08:37:35 -07:00
Arash Bakhtiari	8b2a63717a	Add support of OPT models (#2205 ) * add opt replace policy * simplify inf. api * fix opt replace policy * fix use-cash & add relu * Add support of custom MLP act. function * Revert "simplify inf. api" This reverts commit 9e910fcbd5471dec9b3c92008426f5ba590bf0b6. * fix the inference API (temp. solution) * fix code formatting * add unit tests for OPT models. * refactor pre-attention layer norm configuration * add support of opt-350m model * refactor the HF model config initialization * fix hf model config issue Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>	2022-08-15 07:31:51 -07:00
Ramya Ramineni	2e3769a1f4	Enable fused_lamb_cuda_kernel on ROCm (#2148 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>	2022-08-04 09:25:49 -07:00
Alex Hedges	316c4a43e0	Add flake8 to pre-commit checks (#2051 )	2022-07-25 16:48:08 -07:00
Reza Yazdani	aa88137b8d	Add Inference support for running the BigScience-BLOOM Architecture (#2083 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>	2022-07-18 16:27:12 -07:00
Reza Yazdani	a04480e192	Fix the half-precision version of CPU-Adam (#2032 ) * Fix the half-precision version of CPU-Adam * remove unexpected return * fix the increase width (fp32/fp16) * support fp16 tests for cpu-adam * fix the fp16 data-loading * change unit-test for fp16 check & slight change to parameter size * fix for numpy error Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>	2022-06-23 08:56:26 -07:00
Jeff Rasley	b666d5cd73	[inference] test suite for ds-kernels (bert, roberta, gpt2, gpt-neo, gpt-j) (#1992 ) Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>	2022-06-15 14:21:19 -07:00
Reza Yazdani	0ebd81dfa9	small fix for the HF Bert models (#1984 )	2022-05-31 10:23:27 -07:00
Reza Yazdani	8164ea9e6d	Fixing several bugs in the inference-api and the kernels (#1951 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2022-05-24 13:27:50 -07:00
Ramya Ramineni	96c8bf32aa	Enable DeepSpeed inference on ROCm (#1922 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2022-04-28 10:08:05 -07:00
Jeff Rasley	b4fcd98ff0	Inference PP changes for neox (#1899 ) Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>	2022-04-26 11:50:38 -07:00
Ramya Ramineni	b4e8f18c27	THCGeneral.h header file is deprecated (#1842 ) Co-authored-by: hubertlu-tw <hubertlu@amd.com>	2022-03-16 16:36:08 -07:00
Ramya Ramineni	7bcb4fabeb	Enable CG headers on ROCm (#1821 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2022-03-11 12:06:41 -08:00
Jithun Nair	350d74ca39	Invoke hipify from op builder for JIT extension builds (#1807 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2022-03-07 18:59:14 +00:00
Jeff Rasley	c3c8d5dd93	AMD support (#1430 ) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jithun Nair <jithun.nair@amd.com> Co-authored-by: rraminen <rraminen@amd.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: okakarpa <okakarpa@amd.com> Co-authored-by: rraminen <rraminen@amd.com> Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: okakarpa <okakarpa@amd.com> Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com>	2022-03-03 01:53:35 +00:00
Alex Hedges	4cf970e6bb	Add codespell to pre-commit checks (#1717 )	2022-01-22 14:45:58 -08:00
Jeff Rasley	e46d808a1b	MoE inference + PR-MoE model support (#1705 ) Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Zhewei Yao <zheweiy@berkeley.edu> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>	2022-01-18 16:25:01 -08:00

1 2 3 4 5

204 Коммитов