* Add the llama2 support from the official llama repo
* add back commented function
* add new policy & implementation for llama2
* add some changes to inject/run the 70b llama model
* remove debugging code
* remove more debugging code
* formatting
* use num_kv only when it has positive value
* use the num_kv param only if it is positive
* fix syntax and format errors.
* fix an issue with the float32 transform kernel
---------
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
* bugfix in launch_dequantize()
Get rid of `hid_cnt` and simply set #blocks to output size / #groups
* add a unit test for dequantization
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
* Guard against APIs not available on AMD in reduction_utils, code cleanup
* More API alignment simplification
* Int conversion fix
* Syntax
---------
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
* Fix deadlock when allreduce spin too fast
* Change state to enum to increase readability
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
* fused adam can build
* use cpu adam to implement fused adam
* enable zero stage 1 and 2 for synchronized accelerator (a.k.a. CPU)
* remove unused parameters
* fix format error
* Remove adam class
* fix format
* support stage3
* reuse simd.h
* fix format
* make memory_stat return meaningful dict
* fix format
* add cpu_adam
* reuse cpu_adam
* header cleanup
* fix cpu_adam
* fix format, add missing file
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
* use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node)
* add fp32 support for SHM allreduce
* avoid assertion for FP16 data type
* fix format
* change 'allreduce_low_latency' to 'inference_allreduce'
* Fix according to comments
* change inference_allreduce to inference_all_reduce to keep naming consistency
* check whether LOCAL_SIZE is defined in ccl.cpp, also define LOCAL_SIZE in test_distributed
* fix format
* Fix format error
* Update tests/unit/comm/test_dist.py
Fix world_size to 4 in UT
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
The `type()` function is deprecated and `is_cuda()` can be used since about forever.
This avoids MANY warnings when compiling extensions.
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
* include cublas error details when getting cublas handle fails
* run clang-format
* just use raw enum value to avoid depending on minimum cuda version
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
* add fallback path for kernels used in megatron
* temporary numactl WA for SPR 56core
* adapt core allocation according to number of ranks
* add switch to turn on numactl
* detect number of cores on the system
* allow select a subset of the cores on the system to bind
* remove unneeded changes
* add ccl backend
* change nccl to ccl
* remove unused code
* add comm/ccl to ops
* initial ccl comm support
* first broadcast case passed
* add CCL_Backend to DeepSpeed
* support comm timer for CPU
* support barrier for comm backend
* support specify master address from deepspeed command line
* support pytorch 2.0
* remove 'block' from api
* Tweak for debug
Signed-off-by: Cao, Zhong Z <zhong.z.cao@intel.com>
* Remove unecessary directory
Signed-off-by: Cao, Zhong Z <zhong.z.cao@intel.com>
* Add bf16 kernel support for inference
* Add temporary torch implement for cpu inference
* Add softmax ops cpu fallback for inference
* bind cores to numa domain as well
* merge latest change in gma/numactl
* initial bf16 kernel support with fallback path
* initial fallback path for bloom kernel injection
* fix softmax attn mask
* check KMP_AFFINITY to avoid conflict with numactl
* New CCLBackend which utilize TorchBackend for initialization
* rollback last change because there is result error
* fix bloom injection policy TP could not work issue.
injection_policy={BloomBlock: ("self_attention.dense", "mlp.dense_4h_to_h")}
* Use TorchBackend to initialize CCLBackend, make behavior consistent
* remove comm under deepspeed/ops
* add license header
* code clean up
* fix format issue
* remove magic number in main address
* add caching support but not turn on by default
* change name of inference_cuda_module to inference_module
* Check for is_synchronized_device in accelerator before get Event
* fix typo
* Fix fallback path of softmax kernel on CUDA device for BF16 data type, because CUDA tril does not support BF16 datatype, enforce fp32 data type
* add cpu backend files
* change CPU_Accelerator op_builder_dir
* remove cpu_kernel_path
* using CPU_Accelerator on non-cuda device
* fix deepspeed.op_builder => deepspeed.ops.op_builder
* add alias for num_gpus: num_accelerators
* allow loading cpu_builder in build stage
* Assume cuda available if torch not installed
* add oneccl_binding_pt to requirements
* move oneccl-binding-pt to seperate requiremetns-cpu.txt
* add missing file
* use dependency_links in setuptools.setup() call for additional dependency links
* install oneccl_bind_pt in workflows
* change oneccl_bind_pt's version from 1.13 to 2.0
* use intel_exention_for_pytorch as indicator that CPU_Accelerator should be used
* Add indicator for Accelerator used
* change foo.c to foo.cpp
* exclude 'cpu' directory in CUDA op builder reflection
* add a cpu-inference workflow
* run cpu-inference workflow on self-hosted instance
* change cpu runs-on node to v100 node
* print out python version in workflow
* add verbose in pip command to understand oneccl_bind_pt install issue
* update cpu-inference workflow
* add a stage to detect instance instruction sets
* add back bf16 support for CPU inference
* enable autoTP for bloom
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* update workflow to detect cpu instruction sets
* temporary WA for Intel Extension for PyTorch AVX2 instructioon set detection
* change cpu-inference workflow machine to ubuntu-20.04
* add sharded checkpoint loading for AutoTP path to reduce the peak memory in initialization stage
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* enable policy for llama
* use a special build ipex to test avx2 detection fix
* fix format
* fix test fail issue
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* fix gptj sharded checkpoint loading problem
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
* return a not implemented build in get_op_builder in cpu_backend
* support cpu device in tests
* use cpuinfo to extract number of CPUs
* use ~/tmp as transfomer cache rather than /blob/
* Add support for mpich launcher with prefer_deepspeed_comm
* add missing modification in accelerator
* enable IMPI launcher
* remove unused file and fix formatting
* clean up ccl.cpp
* Less confusing error message when certin op builder are not implemented
* Fix license header
* Add license header
* add license headers
* add license header
* fix cuda specific code in test
* update CPU workflow
* use numactl to bind to core
* allow bind_cores_to_rank in multi-node impi runner
* fix format error
* Remove InferenceBuilder
* fix format error in numa.py
* check whether op is in installed ops in ds_report.py
* allow override accelerator with DS_ACCELERATOR='cuda','cpu' or 'xpu'
* lazy init class_dict in CUDA_Accelerator to avoid cyclic initialization of CUDA_Accelerator
* put short path in the beginning in real_accelerator.py
* device_count return number of NUMA nodes
* fix typo
* install numactl in cpu workflow
* Follow comments
* Better implementation of device_count() and current_device()
* remove dependency_link for Intel Extension for DeepSpeed
* use check is_synchronized_device in timer only once
* remove env mapping WA in cpu_accelerator
* fix duplicate definition
* fix format error
* refine ccl backend selection
* move comments to the right place
* remove prefer_deepspeed_comm, use CCLBackend by default
* refractor fallback path
* Fix execution failure in kernel injection path
* do not refractory kernel injection fallback path in residual_add because it contains function call with side-effect
* guard residual_add fallback path with environ DS_KI_FALLBACK=True
* fix format error
* add test for allreduce on CPU workflow
* fix format error
* Fallback to TorchBackend if CCLBackend kernel are not implemented
* Update Intel Extension for Pytorch installation link
* Don't specify version number of Intel Extension for PyTorch
* install oneCCL for CCLBackend
* fix link path for CPU comm kernels
* fix source oneCCL environment
* source oneCCL env before run UT
* Give more specific instruction when CCL_ROOT not defined
---------
Signed-off-by: Cao, Zhong Z <zhong.z.cao@intel.com>
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: sdp <sdp@aia-sdp-spr-108864.jf.intel.com>
Co-authored-by: Cao, Zhong Z <zhong.z.cao@intel.com>
Co-authored-by: Zhenhuan Chen <zhenhuan.chen@intel.com>
Co-authored-by: baodii <di.bao@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: jianan-gu <jianan.gu@intel.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
* Add cg headers hipification
* Exclude including cuda_bf16.h on ROCm
* Merge
* Retricting including cuda_bf16.h with BF16_AVAILABLE var
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
* temporary WAR workaround till __double2half support enabled in HIP
* workaround only for hipcc
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
* Fixes for asymmetric quantization
* addtional offset to further improve accuracy
* put the 0.5 into offset rather than applying it later
* update unit test for quantization
* fix format
* attempt to fix format
---------
Co-authored-by: Connor Holmes <connorholmes@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
* Enable page-locked memory in cpu only env
* Enable page-locked memory in cpu only env
* Formatting
* Add TODOs; Release page-locked memory
* Update perf microbenchmark; Reduce unit test memory
* Reduce CI mem usage
* Reset KV-cache at the beginning of text-generation
* Add new backward kernel to handle large softmax-length
* remove unrelated changes
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Connor Holmes <connorholmes@microsoft.com>
* Integrate accelerator abstraction interface into deepspeed/
* Fix error message in fp16/fused_optimizer
* fix error message in fp16/unfused_optimizer.py
* assign get_accelerator().pin_memory() result to input Tensor name
* no need to check cuda and whether nvtx supported
* move try-except into inner most block
* call Event() and Stream() in get_accelerator() for data type
* Make Stream and Event as properties of abstract interface so they can be used as data type in deepspeed
* Apply op_builder backend api change from #2705 from @jeffra
* fix tests where Builder NAME is used
* keep original ...Builder.NAME interface instead of ...Builder().NAME interface
* fix builder closure for installation
* fix randomltd builder
* add comments to clarify create_op_builder and get_op_builder
* fix compatibility with pip install -e
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
* CPU-Adam: add compile-flag to enable param-copy from CPU to GPU
* guarde the CUDA-related include files and variables
* remove CUDA dependency from op_builder when building against CPU
* fixing the builder issues
* fix formatting
* return true when there is no mismatch on the cuda version
* guard for when cuda is not available & test with cpu-only environment
* Update cpu_adam and cpu_adagrad
* Format fixes
* Add configurable half precision type; Build/run in CUDA environment
* Run cpu_adam and cpu_adagrad in cpu only environment
* Mark CUDA only unit tests
* CPU environment CI
* Format fixes
* Remove --forked
* Add --forked
* CPU only CI should pass
* Format fixes
* Format fixes
* Remove scattered pytest.skip
* Fix cpu_adam unit test
* Update .github/workflows/nv-torch-latest-cpu.yml
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
* Update .github/workflows/nv-torch-latest-cpu.yml
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
* Address PR feedback
* OpenMP linking
* Fix unit tests
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
* Abstract accelerator (step 2)
* more flex op_builder path for both installation and runtime
* add SpatialInferenceBuilder into cuda_accelerator.py
* use reflection to make cuda_accelerator adapt to CUDA op builder change automatically
* clean up deepspeed/__init__.py
* add comments in cuda_accelerator for no torch path
* Update deepspeed/env_report.py
Change env_report.py according to suggestion
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
* reduce the range of try...except for better code clarity
* Add porting for deepspeed/ops/random_ltd/dropping_utils.py
* move accelerator to top directory and create symlink under deepspeed
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* fix checkpoint loading when it is a dictionary
* fix some issues with saving ckpt & int8 inference
* fix quantized-inference & add generic support of checkpoint loading
* remove int8 hard-coded flag
* fix mlp return tensors
* fix several issue to load checkpoints of GPT-J, GPT-NEOX, and OPT with different TP-size
* add more comments & description for checkpoint-loading module
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
* Initial commit Deepspeed quantization library
* Match function signatures
* Add Quantization Kernel
* adding offset comparision and precommit changes
* format fixes
* FIle name changes
* pt_binding_changes
* test name change
* Integer quantization, minor refactors
* Add directed test_case
* format fixes
* Move param calculation to constructor of params class
* Use local function and add elemsPerBlock
* change function to be specalized
* sub block reduce
* add new schedule
* Add new schedule test case
* fix illegal writes in sch1
* Style fixes in comments
Co-authored-by: Connor Holmes <connorholmes@microsoft.com>
* Fix build issues on Windows
* small fix to complie with new version of Microsoft C++ Build Tools
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
* Unify macro definitions and constants in a single file
* Conversion utility implementation.
* Fix reversion from formatting
* Bugfixes after testing with correct DeepSpeed
* Inline markers are available on both HIP + CUDA
* mem access for quantize kernel
* format
* format fp32
* modify quant kernel
* modify quant kernel2
* modify format
* format
* fix comments in pytest
* fix comments in pytest
* format
* rerun
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Connor Holmes <connorholmes@microsoft.com>
* Fix the tensor-slicing copy for qkv parameters
* remove the random-generator from context during inference
* formatting
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* Fix the half-precision version of CPU-Adam
* remove unexpected return
* fix the increase width (fp32/fp16)
* support fp16 tests for cpu-adam
* fix the fp16 data-loading
* change unit-test for fp16 check & slight change to parameter size
* fix for numpy error
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>