Граф коммитов

171 Коммитов

Автор SHA1 Сообщение Дата
Wang, Yi 5e16eb2c93
enable autoTP for mpt in huggingface model hub without trust_remote_code (#4062)
see  https://github.com/huggingface/transformers/tree/main/src/transformers/models/mpt

Co-authored-by: Molly Smith <112220543+molly-smith@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-08-22 13:41:11 +00:00
Molly Smith 341cefd2a4
Return nn.parameter type for weights and biases (#4146)
* Return nn.parameter type for weights and biases

* whitespace

* Fix bias tensor size
2023-08-15 20:39:38 +00:00
digger yu 4cde5da88e
fix typo: change polciies to policies (#4090) 2023-08-04 17:51:43 +00:00
Lev Kurilenko 1ba4098918
Fix Stable Diffusion Injection (#4078)
* Initial commit

* Clean up

* Fix formatting
2023-08-03 17:58:11 +00:00
Molly Smith 94c7233a8b
Refactor autoTP inference for HE (#4040)
* Refactor autoTP inference for HE

* Formatting

* Move redundant functions to autotp

* Remove self from loading class

* formatting

* Some gpt2 autotp path fixes

* precommit
2023-08-01 04:41:43 +00:00
mzl 6b877d2dbc
autoTP for fused qkv weight (#3844)
* autoTP for fused qkv weight

* fix format

* clean up

* clean up

* clean up

* update

* make logic flow to util and move to file

* fix formatting

* remove empty line

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-07-27 21:30:14 +00:00
Wang, Yi 0bafeac491
enable autoTP for MPT (#3861)
* enable autoTP for MPT

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* add model specific func to auto_tp_model_utils.py

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-07-27 21:27:38 +00:00
Wang, Yi 76953a37b7
fix opt-350m shard loading issue in AutoTP (#3600)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
2023-07-27 21:20:22 +00:00
digger yu 389bf69319
fix: Remove duplicate word the (#4051) 2023-07-27 09:33:13 -07:00
Minjia Zhang 15f94ae756
Engine side fix for loading llama checkpoint fine-tuned with zero3 (#3981)
* Engine side fix for loading llama checkpoint fine-tuned with zero3

* Fixes to support llama fine-tuning in ds-chat

* Refactored the code to avoid using an except block.

* formatting

* revert permissions change

---------

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2023-07-26 22:18:12 +00:00
Puneesh Khanna ad661b8e35
Remove print of weight parameter in RMS norm (#4031) 2023-07-25 19:56:40 +00:00
Dino Chen f3943cf910
add llama2 autoTP support in replace_module (#4022) 2023-07-24 21:55:29 +00:00
Ma, Guokai 1bc3b78423
[CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) (#3919)
* use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node)

* add fp32 support for SHM allreduce

* avoid assertion for FP16 data type

* fix format

* change 'allreduce_low_latency' to 'inference_allreduce'

* Fix according to comments

* change inference_allreduce to inference_all_reduce to keep naming consistency

* check whether LOCAL_SIZE is defined in ccl.cpp, also define LOCAL_SIZE in test_distributed

* fix format

* Fix format error

* Update tests/unit/comm/test_dist.py

Fix world_size to 4 in UT

Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
2023-07-19 20:57:54 +00:00
digger yu ce535945e6
fix: change ==NONE to is (#3923) 2023-07-11 16:56:43 +00:00
Lev Kurilenko cc3a7c9cba
Fix Meta Tensor checkpoint load for BLOOM models (#3885)
This PR fixes Meta Tensor checkpoint loading for BLOOM models where the SD keys start with transformer..
2023-07-06 06:41:16 +00:00
Yejing-Lai d6f622176d
Add GPTNeoX AutoTP support (#3778)
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Molly Smith <112220543+molly-smith@users.noreply.github.com>
2023-07-06 03:42:55 +00:00
Reza Yazdani f3c93b056d
Add FALCON Auto-TP Support (#3640)
* Add FALCON auto-tp support
* added (skipped) unit test, refactored code to be more readable

---------

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2023-07-05 13:01:02 -07:00
Xingjian Shi d81dfdabcc
Fix LoRA Fuse/Unfuse in Hybrid Engine (#3563)
* fix lora fuse unfuse in hybrid_engine

* fix name

* fix typo

* remove empty lines

* Update gptj.py

* add lora test-case + fix gptneo implementation

* try to fix format

* try to accelerate testcase by reducing max length

* reduce test runtime

* Fix bloom / gpt-neox and add test for bloom

* fix CI + fix issue in engine

---------

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-07-05 08:42:43 -04:00
Connor Holmes c86e4e31b8 Missing strided copy for gated MLP (#3788)
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2023-06-23 14:30:49 -07:00
stephen youn 69d1b9f978 DeepSpeed-Triton for Inference (#3748)
Co-authored-by: Stephen Youn <styoun@microsoft.com>
Co-authored-by: Arash Bakhtiari <arash@bakhtiari.org>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Ethan Doe <yidoe@microsoft.com>
Co-authored-by: yidoe <68296935+yidoe@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-06-23 14:30:49 -07:00
tensor-tang 45466afa34
fix hybrid engine mlp module (#3736)
* fixgated_mlp.py

* fix hybrid_engine.py

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2023-06-13 21:44:47 +00:00
Logan Adams fc8e5c8858
Fix typo in name of hybrid engine function (#3704)
* Fix typo in name of hybrid engine function

* Fix
2023-06-08 11:19:41 -07:00
Reza Yazdani 34a9fbf1a3
Fix gpt-j inference issue (#3639)
* fix gpt-j inference issue for mlp_gemm_func call

* bring back the gpt-j inference-test

* fix formatting

* fix the neox and pythia injection issue
2023-06-07 19:38:46 +00:00
Logan Adams 7e59ef1230
Revert "fix typo name (#3689)" (#3702)
This reverts commit f2f5f21b52.
2023-06-07 09:56:21 -07:00
tensor-tang f2f5f21b52
fix typo name (#3689)
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2023-06-06 16:43:29 -07:00
Lev Kurilenko 7667988491
Fix Hybrid Engine for BLOOM (#3580)
This PR fixes Hybrid Engine (HE) support for the BLOOM model, which was accidentally broken during the HE refactor in GH-3425.

The BLOOM container now inherits the HybridEngineContainer feature and defines a set_lora_params() function necessary for the feature to work. get_lora_params() is correspondingly removed from the BLOOM policy class as well.

GPT-NeoX was also cleaned up by removing a get_lora_params() function from its policy due to it no longer being used.
2023-05-23 18:46:59 +00:00
Ma, Guokai 1f72082fc0
[CPU] Support Intel CPU inference (#3041)
* add fallback path for kernels used in megatron

* temporary numactl WA for SPR 56core

* adapt core allocation according to number of ranks

* add switch to turn on numactl

* detect number of cores on the system

* allow select a subset of the cores on the system to bind

* remove unneeded changes

* add ccl backend

* change nccl to ccl

* remove unused code

* add comm/ccl to ops

* initial ccl comm support

* first broadcast case passed

* add CCL_Backend to DeepSpeed

* support comm timer for CPU

* support barrier for comm backend

* support specify master address from deepspeed command line

* support pytorch 2.0

* remove 'block' from api

* Tweak for debug

Signed-off-by: Cao, Zhong Z <zhong.z.cao@intel.com>

* Remove unecessary directory

Signed-off-by: Cao, Zhong Z <zhong.z.cao@intel.com>

* Add bf16 kernel support for inference

* Add temporary torch implement for cpu inference

* Add softmax ops cpu fallback for inference

* bind cores to numa domain as well

* merge latest change in gma/numactl

* initial bf16 kernel support with fallback path

* initial fallback path for bloom kernel injection

* fix softmax attn mask

* check KMP_AFFINITY to avoid conflict with numactl

* New CCLBackend which utilize TorchBackend for initialization

* rollback last change because there is result error

* fix bloom injection policy TP could not work issue.

injection_policy={BloomBlock: ("self_attention.dense", "mlp.dense_4h_to_h")}

* Use TorchBackend to initialize CCLBackend, make behavior consistent

* remove comm under deepspeed/ops

* add license header

* code clean up

* fix format issue

* remove magic number in main address

* add caching support but not turn on by default

* change name of inference_cuda_module to inference_module

* Check for is_synchronized_device in accelerator before get Event

* fix typo

* Fix fallback path of softmax kernel on CUDA device for BF16 data type, because CUDA tril does not support BF16 datatype, enforce fp32 data type

* add cpu backend files

* change CPU_Accelerator op_builder_dir

* remove cpu_kernel_path

* using CPU_Accelerator on non-cuda device

* fix deepspeed.op_builder => deepspeed.ops.op_builder

* add alias for num_gpus: num_accelerators

* allow loading cpu_builder in build stage

* Assume cuda available if torch not installed

* add oneccl_binding_pt to requirements

* move oneccl-binding-pt to seperate requiremetns-cpu.txt

* add missing file

* use dependency_links in setuptools.setup() call for additional dependency links

* install oneccl_bind_pt in workflows

* change oneccl_bind_pt's version from 1.13 to 2.0

* use intel_exention_for_pytorch as indicator that CPU_Accelerator should be used

* Add indicator for Accelerator used

* change foo.c to foo.cpp

* exclude 'cpu' directory in CUDA op builder reflection

* add a cpu-inference workflow

* run cpu-inference workflow on self-hosted instance

* change cpu runs-on node to v100 node

* print out python version in workflow

* add verbose in pip command to understand oneccl_bind_pt install issue

* update cpu-inference workflow

* add a stage to detect instance instruction sets

* add back bf16 support for CPU inference

* enable autoTP for bloom

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* update workflow to detect cpu instruction sets

* temporary WA for Intel Extension for PyTorch AVX2 instructioon set detection

* change cpu-inference workflow machine to ubuntu-20.04

* add sharded checkpoint loading for AutoTP path to reduce the peak memory in initialization stage

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* enable policy for llama

* use a special build ipex to test avx2 detection fix

* fix format

* fix test fail issue

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix gptj sharded checkpoint loading problem

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* return a not implemented build in get_op_builder in cpu_backend

* support cpu device in tests

* use cpuinfo to extract number of CPUs

* use ~/tmp as transfomer cache rather than /blob/

* Add support for mpich launcher with prefer_deepspeed_comm

* add missing modification in accelerator

* enable IMPI launcher

* remove unused file and fix formatting

* clean up ccl.cpp

* Less confusing error message when certin op builder are not implemented

* Fix license header

* Add license header

* add license headers

* add license header

* fix cuda specific code in test

* update CPU workflow

* use numactl to bind to core

* allow bind_cores_to_rank in multi-node impi runner

* fix format error

* Remove InferenceBuilder

* fix format error in numa.py

* check whether op is in installed ops in ds_report.py

* allow override accelerator with DS_ACCELERATOR='cuda','cpu' or 'xpu'

* lazy init class_dict in CUDA_Accelerator to avoid cyclic initialization of CUDA_Accelerator

* put short path in the beginning in real_accelerator.py

* device_count return number of NUMA nodes

* fix typo

* install numactl in cpu workflow

* Follow comments

* Better implementation of device_count() and current_device()

* remove dependency_link for Intel Extension for DeepSpeed

* use check is_synchronized_device in timer only once

* remove env mapping WA in cpu_accelerator

* fix duplicate definition

* fix format error

* refine ccl backend selection

* move comments to the right place

* remove prefer_deepspeed_comm, use CCLBackend by default

* refractor fallback path

* Fix execution failure in kernel injection path

* do not refractory kernel injection fallback path in  residual_add because it contains function call with side-effect

* guard residual_add fallback path with environ DS_KI_FALLBACK=True

* fix format error

* add test for allreduce on CPU workflow

* fix format error

* Fallback to TorchBackend if CCLBackend kernel are not implemented

* Update Intel Extension for Pytorch installation link

* Don't specify version number of Intel Extension for PyTorch

* install oneCCL for CCLBackend

* fix link path for CPU comm kernels

* fix source oneCCL environment

* source oneCCL env before run UT

* Give more specific instruction when CCL_ROOT not defined

---------

Signed-off-by: Cao, Zhong Z <zhong.z.cao@intel.com>
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: sdp <sdp@aia-sdp-spr-108864.jf.intel.com>
Co-authored-by: Cao, Zhong Z <zhong.z.cao@intel.com>
Co-authored-by: Zhenhuan Chen <zhenhuan.chen@intel.com>
Co-authored-by: baodii <di.bao@intel.com>
Co-authored-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: jianan-gu <jianan.gu@intel.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2023-05-16 11:59:22 -04:00
digger-yu 254663a28c
fix spelling error with deepspeed/runtime/ (#3509)
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2023-05-11 16:34:19 +00:00
Lev Kurilenko 194053bd58
Hybrid Engine Fix Llama (#3505)
Co-authored-by: Connor Holmes <connorholmes@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
2023-05-10 19:29:08 -07:00
Wang, Yi b31b46c0d1
fix regression in shard checkpoint loading in AutoTP Path caused by qkv_copy() is deleted and add UT case for shard checkpoint loading in AutoTP (#3457)
* add UT case for shard checkpoint loading in AutoTP

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* autoTP path also support shard loading

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
2023-05-10 06:52:44 -04:00
Lev Kurilenko db26f8b413
Update Inference Engine checkpoint loading + meta tensor assertions (#2940)
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
2023-05-10 03:00:43 +00:00
Wang, Yi d10b8ca011
add sharded checkpoint loading for AutoTP path to reduce the peak mem… (#3102)
* add sharded checkpoint loading for AutoTP path to reduce the peak memory in initialization stage

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

* fix gptj sharded checkpoint loading problem

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

---------

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-05-04 13:34:15 -04:00
Connor Holmes 0a61d5d664
Hybrid Engine Refactor and Llama Inference Support (#3425)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-05-03 17:20:07 -07:00
jianan-gu 2c63e349e4
Enable auto TP policy for llama model (#3170)
* Enable auto TP policy for llama model

* Update automatic-tensor-parallelism.md

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Molly Smith <112220543+molly-smith@users.noreply.github.com>
2023-05-03 22:49:33 +00:00
Connor Holmes 52d7e80aac
OPT Activation Function Hotfix (#3400)
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Zhewei Yao <zheweiyao@gmail.com>
2023-05-01 21:08:13 -07:00
Reza Yazdani 3e8564645d
Add HE support for the rest of model containers (#3191)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-05-01 13:37:57 -07:00
Molly Smith 496a9a3a62
Diffusers 0.15.0 bug fix (#3345)
* diffusers 0.15.0 cross attention class check

* revert diffusers_attention.py
2023-04-21 15:30:24 -07:00
Connor Holmes 793c23e5c1
Explicitly check for OPT activation function (#3278)
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-04-20 17:38:43 -07:00
Olatunji Ruwase 47f9f13bd3
DeepSpeed Chat (#3186)
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: yaozhewei <zheweiy@berkeley.edu>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Connor Holmes <connorholmes@microsoft.com>
Co-authored-by: Lok Chand Koppaka <lokoppak@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2023-04-11 11:53:38 -07:00
Wang, Yi 6ba0024d54
Enable autoTP for bloom (#3035)
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-04-05 10:58:33 -04:00
Michael Wyatt b361c72761
Update DeepSpeed copyright license to Apache 2.0 (#3111)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-03-30 17:14:38 -07:00
Jeff Rasley 91d63e0228
update formatter version and style settings (#3098) 2023-03-27 07:55:19 -04:00
Molly Smith 9ea0fdc2ce
Assert mp_size is factor of model dimensions (#2891)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-03-21 14:50:43 -07:00
Lev Kurilenko 3798e60519
Fix Meta Tensor checkpoint load for OPT models (#2990)
This PR fixes Meta Tensor checkpoint loading for OPT models where the SD keys start with `model.`.
2023-03-10 11:45:36 -08:00
Ma, Guokai 0acf7e9c48
[RFC] add device abstraction to allow other device than CUDA be used (#2221)
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-03-07 09:40:17 -08:00
Lev Kurilenko 87eaf8f99a
Check for local CUDA graphs when enable_cuda_graph=True (#2941) 2023-03-06 17:38:50 -08:00
Molly Smith 2ede0d942a
AutoTP Assert Kernel Injection Support (#2939)
* check kernel injection supported models

* Clarify why user should use kernel injection
2023-03-06 14:23:55 -08:00
Molly Smith 4ae3a3da0d
TP unsupported models and assertions (#2810)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-03-01 14:18:13 -08:00
Heyang Qin dc01cee5ca
using container when loading inference checkpoints (#2875)
This PR updates the replace_fn function when loading inference checkpoints. The container will now be passed to the load_model_with_checkpoint() so we can call load_params() from there. load_params() is also updated to access the variables in the policy.
2023-02-28 14:59:23 +00:00
Jeff Rasley da84e60d98
add missing license info to top of all source code (#2889)
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-02-27 11:20:41 -08:00
Lev Kurilenko fd1449c766
Port Reza's INT8-quantization fix to container architecture (#2725)
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Heyang Qin <heyangqin@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2023-02-16 10:12:18 -08:00
Molly Smith 46784cb58e
Fix auto TP for duplicate modules with different gems (#2784)
* Fix auto TP for duplicate modules with different gems

* precommit and comments

* Comment

* Combine gem list of same named modules

* remove duplicates from gem_list before updating policy

* Add module attribute with name variation for ProphetNet

---------

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-02-15 12:50:32 -08:00
Lev Kurilenko 10f3c301a0
Add container load checkpoint error reporting + refactor (#2792)
This PR refactors the organization of meta tensor checkpoint loading as follows:

- Move get_param_names() abstract method definition from TransformerPolicy into MetaTensorContainer
- Model-specific get_param_names() definitions moved from policy into model-specific container
- selected_policy_g, megatron_v2_g, and transformer_config_g globals replaced with a single container_g global, since the container will contain all of the information those globals previously captured
- ckpt_load_enabled flag added to containers that's set to False by default in the base.py container and gets set to True when the MetaTensorContainer feature is inherited
- Assertion added to replace_transformer_layer before performing checkpoint loading to check if ckpt_load_enabled ==True, otherwise an error message will be printed saying that the container does not support meta tensor checkpoint loading.

The aim of these changes is to more closely couple meta tensor checkpoint loading code to the MetaTensorContainer and to allow for better error reporting of load checkpoint use on model types that don't support this feature.
2023-02-07 23:18:30 +00:00
Lev Kurilenko 0a73e6e613
Container param cleanup + remove qkv_merging (#2780)
This PR cleans up some container items and removes an unused qkv_merging parameter:

- Remove qkv_merging=True from BERT containers
- Change containers config object to ds_model_config
- Remove qkv_merging param
2023-02-03 21:49:33 +00:00
Reza Yazdani 9f41ffe4a6
Reset KV-cache at the beginning of text-generation (#2669)
Co-authored-by: Martin Cai <martincai@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-02-03 12:07:44 -08:00
Reza Yazdani 2c6e819450
Fix Checkpoint-loading with Meta-tensor (#2781)
* Reset KV-cache at the beginning of text-generation

* Pass the ckpt-loading arguments to work with meta-tensor

* remove unrelated changes
2023-02-03 07:12:53 +00:00
Michael Wyatt ef6a958e70
Fix for diffusers v0.12.0 (#2753)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-01-31 15:54:46 -08:00
Ma, Guokai 98cc35b6a8
Abstract accelerator (step 3) (#2677)
* Integrate accelerator abstraction interface into deepspeed/

* Fix error message in fp16/fused_optimizer

* fix error message in fp16/unfused_optimizer.py

* assign get_accelerator().pin_memory() result to input Tensor name

* no need to check cuda and whether nvtx supported

* move try-except into inner most block

* call Event() and Stream() in get_accelerator() for data type

* Make Stream and Event as properties of abstract interface so they can be used as data type in deepspeed

* Apply op_builder backend api change from #2705 from @jeffra

* fix tests where Builder NAME is used

* keep original ...Builder.NAME interface instead of ...Builder().NAME interface

* fix builder closure for installation

* fix randomltd builder

* add comments to clarify create_op_builder and get_op_builder

* fix compatibility with pip install -e

Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-01-26 06:03:12 -08:00
Molly Smith d59b572911
Automatic tensor parallelism v2 (#2670)
* loop through pipe.model

* tp_parser first draft

* client_module must be type object

* Simplify layernorm tracking. Add unittest.

* cleanup

* Add more models to unittest

* cleanup inference pytest for merging

* Add unittest

* cleanup

* pre-commit

* unittest id and pytest marker

* try marian for unittest

* precommit

* Move tp code to seperate file

* Add new auto tp file

* pre-commit and type

* Update deepspeed/module_inject/auto_tp.py

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>

* Update deepspeed/module_inject/auto_tp.py

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>

* Update tests/unit/inference/test_inference.py

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>

* remove unused fillmask function

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2023-01-24 15:05:48 -08:00
Ammar Ahmad Awan 867da307d0
Inference Refactor (replace_with_policy, model_implementations) (#2554)
Co-authored-by: Lev Kurilenko <lekurile@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-01-19 14:10:03 -08:00
Reza Yazdani 95d9a1b6c3
Fix Opt injection (#2541)
* fix Opt injection & add injection verification check at inference test

* fix several issues

* remove fixture

* remove check_injection when no kerenl is injected

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-01-06 13:21:49 -08:00
Jeff Rasley d9b788d773
tweaks to ds-attn, distilbert policy, and mup (#2649) 2022-12-28 10:16:02 -08:00
Jeff Rasley e0aa84c5b5
Fix issue w. bloom when changing tp size (#2645) 2022-12-23 03:27:33 +00:00
Lev Kurilenko 503706ac44
Remove GatheredParameters context from replace_with_policy (#2591)
This PR removes the zero-infernece GatheredParameters context from replace_with_policy due to no longer needing zero-inference after the introduction of meta tensor support for BLOOM.
2022-12-16 13:43:28 -08:00
Jeff Rasley 35eabb0a33
Fix issues w. python 3.6 + add py-version checks to CI (#2589) 2022-12-09 21:53:58 +00:00
Michael Wyatt ccb8eb81fb
Add checkpoint sharding unit tests (#2561)
* added checkpopint sharding tests
2022-12-08 14:35:43 -08:00
Lev Kurilenko 731965db33
Fix MegatronLayerPolicy to have megatron_v2=True (#2579)
This PR updates the MegatronLayerPolicy to set megatron_v2=True, which is required in order to properly transpose in the replace_with_policy() function.

After the change in this PR, in conjunction with PR #99 in the Megatron-DeepSpeed fork, the Megatron text-generation example works with DS inference.
2022-12-07 09:26:09 -08:00
Reza Yazdani 35b350b28c
Fix quantized-inference & Add generic support of checkpoint loading (#2547)
* fix checkpoint loading when it is a dictionary

* fix some issues with saving ckpt & int8 inference

* fix quantized-inference & add generic support of checkpoint loading

* remove int8 hard-coded flag

* fix mlp return tensors

* fix several issue to load checkpoints of GPT-J, GPT-NEOX, and OPT with different TP-size

* add more comments & description for checkpoint-loading module

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2022-12-06 13:49:29 -08:00
Ammar Ahmad Awan 90ae688442
Pass down the new DS inference config to replace_transformer_layer. (#2539)
* pass down the new DS inference config to replace_transformer_layer.

* remove quantize_settings and rename the ep_mp_group.

* Fix model_config passing. Fixes gptj issue with wrong output.

* fix small bug in gpt-neo.

Co-authored-by: Reza Yazdani and Michael Wyatt
2022-11-23 19:50:11 +00:00
Ammar Ahmad Awan b5d18a6ab3
DeepSpeed inference config. (#2459) (#2472)
Changes to inference API to use accept a config dict and cleaning up Inference Engine to utilize the newly added inference config.

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2022-11-15 00:45:43 +00:00
lokoppakmsft f2710bbe1d
Make data contiguous before the inplace reshape-copy_ function (#2489)
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2022-11-11 14:04:31 -08:00
Connor Holmes e7e7595502
Stable Diffusion Enhancements (#2491)
Co-authored-by: cmikeh2 <connorholmes@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
2022-11-09 17:40:59 -08:00
Kevin Ko 6f77da1bae
Add `scale_attn_by_inverse_layer_idx` feature (#2486)
* Add scale_attn_by_inverse_layer_idx feature

* Fix layer_id bug

* Fix scaling value

Co-authored-by: Connor Holmes <connorholmes@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
2022-11-09 15:29:10 -08:00
Reza Yazdani 9cfcf7431a
Add correct memory-allocation at DeepSpeed-Attention (#2474)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Connor Holmes <connorholmes@microsoft.com>
2022-11-07 16:23:25 -08:00
Ammar Ahmad Awan 35458da0e0
Create a new folder structure to isolate model-specific code in DS (#2464) 2022-11-03 17:00:44 -07:00
Connor Holmes 10e9d04c23
Cache Allocation and Softmax Fixes (#2433)
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2022-11-02 10:48:18 -07:00
Jeff Rasley ec13da6ba7
add SD injection policy (#2381)
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
2022-10-13 16:47:12 -07:00
Andrey Chernykh cd3a70953a
Fix GPT Neo-X multi-gpu inference (#2401)
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2022-10-13 10:18:03 -07:00
lekurile 46a886c068
Change type to tuple in replace_wo_policy isinstance check (#2387)
Update the isinstance check inside the `replace_wo_policy` function to `tuple` and `str` instead of `dict`, since the layers are provided as a `tuple` type.

Co-authored-by: Lev Kurilenko <lekurile@microsoft.com>
Co-authored-by: Molly Smith <mosm@microsoft.com>
Co-authored-by: Lok Chand Koppaka <lokoppak@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
2022-10-07 15:32:10 -07:00
Ammar Ahmad Awan 993264388d
Inference profiling updates/fixes (#2348) (#2349)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2022-09-23 14:38:09 -07:00
Stas Bekman b146aa3523
[ds-inference] fix progress bar (#2286)
when loading the non-sharded checkpoint update the progress bar (fix by @RezaYazdaniAminabadi) - I've just tested it to work.

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2022-09-04 18:12:36 -04:00
Reza Yazdani afdc72879f
Ds-inference Int8 support through ZeroQuant technology (#2217)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2022-08-30 16:39:34 -07:00
Molly Smith a7ee688a6f
Update replace_module.py, test-gptj.py related fix (#2269)
Fix RuntimeError: Boolean value of Tensor with more than one value is ambiguous when running test-gptj.py
2022-08-26 23:25:27 -07:00
Reza Yazdani c35bfe89f6
fix ds-inference without policy (#2247)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2022-08-23 14:44:09 -07:00
Arash Bakhtiari fae896ef60
Make OPT policy backward compatible with pre-OPT transformers versions (#2254) 2022-08-23 14:38:48 -07:00
Jeff Rasley dce3acaac7
allow saving ckpt w/o ckpt json + bloom copy fix (#2237) 2022-08-19 15:01:15 -07:00
Arash Bakhtiari 8b2a63717a
Add support of OPT models (#2205)
* add opt replace policy

* simplify inf. api

* fix opt replace policy

* fix use-cash & add relu

* Add support of custom MLP act. function

* Revert "simplify inf. api"

This reverts commit 9e910fcbd5471dec9b3c92008426f5ba590bf0b6.

* fix the inference API (temp. solution)

* fix code formatting

* add unit tests for OPT models.

* refactor pre-attention layer norm configuration

* add support of opt-350m model

* refactor the HF model config initialization

* fix hf model config issue

Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
2022-08-15 07:31:51 -07:00
Reza Yazdani 8920308c66
Fix the tensor-slicing copy for qkv parameters (#2198)
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2022-08-10 09:34:57 -07:00
Reza Yazdani e7d9959540
fixing model partitioning without injection (#2179) 2022-08-03 20:49:11 -07:00
Reza Yazdani 556f005152
Fix random token-generation issue + MP-checkpoint loading/saving (#2132)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2022-07-28 17:24:07 -07:00
Alex Hedges 316c4a43e0
Add flake8 to pre-commit checks (#2051) 2022-07-25 16:48:08 -07:00
Michael Wyatt ee7ea3b805
use HF NeoX (#2087)
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2022-07-19 12:50:58 -07:00
Stas Bekman 16699d839f
[ds-inference] checkpoint loading => tqdm (#2107)
* [ds-inference] checkpoint loading => tqdm

solve 2 issues:
- less noise using tqdm progress bar
- more informative - tell users how much to wait and how many shards to load

New way:

```
Loading 72 checkpoints:  12%|█▎        | 9/72 [01:12<08:39,  8.25s/it]
```

* write only from one process

* style
2022-07-19 09:21:19 -07:00
Reza Yazdani aa88137b8d
Add Inference support for running the BigScience-BLOOM Architecture (#2083)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2022-07-18 16:27:12 -07:00
Alex Hedges 76ea0534c1
Fix missing import in replace_module.py (#2050)
* Fix missing import in replace_module.py

* Change import from torch.distributed to deepspeed.comm
2022-06-29 17:16:26 +00:00
Jeff Rasley b666d5cd73
[inference] test suite for ds-kernels (bert, roberta, gpt2, gpt-neo, gpt-j) (#1992)
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
2022-06-15 14:21:19 -07:00
Ammar Ahmad Awan 36ad3119d5
DeepSpeed comm backend v1 (#1985)
Co-authored-by: Quentin Anthony <qganthony@yahoo.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2022-06-10 16:47:33 -07:00
Reza Yazdani 8164ea9e6d
Fixing several bugs in the inference-api and the kernels (#1951)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2022-05-24 13:27:50 -07:00
Jeff Rasley b4fcd98ff0
Inference PP changes for neox (#1899)
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
2022-04-26 11:50:38 -07:00
Samyam Rajbhandari c13457b756
Supporting multiple modules injection with a single policy when they have identical architectures (#1869)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2022-03-30 17:47:19 +00:00