* Fix auto TP for duplicate modules with different gems
* precommit and comments
* Comment
* Combine gem list of same named modules
* remove duplicates from gem_list before updating policy
* Add module attribute with name variation for ProphetNet
---------
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
This PR refactors the organization of meta tensor checkpoint loading as follows:
- Move get_param_names() abstract method definition from TransformerPolicy into MetaTensorContainer
- Model-specific get_param_names() definitions moved from policy into model-specific container
- selected_policy_g, megatron_v2_g, and transformer_config_g globals replaced with a single container_g global, since the container will contain all of the information those globals previously captured
- ckpt_load_enabled flag added to containers that's set to False by default in the base.py container and gets set to True when the MetaTensorContainer feature is inherited
- Assertion added to replace_transformer_layer before performing checkpoint loading to check if ckpt_load_enabled ==True, otherwise an error message will be printed saying that the container does not support meta tensor checkpoint loading.
The aim of these changes is to more closely couple meta tensor checkpoint loading code to the MetaTensorContainer and to allow for better error reporting of load checkpoint use on model types that don't support this feature.
This PR cleans up some container items and removes an unused qkv_merging parameter:
- Remove qkv_merging=True from BERT containers
- Change containers config object to ds_model_config
- Remove qkv_merging param
* Integrate accelerator abstraction interface into deepspeed/
* Fix error message in fp16/fused_optimizer
* fix error message in fp16/unfused_optimizer.py
* assign get_accelerator().pin_memory() result to input Tensor name
* no need to check cuda and whether nvtx supported
* move try-except into inner most block
* call Event() and Stream() in get_accelerator() for data type
* Make Stream and Event as properties of abstract interface so they can be used as data type in deepspeed
* Apply op_builder backend api change from #2705 from @jeffra
* fix tests where Builder NAME is used
* keep original ...Builder.NAME interface instead of ...Builder().NAME interface
* fix builder closure for installation
* fix randomltd builder
* add comments to clarify create_op_builder and get_op_builder
* fix compatibility with pip install -e
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
* loop through pipe.model
* tp_parser first draft
* client_module must be type object
* Simplify layernorm tracking. Add unittest.
* cleanup
* Add more models to unittest
* cleanup inference pytest for merging
* Add unittest
* cleanup
* pre-commit
* unittest id and pytest marker
* try marian for unittest
* precommit
* Move tp code to seperate file
* Add new auto tp file
* pre-commit and type
* Update deepspeed/module_inject/auto_tp.py
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
* Update deepspeed/module_inject/auto_tp.py
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
* Update tests/unit/inference/test_inference.py
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
* remove unused fillmask function
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Lev Kurilenko <lekurile@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* fix Opt injection & add injection verification check at inference test
* fix several issues
* remove fixture
* remove check_injection when no kerenl is injected
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
This PR removes the zero-infernece GatheredParameters context from replace_with_policy due to no longer needing zero-inference after the introduction of meta tensor support for BLOOM.
This PR updates the MegatronLayerPolicy to set megatron_v2=True, which is required in order to properly transpose in the replace_with_policy() function.
After the change in this PR, in conjunction with PR #99 in the Megatron-DeepSpeed fork, the Megatron text-generation example works with DS inference.
* fix checkpoint loading when it is a dictionary
* fix some issues with saving ckpt & int8 inference
* fix quantized-inference & add generic support of checkpoint loading
* remove int8 hard-coded flag
* fix mlp return tensors
* fix several issue to load checkpoints of GPT-J, GPT-NEOX, and OPT with different TP-size
* add more comments & description for checkpoint-loading module
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
* pass down the new DS inference config to replace_transformer_layer.
* remove quantize_settings and rename the ep_mp_group.
* Fix model_config passing. Fixes gptj issue with wrong output.
* fix small bug in gpt-neo.
Co-authored-by: Reza Yazdani and Michael Wyatt
Changes to inference API to use accept a config dict and cleaning up Inference Engine to utilize the newly added inference config.
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Update the isinstance check inside the `replace_wo_policy` function to `tuple` and `str` instead of `dict`, since the layers are provided as a `tuple` type.
Co-authored-by: Lev Kurilenko <lekurile@microsoft.com>
Co-authored-by: Molly Smith <mosm@microsoft.com>
Co-authored-by: Lok Chand Koppaka <lokoppak@microsoft.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
when loading the non-sharded checkpoint update the progress bar (fix by @RezaYazdaniAminabadi) - I've just tested it to work.
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
* [ds-inference] checkpoint loading => tqdm
solve 2 issues:
- less noise using tqdm progress bar
- more informative - tell users how much to wait and how many shards to load
New way:
```
Loading 72 checkpoints: 12%|█▎ | 9/72 [01:12<08:39, 8.25s/it]
```
* write only from one process
* style
Co-authored-by: Quentin Anthony <qganthony@yahoo.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* DeepSpeedInferenceConfig
get epsilon value from config
* epsilon -> layer_norm_eps
to keep var name same as in DeepSpeedTransformerConfig
* DeepSpeedTransformerConfig
get epsilon value from config
* configurabale stochastic_mode
eg:
1. For LM pre-training True
2. For LM fine-tuning on task False
* Updated replace_module.py
checking layer_norm_eps is attribute of config
default 1e-12
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
* Fix typos in docs/
* Fix typos in code comments and output strings
* Fix typos in the code itself
* Fix typos in tests/
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
* fixing inference api for FP32 and non-masking GPT-based models
* use a dummy tensor if input_mask is none
* fix input_mask
* minor fix
* send input_mask to compute_attn func for checking
* fix links for inference tutorial
* Fix automatic injection. Add the local-attention for GPT-Neo
* fix the inference for generation of large sequences (>1K & <32K)
* fix format
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* fix the bias-add precision and indexing and also adding the layer-norm-eps as a configurable parameter for transformer
* add ACC_HALF config
* use defined to check if ACC_Half is defined