DeepSpeed/deepspeed/linear
Jeff Rasley 6e5d58d248
OptimizedLinear updates (#5791)
This is a refresh of of `OptimizedLinear` with the following features to
improve performance and usability:
 * More efficient sharing of base weights using `all_gather_into_tensor`
 * Flattened sharded weights
 * Selectively offload frozen weights to cpu
* `deepspeed.linear.Init` that allows injecting OptimizedLinear during
model construction (similar to zero.Init)
* Support for load state dict directly in OptimizedLinear, this allows
loading HF model weights correctly into sharded params
 * Various bug fixes for the LoRA implementation introduced previously
 * Several new unit tests
 
Builds on-top of @RezaYazdaniAminabadi's previous FP8 updates (#5764) to
support dense model fp8 quantization.

Example usage of this to fine-tune llama-3.1-405B on a single node:
https://github.com/Snowflake-Labs/snowflake-arctic/tree/main/training/llama3.1

---------

Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
Co-authored-by: Reza Yazdani <152926435+sfc-gh-reyazda@users.noreply.github.com>
2024-08-13 23:36:22 +00:00
..
__init__.py OptimizedLinear updates (#5791) 2024-08-13 23:36:22 +00:00
config.py OptimizedLinear updates (#5791) 2024-08-13 23:36:22 +00:00
context_manager.py OptimizedLinear updates (#5791) 2024-08-13 23:36:22 +00:00
optimized_linear.py OptimizedLinear updates (#5791) 2024-08-13 23:36:22 +00:00
quantization.py OptimizedLinear updates (#5791) 2024-08-13 23:36:22 +00:00