DeepSpeed/deepspeed/model_implementations
inkcherry d5a7c1e0b4
Capture short kernel sequences to graph (#4318)
**Motivation:**
1. This is a series of cases where short kernel sequences are launched
and executed serially(no dynamic shape), with the launch overhead being
much higher than the execution overhead. We can use a graph to solve
this problem. Compared to ```multi-tensor-apply```, using graph is more
concise and only requires PyTorch as a dependency.
2. Some device software stacks also support lazy-mode PyTorch, enabling
full utilization of the compiler to perform graph optimization. However,
in lazy mode, operation accumulation time (host time) could become
significantly higher compared to device time in such scenario, and
devices are usually not well utilized. By using the same API(after
adding to accelerator cc @delock ) with cuda graph, this issue could
also be resolved.

**Change:**
We modified three functions, 
```update_hp_grads```. Here, we executed the operations for the CPU and GPU separately because the graph is unable to record the execution of CPU operations. Additionally, the data input required by the graph must not have its address modified, or the address modification must be captured by the capture operation(In this case, set ```replay_first_step``` to ```True```). Therefore, we changed ```grad=None``` to ```grad.zero_()```. Similarly, we have also placed some inputs that require fixed addresses in the ```graph_cache``` 

For ```clip_tensors_by_global_norm```, ```clip_coef``` is a scalar with a non-fixed value, so it needs to be moved to the GPU when using a graph.


For ```total_norm = sum ([t. data. float (). norm (norm_type). item () * * norm_type for t in input_tensors])```, ```item () ```, synchronous operation is also not supported by graph. We directly put the ```sum``` and ```* * norm_type``` on the GPU to execute the computation.

Other similar scenarios can also use this ```graph_process()```, or a slightly modified version of ```graph_process()```

you can checkout
[4abab21](4abab212c8)  and set it to True here to do some benchmarking.
4abab212c8 (diff-f8f0b3feb55b0374615405e542c1c3e0f017982b177c46c562bf688532ac935cR42)

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-12-20 20:51:36 +00:00
..
diffusers Capture short kernel sequences to graph (#4318) 2023-12-20 20:51:36 +00:00
features Update DeepSpeed copyright license to Apache 2.0 (#3111) 2023-03-30 17:14:38 -07:00
transformers Capture short kernel sequences to graph (#4318) 2023-12-20 20:51:36 +00:00
__init__.py Update DeepSpeed copyright license to Apache 2.0 (#3111) 2023-03-30 17:14:38 -07:00