DeepSpeed

История

inkcherry d5a7c1e0b4 Capture short kernel sequences to graph (#4318 ) Motivation: 1. This is a series of cases where short kernel sequences are launched and executed serially（no dynamic shape）, with the launch overhead being much higher than the execution overhead. We can use a graph to solve this problem. Compared to ```multi-tensor-apply```, using graph is more concise and only requires PyTorch as a dependency. 2. Some device software stacks also support lazy-mode PyTorch, enabling full utilization of the compiler to perform graph optimization. However, in lazy mode, operation accumulation time (host time) could become significantly higher compared to device time in such scenario, and devices are usually not well utilized. By using the same API(after adding to accelerator cc @delock ) with cuda graph, this issue could also be resolved. Change: We modified three functions, ```update_hp_grads```. Here, we executed the operations for the CPU and GPU separately because the graph is unable to record the execution of CPU operations. Additionally, the data input required by the graph must not have its address modified, or the address modification must be captured by the capture operation(In this case, set ```replay_first_step``` to ```True```). Therefore, we changed ```grad=None``` to ```grad.zero_()```. Similarly, we have also placed some inputs that require fixed addresses in the ```graph_cache``` For ```clip_tensors_by_global_norm```, ```clip_coef``` is a scalar with a non-fixed value, so it needs to be moved to the GPU when using a graph. For ```total_norm = sum ([t. data. float (). norm (norm_type). item () * * norm_type for t in input_tensors])```, ```item () ```, synchronous operation is also not supported by graph. We directly put the ```sum``` and ```* * norm_type``` on the GPU to execute the computation. Other similar scenarios can also use this ```graph_process()```, or a slightly modified version of ```graph_process()``` you can checkout [`4abab21`](`4abab212c8`) and set it to True here to do some benchmarking. `4abab212c8 (diff-f8f0b3feb55b0374615405e542c1c3e0f017982b177c46c562bf688532ac935cR42)` --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>		2023-12-20 20:51:36 +00:00
..
diffusers	Capture short kernel sequences to graph (#4318 )	2023-12-20 20:51:36 +00:00
features	Update DeepSpeed copyright license to Apache 2.0 (#3111 )	2023-03-30 17:14:38 -07:00
transformers	Capture short kernel sequences to graph (#4318 )	2023-12-20 20:51:36 +00:00
__init__.py	Update DeepSpeed copyright license to Apache 2.0 (#3111 )	2023-03-30 17:14:38 -07:00