diff --git a/docs/source/main_classes/trainer.rst b/docs/source/main_classes/trainer.rst index c0ce5c4e2..f8527ee88 100644 --- a/docs/source/main_classes/trainer.rst +++ b/docs/source/main_classes/trainer.rst @@ -290,6 +290,16 @@ full support for: 1. Optimizer State Partitioning (ZeRO stage 1) 2. Add Gradient Partitioning (ZeRO stage 2) +3. Custom fp16 handling +4. A range of fast Cuda-extension-based Optimizers +5. ZeRO-Offload + +ZeRO-Offload has its own dedicated paper: `ZeRO-Offload: Democratizing Billion-Scale Model Training +`__. + +DeepSpeed is currently used only for training, as all the currently available features are of no use to inference. + + Installation ======================================================================================================================= @@ -329,6 +339,11 @@ Unlike, ``torch.distributed.launch`` where you have to specify how many GPUs to full details on how to configure various nodes and GPUs can be found `here `__. +In fact, you can continue using ``-m torch.distributed.launch`` with DeepSpeed as long as you don't need to use +``deepspeed`` launcher-specific arguments. Typically if you don't need a multi-node setup you're not required to use +the ``deepspeed`` launcher. But since in the DeepSpeed documentation it'll be used everywhere, for consistency we will +use it here as well. + Here is an example of running ``finetune_trainer.py`` under DeepSpeed deploying all available GPUs: .. code-block:: bash @@ -402,12 +417,42 @@ find more details in the discussion below. For a practical usage example of this type of deployment, please, see this `post `__. +Notes: + +- if you need to run on a specific GPU, which is different from GPU 0, you can't use ``CUDA_VISIBLE_DEVICES`` to limit + the visible scope of available GPUs. Instead, you have to use the following syntax: + + .. code-block:: bash + + deepspeed --include localhost:1 ./finetune_trainer.py + + In this example, we tell DeepSpeed to use GPU 1. + + Configuration ======================================================================================================================= For the complete guide to the DeepSpeed configuration options that can be used in its configuration file please refer to the `following documentation `__. +You can find dozens of DeepSpeed configuration examples that address various practical needs in `the DeepSpeedExamples +repo `__: + +.. code-block:: bash + + git clone https://github.com/microsoft/DeepSpeedExamples + cd DeepSpeedExamples + find . -name '*json' + +Continuing the code from above, let's say you're looking to configure the Lamb optimizer. So you can search through the +example ``.json`` files with: + +.. code-block:: bash + + grep -i Lamb $(find . -name '*json') + +Some more examples are to be found in the `main repo `__ as well. + While you always have to supply the DeepSpeed configuration file, you can configure the DeepSpeed integration in several ways: @@ -547,7 +592,11 @@ Notes: - ``"overlap_comm": true`` trades off increased GPU RAM usage to lower all-reduce latency. ``overlap_comm`` uses 4.5x the ``allgather_bucket_size`` and ``reduce_bucket_size`` values. So if they are set to 5e8, this requires a 9GB footprint (``5e8 x 2Bytes x 2 x 4.5``). Therefore, if you have a GPU with 8GB or less RAM, to avoid getting - OOM-errors you will need to reduce those parameters to about ``2e8``, which would require 3.6GB. + OOM-errors you will need to reduce those parameters to about ``2e8``, which would require 3.6GB. You will want to do + the same on larger capacity GPU as well, if you're starting to hit OOM. +- when reducing these buffers you're trading communication speed to avail more GPU RAM. The smaller the buffer size, + the slower the communication, and the more GPU RAM will be available to other tasks. So if a bigger batch size is + important, getting a slightly slower training time could be a good trade. This section has to be configured exclusively via DeepSpeed configuration - the :class:`~transformers.Trainer` provides no equivalent command line arguments. @@ -717,6 +766,11 @@ Main DeepSpeed Resources - `API docs `__ - `Blog posts `__ +Papers: + +- `ZeRO: Memory Optimizations Toward Training Trillion Parameter Models `__ +- `ZeRO-Offload: Democratizing Billion-Scale Model Training `__ + Finally, please, remember that, HuggingFace :class:`~transformers.Trainer` only integrates DeepSpeed, therefore if you have any problems or questions with regards to DeepSpeed usage, please, file an issue with `DeepSpeed GitHub `__.