# Efficient data loading for large training workload
One key business objective when training AI models is to ensure the GPU on your compute is fully utilized in order to keep costs as low as possible (no idle compute). Serving training data to the GPU in a performant manner goes a long way to ensure you can fully utilize the GPU. If the serving of data to the GPU is slow relative to the processing of an epoch, then the GPU may idle whilst it waits for the data to arrive.
## Optimized Environment for large scale distributed training
To effectively run optimized and significantly faster training and inference for large models on AzureML, we recommend the new Azure Container for PyTorch (ACPT) environment which includes the best of Microsoft technologies for training with PyTorch on Azure. In addition to AzureML packages, this environment includes latest Training Optimization Technologies: [Onnx / Onnx Runtime / Onnx Runtime Training](https://onnxruntime.ai/),
Operationalizing models in production is about taking the model that has been trained and evaluated, and making it available for use in a live, production environment.
description: An official step-by-step guide of best-practices with techniques and optimizations for running large scale distributed training on AzureML. Includes all aspects of the data science steps to manage enterprise grade MLOps lifecycle from resource setup and data loading to training optimizations, evaluation and optimizations for inference.
---
# AzureML Large Scale Deep Learning Best Practices
This example will focus on pretraining a BERT model for Masked Language Modeling (MLM) on the GLUE dataset. Bert is a large model and in this article you can learn on tips and tricks to be able to train with high efficiency for compute and memory without impacting the quality of model.
BigScience Large Open-science Open-access Multilingual Language Model (BLOOM), is an autoregressive language model based on the GPT-3 architecture. BLOOM is trained on data from 46 natural languages and 13 programming languages and is the largest publicly available open multilingual model. Training this large model required multiple optimizations to train efficiently. This guide details the process.
## Using DeepSpeed Autotuning to generate an optimal DeepSpeed configuration file
DeepSpeed Autotuning is a feature used to find the most optimal configuration file that will maximize the training speed and memory efficiency of a model for a given hardware configuration. This can give users the best possible performance, without having to spend time manually tweaking hyperparameters.
Large scale training has led to state-of-the-art accuracies across a range of tasks and numerous customers have been using Azure Machine Learning for training models with millions/billions of parameters. While large scale training has led to high accuracies, it also comes with challenges.
@ -8,33 +43,10 @@ Large scale training has led to state-of-the-art accuracies across a range of ta
This guide will show best practices to allow you to train large models very efficiently with high throughput in AzureML, leveraging full utilization of GPU to keep the cost low.
- [Setup](#setup)
- [Estimate Your Memory Requirements](#estimate-memory-requirements)
- [Compute Cluster](#compute-cluster)
- [Linear Scaling with Infiniband Enabled SKUs](linear-scaling-with-infiniband-enabled-skus)
- [Environment](#environment)
- [Data Loading](#data-loading)
- [Training Optimizations for Compute and Memory Efficiency](#optimizations)
For a large training job, its improtant to know how much memory is required by model params, gradients and optimizer states. In addition, you will also need enough memory to fit activation calculations and any temporary memory for intermediate calculations, which for long sequences could be significant. Here is estimated calculation for Model using FP16 and Adam optimizers
```
FP16 parameter: 2 bytes
@ -53,11 +65,11 @@ This guide will show best practices to allow you to train large models very effi
([API to estimate memory usage for model state consumption, but not activations](https://deepspeed.readthedocs.io/en/latest/memory.html)) from DeepSpeed with several accelarations/optimizations to reduce GPU memory.
Ideally, as the number of VMs training a given model increases, the time to train that model should decrease linearly. For instance, if training a model using one VM takes 100 seconds, then training that same model using two VMs should ideally take 50 seconds. Also ideally, model quality / accuracy should not be affected by the number of VMs used. To attain linear scaling, one important step is to use InfiniBand. Linear scaling is ideal, but unfortunately as the number of machines increases, communication cost among the nodes also increases. Infiniband can help offset this cost and increase throughput.
#### **Linear Scaling with Infiniband Enabled SKUs**
#### 1.2.1. <aname='LinearScalingwithInfinibandEnabledSKUs'></a>**Linear Scaling with Infiniband Enabled SKUs******
AzureML offers optimized supercomputer hardware with high bandwidth interconnects to enable low latency, GPU-to-GPU communication across nodes in a cluster.
These GPUs within a node are connected by NVLink and NVSwitch, GPUs across nodes connected by NVIDIA Mellanox 200Gbps Infiniband cards providing 2.8 exaflop/s of peak AI performance in aggregate.
@ -86,7 +98,7 @@ This guide will show best practices to allow you to train large models very effi
> While InifiBand helps attain linear scaling, there are other reasons/factors that can impact linear scaling and you will see in the document below and solution..
- ### **Environment**
### 1.3. <aname='Environment'></a>**Environment**
The recommended environment for a large scale distributed training job is an Azure Container for PyTorch (ACPT) environment with several built in optimizers and is described in more detail [here](../Environment/ACPT.md). This environment is built and ready to use under the 'Environments' tab in AzureML studio. Some optimizers included in the environment are:
- Onnx Runtime, Built-in optimizations that deliver up to 1.4X faster training
- Deepspeed allows to train trillion model parameter at low cost by achieving excellent system throughput and efficiently scale to thousands of GPUs
@ -94,11 +106,12 @@ This guide will show best practices to allow you to train large models very effi
- Nebula, a new fast checkpointing feature to save your checkpoint 1000 times faster
To load data in the most efficient way with large scale distributed training jobs, follow [this guide](../Data-loading/data-loading.md).
## Optimizations
## 2. <aname='Optimizations'></a>Optimizations
To achive the best possible performance and resource utilization of jobs on AzureML, we employ several different optimization tools showcased below.
- ### **DeepSpeed**
### 2.1. <aname='DeepSpeed'></a>**DeepSpeed**
[DeepSpeed](https://github.com/microsoft/DeepSpeed) is an open-source library developed by Microsoft that optimizes the training of large deep learning models. It aims to reduce the time and memory requirements needed for training large models with trillions of parameters on distributed GPU clusters.
@ -137,7 +150,7 @@ To achive the best possible performance and resource utilization of jobs on Azur
An example showing this implementation can be found [here](https://github.com/Azure/azureml-examples/tree/main/cli/jobs/deepspeed/deepspeed-training).
For a full set of DeepSpeed features see this [API doc](https://www.deepspeed.ai/docs/config-json/).
When running a job with DeepSpeed, it is always necessary to include a ``ds_config.json`` file that has the configurations that DeepSpeed will use for training. However, it is hard to know what settings are best in your scenario. This is where Autotuning comes in. [DeepSpeed Autotuning](https://www.deepspeed.ai/tutorials/autotuning/) will find the most optimal configuration file that will maximize the training speed and memory efficiency of a model for a given hardware configuration. This can give users the best possible performance, without having to spend time manually tweaking hyperparameters. There are three configurations in particular that Autotuning will help find the best settings for:
- ``train_micro_batch_size_per_gpu`` - The batch size for a single step on a GPU.
- ``gradient_accumulation_steps``- Number of training steps to accumulate gradients before using them to compute variables. Increasing this allows for training on bigger batch sizes.
@ -167,7 +180,7 @@ To achive the best possible performance and resource utilization of jobs on Azur
In addition to DeepSpeed, we can also use the HuggingFace [Optimum](https://huggingface.co/docs/optimum/index) library and [Onnx Runtime](https://onnxruntime.ai/docs/) to optimize our training. ORT can provide several benefits to a training job, including flexibility with different hardware configurations, memory optimizations that allow fitting of larger models compared to base Pytorch. More details on how exactly Onnx Runtime improves training time and throughput can be found [here](https://huggingface.co/blog/optimum-onnxruntime-training).
@ -193,12 +206,12 @@ To achive the best possible performance and resource utilization of jobs on Azur
--optim adamw_ort_fused
```
This is an extra argument added with ORTTrainingArguments that applies the Fused Adam Optimizer to give a little extra performance gain. For a training example that uses ORT, See the [BERT Pretrain example](./Bert-Pretrain/README.md).
Machine learning model training is usually an iterative process and requires significant experimentation. With the Azure Machine Learning interactive job experience, we can access the container where the job is running and iterate on training scripts, monitor progress and even debug the job remotely on local machines.
Depending on the tool you want to use, add the corresponding service to your Azure cli v2 command job yaml file:
```
```
services:
my_jupyterlab:
job_service_type: jupyter_lab
@ -210,25 +223,26 @@ To achive the best possible performance and resource utilization of jobs on Azur
my_vscode:
job_service_type: vs_code
nodes: all
```
```
To access these services once the job starts, go to the job overview page and click on ``Monitor and Debug``. This will open a sidebar page like the one in the image below, showing links to JupyterLab, TensorBoard and VSCode.
To access these services once the job starts, go to the job overview page and click on ``Monitor and Debug``. This will open a sidebar page like the one in the image below, showing links to JupyterLab, TensorBoard and VSCode.
With TensorBoard we can monitor metrics while the job is running. It also can show resource utilization via Pytorch Profiler (more on this later).
@ -243,7 +257,8 @@ To achive the best possible performance and resource utilization of jobs on Azur
>},
>```
For more information on interacting with jobs, see [this page](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-interactive-jobs?tabs=ui).
With how long training times can be and how little resources may be available for a large scale training job, it is important to monitor resource utilization. For a clear and consise way to do this while a job is running, we can use the Pytorch Profiler.
If you are using the HuggingFace Transformers library in your training script, one way to start using the profiler is to use a custom HuggingFace trainer callback.
@ -279,7 +294,7 @@ To achive the best possible performance and resource utilization of jobs on Azur
If you are not using the HuggingFace Transformers ``Trainer`` class in your training script and instead using your own training loop, try [this tutorial](https://pytorch.org/tutorials/intermediate/tensorboard_profiler_tutorial.html).
The DeepSpeed Flops Profiler provides user with metrics that can help understand the performance and help spot inefficiencies. More information can be found [here](https://www.deepspeed.ai/tutorials/flops-profiler/). To enable Flops Profiler while using DeepSpeed in your jobs, you can pass the `flops_profiler` settings to ds_config.json:
@ -297,10 +312,10 @@ To achive the best possible performance and resource utilization of jobs on Azur
When training with multiple compute nodes, the likelyhood of hardware faults occuring is increased. Fortunately, AzureML will automatically restart training jobs that fail due to hardware errors. With the length and resource consumption of large scale distributed training jobs however, it is ideal that training is not restarted scratch. With model checkpointing the training process can be saved at periodic checkpoints and if the training fails due to hardware faults, the training can be resumed from before it failed. Nebula Checkpointing is an optimized version of this feature.
Nebula Checkpointing improves on standard model checkpointing by saving models 1000 times faster.
@ -322,10 +337,10 @@ Nebula Checkpointing improves on standard model checkpointing by saving models 1
```
shm_size: 3100m
```
## **Examples**
## 5. <aname='Examples'></a>**Examples**
- ### **Pretraining a model**
Pretraining a language model is a process of training a model on a large corpus of unlabeled text using self-supervision, which means that the model learns to predict some parts of the text from other parts. Pretraining helps the model learn general language knowledge and skills that can be useful for various downstream tasks. Pretraining from scratch means training a model from random initialization without using any existing pretrained models. Pretraining from scratch can be beneficial when you have a large amount of domain-specific data that differs significantly from general text corpora, or when you want to customize your model architecture or hyperparameters. However, pretraining from scratch can also be more costly and time-consuming than finetuning an existing pretrained model.
[This example](./Bloom-Pretrain/README.md) shows how to pretrain the Bloom model in AzureML. The following results were found using 16 NVIDIA A100 80GB GPUs (2 nodes NVLink enabled).
|Experiment |Model size|GPU Count | TP| PP | MBS | TFlops| Samples per second | GPU memory Utillized