Tutorials on running distributed deep learning on Batch AI

azure batch-ai convolutional-neural-networks deep-learning distributed-training nvidia nvidia-docker

Перейти к файлу

Mat 90f2473195 Updates Makefile Update to jupyter changes the way --ip is used. Update to makefile to reflect this. Fixes issue #5		2018-10-30 17:13:45 +00:00
Docker	Update environment.yml	2018-08-14 18:59:29 +01:00
HorovodKeras	clean up repo	2018-08-22 14:56:11 +00:00
HorovodPytorch	enhance .md file, add comments to notebooks	2018-08-22 14:04:05 +00:00
HorovodTF	clean up repo	2018-08-22 14:56:11 +00:00
common	Fixes 🐛	2018-08-10 17:05:12 +01:00
include	Fixes docstrings	2018-08-10 15:30:24 +01:00
.gitignore	Initial commit	2018-08-05 11:11:03 -07:00
LICENSE	Initial commit	2018-08-05 11:11:06 -07:00
Makefile	Updates Makefile	2018-10-30 17:13:45 +00:00
README.md	modify README.md	2018-08-29 13:29:44 +00:00

README.md

Training Distributed Training on Batch AI

Object recognition in images is a widely applied technique in computer vision applications. It is often implemented by training a convolutional deep neural network (CNN). The training process can take up to weeks on a single GPU, not to mention the the prohibitively long time needed when performing hyperparameter tuning or experimenting with model architectures.

This repo shows how to train a CNN model in a distributed fashion using Azure Batch AI, a managed service that enables deep learning (DL) models to be trained on clusters of Azure virtual machines, including VMs with GPU support.

We train CNN models (ResNet50) using Horovod on the Imagenet dataset. When training CNN models, we use three DL frameworks for you to choose from: TensorFlow, Keras, or PyTorch.

To get started with the tutorial, please proceed with following steps in sequential order.

Prerequisites
Setup
TensorFlow version or Keras version, or PyTorch version

Prerequisites

Local host machine OS: Linux
Docker installed
Dockerhub account
Port 9999 open

Setup

Before you begin make sure you are logged into your dockerhub account by running on your machine:

docker login

Setup Batch AI Images

We need to create the images that will run our code on Batch AI. For chosen framework, you first navigate to its corresponding directory and then build the docker image. Taking TensorFlow model as an example, you navigate to HorovodTF folder and run following command to build the image (replace any instance of with your own dockerhub account name):

make build dockerhub=<dockerhub account>

Then push the image to your registry with:

make push dockerhub=<dockerhub account>

Setup Execution Environment

Before being able to run anything you will need to configure your machine (local host) to set up the environment in which you will be executing the Batch AI commands etc. There are a number of dependencies therefore we offer a dockerfile that will take care of these dependencies for you. To build the image run (replace all instances of with your own dockerhub account name) following command in current directory:

make build dockerhub=<dockerhub account>

Then start the jupyter notebook on port 9999:

make jupyter dockerhub=<dockerhub account>

By following the instructions shown in the output messages of above command, simply point your browser to the IP or DNS of your machine. From there you can navigate to the folders for tutorials on the frameworks covered such as HorovodTF etc.

Alternatively, if you don't want to use Docker, you can look inside the Docker directory at the dockerfile and environment.yml file for the dependencies.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.