|
||
---|---|---|
.. | ||
config | ||
docker | ||
README.md |
README.md
TensorFlow-Distributed
This recipe shows how to run TensorFlow in distributed mode across multiple CPUs or GPUs (either single node or multinode) using N-series Azure VM instances in an Azure Batch compute pool.
Configuration
Please see refer to this set of sample configuration files for this recipe.
Pool Configuration
The pool configuration should enable the following properties if on multiple GPUs:
vm_size
must be a GPU enabled VM size if using GPUs. Because TensorFlow is a GPU-accelerated compute application, you should choose anND
,NC
orNCv2
VM instance size if utilizing GPUs. If not using GPUs, any other appropriate CPU based VM size can be selected.vm_configuration
is the VM configurationplatform_image
specifies to use a platform imagepublisher
should beCanonical
orOpenLogic
if using GPUs. Other supported publishers can be used if not.offer
should beUbuntuServer
for Canonical orCentOS
for OpenLogic if using GPUs. Other supported offers can be used if not.sku
should be16.04-LTS
for Ubuntu or7.3
for CentOS if using GPUs. Other supported skus can be used if not.
If on multiple CPUs:
max_tasks_per_node
must be set to 1 or omitted
Other pool properties such as publisher
, offer
, sku
, vm_size
and
vm_count
should be set to your desired values for multiple CPU configuration.
Global Configuration
The global configuration should set the following properties:
docker_images
array must have a reference to a valid TensorFlow Docker image that can work with multi-instance tasks. The alfpark/tensorflow images have been prepared by using Google's TensorFlow Dockerfile as a base and extending the image to work with Batch Shipyard.
Jobs Configuration
The jobs configuration should set the following properties within the tasks
array which should have a task definition containing:
image
should be the name of the Docker image for this container invocation, e.g.,alfpark/tensorflow:1.2.1-gpu
oralfpark/tensorflow:1.2.1-cpu
command
should contain the command to pass to the Docker run invocation. To run the example MNIST replica example, thecommand
would look like:"/bin/bash -c \"/shipyard/launcher.sh /shipyard/mnist_replica.py\""
. The launcher will automatically detect the number of GPUs and pass the correct number to the TensorFlow script. Please see the launcher.sh for the launcher source.gpu
must be set totrue
if run on GPUs. This enables invoking thenvidia-docker
wrapper. This property should be omitted or set tofalse
if run on CPUs.multi_instance
property must be definednum_instances
should be set topool_specification_vm_count_dedicated
,pool_vm_count_low_priority
,pool_current_dedicated
, orpool_current_low_priority
coordination_command
should be unset ornull
. For pools withnative
container support, this command should be supplied if a non-standardsshd
is required.resource_files
should be unset or the array can be empty
Dockerfile and supplementary files
The Dockerfile
for the Docker images can be found here.
You must agree to the following license prior to use: