05e9773741
- `remove_container_after_exit` is now defaulted enabled - Move to CentOS-HPC 7.3 for ib recipes |
||
---|---|---|
.. | ||
config | ||
docker | ||
README.md |
README.md
CNTK-GPU-OpenMPI
This recipe shows how to run CNTK on GPUs using N-series Azure VM instances in an Azure Batch compute pool.
Please note that CNTK currently uses MPI even for multiple GPUs on a single node.
Configuration
Please see refer to this set of sample configuration files for this recipe.
Pool Configuration
The pool configuration should enable the following properties:
vm_size
must be one ofSTANDARD_NC6
,STANDARD_NC12
,STANDARD_NC24
,STANDARD_NV6
,STANDARD_NV12
,STANDARD_NV24
.NC
VM instances feature K80 GPUs for GPU compute acceleration whileNV
VM instances feature M60 GPUs for visualization workloads. Because CNTK is a GPU-accelerated compute application, it is best to chooseNC
VM instances.vm_configuration
is the VM configurationplatform_image
specifies to use a platform imagepublisher
should beCanonical
orOpenLogic
.offer
should beUbuntuServer
for Canonical orCentOS
for OpenLogic.sku
should be16.04-LTS
for Ubuntu or7.3
for CentOS.
inter_node_communication_enabled
must be set totrue
max_tasks_per_node
must be set to 1 or omitted
Global Configuration
The global configuration should set the following properties:
docker_images
array must have a reference to a valid CNTK GPU-enabled Docker image. For singlenode (non-MPI) jobs, you can use the official Microsoft CNTK Docker images. For MPI jobs, you will need to use Batch Shipyard compatible Docker images which can be found in the alfpark/cntk repository. Images denoted withrefdata
tag suffixes found in can be used for this recipe which contains reference data for MNIST and CIFAR-10 examples. If you do not need this reference data then you can use the images without therefdata
suffix on the image tag.
Non-MPI Jobs Configuration (SingleNode+SingleGPU)
The jobs configuration should set the following properties within the tasks
array which should have a task definition containing:
image
should be the name of the Docker image for this container invocation, e.g.,microsoft/cntk:2.1-gpu-python3.5-cuda8.0-cudnn6.0
command
should contain the command to pass to the Docker run invocation. For themicrosoft/cntk:2.1-gpu-python3.5-cuda8.0-cudnn6.0
Docker image, and to run the MNIST convolutional example on a single CPU, thecommand
would be:"/bin/bash -c \"source /cntk/activate-cntk && cd /cntk/Examples/Image/DataSets/MNIST && python -u install_mnist.py && cd /cntk/Examples/Image/Classification/ConvNet/Python && python -u ConvNet_MNIST.py\""
gpu
must be set totrue
. This enables invoking thenvidia-docker
wrapper.
MPI Jobs Configuration (SingleNode+MultiGPU, MultiNode+SingleGPU, MultiNode+MultiGPU)
The jobs configuration should set the following properties within the tasks
array which should have a task definition containing:
image
should be the name of the Docker image for this container invocation. For this example, this can bealfpark/cntk:2.1-gpu-1bitsgd-py35-cuda8-cudnn6-refdata
. Please note that thedocker_images
in the Global Configuration should match this image name.command
should contain the command to pass to the Docker run invocation. For this example, we will run the ResNet-20 Distributed training on CIFAR-10 example in thealfpark/cntk:2.1-gpu-1bitsgd-py35-cuda8-cudnn6-refdata
Docker image. The applicationcommand
to run would be:"/cntk/run_cntk.sh -s /cntk/Examples/Image/Classification/ResNet/Python/TrainResNet_CIFAR10_Distributed.py -- --network resnet20 -q 1 -a 0 --datadir /cntk/Examples/Image/DataSets/CIFAR-10 --outputdir $AZ_BATCH_TASK_WORKING_DIR/output"
run_cntk.sh
has two parameters-s
for the Python script to run-w
for the working directory (not required for this example to run)--
parameters specified after this are given verbatim to the Python script
gpu
must be set totrue
. This enables invoking thenvidia-docker
wrapper.multi_instance
property must be defined for multinode executionsnum_instances
should be set topool_specification_vm_count_dedicated
,pool_specification_vm_count_low_priority
,pool_current_dedicated
, orpool_current_low_priority
coordination_command
should be unset ornull
resource_files
should be unset or the array can be empty
Dockerfile and supplementary files
The Dockerfile
for the Docker image can be found here.
You must agree to the following licenses prior to use: