cc42916cba
* Fix multi-instance tasks that are not a MPI task * Add setup task script for CNTK-CPU-Infiniband-IntelMPI * Update CNTK-CPU-Infiniband-IntelMPI recipe * Add MPI executable path option * Update CNTK-CPU-OpenMPI recipe * Change the default MPI executable_path to mpirun * Modify CNTK-CPU-Infiniband-IntelMPI recipe * Add setup task script for CNTK-GPU-Infiniband-IntelMPI * Update CNTK-GPU-Infiniband-IntelMPI recipe * Add setup task script for CNTK-GPU-OpenMPI * Add setup task script for NAMD-Infiniband-IntelMPI * Update NAMD-Infiniband-IntelMPI recipe * Add setup task script for OpenFOAM-Infiniband-IntelMPI * Update OpenFOAM-Infiniband-IntelMPI recipe * Update TensorFlow-GPU Singularity recipe * Add setup task script for OpenFOAM-TCP-OpenMPI * Update OpenFOAM-TCP-OpenMPI recipe * Add support for arbitrary commands with the MPI processes_per_node option * Fix MPI with native images * Modify CNTK-CPU-Infiniband-IntelMPI recipe * Modify CNTK-GPU-Infiniband-IntelMPI recipe * Modify NAMD-Infiniband-IntelMPI recipe * Update processes_per_node documentation * Fix `pool images list` with Singularity images * Modify OpenFOAM-Infiniband-IntelMPI set up script * Add check for mpi setting with Windows * Add auto scratch support with OpenFOAM-Infiniband-IntelMPI recipe * Modify OpenFOAM-TCP-OpenMPI set up script * Add auto scratch support with OpenFOAM-TCP-OpenMPI recipe * Add mpiBench-IntelMPI recipe * Add mpiBench-MPICH recipe * Add mpiBench-OpenMPI recipe * Resolve PR comments * Resolve PR comments |
||
---|---|---|
.. | ||
config | ||
docker | ||
README.md |
README.md
CNTK-CPU-Infiniband-IntelMPI
This recipe shows how to run CNTK on CPUs across Infiniband/RDMA enabled Azure VMs via Intel MPI.
Configuration
Please see refer to this set of sample configuration files for this recipe.
Pool Configuration
The pool configuration should enable the following properties:
vm_size
should be a CPU-only RDMA-enabled instance.vm_configuration
is the VM configuration. Please select an appropriateplatform_image
with IB/RDMA as supported by Batch Shipyard.inter_node_communication_enabled
must be set totrue
max_tasks_per_node
must be set to 1 or omitted
Global Configuration
The global configuration should set the following properties:
docker_images
array must have a reference to a valid CNTK CPU-enabled Docker image that can be run with Intel MPI. Images denoted withcpu
andintelmpi
tags found in alfpark/cntk are compatible with Azure VMs. Images denoted withrefdata
tag suffixes found in alfpark/cntk can be used for this recipe which contains reference data for MNIST and CIFAR-10 examples. If you do not need this reference data then you can use the images without therefdata
suffix on the image tag. For this example,alfpark/cntk:2.1-cpu-1bitsgd-py36-intelmpi-refdata
can be used.
MPI Jobs Configuration (MultiNode)
The jobs configuration should set the following properties within the tasks
array which should have a task definition containing:
docker_image
should be the name of the Docker image for this container invocation. For this example, this should bealfpark/cntk:2.1-cpu-1bitsgd-py36-intelmpi-refdata
. Please note that thedocker_images
in the Global Configuration should match this image name.multi_instance
property must be definednum_instances
should be set topool_specification_vm_count_dedicated
,pool_specification_vm_count_low_priority
,pool_current_dedicated
, orpool_current_low_priority
coordination_command
should be unset ornull
. For pools withnative
container support, this command should be supplied if a non-standardsshd
is required.resource_files
should be unset or the array can be emptypre_execution_command
should source the Intelmpivars.sh
script:source /opt/intel/compilers_and_libraries/linux/mpi/bin64/mpivars.sh
mpi
property must be definedruntime
should be set tointelmpi
processes_per_node
should be set to1
command
should contain the command to pass to thempirun
invocation. For this example, we will run the MNIST convolutional example with Data augmentation in thealfpark/cntk:2.1-cpu-py35-refdata
Docker image. The applicationcommand
to run would be:python -u /cntk/Examples/Image/Classification/ConvNet/Python/ConvNet_CIFAR10_DataAug_Distributed.py -q 1 -datadir /cntk/Examples/Image/DataSets/CIFAR-10 -outputdir $AZ_BATCH_TASK_WORKING_DIR/output
infiniband
can be set totrue
, however, it is implicitly enabled by Batch Shipyard when executing on a RDMA-enabled compute pool.
Dockerfile and supplementary files
Supplementary files can be found here.
You must agree to the following licenses prior to use: