История

Fred Park 05e9773741 Update recipes - `remove_container_after_exit` is now defaulted enabled - Move to CentOS-HPC 7.3 for ib recipes		2017-08-03 19:13:57 -07:00
..
config	Update recipes	2017-08-03 19:13:57 -07:00
docker	Update MXNet recipes	2016-10-26 11:18:52 -07:00
README.md	Add list tasks poll option	2017-05-22 19:45:25 -07:00

MXNet-CPU

This recipe shows how to run MXNet on CPUs on one or more compute nodes via SSH.

Configuration

Please see refer to this set of sample configuration files for this recipe.

The pool configuration should enable the following properties:

The global configuration should set the following properties:

docker_images array must have a reference to a valid MXNet CPU-enabled Docker image. alfpark/mxnet:cpu can be used for this recipe.
docker_volumes must be populated with the following if running a MXNet multi-node job:
- shared_data_volumes should contain an Azure File Docker volume driver, a GlusterFS share or a manually configured NFS share. Batch Shipyard has automatic support for setting up Azure File Docker Volumes and GlusterFS, please refer to the Batch Shipyard Configuration doc.

The jobs configuration should set the following properties within the tasks array which should have a task definition containing:

image should be the name of the Docker image for this container invocation, e.g., alfpark/mxnet:cpu
command should contain the command to pass to the Docker run invocation. For the alfpark/mxnet:cpu Docker image and to run the MNIST python-backend example on a single CPU, the command would simply be: "/mxnet/run_mxnet.sh mnist-py . --model-prefix $AZ_BATCH_TASK_WORKING_DIR/mnist-model". The source for run_mxnet.sh can be found here.
- The first argument to run_mxnet.sh is the training example to run. This can be one of: cifar-10-r, cifar-10-py, mnist-r, mnist-py. cifar-10 examples run resnet. mnist examples run lenet.
- The second argument to run_mxnet.sh is the shared file system location. For single node executions, this should be ..
- Arguments after the second are passed to the training script. In this example, we specify where to save the model.

The jobs configuration should set the following properties within the tasks array which should have a task definition containing:

image should be the name of the Docker image for this container invocation. This can be alfpark/mxnet:cpu. Please note that the docker_images in the Global Configuration should match this image name.
command should contain the command to pass to the Docker run invocation. For this example, we will run the CIFAR-10 example across distributed nodes in the alfpark/mxnet:cpu Docker image. The application command to run would be: "/mxnet/run_mxnet.sh cifar-10-py $AZ_BATCH_NODE_SHARED_DIR/gfs --model-prefix $AZ_BATCH_TASK_WORKING_DIR/cifar-10-model". The source for run_mxnet.sh can be found here.
- NOTE: tasks that span multiple compute nodes will need their input stored on a shared file system, otherwise MXNet will not be able to start. To override the input directory for the example above, specify the parameter to the shell script with the location of the shared file system such as Azure File Docker Volume, NFS, GlusterFS, etc. The example above already is writing to a GlusterFS share.
shared_data_volumes should have a valid volume name as defined in the global configuration file. Please see the global configuration section above for details.
multi_instance property must be defined
- num_instances should be set to pool_specification_vm_count_dedicated, pool_vm_count_low_priority, pool_current_dedicated, or pool_current_low_priority
- coordination_command should be unset or null
- resource_files array can be empty

The Dockerfile for the Docker image can be found here.

You must agree to the following licenses before using this image: