Rename MADL recipe to HPMLA
This commit is contained in:
Родитель
124bb429a0
Коммит
6f35a50bf7
|
@ -0,0 +1,29 @@
|
|||
## HPMLA-CPU-OpenMPI Data Shredding
|
||||
This Data Shredding recipe shows how to shred and deploy your training data
|
||||
for HPMLA prior to running the training job on Azure VMs via Open MPI.
|
||||
|
||||
### Data Shredding Configuration
|
||||
Rename the `configuration-template.json` to `configuration.json`.
|
||||
The configuration should enable the following properties:
|
||||
|
||||
* `node_count` should be set to the number of VMs in the compute pool.
|
||||
* `thread_count` thread's count per VM.
|
||||
* `training_data_shred_count` It's advisable to set this number high. This way you only do this step once, and use it for different VMs configuration.
|
||||
* `dataset_local_directory` A local directory to download and shred the training data according to `training_data_shred_count`.
|
||||
* `shredded_dataset_Per_Node` A local directory to hold the final data shreds before deploying them to Azure blobs.
|
||||
* `container_name` container name where the sliced data will be stored.
|
||||
* `trainind_dataset_name` name for the dataset. Used when creating the data blobs.
|
||||
* `subscription_id` Azure subscription id.
|
||||
* `secret_key` Azure password.
|
||||
* `resource_group` Resource group name.
|
||||
* `storage_account` storage account name and access key.
|
||||
* `training_data_container_name` Container name where the training data is hosted.
|
||||
|
||||
You can use your own access mechanism (password, access key, etc.). Above is
|
||||
only a one example. Although, make sure to update the python script
|
||||
every time you make a configuration change.
|
||||
|
||||
You must agree to the following licenses prior to use:
|
||||
* [High Performance ML Algorithms License](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/High%20Performance%20ML%20Algorithms%20-%20Standalone%20(free)%20Use%20Terms%20V2%20(06-06-18).txt)
|
||||
* [TPN Ubuntu Container](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/TPN_Ubuntu%20Container_16-04-FINAL.txt)
|
||||
* [Microsoft Third Party Notice](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/MicrosoftThirdPartyNotice.txt)
|
|
@ -1,6 +1,6 @@
|
|||
# MADL-CPU-OpenMPI
|
||||
This recipe shows how to run High Performance ML Algorithms Learner on CPUs across
|
||||
Azure VMs via Open MPI.
|
||||
# HPMLA-CPU-OpenMPI
|
||||
This recipe shows how to run High Performance ML Algorithms (HPMLA) on CPUs
|
||||
across Azure VMs via Open MPI.
|
||||
|
||||
## Configuration
|
||||
Please see refer to this [set of sample configuration files](./config) for
|
||||
|
@ -8,30 +8,31 @@ this recipe.
|
|||
|
||||
### Pool Configuration
|
||||
The pool configuration should enable the following properties:
|
||||
* `vm_size` should be a CPU-only instance, for example, 'STANDARD_D2_V2'.
|
||||
* `vm_size` should be a CPU-only instance, for example, `STANDARD_D2_V2`.
|
||||
* `inter_node_communication_enabled` must be set to `true`
|
||||
* `max_tasks_per_node` must be set to 1 or omitted
|
||||
|
||||
### Global Configuration
|
||||
The global configuration should set the following properties:
|
||||
* `docker_images` array must have a reference to a valid MADL
|
||||
Docker image that can be run with OpenMPI. The image denoted with `0.0.1` tag found in [msmadl/symsgd:0.0.1](https://hub.docker.com/r/msmadl/symsgd/)
|
||||
* `docker_images` array must have a reference to a valid HPMLA
|
||||
Docker image that can be run with OpenMPI. The image denoted with `0.0.1`
|
||||
tag found in [msmadl/symsgd:0.0.1](https://hub.docker.com/r/msmadl/symsgd/)
|
||||
is compatible with Azure Batch Shipyard VMs.
|
||||
|
||||
### MPI Jobs Configuration (MultiNode)
|
||||
The jobs configuration should set the following properties within the `tasks`
|
||||
array which should have a task definition containing:
|
||||
* `docker_image` should be the name of the Docker image for this container invocation.
|
||||
For this example, this should be
|
||||
* `docker_image` should be the name of the Docker image for this container
|
||||
invocation. For this example, this should be
|
||||
`msmadl/symsgd:0.0.1`.
|
||||
Please note that the `docker_images` in the Global Configuration should match
|
||||
this image name.
|
||||
* `command` should contain the command to pass to the Docker run invocation.
|
||||
For this MADL training example with the `msmadl/symsgd:0.0.1` Docker image. The
|
||||
For this HPMLA training example with the `msmadl/symsgd:0.0.1` Docker image. The
|
||||
application `command` to run would be:
|
||||
`"/parasail/run_parasail.sh -w /parasail/supersgd -l 1e-4 -k 32 -m 1e-2 -e 10 -r 10 -f $AZ_BATCH_NODE_SHARED_DIR/azblob/<container_name from the data shredding configuration file> -t 1 -g 1 -d $AZ_BATCH_TASK_WORKING_DIR/models -b $AZ_BATCH_NODE_SHARED_DIR/azblob/<container_name from the data shredding configuration file>"`
|
||||
* [`run_parasail.sh`](docker/run_parasail.sh) has these parameters
|
||||
* `-w` the MADL superSGD directory
|
||||
* `-w` the HPMLA superSGD directory
|
||||
* `-l` learning rate
|
||||
* `-k` approximation rank constant
|
||||
* `-m` model combiner convergence threshold
|
||||
|
@ -42,8 +43,8 @@ application `command` to run would be:
|
|||
* `-g` log global models every this many epochs
|
||||
* `-d` log global models to this directory at the host"
|
||||
* `-b` location for the algorithm's binary"
|
||||
|
||||
* The training data will need to be shredded to match the number of VMs and the thread's count per VM, and then deployed to a mounted Azure blob that the VM docker images have read/write access.
|
||||
|
||||
* The training data will need to be shredded to match the number of VMs and the thread's count per VM, and then deployed to a mounted Azure blob that the VM docker images have read/write access.
|
||||
A basic python script that can be used to shred and deploy the training data to a blob container, and other data shredding files can be found [here](./DataShredding).
|
||||
* `shared_data_volumes` should contain the shared data volume with an `azureblob` volume driver as specified in the global configuration file found [here](./config/config.yaml).
|
||||
|
||||
|
@ -57,4 +58,4 @@ Supplementary files can be found [here](./docker).
|
|||
You must agree to the following licenses prior to use:
|
||||
* [High Performance ML Algorithms License](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/High%20Performance%20ML%20Algorithms%20-%20Standalone%20(free)%20Use%20Terms%20V2%20(06-06-18).txt)
|
||||
* [TPN Ubuntu Container](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/TPN_Ubuntu%20Container_16-04-FINAL.txt)
|
||||
* [Microsoft Third Party Notice](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/MicrosoftThirdPartyNotice.txt)
|
||||
* [Microsoft Third Party Notice](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/MicrosoftThirdPartyNotice.txt)
|
|
@ -1,4 +1,4 @@
|
|||
#Dockerfile for MADL (Microsoft Distributed Learners)
|
||||
#Dockerfile for HPMLA (Microsoft High Performance ML Algorithms)
|
||||
|
||||
FROM ubuntu:16.04
|
||||
MAINTAINER Saeed Maleki Todd Mytkowicz Madan Musuvathi Dany rouhana https://github.com/saeedmaleki/Distributed-Linear-Learner
|
||||
|
@ -15,7 +15,7 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
|
|||
openmpi-common \
|
||||
libopenmpi-dev && \
|
||||
apt-get clean && \
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# configure ssh server and keys
|
||||
RUN mkdir -p /root/.ssh && \
|
||||
|
@ -27,12 +27,12 @@ RUN mkdir -p /root/.ssh && \
|
|||
ssh-keygen -f /root/.ssh/id_rsa -t rsa -N '' && \
|
||||
chmod 600 /root/.ssh/config && \
|
||||
chmod 700 /root/.ssh && \
|
||||
cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys
|
||||
cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys
|
||||
|
||||
# set parasail dir
|
||||
# set parasail dir
|
||||
WORKDIR /parasail
|
||||
|
||||
# to create your own image, first download the supersgd from the link supplied in the read me file,
|
||||
# to create your own image, first download the supersgd from the link supplied in the read me file,
|
||||
# and the put it in the same dir as this file.
|
||||
COPY supersgd /parasail
|
||||
COPY run_parasail.sh /parasail
|
|
@ -1,26 +0,0 @@
|
|||
## MADL-CPU-OpenMPI Data Shredding
|
||||
This Data Shredding recipe shows how to shred and deploy your training data prior to running a training job on Azure VMs via Open MPI.
|
||||
|
||||
### Data Shredding Configuration
|
||||
Rename the configuration-template.json to configuration.json. The configuration should enable the following properties:
|
||||
* `node_count` should be set to the number of VMs in the compute pool.
|
||||
* `thread_count` thread's count per VM.
|
||||
* `training_data_shred_count` It's advisable to set this number high. This way you only do this step once, and use it for different VMs configuration.
|
||||
* 'dataset_local_directory' A local directory to download and shred the training data according to 'training_data_shred_count'.
|
||||
* 'shredded_dataset_Per_Node' A local directory to hold the final data shreds before deploying them to Azure blobs.
|
||||
* 'container_name' container name where the sliced data will be stored.
|
||||
* 'trainind_dataset_name' name for the dataset. Used when creating the data blobs.
|
||||
* 'subscription_id' Azure subscription id.
|
||||
* 'secret_key' Azure password.
|
||||
* 'resource_group' Resource group name.
|
||||
* 'storage_account' storage account name and access key.
|
||||
* 'training_data_container_name' Container name where the training data is hosted.
|
||||
*''
|
||||
|
||||
You can use your own access mechanism (password, access key, etc.). The above is only a one example. Although, make sure to update the python script
|
||||
every time you make a configuration change.
|
||||
|
||||
You must agree to the following licenses prior to use:
|
||||
* [High Performance ML Algorithms License](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/High%20Performance%20ML%20Algorithms%20-%20Standalone%20(free)%20Use%20Terms%20V2%20(06-06-18).txt)
|
||||
* [TPN Ubuntu Container](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/TPN_Ubuntu%20Container_16-04-FINAL.txt)
|
||||
* [Microsoft Third Party Notice](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/MicrosoftThirdPartyNotice.txt)
|
|
@ -105,9 +105,10 @@ This Keras+Theano-GPU recipe contains information on how to containerize
|
|||
[Theano](http://www.deeplearning.net/software/theano/) backend for use with
|
||||
N-Series Azure VMs.
|
||||
|
||||
#### [MADL-CPU-OpenMPI](./MADL-CPU-OpenMPI)
|
||||
#### [HPMLA-CPU-OpenMPI](./HPMLA-CPU-OpenMPI)
|
||||
This recipe contains information on how to containerize the Microsoft High
|
||||
Performance ML Algorithms Learner for use across multiple compute nodes.
|
||||
Performance ML Algorithms (HPMLA) for use across multiple compute
|
||||
nodes.
|
||||
|
||||
#### [MXNet-CPU](./MXNet-CPU)
|
||||
This MXNet-CPU recipe contains information on how to containerize
|
||||
|
|
Загрузка…
Ссылка в новой задаче