This commit is contained in:
Fred Park 2018-10-04 09:59:19 -07:00
Родитель 124bb429a0
Коммит 6f35a50bf7
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 3C4D545F457737EB
14 изменённых файлов: 51 добавлений и 46 удалений

Просмотреть файл

@ -0,0 +1,29 @@
## HPMLA-CPU-OpenMPI Data Shredding
This Data Shredding recipe shows how to shred and deploy your training data
for HPMLA prior to running the training job on Azure VMs via Open MPI.
### Data Shredding Configuration
Rename the `configuration-template.json` to `configuration.json`.
The configuration should enable the following properties:
* `node_count` should be set to the number of VMs in the compute pool.
* `thread_count` thread's count per VM.
* `training_data_shred_count` It's advisable to set this number high. This way you only do this step once, and use it for different VMs configuration.
* `dataset_local_directory` A local directory to download and shred the training data according to `training_data_shred_count`.
* `shredded_dataset_Per_Node` A local directory to hold the final data shreds before deploying them to Azure blobs.
* `container_name` container name where the sliced data will be stored.
* `trainind_dataset_name` name for the dataset. Used when creating the data blobs.
* `subscription_id` Azure subscription id.
* `secret_key` Azure password.
* `resource_group` Resource group name.
* `storage_account` storage account name and access key.
* `training_data_container_name` Container name where the training data is hosted.
You can use your own access mechanism (password, access key, etc.). Above is
only a one example. Although, make sure to update the python script
every time you make a configuration change.
You must agree to the following licenses prior to use:
* [High Performance ML Algorithms License](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/High%20Performance%20ML%20Algorithms%20-%20Standalone%20(free)%20Use%20Terms%20V2%20(06-06-18).txt)
* [TPN Ubuntu Container](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/TPN_Ubuntu%20Container_16-04-FINAL.txt)
* [Microsoft Third Party Notice](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/MicrosoftThirdPartyNotice.txt)

Просмотреть файл

@ -1,6 +1,6 @@
# MADL-CPU-OpenMPI
This recipe shows how to run High Performance ML Algorithms Learner on CPUs across
Azure VMs via Open MPI.
# HPMLA-CPU-OpenMPI
This recipe shows how to run High Performance ML Algorithms (HPMLA) on CPUs
across Azure VMs via Open MPI.
## Configuration
Please see refer to this [set of sample configuration files](./config) for
@ -8,30 +8,31 @@ this recipe.
### Pool Configuration
The pool configuration should enable the following properties:
* `vm_size` should be a CPU-only instance, for example, 'STANDARD_D2_V2'.
* `vm_size` should be a CPU-only instance, for example, `STANDARD_D2_V2`.
* `inter_node_communication_enabled` must be set to `true`
* `max_tasks_per_node` must be set to 1 or omitted
### Global Configuration
The global configuration should set the following properties:
* `docker_images` array must have a reference to a valid MADL
Docker image that can be run with OpenMPI. The image denoted with `0.0.1` tag found in [msmadl/symsgd:0.0.1](https://hub.docker.com/r/msmadl/symsgd/)
* `docker_images` array must have a reference to a valid HPMLA
Docker image that can be run with OpenMPI. The image denoted with `0.0.1`
tag found in [msmadl/symsgd:0.0.1](https://hub.docker.com/r/msmadl/symsgd/)
is compatible with Azure Batch Shipyard VMs.
### MPI Jobs Configuration (MultiNode)
The jobs configuration should set the following properties within the `tasks`
array which should have a task definition containing:
* `docker_image` should be the name of the Docker image for this container invocation.
For this example, this should be
* `docker_image` should be the name of the Docker image for this container
invocation. For this example, this should be
`msmadl/symsgd:0.0.1`.
Please note that the `docker_images` in the Global Configuration should match
this image name.
* `command` should contain the command to pass to the Docker run invocation.
For this MADL training example with the `msmadl/symsgd:0.0.1` Docker image. The
For this HPMLA training example with the `msmadl/symsgd:0.0.1` Docker image. The
application `command` to run would be:
`"/parasail/run_parasail.sh -w /parasail/supersgd -l 1e-4 -k 32 -m 1e-2 -e 10 -r 10 -f $AZ_BATCH_NODE_SHARED_DIR/azblob/<container_name from the data shredding configuration file> -t 1 -g 1 -d $AZ_BATCH_TASK_WORKING_DIR/models -b $AZ_BATCH_NODE_SHARED_DIR/azblob/<container_name from the data shredding configuration file>"`
* [`run_parasail.sh`](docker/run_parasail.sh) has these parameters
* `-w` the MADL superSGD directory
* `-w` the HPMLA superSGD directory
* `-l` learning rate
* `-k` approximation rank constant
* `-m` model combiner convergence threshold
@ -42,8 +43,8 @@ application `command` to run would be:
* `-g` log global models every this many epochs
* `-d` log global models to this directory at the host"
* `-b` location for the algorithm's binary"
* The training data will need to be shredded to match the number of VMs and the thread's count per VM, and then deployed to a mounted Azure blob that the VM docker images have read/write access.
* The training data will need to be shredded to match the number of VMs and the thread's count per VM, and then deployed to a mounted Azure blob that the VM docker images have read/write access.
A basic python script that can be used to shred and deploy the training data to a blob container, and other data shredding files can be found [here](./DataShredding).
* `shared_data_volumes` should contain the shared data volume with an `azureblob` volume driver as specified in the global configuration file found [here](./config/config.yaml).
@ -57,4 +58,4 @@ Supplementary files can be found [here](./docker).
You must agree to the following licenses prior to use:
* [High Performance ML Algorithms License](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/High%20Performance%20ML%20Algorithms%20-%20Standalone%20(free)%20Use%20Terms%20V2%20(06-06-18).txt)
* [TPN Ubuntu Container](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/TPN_Ubuntu%20Container_16-04-FINAL.txt)
* [Microsoft Third Party Notice](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/MicrosoftThirdPartyNotice.txt)
* [Microsoft Third Party Notice](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/MicrosoftThirdPartyNotice.txt)

Просмотреть файл

@ -1,4 +1,4 @@
#Dockerfile for MADL (Microsoft Distributed Learners)
#Dockerfile for HPMLA (Microsoft High Performance ML Algorithms)
FROM ubuntu:16.04
MAINTAINER Saeed Maleki Todd Mytkowicz Madan Musuvathi Dany rouhana https://github.com/saeedmaleki/Distributed-Linear-Learner
@ -15,7 +15,7 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
openmpi-common \
libopenmpi-dev && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
rm -rf /var/lib/apt/lists/*
# configure ssh server and keys
RUN mkdir -p /root/.ssh && \
@ -27,12 +27,12 @@ RUN mkdir -p /root/.ssh && \
ssh-keygen -f /root/.ssh/id_rsa -t rsa -N '' && \
chmod 600 /root/.ssh/config && \
chmod 700 /root/.ssh && \
cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys
cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys
# set parasail dir
# set parasail dir
WORKDIR /parasail
# to create your own image, first download the supersgd from the link supplied in the read me file,
# to create your own image, first download the supersgd from the link supplied in the read me file,
# and the put it in the same dir as this file.
COPY supersgd /parasail
COPY run_parasail.sh /parasail

Просмотреть файл

@ -1,26 +0,0 @@
## MADL-CPU-OpenMPI Data Shredding
This Data Shredding recipe shows how to shred and deploy your training data prior to running a training job on Azure VMs via Open MPI.
### Data Shredding Configuration
Rename the configuration-template.json to configuration.json. The configuration should enable the following properties:
* `node_count` should be set to the number of VMs in the compute pool.
* `thread_count` thread's count per VM.
* `training_data_shred_count` It's advisable to set this number high. This way you only do this step once, and use it for different VMs configuration.
* 'dataset_local_directory' A local directory to download and shred the training data according to 'training_data_shred_count'.
* 'shredded_dataset_Per_Node' A local directory to hold the final data shreds before deploying them to Azure blobs.
* 'container_name' container name where the sliced data will be stored.
* 'trainind_dataset_name' name for the dataset. Used when creating the data blobs.
* 'subscription_id' Azure subscription id.
* 'secret_key' Azure password.
* 'resource_group' Resource group name.
* 'storage_account' storage account name and access key.
* 'training_data_container_name' Container name where the training data is hosted.
*''
You can use your own access mechanism (password, access key, etc.). The above is only a one example. Although, make sure to update the python script
every time you make a configuration change.
You must agree to the following licenses prior to use:
* [High Performance ML Algorithms License](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/High%20Performance%20ML%20Algorithms%20-%20Standalone%20(free)%20Use%20Terms%20V2%20(06-06-18).txt)
* [TPN Ubuntu Container](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/TPN_Ubuntu%20Container_16-04-FINAL.txt)
* [Microsoft Third Party Notice](https://github.com/saeedmaleki/Distributed-Linear-Learner/blob/master/MicrosoftThirdPartyNotice.txt)

Просмотреть файл

@ -105,9 +105,10 @@ This Keras+Theano-GPU recipe contains information on how to containerize
[Theano](http://www.deeplearning.net/software/theano/) backend for use with
N-Series Azure VMs.
#### [MADL-CPU-OpenMPI](./MADL-CPU-OpenMPI)
#### [HPMLA-CPU-OpenMPI](./HPMLA-CPU-OpenMPI)
This recipe contains information on how to containerize the Microsoft High
Performance ML Algorithms Learner for use across multiple compute nodes.
Performance ML Algorithms (HPMLA) for use across multiple compute
nodes.
#### [MXNet-CPU](./MXNet-CPU)
This MXNet-CPU recipe contains information on how to containerize