* fix prepare.sh in mpi folder

* add horovod example && add auto-test support

* add horovod example

* fix json of horovod

* add sleep for horovod

* add ssh readiness detest shell script

* add ssh service detest script

* add hdfs support

* revise document of horovod

* revise examples document

* revise document

* fix review

* fix review

* fix review
This commit is contained in:
qyyy 2018-11-08 14:35:20 +08:00 коммит произвёл GitHub
Родитель 498b1cefb7
Коммит 20a497fbb8
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
14 изменённых файлов: 350 добавлений и 32 удалений

Просмотреть файл

@ -4,6 +4,7 @@
- [Quick start: how to write and submit a CIFAR-10 job](#quickstart)
- [List of off-the-shelf examples](#offtheshelf)
- [List of customized job template](#customize)
- [What if the example is failed](#debug)
- [Contributing](#contributing)
## Quick start: how to write and submit a CIFAR-10 job <a name="quickstart"></a>
@ -98,6 +99,15 @@ These user could customize and run these jobs over OpenPAI.
1. [Open MPI TensorFlow CIFAR-10](./mpi#open-mpi-tensorflow-cifar-10-example)
2. [Open MPI CNTK grapheme-to-phoneme conversion](./mpi#open-mpi-cntk-grapheme-to-phoneme-conversion-example)
## What if the example is failed <a name="debug"></a>
The example in the folder could be failed due to the following reasons:
1. The format of json is incorrect. You may get error when you copy the json file to the webportal. It may due to version updating of webportal. You should refer to the latest version of it.
2. The docker image is removed. You will find this error in your job tracking page. You should create an issue to report it, or you can build the image according to the dockerfile in the example's folder, then push it to another docker registry and modify the json file's image field. Just refer to the README or DOCKER in the folder of that example.
3. If the example you submit contains a prepare.sh script shell, it may fail due to the source of the data or code changed or been unstable. You may get error in your job tracking page. Check and try to fix it.
4. The version of the code, tools or library. You may get this error if you rebuild the docker image. Some example doesn't fix the version of its dependency, so, you should check the version.
## Contributing <a name="contributing"></a>
If you want to contribute a job example that can be run on PAI, please open a new pull request.

Просмотреть файл

@ -6,6 +6,8 @@
- [Parameters of the project](#Parameters_of_the_start.sh)
- [The mode of the project](#mode)
- [Note](#Note)
- [Add new example](#Add)
- [Delete example](#Delete)
## Introduction <a name="Introduction"></a>
This python project can help you test all the examples in this folder.
@ -59,7 +61,7 @@ And during the runtime of this shell script, it will require you input F/S or jo
- Enter job names like `cntk-mpi,tensorflow-mpi,sklearn-mnist` means you want to run just the three examples.
Here is an example to start the script: `/bin/bash pai_tmp/examples/auto-test/start.sh normal http://10.20.30.40:9186/api/v1/ 10.20.30.40:9000 http://10.20.30.40:50070 test test`
Here is an example to start the script: `echo "S" | /bin/bash pai_tmp/examples/auto-test/start.sh normal http://10.20.30.40:9186/api/v1/ 10.20.30.40:9000 http://10.20.30.40:50070 test test`
### mode <a name="mode"></a>
The project offers 3 different modes.
1. **ci mode**: If the job can run correctly within 10 minutes, the project will regards it succeeded.
@ -70,4 +72,13 @@ Use "release" as the first parameter of start.sh to enter this mode.
Use "normal" as the first parameter of start.sh to enter this mode.
## Note <a name="Note"></a>
If the parameters contains special characters like '&', please use single qutations to mark that parameter.
Now(27th September, 2018), the mpi examples are still unready. Ignore them!
If you want to add or delete an example, please follow these steps:
### For adding <a name="Add"></a>
1. Prepare your example, include the json file. If you should prepare data and code before submit the job, you should also write a prepare shell script named "prepare.sh". You can refer to [prepare.sh](https://github.com/Microsoft/pai/blob/master/examples/tensorflow/prepare.sh).
2. Add the job name in your json file to the [start.sh](./start.sh), you can see the comment in line 17. Just add your job name to "full" line and "stable" line if the job is stable(Can run correctly in anytime).
3. Put forward your pull request.
### For deleting <a name="Delete"></a>
1. Delete your example.
2. Delete the job name in your json file from the [start.sh](./start.sh), you can see the comment in line 17. Just delete your job name from "full" line and "stable" line if the job is stable(Can run correctly in anytime).
3. Put forward your pull request.

Просмотреть файл

@ -14,11 +14,12 @@ if [ ! -d "pai/" ]; then
else
threshold=30
fi
full="cntk-mpi,tensorflow-mpi,sklearn-mnist,sklearn-text-vectorizers,tensorflow-cifar10,tensorflow-tensorboard,tensorflow-distributed-cifar10,kafka,mxnet-image-classification,mxnet-autoencoder,jupyter_example,tensorflow-serving,xgboost_gpu_hist,cntk-g2p,keras_cntk_backend_mnist,keras_tensorflow_backend_mnist,caffe-mnist,pytorch-regression,pytorch-mnist,chainer-cifar,caffe2-resnet50"
stable="sklearn-mnist,sklearn-text-vectorizers,tensorflow-cifar10,tensorflow-tensorboard,tensorflow-distributed-cifar10,kafka,mxnet-image-classification,mxnet-autoencoder,jupyter_example,tensorflow-serving,xgboost_gpu_hist,cntk-g2p,keras_cntk_backend_mnist,keras_tensorflow_backend_mnist,caffe-mnist,pytorch-regression,pytorch-mnist,chainer-cifar,caffe2-resnet50"
# If you want to add or delete examples, add or delete the name here
full="ocr-serving,horovod-mpi-cifar10,cntk-mpi,tensorflow-mpi,sklearn-mnist,sklearn-text-vectorizers,tensorflow-cifar10,tensorflow-tensorboard,tensorflow-distributed-cifar10,kafka,mxnet-image-classification,mxnet-autoencoder,jupyter_example,tensorflow-serving,xgboost_gpu_hist,cntk-g2p,keras_cntk_backend_mnist,keras_tensorflow_backend_mnist,caffe-mnist,pytorch-regression,pytorch-mnist,chainer-cifar,caffe2-resnet50"
stable="ocr-serving,horovod-mpi-cifar10,sklearn-mnist,sklearn-text-vectorizers,tensorflow-cifar10,tensorflow-tensorboard,tensorflow-distributed-cifar10,kafka,mxnet-image-classification,mxnet-autoencoder,jupyter_example,tensorflow-serving,xgboost_gpu_hist,cntk-g2p,keras_cntk_backend_mnist,keras_tensorflow_backend_mnist,caffe-mnist,pytorch-regression,pytorch-mnist,chainer-cifar,caffe2-resnet50"
echo "There are some error within the mpi example, so, just ignore them!"
read -p "Please input name of the examples you want to run with ',' between two names, or you can just input F/S to run full jobs or only stable jobs:" mode
if [[ $mode =~ ^[a-zA-Z0-9_,]+$ ]]; then
if [[ $mode =~ ^[a-zA-Z0-9_,-]+$ ]]; then
echo "Run the job of "$mode
else
echo "Input jobs' name error!"

Просмотреть файл

@ -8,7 +8,7 @@
"cpuNumber": 2,
"memoryMB": 8192,
"gpuNumber": 0,
"command": "git clone https://github.com/Microsoft/pai.git && mv pai pai_tmp && echo \"S\" | /bin/bash pai_tmp/examples/auto-test/start.sh normal http://10.20.30.40:9186/api/v1/user/your_username/ 10.20.30.40:9000 http://10.20.30.40:50070 username password && rm -rf pai_tmp"
"command": "git clone https://github.com/Microsoft/pai.git && mv pai pai_tmp && echo "S" | /bin/bash pai_tmp/examples/auto-test/start.sh normal http://10.20.30.40:9186/api/v1/user/your_username/ 10.20.30.40:9000 http://10.20.30.40:50070 username password && rm -rf pai_tmp"
}
]
}

Просмотреть файл

@ -57,8 +57,6 @@ fi
#make directory on HDFS
echo "Make cntk directory, waiting..."
hdfs dfs -mkdir -p hdfs://$1/examples/
hdfs dfs -mkdir -p hdfs://$1/examples/cntk
hdfs dfs -mkdir -p hdfs://$1/examples/cntk/code
hdfs dfs -mkdir -p hdfs://$1/examples/cntk/data
hdfs dfs -mkdir -p hdfs://$1/examples/cntk/output

Просмотреть файл

@ -0,0 +1,96 @@
FROM nvidia/cuda:9.0-devel-ubuntu16.04
# TensorFlow version is tightly coupled to CUDA and cuDNN so it should be selected carefully
ENV TENSORFLOW_VERSION=1.10.0
ENV PYTORCH_VERSION=0.4.1
ENV CUDNN_VERSION=7.3.1.20-1+cuda9.0
ENV NCCL_VERSION=2.3.5-2+cuda9.0
# Python 2.7 or 3.5 is supported by Ubuntu Xenial out of the box
ARG python=2.7
ENV PYTHON_VERSION=${python}
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
cmake \
git \
curl \
vim \
wget \
ca-certificates \
libcudnn7=${CUDNN_VERSION} \
libnccl2=${NCCL_VERSION} \
libnccl-dev=${NCCL_VERSION} \
libjpeg-dev \
libpng-dev \
python${PYTHON_VERSION} \
python${PYTHON_VERSION}-dev
RUN ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python
RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
python get-pip.py && \
rm get-pip.py
# Install TensorFlow, Keras and PyTorch
RUN pip install tensorflow-gpu==${TENSORFLOW_VERSION} keras h5py torch==${PYTORCH_VERSION} torchvision
# Install Open MPI
RUN mkdir /tmp/openmpi && \
cd /tmp/openmpi && \
wget https://www.open-mpi.org/software/ompi/v3.1/downloads/openmpi-3.1.2.tar.gz && \
tar zxf openmpi-3.1.2.tar.gz && \
cd openmpi-3.1.2 && \
./configure --enable-orterun-prefix-by-default && \
make -j $(nproc) all && \
make install && \
ldconfig && \
rm -rf /tmp/openmpi
# Install Horovod, temporarily using CUDA stubs
RUN ldconfig /usr/local/cuda-9.0/targets/x86_64-linux/lib/stubs && \
HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 pip install --no-cache-dir horovod && \
ldconfig
# Create a wrapper for OpenMPI to allow running as root by default
RUN mv /usr/local/bin/mpirun /usr/local/bin/mpirun.real && \
echo '#!/bin/bash' > /usr/local/bin/mpirun && \
echo 'mpirun.real --allow-run-as-root "$@"' >> /usr/local/bin/mpirun && \
chmod a+x /usr/local/bin/mpirun
# Configure OpenMPI to run good defaults:
# --bind-to none --map-by slot --mca btl_tcp_if_exclude lo,docker0
RUN echo "hwloc_base_binding_policy = none" >> /usr/local/etc/openmpi-mca-params.conf && \
echo "rmaps_base_mapping_policy = slot" >> /usr/local/etc/openmpi-mca-params.conf && \
echo "btl_tcp_if_exclude = lo,docker0" >> /usr/local/etc/openmpi-mca-params.conf
# Set default NCCL parameters
RUN echo NCCL_DEBUG=INFO >> /etc/nccl.conf && \
echo NCCL_SOCKET_IFNAME=^docker0 >> /etc/nccl.conf
# Install OpenSSH for MPI to communicate between containers
RUN apt-get install -y --no-install-recommends openssh-client openssh-server && \
mkdir -p /var/run/sshd
# Allow OpenSSH to talk to containers without asking for confirmation
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config
WORKDIR "/root"
# Download jdk
RUN wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u191-b12/2787e4a523244c269598db4e85c51e0c/jdk-8u191-linux-x64.tar.gz && tar xvf jdk-8u191-linux-x64.tar.gz && rm jdk-8u191-linux-x64.tar.gz
# Download hadoop
RUN wget http://archive.apache.org/dist/hadoop/core/hadoop-3.1.1/hadoop-3.1.1.tar.gz && tar zxvf hadoop-3.1.1.tar.gz && rm hadoop-3.1.1.tar.gz
# Set java and hdfs env
ENV JAVA_HOME=/root/jdk1.8.0_191 \
HADOOP_HOME=/root/hadoop-3.1.1 \
PATH=$HADOOP_HOME/bin:$PATH
ENV HADOOP_HDFS_HOME=${HADOOP_HOME} \
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${JAVA_HOME}/jre/lib/amd64/server \
PATH=$JAVA_HOME:$PATH

Просмотреть файл

@ -0,0 +1,83 @@
<!--
Copyright (c) Microsoft Corporation
All rights reserved.
MIT License
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
documentation files (the "Software"), to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
-->
# Horovod with mpi on OpenPAI
This guide introduces how to run [Open MPI](https://www.open-mpi.org/) workload on OpenPAI.
We use [tensorflow benchmark](https://github.com/tensorflow/benchmarks/tree/cnn_tf_v1.10_compatible/scripts/tf_cnn_benchmarks) as the example. It seems impossible to run it just with openmpi and tensorflow.
So, we use [horovod](https://github.com/uber/horovod) as the runtime environment to run the example.
Other customized MPI code can be run similarly.
# Open MPI TensorFlow CIFAR-10 example
### Prepare work
1. Prepare the data:
* TensorFlow: Just go to the [official website](http://www.cs.toronto.edu/~kriz/cifar.html) and download the python version data by the [url](http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz). `wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz && tar zxvf cifar-10-python.tar.gz && rm cifar-10-python.tar.gz`
After you downloading the data, upload them to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/mpi/tensorflow/data` or `hdfs dfs -put filename hdfs://ip:port/examples/tensorflow/distributed-cifar-10/data`
Note that we use the same data as [tensorflow distributed cifar-10 example](https://github.com/Microsoft/pai/tree/master/examples/tensorflow). So, if you have already run that example, just use that data path.
2. Prepare the executable code:
* Tensorflow: We use [tensorflow benchmark](https://github.com/tensorflow/benchmarks/tree/cnn_tf_v1.10_compatible) as the example code. Pay attention to the version, the example here uses v1.10 code.
3. Prepare a docker image and upload it to docker hub. We use the [horovod official image](https://hub.docker.com/r/uber/horovod/tags/), tag `0.14.1-tf1.10.0-torch0.4.0-py2.7`. If you want to use a customized image, just refer to the [official Docker file](https://github.com/uber/horovod/blob/master/Dockerfile) and make your own. Then, build it and push the image onto docker hub.
4. Prepare a script in order to detest whether the containers are ready before run the mpi job. [Here](./start.sh) is an example.
5. Prepare a job configuration file and submit it through webportal. The config examples are following.
**Note** that you can simply run the prepare.sh to do the above preparing work, but you must make sure you can use HDFS client on your local mechine. If you can, just run the shell script with a parameter of your HDFS socket! `/bin/bash prepare.sh ip:port`
Here is a configuration file example:
## Open MPI TensorFlow CIFAR-10 example
### [TensorFlow cifar10 benchmark](https://git.io/vF4wT)
```js
{
"jobName": "horovod-mpi-cifar10",
"image": "openpai/example.horovod.mpi",
"dataDir": "$PAI_DEFAULT_FS_URI/examples/tensorflow/distributed-cifar-10/data",
"outputDir": "$PAI_DEFAULT_FS_URI/examples/horovod/output",
"codeDir": "$PAI_DEFAULT_FS_URI/examples/horovod/code",
"virtualCluster": "default",
"retryCount": 0,
"taskRoles": [
{
"name": "main",
"taskNumber": 1,
"cpuNumber": 4,
"memoryMB": 16384,
"shmMB": 64,
"gpuNumber": 2,
"minFailedTaskCount": 1,
"minSucceededTaskCount": 1,
"command": "/bin/bash code/start.sh"
},
{
"name": "worker",
"taskNumber": 1,
"cpuNumber": 4,
"memoryMB": 16384,
"shmMB": 64,
"gpuNumber": 2,
"minFailedTaskCount": 1,
"command": "sleep infinity"
}
]
}
```
For more details on how to write a job configuration file, please refer to [job tutorial](../../docs/job_tutorial.md#json-config-file-for-job-submission).

Просмотреть файл

@ -0,0 +1,26 @@
{
"jobName": "horovod-mpi-cifar10",
"image": "openpai/example.horovod.mpi",
"dataDir": "$PAI_DEFAULT_FS_URI/examples/tensorflow/distributed-cifar-10/data",
"outputDir": "$PAI_DEFAULT_FS_URI/examples/horovod/output",
"codeDir": "$PAI_DEFAULT_FS_URI/examples/horovod/code",
"taskRoles": [
{
"name": "main",
"taskNumber": 1,
"cpuNumber": 4,
"memoryMB": 16384,
"gpuNumber": 2,
"minSucceededTaskCount": 1,
"command": "/bin/bash code/start.sh"
},
{
"name": "worker",
"taskNumber": 1,
"cpuNumber": 4,
"memoryMB": 16384,
"gpuNumber": 2,
"command": "sleep infinity"
}
]
}

Просмотреть файл

@ -0,0 +1,48 @@
#horovod tensorflow cifar-10 prepare
function horovod_prepare_data(){
#download the data
wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz && tar zxvf cifar-10-python.tar.gz && rm cifar-10-python.tar.gz
#upload the data to HDFS
echo "Uploading cifar-10 data, waiting..."
for i in `ls cifar-10-batches-py`
do
hdfs dfs -put cifar-10-batches-py/$i hdfs://$1/examples/tensorflow/distributed-cifar-10/data
done
}
function horovod_prepare_code(){
#download the code
git clone -b cnn_tf_v1.10_compatible https://github.com/tensorflow/benchmarks.git
wget https://github.com/Microsoft/pai/raw/master/examples/horovod/start.sh
#upload the code to HDFS
echo "Uploading benchmarks code, waiting..."
hdfs dfs -put benchmarks/ hdfs://$1/examples/horovod/code
hdfs dfs -put start.sh hdfs://$1/examples/horovod/code
}
echo "Make horovod directory, waiting..."
hdfs dfs -mkdir -p hdfs://$1/examples/horovod/output
hdfs dfs -mkdir -p hdfs://$1/examples/horovod/code
hdfs dfs -mkdir -p hdfs://$1/examples/tensorflow/distributed-cifar-10/data
hdfs dfs -test -e hdfs://$1/examples/horovod/code/*
if [ $? -eq 0 ] ;then
echo "Code exists on HDFS!"
else
horovod_prepare_code $1
echo "Have prepared code!"
fi
hdfs dfs -test -e hdfs://$1/examples/tensorflow/distributed-cifar-10/data/*
if [ $? -eq 0 ] ;then
echo "Data exists on HDFS!"
else
horovod_prepare_data $1
echo "Have prepared data"
fi
rm -rf cifar-10-batches-py*/ benchmarks*/ start.sh
echo "Removed local cifar-10 code and data succeeded!"
echo "Prepare horovod example based on horovod and tensorflow done!"

17
examples/horovod/start.sh Normal file
Просмотреть файл

@ -0,0 +1,17 @@
# Detest the files touched on HDFS when the containers start
PAI_HDFS_PREFIX=${PAI_DEFAULT_FS_URI}/Container
sshConnectInfoFolder=${PAI_HDFS_PREFIX}/${PAI_USER_NAME}/${PAI_JOB_NAME}/ssh/${APP_ID}
echo $sshConnectInfoFolder
res=`hdfs dfs -count $sshConnectInfoFolder`
# Split the result and get the number of existing files
fileNum=`echo $res | awk -F ' ' '{print $2}'`
while [ $fileNum != $PAI_JOB_TASK_ROLE_COUNT ]
do
sleep 30
done
sleep 30
# Run mpi work
mpirun -np 4 -H worker-0:2,main-0:2 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_IB_DISABLE=1 -x CLASSPATH=$($HADOOP_HDFS_HOME/bin/hadoop classpath --glob) -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_exclude docker0,lo,eth1 python code/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet20 --batch_size 32 --data_dir=$PAI_DATA_DIR --data_name=cifar10 --train_dir=$PAI_OUTPUT_DIR --variable_update horovod

Просмотреть файл

@ -51,11 +51,7 @@ RUN mkdir /bazel && \
# Download and build TensorFlow.
WORKDIR /tensorflow
RUN git clone -b r1.4 https://github.com/tensorflow/tensorflow.git . && \
git cherry-pick -n f73d7c && \
sed -i '1 i\#define TENSORFLOW_USE_MPI' tensorflow/contrib/mpi_collectives/mpi_ops.cc && \
sed -i '1 i\#define TENSORFLOW_USE_MPI' tensorflow/contrib/mpi_collectives/ring.cc && \
sed -i '1 i\#define TENSORFLOW_USE_MPI' tensorflow/contrib/mpi_collectives/ring.cu.cc && \
sed -i '1 i\#define TENSORFLOW_USE_MPI' tensorflow/contrib/mpi_collectives/ring.h
git cherry-pick -n f73d7c
ENV TF_NEED_CUDA=1 \
TF_CUDA_COMPUTE_CAPABILITIES=3.0,3.5,5.2,6.0,6.1 \
TF_CUDA_VERSION=8.0 \
@ -84,4 +80,3 @@ RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/lib
rm -rf /root/.cache
WORKDIR /root
ADD tf-mpi.py tf-mpi.py

Просмотреть файл

@ -5,9 +5,9 @@
// prepare cmudict corpus in CNTK format https://git.io/vbT5A and upload to hdfs
"dataDir": "$PAI_DEFAULT_FS_URI/examples/cntk/data",
// make a new dir for output on hdfs
"outputDir": "$PAI_DEFAULT_FS_URI/examples/cntk/output",
"outputDir": "$PAI_DEFAULT_FS_URI/examples/mpi/cntk/output",
// prepare g2p distributed training script cntk-mpi.sh and upload to hdfs
"codeDir": "$PAI_DEFAULT_FS_URI/examples/cntk/code",
"codeDir": "$PAI_DEFAULT_FS_URI/examples/mpi/cntk/code",
"virtualCluster": "default",
"taskRoles": [

Просмотреть файл

@ -1,7 +1,7 @@
#mpi cntk prepare
echo "Prepare for the mpi example!"
function prepare_data(){
function mpi_cntk_prepare_data(){
#download data
echo "Downloading mpi cntk data, waiting..."
@ -37,7 +37,7 @@ function prepare_data(){
}
function prepare_code(){
function mpi_cntk_prepare_code(){
#code
#G2P.cntk
echo "Downloading mpi cntk code, waiting..."
@ -55,18 +55,15 @@ fi
#make directory on HDFS
echo "Make mpi cntk directory, waiting..."
hdfs dfs -mkdir -p hdfs://$1/examples/
hdfs dfs -mkdir -p hdfs://$1/examples/mpi
hdfs dfs -mkdir -p hdfs://$1/examples/mpi/cntk
hdfs dfs -mkdir -p hdfs://$1/examples/mpi/cntk/code
hdfs dfs -mkdir -p hdfs://$1/examples/mpi/cntk/data
hdfs dfs -mkdir -p hdfs://$1/examples/mpi/cntk/output
hdfs dfs -mkdir -p hdfs://$1/examples/cntk/data
hdfs dfs -test -e hdfs://$1/examples/mpi/cntk/code/*
if [ $? -eq 0 ] ;then
echo "Code exists on HDFS!"
else
prepare_code $1
mpi_cntk_prepare_code $1
echo "Have prepared code!"
fi
@ -74,7 +71,7 @@ hdfs dfs -test -e hdfs://$1/examples/cntk/data/*
if [ $? -eq 0 ] ;then
echo "Data exists on HDFS!"
else
prepare_data $1
mpi_cntk_prepare_data $1
echo "Have prepared data"
fi
@ -83,8 +80,48 @@ rm cntk-mpi.sh* G2P.cntk* cmudict* tiny.ctf*
echo "Removed local mpi cntk code and data succeeded!"
#mpi tensorflow cifar-10 prepare
echo "Make mpi tensorflow directory, waiting..."
hdfs dfs -mkdir -p hdfs://$1/examples/mpi/tensorflow
hdfs dfs -mkdir -p hdfs://$1/examples/mpi/tensorflow/output
function mpi_tensorflow_prepare_data(){
#download the data
wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz && tar zxvf cifar-10-python.tar.gz && rm cifar-10-python.tar.gz
echo "Prepare for the mpi example done!"
#upload the data to HDFS
echo "Uploading cifar-10 data, waiting..."
for i in `ls cifar-10-batches-py`
do
hdfs dfs -put cifar-10-batches-py/$i hdfs://$1/examples/tensorflow/distributed-cifar-10/data
done
}
function mpi_tensorflow_prepare_code(){
#download the code
git clone -b tf_benchmark_stage https://github.com/tensorflow/benchmarks.git
#upload the code to HDFS
echo "Uploading benchmarks code, waiting..."
hdfs dfs -put benchmarks/ hdfs://$1/examples/tensorflow/distributed-cifar-10/code
}
echo "Make mpi tensorflow directory, waiting..."
hdfs dfs -mkdir -p hdfs://$1/examples/mpi/tensorflow/output
hdfs dfs -mkdir -p hdfs://$1/examples/tensorflow/distributed-cifar-10/code
hdfs dfs -mkdir -p hdfs://$1/examples/tensorflow/distributed-cifar-10/data
hdfs dfs -test -e hdfs://$1/examples/tensorflow/distributed-cifar-10/code/*
if [ $? -eq 0 ] ;then
echo "Code exists on HDFS!"
else
mpi_tensorflow_prepare_code $1
echo "Have prepared code!"
fi
hdfs dfs -test -e hdfs://$1/examples/tensorflow/distributed-cifar-10/data/*
if [ $? -eq 0 ] ;then
echo "Data exists on HDFS!"
else
mpi_tensorflow_prepare_data $1
echo "Have prepared data"
fi
rm -r cifar-10-batches-py*/ benchmarks*/
echo "Removed local cifar-10 code and data succeeded!"
echo "Prepare mpi example based on horovod and tensorflow done!"

Просмотреть файл

@ -36,9 +36,6 @@ echo "You must input hdfs socket as the only parameter! Or you cannot run this s
#make directory on HDFS
echo "Make imageNet directory, waiting..."
hdfs dfs -mkdir -p hdfs://$1/examples/
hdfs dfs -mkdir -p hdfs://$1/examples/tensorflow/
hdfs dfs -mkdir -p hdfs://$1/examples/tensorflow/imageNet/
hdfs dfs -mkdir -p hdfs://$1/examples/tensorflow/imageNet/data/
hdfs dfs -mkdir -p hdfs://$1/examples/tensorflow/imageNet/code/
hdfs dfs -mkdir -p hdfs://$1/examples/tensorflow/imageNet/output/
@ -92,7 +89,6 @@ function distributed_prepare_code(){
#make directory on HDFS
echo "Make distributed cifar-10 directory, waiting..."
hdfs dfs -mkdir -p hdfs://$1/examples/tensorflow/distributed-cifar-10/
hdfs dfs -mkdir -p hdfs://$1/examples/tensorflow/distributed-cifar-10/code/
hdfs dfs -mkdir -p hdfs://$1/examples/tensorflow/distributed-cifar-10/data/
hdfs dfs -mkdir -p hdfs://$1/examples/tensorflow/distributed-cifar-10/output/