зеркало из https://github.com/microsoft/pai.git
[example]revise document (#1656)
* fix prepare.sh in mpi folder * add horovod example && add auto-test support * add horovod example * fix json of horovod * add sleep for horovod * add ssh readiness detest shell script * add ssh service detest script * add hdfs support * revise document of horovod * revise examples document * revise document * fix review * fix review * fix review
This commit is contained in:
Родитель
498b1cefb7
Коммит
20a497fbb8
|
@ -4,6 +4,7 @@
|
|||
- [Quick start: how to write and submit a CIFAR-10 job](#quickstart)
|
||||
- [List of off-the-shelf examples](#offtheshelf)
|
||||
- [List of customized job template](#customize)
|
||||
- [What if the example is failed](#debug)
|
||||
- [Contributing](#contributing)
|
||||
|
||||
## Quick start: how to write and submit a CIFAR-10 job <a name="quickstart"></a>
|
||||
|
@ -98,6 +99,15 @@ These user could customize and run these jobs over OpenPAI.
|
|||
1. [Open MPI TensorFlow CIFAR-10](./mpi#open-mpi-tensorflow-cifar-10-example)
|
||||
2. [Open MPI CNTK grapheme-to-phoneme conversion](./mpi#open-mpi-cntk-grapheme-to-phoneme-conversion-example)
|
||||
|
||||
## What if the example is failed <a name="debug"></a>
|
||||
|
||||
The example in the folder could be failed due to the following reasons:
|
||||
|
||||
1. The format of json is incorrect. You may get error when you copy the json file to the webportal. It may due to version updating of webportal. You should refer to the latest version of it.
|
||||
2. The docker image is removed. You will find this error in your job tracking page. You should create an issue to report it, or you can build the image according to the dockerfile in the example's folder, then push it to another docker registry and modify the json file's image field. Just refer to the README or DOCKER in the folder of that example.
|
||||
3. If the example you submit contains a prepare.sh script shell, it may fail due to the source of the data or code changed or been unstable. You may get error in your job tracking page. Check and try to fix it.
|
||||
4. The version of the code, tools or library. You may get this error if you rebuild the docker image. Some example doesn't fix the version of its dependency, so, you should check the version.
|
||||
|
||||
## Contributing <a name="contributing"></a>
|
||||
|
||||
If you want to contribute a job example that can be run on PAI, please open a new pull request.
|
||||
|
|
|
@ -6,6 +6,8 @@
|
|||
- [Parameters of the project](#Parameters_of_the_start.sh)
|
||||
- [The mode of the project](#mode)
|
||||
- [Note](#Note)
|
||||
- [Add new example](#Add)
|
||||
- [Delete example](#Delete)
|
||||
|
||||
## Introduction <a name="Introduction"></a>
|
||||
This python project can help you test all the examples in this folder.
|
||||
|
@ -59,7 +61,7 @@ And during the runtime of this shell script, it will require you input F/S or jo
|
|||
|
||||
- Enter job names like `cntk-mpi,tensorflow-mpi,sklearn-mnist` means you want to run just the three examples.
|
||||
|
||||
Here is an example to start the script: `/bin/bash pai_tmp/examples/auto-test/start.sh normal http://10.20.30.40:9186/api/v1/ 10.20.30.40:9000 http://10.20.30.40:50070 test test`
|
||||
Here is an example to start the script: `echo "S" | /bin/bash pai_tmp/examples/auto-test/start.sh normal http://10.20.30.40:9186/api/v1/ 10.20.30.40:9000 http://10.20.30.40:50070 test test`
|
||||
### mode <a name="mode"></a>
|
||||
The project offers 3 different modes.
|
||||
1. **ci mode**: If the job can run correctly within 10 minutes, the project will regards it succeeded.
|
||||
|
@ -70,4 +72,13 @@ Use "release" as the first parameter of start.sh to enter this mode.
|
|||
Use "normal" as the first parameter of start.sh to enter this mode.
|
||||
## Note <a name="Note"></a>
|
||||
If the parameters contains special characters like '&', please use single qutations to mark that parameter.
|
||||
Now(27th September, 2018), the mpi examples are still unready. Ignore them!
|
||||
|
||||
If you want to add or delete an example, please follow these steps:
|
||||
### For adding <a name="Add"></a>
|
||||
1. Prepare your example, include the json file. If you should prepare data and code before submit the job, you should also write a prepare shell script named "prepare.sh". You can refer to [prepare.sh](https://github.com/Microsoft/pai/blob/master/examples/tensorflow/prepare.sh).
|
||||
2. Add the job name in your json file to the [start.sh](./start.sh), you can see the comment in line 17. Just add your job name to "full" line and "stable" line if the job is stable(Can run correctly in anytime).
|
||||
3. Put forward your pull request.
|
||||
### For deleting <a name="Delete"></a>
|
||||
1. Delete your example.
|
||||
2. Delete the job name in your json file from the [start.sh](./start.sh), you can see the comment in line 17. Just delete your job name from "full" line and "stable" line if the job is stable(Can run correctly in anytime).
|
||||
3. Put forward your pull request.
|
||||
|
|
|
@ -14,11 +14,12 @@ if [ ! -d "pai/" ]; then
|
|||
else
|
||||
threshold=30
|
||||
fi
|
||||
full="cntk-mpi,tensorflow-mpi,sklearn-mnist,sklearn-text-vectorizers,tensorflow-cifar10,tensorflow-tensorboard,tensorflow-distributed-cifar10,kafka,mxnet-image-classification,mxnet-autoencoder,jupyter_example,tensorflow-serving,xgboost_gpu_hist,cntk-g2p,keras_cntk_backend_mnist,keras_tensorflow_backend_mnist,caffe-mnist,pytorch-regression,pytorch-mnist,chainer-cifar,caffe2-resnet50"
|
||||
stable="sklearn-mnist,sklearn-text-vectorizers,tensorflow-cifar10,tensorflow-tensorboard,tensorflow-distributed-cifar10,kafka,mxnet-image-classification,mxnet-autoencoder,jupyter_example,tensorflow-serving,xgboost_gpu_hist,cntk-g2p,keras_cntk_backend_mnist,keras_tensorflow_backend_mnist,caffe-mnist,pytorch-regression,pytorch-mnist,chainer-cifar,caffe2-resnet50"
|
||||
# If you want to add or delete examples, add or delete the name here
|
||||
full="ocr-serving,horovod-mpi-cifar10,cntk-mpi,tensorflow-mpi,sklearn-mnist,sklearn-text-vectorizers,tensorflow-cifar10,tensorflow-tensorboard,tensorflow-distributed-cifar10,kafka,mxnet-image-classification,mxnet-autoencoder,jupyter_example,tensorflow-serving,xgboost_gpu_hist,cntk-g2p,keras_cntk_backend_mnist,keras_tensorflow_backend_mnist,caffe-mnist,pytorch-regression,pytorch-mnist,chainer-cifar,caffe2-resnet50"
|
||||
stable="ocr-serving,horovod-mpi-cifar10,sklearn-mnist,sklearn-text-vectorizers,tensorflow-cifar10,tensorflow-tensorboard,tensorflow-distributed-cifar10,kafka,mxnet-image-classification,mxnet-autoencoder,jupyter_example,tensorflow-serving,xgboost_gpu_hist,cntk-g2p,keras_cntk_backend_mnist,keras_tensorflow_backend_mnist,caffe-mnist,pytorch-regression,pytorch-mnist,chainer-cifar,caffe2-resnet50"
|
||||
echo "There are some error within the mpi example, so, just ignore them!"
|
||||
read -p "Please input name of the examples you want to run with ',' between two names, or you can just input F/S to run full jobs or only stable jobs:" mode
|
||||
if [[ $mode =~ ^[a-zA-Z0-9_,]+$ ]]; then
|
||||
if [[ $mode =~ ^[a-zA-Z0-9_,-]+$ ]]; then
|
||||
echo "Run the job of "$mode
|
||||
else
|
||||
echo "Input jobs' name error!"
|
||||
|
|
|
@ -8,7 +8,7 @@
|
|||
"cpuNumber": 2,
|
||||
"memoryMB": 8192,
|
||||
"gpuNumber": 0,
|
||||
"command": "git clone https://github.com/Microsoft/pai.git && mv pai pai_tmp && echo \"S\" | /bin/bash pai_tmp/examples/auto-test/start.sh normal http://10.20.30.40:9186/api/v1/user/your_username/ 10.20.30.40:9000 http://10.20.30.40:50070 username password && rm -rf pai_tmp"
|
||||
"command": "git clone https://github.com/Microsoft/pai.git && mv pai pai_tmp && echo "S" | /bin/bash pai_tmp/examples/auto-test/start.sh normal http://10.20.30.40:9186/api/v1/user/your_username/ 10.20.30.40:9000 http://10.20.30.40:50070 username password && rm -rf pai_tmp"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
|
|
@ -57,8 +57,6 @@ fi
|
|||
|
||||
#make directory on HDFS
|
||||
echo "Make cntk directory, waiting..."
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/cntk
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/cntk/code
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/cntk/data
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/cntk/output
|
||||
|
|
|
@ -0,0 +1,96 @@
|
|||
FROM nvidia/cuda:9.0-devel-ubuntu16.04
|
||||
|
||||
# TensorFlow version is tightly coupled to CUDA and cuDNN so it should be selected carefully
|
||||
ENV TENSORFLOW_VERSION=1.10.0
|
||||
ENV PYTORCH_VERSION=0.4.1
|
||||
ENV CUDNN_VERSION=7.3.1.20-1+cuda9.0
|
||||
ENV NCCL_VERSION=2.3.5-2+cuda9.0
|
||||
|
||||
# Python 2.7 or 3.5 is supported by Ubuntu Xenial out of the box
|
||||
ARG python=2.7
|
||||
ENV PYTHON_VERSION=${python}
|
||||
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
build-essential \
|
||||
cmake \
|
||||
git \
|
||||
curl \
|
||||
vim \
|
||||
wget \
|
||||
ca-certificates \
|
||||
libcudnn7=${CUDNN_VERSION} \
|
||||
libnccl2=${NCCL_VERSION} \
|
||||
libnccl-dev=${NCCL_VERSION} \
|
||||
libjpeg-dev \
|
||||
libpng-dev \
|
||||
python${PYTHON_VERSION} \
|
||||
python${PYTHON_VERSION}-dev
|
||||
|
||||
RUN ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python
|
||||
|
||||
RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
|
||||
python get-pip.py && \
|
||||
rm get-pip.py
|
||||
|
||||
# Install TensorFlow, Keras and PyTorch
|
||||
RUN pip install tensorflow-gpu==${TENSORFLOW_VERSION} keras h5py torch==${PYTORCH_VERSION} torchvision
|
||||
|
||||
# Install Open MPI
|
||||
RUN mkdir /tmp/openmpi && \
|
||||
cd /tmp/openmpi && \
|
||||
wget https://www.open-mpi.org/software/ompi/v3.1/downloads/openmpi-3.1.2.tar.gz && \
|
||||
tar zxf openmpi-3.1.2.tar.gz && \
|
||||
cd openmpi-3.1.2 && \
|
||||
./configure --enable-orterun-prefix-by-default && \
|
||||
make -j $(nproc) all && \
|
||||
make install && \
|
||||
ldconfig && \
|
||||
rm -rf /tmp/openmpi
|
||||
|
||||
# Install Horovod, temporarily using CUDA stubs
|
||||
RUN ldconfig /usr/local/cuda-9.0/targets/x86_64-linux/lib/stubs && \
|
||||
HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 pip install --no-cache-dir horovod && \
|
||||
ldconfig
|
||||
|
||||
# Create a wrapper for OpenMPI to allow running as root by default
|
||||
RUN mv /usr/local/bin/mpirun /usr/local/bin/mpirun.real && \
|
||||
echo '#!/bin/bash' > /usr/local/bin/mpirun && \
|
||||
echo 'mpirun.real --allow-run-as-root "$@"' >> /usr/local/bin/mpirun && \
|
||||
chmod a+x /usr/local/bin/mpirun
|
||||
|
||||
# Configure OpenMPI to run good defaults:
|
||||
# --bind-to none --map-by slot --mca btl_tcp_if_exclude lo,docker0
|
||||
RUN echo "hwloc_base_binding_policy = none" >> /usr/local/etc/openmpi-mca-params.conf && \
|
||||
echo "rmaps_base_mapping_policy = slot" >> /usr/local/etc/openmpi-mca-params.conf && \
|
||||
echo "btl_tcp_if_exclude = lo,docker0" >> /usr/local/etc/openmpi-mca-params.conf
|
||||
|
||||
# Set default NCCL parameters
|
||||
RUN echo NCCL_DEBUG=INFO >> /etc/nccl.conf && \
|
||||
echo NCCL_SOCKET_IFNAME=^docker0 >> /etc/nccl.conf
|
||||
|
||||
# Install OpenSSH for MPI to communicate between containers
|
||||
RUN apt-get install -y --no-install-recommends openssh-client openssh-server && \
|
||||
mkdir -p /var/run/sshd
|
||||
|
||||
# Allow OpenSSH to talk to containers without asking for confirmation
|
||||
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
|
||||
echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
|
||||
mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config
|
||||
|
||||
WORKDIR "/root"
|
||||
|
||||
# Download jdk
|
||||
RUN wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/8u191-b12/2787e4a523244c269598db4e85c51e0c/jdk-8u191-linux-x64.tar.gz && tar xvf jdk-8u191-linux-x64.tar.gz && rm jdk-8u191-linux-x64.tar.gz
|
||||
|
||||
# Download hadoop
|
||||
RUN wget http://archive.apache.org/dist/hadoop/core/hadoop-3.1.1/hadoop-3.1.1.tar.gz && tar zxvf hadoop-3.1.1.tar.gz && rm hadoop-3.1.1.tar.gz
|
||||
|
||||
# Set java and hdfs env
|
||||
ENV JAVA_HOME=/root/jdk1.8.0_191 \
|
||||
HADOOP_HOME=/root/hadoop-3.1.1 \
|
||||
PATH=$HADOOP_HOME/bin:$PATH
|
||||
|
||||
ENV HADOOP_HDFS_HOME=${HADOOP_HOME} \
|
||||
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${JAVA_HOME}/jre/lib/amd64/server \
|
||||
PATH=$JAVA_HOME:$PATH
|
||||
|
|
@ -0,0 +1,83 @@
|
|||
<!--
|
||||
Copyright (c) Microsoft Corporation
|
||||
All rights reserved.
|
||||
|
||||
MIT License
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
|
||||
documentation files (the "Software"), to deal in the Software without restriction, including without limitation
|
||||
the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
|
||||
to permit persons to whom the Software is furnished to do so, subject to the following conditions:
|
||||
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
|
||||
BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
|
||||
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
|
||||
DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
||||
-->
|
||||
|
||||
# Horovod with mpi on OpenPAI
|
||||
|
||||
This guide introduces how to run [Open MPI](https://www.open-mpi.org/) workload on OpenPAI.
|
||||
We use [tensorflow benchmark](https://github.com/tensorflow/benchmarks/tree/cnn_tf_v1.10_compatible/scripts/tf_cnn_benchmarks) as the example. It seems impossible to run it just with openmpi and tensorflow.
|
||||
So, we use [horovod](https://github.com/uber/horovod) as the runtime environment to run the example.
|
||||
Other customized MPI code can be run similarly.
|
||||
|
||||
# Open MPI TensorFlow CIFAR-10 example
|
||||
|
||||
### Prepare work
|
||||
1. Prepare the data:
|
||||
* TensorFlow: Just go to the [official website](http://www.cs.toronto.edu/~kriz/cifar.html) and download the python version data by the [url](http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz). `wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz && tar zxvf cifar-10-python.tar.gz && rm cifar-10-python.tar.gz`
|
||||
After you downloading the data, upload them to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/mpi/tensorflow/data` or `hdfs dfs -put filename hdfs://ip:port/examples/tensorflow/distributed-cifar-10/data`
|
||||
Note that we use the same data as [tensorflow distributed cifar-10 example](https://github.com/Microsoft/pai/tree/master/examples/tensorflow). So, if you have already run that example, just use that data path.
|
||||
2. Prepare the executable code:
|
||||
* Tensorflow: We use [tensorflow benchmark](https://github.com/tensorflow/benchmarks/tree/cnn_tf_v1.10_compatible) as the example code. Pay attention to the version, the example here uses v1.10 code.
|
||||
3. Prepare a docker image and upload it to docker hub. We use the [horovod official image](https://hub.docker.com/r/uber/horovod/tags/), tag `0.14.1-tf1.10.0-torch0.4.0-py2.7`. If you want to use a customized image, just refer to the [official Docker file](https://github.com/uber/horovod/blob/master/Dockerfile) and make your own. Then, build it and push the image onto docker hub.
|
||||
4. Prepare a script in order to detest whether the containers are ready before run the mpi job. [Here](./start.sh) is an example.
|
||||
5. Prepare a job configuration file and submit it through webportal. The config examples are following.
|
||||
|
||||
**Note** that you can simply run the prepare.sh to do the above preparing work, but you must make sure you can use HDFS client on your local mechine. If you can, just run the shell script with a parameter of your HDFS socket! `/bin/bash prepare.sh ip:port`
|
||||
|
||||
Here is a configuration file example:
|
||||
|
||||
## Open MPI TensorFlow CIFAR-10 example
|
||||
|
||||
### [TensorFlow cifar10 benchmark](https://git.io/vF4wT)
|
||||
|
||||
```js
|
||||
{
|
||||
"jobName": "horovod-mpi-cifar10",
|
||||
"image": "openpai/example.horovod.mpi",
|
||||
"dataDir": "$PAI_DEFAULT_FS_URI/examples/tensorflow/distributed-cifar-10/data",
|
||||
"outputDir": "$PAI_DEFAULT_FS_URI/examples/horovod/output",
|
||||
"codeDir": "$PAI_DEFAULT_FS_URI/examples/horovod/code",
|
||||
"virtualCluster": "default",
|
||||
"retryCount": 0,
|
||||
"taskRoles": [
|
||||
{
|
||||
"name": "main",
|
||||
"taskNumber": 1,
|
||||
"cpuNumber": 4,
|
||||
"memoryMB": 16384,
|
||||
"shmMB": 64,
|
||||
"gpuNumber": 2,
|
||||
"minFailedTaskCount": 1,
|
||||
"minSucceededTaskCount": 1,
|
||||
"command": "/bin/bash code/start.sh"
|
||||
},
|
||||
{
|
||||
"name": "worker",
|
||||
"taskNumber": 1,
|
||||
"cpuNumber": 4,
|
||||
"memoryMB": 16384,
|
||||
"shmMB": 64,
|
||||
"gpuNumber": 2,
|
||||
"minFailedTaskCount": 1,
|
||||
"command": "sleep infinity"
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
For more details on how to write a job configuration file, please refer to [job tutorial](../../docs/job_tutorial.md#json-config-file-for-job-submission).
|
|
@ -0,0 +1,26 @@
|
|||
{
|
||||
"jobName": "horovod-mpi-cifar10",
|
||||
"image": "openpai/example.horovod.mpi",
|
||||
"dataDir": "$PAI_DEFAULT_FS_URI/examples/tensorflow/distributed-cifar-10/data",
|
||||
"outputDir": "$PAI_DEFAULT_FS_URI/examples/horovod/output",
|
||||
"codeDir": "$PAI_DEFAULT_FS_URI/examples/horovod/code",
|
||||
"taskRoles": [
|
||||
{
|
||||
"name": "main",
|
||||
"taskNumber": 1,
|
||||
"cpuNumber": 4,
|
||||
"memoryMB": 16384,
|
||||
"gpuNumber": 2,
|
||||
"minSucceededTaskCount": 1,
|
||||
"command": "/bin/bash code/start.sh"
|
||||
},
|
||||
{
|
||||
"name": "worker",
|
||||
"taskNumber": 1,
|
||||
"cpuNumber": 4,
|
||||
"memoryMB": 16384,
|
||||
"gpuNumber": 2,
|
||||
"command": "sleep infinity"
|
||||
}
|
||||
]
|
||||
}
|
|
@ -0,0 +1,48 @@
|
|||
#horovod tensorflow cifar-10 prepare
|
||||
function horovod_prepare_data(){
|
||||
#download the data
|
||||
wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz && tar zxvf cifar-10-python.tar.gz && rm cifar-10-python.tar.gz
|
||||
|
||||
#upload the data to HDFS
|
||||
echo "Uploading cifar-10 data, waiting..."
|
||||
for i in `ls cifar-10-batches-py`
|
||||
do
|
||||
hdfs dfs -put cifar-10-batches-py/$i hdfs://$1/examples/tensorflow/distributed-cifar-10/data
|
||||
done
|
||||
}
|
||||
|
||||
function horovod_prepare_code(){
|
||||
#download the code
|
||||
git clone -b cnn_tf_v1.10_compatible https://github.com/tensorflow/benchmarks.git
|
||||
wget https://github.com/Microsoft/pai/raw/master/examples/horovod/start.sh
|
||||
|
||||
#upload the code to HDFS
|
||||
echo "Uploading benchmarks code, waiting..."
|
||||
hdfs dfs -put benchmarks/ hdfs://$1/examples/horovod/code
|
||||
hdfs dfs -put start.sh hdfs://$1/examples/horovod/code
|
||||
}
|
||||
|
||||
echo "Make horovod directory, waiting..."
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/horovod/output
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/horovod/code
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/tensorflow/distributed-cifar-10/data
|
||||
|
||||
hdfs dfs -test -e hdfs://$1/examples/horovod/code/*
|
||||
if [ $? -eq 0 ] ;then
|
||||
echo "Code exists on HDFS!"
|
||||
else
|
||||
horovod_prepare_code $1
|
||||
echo "Have prepared code!"
|
||||
fi
|
||||
|
||||
hdfs dfs -test -e hdfs://$1/examples/tensorflow/distributed-cifar-10/data/*
|
||||
if [ $? -eq 0 ] ;then
|
||||
echo "Data exists on HDFS!"
|
||||
else
|
||||
horovod_prepare_data $1
|
||||
echo "Have prepared data"
|
||||
fi
|
||||
|
||||
rm -rf cifar-10-batches-py*/ benchmarks*/ start.sh
|
||||
echo "Removed local cifar-10 code and data succeeded!"
|
||||
echo "Prepare horovod example based on horovod and tensorflow done!"
|
|
@ -0,0 +1,17 @@
|
|||
# Detest the files touched on HDFS when the containers start
|
||||
PAI_HDFS_PREFIX=${PAI_DEFAULT_FS_URI}/Container
|
||||
sshConnectInfoFolder=${PAI_HDFS_PREFIX}/${PAI_USER_NAME}/${PAI_JOB_NAME}/ssh/${APP_ID}
|
||||
echo $sshConnectInfoFolder
|
||||
res=`hdfs dfs -count $sshConnectInfoFolder`
|
||||
|
||||
# Split the result and get the number of existing files
|
||||
fileNum=`echo $res | awk -F ' ' '{print $2}'`
|
||||
while [ $fileNum != $PAI_JOB_TASK_ROLE_COUNT ]
|
||||
do
|
||||
sleep 30
|
||||
done
|
||||
sleep 30
|
||||
|
||||
# Run mpi work
|
||||
mpirun -np 4 -H worker-0:2,main-0:2 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_IB_DISABLE=1 -x CLASSPATH=$($HADOOP_HDFS_HOME/bin/hadoop classpath --glob) -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_exclude docker0,lo,eth1 python code/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet20 --batch_size 32 --data_dir=$PAI_DATA_DIR --data_name=cifar10 --train_dir=$PAI_OUTPUT_DIR --variable_update horovod
|
||||
|
|
@ -51,11 +51,7 @@ RUN mkdir /bazel && \
|
|||
# Download and build TensorFlow.
|
||||
WORKDIR /tensorflow
|
||||
RUN git clone -b r1.4 https://github.com/tensorflow/tensorflow.git . && \
|
||||
git cherry-pick -n f73d7c && \
|
||||
sed -i '1 i\#define TENSORFLOW_USE_MPI' tensorflow/contrib/mpi_collectives/mpi_ops.cc && \
|
||||
sed -i '1 i\#define TENSORFLOW_USE_MPI' tensorflow/contrib/mpi_collectives/ring.cc && \
|
||||
sed -i '1 i\#define TENSORFLOW_USE_MPI' tensorflow/contrib/mpi_collectives/ring.cu.cc && \
|
||||
sed -i '1 i\#define TENSORFLOW_USE_MPI' tensorflow/contrib/mpi_collectives/ring.h
|
||||
git cherry-pick -n f73d7c
|
||||
ENV TF_NEED_CUDA=1 \
|
||||
TF_CUDA_COMPUTE_CAPABILITIES=3.0,3.5,5.2,6.0,6.1 \
|
||||
TF_CUDA_VERSION=8.0 \
|
||||
|
@ -84,4 +80,3 @@ RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/lib
|
|||
rm -rf /root/.cache
|
||||
|
||||
WORKDIR /root
|
||||
ADD tf-mpi.py tf-mpi.py
|
||||
|
|
|
@ -5,9 +5,9 @@
|
|||
// prepare cmudict corpus in CNTK format https://git.io/vbT5A and upload to hdfs
|
||||
"dataDir": "$PAI_DEFAULT_FS_URI/examples/cntk/data",
|
||||
// make a new dir for output on hdfs
|
||||
"outputDir": "$PAI_DEFAULT_FS_URI/examples/cntk/output",
|
||||
"outputDir": "$PAI_DEFAULT_FS_URI/examples/mpi/cntk/output",
|
||||
// prepare g2p distributed training script cntk-mpi.sh and upload to hdfs
|
||||
"codeDir": "$PAI_DEFAULT_FS_URI/examples/cntk/code",
|
||||
"codeDir": "$PAI_DEFAULT_FS_URI/examples/mpi/cntk/code",
|
||||
"virtualCluster": "default",
|
||||
|
||||
"taskRoles": [
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
#mpi cntk prepare
|
||||
echo "Prepare for the mpi example!"
|
||||
|
||||
function prepare_data(){
|
||||
function mpi_cntk_prepare_data(){
|
||||
|
||||
#download data
|
||||
echo "Downloading mpi cntk data, waiting..."
|
||||
|
@ -37,7 +37,7 @@ function prepare_data(){
|
|||
}
|
||||
|
||||
|
||||
function prepare_code(){
|
||||
function mpi_cntk_prepare_code(){
|
||||
#code
|
||||
#G2P.cntk
|
||||
echo "Downloading mpi cntk code, waiting..."
|
||||
|
@ -55,18 +55,15 @@ fi
|
|||
|
||||
#make directory on HDFS
|
||||
echo "Make mpi cntk directory, waiting..."
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/mpi
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/mpi/cntk
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/mpi/cntk/code
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/mpi/cntk/data
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/mpi/cntk/output
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/cntk/data
|
||||
|
||||
hdfs dfs -test -e hdfs://$1/examples/mpi/cntk/code/*
|
||||
if [ $? -eq 0 ] ;then
|
||||
echo "Code exists on HDFS!"
|
||||
else
|
||||
prepare_code $1
|
||||
mpi_cntk_prepare_code $1
|
||||
echo "Have prepared code!"
|
||||
fi
|
||||
|
||||
|
@ -74,7 +71,7 @@ hdfs dfs -test -e hdfs://$1/examples/cntk/data/*
|
|||
if [ $? -eq 0 ] ;then
|
||||
echo "Data exists on HDFS!"
|
||||
else
|
||||
prepare_data $1
|
||||
mpi_cntk_prepare_data $1
|
||||
echo "Have prepared data"
|
||||
fi
|
||||
|
||||
|
@ -83,8 +80,48 @@ rm cntk-mpi.sh* G2P.cntk* cmudict* tiny.ctf*
|
|||
echo "Removed local mpi cntk code and data succeeded!"
|
||||
|
||||
#mpi tensorflow cifar-10 prepare
|
||||
echo "Make mpi tensorflow directory, waiting..."
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/mpi/tensorflow
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/mpi/tensorflow/output
|
||||
function mpi_tensorflow_prepare_data(){
|
||||
#download the data
|
||||
wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz && tar zxvf cifar-10-python.tar.gz && rm cifar-10-python.tar.gz
|
||||
|
||||
echo "Prepare for the mpi example done!"
|
||||
#upload the data to HDFS
|
||||
echo "Uploading cifar-10 data, waiting..."
|
||||
for i in `ls cifar-10-batches-py`
|
||||
do
|
||||
hdfs dfs -put cifar-10-batches-py/$i hdfs://$1/examples/tensorflow/distributed-cifar-10/data
|
||||
done
|
||||
}
|
||||
|
||||
function mpi_tensorflow_prepare_code(){
|
||||
#download the code
|
||||
git clone -b tf_benchmark_stage https://github.com/tensorflow/benchmarks.git
|
||||
|
||||
#upload the code to HDFS
|
||||
echo "Uploading benchmarks code, waiting..."
|
||||
hdfs dfs -put benchmarks/ hdfs://$1/examples/tensorflow/distributed-cifar-10/code
|
||||
}
|
||||
|
||||
echo "Make mpi tensorflow directory, waiting..."
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/mpi/tensorflow/output
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/tensorflow/distributed-cifar-10/code
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/tensorflow/distributed-cifar-10/data
|
||||
|
||||
hdfs dfs -test -e hdfs://$1/examples/tensorflow/distributed-cifar-10/code/*
|
||||
if [ $? -eq 0 ] ;then
|
||||
echo "Code exists on HDFS!"
|
||||
else
|
||||
mpi_tensorflow_prepare_code $1
|
||||
echo "Have prepared code!"
|
||||
fi
|
||||
|
||||
hdfs dfs -test -e hdfs://$1/examples/tensorflow/distributed-cifar-10/data/*
|
||||
if [ $? -eq 0 ] ;then
|
||||
echo "Data exists on HDFS!"
|
||||
else
|
||||
mpi_tensorflow_prepare_data $1
|
||||
echo "Have prepared data"
|
||||
fi
|
||||
|
||||
rm -r cifar-10-batches-py*/ benchmarks*/
|
||||
echo "Removed local cifar-10 code and data succeeded!"
|
||||
echo "Prepare mpi example based on horovod and tensorflow done!"
|
||||
|
|
|
@ -36,9 +36,6 @@ echo "You must input hdfs socket as the only parameter! Or you cannot run this s
|
|||
|
||||
#make directory on HDFS
|
||||
echo "Make imageNet directory, waiting..."
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/tensorflow/
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/tensorflow/imageNet/
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/tensorflow/imageNet/data/
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/tensorflow/imageNet/code/
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/tensorflow/imageNet/output/
|
||||
|
@ -92,7 +89,6 @@ function distributed_prepare_code(){
|
|||
|
||||
#make directory on HDFS
|
||||
echo "Make distributed cifar-10 directory, waiting..."
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/tensorflow/distributed-cifar-10/
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/tensorflow/distributed-cifar-10/code/
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/tensorflow/distributed-cifar-10/data/
|
||||
hdfs dfs -mkdir -p hdfs://$1/examples/tensorflow/distributed-cifar-10/output/
|
||||
|
|
Загрузка…
Ссылка в новой задаче