This commit is contained in:
Manuel Möhlmann 2018-11-05 06:54:36 +01:00 коммит произвёл fan yang
Родитель ce46332b16
Коммит 7461d3091a
12 изменённых файлов: 18 добавлений и 18 удалений

Просмотреть файл

@ -1,6 +1,6 @@
# Goal # Goal
Monitoring all compoments in pai, provide insight on detectiving system/hardware failuring and Monitoring all components in pai, provide insight on detectiving system/hardware failuring and
analysing jobs performance. analysing jobs performance.
# Architecture # Architecture
@ -23,7 +23,7 @@ metrics to volume mounted in `/datastorage/prometheus`.
Metrics generated by `watchdog` and `gpu_exporter` are collected by `node_exporter` container running Metrics generated by `watchdog` and `gpu_exporter` are collected by `node_exporter` container running
inside `exporter` pod. Those metrics are scraped by `node_exporter` container. `node_exporter` also inside `exporter` pod. Those metrics are scraped by `node_exporter` container. `node_exporter` also
expose node metricss like node cpu/memory/disk usage. expose node metrics like node cpu/memory/disk usage.
# Metrics collected # Metrics collected

Просмотреть файл

@ -34,7 +34,7 @@ Usually there will have multiple patch files, the newest one is the last known g
Below are step-by-step build for advance user: Below are step-by-step build for advance user:
1. Prepare linux enviroment 1. Prepare linux environment
Ubuntu 16.04 is the default system. This dependencies must be installed: Ubuntu 16.04 is the default system. This dependencies must be installed:

Просмотреть файл

@ -44,7 +44,7 @@ Pylon starts a [nginx](http://nginx.org/) instance in a Docker container to prov
### For deploying as a standalone service (debugging) ### For deploying as a standalone service (debugging)
If the nginx in Pylon is to be deployed as a stand alone service (usually for debugging purpose), the following envirionment variables must be set in advance: If the nginx in Pylon is to be deployed as a stand alone service (usually for debugging purpose), the following environment variables must be set in advance:
- `REST_SERVER_URI`: String. The root url of the REST server. - `REST_SERVER_URI`: String. The root url of the REST server.
- `K8S_API_SERVER_URI`: String. The root url of Kubernetes's API server. - `K8S_API_SERVER_URI`: String. The root url of Kubernetes's API server.
- `WEBHDFS_URI`: String. The root url of WebHDFS's API server. - `WEBHDFS_URI`: String. The root url of WebHDFS's API server.

Просмотреть файл

@ -44,7 +44,7 @@ If web portal is deployed within PAI cluster, the following config field could b
--- ---
If web portal is deployed as a standalone service, the following envioronment variables must be configured: If web portal is deployed as a standalone service, the following environment variables must be configured:
* `REST_SERVER_URI`: URI of [REST Server](../rest-server) * `REST_SERVER_URI`: URI of [REST Server](../rest-server)
* `PROMETHEUS_URI`: URI of [Prometheus](../../src/prometheus) * `PROMETHEUS_URI`: URI of [Prometheus](../../src/prometheus)
@ -70,7 +70,7 @@ The deployment of web portal goes with the bootstrapping process of the whole PA
--- ---
If web portal is need to be deplyed as a standalone service, follow these steps: If web portal is need to be deployed as a standalone service, follow these steps:
1. Go into the `webportal` directory. 1. Go into the `webportal` directory.
2. Make sure the environment variables is fully configured. 2. Make sure the environment variables is fully configured.

Просмотреть файл

@ -58,8 +58,8 @@ Users can refer to this tutorial [submit a job in web portal](https://github.com
Examples which can be run by submitting the json straightly without any modification. Examples which can be run by submitting the json straightly without any modification.
* [tensorflow.cifar10.json](./tensorflow/tensorflow.cifar10.json): Single GPU trainning on CIFAR-10 using TensorFlow. * [tensorflow.cifar10.json](./tensorflow/tensorflow.cifar10.json): Single GPU training on CIFAR-10 using TensorFlow.
* [pytorch.mnist.json](./pytorch/pytorch.mnist.json): Single GPU trainning on MNIST using PyTorch. * [pytorch.mnist.json](./pytorch/pytorch.mnist.json): Single GPU training on MNIST using PyTorch.
* [pytorch.regression.json](./pytorch/pytorch.regression.json): Regression using PyTorch. * [pytorch.regression.json](./pytorch/pytorch.regression.json): Regression using PyTorch.
* [mxnet.autoencoder.json](./mxnet/mxnet.autoencoder.json): Autoencoder using MXNet. * [mxnet.autoencoder.json](./mxnet/mxnet.autoencoder.json): Autoencoder using MXNet.
* [mxnet.image-classification.json](./mxnet/mxnet.image-classification.json): Image * [mxnet.image-classification.json](./mxnet/mxnet.image-classification.json): Image

Просмотреть файл

@ -28,10 +28,10 @@ The following contents show some basic CNTK examples, other customized CNTK code
### prepare ### prepare
To run CNTK examples in OpenPAI, you need to do the following things: To run CNTK examples in OpenPAI, you need to do the following things:
1. Prepare the data by downloading all files in https://git.io/vbT5A(`wget https://github.com/Microsoft/CNTK/raw/master/Examples/SequenceToSequence/CMUDict/Data/cmudict-0.7b`) and put them up to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/cntk/data`. 1. Prepare the data by downloading all files in https://git.io/vbT5A(`wget https://github.com/Microsoft/CNTK/raw/master/Examples/SequenceToSequence/CMUDict/Data/cmudict-0.7b`) and put them up to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/cntk/data`.
2. Prepare the execable code(`wget https://github.com/Microsoft/pai/raw/master/examples/cntk/cntk-g2p.sh`) and config(`wget https://github.com/Microsoft/CNTK/raw/master/Examples/SequenceToSequence/CMUDict/BrainScript/G2P.cntk`). And upload them to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/cntk/code`. 2. Prepare the executable code(`wget https://github.com/Microsoft/pai/raw/master/examples/cntk/cntk-g2p.sh`) and config(`wget https://github.com/Microsoft/CNTK/raw/master/Examples/SequenceToSequence/CMUDict/BrainScript/G2P.cntk`). And upload them to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/cntk/code`.
3. Prepare a docker image and upload it to docker hub. You can get the tutorial below. 3. Prepare a docker image and upload it to docker hub. You can get the tutorial below.
4. Prepare a job configuration file and submit it through webportal. 4. Prepare a job configuration file and submit it through webportal.
Note that you can simply run the prepare.sh to do the above preparing work, but you must make sure you can use HDFS client on your local mechine. If you can, just run the shell script with a parameter of your HDFS socket!`/bin/bash prepare.sh ip:port` Note that you can simply run the prepare.sh to do the above preparing work, but you must make sure you can use HDFS client on your local machine. If you can, just run the shell script with a parameter of your HDFS socket!`/bin/bash prepare.sh ip:port`
OpenPAI packaged the docker env required by the job for user to use. User could refer to [DOCKER.md](./DOCKER.md) to customize this example docker env. If user have built a customized image and pushed it to Docker Hub, replace our pre-built image `openpai/pai.example.caffe` with your own. OpenPAI packaged the docker env required by the job for user to use. User could refer to [DOCKER.md](./DOCKER.md) to customize this example docker env. If user have built a customized image and pushed it to Docker Hub, replace our pre-built image `openpai/pai.example.caffe` with your own.

Просмотреть файл

@ -74,6 +74,6 @@ For more details on how to write a job configuration file, please refer to [job
### Note: ### Note:
Since PAI runs Keras jobs in Docker, the trainning speed on PAI should be similar to speed on host. Since PAI runs Keras jobs in Docker, the training speed on PAI should be similar to speed on host.
We provide two stable docker images by adding the data to the images. If you want to use them, add `stable` tag to the image name: `openpai/pai.example.keras.cntk:stable` or `openpai/pai.example.keras.tensorflow:stable`. We provide two stable docker images by adding the data to the images. If you want to use them, add `stable` tag to the image name: `openpai/pai.example.keras.cntk:stable` or `openpai/pai.example.keras.tensorflow:stable`.

Просмотреть файл

@ -40,13 +40,13 @@ After you downloading the data, upload them to HDFS:`hdfs dfs -put filename hdfs
Note that we use the same data as tensorflow distributed cifar-10 example. So, if you have already run that example, just use that data path. Note that we use the same data as tensorflow distributed cifar-10 example. So, if you have already run that example, just use that data path.
* CNTK: Download all files in https://git.io/vbT5A `wget https://github.com/Microsoft/CNTK/raw/master/Examples/SequenceToSequence/CMUDict/Data/cmudict-0.7b` and put them up to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/cntk/data` or `hdfs dfs -put filename hdfs://ip:port/examples/mpi/cntk/data`. * CNTK: Download all files in https://git.io/vbT5A `wget https://github.com/Microsoft/CNTK/raw/master/Examples/SequenceToSequence/CMUDict/Data/cmudict-0.7b` and put them up to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/cntk/data` or `hdfs dfs -put filename hdfs://ip:port/examples/mpi/cntk/data`.
Note that we use the same data as cntk example. So, if you have already run that example, just use that data path. Note that we use the same data as cntk example. So, if you have already run that example, just use that data path.
2. Prepare the execable code: 2. Prepare the executable code:
* Tensorflow: We use the same code as tensorflow distributed cifar-10 example. You can follow [that document](https://github.com/Microsoft/pai/blob/master/examples/tensorflow/README.md). * Tensorflow: We use the same code as tensorflow distributed cifar-10 example. You can follow [that document](https://github.com/Microsoft/pai/blob/master/examples/tensorflow/README.md).
* cntk: Download the script example from [github](https://github.com/Microsoft/pai/blob/master/examples/mpi/cntk-mpi.sh)`wget https://github.com/Microsoft/pai/raw/master/examples/mpi/cntk-mpi.sh`. Then upload them to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/mpi/cntk/code/` * cntk: Download the script example from [github](https://github.com/Microsoft/pai/blob/master/examples/mpi/cntk-mpi.sh)`wget https://github.com/Microsoft/pai/raw/master/examples/mpi/cntk-mpi.sh`. Then upload them to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/mpi/cntk/code/`
3. Prepare a docker image and upload it to docker hub. OpenPAI packaged the docker env required by the job for user to use. User could refer to [DOCKER.md](./DOCKER.md) to customize this example docker env. If user have built a customized image and pushed it to Docker Hub, replace our pre-built image `openpai/pai.example.tensorflow-mpi`, `openpai/pai.example.cntk-mp` with your own. 3. Prepare a docker image and upload it to docker hub. OpenPAI packaged the docker env required by the job for user to use. User could refer to [DOCKER.md](./DOCKER.md) to customize this example docker env. If user have built a customized image and pushed it to Docker Hub, replace our pre-built image `openpai/pai.example.tensorflow-mpi`, `openpai/pai.example.cntk-mp` with your own.
4. Prepare a job configuration file and submit it through webportal. The config examples are following. 4. Prepare a job configuration file and submit it through webportal. The config examples are following.
**Note** that you can simply run the prepare.sh to do the above preparing work, but you must make sure you can use HDFS client on your local mechine. If you can, just run the shell script with a parameter of your HDFS socket! `/bin/bash prepare.sh ip:port` **Note** that you can simply run the prepare.sh to do the above preparing work, but you must make sure you can use HDFS client on your local machine. If you can, just run the shell script with a parameter of your HDFS socket! `/bin/bash prepare.sh ip:port`
Here're some configuration file examples: Here're some configuration file examples:

Просмотреть файл

@ -79,6 +79,6 @@ For more details on how to write a job configuration file, please refer to [job
### Note: ### Note:
Since PAI runs MXNet jobs in Docker, the trainning speed on PAI should be similar to speed on host. Since PAI runs MXNet jobs in Docker, the training speed on PAI should be similar to speed on host.
We provide a stable docker image by adding the data to the image. If you want to use it, add `stable` tag to the image name: `openpai/pai.example.mxnet:stable`. We provide a stable docker image by adding the data to the image. If you want to use it, add `stable` tag to the image name: `openpai/pai.example.mxnet:stable`.

Просмотреть файл

@ -79,4 +79,4 @@ For more details on how to write a job configuration file, please refer to [job
## Note: ## Note:
Since PAI runs PyTorch jobs in Docker, the trainning speed on PAI should be similar to speed on host. Since PAI runs PyTorch jobs in Docker, the training speed on PAI should be similar to speed on host.

Просмотреть файл

@ -79,6 +79,6 @@ For more details on how to write a job configuration file, please refer to [job
### Note: ### Note:
Since PAI runs PyTorch jobs in Docker, the trainning speed on PAI should be similar to speed on host. Since PAI runs PyTorch jobs in Docker, the training speed on PAI should be similar to speed on host.
We provide a stable docker image by adding the data to the image. If you want to use it, add `stable` tag to the image name: `openpai/pai.example.sklearn:stable`. We provide a stable docker image by adding the data to the image. If you want to use it, add `stable` tag to the image name: `openpai/pai.example.sklearn:stable`.

Просмотреть файл

@ -42,13 +42,13 @@ Pay attention to your disk, because the data size is about 500GB.
After you download the data, upload them to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/tensorflow/imageNet/data/` After you download the data, upload them to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/tensorflow/imageNet/data/`
* cifar-10: Just go to the [official website](http://www.cs.toronto.edu/~kriz/cifar.html) and download the python version data by the [url](http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz). `wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz && tar zxvf cifar-10-python.tar.gz && rm cifar-10-python.tar.gz` * cifar-10: Just go to the [official website](http://www.cs.toronto.edu/~kriz/cifar.html) and download the python version data by the [url](http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz). `wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz && tar zxvf cifar-10-python.tar.gz && rm cifar-10-python.tar.gz`
After you downloading the data, upload them to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/tensorflow/distributed-cifar-10/data` After you downloading the data, upload them to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/tensorflow/distributed-cifar-10/data`
2. Prepare the execable code: 2. Prepare the executable code:
* imageNet: The *slim* folder you just downloaded contains the code. If you download the data manually, refer to the automatic method to get the code. * imageNet: The *slim* folder you just downloaded contains the code. If you download the data manually, refer to the automatic method to get the code.
After you download the data, upload them to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/tensorflow/distributed-cifar-10/code/` After you download the data, upload them to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/tensorflow/distributed-cifar-10/code/`
* cifar-10: We use the [tensorflow official benchmark code](https://github.com/tensorflow/benchmarks). Pay attention to the version. We use *tf_benchmark_stage* branch. `git clone -b tf_benchmark_stage https://github.com/tensorflow/benchmarks.git` * cifar-10: We use the [tensorflow official benchmark code](https://github.com/tensorflow/benchmarks). Pay attention to the version. We use *tf_benchmark_stage* branch. `git clone -b tf_benchmark_stage https://github.com/tensorflow/benchmarks.git`
* After you download the data, upload them to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/tensorflow/distributed-cifar-10/code/` * After you download the data, upload them to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/tensorflow/distributed-cifar-10/code/`
3. Prepare a docker image and upload it to docker hub. OpenPAI packaged the docker env required by the job for user to use. User could refer to [DOCKER.md](./DOCKER.md) to customize this example docker env. If user have built a customized image and pushed it to Docker Hub, replace our pre-built image `openpai/pai.example.tensorflow` with your own. 3. Prepare a docker image and upload it to docker hub. OpenPAI packaged the docker env required by the job for user to use. User could refer to [DOCKER.md](./DOCKER.md) to customize this example docker env. If user have built a customized image and pushed it to Docker Hub, replace our pre-built image `openpai/pai.example.tensorflow` with your own.
4. Prepare a job configuration file and submit it through webportal. Note that you can simply run the prepare.sh to do the above preparing work, but you must make sure you can use HDFS client on your local mechine. If you can, just run the shell script with a parameter of your HDFS socket! `/bin/bash prepare.sh ip:port` 4. Prepare a job configuration file and submit it through webportal. Note that you can simply run the prepare.sh to do the above preparing work, but you must make sure you can use HDFS client on your local machine. If you can, just run the shell script with a parameter of your HDFS socket! `/bin/bash prepare.sh ip:port`
Note that, the default operation of the prepare script has closed the data preparing of imageNet due to its size. If you want to open it, just remove the "#" in the line 52. Note that, the default operation of the prepare script has closed the data preparing of imageNet due to its size. If you want to open it, just remove the "#" in the line 52.
5. Prepare a job configuration file and submit it through webportal. The config examples are following. 5. Prepare a job configuration file and submit it through webportal. The config examples are following.