зеркало из https://github.com/microsoft/pai.git
Fixed typos (#1607)
This commit is contained in:
Родитель
ce46332b16
Коммит
7461d3091a
|
@ -1,6 +1,6 @@
|
|||
# Goal
|
||||
|
||||
Monitoring all compoments in pai, provide insight on detectiving system/hardware failuring and
|
||||
Monitoring all components in pai, provide insight on detectiving system/hardware failuring and
|
||||
analysing jobs performance.
|
||||
|
||||
# Architecture
|
||||
|
@ -23,7 +23,7 @@ metrics to volume mounted in `/datastorage/prometheus`.
|
|||
|
||||
Metrics generated by `watchdog` and `gpu_exporter` are collected by `node_exporter` container running
|
||||
inside `exporter` pod. Those metrics are scraped by `node_exporter` container. `node_exporter` also
|
||||
expose node metricss like node cpu/memory/disk usage.
|
||||
expose node metrics like node cpu/memory/disk usage.
|
||||
|
||||
# Metrics collected
|
||||
|
||||
|
|
|
@ -34,7 +34,7 @@ Usually there will have multiple patch files, the newest one is the last known g
|
|||
|
||||
Below are step-by-step build for advance user:
|
||||
|
||||
1. Prepare linux enviroment
|
||||
1. Prepare linux environment
|
||||
|
||||
Ubuntu 16.04 is the default system. This dependencies must be installed:
|
||||
|
||||
|
|
|
@ -44,7 +44,7 @@ Pylon starts a [nginx](http://nginx.org/) instance in a Docker container to prov
|
|||
|
||||
### For deploying as a standalone service (debugging)
|
||||
|
||||
If the nginx in Pylon is to be deployed as a stand alone service (usually for debugging purpose), the following envirionment variables must be set in advance:
|
||||
If the nginx in Pylon is to be deployed as a stand alone service (usually for debugging purpose), the following environment variables must be set in advance:
|
||||
- `REST_SERVER_URI`: String. The root url of the REST server.
|
||||
- `K8S_API_SERVER_URI`: String. The root url of Kubernetes's API server.
|
||||
- `WEBHDFS_URI`: String. The root url of WebHDFS's API server.
|
||||
|
|
|
@ -44,7 +44,7 @@ If web portal is deployed within PAI cluster, the following config field could b
|
|||
|
||||
---
|
||||
|
||||
If web portal is deployed as a standalone service, the following envioronment variables must be configured:
|
||||
If web portal is deployed as a standalone service, the following environment variables must be configured:
|
||||
|
||||
* `REST_SERVER_URI`: URI of [REST Server](../rest-server)
|
||||
* `PROMETHEUS_URI`: URI of [Prometheus](../../src/prometheus)
|
||||
|
@ -70,7 +70,7 @@ The deployment of web portal goes with the bootstrapping process of the whole PA
|
|||
|
||||
---
|
||||
|
||||
If web portal is need to be deplyed as a standalone service, follow these steps:
|
||||
If web portal is need to be deployed as a standalone service, follow these steps:
|
||||
|
||||
1. Go into the `webportal` directory.
|
||||
2. Make sure the environment variables is fully configured.
|
||||
|
|
|
@ -58,8 +58,8 @@ Users can refer to this tutorial [submit a job in web portal](https://github.com
|
|||
|
||||
Examples which can be run by submitting the json straightly without any modification.
|
||||
|
||||
* [tensorflow.cifar10.json](./tensorflow/tensorflow.cifar10.json): Single GPU trainning on CIFAR-10 using TensorFlow.
|
||||
* [pytorch.mnist.json](./pytorch/pytorch.mnist.json): Single GPU trainning on MNIST using PyTorch.
|
||||
* [tensorflow.cifar10.json](./tensorflow/tensorflow.cifar10.json): Single GPU training on CIFAR-10 using TensorFlow.
|
||||
* [pytorch.mnist.json](./pytorch/pytorch.mnist.json): Single GPU training on MNIST using PyTorch.
|
||||
* [pytorch.regression.json](./pytorch/pytorch.regression.json): Regression using PyTorch.
|
||||
* [mxnet.autoencoder.json](./mxnet/mxnet.autoencoder.json): Autoencoder using MXNet.
|
||||
* [mxnet.image-classification.json](./mxnet/mxnet.image-classification.json): Image
|
||||
|
|
|
@ -28,10 +28,10 @@ The following contents show some basic CNTK examples, other customized CNTK code
|
|||
### prepare
|
||||
To run CNTK examples in OpenPAI, you need to do the following things:
|
||||
1. Prepare the data by downloading all files in https://git.io/vbT5A(`wget https://github.com/Microsoft/CNTK/raw/master/Examples/SequenceToSequence/CMUDict/Data/cmudict-0.7b`) and put them up to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/cntk/data`.
|
||||
2. Prepare the execable code(`wget https://github.com/Microsoft/pai/raw/master/examples/cntk/cntk-g2p.sh`) and config(`wget https://github.com/Microsoft/CNTK/raw/master/Examples/SequenceToSequence/CMUDict/BrainScript/G2P.cntk`). And upload them to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/cntk/code`.
|
||||
2. Prepare the executable code(`wget https://github.com/Microsoft/pai/raw/master/examples/cntk/cntk-g2p.sh`) and config(`wget https://github.com/Microsoft/CNTK/raw/master/Examples/SequenceToSequence/CMUDict/BrainScript/G2P.cntk`). And upload them to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/cntk/code`.
|
||||
3. Prepare a docker image and upload it to docker hub. You can get the tutorial below.
|
||||
4. Prepare a job configuration file and submit it through webportal.
|
||||
Note that you can simply run the prepare.sh to do the above preparing work, but you must make sure you can use HDFS client on your local mechine. If you can, just run the shell script with a parameter of your HDFS socket!`/bin/bash prepare.sh ip:port`
|
||||
Note that you can simply run the prepare.sh to do the above preparing work, but you must make sure you can use HDFS client on your local machine. If you can, just run the shell script with a parameter of your HDFS socket!`/bin/bash prepare.sh ip:port`
|
||||
|
||||
|
||||
OpenPAI packaged the docker env required by the job for user to use. User could refer to [DOCKER.md](./DOCKER.md) to customize this example docker env. If user have built a customized image and pushed it to Docker Hub, replace our pre-built image `openpai/pai.example.caffe` with your own.
|
||||
|
|
|
@ -74,6 +74,6 @@ For more details on how to write a job configuration file, please refer to [job
|
|||
|
||||
### Note:
|
||||
|
||||
Since PAI runs Keras jobs in Docker, the trainning speed on PAI should be similar to speed on host.
|
||||
Since PAI runs Keras jobs in Docker, the training speed on PAI should be similar to speed on host.
|
||||
|
||||
We provide two stable docker images by adding the data to the images. If you want to use them, add `stable` tag to the image name: `openpai/pai.example.keras.cntk:stable` or `openpai/pai.example.keras.tensorflow:stable`.
|
||||
|
|
|
@ -40,13 +40,13 @@ After you downloading the data, upload them to HDFS:`hdfs dfs -put filename hdfs
|
|||
Note that we use the same data as tensorflow distributed cifar-10 example. So, if you have already run that example, just use that data path.
|
||||
* CNTK: Download all files in https://git.io/vbT5A `wget https://github.com/Microsoft/CNTK/raw/master/Examples/SequenceToSequence/CMUDict/Data/cmudict-0.7b` and put them up to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/cntk/data` or `hdfs dfs -put filename hdfs://ip:port/examples/mpi/cntk/data`.
|
||||
Note that we use the same data as cntk example. So, if you have already run that example, just use that data path.
|
||||
2. Prepare the execable code:
|
||||
2. Prepare the executable code:
|
||||
* Tensorflow: We use the same code as tensorflow distributed cifar-10 example. You can follow [that document](https://github.com/Microsoft/pai/blob/master/examples/tensorflow/README.md).
|
||||
* cntk: Download the script example from [github](https://github.com/Microsoft/pai/blob/master/examples/mpi/cntk-mpi.sh)`wget https://github.com/Microsoft/pai/raw/master/examples/mpi/cntk-mpi.sh`. Then upload them to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/mpi/cntk/code/`
|
||||
3. Prepare a docker image and upload it to docker hub. OpenPAI packaged the docker env required by the job for user to use. User could refer to [DOCKER.md](./DOCKER.md) to customize this example docker env. If user have built a customized image and pushed it to Docker Hub, replace our pre-built image `openpai/pai.example.tensorflow-mpi`, `openpai/pai.example.cntk-mp` with your own.
|
||||
4. Prepare a job configuration file and submit it through webportal. The config examples are following.
|
||||
|
||||
**Note** that you can simply run the prepare.sh to do the above preparing work, but you must make sure you can use HDFS client on your local mechine. If you can, just run the shell script with a parameter of your HDFS socket! `/bin/bash prepare.sh ip:port`
|
||||
**Note** that you can simply run the prepare.sh to do the above preparing work, but you must make sure you can use HDFS client on your local machine. If you can, just run the shell script with a parameter of your HDFS socket! `/bin/bash prepare.sh ip:port`
|
||||
|
||||
Here're some configuration file examples:
|
||||
|
||||
|
|
|
@ -79,6 +79,6 @@ For more details on how to write a job configuration file, please refer to [job
|
|||
|
||||
### Note:
|
||||
|
||||
Since PAI runs MXNet jobs in Docker, the trainning speed on PAI should be similar to speed on host.
|
||||
Since PAI runs MXNet jobs in Docker, the training speed on PAI should be similar to speed on host.
|
||||
|
||||
We provide a stable docker image by adding the data to the image. If you want to use it, add `stable` tag to the image name: `openpai/pai.example.mxnet:stable`.
|
||||
|
|
|
@ -79,4 +79,4 @@ For more details on how to write a job configuration file, please refer to [job
|
|||
|
||||
## Note:
|
||||
|
||||
Since PAI runs PyTorch jobs in Docker, the trainning speed on PAI should be similar to speed on host.
|
||||
Since PAI runs PyTorch jobs in Docker, the training speed on PAI should be similar to speed on host.
|
||||
|
|
|
@ -79,6 +79,6 @@ For more details on how to write a job configuration file, please refer to [job
|
|||
|
||||
### Note:
|
||||
|
||||
Since PAI runs PyTorch jobs in Docker, the trainning speed on PAI should be similar to speed on host.
|
||||
Since PAI runs PyTorch jobs in Docker, the training speed on PAI should be similar to speed on host.
|
||||
|
||||
We provide a stable docker image by adding the data to the image. If you want to use it, add `stable` tag to the image name: `openpai/pai.example.sklearn:stable`.
|
||||
|
|
|
@ -42,13 +42,13 @@ Pay attention to your disk, because the data size is about 500GB.
|
|||
After you download the data, upload them to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/tensorflow/imageNet/data/`
|
||||
* cifar-10: Just go to the [official website](http://www.cs.toronto.edu/~kriz/cifar.html) and download the python version data by the [url](http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz). `wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz && tar zxvf cifar-10-python.tar.gz && rm cifar-10-python.tar.gz`
|
||||
After you downloading the data, upload them to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/tensorflow/distributed-cifar-10/data`
|
||||
2. Prepare the execable code:
|
||||
2. Prepare the executable code:
|
||||
* imageNet: The *slim* folder you just downloaded contains the code. If you download the data manually, refer to the automatic method to get the code.
|
||||
After you download the data, upload them to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/tensorflow/distributed-cifar-10/code/`
|
||||
* cifar-10: We use the [tensorflow official benchmark code](https://github.com/tensorflow/benchmarks). Pay attention to the version. We use *tf_benchmark_stage* branch. `git clone -b tf_benchmark_stage https://github.com/tensorflow/benchmarks.git`
|
||||
* After you download the data, upload them to HDFS:`hdfs dfs -put filename hdfs://ip:port/examples/tensorflow/distributed-cifar-10/code/`
|
||||
3. Prepare a docker image and upload it to docker hub. OpenPAI packaged the docker env required by the job for user to use. User could refer to [DOCKER.md](./DOCKER.md) to customize this example docker env. If user have built a customized image and pushed it to Docker Hub, replace our pre-built image `openpai/pai.example.tensorflow` with your own.
|
||||
4. Prepare a job configuration file and submit it through webportal. Note that you can simply run the prepare.sh to do the above preparing work, but you must make sure you can use HDFS client on your local mechine. If you can, just run the shell script with a parameter of your HDFS socket! `/bin/bash prepare.sh ip:port`
|
||||
4. Prepare a job configuration file and submit it through webportal. Note that you can simply run the prepare.sh to do the above preparing work, but you must make sure you can use HDFS client on your local machine. If you can, just run the shell script with a parameter of your HDFS socket! `/bin/bash prepare.sh ip:port`
|
||||
Note that, the default operation of the prepare script has closed the data preparing of imageNet due to its size. If you want to open it, just remove the "#" in the line 52.
|
||||
5. Prepare a job configuration file and submit it through webportal. The config examples are following.
|
||||
|
||||
|
|
Загрузка…
Ссылка в новой задаче