Resolve conflicts.

2018-07-26 21:10:34 +08:00 · 2018-07-26 21:10:34 +08:00 · 831bf32aa4
--- a/README.md
+++ b/README.md
@ -1,98 +1,83 @@
-# Open Platform for AI (PAI) ![alt text][logo]
+# Open Platform for AI (OpenPAI) ![alt text][logo]

 [logo]: ./pailogo.jpg "OpenPAI"

 [![Build Status](https://travis-ci.org/Microsoft/pai.svg?branch=master)](https://travis-ci.org/Microsoft/pai)
 [![Coverage Status](https://coveralls.io/repos/github/Microsoft/pai/badge.svg?branch=master)](https://coveralls.io/github/Microsoft/pai?branch=master)

+OpenPAI is an open source platform that provides complete AI model training and resource management capabilities, it is easy to extend and supports on-premise, cloud and hybrid environments in various scale. 

-## Introduction
+# Table of Contents
+1. [When to consider OpenPAI](#when-to-consider-openpai)
+2. [Why choose OpenPAI](#why-choose-openpai)
+3. [How to deploy](#how-to-deploy)
+4. [How to use](#how-to-use)
+5. [Resources](#resources)
+6. [Get Involved](#get-involved)
+7. [How to contribute](#how-to-contribute)

-Platform for AI (PAI) is a platform for cluster management and resource scheduling.
-The platform incorporates the mature design that has a proven track record in Microsoft's large scale production environment.
-
-PAI supports AI jobs (e.g., deep learning jobs) running in a GPU cluster. The platform provides PAI [runtime environment](https://github.com/Microsoft/pai/blob/master/job-tutorial/README.md) support, with which existing [deep learning frameworks](./examples/README.md), e.g., CNTK and TensorFlow, can onboard PAI without any code changes. The runtime environment support provides great extensibility: new workload can leverage the environment support to run on PAI with just a few extra lines of script and/or Python code.
-
-PAI supports GPU scheduling, a key requirement of deep learning jobs.
-For better performance, PAI supports fine-grained topology-aware job placement that can request for the GPU with a specific location (e.g., under the same PCI-E switch).
-
-PAI embraces a [microservices](https://en.wikipedia.org/wiki/Microservices) architecture: every component runs in a container.
-The system leverages [Kubernetes](https://kubernetes.io/) to deploy and manage static components in the system.
-The more dynamic deep learning jobs are scheduled and managed by [Hadoop](http://hadoop.apache.org/) YARN with our [GPU enhancement](./hadoop-ai/README.md).
-The training data and training results are stored in Hadoop HDFS.
-
-## An Open AI Platform for R&D and Education
-
-One key purpose of PAI is to support the highly diversified requirements from academia and industry. PAI is completely open: it is under the MIT license. PAI is architected in a modular way: different module can be plugged in as appropriate. This makes PAI particularly attractive to evaluate various research ideas, which include but not limited to the following components:
-
-* Scheduling mechanism for deep learning workload
-* Deep neural network application that requires evaluation under realistic platform environment
-* New deep learning framework
-* AutoML
-* Compiler technique for AI
-* High performance networking for AI
-* Profiling tool, including network, platform, and AI job profiling
-* AI Benchmark suite
-* New hardware for AI, including FPGA, ASIC, Neural Processor
-* AI Storage support
-* AI platform management
-
-PAI operates in an open model. It is initially designed and developed by [Microsoft Research (MSR)](https://www.microsoft.com/en-us/research/group/systems-research-group-asia/) and [Microsoft Search Technology Center (STC)](https://www.microsoft.com/en-us/ard/company/introduction.aspx) platform team.
-We are glad to have [Peking University](http://eecs.pku.edu.cn/EN/), [Xi'an Jiaotong University](http://www.aiar.xjtu.edu.cn/), [Zhejiang University](http://www.cesc.zju.edu.cn/index_e.htm), and [University of Science and Technology of China](http://eeis.ustc.edu.cn/) join us to develop the platform jointly.
-Contributions from academia and industry are all highly welcome.
-
-## System Deployment
-
-### Prerequisite
-
-The system runs in a cluster of machines each equipped with one or multiple GPUs.
-Each machine in the cluster should:
-1. Run Ubuntu 16.04 LTS.
-2. Assign a static IP address.
-3. Have no Docker installed or a Docker with api version >= 1.26.
-4. Have access to a Docker registry service (e.g., [Docker hub](https://docs.docker.com/docker-hub/))
-to store the Docker images for the services to be deployed.
-
-The system also requires a dev machine that runs in the same environment that has full access to the cluster.
-And the system need [NTP](http://www.ntp.org/) service for clock synchronization.
-
-### Deployment process
-To deploy and use the system, the process consists of the following steps.
-
-1. Deploy PAI following our [bootup process](./pai-management/doc/cluster-bootup.md)
-2. Access [web portal](./webportal/README.md) for job submission and cluster management
+## When to consider OpenPAI
+1. When your organization nedd to share powerful AI computing resources (GPU/FGPA farm, etc.) among teams.
+2. When your organization need to share and reuse common AI assets like Model, Data, Environment, etc.
+3. When your organization need an easy IT ops platform for AI.
+4. When you want to run complete training pipeline in one place. 


-#### Job management
+## Why choose OpenPAI
+The platform incorporates the mature design that has a proven track record in Microsoft's large-scale production environment.

-After system services have been deployed, user can access the web portal, a Web UI, for cluster management and job management.
-Please refer to this [tutorial](job-tutorial/README.md) for details about job submission.
+### Support on-premises and easy to deploy

-#### Cluster management
+OpenPAI is a full stack solution. OpenPAI not only supports on-premises, hybrid, or public Cloud deployment, but also supports single-box deployment for trial users.

-The web portal also provides Web UI for cluster management.
+### Support popular AI frameworks and heterogeneous hardware

-## System Architecture
+Pre-built docker for popular AI frameworks. Easy to include heterogeneous hardware. Support Distributed training, such as distributed TensorFlow.

-<p style="text-align: left;">
-  <img src="./sysarch.png" title="System Architecture" alt="System Architecture" />
-</p>
+### Most complete solution and easy to extend

-The system architecture is illustrated above.
-User submits jobs or monitors cluster status through the [Web Portal](./webportal/README.md),
-which calls APIs provided by the [REST server](./rest-server/README.md).
-Third party tools can also call REST server directly for job management.
-Upon receiving API calls, the REST server coordinates with [FrameworkLauncher](./frameworklauncher/README.md) (short for Launcher)
-to perform job management.
-The Launcher Server handles requests from the REST Server and submits jobs to Hadoop YARN.
-The job, scheduled by YARN with [GPU enhancement](https://issues.apache.org/jira/browse/YARN-7481),
-can leverage GPUs in the cluster for deep learning computation. Other type of CPU based AI workloads or traditional big data job
-can also run in the platform, coexisted with those GPU-based jobs.
-The platform leverages HDFS to store data. All jobs are assumed to support HDFS.
-All the static services (blue-lined box) are managed by Kubernetes, while jobs (purple-lined box) are managed by Hadoop YARN.
+OpenPAI is a most complete solution for deep learning, support virtual cluster, compatible Hadoop / kubernetes eco-system, complete training pipeline at one cluster etc. OpenPAI is architected in a modular way: different module can be plugged in as appropriate. 

-## Contributing
+## How to deploy
+#### 1 Prerequisites
+Before start, you need to meet the following requirements:

+- Ubuntu 16.04
+- Assign each server a static IP address. Network is reachable between servers.
+- Server can access the external network, especially need to have access to a Docker registry service (e.g., Docker hub) to pull the Docker images for the services to be deployed.
+- All machines' SSH service is enabled, share the same username / password and have sudo privilege.
+- Need to enable NTP service.
+- Recommend no Docker installed or a Docker with api version >= 1.26.
+
+#### 2 Deploy OpenPAI
+##### 2.1 [Quick deploy with default settings](./pai-management/doc/cluster-bootup.md#quickdeploy)
+##### 2.2 [Customized deploy](./pai-management/doc/cluster-bootup.md#customizeddeploy)
+
+## How to use
+### How to train jobs
+- How to write PAI jobs
+    - [Learn from Example Jobs](./examples/README.md)
+    - [Write job from scratch in deepth](./docs/job_tutorial.md)
+- How to submit PAI jobs
+    - [Submit a job in Visual Studio](https://github.com/Microsoft/vs-tools-for-ai/blob/master/docs/pai.md) 
+    - [Submit a job in Visual Studio Code](https://github.com/Microsoft/vscode-tools-for-ai/blob/master/docs/quickstart-05-pai.md)
+    - [Submit a job in web portal](https://github.com/Microsoft/pai/blob/master/job-tutorial/README.md#job-submission)
+- How to request on-demand resource for in place training
+    - [Launch a jupyter notebook and work in it](./examples/jupyter/README.md)
+    
+### Cluster administration    
+- [Deployment infrastructure](./pai-management/doc/cluster-bootup.md)
+- [Cluster maintenance](https://github.com/Microsoft/pai/wiki/Maintenance-(Service-&-Machine))
+- [Monitoring](./webportal/README.md)
+
+## Resources
+The OpenPAI user [documentations](./docs/documentation.md) provides in-depth instructions for using OpenPAI
+
+## Get Involved
+- [StackOverflow:](./docs/stackoverflow.md) If you have questions about OpenPAI, please submit question at Stackoverflow under tag: openpai
+- [Report an issue:](https://github.com/Microsoft/pai/wiki/Issue-tracking) If you have issue/ bug/ new feature, please submit it at Github 
+## How to contribute
+#### Contributor License Agreement
 This project welcomes contributions and suggestions.  Most contributions require you to agree to a
 Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
 the rights to use your contribution. For details, visit https://cla.microsoft.com.
@ -104,3 +89,15 @@ provided by the bot. You will only need to do this once across all repos using o
 This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
 For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
 contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
+
+#### Who should consider contributing to OpenPAI?
+- Folks who want to add support for other ML and DL frameworks
+- Folks who want to make OpenPAI a richer AI platform (e.g. support for more ML pipelines, hyperparameter tuning)
+- Folks who want to write tutorials/blog posts showing how to use OpenPAI to solve AI problems
+
+#### Contributors
+One key purpose of PAI is to support the highly diversified requirements from academia and industry. PAI is completely open: it is under the MIT license. This makes PAI particularly attractive to evaluate various research ideas, which include but not limited to the [components](./docs/research_education.md).
+
+PAI operates in an open model. It is initially designed and developed by [Microsoft Research (MSR)](https://www.microsoft.com/en-us/research/group/systems-research-group-asia/) and [Microsoft Search Technology Center (STC)](https://www.microsoft.com/en-us/ard/company/introduction.aspx) platform team.
+We are glad to have [Peking University](http://eecs.pku.edu.cn/EN/), [Xi'an Jiaotong University](http://www.aiar.xjtu.edu.cn/), [Zhejiang University](http://www.cesc.zju.edu.cn/index_e.htm), and [University of Science and Technology of China](http://eeis.ustc.edu.cn/) join us to develop the platform jointly.
+Contributions from academia and industry are all highly welcome.
--- a/docs/documentation.md
+++ b/docs/documentation.md
@ -0,0 +1,10 @@
+## Documentation
+### Achitecture and OpenPAI core
+- [System architecture](./system_architecture.md)
+- [Job Scheduling: scheduling resources across OpenPAI jobs](../hadoop-ai/README.md)
+- [FrameworkLauncher: launching customize Framework by Launcher Service](../frameworklauncher/README.md)
+### Configuration and API
+- [Configuration: customize OpenPAI via its configuration](../pai-management/doc/how-to-write-pai-configuration.md#cluster_configuration)
+- [OpenPAI Programming Guides](../examples/README.md)
+- [Restful API Docs](../rest-server/README.md)
+
--- a/docs/images/PAI_ask_question1.PNG
+++ b/docs/images/PAI_ask_question1.PNG
--- a/docs/images/PAI_ask_question2.PNG
+++ b/docs/images/PAI_ask_question2.PNG
--- a/docs/images/PAI_ask_question3.PNG
+++ b/docs/images/PAI_ask_question3.PNG
--- a/docs/images/PAI_ask_question4.PNG
+++ b/docs/images/PAI_ask_question4.PNG
--- a/job-tutorial/README.md
+++ b/job-tutorial/README.md
@ -38,14 +38,14 @@ This guide assumes the system has already been deployed properly and a docker re
 The system launches a deep learning job in one or more Docker containers. A Docker images is required in advance. 
 The system provides a base Docker images with HDFS, CUDA and cuDNN support, based on which users can build their own custom Docker images.

-To build a base Docker image, for example [Dockerfile.build.base](Dockerfiles/cuda8.0-cudnn6/Dockerfile.build.base), run:
+To build a base Docker image, for example [Dockerfile.build.base](../job-tutorial/Dockerfiles/cuda8.0-cudnn6/Dockerfile.build.base), run:
 ```sh
 docker build -f Dockerfiles/Dockerfile.build.base -t pai.build.base:hadoop2.7.2-cuda8.0-cudnn6-devel-ubuntu16.04 Dockerfiles/
 ```

 Then a custom docker image can be built based on it by adding `FROM pai.build.base:hadoop2.7.2-cuda8.0-cudnn6-devel-ubuntu16.04` in the Dockerfile.

-As an example, we customize a TensorFlow Docker image using [Dockerfile.run.tensorflow](Dockerfiles/cuda8.0-cudnn6/Dockerfile.run.tensorflow):
+As an example, we customize a TensorFlow Docker image using [Dockerfile.run.tensorflow](../job-tutorial/Dockerfiles/cuda8.0-cudnn6/Dockerfile.run.tensorflow):
 ```sh
 docker build -f Dockerfiles/Dockerfile.run.tensorflow -t pai.run.tensorflow Dockerfiles/
 ```
@ -294,4 +294,4 @@ You can ssh connect to a specified container either from outside or inside conta
 You can use `ssh $PAI_CURRENT_TASK_ROLE_NAME-$PAI_CURRENT_TASK_ROLE_CURRENT_TASK_INDEX` command to connect into another containers which belong to the same job. For example, if there are two taskRoles: master and worker, you can connect to worker-0 container directly with below command line:
 ```sh
 ssh worker-0
-```
+```
--- a/docs/research_education.md
+++ b/docs/research_education.md
@ -0,0 +1,18 @@
+## An Open AI Platform for R&D and Education
+One key purpose of PAI is to support the highly diversified requirements from academia and industry. PAI is completely open: it is under the MIT license. PAI is architected in a modular way: different module can be plugged in as appropriate. This makes PAI particularly attractive to evaluate various research ideas, which include but not limited to the following components:
+
+* Scheduling mechanism for deep learning workload
+* Deep neural network application that requires evaluation under realistic platform environment
+* New deep learning framework
+* AutoML
+* Compiler technique for AI
+* High performance networking for AI
+* Profiling tool, including network, platform, and AI job profiling
+* AI Benchmark suite
+* New hardware for AI, including FPGA, ASIC, Neural Processor
+* AI Storage support
+* AI platform management
+
+PAI operates in an open model. It is initially designed and developed by [Microsoft Research (MSR)](https://www.microsoft.com/en-us/research/group/systems-research-group-asia/) and [Microsoft Search Technology Center (STC)](https://www.microsoft.com/en-us/ard/company/introduction.aspx) platform team.
+We are glad to have [Peking University](http://eecs.pku.edu.cn/EN/), [Xi'an Jiaotong University](http://www.aiar.xjtu.edu.cn/), [Zhejiang University](http://www.cesc.zju.edu.cn/index_e.htm), and [University of Science and Technology of China](http://eeis.ustc.edu.cn/) join us to develop the platform jointly.
+Contributions from academia and industry are all highly welcome.
--- a/docs/stackoverflow.md
+++ b/docs/stackoverflow.md
@ -0,0 +1,30 @@
+## How to ask a question on Stack Overflow about OpenPAI
+
+### 1. Click the "Ask Question" button. 
+Navigate to the Stack Overflow homepage in your browser at stackoverflow.com. In the upper right hand corner of the page, you should see the Ask Question button, which you should click to continue.
+
+![PAI_ask_question1](./images/PAI_ask_question1.PNG)
+
+### 2. Read the disclaimer. 
+Then check the box box indicating you have read and understand the disclaimer and click "Proceed." Now you're ready to ask your question!
+
+![PAI_ask_question2](./images/PAI_ask_question2.PNG)
+
+### 3. Fill in the necessary information. 
+This is where your problem description and title come in handy. Fill in the information and take a moment to double check spelling and grammar. That last thing you want is someone hassling your usage instead of answering your question. Then click on "Post your question."
+
+![PAI_ask_question3](./images/PAI_ask_question3.PNG)
+
+### 4. Add any relevant tags. 
+In the tags field, when you begin typing, the Stack Overflow system will automatically suggest likely tags to help you with this process. Be sure you read the descriptions for your tags. An incorrect tag can seriously limit potential responses.
+
+![PAI_ask_question4](./images/PAI_ask_question4.PNG)
+
+
+## How to search OpenPAI related questions  
+OpenPAI's stackoverflow tag: openpai. User could view questions and ask questions under this tag.
+- [StackOverflow: tag openpai](https://stackoverflow.com/questions/tagged/openpai)
+
+Referenece:
+- https://stackoverflow.com/help/how-to-ask
+- https://www.wikihow.com/Ask-a-Question-on-Stack-Overflow
--- a/docs/system_architecture.md
+++ b/docs/system_architecture.md
@ -0,0 +1,18 @@
+## System Architecture
+
+<p style="text-align: left;">
+  <img src="https://github.com/Microsoft/pai/blob/master/sysarch.png" title="System Architecture" alt="System Architecture" />
+</p>
+
+The system architecture is illustrated above.
+User submits jobs or monitors cluster status through the [Web Portal](../webportal/README.md),
+which calls APIs provided by the [REST server](../rest-server/README.md).
+Third party tools can also call REST server directly for job management.
+Upon receiving API calls, the REST server coordinates with [FrameworkLauncher](../frameworklauncher/README.md) (short for Launcher)
+to perform job management.
+The Launcher Server handles requests from the REST Server and submits jobs to Hadoop YARN.
+The job, scheduled by YARN with [GPU enhancement](https://issues.apache.org/jira/browse/YARN-7481),
+can leverage GPUs in the cluster for deep learning computation. Other type of CPU based AI workloads or traditional big data job
+can also run in the platform, coexisted with those GPU-based jobs.
+The platform leverages HDFS to store data. All jobs are assumed to support HDFS.
+All the static services (blue-lined box) are managed by Kubernetes, while jobs (purple-lined box) are managed by Hadoop YARN.
--- a/examples/XGBoost/README.md
+++ b/examples/XGBoost/README.md
@ -105,4 +105,4 @@ Here's one configuration file example to train a model on the [forest cover type
  ]
 }
 ```
-For more details on how to write a job configuration file, please refer to [job tutorial](../../job-tutorial/README.md#json-config-file-for-job-submission).
+For more details on how to write a job configuration file, please refer to [job tutorial](../../docs/job_tutorial.md#json-config-file-for-job-submission).
--- a/examples/caffe/README.md
+++ b/examples/caffe/README.md
@ -105,4 +105,4 @@ Here's one configuration file example:
  ]
 }
 ```
-For more details on how to write a job configuration file, please refer to [job tutorial](../../job-tutorial/README.md#json-config-file-for-job-submission).
+For more details on how to write a job configuration file, please refer to [job tutorial](../../docs/job_tutorial.md#json-config-file-for-job-submission).
--- a/examples/caffe2/README.md
+++ b/examples/caffe2/README.md
@ -112,4 +112,4 @@ Here's one configuration file example:
 }
 ```

-For more details on how to write a job configuration file, please refer to [job tutorial](../../job-tutorial/README.md#json-config-file-for-job-submission).
+For more details on how to write a job configuration file, please refer to [job tutorial](../../docs/job_tutorial.md#json-config-file-for-job-submission).
--- a/examples/chainer/README.md
+++ b/examples/chainer/README.md
@ -106,4 +106,4 @@ Here're some configuration file examples:
 }
 ```

-For more details on how to write a job configuration file, please refer to [job tutorial](../../job-tutorial/README.md#json-config-file-for-job-submission).
+For more details on how to write a job configuration file, please refer to [job tutorial](../../docs/job_tutorial.md#json-config-file-for-job-submission).
--- a/examples/cntk/README.md
+++ b/examples/cntk/README.md
@ -179,4 +179,4 @@ Here're some configuration file examples:
 }
 ```

-For more details on how to write a job configuration file, please refer to [job tutorial](../../job-tutorial/README.md#json-config-file-for-job-submission).
+For more details on how to write a job configuration file, please refer to [job tutorial](../../docs/job_tutorial.md#json-config-file-for-job-submission).
--- a/examples/jupyter/README.md
+++ b/examples/jupyter/README.md
@ -104,10 +104,10 @@ Please built your image and pushed it to your Docker registry, replace image `op
 }
 ```

-For more details on how to write a job configuration file, please refer to [job tutorial](../../job-tutorial/README.md#json-config-file-for-job-submission).
+For more details on how to write a job configuration file, please refer to [job tutorial](../../docs/job_tutorial.md#json-config-file-for-job-submission).

 ### Access Jupyter Notebook

 Once the job is successfully submitted to PAI, you can view job info in webportal, and access your Jupyter Notebook via http://${container_ip}:${container_port}/jupyter/notebooks/mnist.ipynb. 
 ![avatar](example.png)
-for example, from the above job info page, you can access your Jupyter Notebook via http://10.151.40.202:4836/jupyter/notebooks/mnist.ipynb
+for example, from the above job info page, you can access your Jupyter Notebook via http://10.151.40.202:4836/jupyter/notebooks/mnist.ipynb
--- a/examples/keras/README.md
+++ b/examples/keras/README.md
@ -151,11 +151,11 @@ Here're some configuration file examples:
 }
 ```

-For more details on how to write a job configuration file, please refer to [job tutorial](../../job-tutorial/README.md#json-config-file-for-job-submission).
+For more details on how to write a job configuration file, please refer to [job tutorial](../../docs/job_tutorial.md#json-config-file-for-job-submission).


 ## FAQ

 ### Speed

-Since PAI runs Keras jobs in Docker, the trainning speed on PAI should be similar to speed on host.
+Since PAI runs Keras jobs in Docker, the trainning speed on PAI should be similar to speed on host.
--- a/examples/mpi/README.md
+++ b/examples/mpi/README.md
@ -153,4 +153,4 @@ Here're some configuration file examples:
 }
 ```

-For more details on how to write a job configuration file, please refer to [job tutorial](../../job-tutorial/README.md#json-config-file-for-job-submission).
+For more details on how to write a job configuration file, please refer to [job tutorial](../../docs/job_tutorial.md#json-config-file-for-job-submission).
--- a/examples/mxnet/README.md
+++ b/examples/mxnet/README.md
@ -147,7 +147,7 @@ Here're some configuration file examples:
 }
 ```

-For more details on how to write a job configuration file, please refer to [job tutorial](../../job-tutorial/README.md#json-config-file-for-job-submission).
+For more details on how to write a job configuration file, please refer to [job tutorial](../../docs/job_tutorial.md#json-config-file-for-job-submission).


 ## FAQ
--- a/examples/pytorch/README.md
+++ b/examples/pytorch/README.md
@ -147,7 +147,7 @@ Here're some configuration file examples:
 }
 ```

-For more details on how to write a job configuration file, please refer to [job tutorial](../../job-tutorial/README.md#json-config-file-for-job-submission).
+For more details on how to write a job configuration file, please refer to [job tutorial](../../docs/job_tutorial.md#json-config-file-for-job-submission).


 ## FAQ
--- a/examples/scikit-learn/README.md
+++ b/examples/scikit-learn/README.md
@ -147,7 +147,7 @@ Here're some configuration file examples:
 }
 ```

-For more details on how to write a job configuration file, please refer to [job tutorial](../../job-tutorial/README.md#json-config-file-for-job-submission).
+For more details on how to write a job configuration file, please refer to [job tutorial](../../docs/job_tutorial.md#json-config-file-for-job-submission).


 ## FAQ
--- a/examples/serving/README.md
+++ b/examples/serving/README.md
@ -105,4 +105,4 @@ Here're some configuration file examples:
 }
 ```

-For more details on how to write a job configuration file, please refer to [job tutorial](../../job-tutorial/README.md#json-config-file-for-job-submission).
+For more details on how to write a job configuration file, please refer to [job tutorial](../../docs/job_tutorial.md#json-config-file-for-job-submission).
--- a/examples/tensorflow/README.md
+++ b/examples/tensorflow/README.md
@ -219,4 +219,4 @@ Here're some configuration file examples:
 }
 ```

-For more details on how to write a job configuration file, please refer to [job tutorial](../../job-tutorial/README.md#json-config-file-for-job-submission).
+For more details on how to write a job configuration file, please refer to [job tutorial](../../docs/job_tutorial.md#json-config-file-for-job-submission).
--- a/pai-management/bootstrap/grafana/grafana-configuration/pai-clusterview-dashboard.json.template
+++ b/pai-management/bootstrap/grafana/grafana-configuration/pai-clusterview-dashboard.json.template
@ -388,7 +388,7 @@
          "steppedLine": false,
          "targets": [
            {
-              "expr": "100 - avg (irate(node_cpu{mode=\"idle\"}[5m])) * 100",
+              "expr": "100 - avg (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100",
              "format": "time_series",
              "intervalFactor": 2,
              "legendFormat": "cpu utilization",
@ -464,21 +464,21 @@
          "steppedLine": false,
          "targets": [
            {
-              "expr": "sum(node_memory_MemTotal) - sum(node_memory_MemFree) - sum(node_memory_Buffers) - sum(node_memory_Cached)",
+              "expr": "sum(node_memory_MemTotal_bytes) - sum(node_memory_MemFree_bytes) - sum(node_memory_Buffers_bytes) - sum(node_memory_Cached_bytes)",
              "format": "time_series",
              "intervalFactor": 2,
              "legendFormat": "usage",
              "refId": "A"
            },
            {
-              "expr": "sum(node_memory_MemFree)",
+              "expr": "sum(node_memory_MemFree_bytes)",
              "format": "time_series",
              "intervalFactor": 2,
              "legendFormat": "free",
              "refId": "D"
            },
            {
-              "expr": "sum(node_memory_Buffers) + sum(node_memory_Cached)",
+              "expr": "sum(node_memory_Buffers_bytes) + sum(node_memory_Cached_bytes)",
              "format": "time_series",
              "intervalFactor": 2,
              "legendFormat": "buff/cache",
@ -553,14 +553,14 @@
          "steppedLine": false,
          "targets": [
            {
-              "expr": "sum(rate(node_network_receive_bytes{device!~\"lo\"}[5m]))",
+              "expr": "sum(rate(node_network_receive_bytes_total{device!~\"lo\"}[5m]))",
              "format": "time_series",
              "intervalFactor": 2,
              "legendFormat": "in",
              "refId": "A"
            },
            {
-              "expr": "sum(rate(node_network_transmit_bytes{device!~\"lo\"}[5m]))",
+              "expr": "sum(rate(node_network_transmit_bytes_total{device!~\"lo\"}[5m]))",
              "format": "time_series",
              "intervalFactor": 2,
              "legendFormat": "out",
@ -647,14 +647,14 @@
          "steppedLine": false,
          "targets": [
            {
-              "expr": "sum(rate(node_disk_bytes_read[5m]))",
+              "expr": "sum(rate(node_disk_read_bytes_total[5m]))",
              "format": "time_series",
              "intervalFactor": 2,
              "legendFormat": "read",
              "refId": "A"
            },
            {
-              "expr": "sum(rate(node_disk_bytes_written[5m]))",
+              "expr": "sum(rate(node_disk_written_bytes_total[5m]))",
              "format": "time_series",
              "intervalFactor": 2,
              "legendFormat": "write",
--- a/pai-management/bootstrap/grafana/grafana-configuration/pai-nodeview-dashboard.json.template
+++ b/pai-management/bootstrap/grafana/grafana-configuration/pai-nodeview-dashboard.json.template
@ -74,7 +74,7 @@
              "refId": "A"
            },
            {
-              "expr": "100 - (avg by (instance)(irate(node_cpu{mode=\"idle\",instance=~\"$node\"}[5m])) * 100)",
+              "expr": "100 - (avg by (instance)(irate(node_cpu_seconds_total{mode=\"idle\",instance=~\"$node\"}[5m])) * 100)",
              "format": "time_series",
              "hide": false,
              "intervalFactor": 2,
@ -175,7 +175,7 @@
          "steppedLine": false,
          "targets": [
            {
-              "expr": "node_memory_MemTotal{instance=~'$node'} - node_memory_MemFree{instance=~'$node'} - node_memory_Buffers{instance=~'$node'} - node_memory_Cached{instance=~'$node'}",
+              "expr": "node_memory_MemTotal_bytes{instance=~'$node'} - node_memory_MemFree_bytes{instance=~'$node'} - node_memory_Buffers_bytes{instance=~'$node'} - node_memory_Cached_bytes{instance=~'$node'}",
              "format": "time_series",
              "interval": "",
              "intervalFactor": 2,
@ -186,7 +186,7 @@
              "target": ""
            },
            {
-              "expr": "node_memory_MemFree{instance=~'$node'}",
+              "expr": "node_memory_MemFree_bytes{instance=~'$node'}",
              "format": "time_series",
              "hide": false,
              "interval": "",
@ -196,7 +196,7 @@
              "step": 600
            },
            {
-              "expr": "node_memory_Buffers{instance=~'$node'} + node_memory_Cached{instance=~'$node'}",
+              "expr": "node_memory_Buffers_bytes{instance=~'$node'} + node_memory_Cached_bytes{instance=~'$node'}",
              "format": "time_series",
              "intervalFactor": 2,
              "legendFormat": "buff/cache",
@ -281,7 +281,7 @@
          "steppedLine": false,
          "targets": [
            {
-              "expr": "sum(rate(node_network_receive_bytes{instance=~\"$node\"}[5m]))",
+              "expr": "sum(rate(node_network_receive_bytes_total{instance=~\"$node\"}[5m]))",
              "format": "time_series",
              "interval": "",
              "intervalFactor": 2,
@ -292,7 +292,7 @@
              "target": ""
            },
            {
-              "expr": "sum(rate(node_network_transmit_bytes{instance=~\"$node\"}[5m]))",
+              "expr": "sum(rate(node_network_transmit_bytes_total{instance=~\"$node\"}[5m]))",
              "format": "time_series",
              "hide": true,
              "interval": "",
@ -388,7 +388,7 @@
          "steppedLine": false,
          "targets": [
            {
-              "expr": "sum(rate(node_disk_bytes_read{instance=~\"$node\"}[5m]))",
+              "expr": "sum(rate(node_disk_read_bytes_total{instance=~\"$node\"}[5m]))",
              "format": "time_series",
              "interval": "",
              "intervalFactor": 2,
@ -399,7 +399,7 @@
              "target": ""
            },
            {
-              "expr": "sum(rate(node_disk_bytes_written{instance=~\"$node\"}[5m]))",
+              "expr": "sum(rate(node_disk_written_bytes_total{instance=~\"$node\"}[5m]))",
              "format": "time_series",
              "hide": false,
              "intervalFactor": 2,
@ -768,7 +768,7 @@
        "multiFormat": "regex values",
        "name": "node",
        "options": [],
-        "query": "label_values(node_boot_time, instance)",
+        "query": "label_values(node_uname_info, instance)",
        "refresh": 1,
        "regex": "",
        "sort": 1,
--- a/pai-management/bootstrap/prometheus/node-exporter-ds.yaml.template
+++ b/pai-management/bootstrap/prometheus/node-exporter-ds.yaml.template
@ -67,7 +67,7 @@ spec:
 #- '--no-collector.loadavg' Exposes load average.
          - '--no-collector.mdadm'
 #- '--no-collector.meminfo' Exposes memory statistics.
-          - '--no-collector.netdev'
+#- '--no-collector.netdev' Exposes network interface statistics such as bytes transferred.
 #- '--no-collector.netstat' Exposes network statistics from /proc/net/netstat. This is the same information as netstat -s.
          - '--no-collector.nfs'
          - '--no-collector.nfsd'
--- a/pai-management/doc/cluster-bootup.md
+++ b/pai-management/doc/cluster-bootup.md
@ -5,38 +5,99 @@ This document introduces the detailed procedures to boot up PAI on a cluster. Pl
 Please refer to Section [single box deployment](./single-box-deployment.md) if user would like to deploy PAI on a single server.


-Table of contents:
+## Table of contents:
 <!-- TOC depthFrom:2 depthTo:3 -->

 - [Overview](#overview)
- [Step 1a. Prepare PAI configuration: Manual approach](#step-1a)
- [Step 1b. Prepare PAI configuration: Using `paictl` tool for a quickstart deployment](#step-1b)
- [Step 2. Boot up Kubernetes](#step-2)
- [Step 3. Start all PAI services](#step-3)
+- [Quick deploy with default settings](#quickdeploy)
+- [Customized deploy](#customizeddeploy)
 - [Appendix: Default values in auto-generated configuration files](#appendix)

 <!-- /TOC -->

 ## Overview <a name="overview"></a>

-We assume that the whole cluster has already been configured by the system maintainer to meet the following requirements:
-
- A [dev-box](./how-to-setup-dev-box.md) has been set up and can access the cluster.
- SSH service is enabled on each of the machines.
- All machines share the same username / password for the SSH service on each of them.
- The username that can be used to login to each machine should have sudo privilege.
- All machines to be set up as masters should be in the same network segment.
- A load balancer is prepared if there are multiple masters to be set up.
+We assume that the whole cluster has already been configured by the system maintainer to meet the [Prerequisites](../../README.md#how-to-deploy).

 With the cluster being set up, the steps to bring PAI up on it are as follows:

+- Step 0. Prepare dev-box
 - Step 1. Prepare PAI configuration.
    - (For advanced users) This step can either be done by writing the configuration files manually,
    - (For novice users) or be done using the `paictl` tool.
 - Step 2. Boot up Kubernetes.
 - Step 3. Start all PAI services.

-## Step 1a. Prepare PAI configuration: Manual approach <a name="step-1a"></a>
+## Quick deploy with default settings <a name="quickdeploy"></a>
+### Step 0. Prepare the dev-box
+It is recommended to perform the operations below in a dev box.
+Please refer to this [section](./how-to-setup-dev-box.md) for the details of setting up a dev-box.
+
+### Step 1. Prepare the quick-start.yaml file <a name="step-1a"></a>
+
+An example yaml file is shown below. Note that you should change the IP address of the machine and ssh information accordingly.
+
+```yaml
+# quick-start.yaml
+
+# (Required) Please fill in the IP address of the server you would like to deploy OpenPAI
+machines:
+  - 192.168.1.11
+  - 192.168.1.12
+  - 192.168.1.13
+
+# (Required) Log-in info of all machines. System administrator should guarantee
+# that the username/password pair is valid and has sudo privilege.
+ssh-username: pai
+ssh-password: pai-password
+
+# (Optional, default=22) Port number of ssh service on each machine.
+#ssh-port: 22
+
+# (Optional, default=DNS of the first machine) Cluster DNS.
+#dns: <ip-of-dns>
+
+# (Optional, default=10.254.0.0/16) IP range used by Kubernetes. Note that
+# this IP range should NOT conflict with the current network.
+#service-cluster-ip-range: <ip-range-for-k8s>
+
+```
+
+### Step 2. Generate OpenPAI configuration files
+
+After the quick-start.yaml is ready, use it to generate four configuration yaml files as follows.
+
+```
+python paictl.py cluster generate-configuration -i ~/quick-start.yaml -o ~/pai-config -f
+```
+
+The command will generate the following four yaml files.
+
+```
+cluster-configuration.yaml
+k8s-role-definition.yaml
+kubernetes-configuration.yaml
+serivices-configuration.yaml
+```
+Please refer to this [section](./how-to-write-pai-configuration.md) for the details of the configuration files.
+
+### Step 3. Boot up Kubernetes
+
+Use the four yaml files to boot up k8s.
+Please refer to this [section](./cluster-bootup.md#step-2) for details.
+
+### Step 4. Start all OpenPAI services
+
+After k8s starts, boot up all OpenPAI services.
+Please refer to this [section](./cluster-bootup.md#step-3) for details.
+
+## Customized deploy <a name="customizeddeploy"></a>
+### Step 0. Prepare the dev-box
+
+It is recommended to perform the operations below in a dev box.
+Please refer to this [section](./how-to-setup-dev-box.md) for the details of setting up a dev-box.
+
+### Step 1. Prepare PAI configuration: Manual approach <a name="step-1a"></a>

 This method is for advanced users. PAI configuration consists of 4 YAML files:

@ -49,41 +110,7 @@ There are two ways to prepare the above 4 PAI configuration files. The first one

 If you want to deploy PAI in single box environment, please refer to [Single Box Deployment](single-box-deployment.md) to edit configuration files.

-## Step 1b. Prepare PAI configuration: A quick start approach using `paictl` tool <a name="step-1b"></a>
-
-The second way, which is designed for fast deployment, is to generate a set of default configuration files from a very simple starting-point file using the `paictl` maintenance tool:
-
-```
-python paictl.py cluster generate-configuration \
-  -i quick-start.yaml \
-  -o /path/to/cluster-configuration/dir
-```
-
-The 4 configuration files will be stored into the `/path/to/cluster-configuration/dir` folder. Note that most of the fields in the 4 configuration fields are automatically generated using default values. See [Appendix](#appendix) for an incomplete list of these default values.
-
-The `quick-start.yaml` file consists of the following sections:
-
- `machines` - The list of all machines. The first machine will be configured as the master, and all other machines will be configured as workers.
- `ssh-username` and `ssh-password`: Log-in info of all machines.
- (Optional, default=22) `ssh-port` - Port number of the SSH service on each machine.
- (Optional, default=DNS of the first machine) `dns` - Cluster DNS.
- (Optional, default=10.254.0.0/16) `service-cluster-ip-range` - IP range used by Kubernetes. Note that this IP range should NOT conflict with the current network.
-
-Example:
-
-```yaml
-machines:
-  - 192.168.1.11
-  - 192.168.1.12
-  - 192.168.1.13
-
-ssh-username: pai-admin
-ssh-password: pai-admin-password
-```
-An example quick-start.yaml file is available [here](../quick-start/quick-start-example.yaml). 
-Note that the quick start approach does not provide high availability and customized deployment, which is done through the [manual approach](#step-1a).
-
-## Step 2. Boot up Kubernetes <a name="step-2"></a>
+### Step 2. Boot up Kubernetes <a name="step-2"></a>

 After the configuration files are prepared, the Kubernetes services can be started using `paictl` tool:

@ -104,7 +131,7 @@ http://<master>:9090
 ```
 where `<master>` denotes the IP address of the load balancer of Kubernetes master nodes. When there is only one master node and a load balancer is not used, it is usually the IP address of the master node itself.

-## Step 3. Start all PAI services <a name="step-3"></a>
+### Step 3. Start all PAI services <a name="step-3"></a>

 When Kubernetes is up and running, PAI services can then be deployed to it using `paictl` tool:

--- a/pai-management/k8sPaiLibrary/template/dashboard-deployment.yaml.template
+++ b/pai-management/k8sPaiLibrary/template/dashboard-deployment.yaml.template
@ -44,13 +44,13 @@ spec:
        args:
          - --apiserver-host=http://{{ clusterconfig['api-servers-ip'] }}:8080
        resources:
-          # keep request = limit to keep this container in guaranteed class
+          # the ideal cpu setting will be 3.5, according to our experiment. If the server hosting k8s dashboard has enough resource, user can change this setting to a larger value.
          limits:
-            cpu: 100m
-            memory: 300Mi
+            cpu: "1"
+            memory: 3000Mi
          requests:
-            cpu: 100m
-            memory: 100Mi
+            cpu: "0.5"
+            memory: 1000Mi
        ports:
        - containerPort: 9090
        livenessProbe:
--- a/pai-management/src/end-to-end-test/etc/launcher.json
+++ b/pai-management/src/end-to-end-test/etc/launcher.json
@ -3,9 +3,17 @@
  "user": {
    "name": "test"
  },
+  "retryPolicy": {
+    "maxRetryCount": 0,
+    "fancyRetryPolicy": true
+  },
  "taskRoles": {
    "Master": {
      "taskNumber": 10,
+      "taskRetryPolicy": {
+        "maxRetryCount": 0,
+        "fancyRetryPolicy": true
+      },
      "taskService": {
        "version": 23,
        "entryPoint": "echo 'TEST'",
--- a/pai-management/src/webportal/image.yaml
+++ b/pai-management/src/webportal/image.yaml
@ -17,7 +17,7 @@


 copy-list:
-  - src: ../job-tutorial
+  - src: ../docs
    dst: src/webportal/copied_file
  - src: ../examples
    dst:  src/webportal/copied_file
--- a/rest-server/README.md
+++ b/rest-server/README.md
@ -26,7 +26,7 @@ REST Server exposes a set of interface that allows you to manage jobs.

 1. Job config file

-    Prepare a job config file as described in [examples/README.md](../job-tutorial/README.md#json-config-file-for-job-submission), for example, `exampleJob.json`.
+    Prepare a job config file as described in [examples/README.md](../docs/job_tutorial.md#json-config-file-for-job-submission), for example, `exampleJob.json`.

 2. Authentication

@ -468,7 +468,7 @@ Configure the rest server port in [services-configuration.yaml](../cluster-confi

    *Parameters*

-    [job config json](../job-tutorial/README.md#json-config-file-for-job-submission)
+    [job config json](../docs/job_tutorial.md#json-config-file-for-job-submission)

    *Response if succeeded*
    ```
--- a/rest-server/src/models/job.js
+++ b/rest-server/src/models/job.js
@ -370,7 +370,8 @@ class Job {

  generateFrameworkDescription(data) {
    const gpuType = data.gpuType || null;
-    const fancyRetryPolicy = (data.retryCount >= -1);
+    const fancyRetryPolicy = (data.retryCount !== -2);
+    const minSucceededTaskCount = (data.killAllOnCompletedTaskNumber > 0) ? 1 : null;
    const virtualCluster = (!data.virtualCluster) ? 'default' : data.virtualCluster;
    const frameworkDescription = {
      'version': 10,
--- a/rest-server/src/templates/dockerContainerScript.mustache
+++ b/rest-server/src/templates/dockerContainerScript.mustache
@ -26,7 +26,7 @@ BASH_XTRACEFD=17
 function exit_handler()
 {
  printf "%s %s\n" \
-    "[DEBUG]" "EXIT signal received in docker container, exiting ..."
+    "[DEBUG]" "Docker container exit handler: EXIT signal received in docker container, exiting ..."
  kill 0
 }

--- a/rest-server/src/templates/yarnContainerScript.mustache
+++ b/rest-server/src/templates/yarnContainerScript.mustache
@ -29,9 +29,9 @@ BASH_XTRACEFD=13
 function exit_handler()
 {
  printf "%s %s\n" \
-    "[DEBUG]" "EXIT signal received in yarn container, exiting ..."
+    "[DEBUG]" "Yarn container exit handler: EXIT signal received in yarn container, exiting ..."
  pid=$(docker inspect --format={{{ inspectFormat }}} $docker_name) && kill -9 $pid ||\
-    printf "%s %s\n"  "[DEBUG]" "$docker_name does not exist."
+    printf "%s %s\n"  "[DEBUG]" "Yarn container exit handler tries to kill the container $docker_name that does not exist. It probably has already exited."
  kill 0
 }

--- a/webportal/README.md
+++ b/webportal/README.md
@ -14,7 +14,7 @@ The deployment of web portal goes with the bootstrapping process of the whole PA

 ### Submit a job

-Click the tab "Submit Job" to show a button asking you to select a json file for the submission. The job config file must follow the format shown in [job tutorial](../job-tutorial/README.md).
+Click the tab "Submit Job" to show a button asking you to select a json file for the submission. The job config file must follow the format shown in [job tutorial](../docs/job_tutorial.md).

 ### View job status

@ -30,4 +30,4 @@ Click the tab "Cluster View" to see the status of the whole cluster. Specificall

 ### Read documents

-Click the tab "Documents" to read the tutorial of submitting a job.
+Click the tab "Documents" to read the tutorial of submitting a job.
--- a/webportal/config/preprocess.js
+++ b/webportal/config/preprocess.js
@ -24,7 +24,7 @@ const webportalConfig = require('./webportal.config');

 // copy docs to app
 fse.copySync(
-    helpers.root('../job-tutorial/README.md'),
+    helpers.root('../docs/job_tutorial.md'),
    helpers.root('src/app/job/job-docs/job-docs.md'));

 fse.copySync(