@@ -228,4 +228,4 @@ After start rest-server, please ensure that the following task is successfully e
##### If test failed
-Please try to delete the rest-server, and then try to start it again. If fail again, please provide detail information and create issue ticket in github.
+Please try to delete the rest-server, and then try to start it again. If it fails again, please provide detailed information and create an issue ticket in Github.
diff --git a/docs/manual/cluster-admin/how-to-set-up-storage.md b/docs/manual/cluster-admin/how-to-set-up-storage.md
index d6011cb66..946bee331 100644
--- a/docs/manual/cluster-admin/how-to-set-up-storage.md
+++ b/docs/manual/cluster-admin/how-to-set-up-storage.md
@@ -1,16 +1,16 @@
# How to Set Up Storage
-This document describes how to use Kubernetes Persistent Volumes (PV) as storage on PAI. To set up existing storage (nfs, samba, Azure blob, etc.), you need:
+This document describes how to use Kubernetes Persistent Volumes (PV) as storage on PAI. To set up existing storage (NFS, Samba, Azure blob, etc.), you need:
1. Create PV and PVC as PAI storage on Kubernetes.
- 2. Confirm the worker nodes have proper package to mount the PVC. For example, the `NFS` PVC requires package `nfs-common` to work on Ubuntu.
+ 2. Confirm the worker nodes have the proper package to mount the PVC. For example, the `NFS` PVC requires package `nfs-common` to work on Ubuntu.
3. Assign PVC to specific user groups.
Users could mount those PV/PVC into their jobs after you set up the storage properly. The name of PVC is used to onboard on PAI.
## Create PV/PVC on Kubernetes
-There're many approches to create PV/PVC, you could refer to [Kubernetes docs](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) if you are not familiar yet. Followings are some commonly used PV/PVC examples.
+There're many approaches to create PV/PVC, you could refer to [Kubernetes docs](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) if you are not familiar yet. The followings are some commonly used PV/PVC examples.
### NFS
@@ -56,9 +56,9 @@ spec:
Save the above file as `nfs-storage.yaml` and run `kubectl apply -f nfs-storage.yaml` to create a PV named `nfs-storage-pv` and a PVC named `nfs-storage` for nfs server `nfs://10.0.0.1:/data`. The PVC will be bound to specific PV through label selector, using label `name: nfs-storage`.
-Users could use PVC name `nfs-storage` as storage name to mount this nfs storage in their jobs.
+Users could use the PVC name `nfs-storage` as the storage name to mount this NFS storage in their jobs.
-If you want to configure the above nfs as personal storage so that each user could only visit their own directory on PAI like Linux home directory, for example, Alice can only mount `/data/Alice` while Bob can only mount `/data/Bob`, you could add a `share: "false"` label to PVC. In this case, PAI will use `${PAI_USER_NAME}` as sub path when mounting to job containers.
+If you want to configure the above NFS as personal storage so that each user could only visit their directory on PAI like Linux home directory, for example, Alice can only mount `/data/Alice` while Bob can only mount `/data/Bob`, you could add a `share: "false"` label to PVC. In this case, PAI will use `${PAI_USER_NAME}` as the subpath when mounting to job containers.
### Samba
@@ -70,7 +70,7 @@ Please refer to [this document](https://github.com/Azure/kubernetes-volume-drive
#### Tips
-If you cannot mount blobfuse PVC into containers and the corresponding job in OpenPAI sticks in `WAITING` status, please double check the following requirements:
+If you cannot mount blobfuse PVC into containers and the corresponding job in OpenPAI sticks in `WAITING` status, please double-check the following requirements:
**requirement 1.** Every worker node should have `blobfuse` installed. Try the following commands to ensure:
@@ -90,13 +90,13 @@ curl -s https://raw.githubusercontent.com/Azure/kubernetes-volume-drivers/master
| kubectl apply -f -
```
-> NOTE: There is a known issue [#4637](https://github.com/microsoft/pai/issues/4637) to mount same PV multiple times on same node, please either:
+> NOTE: There is a known issue [#4637](https://github.com/microsoft/pai/issues/4637) to mount the same PV multiple times on the same node, please either:
> * use the [patched blobfuse flexvolume installer](https://github.com/microsoft/pai/issues/4637#issuecomment-647434815) instead.
> * use the [earlier version 1.1.1](https://github.com/Azure/kubernetes-volume-drivers/issues/66#issuecomment-649188681) instead.
### Azure File
-First create a Kubernetes secret to access the Azure file share.
+First, create a Kubernetes secret to access the Azure file share.
```sh
kubectl create secret generic azure-secret --from-literal=azurestorageaccountname=$AKS_PERS_STORAGE_ACCOUNT_NAME --from-literal=azurestorageaccountkey=$STORAGE_KEY
@@ -185,17 +185,17 @@ spec:
.......
```
-Please notice, `PersistentVolume.Spec.AccessModes` and `PersistentVolumeClaim.Spec.AccessModes` doesn't affect whether a storage is writable in PAI. They only take effect during binding time between PV and PVC.
+Please notice, `PersistentVolume.Spec.AccessModes` and `PersistentVolumeClaim.Spec.AccessModes` doesn't affect whether a storage is writable in PAI. They only take effect during the binding time between PV and PVC.
## Confirm Environment on Worker Nodes
-The [notice in Kubernetes' document](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistent-volumes) mentions: helper program may be required to consume certain type of PersistentVolume. For example, all worker nodes should have `nfs-common` installed if you want to use `NFS` PV. You can confirm it using the command `apt install nfs-common` on every worker node.
+The [notice in Kubernetes' document](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistent-volumes) mentions: helper program may be required to consume a certain type of PersistentVolume. For example, all worker nodes should have `nfs-common` installed if you want to use `NFS` PV. You can confirm it using the command `apt install nfs-common` on every worker node.
-Since different PVs have different requirements, you should check the environment according to document of the PV.
+Since different PVs have different requirements, you should check the environment according to the document of the PV.
## Assign Storage to PAI Groups
-The PVC name is used as storage name in OpenPAI. After you have set up the PV/PVC and checked the environment, you need to assign storage to users. In OpenPAI, the name of the PVC is used as the storage name, and the access of different storages is managed by [user groups](./how-to-manage-users-and-groups.md). To assign storage to a user, please use RESTful API to assign storage to the groups of the user.
+The PVC name is used as the storage name in OpenPAI. After you have set up the PV/PVC and checked the environment, you need to assign storage to users. In OpenPAI, the name of the PVC is used as the storage name, and the access of different storage is managed by [user groups](./how-to-manage-users-and-groups.md). To assign storage to a user, please use RESTful API to assign storage to the groups of the user.
Before querying the API, you should get an access token for the API. Go to your profile page and copy one:
@@ -208,7 +208,7 @@ For example, if you want to assign `nfs-storage` PVC to `default` group. First,
```json
{
"groupname": "default",
- "description": "group for default vc",
+ "description": "group for default vc",
"externalName": "",
"extension": {
"acls": {
@@ -220,7 +220,7 @@ For example, if you want to assign `nfs-storage` PVC to `default` group. First,
}
```
-The GET request must use header `Authorization: Bearer
` for authorization. This remains the same for all API calls. You may notice the `storageConfigs` in the return body. In fact it controls which storage a group can use. To add a `nfs-storage` to it, PUT `http(s):///rest-server/api/v2/groups`. Request body is:
+The GET request must use the header `Authorization: Bearer ` for authorization. This remains the same for all API calls. You may notice the `storageConfigs` in the return body. It controls which storage a group can use. To add a `nfs-storage` to it, PUT `http(s):///rest-server/api/v2/groups`. The request body is:
```json
{
@@ -242,7 +242,7 @@ Do not omit any fields in `extension` or it will change the `virtualClusters` se
## Example: Use Storage Manager to Create an NFS + SAMBA Server
-To help you set up the storage, OpenPAI provides a storage manager, which can set up an NFS + SAMBA server. In the cluster, the NFS storage can be accessed in OpenPAI containers. Out of the cluster, users can mount the storage on Unix-like system, or access it in File Explorer on Windows.
+To help you set up the storage, OpenPAI provides a storage manager, which can set up an NFS + SAMBA server. In the cluster, the NFS storage can be accessed in OpenPAI containers. Out of the cluster, users can mount the storage on a Unix-like system, or access it in File Explorer on Windows.
Please read the document about [service management and paictl](./basic-management-operations.md#pai-service-management-and-paictl) first, and start a dev box container. Then, in the dev box container, pull the configuration by:
@@ -250,7 +250,7 @@ Please read the document about [service management and paictl](./basic-managemen
./paictl config pull -o /cluster-configuration
```
-To use storage manager, you should first decide a machine in PAI system to be the storage server. The machine **must** be one of PAI workers, not PAI master. Please open `/cluster-configuration/layout.yaml`, choose a worker machine, then add a `pai-storage: "true"` field to it. Here is an example of the edited `layout.yaml`:
+To use storage manager, you should first decide on a machine in the PAI system to be the storage server. The machine **must** be one of PAI workers, not PAI master. Please open `/cluster-configuration/layout.yaml`, choose a worker machine, then add a `pai-storage: "true"` field to it. Here is an example of the edited `layout.yaml`:
```yaml
......
@@ -287,7 +287,7 @@ storage-manager:
smbpwd: smbpwd
```
-The `localpath` determines the root data dir for NFS on the storage server. The `smbuser` and `smbpwd` determines the username and password when you access the storage in File Explorer on Windows.
+The `localpath` determines the root data dir for NFS on the storage server. The `smbuser` and `smbpwd` determine the username and password when you access the storage in File Explorer on Windows.
Follow these commands to start the storage manager:
@@ -352,11 +352,11 @@ spec:
Use `kubectl create -f nfs-storage.yaml` to create the PV and PVC.
-Since the Kuberentes PV requires the node using it has the corresponding driver, we should use `apt install nfs-common` to install the `nfs-common` package on every worker node.
+Since the Kubernetes PV requires the node using it has the corresponding driver, we should use `apt install nfs-common` to install the `nfs-common` package on every worker node.
Finally, [assign storage to PAI groups](#assign-storage-to-pai-groups) by rest-server API. Then you can mount it into job containers.
-How to upload data to the storage server? On Windows, open the File Explorer, type in `\\10.0.0.1` (please change `10.0.0.1` to your storage server IP), and press ENTER. The File Explorer will ask you for authorization. Please use `smbuser` and `smbpwd` as username and password to login. On a Unix-like system, you can mount the NFS folder to the file system. For example, on Ubuntu, use the following command to mount it:
+How to upload data to the storage server? On Windows, open the File Explorer, type in `\\10.0.0.1` (please change `10.0.0.1` to your storage server IP), and press ENTER. The File Explorer will ask you for authorization. Please use `smbuser` and `smbpwd` as the username and password to log in. On a Unix-like system, you can mount the NFS folder to the file system. For example, on Ubuntu, use the following command to mount it:
```bash
# replace 10.0.0.1 with your storage server IP
diff --git a/docs/manual/cluster-admin/how-to-set-up-virtual-clusters.md b/docs/manual/cluster-admin/how-to-set-up-virtual-clusters.md
index 4cf9da321..557cb5600 100644
--- a/docs/manual/cluster-admin/how-to-set-up-virtual-clusters.md
+++ b/docs/manual/cluster-admin/how-to-set-up-virtual-clusters.md
@@ -2,7 +2,7 @@
## What is Hived Scheduler and How to Configure it
-OpenPAI supports two kinds of scheduler: the Kubernetes scheduler, and [hivedscheduler](https://github.com/microsoft/hivedscheduler). [Hivedscheduler](https://github.com/microsoft/hivedscheduler) is a Kubernetes Scheduler for Deep Learning. It supports virtual cluster division, topology-aware resource guarantee and optimized gang scheduling, which are not supported in k8s default scheduler. If you didn't specify `enable_hived_scheduler: false` during installation, hived scheduler is enabled by default. Please notice only hivedscheduler supports virtual cluster setup, k8s default scheduler doesn't support it.
+OpenPAI supports two kinds of schedulers: the Kubernetes scheduler, and [hivedscheduler](https://github.com/microsoft/hivedscheduler). [Hivedscheduler](https://github.com/microsoft/hivedscheduler) is a Kubernetes Scheduler for Deep Learning. It supports virtual cluster division, topology-aware resource guarantee, and optimized gang scheduling, which are not supported in the k8s default scheduler. If you didn't specify `enable_hived_scheduler: false` during installation, hived scheduler is enabled by default. Please notice the only hived scheduler supports virtual cluster setup, k8s default scheduler doesn't support it.
## Set Up Virtual Clusters
@@ -45,7 +45,7 @@ hivedscheduler:
...
```
-If you have followed the [installation guide](./installation-guide.md), you would find similar setting in your [`services-configuration.yaml`](./basic-management-operations.md#pai-service-management-and-paictl). The detailed explanation of these fields are in the [hived scheduler document](https://github.com/microsoft/hivedscheduler/blob/master/doc/user-manual.md). You can update the configuration and set up virtual clusters. For example, in the above settings, we have 3 nodes, `worker1`, `worker2` and `worker3`. They are all in the `default` virtual cluster. If we want to create two VCs, one is called `default` and has 2 nodes, the other is called `new` and has 1 node, we can first modify `services-configuration.yaml`:
+If you have followed the [installation guide](./installation-guide.md), you would find similar setting in your [`services-configuration.yaml`](./basic-management-operations.md#pai-service-management-and-paictl). The detailed explanation of these fields are in the [hived scheduler document](https://github.com/microsoft/hivedscheduler/blob/master/doc/user-manual.md). You can update the configuration and set up virtual clusters. For example, in the above settings, we have 3 nodes, `worker1`, `worker2`, and `worker3`. They are all in the `default` virtual cluster. If we want to create two VCs, one is called `default` and has 2 nodes, the other is called `new` and has 1 node, we can first modify `services-configuration.yaml`:
```yaml
# services-configuration.yaml
@@ -198,7 +198,7 @@ This should be self-explanatory. The `virtualClusters` field is used to manage V
./paictl.py service start -n rest-server
```
-## Different Hardwares in Worker Nodes
+## Different Hardware in Worker Nodes
We recommend one VC should have the same hardware, which leads to one `skuType` of one VC in the hived scheduler setting. If you have different types of worker nodes (e.g. different GPU types on different nodes), please configure them in different VCs. Here is an example of 2 kinds of nodes:
@@ -251,7 +251,7 @@ hivedscheduler:
cellNumber: 3
```
-In the above example, we set up 2 VCs: `default` and `v100`. The `default` VC has 2 K80 nodes, and `V100` VC has 3 V100 nodes. Every K80 node has 4 K80 GPUs and Every V100 nodes has 4 V100 GPUs.
+In the above example, we set up 2 VCs: `default` and `v100`. The `default` VC has 2 K80 nodes, and `V100` VC has 3 V100 nodes. Every K80 node has 4 K80 GPUs and every V100 node has 4 V100 GPUs.
## Configure CPU and GPU SKU on the Same Node
@@ -305,14 +305,14 @@ hivedscheduler:
cellNumber: 2
```
-Currently we only support mixing CPU and GPU types on one NVIDIA GPU node or one AMD GPU node,
+Currently, we only support mixing CPU and GPU types on one NVIDIA GPU node or one AMD GPU node,
rare cases including NVIDIA cards and AMD cards on one node are not supported.
## Use Pinned Cell to Reserve Certain Node in a Virtual Cluster
-In some cases, you might want to reserve a certain node in a virtual cluster, and submit job to this node explicitly for debugging or quick testing. OpenPAI provides you with a way to "pin" a node to a virtual cluster.
+In some cases, you might want to reserve a certain node in a virtual cluster, and submit jobs to this node explicitly for debugging or quick testing. OpenPAI provides you with a way to "pin" a node to a virtual cluster.
-For example, assuming you have three worker nodes: `worker1`, `worker2`, and `worker3`, and 2 virtual clusters: `default` and `new`. The `default` VC has 2 workers, and `new` VC only has one worker. The following is an example for the configuration:
+For example, assuming you have three worker nodes: `worker1`, `worker2`, and `worker3`, and 2 virtual clusters: `default` and `new`. The `default` VC has 2 workers, and the `new` VC only has one worker. The following is an example of the configuration:
```yaml
# services-configuration.yaml
diff --git a/docs/manual/cluster-admin/how-to-uninstall-openpai.md b/docs/manual/cluster-admin/how-to-uninstall-openpai.md
index 9ae6c74ec..f2a51e8e3 100644
--- a/docs/manual/cluster-admin/how-to-uninstall-openpai.md
+++ b/docs/manual/cluster-admin/how-to-uninstall-openpai.md
@@ -2,7 +2,7 @@
## Uninstallation Guide for OpenPAI >= v1.0.0
-The uninstallation of OpenPAI >= `v1.0.0` is irreversible: all the data will be removed and you cannot find them back. If you need a backup, do it before uninstallation.
+The uninstallation of OpenPAI >= `v1.0.0` is irreversible: all the data will be removed and you cannot find them back. If you need a backup, do it before the uninstallation.
First, log in to the dev box machine and delete all PAI services with [dev box container](./basic-management-operations.md#pai-service-management-and-paictl).:
@@ -16,17 +16,17 @@ Now all PAI services and data are deleted. If you want to destroy the Kubernetes
ansible-playbook -i inventory/pai/hosts.yml reset.yml --become --become-user=root -e "@inventory/pai/openpai.yml"
```
-We recommend you to keep the folder `~/pai-deploy` for re-installation.
+We recommend you keep the folder `~/pai-deploy` for re-installation.
## Uninstallation Guide for OpenPAI < v1.0.0
### Save your Data to a Different Place
-During the uninstallation of OpenPAI < `v1.0.0`, you cannot preserve any useful data: all jobs, user information, dataset will be lost inevitably and irreversibly. Thus, if you have any useful data in previous deployment, please make sure you have saved them to a different place.
+During the uninstallation of OpenPAI < `v1.0.0`, you cannot preserve any useful data: all jobs, user information, the dataset will be lost inevitably and irreversibly. Thus, if you have any useful data in the previous deployment, please make sure you have saved them to a different place.
#### HDFS Data
-Before `v1.0.0`, PAI will deploy an HDFS server for you. After `v1.0.0`, the HDFS server won't be deployed and previous data will be removed in upgrade. The following commands could be used to transfer your HDFS data:
+Before `v1.0.0`, PAI will deploy an HDFS server for you. After `v1.0.0`, the HDFS server won't be deployed and previous data will be removed in the upgrade. The following commands could be used to transfer your HDFS data:
``` bash
# check data structure
@@ -35,11 +35,11 @@ hdfs dfs -ls hdfs://
:/
hdfs dfs -copyToLocal hdfs://:/
```
-`` and `` is the ip of PAI master and `9000` if you did't modify the default setting. Please make sure your local folder has enough capacity to hold the data you want to save.
+`` and `` are the IP of PAI master and `9000` if you didn't modify the default setting. Please make sure your local folder has enough capacity to hold the data you want to save.
#### Metadata of Jobs and Users
-Metadata of jobs and users will also be lost, including job records, job log, user name, user password, etc. We do not have an automatical tool for you to backup these data. Please transfer the data manually if you find some are valuable.
+Metadata of jobs and users will also be lost, including job records, job log, user name, user password, etc. We do not have an automatic tool for you to backup these data. Please transfer the data manually if you find some are valuable.
#### Other Resources on Kubernetes
@@ -55,7 +55,7 @@ cd pai
# checkout to a different branch if you have a different version
git checkout pai-0.14.y
-# delete all pai service and remove all service data
+# delete all PAI service and remove all service data
./paictl.py service delete
# delete k8s cluster
@@ -86,7 +86,7 @@ lsmod | grep -qE "^nvidia" &&
done
rmmod nvidia ||
{
- echo "The driver nvidia is still in use, can't unload it."
+ echo "The driver NVIDIA is still in use, can't unload it."
exit 1
}
}
diff --git a/docs/manual/cluster-admin/how-to-use-alert-system.md b/docs/manual/cluster-admin/how-to-use-alert-system.md
index 25d87f9ac..209772f99 100644
--- a/docs/manual/cluster-admin/how-to-use-alert-system.md
+++ b/docs/manual/cluster-admin/how-to-use-alert-system.md
@@ -1,6 +1,6 @@
# How to Use Alert System
-OpenPAI has a built-in alert system. The alert system has some existing alert rules and actions. It can also let admin customize them. In this document, we will have a detailed introduction to this topic.
+OpenPAI has a built-in alert system. The alert system has some existing alert rules and actions. It can also let the admin customize them. In this document, we will have a detailed introduction to this topic.
## Alert Rules
@@ -13,12 +13,12 @@ alert: GpuUsedByExternalProcess
expr: gpu_used_by_external_process_count > 0
for: 5m
annotations:
- summary: found nvidia used by external process in {{$labels.instance}}
+ summary: found NVIDIA used by external process in {{$labels.instance}}
```
For the detailed syntax of alert rules, please refer to [here](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/).
-All alerts fired by the alert rules, including the pre-defined rules and the customized rules, will be shown on the home page of Webportal (on the top-right corner).
+All alerts fired by the alert rules, including the pre-defined rules and the customized rules, will be shown on the home page of webportal (on the top-right corner).
### Existing Alert Rules
@@ -53,9 +53,9 @@ prometheus:
```
The `PAIJobGpuPercentLowerThan0_3For1h` alert will be fired when the job on virtual cluster `default` has a task level average GPU percent lower than `30%` for more than `1 hour`.
-The alert severity can be defined as `info`, `warn`, `error` or `fatal` by adding a label.
+The alert severity can be defined as `info`, `warn`, `error`, or `fatal` by adding a label.
Here we use `warn`.
-Here the metric `task_gpu_percent` is used, which describes the GPU utilization at task level.
+Here the metric `task_gpu_percent` is used, which describes the GPU utilization at the task level.
Remember to push service config to the cluster and restart the `prometheus` service after your modification with the following commands [in the dev-box container](./basic-management-operations.md#pai-service-management-and-paictl):
```bash
@@ -68,7 +68,7 @@ Please refer to [Prometheus Alerting Rules](https://prometheus.io/docs/prometheu
## Alert Actions and Routes
-Admin can choose how to handle the alerts by different alert actions. We provide some basic alert actions and you can also customize your own actions. In this section, we will first introduce the existing actions and the matching rules between these actions and alerts. Then we will let you know how to add new alert actions. The actions and matching rules are both handled by [`alert-manager`](https://prometheus.io/docs/alerting/latest/alertmanager/).
+Admin can choose how to handle the alerts by different alert actions. We provide some basic alert actions and you can also customize your actions. In this section, we will first introduce the existing actions and the matching rules between these actions and alerts. Then we will let you know how to add new alert actions. The actions and matching rules are both handled by [`alert-manager`](https://prometheus.io/docs/alerting/latest/alertmanager/).
### Existing Actions and Matching Rules
diff --git a/docs/manual/cluster-admin/installation-faqs-and-troubleshooting.md b/docs/manual/cluster-admin/installation-faqs-and-troubleshooting.md
index d3cebe64a..85a80a327 100644
--- a/docs/manual/cluster-admin/installation-faqs-and-troubleshooting.md
+++ b/docs/manual/cluster-admin/installation-faqs-and-troubleshooting.md
@@ -4,20 +4,20 @@
#### Which version of NVIDIA driver should I install?
-First, check out the [NVIDIA site](https://www.nvidia.com/Download/index.aspx) to verify the newest driver version of your GPU card. Then, check out [this table](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#binary-compatibility__table-toolkit-driver) to see the CUDA requirement of driver version.
+First, check out the [NVIDIA site](https://www.nvidia.com/Download/index.aspx) to verify the newest driver version of your GPU card. Then, check out [this table](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#binary-compatibility__table-toolkit-driver) to see the CUDA requirement of the driver version.
-Please note that, some docker images with new CUDA version cannot be used on machine with old driver. As for now, we recommend to install the NVIDIA driver 418 as it supports CUDA 9.0 to CUDA 10.1, which is used by most deep learning frameworks.
+Please note that some docker images with new CUDA versions cannot be used on machines with old drivers. As for now, we recommend installing the NVIDIA driver 418 as it supports CUDA 9.0 to CUDA 10.1, which is used by most deep learning frameworks.
-#### How to fasten deploy speed on large cluster?
+#### How to fasten deploy speed on a large cluster?
By default, `Ansible` uses 5 forks to execute commands parallelly on all hosts. If your cluster is a large one, it may be slow for you.
-To fasten the deploy speed, you can add `-f ` to all commands using `ansible` or `ansible-playbook`. See [ansible doc](https://docs.ansible.com/ansible/latest/cli/ansible.html#cmdoption-ansible-f) for reference.
+To fasten the deploying speed, you can add `-f ` to all commands using `ansible` or `ansible-playbook`. See [ansible doc](https://docs.ansible.com/ansible/latest/cli/ansible.html#cmdoption-ansible-f) for reference.
-#### How to remove k8s network plugin
+#### How to remove the k8s network plugin
-After installation, if you use [weave](https://github.com/weaveworks/weave) as k8s network plugin and you encounter some errors about the network, such as some pods failed to connect internet, you could remove network plugin to solve this issue.
+After installation, if you use [weave](https://github.com/weaveworks/weave) as a k8s network plugin and you encounter some errors about the network, such as some pods failed to connect internet, you could remove the network plugin to solve this issue.
Please run `kubectl delete ds weave-net -n kube-system` to remove `weave-net` daemon set first
@@ -71,18 +71,18 @@ To remove the network plugin, you could use following `ansible-playbook`:
executable: /bin/bash
```
-After these steps you need to change the `coredns` to fix dns resolution issue.
+After these steps, you need to change the `coredns` to fix the DNS resolution issue.
Please run `kubectl edit cm coredns -n kube-system`, change `.:53` to `.:9053`
Please run `kubectl edit service coredns -n kube-system`, change `targetPort: 53` to `targetPort: 9053`
Please run `kubectl edit deployment coredns -n kube-system`, change `containerPort: 53` to `containerPort: 9053`. Add `hostNetwork: true` in pod spec.
#### How to check whether the GPU driver is installed?
-For Nvidia GPU, use command `nvidia-smi` to check.
+For NVIDIA GPU, use the command `nvidia-smi` to check.
#### How to install GPU driver?
-For Nvidia GPU, please first determine which version of driver you want to install (see [this question](#which-version-of-nvidia-driver-should-i-install) for details). Then follow these commands:
+For NVIDIA GPU, please first determine which version of the driver you want to install (see [this question](#which-version-of-nvidia-driver-should-i-install) for details). Then follow these commands:
```bash
sudo add-apt-repository ppa:graphics-drivers/ppa
@@ -91,9 +91,9 @@ sudo apt install nvidia-418
sudo reboot
```
-Here we use nvidia driver version 418 as an example. Please modify `nvidia-418` if you want to install a different version, and refer to the Nvidia community for help if encounter any problem.
+Here we use NVIDIA driver version 418 as an example. Please modify `nvidia-418` if you want to install a different version, and refer to the NVIDIA community for help if encounter any problem.
-#### How to install nvidia-container-runtime?
+#### How to install NVIDIA-container-runtime?
Please refer to the [official document](https://github.com/NVIDIA/nvidia-container-runtime#installation). Don't forget to set it as docker' default runtime in [docker-config-file](https://docs.docker.com/config/daemon/#configure-the-docker-daemon). Here is an example of `/etc/docker/daemon.json`:
@@ -117,7 +117,7 @@ Please refer to [this document](https://github.com/microsoft/pai/tree/master/con
#### Ansible reports `Failed to update apt cache` or `Apt install ` fails
-Please first check if there is any network-related issues. Besides network, another reason for this problem is: `ansible` sometimes runs a `apt update` to update the cache before the package installation. If `apt update` exits with a non-zero code, the whole command will be considered to be failed.
+Please first check if there are any network-related issues. Besides network, another reason for this problem is: `ansible` sometimes runs an `apt update` to update the cache before the package installation. If `apt update` exits with a non-zero code, the whole command will be considered to be failed.
You can check this by running `sudo apt update; echo $?` on the corresponding machine. If the exit code is not 0, please fix it. Here are 2 normal causes of this problem:
@@ -137,7 +137,7 @@ sudo apt update
#### Ansible playbook exits because of timeout.
-Sometimes, if you assign a different hostname for a certain machine, any commands with `sudo` will be very slow on that machine. Because the system DNS try to find the new hostname, but it will fail due to a timeout.
+Sometimes, if you assign a different hostname for a certain machine, any commands with `sudo` will be very slow on that machine. Because the system DNS tries to find the new hostname, but it will fail due to a timeout.
To fix this problem, on each machine, you can add the new hostname to its `/etc/hosts` by:
@@ -149,7 +149,7 @@ sudo chmod 644 /etc/hosts
#### Ansible exits because `sudo` is timed out.
-The same as `1. Ansible playbook exits because of timeout.` .
+The same as `1. Ansible playbook exits because of timeout.`
#### Ansible reports `Could not import python modules: apt, apt_pkg. Please install python3-apt package.`
@@ -166,12 +166,12 @@ During installation, the script will download kubeadm and hyperkube from `storag
- `kubeadm`: `https://storage.googleapis.com/kubernetes-release/release/v1.15.11/bin/linux/amd64/kubeadm`
- `hyperkube`: `https://storage.googleapis.com/kubernetes-release/release/v1.15.11/bin/linux/amd64/hyperkube`
-Please find alternative urls for downloading this two files and modify `kubeadm_download_url` and `hyperkube_download_url` in your `config` file.
+Please find alternative URLs for downloading these two files and modify `kubeadm_download_url` and `hyperkube_download_url` in your `config` file.
**Cannot download image**
Please first check the log to see which image blocks the installation process, and modify `gcr_image_repo`, `kube_image_repo`, `quay_image_repo`, or `docker_image_repo` to a mirror repository correspondingly in `config` file.
-For example, if you cannot pull images from `gcr.io`, you should fisrt find a mirror repository (We recommend you to use `gcr.azk8s.cn` if you are in China). Then, modify `gcr_image_repo` and `kube_image_repo`.
+For example, if you cannot pull images from `gcr.io`, you should first find a mirror repository (We recommend you to use `gcr.azk8s.cn` if you are in China). Then, modify `gcr_image_repo` and `kube_image_repo`.
Especially for `gcr.io`, we find some image links in kubespray which do not adopt `gcr_image_repo` and `kube_image_repo`. You should modify them manually in `~/pai-deploy/kubespray`. Command `grep -r --color gcr.io ~/pai-deploy/kubespray` will be helpful to you.
\ No newline at end of file
diff --git a/docs/manual/cluster-admin/installation-guide.md b/docs/manual/cluster-admin/installation-guide.md
index c0beb4601..d3abdb831 100644
--- a/docs/manual/cluster-admin/installation-guide.md
+++ b/docs/manual/cluster-admin/installation-guide.md
@@ -66,17 +66,17 @@ We recommend you to use CPU-only machines for dev box and master. The detailed r
The worker machines are used to run jobs. You can use multiple workers during installation.
-We support various types of workers: CPU worker, GPU worker, and workers with other computing device (e.g. TPU, NPU).
+We support various types of workers: CPU workers, GPU workers, and workers with other computing devices (e.g. TPU, NPU).
-In the same time, we also support two schedulers: the Kubernetes default scheduler, and [hivedscheduler](https://github.com/microsoft/hivedscheduler).
+At the same time, we also support two schedulers: the Kubernetes default scheduler, and [hivedscheduler](https://github.com/microsoft/hivedscheduler).
-Hivedscheduler is the default for OpenPAI. It supports virtual cluster division, topology-aware resource guarantee and optimized gang scheduling, which are not supported in k8s default scheduler.
+Hivedscheduler is the default for OpenPAI. It supports virtual cluster division, topology-aware resource guarantee, and optimized gang scheduling, which are not supported in the k8s default scheduler.
-For now, the support for CPU/Nvidia GPU workers and workers with other computing device is different:
+For now, the support for CPU/NVIDIA GPU workers and workers with other computing device is different:
- - For CPU worker and NVIDIA GPU worker, both k8s default scheduler and hived scheduler can be used.
- - For workers with other types of computing device (e.g. TPU, NPU), currently we only support the usage of k8s default scheduler. You can only include workers with the same computing device in the cluster. For example, you can use TPU workers, but all workers should be TPU workers. You cannot use TPU workers together with GPU workers in one cluster.
+ - For CPU workers and NVIDIA GPU workers, both k8s default scheduler and hived scheduler can be used.
+ - For workers with other types of computing devices (e.g. TPU, NPU), currently, we only support the usage of the k8s default scheduler. You can only include workers with the same computing device in the cluster. For example, you can use TPU workers, but all workers should be TPU workers. You cannot use TPU workers together with GPU workers in one cluster.
Please check the following requirements for different types of worker machines:
@@ -167,10 +167,10 @@ git checkout v1.5.0
```
Please edit `layout.yaml` and a `config.yaml` file under `/contrib/kubespray/config` folder.
-These two files spedify the cluster layout and the customized configuration, respectively.
+These two files specify the cluster layout and the customized configuration, respectively.
The following is the format and example of these 2 files.
-**Tips for China Users**: If you are a China user, before you edit these files, please refer to [here](./configuration-for-china.md) first.
+**Tips for Chinese Users**: If you are in Mainland China, please refer to [here](./configuration-for-china.md) first before you edit these files.
#### `layout.yaml` format
@@ -189,7 +189,7 @@ machine-sku:
vcore: 24
gpu-machine:
computing-device:
- # For `type`, please follow the same format specified in device plugin.
+ # For `type`, please follow the same format specified in the device plugin.
# For example, `nvidia.com/gpu` is for NVIDIA GPU, `amd.com/gpu` is for AMD GPU,
# and `enflame.com/dtu` is for Enflame DTU.
# Reference: https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/
@@ -306,9 +306,9 @@ The `user` and `password` is the SSH username and password from dev box machine
**For Azure Users**: If you are deploying OpenPAI in Azure, please uncomment `openpai_kube_network_plugin: calico` in the config file above, and change it to `openpai_kube_network_plugin: weave`. It is because Azure doesn't support calico. See [here](https://docs.projectcalico.org/reference/public-cloud/azure#why-doesnt-azure-support-calico-networking) for details.
-**For those who use workers other than CPU workers and NVIDIA GPU workers**: Now we only support Kubernetes default scheduler (not Hivedscheduler) for device other than NVIDIA GPU and CPU. Please uncomment `# enable_hived_scheduler: true` and set it to `enable_hived_scheduler: false`.
+**For those who use workers other than CPU workers and NVIDIA GPU workers**: Now we only support Kubernetes default scheduler (not Hivedscheduler) for devices other than NVIDIA GPU and CPU. Please uncomment `# enable_hived_scheduler: true` and set it to `enable_hived_scheduler: false`.
-**If qos-switch is enabled**: OpenPAI daemons will request additional resources in each node. Please check following table and reserve sufficient resources for OpenPAI daemons.
+**If qos-switch is enabled**: OpenPAI daemons will request additional resources in each node. Please check the following table and reserve sufficient resources for OpenPAI daemons.
| Service Name | Memory Request | CPU Request |
| :-----------: | :------------: | :---------: |
@@ -331,7 +331,7 @@ Please run the following script to deploy Kubernetes first. As the name explains
/bin/bash quick-start-kubespray.sh
```
-If there is any problem, please double check the environment requirements first. Here we provide a requirement checker to help you verify:
+If there is any problem, please double-check the environment requirements first. Here we provide a requirement checker to help you verify:
``` bash
/bin/bash requirement.sh -l config/layout.yaml -c config/config.yaml
@@ -342,16 +342,16 @@ You can also refer to [the installation troubleshooting](./installation-faqs-and
The `quick-start-kubespray.sh` will output the following information if k8s is successfully installed:
```
-You can run the following commands to setup kubectl on you local host:
+You can run the following commands to set up kubectl on your localhost:
ansible-playbook -i ${HOME}/pai-deploy/kubespray/inventory/pai/hosts.yml set-kubectl.yml --ask-become-pass
```
-By default, we don't setup `kubeconfig` or install `kubectl` client on the dev box machine, but we put the Kubernetes config file in `~/pai-deploy/kube/config`. You can use the config with any Kubernetes client to verify the installation.
+By default, we don't set up `kubeconfig` or install `kubectl` client on the dev box machine, but we put the Kubernetes config file in `~/pai-deploy/kube/config`. You can use the config with any Kubernetes client to verify the installation.
Also, you can use the command `ansible-playbook -i ${HOME}/pai-deploy/kubespray/inventory/pai/hosts.yml set-kubectl.yml --ask-become-pass` to set up `kubeconfig` and `kubectl` on the dev box machine. It will copy the config to `~/.kube/config` and set up the `kubectl` client. After it is executed, you can use `kubectl` on the dev box machine directly.
#### Tips for Network-related Issues
-If you are facing network issues such as the machine cannot download some file, or cannot connect to some docker registry, please combine the prompted error log and kubespray as a keyword, and search for solution. You can also refer to the [installation troubleshooting](./installation-faqs-and-troubleshooting.md#troubleshooting) and [this issue](https://github.com/microsoft/pai/issues/4516).
+If you are facing network issues such as the machine cannot download some file, or cannot connect to some docker registry, please combine the prompted error log and kubespray as a keyword, and search for a solution. You can also refer to the [installation troubleshooting](./installation-faqs-and-troubleshooting.md#troubleshooting) and [this issue](https://github.com/microsoft/pai/issues/4516).
## Start OpenPAI Services
@@ -375,16 +375,16 @@ You can go to http://, then use the default username and passwor
As the message says, you can use `admin` and `admin-password` to login to the webportal, then submit a job to validate your installation. We have generated the configuration files of OpenPAI in the folder `~/pai-deploy/cluster-cfg`. If you need further customization, they will be used in the future.
-**For those who use workers other than CPU workers, NVIDIA GPU workers, AMD GPU workers, and Enflame DTU workers**: Please manually deploy the device's device plugin in Kubernetes. Otherwise the Kubernetes default scheduler won't work. Supported device plugins are listed [in this file](https://github.com/microsoft/pai/blob/master/src/device-plugin/deploy/start.sh.template). PRs are welcome.
+**For those who use workers other than CPU workers, NVIDIA GPU workers, AMD GPU workers, and Enflame DTU workers**: Please manually deploy the device's device plugin in Kubernetes. Otherwise, the Kubernetes default scheduler won't work. Supported device plugins are listed [in this file](https://github.com/microsoft/pai/blob/master/src/device-plugin/deploy/start.sh.template). PRs are welcome.
## Keep a Folder
-We highly recommend you to keep the folder `~/pai-deploy` for future operations such as upgrade, maintenance, and uninstallation. The most important contents in this folder are:
+We highly recommend you keep the folder `~/pai-deploy` for future operations such as upgrade, maintenance, and uninstallation. The most important contents in this folder are:
- + Kubernetes cluster config (the default is `~/pai-deploy/kube/config`): Kubernetes config file. It is used by `kubectl` to connect to k8s api server.
+ + Kubernetes cluster config (the default is `~/pai-deploy/kube/config`): Kubernetes config file. It is used by `kubectl` to connect to the k8s API server.
+ OpenPAI cluster config (the default is `~/pai-deploy/cluster-cfg`): It is a folder containing machine layout and OpenPAI service configurations.
If it is possible, you can make a backup of `~/pai-deploy` in case it is deleted unexpectedly.
-Apart from the folder, you should remember your OpenPAI cluster ID, which is used to indicate your OpenPAI cluster.
-The default value is `pai`. Some management operation needs a confirmation of this cluster ID.
+Apart from the folder, you should remember your OpenPAI cluster-ID, which is used to indicate your OpenPAI cluster.
+The default value is `pai`. Some management operation needs a confirmation of this cluster-ID.
diff --git a/docs/manual/cluster-admin/recommended-practice.md b/docs/manual/cluster-admin/recommended-practice.md
index cd62969ca..cfe2f64a1 100644
--- a/docs/manual/cluster-admin/recommended-practice.md
+++ b/docs/manual/cluster-admin/recommended-practice.md
@@ -4,19 +4,19 @@ Managing one or more clusters is not an easy task. In most cases, the administra
## Team Shared Practice
-There are mainly two kinds of resource in OpenPAI, namely virtual cluster and storage, and they are managed by groups. All users are in the `default` group. So every one should have access to the `default` virtual cluster and one or more storages which are set for `default` group. Besides `default` group, you can set up different groups for different users. **In practice, we recommend you to assign each team with a single group.**
+There are mainly two kinds of resources in OpenPAI, namely virtual cluster and storage, and they are managed by groups. All users are in the `default` group. So everyone should have access to the `default` virtual cluster and one or more storages that are set for the `default` group. Besides the `default` group, you can set up different groups for different users. **In practice, we recommend you to assign each team with a single group.**
-For example, if you have two teams: team A working on project A; and team B working on project B. You can set up group A and group B for team A and B, correspondingly. Thus each team can share computing and storage resource internally. If a user joins a team, just add him into the corresponding group.
+For example, if you have two teams: team A working on project A; and Team B working on project B. You can set up group A and group B for Team A and B, correspondingly. Thus each team can share computing and storage resources internally. If a user joins a team, just add him into the corresponding group.
-By default, OpenPAI uses basic authentication mode. In this mode, virtual cluster is exactly bound to groups, which means setting up a group means set up a virtual cluster in basic authentication mode. In AAD mode, group and virtual cluster are different concepts. Please refer to [How to Set Up Virtual Clusters](./how-to-set-up-virtual-clusters.md) for details.
+By default, OpenPAI uses basic authentication mode. In this mode, the virtual cluster is exactly bound to groups, which means setting up a group means setting up a virtual cluster in basic authentication mode. In AAD mode, group and virtual cluster are different concepts. Please refer to [How to Set Up Virtual Clusters](./how-to-set-up-virtual-clusters.md) for details.
## Onboarding Practice
For new user onboarding in basic authentication mode, the PAI admin should create this user manually in the backend, and notify the user of some instructions and guidelines. In our practice, we will send an e-mail to the new user. Besides account information, we also include the following content in the e-mail:
- - Let the user read the [user manual](../cluster-user/) to learn how to submit job, debug job, and use client tool.
+ - Let the user read the [user manual](../cluster-user/) to learn how to submit jobs, debug jobs, and use the client tool.
- Let the user know their completed jobs may be deleted after 30 days.
- - Let the user know he/she shouldn't always run low-efficiency job (e.g. sleep for several days in the container). Otherwise the administrator may kill the job.
+ - Let the user know he/she shouldn't always run low-efficiency jobs (e.g. sleep for several days in the container). Otherwise, the administrator may kill the job.
- Let the user know how to contact the administrator in case they find any problem or have any question.
## DRI Practice
@@ -35,7 +35,7 @@ The DRI should be aware of different severities:
- Severity 2: Some jobs fail consistently, or some users hit problems frequently.
- Severity 3: Random job failures with low probability.
-In addition, if there are multiple clusters, he/she should also be aware of the different priorities of different clusters.
+Besides, if there are multiple clusters, he/she should also be aware of the different priorities of different clusters.
Based on severity and priority, we can make an SLA for the cluster management. Here is an example:
@@ -48,6 +48,6 @@ Based on severity and priority, we can make an SLA for the cluster management. H
If an issue is raised, the DRI should follow these steps to address it:
-1. All questions or notification sent to DRIs should be updated by the DRI owner proactively.
-2. DRI owner should send ACK to each incident of PAI alerts. As there are many duplicated alerts, so that it doesn't need to ACK on each one.
-3. Besides ACK, the DRI owner should reply to the questions/notification/alerts, if there are updates or it is resolved. Further more, the DRI owner should think about why this incident happens, how to avoid it in the next time, after it is resolved. If it's applicable, create issues on Github.
\ No newline at end of file
+1. All questions or notifications sent to DRIs should be updated by the DRI owner proactively.
+2. DRI owner should send ACK to each incident of PAI alerts. As there are many duplicated alerts so that it doesn't need to ACK on each one.
+3. Besides ACK, the DRI owner should reply to the questions/notification/alerts, if there are updates or it is resolved. Furthermore, the DRI owner should think about why this incident happens, how to avoid it the next time, after it is resolved. If it's applicable, create issues on Github.
\ No newline at end of file
diff --git a/docs/manual/cluster-admin/troubleshooting.md b/docs/manual/cluster-admin/troubleshooting.md
index be45a6428..a408f2d3a 100644
--- a/docs/manual/cluster-admin/troubleshooting.md
+++ b/docs/manual/cluster-admin/troubleshooting.md
@@ -1,22 +1,22 @@
# Troubleshooting
-This ducument includes some troubleshooting cases in practice.
+This document includes some troubleshooting cases in practice.
### PaiServicePodNotReady Alert
-This is a kind of alert from alert manager, and usually caused by container being killed by operator or OOM killer. To check if it was killed by OOM killer, you can check node's free memory via Prometheus:
+This is a kind of alert from the alert manager and is usually caused by the container being killed by the operator or OOM killer. To check if it was killed by OOM killer, you can check the node's free memory via Prometheus:
1. visit Prometheus web page, it is usually `http://:9091`.
2. Enter query `node_memory_MemFree_bytes`.
- 3. If free memory drop to near 0, the container should be killed by OOM killer
- 4. You can double check this by logging into node and run command `dmesg` and looking for phase `oom`. Or you can run `docker inspect ` to get more detailed information.
+ 3. If free memory drops to near 0, the container should be killed by OOM killer
+ 4. You can double-check this by logging into the node and run the command `dmesg` and looking for phase `oom`. Or you can run `docker inspect ` to get more detailed information.
Solutions:
- 1. Force remove unhealth containers with this command in terminal:
+ 1. Force remove unhealthy containers with this command in terminal:
`kubectl delete pod pod-name --grace-period=0 --force`
- 2. Recreate pod in Kubernetes, this operation may block indefinitely because dockerd may not functioning correctly after OOM. If recreate blocked too long, you can log into the node and restart dockerd via `/etc/init.d/docker restart`.
- 3. If restarting doesn't solve it, you can consider increase the pod's memory limit.
+ 2. Recreate pod in Kubernetes, this operation may block indefinitely because dockerd may not function correctly after OOM. If recreate blocked too long, you can log into the node and restart dockerd via `/etc/init.d/docker restart`.
+ 3. If restarting doesn't solve it, you can increase the pod's memory limit.
### NodeNotReady Alert
@@ -26,13 +26,13 @@ This is a kind of alert from alert manager, and is reported by watchdog service.
pai_node_count{disk_pressure="false",instance="10.0.0.1:9101",job="pai_serivce_exporter",memory_pressure="false",host_ip="10.0.0.2",out_of_disk="false",pai_service_name="watchdog",ready="true",scraped_from="watchdog-5ddd945975-kwhpr"}
```
-The name label indicate what node this metric represents.
+The name label indicates what node this metric represents.
-If the node's ready label has value "unknown", this means the node may disconnect from Kubernetes master, this may due to several reasons:
+If the node's ready label has the value "unknown", this means the node may disconnect from Kubernetes master, this may due to several reasons:
- Node is down
- Kubelet is down
- - Network partition between node and Kubernetes master
+ - Network partition between the node and Kubernetes master
You can first try to log into the node. If you can not, and have no ping response, the node may be down, and you should boot it up.
@@ -50,17 +50,17 @@ You should check what caused this connectivity problem.
### NodeFilesystemUsage Alert
-This is a kind of alert from alert manager, and is used to monitor disk space of each server. If usage of disk space is greater than 80%, this alert will be triggered. OpenPAI has two services may use a lot of disk space. They are storage manager and docker image cache. If there is other usage of OpenPAI servers, they should be checked to avoid the disk usage is caused by outside of OpenPAI.
+This is a kind of alert from the alert manager and is used to monitor the disk space of each server. If usage of disk space is greater than 80%, this alert will be triggered. OpenPAI has two services that may use a lot of disk space. They are storage manager and docker image cache. If there is another usage of OpenPAI servers, they should be checked to avoid the disk usage is caused outside of OpenPAI.
Solutions:
- 1. Check user file on the NFS storage server launched by storage manager. If you didn't set up a storage manager, ignore this step.
- 2. Check the docker cache. The docker may use too many disk space for caching, it's worth to have a check.
+ 1. Check the user file on the NFS storage server launched by the storage manager. If you didn't set up a storage manager, ignore this step.
+ 2. Check the docker cache. The docker may use too much disk space for caching, it's worth having a check.
3. Check PAI log folder size. The path is `/var/log/pai`.
### NodeGpuCountChanged Alert
-This is an alert from alert manager and is used to monitor the GPU count of each node.
+This is an alert from the alert manager and is used to monitor the GPU count of each node.
This alert will be triggered when the GPU count detected is different from the GPU count specified in `layout.yaml`.
If you find that the real GPU count is correct but the alerts still keep being fired, it's possibly caused by the wrong specification in `layout.yaml`.
@@ -86,17 +86,17 @@ If you cannot use GPU in your job, please check the following items on the corre
If the GPU number shown in webportal is wrong, check the [hivedscheduler and VC configuration](./how-to-set-up-virtual-clusters.md).
### NvidiaSmiDoubleEccError
-This is a kind of alert from alert manager.
-It means that nvidia cards from the related nodes have double ecc error.
-When this alert occurs, the nodes related will be automatically cordoned by alert manager.
+This is a kind of alert from the alert manager.
+It means that NVIDIA cards from the related nodes have double ecc errors.
+When this alert occurs, the nodes related will be automatically cordoned by the alert manager.
After the problem is resolved, you can uncordon the node manually with the following command:
```bash
kubectl uncordon
```
### NodeGpuLowPerfState
-This is a kind of alert from alert manager.
-It means the nvidia cards from related node downgrade into low peroformance state unexpectedly.
+This is a kind of alert from the alert manager.
+It means the NVIDIA cards from related node downgrade into low performance state unexpectedly.
To fix this, please run following commands:
```bash
sudo nvidia-smi -pm ENABLED -i
@@ -106,16 +106,16 @@ You can get the supported clock by `sudo nvidia-smi -q -d SUPPORTED_CLOCKS`
### Cannot See Utilization Information.
-If you cannot see utilization information (e.g. GPU, CPU, and network usage) in cluster, please check if the service `prometheus`, `grafana`, `job-exporter`, and `node-exporter` are working.
+If you cannot see utilization information (e.g. GPU, CPU, and network usage) in the cluster, please check if the service `prometheus`, `grafana`, `job-exporter`, and `node-exporter` are working.
To be detailed, you can [exec into a dev box container](./basic-management-operations.md#pai-service-management-and-paictl), then check the service status by `kubectl get pod`. You can see the pod log by `kubectl logs `. After you fix the problem, you can [restart the whole cluster using paictl](./basic-management-operations.md#pai-service-management-and-paictl).
### Node is De-allocated and doesn't Appear in Kubernetes System when it Comes Back
-Working nodes can be de-allocated if you are using a cloud service and set up PAI on low-priority machines. Usually, if the node is lost temporarily, you can wait until the node comes back. It doesn't need any special care.
+Working nodes can be de-allocated if you are using cloud service and set up PAI on low-priority machines. Usually, if the node is lost temporarily, you can wait until the node comes back. It doesn't need any special care.
-However, some cloud service providers not only de-allocate nodes, but also remove all disk contents on the certain nodes. Thus the node cannot connect to Kubernetes automatically when it comes back. If it is your case, we recommend you to set up a crontab job on the dev box node to bring back these nodes periodically.
+However, some cloud service providers not only de-allocate nodes but also remove all disk contents on certain nodes. Thus the node cannot connect to Kubernetes automatically when it comes back. If it is your case, we recommend you set up a crontab job on the dev box node to bring back these nodes periodically.
In [How to Add and Remove Nodes](how-to-add-and-remove-nodes.md), we have described how to add a node. The crontab job doesn't need to do all of these things. It only needs to add the node to the Kubernetes. It figures out which nodes have come back but are still considered `NotReady` in Kubernetes, then, run the following command to bring it back:
@@ -123,11 +123,11 @@ In [How to Add and Remove Nodes](how-to-add-and-remove-nodes.md), we have descri
ansible-playbook -i inventory/mycluster/hosts.yml upgrade-cluster.yml --become --become-user=root --limit=${limit_list} -e "@inventory/mycluster/openpai.yml"
```
-`${limit_list}` stands for the names of these de-allocated nodes. For example, if the crontab job finds node `a` and node `b` are available now, but they are still in `NotReady` status in Kuberentes, then it can set `limit_list=a,b`.
+`${limit_list}` stands for the names of these de-allocated nodes. For example, if the crontab job finds node `a` and node `b` are available now, but they are still in `NotReady` status in Kubernetes, then it can set `limit_list=a,b`.
### How to Enlarge Internal Storage Size
-Currently, OpenPAI uses [internal storage](https://github.com/microsoft/pai/tree/master/src/internal-storage) to hold database. Internal storage is a limited size storage. It leverages loop device in Linux to provide a storage with strictly limited quota. The default quota is 30 GB (or 10GB for OpenPAI <= `v1.1.0`), which can hold about 1,000,000 jobs. If you want a larger space to hold more jobs, please follow these steps to enlarge the internal storage:
+Currently, OpenPAI uses [internal storage](https://github.com/microsoft/pai/tree/master/src/internal-storage) to hold database. Internal storage is limited size storage. It leverages loop devices in Linux to provide a storage with strictly limited quota. The default quota is 30 GB (or 10GB for OpenPAI <= `v1.1.0`), which can hold about 1,000,000 jobs. If you want a larger space to hold more jobs, please follow these steps to enlarge the internal storage:
Step 1. [Exec into a dev box container.](./basic-management-operations.md#pai-service-management-and-paictl)
diff --git a/docs/manual/cluster-admin/upgrade-guide.md b/docs/manual/cluster-admin/upgrade-guide.md
index 1599915fe..bae50e196 100644
--- a/docs/manual/cluster-admin/upgrade-guide.md
+++ b/docs/manual/cluster-admin/upgrade-guide.md
@@ -6,7 +6,7 @@ The upgrade process is mainly about modifying `services-configuration.yaml` and
## Stop All Services and Previous Dev Box Container
-First, launch a dev box container of current PAI version, stop all services by:
+First, launch a dev box container of the current PAI version, stop all services by:
```bash
./paictl.py service stop
@@ -28,7 +28,7 @@ sudo docker rm dev-box
## Modify `services-configuration.yaml`
-Now, launch a dev box container of new version. For example, if you want to upgrade to `v1.1.0`, you should use docker `openpai/dev-box:v1.1.0`.
+Now, launch a dev box container of the new version. For example, if you want to upgrade to `v1.1.0`, you should use docker `openpai/dev-box:v1.1.0`.
Then, retrieve your configuration by:
@@ -41,7 +41,7 @@ Find the following section in `/services-configuration.yaml`:
```yaml
cluster:
- # the docker registry to store docker images that contain system services like frameworklauncher, hadoop, etc.
+ # the docker registry to store docker images that contain system services like Frameworklauncher, Hadoop, etc.
docker-registry:
......
@@ -72,4 +72,4 @@ If you didn't stop `storage-manager`, start other services by:
./paictl.py service start --skip-service-list storage-manager
```
-After all services is started, your OpenPAI cluster is successfully upgraded.
\ No newline at end of file
+After all the services are started, your OpenPAI cluster is successfully upgraded.
\ No newline at end of file
diff --git a/docs/manual/cluster-user/how-to-manage-data.md b/docs/manual/cluster-user/how-to-manage-data.md
index e51571296..31773433e 100644
--- a/docs/manual/cluster-user/how-to-manage-data.md
+++ b/docs/manual/cluster-user/how-to-manage-data.md
@@ -46,7 +46,7 @@ You could access `NFS` data by `Windows File Explorer` directly if:
To access it, use the file location `\\NFS_SERVER_ADDRESS` in `Window File Explorer`. It will prompt you to type in a username and a password:
- - If OpenPAI is in basic authentication mode (this mode means you use a basic username/password to log in to OpenPAI web portal), you can access NFS data through its configured username and password. Please note it is different from the one you use to log in to OpenPAI. If the administrator uses `storage-manager`, the default username/password for NFS is `smbuser` and `smbpwd`.
+ - If OpenPAI is in basic authentication mode (this mode means you use a basic username/password to log in to OpenPAI webportal), you can access NFS data through its configured username and password. Please note it is different from the one you use to log in to OpenPAI. If the administrator uses `storage-manager`, the default username/password for NFS is `smbuser` and `smbpwd`.
- If OpenPAI is in AAD authentication mode, you can access NFS data through the user domain name and password.
diff --git a/docs/manual/cluster-user/quick-start.md b/docs/manual/cluster-user/quick-start.md
index 2bdc698a8..1bf2492ce 100644
--- a/docs/manual/cluster-user/quick-start.md
+++ b/docs/manual/cluster-user/quick-start.md
@@ -14,7 +14,7 @@ Now your first OpenPAI job has been kicked off!
## Browse Stdout, Stderr, Full logs, and Metrics
-The hello world job is implemented by TensorFlow. It trains a simple model on the CIFAR-10 dataset for 1,000 steps with downloaded data. You can monitor the job by checking its logs and running metrics on the web portal.
+The hello world job is implemented by TensorFlow. It trains a simple model on the CIFAR-10 dataset for 1,000 steps with downloaded data. You can monitor the job by checking its logs and running metrics on the webportal.
Click the `Stdout` and `Stderr` buttons to see the stdout and stderr logs for a job on the job detail page. If you want to see a merged log, you can click `...` on the right and then select `Stdout + Stderr`.
@@ -32,7 +32,7 @@ On the job detail page, you can also see metrics by clicking `Go to Job Metrics
Instead of importing a job configuration file, you can submit the hello world job directly through the web page. The following is a step-by-step guide:
-**Step 1.** Login to OpenPAI web portal.
+**Step 1.** Login to OpenPAI webportal.
**Step 2.** Click **Submit Job** on the left pane, then click `Single` to reach this page.
diff --git a/docs/manual/cluster-user/use-marketplace.md b/docs/manual/cluster-user/use-marketplace.md
index 7454b3627..4b229a64d 100644
--- a/docs/manual/cluster-user/use-marketplace.md
+++ b/docs/manual/cluster-user/use-marketplace.md
@@ -4,7 +4,7 @@
## Entrance
-If your administrator enables marketplace plugin, you will find a link in the `Plugin` section on the web portal, like:
+If your administrator enables marketplace plugin, you will find a link in the `Plugin` section on the webportal, like:
> If you are PAI admin, you could check [deployment doc](https://github.com/microsoft/openpaimarketplace/blob/master/docs/deployment.md) to see how to deploy and enable marketplace plugin.
diff --git a/docs/manual/cluster-user/use-vscode-extension.md b/docs/manual/cluster-user/use-vscode-extension.md
index 4ad769ab1..d7f68e952 100644
--- a/docs/manual/cluster-user/use-vscode-extension.md
+++ b/docs/manual/cluster-user/use-vscode-extension.md
@@ -34,6 +34,7 @@ If there are multiple OpenPAI clusters, you can follow the above steps again to
![add cluster host](https://raw.githubusercontent.com/Microsoft/openpaivscode/0.3.0/assets/add_cluster_host.png)
+
4. If the `authn_type` of the cluster is `OIDC`, a website will be open and ask you to log in. If your login was successful, the username and token fields are auto-filled, and you can change them if needed. Once it completes, click the *Finish* button at the bottom right corner. Notice, the settings will not take effect if you save and close the file directly.
![add cluster configuration](https://raw.githubusercontent.com/Microsoft/openpaivscode/0.3.0/assets/add_aad_cluster.gif)
@@ -46,12 +47,14 @@ After added a cluster configuration, you can find the cluster in the *PAI CLUSTE
![pai cluster explorer](https://raw.githubusercontent.com/Microsoft/openpaivscode/0.3.0/assets/pai_cluster_explorer.png)
+
To submit a job config YAML file, please follow the steps below:
1. Double-click `Create Job Config...` in OpenPAI cluster Explorer, and then specify file name and location to create a job configuration file.
2. Update job configuration as needed.
3. Right-click on the created job configuration file, then click on `Submit Job to PAI Cluster`. The client will then upload files to OpenPAI and create a job. Once it's done, there is a notification at the bottom right corner, you can click to open the job detail page.
+
If there are multiple OpenPAI clusters, you need to choose one.
This animation shows the above steps.
diff --git a/docs_zh_CN/manual/cluster-admin/how-to-use-alert-system.md b/docs_zh_CN/manual/cluster-admin/how-to-use-alert-system.md
index 2b9c73634..33d630161 100644
--- a/docs_zh_CN/manual/cluster-admin/how-to-use-alert-system.md
+++ b/docs_zh_CN/manual/cluster-admin/how-to-use-alert-system.md
@@ -13,7 +13,7 @@ alert: GpuUsedByExternalProcess
expr: gpu_used_by_external_process_count > 0
for: 5m
annotations:
- summary: found nvidia used by external process in {{$labels.instance}}
+ summary: found NVIDIA used by external process in {{$labels.instance}}
```
关于报警规则的详细语法,请参考[这里](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/)。
diff --git a/docs_zh_CN/manual/cluster-admin/installation-faqs-and-troubleshooting.md b/docs_zh_CN/manual/cluster-admin/installation-faqs-and-troubleshooting.md
index f4df44a98..980062fdf 100644
--- a/docs_zh_CN/manual/cluster-admin/installation-faqs-and-troubleshooting.md
+++ b/docs_zh_CN/manual/cluster-admin/installation-faqs-and-troubleshooting.md
@@ -78,11 +78,11 @@
#### 如何检查GPU驱动被正确安装了?
-对于Nvidia GPU, 您可以使用命令`nvidia-smi`来检查。
+对于NVIDIA GPU, 您可以使用命令`nvidia-smi`来检查。
#### 如何安装GPU驱动?
-对于Nvidia GPU,请先确认您想安装哪个版本的GPU(您可以参考[这个问题](#which-version-of-nvidia-driver-should-i-install))。然后参考下面的步骤:
+对于NVIDIA GPU,请先确认您想安装哪个版本的GPU(您可以参考[这个问题](#which-version-of-nvidia-driver-should-i-install))。然后参考下面的步骤:
```bash
sudo add-apt-repository ppa:graphics-drivers/ppa
diff --git a/docs_zh_CN/manual/cluster-admin/troubleshooting.md b/docs_zh_CN/manual/cluster-admin/troubleshooting.md
index e39f5dd27..8911962d6 100644
--- a/docs_zh_CN/manual/cluster-admin/troubleshooting.md
+++ b/docs_zh_CN/manual/cluster-admin/troubleshooting.md
@@ -84,7 +84,7 @@ pai_node_count{disk_pressure="false",instance="10.0.0.1:9101",job="pai_serivce_e
如果您无法在您的任务中使用GPU,您可以在Worker结点上follow下面的步骤来检查:
- 1. 显卡驱动安装正确。如果是Nvidia卡的话,使用`nvidia-smi`来检查。
+ 1. 显卡驱动安装正确。如果是NVIDIA卡的话,使用`nvidia-smi`来检查。
2. [nvidia-container-runtime](https://github.com/NVIDIA/nvidia-container-runtime)已经被正确安装,并且被设置为Docker的默认runtime。您可以用`docker info -f "{{json .DefaultRuntime}}"`来检查。
如果是在Webportal中显示的GPU数目有出入,请参考[这个文档](./how-to-set-up-virtual-clusters.md)对集群重新进行配置。