зеркало из https://github.com/microsoft/pai.git
Docs modification (admin) (#5299)
* Readme.md and some grammarly errors due to proper nouns * Some missing grammarly errors due to proper nouns * Some missing grammarly errors
This commit is contained in:
Родитель
f652c996a7
Коммит
ce378418ad
|
@ -2,7 +2,7 @@
|
|||
|
||||
## Management on Webportal
|
||||
|
||||
The webportal provides some basic administration functions. If you log in to it as an administrator, you will find several buttons about administration on the left sidebar, as shown in the following image.
|
||||
The webportal provides some basic administration functions. If you log in to it as an administrator, you will find several buttons about the administration on the left sidebar, as shown in the following image.
|
||||
|
||||
<img src="./imgs/administration.png" width="100%" height="100%" />
|
||||
|
||||
|
@ -10,7 +10,7 @@ Most of these functions are easy to understand. We will go through them quickly
|
|||
|
||||
### Hardware Utilization Page
|
||||
|
||||
The hardware page shows the CPU, GPU, memory, disk, network utilization of each node in your cluster. The utilization is shown in different color basically. If you hover your mouse on these colored circles, exact utilization percentage will be shown.
|
||||
The hardware page shows the CPU, GPU, memory, disk, network utilization of each node in your cluster. The utilization is shown in different colors. If you hover your mouse on these colored circles, the exact utilization percentage will be shown.
|
||||
|
||||
<img src="./imgs/hardware.png" width="100%" height="100%" />
|
||||
|
||||
|
@ -36,7 +36,7 @@ On the homepage, there is an `abnormal jobs` section for administrators. A job i
|
|||
|
||||
### Access Kubernetes Dashboard
|
||||
|
||||
There is a shortcut to k8s dashboard on the webportal. However, it needs special authentication for security issues.
|
||||
There is a shortcut to the k8s dashboard on the webportal. However, it needs special authentication for security issues.
|
||||
|
||||
<img src="./imgs/k8s-dashboard.png" width="100%" height="100%" />
|
||||
|
||||
|
@ -67,11 +67,11 @@ subjects:
|
|||
|
||||
**Step 2.** Run `kubectl apply -f admin-user.yaml`
|
||||
|
||||
**Step 3.** Run `kubectl -n kube-system describe secret $(kubectl -n kube-system get secret | grep admin-user | awk '{print $1}')`. It will print the token which can be used to login k8s-dashboard.
|
||||
**Step 3.** Run `kubectl -n kube-system describe secret $(kubectl -n kube-system get secret | grep admin-user | awk '{print $1}')`. It will print the token which can be used to login to k8s-dashboard.
|
||||
|
||||
## PAI Service Management and Paictl
|
||||
|
||||
Generally speaking, PAI services are daemon sets, deployments or stateful sets created by PAI system, running on Kubernetes. You can find them on the [k8s dashboard](#access-kubernetes-dashboard) and [services page](#services-page). For example, `webportal` is a PAI service which provides front-end interface, and `rest-server` is another one for back-end APIs. These services are all configurable. If you have followed the [installation-guide](./installation-guide.md), you can find two files, `layout.yaml` and `services-configuration.yaml`, in folder `~/pai-deploy/cluster-cfg` on the dev box machine. These two files are the default service configuration.
|
||||
Generally speaking, PAI services are daemon sets, deployments, or stateful sets created by PAI system, running on Kubernetes. You can find them on the [k8s dashboard](#access-kubernetes-dashboard) and [services page](#services-page). For example, `webportal` is a PAI service which provides front-end interface, and `rest-server` is another one for back-end APIs. These services are all configurable. If you have followed the [installation-guide](./installation-guide.md), you can find two files, `layout.yaml` and `services-configuration.yaml`, in folder `~/pai-deploy/cluster-cfg` on the dev box machine. These two files are the default service configuration.
|
||||
|
||||
`paictl` is a CLI tool which helps you manage cluster configuration and PAI services. To use it, we recommend you to leverage our dev box docker image to avoid environment-related problems. First, go to the dev box machine, launch the dev box docker by:
|
||||
|
||||
|
@ -103,7 +103,7 @@ mkdir -p ~/.kube
|
|||
vim ~/.kube/config
|
||||
```
|
||||
|
||||
Go to folder `/pai`, try to retrieve your cluster id:
|
||||
Go to folder `/pai`, try to retrieve your cluster-id:
|
||||
|
||||
```bash
|
||||
cd /pai
|
||||
|
@ -120,7 +120,7 @@ Here are some basic usage examples of `paictl`:
|
|||
|
||||
# pull service config to a certain folder
|
||||
# the configuration containers two files: layout.yaml and services-configuration.yaml
|
||||
# if <config-folder> already has these files, they will be overrided
|
||||
# if <config-folder> already has these files, they will be overridden
|
||||
./paictl.py config pull -o <config-folder>
|
||||
|
||||
# push service config to the cluster
|
||||
|
@ -142,7 +142,7 @@ Here are some basic usage examples of `paictl`:
|
|||
|
||||
If you want to change configuration of some services, please follow the steps of `service stop`, `config push` and `service start`.
|
||||
|
||||
For example, if you want to customize webportal, you should modify the `webportal` section in `services-configuration.yaml`. Then use the following command to push the configuration and restart webportal:
|
||||
For example, if you want to customize webportal, you should modify the `webportal` section in `services-configuration.yaml`. Then use the following command to push the configuration and restart the webportal:
|
||||
|
||||
```bash
|
||||
./paictl.py service stop -n webportal
|
||||
|
@ -158,7 +158,7 @@ Another example is to restart the whole cluster:
|
|||
./paictl.py service start
|
||||
```
|
||||
|
||||
You can use `exit` to leave the dev-box container, and use `sudo docker exec -it dev-box bash` to re-enter it if you desire so. If you don't need it any more, use `sudo docker stop dev-box` and `sudo docker rm dev-box` to delete the docker container.
|
||||
You can use `exit` to leave the dev-box container, and use `sudo docker exec -it dev-box bash` to re-enter it if you desire so. If you don't need it anymore, use `sudo docker stop dev-box` and `sudo docker rm dev-box` to delete the docker container.
|
||||
|
||||
## How To Set Up HTTPS
|
||||
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
# How to Add and Remove Nodes
|
||||
|
||||
OpenPAI doesn't support changing master nodes, thus, only the solution of adding/removing worker nodes is provided. You can add CPU workers, GPU workers, and other computing device (e.g. TPU, NPU) into the cluster.
|
||||
OpenPAI doesn't support changing master nodes, thus, only the solution of adding/removing worker nodes is provided. You can add CPU workers, GPU workers, and other computing devices (e.g. TPU, NPU) into the cluster.
|
||||
|
||||
## How to Add Nodes
|
||||
|
||||
|
@ -133,7 +133,7 @@ If you have configured any PV/PVC storage, please confirm the added worker node
|
|||
|
||||
## How to Remove Nodes
|
||||
|
||||
Please refer to the operation of add nodes. They are very similar.
|
||||
Please refer to the operation of adding nodes. They are very similar.
|
||||
|
||||
To remove nodes from the cluster, there is no need to modify `hosts.yml`.
|
||||
Go into `~/pai-deploy/kubespray/`, run
|
||||
|
|
|
@ -4,7 +4,7 @@
|
|||
|
||||
Webportal plugin provides a way to add custom web pages to the OpenPAI webportal. It can communicate with other PAI services, like the rest-server. It could provide customized solutions to different requirements.
|
||||
|
||||
As an administrator, you can configure the web portal plugins in the `webportal.plugins` field of `services-configuration.yaml` (If you don't know what `services-configuration.yaml` is, please refer to [PAI Service Management and Paictl](./basic-management-operations.md#pai-service-management-and-paictl)):
|
||||
As an administrator, you can configure the webportal plugins in the `webportal.plugins` field of `services-configuration.yaml` (If you don't know what `services-configuration.yaml` is, please refer to [PAI Service Management and Paictl](./basic-management-operations.md#pai-service-management-and-paictl)):
|
||||
|
||||
```yaml
|
||||
webportal:
|
||||
|
@ -18,9 +18,9 @@ webportal:
|
|||
```
|
||||
|
||||
|
||||
- The `title` field is the title of the web portal plugin listed in the menu, it could be customized by administrators for the same plugin with different configurations.
|
||||
- The `uri` field is the entry file of the web portal plugin, usually previded by the plugin developer. It may be an absolute URL or a root-relative URL, as the different deploy type of the web portal plugin.
|
||||
- The `config` field is a key-value dictionary to configure the web portal plugin, available configs are listed in web portal plugin's specific document.
|
||||
- The `title` field is the title of the webportal plugin listed in the menu, it could be customized by administrators for the same plugin with different configurations.
|
||||
- The `uri` field is the entry file of the webportal plugin, usually provided by the plugin developer. It may be an absolute URL or a root-relative URL, as the different deploy type of the webportal plugin.
|
||||
- The `config` field is a key-value dictionary to configure the webportal plugin, available configs are listed in the webportal plugin's specific document.
|
||||
|
||||
After modifying the configuration, push it to the cluster and restart webportal by:
|
||||
|
||||
|
@ -30,12 +30,12 @@ After modifying the configuration, push it to the cluster and restart webportal
|
|||
./paictl.py service start -n webportal
|
||||
```
|
||||
|
||||
## Deploy Openpaimarketplace as webportal plugin
|
||||
## Deploy Openpaimarketplace as Webportal Plugin
|
||||
|
||||
[Openpaimarketplace](https://github.com/microsoft/openpaimarketplace) is a place which stores examples and job templates of openpai. Users could use openpaimarketplace to share their jobs or run-and-learn others' sharing job.
|
||||
[Openpaimarketplace](https://github.com/microsoft/openpaimarketplace) is a place that stores examples and job templates of OpenPAI. Users could use openpaimarketplace to share their jobs or run-and-learn others' sharing jobs.
|
||||
|
||||
To deploy openpaimarketplace, please refer to [the project doc](https://github.com/microsoft/openpaimarketplace) about how to deploy the marketplace service and webportal plugin.
|
||||
|
||||
After deployment, follow the [previous part](#how-to-install-a-webportal-plugin) to change the webportal configuration with marketplace plugin url and restart webportal. Then you could use marketplace from the sidebar.
|
||||
After deployment, follow the [previous part](#how-to-install-a-webportal-plugin) to change the webportal configuration with marketplace plugin URL and restart webportal. Then you could use marketplace from the sidebar.
|
||||
|
||||
<img src="./imgs/marketplace.png" width="100%" height="100%" />
|
||||
|
|
|
@ -2,11 +2,11 @@
|
|||
|
||||
## Users and Groups in Basic Authentication Mode
|
||||
|
||||
OpenPAI is deployed in basic authentication mode by default. Groups in basic authentication mode are bound to virtual clusters (please refer to [how to set up virtual clusters](./how-to-set-up-virtual-clusters.md) to configure virtual clusters). Two groups, `default` and `admingroup` will be created once OpenPAI is deployed. All users belong to `default` group, and have access to the `default` virtual cluster. All administrators belong to `admingroup`, and have access to all virtual clusters. If there is another virtual cluster named `test-vc`, and an administrator grants it to a user, the user will be in group `test-vc` and have access to the corresponding virtual cluster.
|
||||
OpenPAI is deployed in basic authentication mode by default. Groups in basic authentication mode are bound to virtual clusters (please refer to [how to set up virtual clusters](./how-to-set-up-virtual-clusters.md) to configure virtual clusters). Two groups, `default` and `admingroup` will be created once OpenPAI is deployed. All users belong to `default` group and have access to the `default` virtual cluster. All administrators belong to `admingroup`, and have access to all virtual clusters. If there is another virtual cluster named `test-vc`, and an administrator grants it to a user, the user will be in group `test-vc` and have access to the corresponding virtual cluster.
|
||||
|
||||
For example, if you create an admin user [on the webportal](./basic-management-operations.md#user-management), he will be in `default` and `admingroup`. A non-admin user will be only in `default` group once created. If administrator gives the non-admin user access to `new-vc`, he will be in `default` and `new-vc` group.
|
||||
|
||||
A user can see his groups in the profile page. First click `View my profile` in the right-top corner.
|
||||
A user can see his groups on the profile page. First click `View my profile` in the right-top corner.
|
||||
|
||||
<img src="./imgs/view-profile.png" width="100%" height="100%" />
|
||||
|
||||
|
@ -25,7 +25,7 @@ In this section, we will cover how to set up the integration step by step.
|
|||
|
||||
#### Note
|
||||
|
||||
Previous user data in webportal is required to be mapping/migrate to AAD. Once the integration is enabled, instead of using basic user authentication, OpenPAI will switch to use (and only use) AAD as user authentication mechanism. To set up AAD, please follow the instructions [here](./basic-management-operations.md#how-to-set-up-https) to set up HTTPS access for OpenPAI first.
|
||||
Previous user data in webportal is required to be mapping/migrate to AAD. Once the integration is enabled, instead of using basic user authentication, OpenPAI will switch to use (and only use) AAD as the user authentication mechanism. To set up AAD, please follow the instructions [here](./basic-management-operations.md#how-to-set-up-https) to set up HTTPS access for OpenPAI first.
|
||||
|
||||
|
||||
#### [Rest-server] Configuration AAD
|
||||
|
@ -108,7 +108,7 @@ authentication:
|
|||
|
||||
# Admin group name and its user list
|
||||
admin-group:
|
||||
# The group named showed in OpenPAI system.
|
||||
# The group named showed in the OpenPAI system.
|
||||
groupname: admingroup
|
||||
description: "admin's group"
|
||||
# The group alias (groupname) in Azure Active directory
|
||||
|
@ -117,15 +117,15 @@ authentication:
|
|||
# Group for default vc.
|
||||
# For yarn default queue hack.
|
||||
default-group:
|
||||
# The group named showed in OpenPAI system.
|
||||
# The group named showed in the OpenPAI system.
|
||||
groupname: default
|
||||
description: "group for default vc"
|
||||
# The group alias (groupname) in Azure Active directory
|
||||
externalName: "team_alias_b"
|
||||
|
||||
# If you cluster you have configured several yarn vc, except default vc (it has been created in the default-group), you should configure group for each vc in the following list
|
||||
# If you cluster you have configured several yarn VC, except default VC (it has been created in the default-group), you should configure group for each VC in the following list
|
||||
grouplist:
|
||||
# The group named showed in OpenPAI system.
|
||||
# The group named showed in the OpenPAI system.
|
||||
- groupname: forexample1
|
||||
description: forexample1
|
||||
# The group alias (groupname) in Azure Active directory
|
||||
|
@ -139,13 +139,13 @@ authentication:
|
|||
|
||||
##### Clean Previous Data
|
||||
|
||||
Please clean all users' data. Because in this mode, user's permission will be managed by azure active directory. The local data is useless.
|
||||
Please clean all users' data. Because in this mode, the user's permission will be managed by the Azure active directory. The local data is useless.
|
||||
|
||||
```bash
|
||||
./paictl.py service delete -n rest-server
|
||||
```
|
||||
|
||||
##### After all the steps above, push the configuration, and restart all OpenPAI services.
|
||||
##### After all the steps above, push the configuration and restart all OpenPAI services.
|
||||
|
||||
```bash
|
||||
./paictl.py service stop
|
||||
|
@ -157,7 +157,7 @@ Please clean all users' data. Because in this mode, user's permission will be ma
|
|||
|
||||
##### Start Service stage
|
||||
|
||||
After start rest-server, please ensure that the following task is successfully executed.
|
||||
After start the rest-server, please ensure that the following task is successfully executed.
|
||||
|
||||
- namespace named ```pai-group``` and ```pai-user-v2```are created
|
||||
|
||||
|
@ -171,34 +171,34 @@ After start rest-server, please ensure that the following task is successfully e
|
|||
<img src="./imgs/aad/group-created.png" alt="paictl overview picture" style="float: center; margin-right: 10px;" />
|
||||
</div>
|
||||
|
||||
- Every group have an `acls` in extension field.
|
||||
- Every group has an `acls` in the extension field.
|
||||
|
||||
<div align="center">
|
||||
<img src="./imgs/aad/admin_group_detail.png" alt="paictl overview picture" style="float: center; margin-right: 10px;" />
|
||||
<img src="./imgs/aad/default_group_detail.png" alt="paictl overview picture" style="float: center; margin-right: 10px;" />
|
||||
</div>
|
||||
|
||||
- Please Login through OpenPAI's webportal, then please check whether your user's data is created in the secret of ```pai-user-v2``` namespace.
|
||||
- Please login through OpenPAI's webportal, then please check whether your user's data is created in the secret of ```pai-user-v2``` namespace.
|
||||
|
||||
<div align="center">
|
||||
<img src="./imgs/aad/user_created.png" alt="paictl overview picture" style="float: center; margin-right: 10px;" />
|
||||
</div>
|
||||
|
||||
- please check the created user data. There should be an empty extension and a non-empty grouplist.
|
||||
- please check the created user data. There should be an empty extension and a non-empty group list.
|
||||
|
||||
<div align="center">
|
||||
<img src="./imgs/aad/user_detail.png" alt="paictl overview picture" style="float: center; margin-right: 10px;" />
|
||||
</div>
|
||||
|
||||
- please submit a test job in default vc, and then submit the same job to another vc.
|
||||
- please submit a test job in default VC, and then submit the same job to another VC.
|
||||
|
||||
- please check whether admin user can access to the administration tab.
|
||||
- please check whether the admin user can access the administration tab.
|
||||
|
||||
<div align="center">
|
||||
<img src="./imgs/aad/admin_view.png" alt="paictl overview picture" style="float: center; margin-right: 10px;" />
|
||||
</div>
|
||||
|
||||
- please create a vc, then check whether a corresponding group is created.
|
||||
- please create a VC, then check whether a corresponding group is created.
|
||||
|
||||
<div align="center">
|
||||
<img src="./imgs/aad/add_vc.png" alt="paictl overview picture" style="float: center; margin-right: 10px;" />
|
||||
|
@ -208,19 +208,19 @@ After start rest-server, please ensure that the following task is successfully e
|
|||
<img src="./imgs/aad/test_group_detail.png" alt="paictl overview picture" style="float: center; margin-right: 10px;" />
|
||||
</div>
|
||||
|
||||
- After creating the new vc, please check whether the new vc is available for admin at home page.
|
||||
- After creating the new VC, please check whether the new VC is available for the admin on the home page.
|
||||
|
||||
<div align="center">
|
||||
<img src="./imgs/aad/admin_home.png" alt="paictl overview picture" style="float: center; margin-right: 10px;" />
|
||||
</div>
|
||||
|
||||
- Delete the test vc, then please check whether the corresponding group is deleted.
|
||||
- Delete the test VC, then please check whether the corresponding group is deleted.
|
||||
|
||||
<div align="center">
|
||||
<img src="./imgs/aad/vc_delete.png" alt="paictl overview picture" style="float: center; margin-right: 10px;" />
|
||||
</div>
|
||||
|
||||
- After deleting the vc, please check whether the group is removed from `pai-group` secrets.
|
||||
- After deleting the VC, please check whether the group is removed from `pai-group` secrets.
|
||||
|
||||
<div align="center">
|
||||
<img src="./imgs/aad/group_delete.png" alt="paictl overview picture" style="float: center; margin-right: 10px;" />
|
||||
|
@ -228,4 +228,4 @@ After start rest-server, please ensure that the following task is successfully e
|
|||
|
||||
##### If test failed
|
||||
|
||||
Please try to delete the rest-server, and then try to start it again. If fail again, please provide detail information and create issue ticket in github.
|
||||
Please try to delete the rest-server, and then try to start it again. If it fails again, please provide detailed information and create an issue ticket in Github.
|
||||
|
|
|
@ -1,16 +1,16 @@
|
|||
# How to Set Up Storage
|
||||
|
||||
This document describes how to use Kubernetes Persistent Volumes (PV) as storage on PAI. To set up existing storage (nfs, samba, Azure blob, etc.), you need:
|
||||
This document describes how to use Kubernetes Persistent Volumes (PV) as storage on PAI. To set up existing storage (NFS, Samba, Azure blob, etc.), you need:
|
||||
|
||||
1. Create PV and PVC as PAI storage on Kubernetes.
|
||||
2. Confirm the worker nodes have proper package to mount the PVC. For example, the `NFS` PVC requires package `nfs-common` to work on Ubuntu.
|
||||
2. Confirm the worker nodes have the proper package to mount the PVC. For example, the `NFS` PVC requires package `nfs-common` to work on Ubuntu.
|
||||
3. Assign PVC to specific user groups.
|
||||
|
||||
Users could mount those PV/PVC into their jobs after you set up the storage properly. The name of PVC is used to onboard on PAI.
|
||||
|
||||
## Create PV/PVC on Kubernetes
|
||||
|
||||
There're many approches to create PV/PVC, you could refer to [Kubernetes docs](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) if you are not familiar yet. Followings are some commonly used PV/PVC examples.
|
||||
There're many approaches to create PV/PVC, you could refer to [Kubernetes docs](https://kubernetes.io/docs/concepts/storage/persistent-volumes/) if you are not familiar yet. The followings are some commonly used PV/PVC examples.
|
||||
|
||||
### NFS
|
||||
|
||||
|
@ -56,9 +56,9 @@ spec:
|
|||
|
||||
Save the above file as `nfs-storage.yaml` and run `kubectl apply -f nfs-storage.yaml` to create a PV named `nfs-storage-pv` and a PVC named `nfs-storage` for nfs server `nfs://10.0.0.1:/data`. The PVC will be bound to specific PV through label selector, using label `name: nfs-storage`.
|
||||
|
||||
Users could use PVC name `nfs-storage` as storage name to mount this nfs storage in their jobs.
|
||||
Users could use the PVC name `nfs-storage` as the storage name to mount this NFS storage in their jobs.
|
||||
|
||||
If you want to configure the above nfs as personal storage so that each user could only visit their own directory on PAI like Linux home directory, for example, Alice can only mount `/data/Alice` while Bob can only mount `/data/Bob`, you could add a `share: "false"` label to PVC. In this case, PAI will use `${PAI_USER_NAME}` as sub path when mounting to job containers.
|
||||
If you want to configure the above NFS as personal storage so that each user could only visit their directory on PAI like Linux home directory, for example, Alice can only mount `/data/Alice` while Bob can only mount `/data/Bob`, you could add a `share: "false"` label to PVC. In this case, PAI will use `${PAI_USER_NAME}` as the subpath when mounting to job containers.
|
||||
|
||||
### Samba
|
||||
|
||||
|
@ -70,7 +70,7 @@ Please refer to [this document](https://github.com/Azure/kubernetes-volume-drive
|
|||
|
||||
#### Tips
|
||||
|
||||
If you cannot mount blobfuse PVC into containers and the corresponding job in OpenPAI sticks in `WAITING` status, please double check the following requirements:
|
||||
If you cannot mount blobfuse PVC into containers and the corresponding job in OpenPAI sticks in `WAITING` status, please double-check the following requirements:
|
||||
|
||||
**requirement 1.** Every worker node should have `blobfuse` installed. Try the following commands to ensure:
|
||||
|
||||
|
@ -90,13 +90,13 @@ curl -s https://raw.githubusercontent.com/Azure/kubernetes-volume-drivers/master
|
|||
| kubectl apply -f -
|
||||
```
|
||||
|
||||
> NOTE: There is a known issue [#4637](https://github.com/microsoft/pai/issues/4637) to mount same PV multiple times on same node, please either:
|
||||
> NOTE: There is a known issue [#4637](https://github.com/microsoft/pai/issues/4637) to mount the same PV multiple times on the same node, please either:
|
||||
> * use the [patched blobfuse flexvolume installer](https://github.com/microsoft/pai/issues/4637#issuecomment-647434815) instead.
|
||||
> * use the [earlier version 1.1.1](https://github.com/Azure/kubernetes-volume-drivers/issues/66#issuecomment-649188681) instead.
|
||||
|
||||
### Azure File
|
||||
|
||||
First create a Kubernetes secret to access the Azure file share.
|
||||
First, create a Kubernetes secret to access the Azure file share.
|
||||
|
||||
```sh
|
||||
kubectl create secret generic azure-secret --from-literal=azurestorageaccountname=$AKS_PERS_STORAGE_ACCOUNT_NAME --from-literal=azurestorageaccountkey=$STORAGE_KEY
|
||||
|
@ -185,17 +185,17 @@ spec:
|
|||
.......
|
||||
```
|
||||
|
||||
Please notice, `PersistentVolume.Spec.AccessModes` and `PersistentVolumeClaim.Spec.AccessModes` doesn't affect whether a storage is writable in PAI. They only take effect during binding time between PV and PVC.
|
||||
Please notice, `PersistentVolume.Spec.AccessModes` and `PersistentVolumeClaim.Spec.AccessModes` doesn't affect whether a storage is writable in PAI. They only take effect during the binding time between PV and PVC.
|
||||
|
||||
## Confirm Environment on Worker Nodes
|
||||
|
||||
The [notice in Kubernetes' document](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistent-volumes) mentions: helper program may be required to consume certain type of PersistentVolume. For example, all worker nodes should have `nfs-common` installed if you want to use `NFS` PV. You can confirm it using the command `apt install nfs-common` on every worker node.
|
||||
The [notice in Kubernetes' document](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistent-volumes) mentions: helper program may be required to consume a certain type of PersistentVolume. For example, all worker nodes should have `nfs-common` installed if you want to use `NFS` PV. You can confirm it using the command `apt install nfs-common` on every worker node.
|
||||
|
||||
Since different PVs have different requirements, you should check the environment according to document of the PV.
|
||||
Since different PVs have different requirements, you should check the environment according to the document of the PV.
|
||||
|
||||
## Assign Storage to PAI Groups
|
||||
|
||||
The PVC name is used as storage name in OpenPAI. After you have set up the PV/PVC and checked the environment, you need to assign storage to users. In OpenPAI, the name of the PVC is used as the storage name, and the access of different storages is managed by [user groups](./how-to-manage-users-and-groups.md). To assign storage to a user, please use RESTful API to assign storage to the groups of the user.
|
||||
The PVC name is used as the storage name in OpenPAI. After you have set up the PV/PVC and checked the environment, you need to assign storage to users. In OpenPAI, the name of the PVC is used as the storage name, and the access of different storage is managed by [user groups](./how-to-manage-users-and-groups.md). To assign storage to a user, please use RESTful API to assign storage to the groups of the user.
|
||||
|
||||
Before querying the API, you should get an access token for the API. Go to your profile page and copy one:
|
||||
|
||||
|
@ -208,7 +208,7 @@ For example, if you want to assign `nfs-storage` PVC to `default` group. First,
|
|||
```json
|
||||
{
|
||||
"groupname": "default",
|
||||
"description": "group for default vc",
|
||||
"description": "group for default vc",
|
||||
"externalName": "",
|
||||
"extension": {
|
||||
"acls": {
|
||||
|
@ -220,7 +220,7 @@ For example, if you want to assign `nfs-storage` PVC to `default` group. First,
|
|||
}
|
||||
```
|
||||
|
||||
The GET request must use header `Authorization: Bearer <token>` for authorization. This remains the same for all API calls. You may notice the `storageConfigs` in the return body. In fact it controls which storage a group can use. To add a `nfs-storage` to it, PUT `http(s)://<pai-master-ip>/rest-server/api/v2/groups`. Request body is:
|
||||
The GET request must use the header `Authorization: Bearer <token>` for authorization. This remains the same for all API calls. You may notice the `storageConfigs` in the return body. It controls which storage a group can use. To add a `nfs-storage` to it, PUT `http(s)://<pai-master-ip>/rest-server/api/v2/groups`. The request body is:
|
||||
|
||||
```json
|
||||
{
|
||||
|
@ -242,7 +242,7 @@ Do not omit any fields in `extension` or it will change the `virtualClusters` se
|
|||
|
||||
## Example: Use Storage Manager to Create an NFS + SAMBA Server
|
||||
|
||||
To help you set up the storage, OpenPAI provides a storage manager, which can set up an NFS + SAMBA server. In the cluster, the NFS storage can be accessed in OpenPAI containers. Out of the cluster, users can mount the storage on Unix-like system, or access it in File Explorer on Windows.
|
||||
To help you set up the storage, OpenPAI provides a storage manager, which can set up an NFS + SAMBA server. In the cluster, the NFS storage can be accessed in OpenPAI containers. Out of the cluster, users can mount the storage on a Unix-like system, or access it in File Explorer on Windows.
|
||||
|
||||
Please read the document about [service management and paictl](./basic-management-operations.md#pai-service-management-and-paictl) first, and start a dev box container. Then, in the dev box container, pull the configuration by:
|
||||
|
||||
|
@ -250,7 +250,7 @@ Please read the document about [service management and paictl](./basic-managemen
|
|||
./paictl config pull -o /cluster-configuration
|
||||
```
|
||||
|
||||
To use storage manager, you should first decide a machine in PAI system to be the storage server. The machine **must** be one of PAI workers, not PAI master. Please open `/cluster-configuration/layout.yaml`, choose a worker machine, then add a `pai-storage: "true"` field to it. Here is an example of the edited `layout.yaml`:
|
||||
To use storage manager, you should first decide on a machine in the PAI system to be the storage server. The machine **must** be one of PAI workers, not PAI master. Please open `/cluster-configuration/layout.yaml`, choose a worker machine, then add a `pai-storage: "true"` field to it. Here is an example of the edited `layout.yaml`:
|
||||
|
||||
```yaml
|
||||
......
|
||||
|
@ -287,7 +287,7 @@ storage-manager:
|
|||
smbpwd: smbpwd
|
||||
```
|
||||
|
||||
The `localpath` determines the root data dir for NFS on the storage server. The `smbuser` and `smbpwd` determines the username and password when you access the storage in File Explorer on Windows.
|
||||
The `localpath` determines the root data dir for NFS on the storage server. The `smbuser` and `smbpwd` determine the username and password when you access the storage in File Explorer on Windows.
|
||||
|
||||
Follow these commands to start the storage manager:
|
||||
|
||||
|
@ -352,11 +352,11 @@ spec:
|
|||
|
||||
Use `kubectl create -f nfs-storage.yaml` to create the PV and PVC.
|
||||
|
||||
Since the Kuberentes PV requires the node using it has the corresponding driver, we should use `apt install nfs-common` to install the `nfs-common` package on every worker node.
|
||||
Since the Kubernetes PV requires the node using it has the corresponding driver, we should use `apt install nfs-common` to install the `nfs-common` package on every worker node.
|
||||
|
||||
Finally, [assign storage to PAI groups](#assign-storage-to-pai-groups) by rest-server API. Then you can mount it into job containers.
|
||||
|
||||
How to upload data to the storage server? On Windows, open the File Explorer, type in `\\10.0.0.1` (please change `10.0.0.1` to your storage server IP), and press ENTER. The File Explorer will ask you for authorization. Please use `smbuser` and `smbpwd` as username and password to login. On a Unix-like system, you can mount the NFS folder to the file system. For example, on Ubuntu, use the following command to mount it:
|
||||
How to upload data to the storage server? On Windows, open the File Explorer, type in `\\10.0.0.1` (please change `10.0.0.1` to your storage server IP), and press ENTER. The File Explorer will ask you for authorization. Please use `smbuser` and `smbpwd` as the username and password to log in. On a Unix-like system, you can mount the NFS folder to the file system. For example, on Ubuntu, use the following command to mount it:
|
||||
|
||||
```bash
|
||||
# replace 10.0.0.1 with your storage server IP
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
|
||||
## What is Hived Scheduler and How to Configure it
|
||||
|
||||
OpenPAI supports two kinds of scheduler: the Kubernetes scheduler, and [hivedscheduler](https://github.com/microsoft/hivedscheduler). [Hivedscheduler](https://github.com/microsoft/hivedscheduler) is a Kubernetes Scheduler for Deep Learning. It supports virtual cluster division, topology-aware resource guarantee and optimized gang scheduling, which are not supported in k8s default scheduler. If you didn't specify `enable_hived_scheduler: false` during installation, hived scheduler is enabled by default. Please notice only hivedscheduler supports virtual cluster setup, k8s default scheduler doesn't support it.
|
||||
OpenPAI supports two kinds of schedulers: the Kubernetes scheduler, and [hivedscheduler](https://github.com/microsoft/hivedscheduler). [Hivedscheduler](https://github.com/microsoft/hivedscheduler) is a Kubernetes Scheduler for Deep Learning. It supports virtual cluster division, topology-aware resource guarantee, and optimized gang scheduling, which are not supported in the k8s default scheduler. If you didn't specify `enable_hived_scheduler: false` during installation, hived scheduler is enabled by default. Please notice the only hived scheduler supports virtual cluster setup, k8s default scheduler doesn't support it.
|
||||
|
||||
## Set Up Virtual Clusters
|
||||
|
||||
|
@ -45,7 +45,7 @@ hivedscheduler:
|
|||
...
|
||||
```
|
||||
|
||||
If you have followed the [installation guide](./installation-guide.md), you would find similar setting in your [`services-configuration.yaml`](./basic-management-operations.md#pai-service-management-and-paictl). The detailed explanation of these fields are in the [hived scheduler document](https://github.com/microsoft/hivedscheduler/blob/master/doc/user-manual.md). You can update the configuration and set up virtual clusters. For example, in the above settings, we have 3 nodes, `worker1`, `worker2` and `worker3`. They are all in the `default` virtual cluster. If we want to create two VCs, one is called `default` and has 2 nodes, the other is called `new` and has 1 node, we can first modify `services-configuration.yaml`:
|
||||
If you have followed the [installation guide](./installation-guide.md), you would find similar setting in your [`services-configuration.yaml`](./basic-management-operations.md#pai-service-management-and-paictl). The detailed explanation of these fields are in the [hived scheduler document](https://github.com/microsoft/hivedscheduler/blob/master/doc/user-manual.md). You can update the configuration and set up virtual clusters. For example, in the above settings, we have 3 nodes, `worker1`, `worker2`, and `worker3`. They are all in the `default` virtual cluster. If we want to create two VCs, one is called `default` and has 2 nodes, the other is called `new` and has 1 node, we can first modify `services-configuration.yaml`:
|
||||
|
||||
```yaml
|
||||
# services-configuration.yaml
|
||||
|
@ -198,7 +198,7 @@ This should be self-explanatory. The `virtualClusters` field is used to manage V
|
|||
./paictl.py service start -n rest-server
|
||||
```
|
||||
|
||||
## Different Hardwares in Worker Nodes
|
||||
## Different Hardware in Worker Nodes
|
||||
|
||||
We recommend one VC should have the same hardware, which leads to one `skuType` of one VC in the hived scheduler setting. If you have different types of worker nodes (e.g. different GPU types on different nodes), please configure them in different VCs. Here is an example of 2 kinds of nodes:
|
||||
|
||||
|
@ -251,7 +251,7 @@ hivedscheduler:
|
|||
cellNumber: 3
|
||||
```
|
||||
|
||||
In the above example, we set up 2 VCs: `default` and `v100`. The `default` VC has 2 K80 nodes, and `V100` VC has 3 V100 nodes. Every K80 node has 4 K80 GPUs and Every V100 nodes has 4 V100 GPUs.
|
||||
In the above example, we set up 2 VCs: `default` and `v100`. The `default` VC has 2 K80 nodes, and `V100` VC has 3 V100 nodes. Every K80 node has 4 K80 GPUs and every V100 node has 4 V100 GPUs.
|
||||
|
||||
## Configure CPU and GPU SKU on the Same Node
|
||||
|
||||
|
@ -305,14 +305,14 @@ hivedscheduler:
|
|||
cellNumber: 2
|
||||
```
|
||||
|
||||
Currently we only support mixing CPU and GPU types on one NVIDIA GPU node or one AMD GPU node,
|
||||
Currently, we only support mixing CPU and GPU types on one NVIDIA GPU node or one AMD GPU node,
|
||||
rare cases including NVIDIA cards and AMD cards on one node are not supported.
|
||||
|
||||
## Use Pinned Cell to Reserve Certain Node in a Virtual Cluster
|
||||
|
||||
In some cases, you might want to reserve a certain node in a virtual cluster, and submit job to this node explicitly for debugging or quick testing. OpenPAI provides you with a way to "pin" a node to a virtual cluster.
|
||||
In some cases, you might want to reserve a certain node in a virtual cluster, and submit jobs to this node explicitly for debugging or quick testing. OpenPAI provides you with a way to "pin" a node to a virtual cluster.
|
||||
|
||||
For example, assuming you have three worker nodes: `worker1`, `worker2`, and `worker3`, and 2 virtual clusters: `default` and `new`. The `default` VC has 2 workers, and `new` VC only has one worker. The following is an example for the configuration:
|
||||
For example, assuming you have three worker nodes: `worker1`, `worker2`, and `worker3`, and 2 virtual clusters: `default` and `new`. The `default` VC has 2 workers, and the `new` VC only has one worker. The following is an example of the configuration:
|
||||
|
||||
```yaml
|
||||
# services-configuration.yaml
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
|
||||
## <div id="gte100-uninstallation">Uninstallation Guide for OpenPAI >= v1.0.0</div>
|
||||
|
||||
The uninstallation of OpenPAI >= `v1.0.0` is irreversible: all the data will be removed and you cannot find them back. If you need a backup, do it before uninstallation.
|
||||
The uninstallation of OpenPAI >= `v1.0.0` is irreversible: all the data will be removed and you cannot find them back. If you need a backup, do it before the uninstallation.
|
||||
|
||||
First, log in to the dev box machine and delete all PAI services with [dev box container](./basic-management-operations.md#pai-service-management-and-paictl).:
|
||||
|
||||
|
@ -16,17 +16,17 @@ Now all PAI services and data are deleted. If you want to destroy the Kubernetes
|
|||
ansible-playbook -i inventory/pai/hosts.yml reset.yml --become --become-user=root -e "@inventory/pai/openpai.yml"
|
||||
```
|
||||
|
||||
We recommend you to keep the folder `~/pai-deploy` for re-installation.
|
||||
We recommend you keep the folder `~/pai-deploy` for re-installation.
|
||||
|
||||
## <div id="lt100-uninstallation">Uninstallation Guide for OpenPAI < v1.0.0<div>
|
||||
|
||||
### Save your Data to a Different Place
|
||||
|
||||
During the uninstallation of OpenPAI < `v1.0.0`, you cannot preserve any useful data: all jobs, user information, dataset will be lost inevitably and irreversibly. Thus, if you have any useful data in previous deployment, please make sure you have saved them to a different place.
|
||||
During the uninstallation of OpenPAI < `v1.0.0`, you cannot preserve any useful data: all jobs, user information, the dataset will be lost inevitably and irreversibly. Thus, if you have any useful data in the previous deployment, please make sure you have saved them to a different place.
|
||||
|
||||
#### HDFS Data
|
||||
|
||||
Before `v1.0.0`, PAI will deploy an HDFS server for you. After `v1.0.0`, the HDFS server won't be deployed and previous data will be removed in upgrade. The following commands could be used to transfer your HDFS data:
|
||||
Before `v1.0.0`, PAI will deploy an HDFS server for you. After `v1.0.0`, the HDFS server won't be deployed and previous data will be removed in the upgrade. The following commands could be used to transfer your HDFS data:
|
||||
|
||||
``` bash
|
||||
# check data structure
|
||||
|
@ -35,11 +35,11 @@ hdfs dfs -ls hdfs://<hdfs-namenode-ip>:<hdfs-namenode-port>/
|
|||
hdfs dfs -copyToLocal hdfs://<hdfs-namenode-ip>:<hdfs-namenode-port>/ <local-folder>
|
||||
```
|
||||
|
||||
`<hdfs-namenode-ip>` and `<hdfs-namenode-port>` is the ip of PAI master and `9000` if you did't modify the default setting. Please make sure your local folder has enough capacity to hold the data you want to save.
|
||||
`<hdfs-namenode-ip>` and `<hdfs-namenode-port>` are the IP of PAI master and `9000` if you didn't modify the default setting. Please make sure your local folder has enough capacity to hold the data you want to save.
|
||||
|
||||
#### Metadata of Jobs and Users
|
||||
|
||||
Metadata of jobs and users will also be lost, including job records, job log, user name, user password, etc. We do not have an automatical tool for you to backup these data. Please transfer the data manually if you find some are valuable.
|
||||
Metadata of jobs and users will also be lost, including job records, job log, user name, user password, etc. We do not have an automatic tool for you to backup these data. Please transfer the data manually if you find some are valuable.
|
||||
|
||||
#### Other Resources on Kubernetes
|
||||
|
||||
|
@ -55,7 +55,7 @@ cd pai
|
|||
# checkout to a different branch if you have a different version
|
||||
git checkout pai-0.14.y
|
||||
|
||||
# delete all pai service and remove all service data
|
||||
# delete all PAI service and remove all service data
|
||||
./paictl.py service delete
|
||||
|
||||
# delete k8s cluster
|
||||
|
@ -86,7 +86,7 @@ lsmod | grep -qE "^nvidia" &&
|
|||
done
|
||||
rmmod nvidia ||
|
||||
{
|
||||
echo "The driver nvidia is still in use, can't unload it."
|
||||
echo "The driver NVIDIA is still in use, can't unload it."
|
||||
exit 1
|
||||
}
|
||||
}
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
# How to Use Alert System
|
||||
|
||||
OpenPAI has a built-in alert system. The alert system has some existing alert rules and actions. It can also let admin customize them. In this document, we will have a detailed introduction to this topic.
|
||||
OpenPAI has a built-in alert system. The alert system has some existing alert rules and actions. It can also let the admin customize them. In this document, we will have a detailed introduction to this topic.
|
||||
|
||||
## Alert Rules
|
||||
|
||||
|
@ -13,12 +13,12 @@ alert: GpuUsedByExternalProcess
|
|||
expr: gpu_used_by_external_process_count > 0
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: found nvidia used by external process in {{$labels.instance}}
|
||||
summary: found NVIDIA used by external process in {{$labels.instance}}
|
||||
```
|
||||
|
||||
For the detailed syntax of alert rules, please refer to [here](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/).
|
||||
|
||||
All alerts fired by the alert rules, including the pre-defined rules and the customized rules, will be shown on the home page of Webportal (on the top-right corner).
|
||||
All alerts fired by the alert rules, including the pre-defined rules and the customized rules, will be shown on the home page of webportal (on the top-right corner).
|
||||
|
||||
### Existing Alert Rules
|
||||
|
||||
|
@ -53,9 +53,9 @@ prometheus:
|
|||
```
|
||||
|
||||
The `PAIJobGpuPercentLowerThan0_3For1h` alert will be fired when the job on virtual cluster `default` has a task level average GPU percent lower than `30%` for more than `1 hour`.
|
||||
The alert severity can be defined as `info`, `warn`, `error` or `fatal` by adding a label.
|
||||
The alert severity can be defined as `info`, `warn`, `error`, or `fatal` by adding a label.
|
||||
Here we use `warn`.
|
||||
Here the metric `task_gpu_percent` is used, which describes the GPU utilization at task level.
|
||||
Here the metric `task_gpu_percent` is used, which describes the GPU utilization at the task level.
|
||||
|
||||
Remember to push service config to the cluster and restart the `prometheus` service after your modification with the following commands [in the dev-box container](./basic-management-operations.md#pai-service-management-and-paictl):
|
||||
```bash
|
||||
|
@ -68,7 +68,7 @@ Please refer to [Prometheus Alerting Rules](https://prometheus.io/docs/prometheu
|
|||
|
||||
## Alert Actions and Routes
|
||||
|
||||
Admin can choose how to handle the alerts by different alert actions. We provide some basic alert actions and you can also customize your own actions. In this section, we will first introduce the existing actions and the matching rules between these actions and alerts. Then we will let you know how to add new alert actions. The actions and matching rules are both handled by [`alert-manager`](https://prometheus.io/docs/alerting/latest/alertmanager/).
|
||||
Admin can choose how to handle the alerts by different alert actions. We provide some basic alert actions and you can also customize your actions. In this section, we will first introduce the existing actions and the matching rules between these actions and alerts. Then we will let you know how to add new alert actions. The actions and matching rules are both handled by [`alert-manager`](https://prometheus.io/docs/alerting/latest/alertmanager/).
|
||||
|
||||
### Existing Actions and Matching Rules
|
||||
|
||||
|
|
|
@ -4,20 +4,20 @@
|
|||
|
||||
#### Which version of NVIDIA driver should I install?
|
||||
|
||||
First, check out the [NVIDIA site](https://www.nvidia.com/Download/index.aspx) to verify the newest driver version of your GPU card. Then, check out [this table](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#binary-compatibility__table-toolkit-driver) to see the CUDA requirement of driver version.
|
||||
First, check out the [NVIDIA site](https://www.nvidia.com/Download/index.aspx) to verify the newest driver version of your GPU card. Then, check out [this table](https://docs.nvidia.com/deploy/cuda-compatibility/index.html#binary-compatibility__table-toolkit-driver) to see the CUDA requirement of the driver version.
|
||||
|
||||
Please note that, some docker images with new CUDA version cannot be used on machine with old driver. As for now, we recommend to install the NVIDIA driver 418 as it supports CUDA 9.0 to CUDA 10.1, which is used by most deep learning frameworks.
|
||||
Please note that some docker images with new CUDA versions cannot be used on machines with old drivers. As for now, we recommend installing the NVIDIA driver 418 as it supports CUDA 9.0 to CUDA 10.1, which is used by most deep learning frameworks.
|
||||
|
||||
#### How to fasten deploy speed on large cluster?
|
||||
#### How to fasten deploy speed on a large cluster?
|
||||
|
||||
By default, `Ansible` uses 5 forks to execute commands parallelly on all hosts. If your cluster is a large one, it may be slow for you.
|
||||
|
||||
To fasten the deploy speed, you can add `-f <parallel-number>` to all commands using `ansible` or `ansible-playbook`. See [ansible doc](https://docs.ansible.com/ansible/latest/cli/ansible.html#cmdoption-ansible-f) for reference.
|
||||
To fasten the deploying speed, you can add `-f <parallel-number>` to all commands using `ansible` or `ansible-playbook`. See [ansible doc](https://docs.ansible.com/ansible/latest/cli/ansible.html#cmdoption-ansible-f) for reference.
|
||||
|
||||
#### How to remove k8s network plugin
|
||||
#### How to remove the k8s network plugin
|
||||
|
||||
|
||||
After installation, if you use [weave](https://github.com/weaveworks/weave) as k8s network plugin and you encounter some errors about the network, such as some pods failed to connect internet, you could remove network plugin to solve this issue.
|
||||
After installation, if you use [weave](https://github.com/weaveworks/weave) as a k8s network plugin and you encounter some errors about the network, such as some pods failed to connect internet, you could remove the network plugin to solve this issue.
|
||||
|
||||
Please run `kubectl delete ds weave-net -n kube-system` to remove `weave-net` daemon set first
|
||||
|
||||
|
@ -71,18 +71,18 @@ To remove the network plugin, you could use following `ansible-playbook`:
|
|||
executable: /bin/bash
|
||||
```
|
||||
|
||||
After these steps you need to change the `coredns` to fix dns resolution issue.
|
||||
After these steps, you need to change the `coredns` to fix the DNS resolution issue.
|
||||
Please run `kubectl edit cm coredns -n kube-system`, change `.:53` to `.:9053`
|
||||
Please run `kubectl edit service coredns -n kube-system`, change `targetPort: 53` to `targetPort: 9053`
|
||||
Please run `kubectl edit deployment coredns -n kube-system`, change `containerPort: 53` to `containerPort: 9053`. Add `hostNetwork: true` in pod spec.
|
||||
|
||||
#### How to check whether the GPU driver is installed?
|
||||
|
||||
For Nvidia GPU, use command `nvidia-smi` to check.
|
||||
For NVIDIA GPU, use the command `nvidia-smi` to check.
|
||||
|
||||
#### How to install GPU driver?
|
||||
|
||||
For Nvidia GPU, please first determine which version of driver you want to install (see [this question](#which-version-of-nvidia-driver-should-i-install) for details). Then follow these commands:
|
||||
For NVIDIA GPU, please first determine which version of the driver you want to install (see [this question](#which-version-of-nvidia-driver-should-i-install) for details). Then follow these commands:
|
||||
|
||||
```bash
|
||||
sudo add-apt-repository ppa:graphics-drivers/ppa
|
||||
|
@ -91,9 +91,9 @@ sudo apt install nvidia-418
|
|||
sudo reboot
|
||||
```
|
||||
|
||||
Here we use nvidia driver version 418 as an example. Please modify `nvidia-418` if you want to install a different version, and refer to the Nvidia community for help if encounter any problem.
|
||||
Here we use NVIDIA driver version 418 as an example. Please modify `nvidia-418` if you want to install a different version, and refer to the NVIDIA community for help if encounter any problem.
|
||||
|
||||
#### How to install nvidia-container-runtime?
|
||||
#### How to install NVIDIA-container-runtime?
|
||||
|
||||
Please refer to the [official document](https://github.com/NVIDIA/nvidia-container-runtime#installation). Don't forget to set it as docker' default runtime in [docker-config-file](https://docs.docker.com/config/daemon/#configure-the-docker-daemon). Here is an example of `/etc/docker/daemon.json`:
|
||||
|
||||
|
@ -117,7 +117,7 @@ Please refer to [this document](https://github.com/microsoft/pai/tree/master/con
|
|||
|
||||
#### Ansible reports `Failed to update apt cache` or `Apt install <some package>` fails
|
||||
|
||||
Please first check if there is any network-related issues. Besides network, another reason for this problem is: `ansible` sometimes runs a `apt update` to update the cache before the package installation. If `apt update` exits with a non-zero code, the whole command will be considered to be failed.
|
||||
Please first check if there are any network-related issues. Besides network, another reason for this problem is: `ansible` sometimes runs an `apt update` to update the cache before the package installation. If `apt update` exits with a non-zero code, the whole command will be considered to be failed.
|
||||
|
||||
You can check this by running `sudo apt update; echo $?` on the corresponding machine. If the exit code is not 0, please fix it. Here are 2 normal causes of this problem:
|
||||
|
||||
|
@ -137,7 +137,7 @@ sudo apt update
|
|||
|
||||
#### Ansible playbook exits because of timeout.
|
||||
|
||||
Sometimes, if you assign a different hostname for a certain machine, any commands with `sudo` will be very slow on that machine. Because the system DNS try to find the new hostname, but it will fail due to a timeout.
|
||||
Sometimes, if you assign a different hostname for a certain machine, any commands with `sudo` will be very slow on that machine. Because the system DNS tries to find the new hostname, but it will fail due to a timeout.
|
||||
|
||||
To fix this problem, on each machine, you can add the new hostname to its `/etc/hosts` by:
|
||||
|
||||
|
@ -149,7 +149,7 @@ sudo chmod 644 /etc/hosts
|
|||
|
||||
#### Ansible exits because `sudo` is timed out.
|
||||
|
||||
The same as `1. Ansible playbook exits because of timeout.` .
|
||||
The same as `1. Ansible playbook exits because of timeout.`
|
||||
|
||||
#### Ansible reports `Could not import python modules: apt, apt_pkg. Please install python3-apt package.`
|
||||
|
||||
|
@ -166,12 +166,12 @@ During installation, the script will download kubeadm and hyperkube from `storag
|
|||
- `kubeadm`: `https://storage.googleapis.com/kubernetes-release/release/v1.15.11/bin/linux/amd64/kubeadm`
|
||||
- `hyperkube`: `https://storage.googleapis.com/kubernetes-release/release/v1.15.11/bin/linux/amd64/hyperkube`
|
||||
|
||||
Please find alternative urls for downloading this two files and modify `kubeadm_download_url` and `hyperkube_download_url` in your `config` file.
|
||||
Please find alternative URLs for downloading these two files and modify `kubeadm_download_url` and `hyperkube_download_url` in your `config` file.
|
||||
|
||||
**Cannot download image**
|
||||
|
||||
Please first check the log to see which image blocks the installation process, and modify `gcr_image_repo`, `kube_image_repo`, `quay_image_repo`, or `docker_image_repo` to a mirror repository correspondingly in `config` file.
|
||||
|
||||
For example, if you cannot pull images from `gcr.io`, you should fisrt find a mirror repository (We recommend you to use `gcr.azk8s.cn` if you are in China). Then, modify `gcr_image_repo` and `kube_image_repo`.
|
||||
For example, if you cannot pull images from `gcr.io`, you should first find a mirror repository (We recommend you to use `gcr.azk8s.cn` if you are in China). Then, modify `gcr_image_repo` and `kube_image_repo`.
|
||||
|
||||
Especially for `gcr.io`, we find some image links in kubespray which do not adopt `gcr_image_repo` and `kube_image_repo`. You should modify them manually in `~/pai-deploy/kubespray`. Command `grep -r --color gcr.io ~/pai-deploy/kubespray` will be helpful to you.
|
|
@ -66,17 +66,17 @@ We recommend you to use CPU-only machines for dev box and master. The detailed r
|
|||
|
||||
The worker machines are used to run jobs. You can use multiple workers during installation.
|
||||
|
||||
We support various types of workers: CPU worker, GPU worker, and workers with other computing device (e.g. TPU, NPU).
|
||||
We support various types of workers: CPU workers, GPU workers, and workers with other computing devices (e.g. TPU, NPU).
|
||||
|
||||
In the same time, we also support two schedulers: the Kubernetes default scheduler, and [hivedscheduler](https://github.com/microsoft/hivedscheduler).
|
||||
At the same time, we also support two schedulers: the Kubernetes default scheduler, and [hivedscheduler](https://github.com/microsoft/hivedscheduler).
|
||||
|
||||
Hivedscheduler is the default for OpenPAI. It supports virtual cluster division, topology-aware resource guarantee and optimized gang scheduling, which are not supported in k8s default scheduler.
|
||||
Hivedscheduler is the default for OpenPAI. It supports virtual cluster division, topology-aware resource guarantee, and optimized gang scheduling, which are not supported in the k8s default scheduler.
|
||||
|
||||
|
||||
For now, the support for CPU/Nvidia GPU workers and workers with other computing device is different:
|
||||
For now, the support for CPU/NVIDIA GPU workers and workers with other computing device is different:
|
||||
|
||||
- For CPU worker and NVIDIA GPU worker, both k8s default scheduler and hived scheduler can be used.
|
||||
- For workers with other types of computing device (e.g. TPU, NPU), currently we only support the usage of k8s default scheduler. You can only include workers with the same computing device in the cluster. For example, you can use TPU workers, but all workers should be TPU workers. You cannot use TPU workers together with GPU workers in one cluster.
|
||||
- For CPU workers and NVIDIA GPU workers, both k8s default scheduler and hived scheduler can be used.
|
||||
- For workers with other types of computing devices (e.g. TPU, NPU), currently, we only support the usage of the k8s default scheduler. You can only include workers with the same computing device in the cluster. For example, you can use TPU workers, but all workers should be TPU workers. You cannot use TPU workers together with GPU workers in one cluster.
|
||||
|
||||
Please check the following requirements for different types of worker machines:
|
||||
|
||||
|
@ -167,10 +167,10 @@ git checkout v1.5.0
|
|||
```
|
||||
|
||||
Please edit `layout.yaml` and a `config.yaml` file under `<pai-code-dir>/contrib/kubespray/config` folder.
|
||||
These two files spedify the cluster layout and the customized configuration, respectively.
|
||||
These two files specify the cluster layout and the customized configuration, respectively.
|
||||
The following is the format and example of these 2 files.
|
||||
|
||||
**Tips for China Users**: If you are a China user, before you edit these files, please refer to [here](./configuration-for-china.md) first.
|
||||
**Tips for Chinese Users**: If you are in Mainland China, please refer to [here](./configuration-for-china.md) first before you edit these files.
|
||||
|
||||
#### `layout.yaml` format
|
||||
|
||||
|
@ -189,7 +189,7 @@ machine-sku:
|
|||
vcore: 24
|
||||
gpu-machine:
|
||||
computing-device:
|
||||
# For `type`, please follow the same format specified in device plugin.
|
||||
# For `type`, please follow the same format specified in the device plugin.
|
||||
# For example, `nvidia.com/gpu` is for NVIDIA GPU, `amd.com/gpu` is for AMD GPU,
|
||||
# and `enflame.com/dtu` is for Enflame DTU.
|
||||
# Reference: https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/
|
||||
|
@ -306,9 +306,9 @@ The `user` and `password` is the SSH username and password from dev box machine
|
|||
|
||||
**For Azure Users**: If you are deploying OpenPAI in Azure, please uncomment `openpai_kube_network_plugin: calico` in the config file above, and change it to `openpai_kube_network_plugin: weave`. It is because Azure doesn't support calico. See [here](https://docs.projectcalico.org/reference/public-cloud/azure#why-doesnt-azure-support-calico-networking) for details.
|
||||
|
||||
**For those who use workers other than CPU workers and NVIDIA GPU workers**: Now we only support Kubernetes default scheduler (not Hivedscheduler) for device other than NVIDIA GPU and CPU. Please uncomment `# enable_hived_scheduler: true` and set it to `enable_hived_scheduler: false`.
|
||||
**For those who use workers other than CPU workers and NVIDIA GPU workers**: Now we only support Kubernetes default scheduler (not Hivedscheduler) for devices other than NVIDIA GPU and CPU. Please uncomment `# enable_hived_scheduler: true` and set it to `enable_hived_scheduler: false`.
|
||||
|
||||
**If qos-switch is enabled**: OpenPAI daemons will request additional resources in each node. Please check following table and reserve sufficient resources for OpenPAI daemons.
|
||||
**If qos-switch is enabled**: OpenPAI daemons will request additional resources in each node. Please check the following table and reserve sufficient resources for OpenPAI daemons.
|
||||
|
||||
| Service Name | Memory Request | CPU Request |
|
||||
| :-----------: | :------------: | :---------: |
|
||||
|
@ -331,7 +331,7 @@ Please run the following script to deploy Kubernetes first. As the name explains
|
|||
/bin/bash quick-start-kubespray.sh
|
||||
```
|
||||
|
||||
If there is any problem, please double check the environment requirements first. Here we provide a requirement checker to help you verify:
|
||||
If there is any problem, please double-check the environment requirements first. Here we provide a requirement checker to help you verify:
|
||||
|
||||
``` bash
|
||||
/bin/bash requirement.sh -l config/layout.yaml -c config/config.yaml
|
||||
|
@ -342,16 +342,16 @@ You can also refer to [the installation troubleshooting](./installation-faqs-and
|
|||
The `quick-start-kubespray.sh` will output the following information if k8s is successfully installed:
|
||||
|
||||
```
|
||||
You can run the following commands to setup kubectl on you local host:
|
||||
You can run the following commands to set up kubectl on your localhost:
|
||||
ansible-playbook -i ${HOME}/pai-deploy/kubespray/inventory/pai/hosts.yml set-kubectl.yml --ask-become-pass
|
||||
```
|
||||
|
||||
By default, we don't setup `kubeconfig` or install `kubectl` client on the dev box machine, but we put the Kubernetes config file in `~/pai-deploy/kube/config`. You can use the config with any Kubernetes client to verify the installation.
|
||||
By default, we don't set up `kubeconfig` or install `kubectl` client on the dev box machine, but we put the Kubernetes config file in `~/pai-deploy/kube/config`. You can use the config with any Kubernetes client to verify the installation.
|
||||
|
||||
Also, you can use the command `ansible-playbook -i ${HOME}/pai-deploy/kubespray/inventory/pai/hosts.yml set-kubectl.yml --ask-become-pass` to set up `kubeconfig` and `kubectl` on the dev box machine. It will copy the config to `~/.kube/config` and set up the `kubectl` client. After it is executed, you can use `kubectl` on the dev box machine directly.
|
||||
|
||||
#### Tips for Network-related Issues
|
||||
If you are facing network issues such as the machine cannot download some file, or cannot connect to some docker registry, please combine the prompted error log and kubespray as a keyword, and search for solution. You can also refer to the [installation troubleshooting](./installation-faqs-and-troubleshooting.md#troubleshooting) and [this issue](https://github.com/microsoft/pai/issues/4516).
|
||||
If you are facing network issues such as the machine cannot download some file, or cannot connect to some docker registry, please combine the prompted error log and kubespray as a keyword, and search for a solution. You can also refer to the [installation troubleshooting](./installation-faqs-and-troubleshooting.md#troubleshooting) and [this issue](https://github.com/microsoft/pai/issues/4516).
|
||||
|
||||
## Start OpenPAI Services
|
||||
|
||||
|
@ -375,16 +375,16 @@ You can go to http://<your-master-ip>, then use the default username and passwor
|
|||
|
||||
As the message says, you can use `admin` and `admin-password` to login to the webportal, then submit a job to validate your installation. We have generated the configuration files of OpenPAI in the folder `~/pai-deploy/cluster-cfg`. If you need further customization, they will be used in the future.
|
||||
|
||||
**For those who use workers other than CPU workers, NVIDIA GPU workers, AMD GPU workers, and Enflame DTU workers**: Please manually deploy the device's device plugin in Kubernetes. Otherwise the Kubernetes default scheduler won't work. Supported device plugins are listed [in this file](https://github.com/microsoft/pai/blob/master/src/device-plugin/deploy/start.sh.template). PRs are welcome.
|
||||
**For those who use workers other than CPU workers, NVIDIA GPU workers, AMD GPU workers, and Enflame DTU workers**: Please manually deploy the device's device plugin in Kubernetes. Otherwise, the Kubernetes default scheduler won't work. Supported device plugins are listed [in this file](https://github.com/microsoft/pai/blob/master/src/device-plugin/deploy/start.sh.template). PRs are welcome.
|
||||
|
||||
## Keep a Folder
|
||||
|
||||
We highly recommend you to keep the folder `~/pai-deploy` for future operations such as upgrade, maintenance, and uninstallation. The most important contents in this folder are:
|
||||
We highly recommend you keep the folder `~/pai-deploy` for future operations such as upgrade, maintenance, and uninstallation. The most important contents in this folder are:
|
||||
|
||||
+ Kubernetes cluster config (the default is `~/pai-deploy/kube/config`): Kubernetes config file. It is used by `kubectl` to connect to k8s api server.
|
||||
+ Kubernetes cluster config (the default is `~/pai-deploy/kube/config`): Kubernetes config file. It is used by `kubectl` to connect to the k8s API server.
|
||||
+ OpenPAI cluster config (the default is `~/pai-deploy/cluster-cfg`): It is a folder containing machine layout and OpenPAI service configurations.
|
||||
|
||||
If it is possible, you can make a backup of `~/pai-deploy` in case it is deleted unexpectedly.
|
||||
|
||||
Apart from the folder, you should remember your OpenPAI cluster ID, which is used to indicate your OpenPAI cluster.
|
||||
The default value is `pai`. Some management operation needs a confirmation of this cluster ID.
|
||||
Apart from the folder, you should remember your OpenPAI cluster-ID, which is used to indicate your OpenPAI cluster.
|
||||
The default value is `pai`. Some management operation needs a confirmation of this cluster-ID.
|
||||
|
|
|
@ -4,19 +4,19 @@ Managing one or more clusters is not an easy task. In most cases, the administra
|
|||
|
||||
## Team Shared Practice
|
||||
|
||||
There are mainly two kinds of resource in OpenPAI, namely virtual cluster and storage, and they are managed by groups. All users are in the `default` group. So every one should have access to the `default` virtual cluster and one or more storages which are set for `default` group. Besides `default` group, you can set up different groups for different users. **In practice, we recommend you to assign each team with a single group.**
|
||||
There are mainly two kinds of resources in OpenPAI, namely virtual cluster and storage, and they are managed by groups. All users are in the `default` group. So everyone should have access to the `default` virtual cluster and one or more storages that are set for the `default` group. Besides the `default` group, you can set up different groups for different users. **In practice, we recommend you to assign each team with a single group.**
|
||||
|
||||
For example, if you have two teams: team A working on project A; and team B working on project B. You can set up group A and group B for team A and B, correspondingly. Thus each team can share computing and storage resource internally. If a user joins a team, just add him into the corresponding group.
|
||||
For example, if you have two teams: team A working on project A; and Team B working on project B. You can set up group A and group B for Team A and B, correspondingly. Thus each team can share computing and storage resources internally. If a user joins a team, just add him into the corresponding group.
|
||||
|
||||
By default, OpenPAI uses basic authentication mode. In this mode, virtual cluster is exactly bound to groups, which means setting up a group means set up a virtual cluster in basic authentication mode. In AAD mode, group and virtual cluster are different concepts. Please refer to [How to Set Up Virtual Clusters](./how-to-set-up-virtual-clusters.md) for details.
|
||||
By default, OpenPAI uses basic authentication mode. In this mode, the virtual cluster is exactly bound to groups, which means setting up a group means setting up a virtual cluster in basic authentication mode. In AAD mode, group and virtual cluster are different concepts. Please refer to [How to Set Up Virtual Clusters](./how-to-set-up-virtual-clusters.md) for details.
|
||||
|
||||
## Onboarding Practice
|
||||
|
||||
For new user onboarding in basic authentication mode, the PAI admin should create this user manually in the backend, and notify the user of some instructions and guidelines. In our practice, we will send an e-mail to the new user. Besides account information, we also include the following content in the e-mail:
|
||||
|
||||
- Let the user read the [user manual](../cluster-user/) to learn how to submit job, debug job, and use client tool.
|
||||
- Let the user read the [user manual](../cluster-user/) to learn how to submit jobs, debug jobs, and use the client tool.
|
||||
- Let the user know their completed jobs may be deleted after 30 days.
|
||||
- Let the user know he/she shouldn't always run low-efficiency job (e.g. sleep for several days in the container). Otherwise the administrator may kill the job.
|
||||
- Let the user know he/she shouldn't always run low-efficiency jobs (e.g. sleep for several days in the container). Otherwise, the administrator may kill the job.
|
||||
- Let the user know how to contact the administrator in case they find any problem or have any question.
|
||||
|
||||
## DRI Practice
|
||||
|
@ -35,7 +35,7 @@ The DRI should be aware of different severities:
|
|||
- Severity 2: Some jobs fail consistently, or some users hit problems frequently.
|
||||
- Severity 3: Random job failures with low probability.
|
||||
|
||||
In addition, if there are multiple clusters, he/she should also be aware of the different priorities of different clusters.
|
||||
Besides, if there are multiple clusters, he/she should also be aware of the different priorities of different clusters.
|
||||
|
||||
Based on severity and priority, we can make an SLA for the cluster management. Here is an example:
|
||||
|
||||
|
@ -48,6 +48,6 @@ Based on severity and priority, we can make an SLA for the cluster management. H
|
|||
|
||||
If an issue is raised, the DRI should follow these steps to address it:
|
||||
|
||||
1. All questions or notification sent to DRIs should be updated by the DRI owner proactively.
|
||||
2. DRI owner should send ACK to each incident of PAI alerts. As there are many duplicated alerts, so that it doesn't need to ACK on each one.
|
||||
3. Besides ACK, the DRI owner should reply to the questions/notification/alerts, if there are updates or it is resolved. Further more, the DRI owner should think about why this incident happens, how to avoid it in the next time, after it is resolved. If it's applicable, create issues on Github.
|
||||
1. All questions or notifications sent to DRIs should be updated by the DRI owner proactively.
|
||||
2. DRI owner should send ACK to each incident of PAI alerts. As there are many duplicated alerts so that it doesn't need to ACK on each one.
|
||||
3. Besides ACK, the DRI owner should reply to the questions/notification/alerts, if there are updates or it is resolved. Furthermore, the DRI owner should think about why this incident happens, how to avoid it the next time, after it is resolved. If it's applicable, create issues on Github.
|
|
@ -1,22 +1,22 @@
|
|||
# Troubleshooting
|
||||
|
||||
This ducument includes some troubleshooting cases in practice.
|
||||
This document includes some troubleshooting cases in practice.
|
||||
|
||||
### PaiServicePodNotReady Alert
|
||||
|
||||
This is a kind of alert from alert manager, and usually caused by container being killed by operator or OOM killer. To check if it was killed by OOM killer, you can check node's free memory via Prometheus:
|
||||
This is a kind of alert from the alert manager and is usually caused by the container being killed by the operator or OOM killer. To check if it was killed by OOM killer, you can check the node's free memory via Prometheus:
|
||||
|
||||
1. visit Prometheus web page, it is usually `http://<your-pai-master-ip>:9091`.
|
||||
2. Enter query `node_memory_MemFree_bytes`.
|
||||
3. If free memory drop to near 0, the container should be killed by OOM killer
|
||||
4. You can double check this by logging into node and run command `dmesg` and looking for phase `oom`. Or you can run `docker inspect <stopped_docker_id>` to get more detailed information.
|
||||
3. If free memory drops to near 0, the container should be killed by OOM killer
|
||||
4. You can double-check this by logging into the node and run the command `dmesg` and looking for phase `oom`. Or you can run `docker inspect <stopped_docker_id>` to get more detailed information.
|
||||
|
||||
Solutions:
|
||||
|
||||
1. Force remove unhealth containers with this command in terminal:
|
||||
1. Force remove unhealthy containers with this command in terminal:
|
||||
`kubectl delete pod pod-name --grace-period=0 --force`
|
||||
2. Recreate pod in Kubernetes, this operation may block indefinitely because dockerd may not functioning correctly after OOM. If recreate blocked too long, you can log into the node and restart dockerd via `/etc/init.d/docker restart`.
|
||||
3. If restarting doesn't solve it, you can consider increase the pod's memory limit.
|
||||
2. Recreate pod in Kubernetes, this operation may block indefinitely because dockerd may not function correctly after OOM. If recreate blocked too long, you can log into the node and restart dockerd via `/etc/init.d/docker restart`.
|
||||
3. If restarting doesn't solve it, you can increase the pod's memory limit.
|
||||
|
||||
### NodeNotReady Alert
|
||||
|
||||
|
@ -26,13 +26,13 @@ This is a kind of alert from alert manager, and is reported by watchdog service.
|
|||
pai_node_count{disk_pressure="false",instance="10.0.0.1:9101",job="pai_serivce_exporter",memory_pressure="false",host_ip="10.0.0.2",out_of_disk="false",pai_service_name="watchdog",ready="true",scraped_from="watchdog-5ddd945975-kwhpr"}
|
||||
```
|
||||
|
||||
The name label indicate what node this metric represents.
|
||||
The name label indicates what node this metric represents.
|
||||
|
||||
If the node's ready label has value "unknown", this means the node may disconnect from Kubernetes master, this may due to several reasons:
|
||||
If the node's ready label has the value "unknown", this means the node may disconnect from Kubernetes master, this may due to several reasons:
|
||||
|
||||
- Node is down
|
||||
- Kubelet is down
|
||||
- Network partition between node and Kubernetes master
|
||||
- Network partition between the node and Kubernetes master
|
||||
|
||||
You can first try to log into the node. If you can not, and have no ping response, the node may be down, and you should boot it up.
|
||||
|
||||
|
@ -50,17 +50,17 @@ You should check what caused this connectivity problem.
|
|||
|
||||
### NodeFilesystemUsage Alert
|
||||
|
||||
This is a kind of alert from alert manager, and is used to monitor disk space of each server. If usage of disk space is greater than 80%, this alert will be triggered. OpenPAI has two services may use a lot of disk space. They are storage manager and docker image cache. If there is other usage of OpenPAI servers, they should be checked to avoid the disk usage is caused by outside of OpenPAI.
|
||||
This is a kind of alert from the alert manager and is used to monitor the disk space of each server. If usage of disk space is greater than 80%, this alert will be triggered. OpenPAI has two services that may use a lot of disk space. They are storage manager and docker image cache. If there is another usage of OpenPAI servers, they should be checked to avoid the disk usage is caused outside of OpenPAI.
|
||||
|
||||
Solutions:
|
||||
|
||||
1. Check user file on the NFS storage server launched by storage manager. If you didn't set up a storage manager, ignore this step.
|
||||
2. Check the docker cache. The docker may use too many disk space for caching, it's worth to have a check.
|
||||
1. Check the user file on the NFS storage server launched by the storage manager. If you didn't set up a storage manager, ignore this step.
|
||||
2. Check the docker cache. The docker may use too much disk space for caching, it's worth having a check.
|
||||
3. Check PAI log folder size. The path is `/var/log/pai`.
|
||||
|
||||
### NodeGpuCountChanged Alert
|
||||
|
||||
This is an alert from alert manager and is used to monitor the GPU count of each node.
|
||||
This is an alert from the alert manager and is used to monitor the GPU count of each node.
|
||||
This alert will be triggered when the GPU count detected is different from the GPU count specified in `layout.yaml`.
|
||||
|
||||
If you find that the real GPU count is correct but the alerts still keep being fired, it's possibly caused by the wrong specification in `layout.yaml`.
|
||||
|
@ -86,17 +86,17 @@ If you cannot use GPU in your job, please check the following items on the corre
|
|||
If the GPU number shown in webportal is wrong, check the [hivedscheduler and VC configuration](./how-to-set-up-virtual-clusters.md).
|
||||
|
||||
### NvidiaSmiDoubleEccError
|
||||
This is a kind of alert from alert manager.
|
||||
It means that nvidia cards from the related nodes have double ecc error.
|
||||
When this alert occurs, the nodes related will be automatically cordoned by alert manager.
|
||||
This is a kind of alert from the alert manager.
|
||||
It means that NVIDIA cards from the related nodes have double ecc errors.
|
||||
When this alert occurs, the nodes related will be automatically cordoned by the alert manager.
|
||||
After the problem is resolved, you can uncordon the node manually with the following command:
|
||||
```bash
|
||||
kubectl uncordon <node name>
|
||||
```
|
||||
|
||||
### NodeGpuLowPerfState
|
||||
This is a kind of alert from alert manager.
|
||||
It means the nvidia cards from related node downgrade into low peroformance state unexpectedly.
|
||||
This is a kind of alert from the alert manager.
|
||||
It means the NVIDIA cards from related node downgrade into low performance state unexpectedly.
|
||||
To fix this, please run following commands:
|
||||
```bash
|
||||
sudo nvidia-smi -pm ENABLED -i <gpu-card-id>
|
||||
|
@ -106,16 +106,16 @@ You can get the supported clock by `sudo nvidia-smi -q -d SUPPORTED_CLOCKS`
|
|||
|
||||
### Cannot See Utilization Information.
|
||||
|
||||
If you cannot see utilization information (e.g. GPU, CPU, and network usage) in cluster, please check if the service `prometheus`, `grafana`, `job-exporter`, and `node-exporter` are working.
|
||||
If you cannot see utilization information (e.g. GPU, CPU, and network usage) in the cluster, please check if the service `prometheus`, `grafana`, `job-exporter`, and `node-exporter` are working.
|
||||
|
||||
To be detailed, you can [exec into a dev box container](./basic-management-operations.md#pai-service-management-and-paictl), then check the service status by `kubectl get pod`. You can see the pod log by `kubectl logs <pod-name>`. After you fix the problem, you can [restart the whole cluster using paictl](./basic-management-operations.md#pai-service-management-and-paictl).
|
||||
|
||||
|
||||
### Node is De-allocated and doesn't Appear in Kubernetes System when it Comes Back
|
||||
|
||||
Working nodes can be de-allocated if you are using a cloud service and set up PAI on low-priority machines. Usually, if the node is lost temporarily, you can wait until the node comes back. It doesn't need any special care.
|
||||
Working nodes can be de-allocated if you are using cloud service and set up PAI on low-priority machines. Usually, if the node is lost temporarily, you can wait until the node comes back. It doesn't need any special care.
|
||||
|
||||
However, some cloud service providers not only de-allocate nodes, but also remove all disk contents on the certain nodes. Thus the node cannot connect to Kubernetes automatically when it comes back. If it is your case, we recommend you to set up a crontab job on the dev box node to bring back these nodes periodically.
|
||||
However, some cloud service providers not only de-allocate nodes but also remove all disk contents on certain nodes. Thus the node cannot connect to Kubernetes automatically when it comes back. If it is your case, we recommend you set up a crontab job on the dev box node to bring back these nodes periodically.
|
||||
|
||||
In [How to Add and Remove Nodes](how-to-add-and-remove-nodes.md), we have described how to add a node. The crontab job doesn't need to do all of these things. It only needs to add the node to the Kubernetes. It figures out which nodes have come back but are still considered `NotReady` in Kubernetes, then, run the following command to bring it back:
|
||||
|
||||
|
@ -123,11 +123,11 @@ In [How to Add and Remove Nodes](how-to-add-and-remove-nodes.md), we have descri
|
|||
ansible-playbook -i inventory/mycluster/hosts.yml upgrade-cluster.yml --become --become-user=root --limit=${limit_list} -e "@inventory/mycluster/openpai.yml"
|
||||
```
|
||||
|
||||
`${limit_list}` stands for the names of these de-allocated nodes. For example, if the crontab job finds node `a` and node `b` are available now, but they are still in `NotReady` status in Kuberentes, then it can set `limit_list=a,b`.
|
||||
`${limit_list}` stands for the names of these de-allocated nodes. For example, if the crontab job finds node `a` and node `b` are available now, but they are still in `NotReady` status in Kubernetes, then it can set `limit_list=a,b`.
|
||||
|
||||
### How to Enlarge Internal Storage Size
|
||||
|
||||
Currently, OpenPAI uses [internal storage](https://github.com/microsoft/pai/tree/master/src/internal-storage) to hold database. Internal storage is a limited size storage. It leverages loop device in Linux to provide a storage with strictly limited quota. The default quota is 30 GB (or 10GB for OpenPAI <= `v1.1.0`), which can hold about 1,000,000 jobs. If you want a larger space to hold more jobs, please follow these steps to enlarge the internal storage:
|
||||
Currently, OpenPAI uses [internal storage](https://github.com/microsoft/pai/tree/master/src/internal-storage) to hold database. Internal storage is limited size storage. It leverages loop devices in Linux to provide a storage with strictly limited quota. The default quota is 30 GB (or 10GB for OpenPAI <= `v1.1.0`), which can hold about 1,000,000 jobs. If you want a larger space to hold more jobs, please follow these steps to enlarge the internal storage:
|
||||
|
||||
Step 1. [Exec into a dev box container.](./basic-management-operations.md#pai-service-management-and-paictl)
|
||||
|
||||
|
|
|
@ -6,7 +6,7 @@ The upgrade process is mainly about modifying `services-configuration.yaml` and
|
|||
|
||||
## Stop All Services and Previous Dev Box Container
|
||||
|
||||
First, launch a dev box container of current PAI version, stop all services by:
|
||||
First, launch a dev box container of the current PAI version, stop all services by:
|
||||
|
||||
```bash
|
||||
./paictl.py service stop
|
||||
|
@ -28,7 +28,7 @@ sudo docker rm dev-box
|
|||
|
||||
## Modify `services-configuration.yaml`
|
||||
|
||||
Now, launch a dev box container of new version. For example, if you want to upgrade to `v1.1.0`, you should use docker `openpai/dev-box:v1.1.0`.
|
||||
Now, launch a dev box container of the new version. For example, if you want to upgrade to `v1.1.0`, you should use docker `openpai/dev-box:v1.1.0`.
|
||||
|
||||
Then, retrieve your configuration by:
|
||||
|
||||
|
@ -41,7 +41,7 @@ Find the following section in `<config-folder>/services-configuration.yaml`:
|
|||
```yaml
|
||||
cluster:
|
||||
|
||||
# the docker registry to store docker images that contain system services like frameworklauncher, hadoop, etc.
|
||||
# the docker registry to store docker images that contain system services like Frameworklauncher, Hadoop, etc.
|
||||
docker-registry:
|
||||
|
||||
......
|
||||
|
@ -72,4 +72,4 @@ If you didn't stop `storage-manager`, start other services by:
|
|||
./paictl.py service start --skip-service-list storage-manager
|
||||
```
|
||||
|
||||
After all services is started, your OpenPAI cluster is successfully upgraded.
|
||||
After all the services are started, your OpenPAI cluster is successfully upgraded.
|
|
@ -46,7 +46,7 @@ You could access `NFS` data by `Windows File Explorer` directly if:
|
|||
|
||||
To access it, use the file location `\\NFS_SERVER_ADDRESS` in `Window File Explorer`. It will prompt you to type in a username and a password:
|
||||
|
||||
- If OpenPAI is in basic authentication mode (this mode means you use a basic username/password to log in to OpenPAI web portal), you can access NFS data through its configured username and password. Please note it is different from the one you use to log in to OpenPAI. If the administrator uses `storage-manager`, the default username/password for NFS is `smbuser` and `smbpwd`.
|
||||
- If OpenPAI is in basic authentication mode (this mode means you use a basic username/password to log in to OpenPAI webportal), you can access NFS data through its configured username and password. Please note it is different from the one you use to log in to OpenPAI. If the administrator uses `storage-manager`, the default username/password for NFS is `smbuser` and `smbpwd`.
|
||||
|
||||
- If OpenPAI is in AAD authentication mode, you can access NFS data through the user domain name and password.
|
||||
|
||||
|
|
|
@ -14,7 +14,7 @@ Now your first OpenPAI job has been kicked off!
|
|||
|
||||
## Browse Stdout, Stderr, Full logs, and Metrics
|
||||
|
||||
The hello world job is implemented by TensorFlow. It trains a simple model on the CIFAR-10 dataset for 1,000 steps with downloaded data. You can monitor the job by checking its logs and running metrics on the web portal.
|
||||
The hello world job is implemented by TensorFlow. It trains a simple model on the CIFAR-10 dataset for 1,000 steps with downloaded data. You can monitor the job by checking its logs and running metrics on the webportal.
|
||||
|
||||
Click the `Stdout` and `Stderr` buttons to see the stdout and stderr logs for a job on the job detail page. If you want to see a merged log, you can click `...` on the right and then select `Stdout + Stderr`.
|
||||
|
||||
|
@ -32,7 +32,7 @@ On the job detail page, you can also see metrics by clicking `Go to Job Metrics
|
|||
|
||||
Instead of importing a job configuration file, you can submit the hello world job directly through the web page. The following is a step-by-step guide:
|
||||
|
||||
**Step 1.** Login to OpenPAI web portal.
|
||||
**Step 1.** Login to OpenPAI webportal.
|
||||
|
||||
**Step 2.** Click **Submit Job** on the left pane, then click `Single` to reach this page.
|
||||
|
||||
|
|
|
@ -4,7 +4,7 @@
|
|||
|
||||
## Entrance
|
||||
|
||||
If your administrator enables marketplace plugin, you will find a link in the `Plugin` section on the web portal, like:
|
||||
If your administrator enables marketplace plugin, you will find a link in the `Plugin` section on the webportal, like:
|
||||
|
||||
> If you are PAI admin, you could check [deployment doc](https://github.com/microsoft/openpaimarketplace/blob/master/docs/deployment.md) to see how to deploy and enable marketplace plugin.
|
||||
|
||||
|
|
|
@ -34,6 +34,7 @@ If there are multiple OpenPAI clusters, you can follow the above steps again to
|
|||
|
||||
![add cluster host](https://raw.githubusercontent.com/Microsoft/openpaivscode/0.3.0/assets/add_cluster_host.png)
|
||||
|
||||
|
||||
4. If the `authn_type` of the cluster is `OIDC`, a website will be open and ask you to log in. If your login was successful, the username and token fields are auto-filled, and you can change them if needed. Once it completes, click the *Finish* button at the bottom right corner. Notice, the settings will not take effect if you save and close the file directly.
|
||||
|
||||
![add cluster configuration](https://raw.githubusercontent.com/Microsoft/openpaivscode/0.3.0/assets/add_aad_cluster.gif)
|
||||
|
@ -46,12 +47,14 @@ After added a cluster configuration, you can find the cluster in the *PAI CLUSTE
|
|||
|
||||
![pai cluster explorer](https://raw.githubusercontent.com/Microsoft/openpaivscode/0.3.0/assets/pai_cluster_explorer.png)
|
||||
|
||||
|
||||
To submit a job config YAML file, please follow the steps below:
|
||||
|
||||
1. Double-click `Create Job Config...` in OpenPAI cluster Explorer, and then specify file name and location to create a job configuration file.
|
||||
2. Update job configuration as needed.
|
||||
3. Right-click on the created job configuration file, then click on `Submit Job to PAI Cluster`. The client will then upload files to OpenPAI and create a job. Once it's done, there is a notification at the bottom right corner, you can click to open the job detail page.
|
||||
|
||||
|
||||
If there are multiple OpenPAI clusters, you need to choose one.
|
||||
|
||||
This animation shows the above steps.
|
||||
|
|
|
@ -13,7 +13,7 @@ alert: GpuUsedByExternalProcess
|
|||
expr: gpu_used_by_external_process_count > 0
|
||||
for: 5m
|
||||
annotations:
|
||||
summary: found nvidia used by external process in {{$labels.instance}}
|
||||
summary: found NVIDIA used by external process in {{$labels.instance}}
|
||||
```
|
||||
|
||||
关于报警规则的详细语法,请参考[这里](https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/)。
|
||||
|
|
|
@ -78,11 +78,11 @@
|
|||
|
||||
#### <div id="how-to-check-whether-the-gpu-driver-is-installed">如何检查GPU驱动被正确安装了?</div>
|
||||
|
||||
对于Nvidia GPU, 您可以使用命令`nvidia-smi`来检查。
|
||||
对于NVIDIA GPU, 您可以使用命令`nvidia-smi`来检查。
|
||||
|
||||
#### <div id="how-to-install-gpu-driver">如何安装GPU驱动?</div>
|
||||
|
||||
对于Nvidia GPU,请先确认您想安装哪个版本的GPU(您可以参考[这个问题](#which-version-of-nvidia-driver-should-i-install))。然后参考下面的步骤:
|
||||
对于NVIDIA GPU,请先确认您想安装哪个版本的GPU(您可以参考[这个问题](#which-version-of-nvidia-driver-should-i-install))。然后参考下面的步骤:
|
||||
|
||||
```bash
|
||||
sudo add-apt-repository ppa:graphics-drivers/ppa
|
||||
|
|
|
@ -84,7 +84,7 @@ pai_node_count{disk_pressure="false",instance="10.0.0.1:9101",job="pai_serivce_e
|
|||
|
||||
如果您无法在您的任务中使用GPU,您可以在Worker结点上follow下面的步骤来检查:
|
||||
|
||||
1. 显卡驱动安装正确。如果是Nvidia卡的话,使用`nvidia-smi`来检查。
|
||||
1. 显卡驱动安装正确。如果是NVIDIA卡的话,使用`nvidia-smi`来检查。
|
||||
2. [nvidia-container-runtime](https://github.com/NVIDIA/nvidia-container-runtime)已经被正确安装,并且被设置为Docker的默认runtime。您可以用`docker info -f "{{json .DefaultRuntime}}"`来检查。
|
||||
|
||||
如果是在Webportal中显示的GPU数目有出入,请参考[这个文档](./how-to-set-up-virtual-clusters.md)对集群重新进行配置。
|
||||
|
|
Загрузка…
Ссылка в новой задаче