[alert-handler] auto-fix NVIDIA GPU low performance issue with temp k8s jobs (#5383)

This commit is contained in:
Guoxin 2021-03-31 17:12:39 +08:00 коммит произвёл GitHub
Родитель 7a370de646
Коммит f559e978a6
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
15 изменённых файлов: 325 добавлений и 98 удалений

Просмотреть файл

@ -232,6 +232,9 @@ authentication:
# - receiver: pai-email-admin-user-and-stop-job
# match:
# alertname: PAIJobGpuPercentLowerThan0_3For1h
# - receiver: pai-email-admin-and-fix-nvidia-gpu-low-perf
# match:
# alertname: NodeGpuLowPerfState
# customized-receivers: # receivers are combination of several actions
# - name: "pai-email-admin-user-and-stop-job"
# actions:
@ -244,6 +247,11 @@ authentication:
# tag-jobs:
# tags:
# - 'stopped-by-alert-manager'
# - name: "pai-email-admin-and-fix-nvidia-gpu-low-perf"
# actions:
# email-admin:
# fix-nvidia-gpu-low-perf:
# uncomment following if you want to customize prometheus
# prometheus:

Просмотреть файл

@ -92,6 +92,9 @@ rest-server:
# - receiver: pai-email-admin-user-and-stop-job
# match:
# alertname: PAIJobGpuPercentLowerThan0_3For1h
# - receiver: pai-email-admin-and-fix-nvidia-gpu-low-perf
# match:
# alertname: NodeGpuLowPerfState
# customized-receivers: # receivers are combination of several actions
# - name: "pai-email-admin-user-and-stop-job"
# actions:
@ -104,6 +107,11 @@ rest-server:
# tag-jobs:
# tags:
# - 'stopped-by-alert-manager'
# - name: "pai-email-admin-and-fix-nvidia-gpu-low-perf"
# actions:
# email-admin:
# fix-nvidia-gpu-low-perf:
# uncomment following if you want to customize prometheus
# prometheus:

Просмотреть файл

@ -114,26 +114,29 @@ We have provided so far these following actions:
- `stop-jobs`: Stop jobs by calling OpenPAI REST API. **Be careful about this action because it stops jobs without notifying related users.**
- `tag-jobs`: Add a tag to jobs by calling OpenPAI REST API.
- `cordon-nodes`: Call Kubernetes API to cordon the corresponding nodes.
- `fix-nvidia-gpu-low-perf`: Start a privileged container to fix NVIDIA GPU Low Performance State issue.
But before you use them, you have to add proper configuration in the `alert-handler` field. For example, `email-admin` needs you to set up an SMTP account to send the email and an admin email address to receive the email. Also, the `tag-jobs` and `stop-jobs` action calls OpenPAI REST API, so you should set a rest server token for them. To get the token, you should go to your profile page (in the top-right corner on Webporal, click `View my profile`), and use `Create application token` to create one. Generally speaking, there are two parts of the configuration in the `alert-handler` field. One is `email-configs`. The other is `pai-bearer-token`. The requirements for different actions are shown in the following table:
| | email-configs | pai-bearer-token |
| :-----------:| :-----------: | :--------------: |
| cordon-nodes | - | - |
| email-admin | required | - |
| email-user | required | required |
| stop-jobs | - | required |
| tag-jobs | - | required |
| | email-configs | pai-bearer-token |
| :-------------------------: | :-----------: | :--------------: |
| cordon-nodes | - | - |
| email-admin | required | - |
| email-user | required | required |
| stop-jobs | - | required |
| tag-jobs | - | required |
| fix-nvidia-gpu-low-perf | - | - |
In addition, some actions may depend on certain fields in the `labels` of alert instances. The labels of the `alert instance` are generated based on the expression in the alert rule. For example, the expression of the `PAIJobGpuPercentLowerThan0_3For1h` alert we mentioned in previous section is `avg(task_gpu_percent{virtual_cluster=~"default"}) by (job_name) < 0.3`. This expression returns a list, the element in which contains the `job_name` field. So there will be also a `job_name` field in the labels of the alert instance. `stop-jobs` action depends on the `job_name` field, and it will stop the corresponding job based on it. To inspect the labels of an alert, you can visit `http(s)://<your master IP>/prometheus/alerts`. If the alert is firing, you can see its labels on this page. For the depended fields of each pre-defined action, please refer to the following table:
| | depended on label field |
| :-----------:| :------------------: |
| cordon-nodes | node_name |
| email-admin | - |
| email-user | - |
| stop-jobs | job_name |
| tag-jobs | job_name |
| | depended on label field |
| :-------------------------: | :---------------------: |
| cordon-nodes | node_name |
| email-admin | - |
| email-user | - |
| stop-jobs | job_name |
| tag-jobs | job_name |
| fix-nvidia-gpu-low-perf | node_name, minor_number |
The matching rules between alerts and actions are defined using `receivers` and `routes`.

Просмотреть файл

@ -82,7 +82,6 @@ rest-server:
#github-path: marketplace
# Job Debugging Reservation Seconds.
#debugging-reservation-seconds: 604800
# uncomment following section if you want to customize the port of web portal
# webportal:
# server-port: 9286
@ -125,6 +124,9 @@ rest-server:
# - receiver: pai-email-admin-user-and-stop-job
# match:
# alertname: PAIJobGpuPercentLowerThan0_3For1h
# - receiver: pai-email-admin-and-fix-nvidia-gpu-low-perf
# match:
# alertname: NodeGpuLowPerfState
# customized-receivers: # receivers are combination of several actions
# - name: "pai-email-admin-user-and-stop-job"
# actions:
@ -137,6 +139,10 @@ rest-server:
# tag-jobs:
# tags:
# - 'stopped-by-alert-manager'
# - name: "pai-email-admin-and-fix-nvidia-gpu-low-perf"
# actions:
# email-admin:
# fix-nvidia-gpu-low-perf:
# uncomment following if you want to customize prometheus
# prometheus:
@ -172,8 +178,6 @@ rest-server:
# # key_name: yyyyyy
# # key_path: /path/to/yyyyyy
# uncomment following section if you want to customize the threshold of cleaner
# cleaner:
# threshold: 90
@ -185,65 +189,65 @@ rest-server:
# uncomment following section, if you want to customize the authentication solution.
#authentication:
#OIDC: false
#OIDC: false
# If OIDC is set as the value true, you will have to configure the following properties.
#OIDC-type: AAD
#
#AAD:
# # If you wanna configure AAD-OIDC for OpenPAI, the following configuration is mandatory.
# # National Clouds endpoint list https://docs.microsoft.com/en-us/azure/active-directory/develop/authentication-national-cloud
# # AZURE: https://login.microsoftonline.com/{tenantID}/v2.0/.well-known/openid-configuration
# # China: https://login.partner.microsoftonline.cn/{tenantID}/v2.0/.well-known/openid-configuration
# # Germany: https://login.microsoftonline.de/{tenantID}/v2.0/.well-known/openid-configuration
# wellKnownURL: https://login.microsoftonline.com/{tenantID}/v2.0/.well-known/openid-configuration
#
# # If you wanna configure AAD-OIDC for OpenPAI, the following configuration is mandatory.
# tenantID: ${tenat_id}
#
# # Required, the client ID of your app in AAD
# clientID: ${your_client_id}
#
# # Required if `responseType` is 'code', 'id_token code' or 'code id_token'.
# # If app key contains '\', replace it with '\\'.
# clientSecret: '${your_client_secret}'
#
# # Optional. The lifetime of nonce in session or cookie, the default value is 3600 (seconds).
# nonceLifetime: null
#
# # Optional. The max amount of nonce saved in session or cookie, the default value is 10.
# nonceMaxAmount: 5
#
# # Optional. The clock skew allowed in token validation, the default value is 300 seconds.
# clockSkew: null
#
#group-manager:
# # basic: If you set group-data-source as the value basic, admin should manually modify user's grouplist.
# # winbind: If you set group-data-source as the value winbind, the user's grouplist will get from winbind server based on your configuration.
# group-data-source: basic
#
# # If you set winbind as your data source, you should configure this configuration.
# # winbind-server-address: xxxxxxx
#
# # Admin group name and its user list
# admin-group:
# groupname: admingroup
# description: "admin's group"
# externalName: ""
#
# # Group for default vc.
# # For yarn default queue hack.
# default-group:
# groupname: default
# description: "group for default vc"
# externalName: ""
#
# # If the following groups are not in the data store, it will be created by default.
# grouplist:
# - groupname: forexample
# # internal name
# description: forexample
# # description of the group
# externalName: ""
# # external name, it should be set if your group-data-source is winbind. And the name will be used to query and match the group from
# # the result of winbind. If the group-data-source is basic, this field is useless.
# If OIDC is set as the value true, you will have to configure the following properties.
#OIDC-type: AAD
#
#AAD:
# # If you wanna configure AAD-OIDC for OpenPAI, the following configuration is mandatory.
# # National Clouds endpoint list https://docs.microsoft.com/en-us/azure/active-directory/develop/authentication-national-cloud
# # AZURE: https://login.microsoftonline.com/{tenantID}/v2.0/.well-known/openid-configuration
# # China: https://login.partner.microsoftonline.cn/{tenantID}/v2.0/.well-known/openid-configuration
# # Germany: https://login.microsoftonline.de/{tenantID}/v2.0/.well-known/openid-configuration
# wellKnownURL: https://login.microsoftonline.com/{tenantID}/v2.0/.well-known/openid-configuration
#
# # If you wanna configure AAD-OIDC for OpenPAI, the following configuration is mandatory.
# tenantID: ${tenat_id}
#
# # Required, the client ID of your app in AAD
# clientID: ${your_client_id}
#
# # Required if `responseType` is 'code', 'id_token code' or 'code id_token'.
# # If app key contains '\', replace it with '\\'.
# clientSecret: '${your_client_secret}'
#
# # Optional. The lifetime of nonce in session or cookie, the default value is 3600 (seconds).
# nonceLifetime: null
#
# # Optional. The max amount of nonce saved in session or cookie, the default value is 10.
# nonceMaxAmount: 5
#
# # Optional. The clock skew allowed in token validation, the default value is 300 seconds.
# clockSkew: null
#
#group-manager:
# # basic: If you set group-data-source as the value basic, admin should manually modify user's grouplist.
# # winbind: If you set group-data-source as the value winbind, the user's grouplist will get from winbind server based on your configuration.
# group-data-source: basic
#
# # If you set winbind as your data source, you should configure this configuration.
# # winbind-server-address: xxxxxxx
#
# # Admin group name and its user list
# admin-group:
# groupname: admingroup
# description: "admin's group"
# externalName: ""
#
# # Group for default vc.
# # For yarn default queue hack.
# default-group:
# groupname: default
# description: "group for default vc"
# externalName: ""
#
# # If the following groups are not in the data store, it will be created by default.
# grouplist:
# - groupname: forexample
# # internal name
# description: forexample
# # description of the group
# externalName: ""
# # external name, it should be set if your group-data-source is winbind. And the name will be used to query and match the group from
# # the result of winbind. If the group-data-source is basic, this field is useless.

Просмотреть файл

@ -0,0 +1,22 @@
# Copyright (c) Microsoft Corporation
# All rights reserved.
#
# MIT License
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
# documentation files (the "Software"), to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
# to permit persons to whom the Software is furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
# BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
# DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
FROM nvidia/cuda:11.2.2-base-ubuntu16.04
COPY ./src/nvidia-gpu-low-perf-fixer .
ENTRYPOINT /bin/bash nvidia-gpu-low-perf-fixer.sh

Просмотреть файл

@ -74,17 +74,14 @@ class AlertManager(object):
else:
token_configured = False
result["alert-handler"]["configured"] = True
result["actions-available"] = ["fix-nvidia-gpu-low-perf"]
if email_configured and token_configured:
result["alert-handler"]["configured"] = True
result["actions-available"].extend(["email-admin", "email-user", "stop-jobs", "tag-jobs"])
elif email_configured:
result["alert-handler"]["configured"] = True
result["actions-available"].append("email-admin")
elif token_configured:
result["alert-handler"]["configured"] = True
result["actions-available"].extend(["stop-jobs", "tag-jobs"])
else:
result["alert-handler"]["configured"] = False
if result.get("cluster-utilization") is not None and \
result["cluster-utilization"].get("schedule") is not None and \

Просмотреть файл

@ -122,6 +122,11 @@ data:
- url: 'http://localhost:{{ cluster_cfg["alert-manager"]["alert-handler"]["port"] }}/alert-handler/cordon-nodes'
send_resolved: false
{% endif %}
{% if (receiver["actions"]["fix-nvidia-gpu-low-perf"] is defined) and ('fix-nvidia-gpu-low-perf' in cluster_cfg["alert-manager"]["actions-available"]) %}
- url: 'http://localhost:{{ cluster_cfg["alert-manager"]["alert-handler"]["port"] }}/alert-handler/fix-nvidia-gpu-low-perf'
send_resolved: false
{% endif %}
{% endfor %}

Просмотреть файл

@ -67,6 +67,10 @@ spec:
value: {{ cluster_cfg["cluster"]["common"]["cluster-id"] }}
- name: REST_SERVER_URI
value: {{ cluster_cfg['rest-server']['uri'] }}
- name: DOCKER_REGISTRY_PREFIX
value: {{ cluster_cfg['cluster']['docker-registry']['prefix'] }}
- name: DOCKER_REGISTRY_TAG
value: {{ cluster_cfg['cluster']['docker-registry']['tag'] }}
- name: WEBPORTAL_URI
{%- if "ssl" in cluster_cfg["pylon"] and cluster_cfg["pylon"]["ssl"] %}
value: "{{ cluster_cfg['pylon']['uri-https']}}"

Просмотреть файл

@ -15,6 +15,9 @@ rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["patch"]
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["create", "list", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding

Просмотреть файл

@ -0,0 +1,63 @@
// Copyright (c) Microsoft Corporation
// All rights reserved.
//
// MIT License
//
// Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
// documentation files (the "Software"), to deal in the Software without restriction, including without limitation
// the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and
// to permit persons to whom the Software is furnished to do so, subject to the following conditions:
// The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING
// BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
// NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
// DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
const k8s = require('@kubernetes/client-node');
const kc = new k8s.KubeConfig();
const logger = require('@alert-handler/common/logger');
// clean TTL 24 hours jobs created by alert-handler
const cleanTTL24HJobs = () => {
logger.info('Cleaning completed TTL 24h jobs...');
const k8sApi = kc.makeApiClient(k8s.BatchV1Api);
k8sApi
.listNamespacedJob(
'default',
undefined,
undefined,
undefined,
undefined,
'created-by=alert-handler,time-to-live=24h', // labelSelector
)
.then((response) => {
logger.info(`Successfully get job list.`);
const jobs = response.body.items;
jobs.forEach((job) => {
const jobName = job.metadata.name;
if (
(job.status.succeeded === 1 || jobs.status.failed === 1) && // check if the job has completed
new Date() - new Date(job.status.completionTime) > 24 * 60 * 60 * 1000 // completed for more than 24h
)
k8sApi
.deleteNamespacedJob(jobName, 'default')
.then((response) => {
logger.info(`Successfully deleted job ${jobName}`);
})
.catch((error) => {
logger.info(`Failed to delete job ${jobName}`, error);
});
});
})
.catch((error) => {
logger.error('Failed to list jobs:', error);
});
};
// module exports
module.exports = {
cleanTTL24HJobs,
};

Просмотреть файл

@ -88,19 +88,6 @@ const sendEmailToAdmin = (req, res) => {
});
};
const getUserNameByJobName = async (jobName, token) => {
return axios
.get(`${process.env.REST_SERVER_URI}/api/v2/jobs/${jobName}`, {
headers: {
Authorization: `Bearer ${token}`,
'Content-Type': 'application/json',
},
})
.then((response) => {
return response.data.jobStatus.username;
});
};
const getUserEmail = async (username, token) => {
return axios
.get(`${process.env.REST_SERVER_URI}/api/v2/users/${username}`, {
@ -132,7 +119,7 @@ const sendEmailToUser = async (req, res) => {
// group alerts by username
const alertsGrouped = {};
alerts.map((alert, index) => {
let userName = alert.labels.job_name.split('~')[0];
const userName = alert.labels.job_name.split('~')[0];
if (userName in alertsGrouped) {
alertsGrouped[userName].push(alerts[index]);
} else {

Просмотреть файл

@ -18,15 +18,16 @@
const k8s = require('@kubernetes/client-node');
const kc = new k8s.KubeConfig();
const logger = require('@alert-handler/common/logger');
const crypto = require('crypto');
kc.loadFromDefault();
const k8sApi = kc.makeApiClient(k8s.CoreV1Api);
const cordonNode = async (nodeName) => {
const headers = {
'content-type': 'application/strategic-merge-patch+json',
};
// set the node unschedulable
const k8sApi = kc.makeApiClient(k8s.CoreV1Api);
return k8sApi.patchNode(
nodeName,
{ spec: { unschedulable: true } },
@ -72,7 +73,108 @@ const cordonNodes = (req, res) => {
});
};
const getK8sV1Job = (jobName, nodeName, minorNumber) => {
const DOCKER_REGISTRY_PREFIX = process.env.DOCKER_REGISTRY_PREFIX;
const DOCKER_REGISTRY_TAG = process.env.DOCKER_REGISTRY_TAG;
const job = {
apiVersion: 'batch/v1',
kind: 'Job',
metadata: {
name: jobName,
labels: {
'created-by': 'alert-handler',
'time-to-live': '24h',
},
},
spec: {
// TTL feature is currently alpha[Kubernetes 1.15]
// To avoid using this fearure, jobs with label `time-to-live=24h` & `created-by=alert-handler` will be cleaned with function `cleanTTL24HJobs` regularlly
// ttlSecondsAfterFinished: 86400,
template: {
spec: {
containers: [
{
name: 'nvidia-gpu-low-perf-fixer',
image: `${DOCKER_REGISTRY_PREFIX}nvidia-gpu-low-perf-fixer:${DOCKER_REGISTRY_TAG}`,
imagePullPolicy: 'Always',
env: [
{
name: 'MINOR_NUMBER',
value: `${minorNumber}`,
},
],
securityContext: {
privileged: true,
},
},
],
restartPolicy: 'Never',
nodeSelector: {
'kubernetes.io/hostname': nodeName,
},
},
},
},
};
return job;
};
// start a k8s job for each GPU card to fix NvidiaGPULowPerf issue
const fixNvidiaGPULowPerf = (req, res) => {
logger.info(
'Received `fixNvidiaGPULowPerf` post request from alert-manager.',
);
// filter alerts which are firing and contain `node_name` & `minor_number` as label
const jobsInfo = req.body.alerts
.filter(
(alert) =>
alert.status === 'firing' &&
'node_name' in alert.labels &&
'minor_number' in alert.labels,
)
// map each alert to a job
.map((alert) => ({
jobName: `nvidia-gpu-low-perf-fixer-${crypto
.createHash('md5')
.update(alert.labels.node_name + alert.labels.minor_number)
.digest('hex')}`, // unique job by GPU card
nodeName: alert.labels.node_name,
minorNumber: alert.labels.minor_number,
DOCKER_REGISTRY_PREFIX: process.env.DOCKER_REGISTRY_PREFIX,
DOCKER_REGISTRY_TAG: process.env.DOCKER_REGISTRY_TAG,
}));
const k8sApi = kc.makeApiClient(k8s.BatchV1Api);
jobsInfo.forEach(async (jobInfo) => {
// get k8s V1Job
const job = getK8sV1Job(
jobInfo.jobName,
jobInfo.nodeName,
jobInfo.minorNumber,
);
k8sApi
.createNamespacedJob('default', job)
.then((response) => {
logger.info(
`Successfully start job ${jobInfo.jobName} for GPU Low Performance issue in node: ${jobInfo.nodeName}, minor number: ${jobInfo.minorNumber}`,
);
})
.catch((error) => {
// ignore the job creation if already exists
if (error.response && error.response.statusCode === 409) {
logger.warn(`Kubernetes job ${jobInfo.jobName} already exists.`);
} else {
logger.error(error);
res.status(500).json({
message: `Failed to start job to fix NvidiaGPULowPerf`,
});
}
});
});
};
// module exports
module.exports = {
cordonNodes,
fixNvidiaGPULowPerf,
};

Просмотреть файл

@ -23,6 +23,7 @@ require('module-alias/register');
const express = require('express');
const bearerToken = require('express-bearer-token');
const actions = require('@alert-handler/routes/actions');
const k8sController = require('@alert-handler/controllers/kubernetes');
const logger = require('@alert-handler/common/logger');
const app = express();
@ -36,3 +37,6 @@ const port = parseInt(process.env.SERVER_PORT);
app.listen(port, () => {
logger.info(`alert-handler listening at http://localhost:${port}`);
});
// check completed jobs which were used to fix NvidiaGPULowPerf issue every 1 hour
setInterval(k8sController.cleanTTL24HJobs, 60 * 60 * 1000);

Просмотреть файл

@ -50,4 +50,9 @@ router
/** POST /alert-handler/cordon-nodes */
.post(nodeController.cordonNodes);
router
.route('/alert-handler/fix-nvidia-gpu-low-perf')
/** POST /alert-handler/fix-nvidia-gpu-low-perf */
.post(nodeController.fixNvidiaGPULowPerf);
module.exports = router;

Просмотреть файл

@ -0,0 +1,12 @@
#!/bin/bash
set -ex
echo "MINOR_NUMBER: ${MINOR_NUMBER}"
nvidia-smi -pm ENABLED -i ${MINOR_NUMBER}
MAX_MEMORY_CLOCK=$(nvidia-smi -q -d SUPPORTED_CLOCKS | grep Memory | awk -v max=0 '{if($3>max){max=$3}}END{print max}')
MAX_GRAPHICS_CLOCK=$(nvidia-smi -q -d SUPPORTED_CLOCKS | grep Graphics | awk -v max=0 '{if($3>max){max=$3}}END{print max}')
echo "MAX_MEMORY_CLOCK: ${MAX_MEMORY_CLOCK}, MAX_GRAPHICS_CLOCK: ${MAX_GRAPHICS_CLOCK}"
nvidia-smi -ac ${MAX_MEMORY_CLOCK},${MAX_GRAPHICS_CLOCK} -i ${MINOR_NUMBER}