Граф коммитов

151 Коммитов

Автор SHA1 Сообщение Дата
Guoxin c6e44085d0
add username label to PAI job low gpu utilization alert (#5499) 2021-06-03 14:06:48 +08:00
Starmie@Choice Specs fb67aea7f7
Change logs about add remove node (#5427)
* warning -> error in k8s generator

* verbose -> silence
2021-04-15 21:23:12 +08:00
Starmie@Choice Specs 32898e6858
Update docs about config.yaml (#5421)
* warn if config.yaml not exists on the cluster
* remind the user to check config.yaml when add / remove nodes
2021-04-13 16:05:46 +08:00
Starmie@Choice Specs 3ca79e2d9f
Add / remove remove nodes using paictl.py (#5400) 2021-04-01 10:30:48 +08:00
Guoxin f559e978a6
[alert-handler] auto-fix NVIDIA GPU low performance issue with temp k8s jobs (#5383) 2021-03-31 17:12:39 +08:00
Starmie@Choice Specs 9a7a60a5cf
add config.yaml into push list (#5394)
* add config.yaml into push list

* print messages if config file not found

* check all folders

* kube_config_path not included.

* modified error messages

* remove redundent blank lines
2021-03-25 14:00:35 +08:00
Guoxin 23d41e37c9
add cluster-utilization report doc (#5331) 2021-03-02 10:45:10 +08:00
Guoxin 59146fa6fa
[alert-manager] move pai-bearer-token config field under alert-manager (#5324) 2021-02-28 20:17:25 +08:00
Guoxin edc67c2b10
send regular GPU utilization report with CronJob (#5281) 2021-02-07 13:22:30 +08:00
yiyione b8fa58782a
[Marketplace] Simplify marketplace setting (#5283)
* add prerequisite in marketplace-webportal and marketplace-restserver, auto use config from their prerequisite services

* Fix template

* Fix template

* Fix template

* auto use https in marketplace if pylon.ssl

* auto use https in marketplace if pylon.ssl

* Add deploy marketplace to quickstart

* fix template

* Add config to enable/disable the auto-added marketplace-plugin

* Set marketplace: false in quick-start default services-configuration

* Add enable_marketplace option to config.yaml
2021-02-05 11:12:37 +08:00
Guoxin 864622bde6
Refine pai-version / cluster-id checking log (#5212)
* refine pai-version check log
* refine cluster-id check log
* refine pylint
2020-12-30 11:27:56 +08:00
Guoxin b5f9be333c
add skip-service-list arg to paictl service command (#5193) 2020-12-23 15:49:01 +08:00
Guoxin 448179bde7
[installation] use layout.yaml for installation from scratch and update schema (#5154)
* use layout.yaml instead of master.csv, worker.csv; 
* move config.yaml, layout.yaml under contrib/kubespary folder; remove all the argument parse logic;
* update layout.yaml schema
* hard-code default cluster_cfg.layout.kubernetes;
* fix gpu-configuration files generation issues due to layout schema change
* change default branch / tag to 1.5 in config.yaml
* refine installation doc
2020-12-09 19:42:24 +08:00
Guoxin 97641692c2
Alert Email Template Refine (#5064)
* add kill user job alert template
* add troubleshooting link in general email template
* allow customized templates
* redefine actions schema in customized-receivers to facilitate parameters passing
2020-11-12 20:21:10 +08:00
Binyang2014 2ccdc1b958
Add permission and new API in log-manager (#5046)
Add permission check in log-manage
Add log-manager API to retrieve log
2020-11-11 18:56:01 +08:00
Guoxin 6cb7f8d242
Alert Severity (#5055)
* define alerts severity

* show severity in email

* introduce alert severity in doc and examples

* show severity in web-portal
2020-11-09 15:33:05 +08:00
Guoxin 0df8edf8c4
alert doc refine: remind admin users of the stop-job action usage (#5030) 2020-10-29 19:21:31 +08:00
shaiic-pai cf4e6a83c1
Alert-manager: Kill low-gpu-utilization jobs, tag abnormal jobs (#4940)
* alert manager based gpu utilization enhancement
2020-09-29 16:44:19 +08:00
Yuqi Wang a5cc5b482c
Fix transient Pod MatchNodeSelector failed as kubelet label missing (#4907) 2020-09-15 17:59:24 +08:00
Yuqi Wang 630cf383b6
Tune FC, ControllerManager and ApiServer to serve large concurrent active frameworks (>10k) (#4864) 2020-09-01 17:22:58 +08:00
YundongYe 972109f418
[Doc] Remove yarn content from deployment doc (#4447) 2020-04-26 10:22:48 +08:00
YundongYe 14016f8be1
[cluster-object-model] Disable yarn value in cluster-type (#4445) 2020-04-24 17:07:59 +08:00
Yifan Xiong aaf3ac80d4
Fix Azure File issues in storage (#4438)
Fix Azure File issues in storage.
2020-04-24 10:34:14 +08:00
Zhiyuan He 51a6e1fa16
Add a notice if user still wants to use yarn version (#4340)
* init

* fix

* fix
2020-03-31 10:10:40 +08:00
YundongYe 7aa68d5dff
[paictl] Notify user to deploy k8s with kubespray. (#4180) 2020-02-24 20:25:25 +08:00
YundongYe d659ffd97d
[Cluster Object Model] Generate necessary service config based on cluster type (#4036)
* init commit

* [Cluster Object Model] Check laucher-type and log-type based on cluster-type. (#4038)

* Remove .k8s.yaml logic. And remove launcher-type & log-type

* [Cluster Objec Model]Don't check hadoop rm uri when in pure k8s cluster. (#4052)

* [Rest-server] Add env, volume based on the cluster-type (#4058)

* Remove yarn config from pylon when cluster-type is k8s (#4063)

* [cluster object model] Document update. (#4066)

* From openpai_parser_type  to service_type
2020-01-02 11:00:14 +08:00
Binyang2014 a80763ff3a
[k8s-dashboard] Enable https for k8s-dashboard (#4032)
* enable https

* fix ut

* remove dashboard roles

* bug fix
2019-12-25 10:32:30 +08:00
Yifan Xiong 141301d4ad
Fix typos (#3980)
* Fix typos

Fix typos by [misspell](https://github.com/client9/misspell).

* Fix typos in api

Fix typos in api.

* Add spelling check in GitHub Actions

Add spelling check using misspell in GitHub Actions.
2019-12-10 11:37:07 +08:00
Yuqi Wang 6522eb5f7d
Explicitly config and tune FIFO (#3977) 2019-12-06 17:27:26 +08:00
WangDian 19dc616285
K8S managed NFS+SMB storage (#3826)
Add service storage-manager
Storage-manager is a tool that helps admin to deploy NFS+SMB service on configured nodes.
2019-11-13 16:04:07 +08:00
Yifan Xiong 308b62f54c
Fix kubelet.service in add machines (#3807)
Fix kubelet.service in add machines.
2019-11-07 20:05:26 +08:00
Yuqi Wang 4db57ca4fc
By default ensure eviction can be still processed slowly instead of totally stopped (#3795) 2019-11-05 13:24:34 +08:00
Double Young 2f060ca59d Deployment for EFK to support job history (#3626) 2019-10-29 16:32:17 +08:00
Yifan Xiong d55ff4b793
Move GPU device plugin to service (#3744)
* Move GPU device plugin to service

Move GPU device plugin to service.

* Move stop script to delete

Move stop script to delete.

* Add configuration for device plugin

Add configuration for device plugin.

* Fix unit test

Fix unit test.
2019-10-19 19:34:48 +08:00
Yifan Xiong 302450d53b
Support k8s services deployment in RBAC (#3709)
* Support k8s services deployment in RBAC

Support framework controller and hived scheduler deployment in RBAC.
2019-10-16 11:03:22 +08:00
Binyang2014 4f951282c0
Add more logs for delpy (#3734) 2019-10-15 21:24:45 +08:00
Yuqi Wang 5b5c684939
Tune default scheduler to support Job FIFO (#3731) 2019-10-14 20:30:20 +08:00
Yuqi Wang 502ca8ce47
Tune default scheduler to support Job FIFO (#3726) 2019-10-14 17:18:03 +08:00
Yuqi Wang 74b92690e2
Tune ApiServer for larger workload (#3715) 2019-10-11 15:40:55 +08:00
Yuqi Wang 1016a98336
Increase container log max size and keep single log file
Because K8S does not recognize multiple rotated files yet
2019-10-08 17:43:26 +08:00
Yifan Xiong 3812b5fd7f
Fix paictl issues during deployment (#3622)
* Use `sudo -S` to read password from stdin
* Update deprecated `yaml.load()`
2019-09-16 17:41:31 +08:00
Yifan Xiong 55a848b462
Remove password from stdin in paictl (#3577)
Remove password from stdin in paictl.
2019-09-06 10:19:24 +08:00
Yifan Xiong 6e3e3ff27f
Add service deployment for hived scheduler (#3495)
* Add service deployment for hived scheduler

Add service deployment for hived scheduler.

* Add service config for hived scheduler

Add service config for hived scheduler.

* Update

Update.

* Add cluster type for k8s services

Add cluster type for k8s services.

* Remove config validation temporally

Remove hived scheduler config validation temporally,
because the paictl validation checks all services.

* Rename configmap.yaml to hivedscheduler-config.yaml

Rename configmap.yaml to hivedscheduler-config.yaml.
2019-09-05 15:27:08 +08:00
Yifan Xiong a1332d3612
[Deployment] Choose services for different cluster type (#3528)
* Choose services for different cluster type

Choose services for different cluster type in deployment.

* Update cluster type in service.yaml

Update cluster type in service.yaml.

* Change the initial cluster type to None

Change the initial cluster type to None.
2019-09-03 16:46:27 +08:00
Yifan Xiong 18414d9cc6
Fix error StatefulSet kind in template generation (#3544)
Fix error StatefulSet kind in template generation.
2019-09-03 15:49:55 +08:00
WangDian 7366dab626
Add cluster-type and related config (#3437)
* Add cluster-type and related config

Use [service name].[cluster type].yaml file for default config for cluster type.
If not found cluster type config file, use [service name].yaml instead
2019-08-27 03:16:06 -07:00
Yifan Xiong 1e63ed3fb2
[Deployment] Support to manage a list of services in paictl (#3432)
* Support to manage a list of services in paictl

Support to manage a list of services in paictl.

Fixes #2734.

* Update related docs

Update related docs.
2019-08-24 15:25:58 +08:00
YundongYe dad368d11f
[rbac-dashboard] Issue Fix. (#3422) 2019-08-22 14:59:57 +08:00
YundongYe c8bc94f449
[config] Allow to only upload layout and service config (#3419) 2019-08-22 10:05:23 +08:00
Yifan Xiong 25fdc8b374
Remove deprecated flag for kubelet (#3409)
Remove deprecated `--require-kubeconfig` flag for kubelet.
2019-08-21 13:35:31 +08:00