Граф коммитов

34 Коммитов

Автор SHA1 Сообщение Дата
Yuqi Wang a5cc5b482c
Fix transient Pod MatchNodeSelector failed as kubelet label missing (#4907) 2020-09-15 17:59:24 +08:00
Yuqi Wang 630cf383b6
Tune FC, ControllerManager and ApiServer to serve large concurrent active frameworks (>10k) (#4864) 2020-09-01 17:22:58 +08:00
Yifan Xiong aaf3ac80d4
Fix Azure File issues in storage (#4438)
Fix Azure File issues in storage.
2020-04-24 10:34:14 +08:00
Yuqi Wang 6522eb5f7d
Explicitly config and tune FIFO (#3977) 2019-12-06 17:27:26 +08:00
Yifan Xiong 308b62f54c
Fix kubelet.service in add machines (#3807)
Fix kubelet.service in add machines.
2019-11-07 20:05:26 +08:00
Yuqi Wang 4db57ca4fc
By default ensure eviction can be still processed slowly instead of totally stopped (#3795) 2019-11-05 13:24:34 +08:00
Double Young 2f060ca59d Deployment for EFK to support job history (#3626) 2019-10-29 16:32:17 +08:00
Yuqi Wang 5b5c684939
Tune default scheduler to support Job FIFO (#3731) 2019-10-14 20:30:20 +08:00
Yuqi Wang 502ca8ce47
Tune default scheduler to support Job FIFO (#3726) 2019-10-14 17:18:03 +08:00
Yuqi Wang 74b92690e2
Tune ApiServer for larger workload (#3715) 2019-10-11 15:40:55 +08:00
Yuqi Wang 1016a98336
Increase container log max size and keep single log file
Because K8S does not recognize multiple rotated files yet
2019-10-08 17:43:26 +08:00
Yifan Xiong 25fdc8b374
Remove deprecated flag for kubelet (#3409)
Remove deprecated `--require-kubeconfig` flag for kubelet.
2019-08-21 13:35:31 +08:00
YundongYe d019b5af35
[kubelet] Start kubelet through systemd (#3307) 2019-08-05 11:11:20 +08:00
Di Xu 8afa0eb76a
add script to upgrade etcd aiding monitoring etcd (#2159) 2019-02-20 17:11:11 +08:00
YundongYe 6b2842f616
[Pod Eviction] Disable kubernetes's pod eviction (#2124)
* According to https://github.com/kubernetes/kubernetes/issues/71661

* add imagegc threshold
2019-02-11 17:27:19 +08:00
Hao Yuan 2d915ebc1c
Support AKS deployment (#1980)
* Support AKS deployment

passing POD_IP for hadoop-data-node

Add kusternetes to services-configuration.yaml

It looks like below.
cluster:
  kubernetes:
    api-server-url: http://master_ip:8080
    dashbaord-host: master_ip

generate restserver ip from machine_list

TODO: should avoid to using machine_list

remove post check for "kubernetes" config

TODO: will need to make sure the kubernetes config entry correct

add layout.yaml for PAI service deployment layout

layout.yaml looks like below:

machine-sku:
  GENERIC:
    mem: 1
    gpu:
      type: generic
      count: 1
    cpu:
      vcore: 1
    os: ubuntu16.04

machine-list:
  - hostname: openm-a-gpu-1000
    hostip: 10.151.41.69
    machine-type: GENERIC
    zkid: "1"
    pai-master: "true"
    pai-worker: "true"

prototype of "layout" cmd

fix jenkins pipeline

hack the ci

fix machine-sku format

hack the ci

demostrate "paictl.py check" command

fix machine-sku format

change True to 'true' for compatible

separate config for kubernetes and PAI services

fix jenkins

fix jenkins

fix "machine" cmd

fix k8s_set_environment

remove class Main

simplify sub commond creating

refine service_management_configuration

warning instead of quit when yaml not found!

separate k8s operations from pai

fix layout

generate kubernetes info in services-configuration.yaml

quick-start would generate k8s info

refine Jenkinsfile

refine, and fix error

workaround for ci

nodename is differnet on premise or aks

refine layout generator

generate kubernetes info in services-configuration.yaml

move kubernetes config to layout.yaml

doube quote the string in layout.yaml

fix layout config validator

fix load-balance-ip missing

Add a Kubernetes proxy to REST server, make webportal to request it (#1856)

* REST server: k8s apiserver proxy

* Object model changes

* Update rest_server.py

REST server: leverage kubernetes service account (#1911)

add auth to visit k8s secret (#1951)

[AKS] api-server RBAC access for watchdog (#1710)

* [aks] prometheus leverage service account to access api-server

* add tls config for prometheus over aks deployment

* mv config of tls to api-server domain as other serivces maybe use it

* add wathdog auth for api-server

* mv path to config

* fix bug

* refine code

* refine the kubelet check

* refine the note

* refine the bearer file path read location

* refactor the watchdog parameter

* refine and update the param

* make this flag

* add default value at object model

* fix code

* refactor the object model and commet

* add the return

* remove prometheus and tls config, only keep watchdog change

* rm the intent

* rm the intent 2

* remove hosts config

* make api-server-scheme config

[aks] prometheus attach token for access api-server (#1939)

*  prometheus access api-server auth by attach service account token

* rm duplicated

[hot fix] prometheus & watchdog yaml change use cluster_cfg object model

fix config entry change

passing the ut, fix it later

fix test_template_generate.py

[aks] workaround k8s config not found

fix generated layout.yaml

fix: don't override nodename

fix authorization token in user secret

fix typo "cleaining-image" -> "cleaning-image"

fix: travis lint

allow to generate "Service" template

add pylon option "deploy-service"

align PR prom and watchdog auth to AKS cluster to new deploy branch

reset the src/webportal/build/build-pre.sh

add aks draft doc

refactoring service_management_configuration

refactoring get_service_list()

refactoring "get_dependency()"

refactoring service_start

refactoring service_stop

refactoring service_delete

refactoring service_refresh

add disable rbac param

clean deprecate code

Make kubernetes dashboard config as url (#2028)

* Make kubernetes dashboard config as url

* generate dashboard-url

* comment out post checkout of dashboard-url

fix breaking import

refine method name

clean TODOs

delete debug code

remove k8s dashboard part

add generate_layout.md

add prepare_env.md

refine aks doc

add --force option for layout command

more details for install kubectl

reformat

refine the num

fix dashboard_url missing

refine config command

refine service command

refine cluster command

refine machine command

refine layout command

refine ConfigCmd

refine LayoutCmd

fix broken import

refine ClusterCmd, MachineCmd and ServiceCmd

explain how cmd handled

refine CheckCmd, it's a dummy impl now

fix broken link

no longer needed, remove changes

resolve duplcated code

revmoe useless pylon config entry

"check" command is not ready

fix paictl.py

remove useless file

fix typo

fix mismatch config

* add switch for tls_config

* fix comments

* refine

* using logger

* update layout doc

* put Imports at the top of the file

* fix broken link
2019-02-01 10:03:50 +08:00
YundongYe 5ea44bf8e0
[QoS] qos switch to disable the resource requirement. (#1934) 2018-12-24 14:30:59 +08:00
YundongYe cb8edfaf5d
Make api-server-url configurable (#1880) 2018-12-20 09:45:01 +08:00
Di Xu 17b49164c9
add TaintBasedEvictions (#1905) 2018-12-19 11:34:26 +08:00
YundongYe 182f38b524
Enable Cluster object model (#1824)
* [Cluster Object Model] main module development (#1683)

* [Cluster Object Model] Modify [ paictl cluster ] command, based on com. (#1712)

* Class name modify. Pep8 (#1720)

* [cluster object model] Change paictl machine‘s code, based on com. (#1723)

* Add a tutorial to guide how to add new service configuraiton into com (#1701)

* [Cluster Object Model] Modify [paictl service] command, based on new cluster-object-model (#1735)
2018-12-05 17:31:49 +08:00
Hao Yuan 5017d411f4
fix memory limits (#1707)
* [hotfix] remove memory limits for k8s services and increase it for rm, zookeeper (#1619)

* increate etcd memory limits to 8Gi

* set the "requests" instead of "limits" for k8s services

* increase memory limits for rm

* increase memory limit for zookeeper

* add "requests" for rm

* fix typo

* [hot-fix] increase data-node memory limits to 4Gi (#1689)

* increate data-node memory limits to 4Gi

* increase data-node max java heapsize to 2G

* reserve more memory on PAI worker
2018-11-16 11:57:30 +08:00
YundongYe 0b9f0083bd
Remove etcd data path hard code (#1539)
* Bug fix.

* Add some guide message
2018-10-18 10:03:27 +08:00
Di Xu f4e934bcd3
add nvidia gpu support for pai-lite (#1437) 2018-10-11 13:08:22 +08:00
ZhaoYu Dong 0b6229ea90
Disk config (#1450)
* add daemon config temp

* change config

* remove old daemon json
2018-10-09 14:13:31 +08:00
YundongYe b8ec568788
Enable kubelet healtz port and address. (#1455)
* Update update library.

* Update update library.
2018-10-08 10:46:26 +08:00
Hao Yuan d9cf1d5f89
Add memory limit for all PAI services, make it 'Burstable' Qos class (#1384)
* set kubernetes memory eviction threshold

To reach that capacity, either some Pod is using more than its request,
or the system is using more than 3Gi - 1Gi = 2Gi.

* set those pods as 'Guaranteed' QoS:

node-exporter
hadoop-node-manager
hadoop-data-node
drivers-one-shot

* Set '--oom-score-adj=1000' for job container

so it would oom killed first

* set those pods as 'Burstable' QoS:

prometheus
grafana

* set those pods as 'Guaranteed' QoS:

frameworklauncher
hadoop-jobhistory
hadoop-name-node
hadoop-resource-manager
pylon
rest-server
webportal
zookeeper

* adjust services memory limits

* add k8s services resource limit

* seem 1g is not enough for launcher

* adjust hadoop-resource-manager limit

* adjust webportal memory limit

* adjust cpu limits

* rm yarn-exporter resource limits

* adjuest prometheus limits

* adjust limits

* frameworklauncher: set JAVA_OPTS="-server -Xmx512m"

zookeeper: set JAVA_OPTS="-server -Xmx512m"

fix env name to JAVA_OPTS

fix zookeeper

* add heapsize limit for hadoop-data-node hadoop-jobhistory

* add xmx for hadoop

* modify memory limits

* reserve 40g for singlebox, else reserve 12g

* using LAUNCHER_OPTS

* revert zookeeper dockerfile

* adjust node manager memory limit

* drivers would take more memory when install

* increase memory for zookeeper and launcher

* set requests to a lower value

* comment it out, using the continer env "YARN_RESOURCEMANAGER_HEAPSIZE"

* add comments
2018-09-27 15:06:46 +08:00
Yitong Feng 4d2011321d
move label machine step from kubelet start into service deployment (#1403) 2018-09-26 14:18:51 +08:00
Di Xu bf9f361c4e
collect network metrics for containers (#1418) 2018-09-25 14:40:08 +08:00
FAREAST\canwan 6e29174edd Merge remote-tracking branch 'origin/yuye/bug_fix' into canwan/resolve-conflict
# Conflicts:
#	deployment/k8sPaiLibrary/maintainconf/deploy.yaml
2018-09-11 16:23:36 +08:00
FAREAST\canwan 482178010f Merge branch 'master' into canwan/resolve-conflict 2018-09-11 10:26:56 +08:00
FAREAST\canwan edd23e703e fix kubernetes-cleanup 2018-09-10 17:48:18 +08:00
FAREAST\canwan c5153d02a8 fix bug in dev-box.dockerfile 2018-09-10 13:54:43 +08:00
FAREAST\canwan e430930267 Merge remote-tracking branch 'origin/master' into canwan/update-jenkins
# Conflicts:
#	Jenkinsfile
#	deployment/k8sPaiLibrary/maintainconf/clean.yaml
#	deployment/k8sPaiLibrary/maintaintool/kubernetes-cleanup.sh
#	docs/pai-management/doc/add-service.md
#	docs/pai-management/doc/cluster-bootup.md
#	docs/pai-management/doc/how-to-write-pai-configuration.md
#	pai-management/bootstrap/alert-manager/delete.sh
#	pai-management/bootstrap/alert-manager/refresh.sh.template
#	pai-management/bootstrap/alert-manager/service.yaml
#	pai-management/bootstrap/alert-manager/stop.sh
#	pai-management/bootstrap/drivers/node-label.sh.template
#	pai-management/bootstrap/hadoop-jobhistory/node-label.sh.template
#	pai-management/bootstrap/hadoop-node-manager/hadoop-node-manager-delete/delete-data.sh
#	pai-management/container-setup.sh
#	pai-management/k8sPaiLibrary/maintaintool/kubernetes-cleanup.sh
#	pai-management/k8sPaiLibrary/template/kubernetes-cleanup.sh.template
#	pai-management/paiLibrary/paiBuild/build_center.py
#	pai-management/paiLibrary/paiBuild/hadoop_ai_build.py
#	paictl.py
#	prometheus/doc/exporter-for-other-services.md
#	src/dev-box/build/container-setup.sh
#	src/dev-box/build/dev-box.dockerfile
#	src/drivers/deploy/node-label.sh.template
#	src/hadoop-ai/build/build-pre.sh
#	src/hadoop-jobhistory/deploy/node-label.sh.template
#	src/hadoop-name-node/deploy/node-label.sh.template
#	src/hadoop-node-manager/deploy/hadoop-node-manager-delete/delete-data.sh
#	src/hadoop-node-manager/deploy/node-label.sh.template
#	src/hadoop-resource-manager/deploy/node-label.sh.template
#	src/zookeeper/deploy/node-label.sh.template
2018-09-10 13:07:56 +08:00
YundongYe 9612c7dd0c
Refactor kubernetes deployment (#1064) 2018-08-13 13:19:07 +08:00