microsoft/pai - pai

Граф коммитов

Автор	SHA1	Сообщение	Дата
Yuqi Wang	a5cc5b482c	Fix transient Pod MatchNodeSelector failed as kubelet label missing (#4907 )	2020-09-15 17:59:24 +08:00
Yuqi Wang	630cf383b6	Tune FC, ControllerManager and ApiServer to serve large concurrent active frameworks (>10k) (#4864 )	2020-09-01 17:22:58 +08:00
Yifan Xiong	aaf3ac80d4	Fix Azure File issues in storage (#4438 ) Fix Azure File issues in storage.	2020-04-24 10:34:14 +08:00
Yuqi Wang	6522eb5f7d	Explicitly config and tune FIFO (#3977 )	2019-12-06 17:27:26 +08:00
Yifan Xiong	308b62f54c	Fix kubelet.service in add machines (#3807 ) Fix kubelet.service in add machines.	2019-11-07 20:05:26 +08:00
Yuqi Wang	4db57ca4fc	By default ensure eviction can be still processed slowly instead of totally stopped (#3795 )	2019-11-05 13:24:34 +08:00
Double Young	2f060ca59d	Deployment for EFK to support job history (#3626 )	2019-10-29 16:32:17 +08:00
Yuqi Wang	5b5c684939	Tune default scheduler to support Job FIFO (#3731 )	2019-10-14 20:30:20 +08:00
Yuqi Wang	502ca8ce47	Tune default scheduler to support Job FIFO (#3726 )	2019-10-14 17:18:03 +08:00
Yuqi Wang	74b92690e2	Tune ApiServer for larger workload (#3715 )	2019-10-11 15:40:55 +08:00
Yuqi Wang	1016a98336	Increase container log max size and keep single log file Because K8S does not recognize multiple rotated files yet	2019-10-08 17:43:26 +08:00
Yifan Xiong	25fdc8b374	Remove deprecated flag for kubelet (#3409 ) Remove deprecated `--require-kubeconfig` flag for kubelet.	2019-08-21 13:35:31 +08:00
YundongYe	d019b5af35	[kubelet] Start kubelet through systemd (#3307 )	2019-08-05 11:11:20 +08:00
Di Xu	8afa0eb76a	add script to upgrade etcd aiding monitoring etcd (#2159 )	2019-02-20 17:11:11 +08:00
YundongYe	6b2842f616	[Pod Eviction] Disable kubernetes's pod eviction (#2124 ) * According to https://github.com/kubernetes/kubernetes/issues/71661 * add imagegc threshold	2019-02-11 17:27:19 +08:00
Hao Yuan	2d915ebc1c	Support AKS deployment (#1980 ) * Support AKS deployment passing POD_IP for hadoop-data-node Add kusternetes to services-configuration.yaml It looks like below. cluster: kubernetes: api-server-url: http://master_ip:8080 dashbaord-host: master_ip generate restserver ip from machine_list TODO: should avoid to using machine_list remove post check for "kubernetes" config TODO: will need to make sure the kubernetes config entry correct add layout.yaml for PAI service deployment layout layout.yaml looks like below: machine-sku: GENERIC: mem: 1 gpu: type: generic count: 1 cpu: vcore: 1 os: ubuntu16.04 machine-list: - hostname: openm-a-gpu-1000 hostip: 10.151.41.69 machine-type: GENERIC zkid: "1" pai-master: "true" pai-worker: "true" prototype of "layout" cmd fix jenkins pipeline hack the ci fix machine-sku format hack the ci demostrate "paictl.py check" command fix machine-sku format change True to 'true' for compatible separate config for kubernetes and PAI services fix jenkins fix jenkins fix "machine" cmd fix k8s_set_environment remove class Main simplify sub commond creating refine service_management_configuration warning instead of quit when yaml not found! separate k8s operations from pai fix layout generate kubernetes info in services-configuration.yaml quick-start would generate k8s info refine Jenkinsfile refine, and fix error workaround for ci nodename is differnet on premise or aks refine layout generator generate kubernetes info in services-configuration.yaml move kubernetes config to layout.yaml doube quote the string in layout.yaml fix layout config validator fix load-balance-ip missing Add a Kubernetes proxy to REST server, make webportal to request it (#1856) * REST server: k8s apiserver proxy * Object model changes * Update rest_server.py REST server: leverage kubernetes service account (#1911) add auth to visit k8s secret (#1951) [AKS] api-server RBAC access for watchdog (#1710) * [aks] prometheus leverage service account to access api-server * add tls config for prometheus over aks deployment * mv config of tls to api-server domain as other serivces maybe use it * add wathdog auth for api-server * mv path to config * fix bug * refine code * refine the kubelet check * refine the note * refine the bearer file path read location * refactor the watchdog parameter * refine and update the param * make this flag * add default value at object model * fix code * refactor the object model and commet * add the return * remove prometheus and tls config, only keep watchdog change * rm the intent * rm the intent 2 * remove hosts config * make api-server-scheme config [aks] prometheus attach token for access api-server (#1939) * prometheus access api-server auth by attach service account token * rm duplicated [hot fix] prometheus & watchdog yaml change use cluster_cfg object model fix config entry change passing the ut, fix it later fix test_template_generate.py [aks] workaround k8s config not found fix generated layout.yaml fix: don't override nodename fix authorization token in user secret fix typo "cleaining-image" -> "cleaning-image" fix: travis lint allow to generate "Service" template add pylon option "deploy-service" align PR prom and watchdog auth to AKS cluster to new deploy branch reset the src/webportal/build/build-pre.sh add aks draft doc refactoring service_management_configuration refactoring get_service_list() refactoring "get_dependency()" refactoring service_start refactoring service_stop refactoring service_delete refactoring service_refresh add disable rbac param clean deprecate code Make kubernetes dashboard config as url (#2028) * Make kubernetes dashboard config as url * generate dashboard-url * comment out post checkout of dashboard-url fix breaking import refine method name clean TODOs delete debug code remove k8s dashboard part add generate_layout.md add prepare_env.md refine aks doc add --force option for layout command more details for install kubectl reformat refine the num fix dashboard_url missing refine config command refine service command refine cluster command refine machine command refine layout command refine ConfigCmd refine LayoutCmd fix broken import refine ClusterCmd, MachineCmd and ServiceCmd explain how cmd handled refine CheckCmd, it's a dummy impl now fix broken link no longer needed, remove changes resolve duplcated code revmoe useless pylon config entry "check" command is not ready fix paictl.py remove useless file fix typo fix mismatch config * add switch for tls_config * fix comments * refine * using logger * update layout doc * put Imports at the top of the file * fix broken link	2019-02-01 10:03:50 +08:00
YundongYe	5ea44bf8e0	[QoS] qos switch to disable the resource requirement. (#1934 )	2018-12-24 14:30:59 +08:00
YundongYe	cb8edfaf5d	Make api-server-url configurable (#1880 )	2018-12-20 09:45:01 +08:00
Di Xu	17b49164c9	add TaintBasedEvictions (#1905 )	2018-12-19 11:34:26 +08:00
YundongYe	182f38b524	Enable Cluster object model (#1824 ) * [Cluster Object Model] main module development (#1683) * [Cluster Object Model] Modify [ paictl cluster ] command, based on com. (#1712) * Class name modify. Pep8 (#1720) * [cluster object model] Change paictl machine‘s code, based on com. (#1723) * Add a tutorial to guide how to add new service configuraiton into com (#1701) * [Cluster Object Model] Modify [paictl service] command, based on new cluster-object-model (#1735)	2018-12-05 17:31:49 +08:00
Hao Yuan	5017d411f4	fix memory limits (#1707 ) * [hotfix] remove memory limits for k8s services and increase it for rm, zookeeper (#1619) * increate etcd memory limits to 8Gi * set the "requests" instead of "limits" for k8s services * increase memory limits for rm * increase memory limit for zookeeper * add "requests" for rm * fix typo * [hot-fix] increase data-node memory limits to 4Gi (#1689) * increate data-node memory limits to 4Gi * increase data-node max java heapsize to 2G * reserve more memory on PAI worker	2018-11-16 11:57:30 +08:00
YundongYe	0b9f0083bd	Remove etcd data path hard code (#1539 ) * Bug fix. * Add some guide message	2018-10-18 10:03:27 +08:00
Di Xu	f4e934bcd3	add nvidia gpu support for pai-lite (#1437 )	2018-10-11 13:08:22 +08:00
ZhaoYu Dong	0b6229ea90	Disk config (#1450 ) * add daemon config temp * change config * remove old daemon json	2018-10-09 14:13:31 +08:00
YundongYe	b8ec568788	Enable kubelet healtz port and address. (#1455 ) * Update update library. * Update update library.	2018-10-08 10:46:26 +08:00
Hao Yuan	d9cf1d5f89	Add memory limit for all PAI services, make it 'Burstable' Qos class (#1384 ) * set kubernetes memory eviction threshold To reach that capacity, either some Pod is using more than its request, or the system is using more than 3Gi - 1Gi = 2Gi. * set those pods as 'Guaranteed' QoS: node-exporter hadoop-node-manager hadoop-data-node drivers-one-shot * Set '--oom-score-adj=1000' for job container so it would oom killed first * set those pods as 'Burstable' QoS: prometheus grafana * set those pods as 'Guaranteed' QoS: frameworklauncher hadoop-jobhistory hadoop-name-node hadoop-resource-manager pylon rest-server webportal zookeeper * adjust services memory limits * add k8s services resource limit * seem 1g is not enough for launcher * adjust hadoop-resource-manager limit * adjust webportal memory limit * adjust cpu limits * rm yarn-exporter resource limits * adjuest prometheus limits * adjust limits * frameworklauncher: set JAVA_OPTS="-server -Xmx512m" zookeeper: set JAVA_OPTS="-server -Xmx512m" fix env name to JAVA_OPTS fix zookeeper * add heapsize limit for hadoop-data-node hadoop-jobhistory * add xmx for hadoop * modify memory limits * reserve 40g for singlebox, else reserve 12g * using LAUNCHER_OPTS * revert zookeeper dockerfile * adjust node manager memory limit * drivers would take more memory when install * increase memory for zookeeper and launcher * set requests to a lower value * comment it out, using the continer env "YARN_RESOURCEMANAGER_HEAPSIZE" * add comments	2018-09-27 15:06:46 +08:00
Yitong Feng	4d2011321d	move label machine step from kubelet start into service deployment (#1403 )	2018-09-26 14:18:51 +08:00
Di Xu	bf9f361c4e	collect network metrics for containers (#1418 )	2018-09-25 14:40:08 +08:00
$FAREAST\canwan$ FAREAST\canwan	6e29174edd	Merge remote-tracking branch 'origin/yuye/bug_fix' into canwan/resolve-conflict # Conflicts: # deployment/k8sPaiLibrary/maintainconf/deploy.yaml	2018-09-11 16:23:36 +08:00
$FAREAST\canwan$ FAREAST\canwan	482178010f	Merge branch 'master' into canwan/resolve-conflict	2018-09-11 10:26:56 +08:00
$FAREAST\canwan$ FAREAST\canwan	edd23e703e	fix kubernetes-cleanup	2018-09-10 17:48:18 +08:00
$FAREAST\canwan$ FAREAST\canwan	c5153d02a8	fix bug in dev-box.dockerfile	2018-09-10 13:54:43 +08:00
$FAREAST\canwan$ FAREAST\canwan	e430930267	Merge remote-tracking branch 'origin/master' into canwan/update-jenkins # Conflicts: # Jenkinsfile # deployment/k8sPaiLibrary/maintainconf/clean.yaml # deployment/k8sPaiLibrary/maintaintool/kubernetes-cleanup.sh # docs/pai-management/doc/add-service.md # docs/pai-management/doc/cluster-bootup.md # docs/pai-management/doc/how-to-write-pai-configuration.md # pai-management/bootstrap/alert-manager/delete.sh # pai-management/bootstrap/alert-manager/refresh.sh.template # pai-management/bootstrap/alert-manager/service.yaml # pai-management/bootstrap/alert-manager/stop.sh # pai-management/bootstrap/drivers/node-label.sh.template # pai-management/bootstrap/hadoop-jobhistory/node-label.sh.template # pai-management/bootstrap/hadoop-node-manager/hadoop-node-manager-delete/delete-data.sh # pai-management/container-setup.sh # pai-management/k8sPaiLibrary/maintaintool/kubernetes-cleanup.sh # pai-management/k8sPaiLibrary/template/kubernetes-cleanup.sh.template # pai-management/paiLibrary/paiBuild/build_center.py # pai-management/paiLibrary/paiBuild/hadoop_ai_build.py # paictl.py # prometheus/doc/exporter-for-other-services.md # src/dev-box/build/container-setup.sh # src/dev-box/build/dev-box.dockerfile # src/drivers/deploy/node-label.sh.template # src/hadoop-ai/build/build-pre.sh # src/hadoop-jobhistory/deploy/node-label.sh.template # src/hadoop-name-node/deploy/node-label.sh.template # src/hadoop-node-manager/deploy/hadoop-node-manager-delete/delete-data.sh # src/hadoop-node-manager/deploy/node-label.sh.template # src/hadoop-resource-manager/deploy/node-label.sh.template # src/zookeeper/deploy/node-label.sh.template	2018-09-10 13:07:56 +08:00
YundongYe	9612c7dd0c	Refactor kubernetes deployment (#1064 )	2018-08-13 13:19:07 +08:00

34 Коммитов