* Support AKS deployment
passing POD_IP for hadoop-data-node
Add kusternetes to services-configuration.yaml
It looks like below.
cluster:
kubernetes:
api-server-url: http://master_ip:8080
dashbaord-host: master_ip
generate restserver ip from machine_list
TODO: should avoid to using machine_list
remove post check for "kubernetes" config
TODO: will need to make sure the kubernetes config entry correct
add layout.yaml for PAI service deployment layout
layout.yaml looks like below:
machine-sku:
GENERIC:
mem: 1
gpu:
type: generic
count: 1
cpu:
vcore: 1
os: ubuntu16.04
machine-list:
- hostname: openm-a-gpu-1000
hostip: 10.151.41.69
machine-type: GENERIC
zkid: "1"
pai-master: "true"
pai-worker: "true"
prototype of "layout" cmd
fix jenkins pipeline
hack the ci
fix machine-sku format
hack the ci
demostrate "paictl.py check" command
fix machine-sku format
change True to 'true' for compatible
separate config for kubernetes and PAI services
fix jenkins
fix jenkins
fix "machine" cmd
fix k8s_set_environment
remove class Main
simplify sub commond creating
refine service_management_configuration
warning instead of quit when yaml not found!
separate k8s operations from pai
fix layout
generate kubernetes info in services-configuration.yaml
quick-start would generate k8s info
refine Jenkinsfile
refine, and fix error
workaround for ci
nodename is differnet on premise or aks
refine layout generator
generate kubernetes info in services-configuration.yaml
move kubernetes config to layout.yaml
doube quote the string in layout.yaml
fix layout config validator
fix load-balance-ip missing
Add a Kubernetes proxy to REST server, make webportal to request it (#1856)
* REST server: k8s apiserver proxy
* Object model changes
* Update rest_server.py
REST server: leverage kubernetes service account (#1911)
add auth to visit k8s secret (#1951)
[AKS] api-server RBAC access for watchdog (#1710)
* [aks] prometheus leverage service account to access api-server
* add tls config for prometheus over aks deployment
* mv config of tls to api-server domain as other serivces maybe use it
* add wathdog auth for api-server
* mv path to config
* fix bug
* refine code
* refine the kubelet check
* refine the note
* refine the bearer file path read location
* refactor the watchdog parameter
* refine and update the param
* make this flag
* add default value at object model
* fix code
* refactor the object model and commet
* add the return
* remove prometheus and tls config, only keep watchdog change
* rm the intent
* rm the intent 2
* remove hosts config
* make api-server-scheme config
[aks] prometheus attach token for access api-server (#1939)
* prometheus access api-server auth by attach service account token
* rm duplicated
[hot fix] prometheus & watchdog yaml change use cluster_cfg object model
fix config entry change
passing the ut, fix it later
fix test_template_generate.py
[aks] workaround k8s config not found
fix generated layout.yaml
fix: don't override nodename
fix authorization token in user secret
fix typo "cleaining-image" -> "cleaning-image"
fix: travis lint
allow to generate "Service" template
add pylon option "deploy-service"
align PR prom and watchdog auth to AKS cluster to new deploy branch
reset the src/webportal/build/build-pre.sh
add aks draft doc
refactoring service_management_configuration
refactoring get_service_list()
refactoring "get_dependency()"
refactoring service_start
refactoring service_stop
refactoring service_delete
refactoring service_refresh
add disable rbac param
clean deprecate code
Make kubernetes dashboard config as url (#2028)
* Make kubernetes dashboard config as url
* generate dashboard-url
* comment out post checkout of dashboard-url
fix breaking import
refine method name
clean TODOs
delete debug code
remove k8s dashboard part
add generate_layout.md
add prepare_env.md
refine aks doc
add --force option for layout command
more details for install kubectl
reformat
refine the num
fix dashboard_url missing
refine config command
refine service command
refine cluster command
refine machine command
refine layout command
refine ConfigCmd
refine LayoutCmd
fix broken import
refine ClusterCmd, MachineCmd and ServiceCmd
explain how cmd handled
refine CheckCmd, it's a dummy impl now
fix broken link
no longer needed, remove changes
resolve duplcated code
revmoe useless pylon config entry
"check" command is not ready
fix paictl.py
remove useless file
fix typo
fix mismatch config
* add switch for tls_config
* fix comments
* refine
* using logger
* update layout doc
* put Imports at the top of the file
* fix broken link
* [Cluster Object Model] main module development (#1683)
* [Cluster Object Model] Modify [ paictl cluster ] command, based on com. (#1712)
* Class name modify. Pep8 (#1720)
* [cluster object model] Change paictl machine‘s code, based on com. (#1723)
* Add a tutorial to guide how to add new service configuraiton into com (#1701)
* [Cluster Object Model] Modify [paictl service] command, based on new cluster-object-model (#1735)
* [hotfix] remove memory limits for k8s services and increase it for rm, zookeeper (#1619)
* increate etcd memory limits to 8Gi
* set the "requests" instead of "limits" for k8s services
* increase memory limits for rm
* increase memory limit for zookeeper
* add "requests" for rm
* fix typo
* [hot-fix] increase data-node memory limits to 4Gi (#1689)
* increate data-node memory limits to 4Gi
* increase data-node max java heapsize to 2G
* reserve more memory on PAI worker
* set kubernetes memory eviction threshold
To reach that capacity, either some Pod is using more than its request,
or the system is using more than 3Gi - 1Gi = 2Gi.
* set those pods as 'Guaranteed' QoS:
node-exporter
hadoop-node-manager
hadoop-data-node
drivers-one-shot
* Set '--oom-score-adj=1000' for job container
so it would oom killed first
* set those pods as 'Burstable' QoS:
prometheus
grafana
* set those pods as 'Guaranteed' QoS:
frameworklauncher
hadoop-jobhistory
hadoop-name-node
hadoop-resource-manager
pylon
rest-server
webportal
zookeeper
* adjust services memory limits
* add k8s services resource limit
* seem 1g is not enough for launcher
* adjust hadoop-resource-manager limit
* adjust webportal memory limit
* adjust cpu limits
* rm yarn-exporter resource limits
* adjuest prometheus limits
* adjust limits
* frameworklauncher: set JAVA_OPTS="-server -Xmx512m"
zookeeper: set JAVA_OPTS="-server -Xmx512m"
fix env name to JAVA_OPTS
fix zookeeper
* add heapsize limit for hadoop-data-node hadoop-jobhistory
* add xmx for hadoop
* modify memory limits
* reserve 40g for singlebox, else reserve 12g
* using LAUNCHER_OPTS
* revert zookeeper dockerfile
* adjust node manager memory limit
* drivers would take more memory when install
* increase memory for zookeeper and launcher
* set requests to a lower value
* comment it out, using the continer env "YARN_RESOURCEMANAGER_HEAPSIZE"
* add comments