Zhiyuan He
060684fcf6
delete elasticsearch in rest-server ( #4302 )
...
* fix rest-server part
* fix dependency for postgres and internal-storage
* fix
* enable db by default
2020-03-20 11:30:08 +08:00
Yifan Xiong
b02f16b9cd
[Rest Server] Add launched time for job ( #4301 )
...
* Add launched time for job
* Add docs for duration
2020-03-20 01:11:11 +08:00
Mingliang Tao
5ec04d9700
Support submit job from local storage ( #4296 )
2020-03-19 17:38:19 +08:00
Yuqi Wang
7785008710
Remove ES backend parts ( #4299 )
2020-03-19 16:38:16 +08:00
Zhiyuan He
2cb5a3cd6a
Use Postgresql as an alternative storage for job history ( #4164 )
...
* fix table schema & fluentd
* fix fluentd
* fix fluentd
* fix fluentd
* fix rest-server
* lint
* fix rest-server
* fix rest-server
* fix index setting
* lint
* rm header
* add postgresql init container
* fix
* fix
* add readiness probe
* add time period for probe
* fix for new build
* fix
* fix
* fix bad connection problem in fluent-plugin-pgjson
* add if for elasticsearch
* fix
* add NOTICE.txt
* comment out debug stdout
* add comment
* change config way
* fix
* copyright header
2020-03-19 14:05:02 +08:00
YundongYe
311981edc3
[deploy] kube-system secret for image pull ( #4294 )
2020-03-19 12:55:30 +08:00
Binyang2014
3063735224
[webportal] show account name for Azure-file and Azure-blob in webportal ( #4289 )
...
* show account name for Azure-file and Azure-blob in webportal
* Use <AZURE_STORAGE_DNS_SUFFIX> as siffix
2020-03-19 10:51:02 +08:00
Yifan Xiong
2b1e9c15bc
Update RBAC rules for storage ( #4292 )
...
Update RBAC rules for storage.
2020-03-18 15:34:58 +08:00
Binyang2014
338f5a3e01
[k8s-dashboard] change dashboard token ttl to 7d ( #4293 )
2020-03-18 15:09:47 +08:00
Hanyu Zhao
3388bb9990
[HiveD] add log for suggested nodes ( #4262 )
...
* add log for suggested nodes
* add log for suggested nodes
* add log for suggested nodes
* refine log
* refine log
2020-03-17 14:24:26 +08:00
Hanyu Zhao
9d69ebca4a
[HiveD] expose vc capacity ( #4153 )
...
* expose vc capacity
* fix failure of running cells
* json representation of cluster status
* add oppor cells to vc
* minor fixes
* add node lister to watch nodes
* minor fixes
* minor fixes
* minor
* refine cell addressing
* quota headroom track (wip)
* safety check
* fix bug in mapNonPreassignedCellToVirtual
* track bad cells
* minor refinements
* resolve comments
* resolve comments
* resolve comments & fix track bad cells (only need to track bad free cells)
* resolve comments
2020-03-16 21:26:30 +08:00
Hanyu Zhao
283668e504
[HiveD] VC aware of suggested nodes ( #4268 )
...
* vc aware of suggested nodes
* vc aware of suggested nodes
* suggested
* add UT
* fix comments
* minor fix
* revert vc aware suggested nodes
* fix UT
2020-03-16 13:07:36 +08:00
dependabot[bot]
e633511722
Bump acorn from 6.1.1 to 6.4.1 in /contrib/submit-job-v2 ( #4286 )
2020-03-16 02:44:29 +00:00
Pengfei Wu
4768244820
Fix typo ( #4287 )
...
Fix typo.
2020-03-16 00:33:35 +08:00
dependabot[bot]
606e0e27bb
Bump acorn from 6.1.1 to 6.4.1 in /contrib/marketplace ( #4285 )
2020-03-15 16:30:48 +00:00
dependabot[bot]
2ebf2dbb4e
Bump acorn from 5.7.3 to 5.7.4 in /contrib/submit-simple-job ( #4284 )
2020-03-15 16:30:26 +00:00
dependabot[bot]
5dc296d66a
Bump acorn from 6.1.1 to 6.4.1 in /src/webportal ( #4283 )
2020-03-14 08:34:15 +00:00
Yifan Xiong
855f3808b0
Support cpu job on gpu node in default scheduler ( #4280 )
...
Support cpu job on gpu node in default scheduler.
2020-03-13 16:27:33 +08:00
yiyione
499f36d519
[Rest Server] Remove list pods from getting job detail ( #4279 )
...
* remove list pods from getting job detail
* set default value to affinityGroupName
Co-authored-by: yiyione <yiyi_one@foxmail.com>
2020-03-13 14:10:41 +08:00
Binyang2014
7b4e863e1f
[doc] add doc for remove cni ( #4278 )
2020-03-12 18:22:15 +08:00
YundongYe
26eda6fd2a
[quick-start] Advanced docker configuration ( #4264 )
2020-03-12 18:09:23 +08:00
Binyang2014
9645ff0b6c
[watchdog] Remove go dependency source code ( #4275 )
...
* remove dependency srouce code
2020-03-12 17:16:49 +08:00
Binyang2014
c8c4774bc6
[runtime] Make ssh barrier timeout configurable ( #4272 )
...
Make ssh barrier timeout configurable to avoid unexpected user job failure. The default timeout is 30 mins
BTW: remove `sshbarriertaskroles`, since it's too complex
Fixes #4239
2020-03-11 22:16:20 +08:00
yiyione
910fe78339
[Rest Server] Use list pods with label selector in getting job detail ( #4270 )
...
* change get each pod to list pod with label selector
* fix eslint
* update getPods
* Add accept header to list pods (#4274 )
* add accept header
* fix mock gpu
Co-authored-by: yiyione <yiyi_one@foxmail.com>
Co-authored-by: yiyione <yiyi_one@foxmail.com>
2020-03-11 10:41:14 +08:00
Zhiyuan He
ebcf5585db
[Kubespray] Inform user of pai cluster id, configuration, username, password ( #4267 )
...
* fix
* fix
2020-03-11 10:05:40 +08:00
Binyang2014
c73454f5e8
[runtime] fix failurePolicy inconsistent ( #4273 )
2020-03-11 09:46:13 +08:00
Yuqi Wang
b7e5d0db29
[Hived]: Filter on full SuggestedNodes ( #4271 )
2020-03-10 18:36:48 +08:00
Binyang2014
f2475b882d
Add amd metrics support ( #4258 )
...
Add AMD GPU metrics support.
Currently, we will export `GPU usage`, `GPU memory usage`, `GPU temperature` for AMD GPU
2020-03-10 14:44:20 +08:00
YundongYe
c0623c3734
[dashboard][hived] Add dashboard and kubescheduler's image ( #4265 )
...
* Add dashboard and kubescheduler's image
* Add dashboard and kubescheduler's image
2020-03-10 11:46:33 +08:00
Zhiyuan He
b1212e92ed
Concatenate SSH command for user ( #4250 )
...
* fix
* fix
* lint
* save
* fix user ssh
* fix
* fix
* fix
* fix
2020-03-10 10:23:27 +08:00
Hanyu Zhao
095a1e7ff9
[HiveD] check suggested nodes for all pods ( #4251 )
...
* check suggested nodes for all pods
* check suggested nodes for all pods
* fix log message
* minor fix
2020-03-09 17:30:26 +08:00
Binyang2014
1d333b1219
support labelSelector in restAPI ( #4260 )
...
* support labelSelector in restAPI
* fix UT
2020-03-09 16:11:54 +08:00
YundongYe
a3b92ffdc7
[quick-start] Fix issue. ( #4261 )
2020-03-09 10:13:31 +08:00
YundongYe
f2194c5ead
[quick-start] Remove docker config path change in kubespray. ( #4259 )
2020-03-08 16:34:13 +08:00
Yifan Xiong
f9bf182fd4
[Rest Server] Refactor storage by leveraging persistent volumes ( #4157 )
...
* Refactor storage by leveraging kubernetes pv and pvc.
* Remove unused code
* Backward compatibility for
* GET /storage/config
* GET /storage/server
* Update api docs
* Add document on setting up storage
2020-03-08 15:02:18 +08:00
YundongYe
4d9caf2d0f
[quick-start] ssh-key-file configuration ( #4256 )
2020-03-07 14:08:17 +08:00
YundongYe
a278fb5d42
[quick-start] hived config generator's log update and support amd device ( #4253 )
2020-03-07 14:07:39 +08:00
Binyang2014
2386c93313
[Runtime] Skip teamwise-storage and fix UT ( #4243 )
...
* Skip teamwise-storage and fix UT
* fix typo
2020-03-06 23:23:23 +08:00
YundongYe
4047b23293
[quick-start] change default version to v0.17.0 ( #4255 )
2020-03-06 22:13:06 +08:00
yiyione
90604120df
[VS Code] 0.3.0 document update ( #4234 )
...
* change version to 0.3.0
* update create job config file
* add document for yaml edit
* update
* update
* fix line break
* add submit job doc
* update
* add storage explorer document
* update readme
* update
* update readme_zh_CN
* update change log
* fix typo in README
* fix NFS
* update
Co-authored-by: yiyione <yiyi_one@foxmail.com>
2020-03-06 15:06:17 +08:00
Zhiyuan He
2a6e249890
Update docs for v0.18.0 ( #4238 )
...
* doc init
* Update upgrade_to_v0.18.0.md
* Update upgrade_to_v0.18.0.md
* fix
* fix
* fix
2020-03-06 14:32:18 +08:00
Zhiyuan He
0cbe9c25e9
Add FAQs and Troubleshooting for kubespray ( #4249 )
...
* init
* Update faqs-and-troubleshooting.md
* Update faqs-and-troubleshooting.md
2020-03-06 14:11:17 +08:00
Zhiyuan He
a84fbe98cc
Delete quick start docker images when error happens ( #4248 )
2020-03-06 14:11:01 +08:00
Yifan Xiong
985cd4ef14
[CI/CD] Fix Jenkins issue caused by downloading MNIST 403 ( #4257 )
...
* Fix Jenkins issue caused by downloading MNIST 403
Fix Jenkins test job failed issue caused by
upstream pytorch and torchvision.
* Fix autoremove issue
Fix autoremove issue.
2020-03-05 20:43:29 +08:00
YundongYe
dcc81fa18f
[quick-start] setup kubectl env on localhost. ( #4242 )
2020-03-03 20:56:25 +08:00
Mingliang Tao
ac67669762
Fix log url in job retry page ( #4237 )
2020-03-03 10:58:23 +08:00
Zhiyuan He
131ea0000a
Fix GPU driver installation on Ubuntu 16.04 ( #4236 )
2020-03-02 14:05:27 +08:00
YundongYe
87ba350c9a
[kubespray] Refactor quick start based on ansible role. ( #4230 )
2020-02-26 20:38:17 +08:00
YundongYe
3597d02124
[kubespray] configuration for docker and image-repo ( #4229 )
2020-02-26 10:41:01 +08:00
Zhiyuan He
c4b7e87860
Add offline apt package cache in kube runtime ( #4226 )
...
* squash commit
* Update package_cache.md
* fix & lint
* add fix for nfs, samba, azurefile
* fix for 32-bit os
* fix
* use dpkg -V instead of dpkg -l
* fix
* fix
* skip some uts
2020-02-25 14:33:15 +08:00