Граф коммитов

3995 Коммитов

Автор SHA1 Сообщение Дата
Zhiyuan He 060684fcf6
delete elasticsearch in rest-server (#4302)
* fix rest-server part

* fix dependency for postgres and internal-storage

* fix

* enable db by default
2020-03-20 11:30:08 +08:00
Yifan Xiong b02f16b9cd
[Rest Server] Add launched time for job (#4301)
* Add launched time for job
* Add docs for duration
2020-03-20 01:11:11 +08:00
Mingliang Tao 5ec04d9700
Support submit job from local storage (#4296) 2020-03-19 17:38:19 +08:00
Yuqi Wang 7785008710
Remove ES backend parts (#4299) 2020-03-19 16:38:16 +08:00
Zhiyuan He 2cb5a3cd6a
Use Postgresql as an alternative storage for job history (#4164)
* fix table schema & fluentd

* fix fluentd

* fix fluentd

* fix fluentd

* fix rest-server

* lint

* fix rest-server

* fix rest-server

* fix index setting

* lint

* rm header

* add postgresql init container

* fix

* fix

* add readiness probe

* add time period for probe

* fix for new build

* fix

* fix

* fix bad connection problem in fluent-plugin-pgjson

* add if for elasticsearch

* fix

* add NOTICE.txt

* comment out debug stdout

* add comment

* change config way

* fix

* copyright header
2020-03-19 14:05:02 +08:00
YundongYe 311981edc3
[deploy] kube-system secret for image pull (#4294) 2020-03-19 12:55:30 +08:00
Binyang2014 3063735224
[webportal] show account name for Azure-file and Azure-blob in webportal (#4289)
* show account name for Azure-file and Azure-blob in webportal

* Use <AZURE_STORAGE_DNS_SUFFIX> as siffix
2020-03-19 10:51:02 +08:00
Yifan Xiong 2b1e9c15bc
Update RBAC rules for storage (#4292)
Update RBAC rules for storage.
2020-03-18 15:34:58 +08:00
Binyang2014 338f5a3e01
[k8s-dashboard] change dashboard token ttl to 7d (#4293) 2020-03-18 15:09:47 +08:00
Hanyu Zhao 3388bb9990
[HiveD] add log for suggested nodes (#4262)
* add log for suggested nodes

* add log for suggested nodes

* add log for suggested nodes

* refine log

* refine log
2020-03-17 14:24:26 +08:00
Hanyu Zhao 9d69ebca4a
[HiveD] expose vc capacity (#4153)
* expose vc capacity

* fix failure of running cells

* json representation of cluster status

* add oppor cells to vc

* minor fixes

* add node lister to watch nodes

* minor fixes

* minor fixes

* minor

* refine cell addressing

* quota headroom track (wip)

* safety check

* fix bug in mapNonPreassignedCellToVirtual

* track bad cells

* minor refinements

* resolve comments

* resolve comments

* resolve comments & fix track bad cells (only need to track bad free cells)

* resolve comments
2020-03-16 21:26:30 +08:00
Hanyu Zhao 283668e504
[HiveD] VC aware of suggested nodes (#4268)
* vc aware of suggested nodes

* vc aware of suggested nodes

* suggested

* add UT

* fix comments

* minor fix

* revert vc aware suggested nodes

* fix UT
2020-03-16 13:07:36 +08:00
dependabot[bot] e633511722
Bump acorn from 6.1.1 to 6.4.1 in /contrib/submit-job-v2 (#4286) 2020-03-16 02:44:29 +00:00
Pengfei Wu 4768244820
Fix typo (#4287)
Fix typo.
2020-03-16 00:33:35 +08:00
dependabot[bot] 606e0e27bb
Bump acorn from 6.1.1 to 6.4.1 in /contrib/marketplace (#4285) 2020-03-15 16:30:48 +00:00
dependabot[bot] 2ebf2dbb4e
Bump acorn from 5.7.3 to 5.7.4 in /contrib/submit-simple-job (#4284) 2020-03-15 16:30:26 +00:00
dependabot[bot] 5dc296d66a
Bump acorn from 6.1.1 to 6.4.1 in /src/webportal (#4283) 2020-03-14 08:34:15 +00:00
Yifan Xiong 855f3808b0
Support cpu job on gpu node in default scheduler (#4280)
Support cpu job on gpu node in default scheduler.
2020-03-13 16:27:33 +08:00
yiyione 499f36d519
[Rest Server] Remove list pods from getting job detail (#4279)
* remove list pods from getting job detail

* set default value to affinityGroupName

Co-authored-by: yiyione <yiyi_one@foxmail.com>
2020-03-13 14:10:41 +08:00
Binyang2014 7b4e863e1f
[doc] add doc for remove cni (#4278) 2020-03-12 18:22:15 +08:00
YundongYe 26eda6fd2a
[quick-start] Advanced docker configuration (#4264) 2020-03-12 18:09:23 +08:00
Binyang2014 9645ff0b6c
[watchdog] Remove go dependency source code (#4275)
* remove dependency srouce code
2020-03-12 17:16:49 +08:00
Binyang2014 c8c4774bc6
[runtime] Make ssh barrier timeout configurable (#4272)
Make ssh barrier timeout configurable to avoid unexpected user job failure. The default timeout is 30 mins 

BTW: remove `sshbarriertaskroles`, since it's too complex

Fixes #4239
2020-03-11 22:16:20 +08:00
yiyione 910fe78339
[Rest Server] Use list pods with label selector in getting job detail (#4270)
* change get each pod to list pod with label selector

* fix eslint

* update getPods

* Add accept header to list pods (#4274)

* add accept header

* fix mock gpu

Co-authored-by: yiyione <yiyi_one@foxmail.com>

Co-authored-by: yiyione <yiyi_one@foxmail.com>
2020-03-11 10:41:14 +08:00
Zhiyuan He ebcf5585db
[Kubespray] Inform user of pai cluster id, configuration, username, password (#4267)
* fix

* fix
2020-03-11 10:05:40 +08:00
Binyang2014 c73454f5e8
[runtime] fix failurePolicy inconsistent (#4273) 2020-03-11 09:46:13 +08:00
Yuqi Wang b7e5d0db29
[Hived]: Filter on full SuggestedNodes (#4271) 2020-03-10 18:36:48 +08:00
Binyang2014 f2475b882d
Add amd metrics support (#4258)
Add AMD GPU metrics support.
Currently, we will export `GPU usage`, `GPU memory usage`, `GPU temperature` for AMD GPU
2020-03-10 14:44:20 +08:00
YundongYe c0623c3734
[dashboard][hived] Add dashboard and kubescheduler's image (#4265)
* Add dashboard and kubescheduler's image

* Add dashboard and kubescheduler's image
2020-03-10 11:46:33 +08:00
Zhiyuan He b1212e92ed
Concatenate SSH command for user (#4250)
* fix

* fix

* lint

* save

* fix user ssh

* fix

* fix

* fix

* fix
2020-03-10 10:23:27 +08:00
Hanyu Zhao 095a1e7ff9
[HiveD] check suggested nodes for all pods (#4251)
* check suggested nodes for all pods

* check suggested nodes for all pods

* fix log message

* minor fix
2020-03-09 17:30:26 +08:00
Binyang2014 1d333b1219
support labelSelector in restAPI (#4260)
* support labelSelector in restAPI

* fix UT
2020-03-09 16:11:54 +08:00
YundongYe a3b92ffdc7
[quick-start] Fix issue. (#4261) 2020-03-09 10:13:31 +08:00
YundongYe f2194c5ead
[quick-start] Remove docker config path change in kubespray. (#4259) 2020-03-08 16:34:13 +08:00
Yifan Xiong f9bf182fd4
[Rest Server] Refactor storage by leveraging persistent volumes (#4157)
* Refactor storage by leveraging kubernetes pv and pvc.
* Remove unused code
* Backward compatibility for
  * GET /storage/config
  * GET /storage/server
* Update api docs
* Add document on setting up storage
2020-03-08 15:02:18 +08:00
YundongYe 4d9caf2d0f
[quick-start] ssh-key-file configuration (#4256) 2020-03-07 14:08:17 +08:00
YundongYe a278fb5d42
[quick-start] hived config generator's log update and support amd device (#4253) 2020-03-07 14:07:39 +08:00
Binyang2014 2386c93313
[Runtime] Skip teamwise-storage and fix UT (#4243)
* Skip teamwise-storage and fix UT

* fix typo
2020-03-06 23:23:23 +08:00
YundongYe 4047b23293
[quick-start] change default version to v0.17.0 (#4255) 2020-03-06 22:13:06 +08:00
yiyione 90604120df
[VS Code] 0.3.0 document update (#4234)
* change version to 0.3.0

* update create job config file

* add document for yaml edit

* update

* update

* fix line break

* add submit job doc

* update

* add storage explorer document

* update readme

* update

* update readme_zh_CN

* update change log

* fix typo in README

* fix NFS

* update

Co-authored-by: yiyione <yiyi_one@foxmail.com>
2020-03-06 15:06:17 +08:00
Zhiyuan He 2a6e249890
Update docs for v0.18.0 (#4238)
* doc init

* Update upgrade_to_v0.18.0.md

* Update upgrade_to_v0.18.0.md

* fix

* fix

* fix
2020-03-06 14:32:18 +08:00
Zhiyuan He 0cbe9c25e9
Add FAQs and Troubleshooting for kubespray (#4249)
* init

* Update faqs-and-troubleshooting.md

* Update faqs-and-troubleshooting.md
2020-03-06 14:11:17 +08:00
Zhiyuan He a84fbe98cc
Delete quick start docker images when error happens (#4248) 2020-03-06 14:11:01 +08:00
Yifan Xiong 985cd4ef14
[CI/CD] Fix Jenkins issue caused by downloading MNIST 403 (#4257)
* Fix Jenkins issue caused by downloading MNIST 403

Fix Jenkins test job failed issue caused by
upstream pytorch and torchvision.

* Fix autoremove issue

Fix autoremove issue.
2020-03-05 20:43:29 +08:00
YundongYe dcc81fa18f
[quick-start] setup kubectl env on localhost. (#4242) 2020-03-03 20:56:25 +08:00
Mingliang Tao ac67669762
Fix log url in job retry page (#4237) 2020-03-03 10:58:23 +08:00
Zhiyuan He 131ea0000a
Fix GPU driver installation on Ubuntu 16.04 (#4236) 2020-03-02 14:05:27 +08:00
YundongYe 87ba350c9a
[kubespray] Refactor quick start based on ansible role. (#4230) 2020-02-26 20:38:17 +08:00
YundongYe 3597d02124
[kubespray] configuration for docker and image-repo (#4229) 2020-02-26 10:41:01 +08:00
Zhiyuan He c4b7e87860
Add offline apt package cache in kube runtime (#4226)
* squash commit

* Update package_cache.md

* fix & lint

* add fix for nfs, samba, azurefile

* fix for 32-bit os

* fix

* use dpkg -V instead of dpkg -l

* fix

* fix

* skip some uts
2020-02-25 14:33:15 +08:00