Scarlett Li
b0d2e09110
Update README.md
2019-12-25 12:18:19 +08:00
Scarlett Li
fedc424a2a
Update README.md
2019-12-25 12:16:18 +08:00
Binyang2014
a80763ff3a
[k8s-dashboard] Enable https for k8s-dashboard ( #4032 )
...
* enable https
* fix ut
* remove dashboard roles
* bug fix
2019-12-25 10:32:30 +08:00
Binyang2014
44506d372a
[Grafana] remove grafana chart ( #4055 )
...
* remove grafana chart
* fix
* move dashboard to admin section
2019-12-25 10:23:14 +08:00
Qinzheng Sun
af98e312f3
[Web Portal] Remove admin-lte layout dependencies ( #4039 )
...
* miscs
* layout
* responsive layout & sidebar items
* plugins
* fix
* 709
* notification button font size
* 235
* 253
* coin
* spelling
2019-12-23 18:56:10 +08:00
Zhiyuan He
56a6f6bbab
fix ( #4049 )
2019-12-23 17:18:15 +08:00
Yifan Xiong
927cda4886
[CI/CD] Update Jenkins ( #4035 )
...
Update Jenkins.
2019-12-23 16:50:52 +08:00
Binyang2014
4fc30d798b
[ErrorSpec] make pattern case insensitive ( #4047 )
2019-12-20 19:25:45 +08:00
Binyang2014
2aa4ec6da6
[runtime] fix image check ( #4015 ) ( #4017 )
2019-12-20 18:28:38 +08:00
Yuqi Wang
44f27bb666
[Hived]: Per VC queuing to avoid cross VC starvation ( #4041 )
2019-12-20 12:22:14 +08:00
Binyang2014
317d0c4dfe
[Prometheus ] fix prometheus high io issue ( #4019 ) ( #4033 )
2019-12-20 09:56:19 +08:00
Yifan Xiong
01facedc05
[Rest Server] Expose preemption status in scheduler ( #4007 )
...
* Expose preemption status in scheduler
Expose preemption status in scheduler.
* Add error handling for hived webservice call
Add error handling for hived webservice call.
* Update hived scheduler service configuration
Update hived scheduler service configuration.
* Update
Update.
* Update
Update.
2019-12-19 19:42:37 +08:00
yiyione
e64da1d439
update yaml job config schema ( #4011 )
2019-12-19 14:45:20 +08:00
Hanyu Zhao
dbb55ae270
[HiveD] fix pod hanging after reconfiguration ( #4003 )
2019-12-19 09:47:28 +08:00
Yifan Xiong
ae388103b0
[Docs] OpenAPI docs for RESTful API ( #3996 )
...
Add OpenAPI docs for RESTful API.
2019-12-17 15:43:00 +08:00
Mingliang Tao
439d4e6965
Fix clone loading misbehavior ( #3923 )
2019-12-16 11:10:38 +08:00
yiyione
51e72bb030
[VS Code] use bash in simulation ( #3991 )
...
* use /bin/bash in simulation
* update
2019-12-12 15:06:38 +08:00
Binyang2014
013eecd116
[runtime] Change to call k8s api to get storage config ( #3944 )
...
Change to use k8s api to finish runtime storage plugin
2019-12-10 16:38:43 +08:00
Yifan Xiong
141301d4ad
Fix typos ( #3980 )
...
* Fix typos
Fix typos by [misspell](https://github.com/client9/misspell ).
* Fix typos in api
Fix typos in api.
* Add spelling check in GitHub Actions
Add spelling check using misspell in GitHub Actions.
2019-12-10 11:37:07 +08:00
Binyang2014
f45849f50e
Binyli/job ssh ( #3986 ) ( #3992 )
...
Remove ssh plugin when gangAllocation is set to false.
Tested in int bed with following job config
```yaml
extras:
gangAllocation: false
com.microsoft.pai.runtimeplugin:
- plugin: ssh
parameters:
jobssh: true
```
2019-12-10 09:52:42 +08:00
Binyang2014
b0ff180888
[runtime] add docker image checker ( #3974 ) ( #3993 )
...
* add docker image checker
2019-12-10 09:51:34 +08:00
yiyione
d56432fa9c
[VS Code] job V2 local simulation ( #3969 )
...
* add simulate v2 job
* fix username in replace variables
* update
2019-12-09 14:45:43 +08:00
Yuqi Wang
8d2cbf1c51
[Hived]: Update Image ( #3987 )
2019-12-09 14:31:51 +08:00
Hanyu Zhao
bff8b9d67e
[HiveD] downgrade pod when PreassignedCellTypes is empty ( #3983 )
2019-12-09 14:14:35 +08:00
YundongYe
c7698316b1
Fix typo in kubespray's tutorial ( #3975 )
2019-12-09 11:24:21 +08:00
Yuqi Wang
1bc971fa8a
[Hived]: Update Image ( #3981 )
2019-12-06 19:50:41 +08:00
Hanyu Zhao
6dcbca600e
[HiveD] support partial release of affinity group ( #3978 )
2019-12-06 19:25:41 +08:00
Yuqi Wang
6522eb5f7d
Explicitly config and tune FIFO ( #3977 )
2019-12-06 17:27:26 +08:00
Yifan Xiong
1b9ee8134d
[Rest Server] Update default completion policy ( #3972 )
...
* Update default completion policy
Update default completion policy,
set minSucceededInstances to task instances by default.
* Update document
Update document.
2019-12-06 17:07:56 +08:00
Yuqi Wang
de98a9a352
[Hived]: Support to tune FIFO ( #3976 )
2019-12-06 17:02:38 +08:00
Binyang2014
900546c3e6
[Error Spec] Add more error patterns ( #3959 )
...
* add new error patterns
2019-12-06 12:52:10 +08:00
Yifan Xiong
4b0346bf3c
[Rest Server] Fix duplicate affinity group name issue ( #3971 )
...
* Fix duplicate affinity group name issue
Fix duplicate affinity group name issue.
2019-12-06 11:16:01 +08:00
Yifan Xiong
b3baf46743
[Rest Server] Add available resources for low priotity cluster ( #3940 )
...
* Add available resources for low priotity cluster
Add available resources for low priotity cluster.
* Update virtual cluster statistics on webportal
Update virtual cluster statistics on webportal.
2019-12-06 10:50:45 +08:00
Hanyu Zhao
965ae38e18
HiveD: record cell type in pod annotation ( #3962 )
...
* record cell type (instead of level) in pod annotation
* fix selectedNode
* refine failed reason
* minor fixes
* minor fixes
2019-12-06 09:09:29 +08:00
Yifan Xiong
c3341596b2
Update examples ( #3968 )
...
Update examples:
* Upgrade built cuda version in pytorch image
* Change to python3 in pytorch examples
* Make NCCL options as a parameter in mpi
2019-12-05 18:56:13 +08:00
Binyang2014
a60bb44ff1
[Runtime] fix typo error ( #3964 )
2019-12-05 15:15:41 +08:00
Yifan Xiong
0ce68684fd
Shorten name to fit backend restrict ( #3958 )
...
Shorten name to fit backend restrict.
2019-12-04 12:42:00 +08:00
Yuqi Wang
f22085b159
[Hived]: Expose and Refine Pod Waiting Reason ( #3931 )
2019-12-04 11:48:20 +08:00
Yifan Xiong
1dcb1198d9
Fix vulnerabilities in npm dependencies ( #3954 )
...
* Fix vulnerabilities in npm dependencies
Fix vulnerabilities in npm dependencies.
* Bump set-value and union-value
Bump set-value and union-value to fix vulnerabilities.
2019-12-04 11:02:49 +08:00
Yifan Xiong
7a035fa87f
Add critical priority in hived ( #3949 )
...
Add critical priority in hived.
2019-12-04 11:02:37 +08:00
AosChen
9e93908e7b
Add the Example, show the process and fix it fit the PAI. ( #3934 )
...
* Add the Example, show the process and fix it fit the PAI.
* Change the output directory.
* Word fix.
2019-12-04 10:03:26 +08:00
Binyang2014
663b06db13
[Job-exporter] Fix gpu matrix not match task-role issue ( #3951 )
...
Tested in vnext bed. After this fix, the task role metrics show correctly
Refer: NVIDIA/nvidia-docker#376
GPU minor number not match NVIDIA_VISIBLE_DEVICES sometimes
2019-12-03 17:45:09 +08:00
YundongYe
b75f66aa06
[kubespray] some script to backup and recover data ( #3955 )
2019-12-03 16:48:03 +08:00
Yifan Xiong
b34177b12d
[Rest Server] Add image pull secrets for private registry ( #3943 )
...
* Add image pull secrets for private registry
Add image pull secrets for private registry.
2019-12-03 10:39:20 +08:00
Qinzheng Sun
bced6e43b9
[Web Portal] fix react hook warning ( #3920 )
2019-12-02 15:47:51 +08:00
yiyione
3d1488324d
[VS Code] Fix add cluster bug when connection timeout ( #3929 )
...
* add try catch to json parse error message
* add some description for cluster config
2019-12-02 14:56:26 +08:00
Mingliang Tao
9bda534aea
Remove max gpu count limit ( #3927 )
2019-12-02 10:39:28 +08:00
Yifan Xiong
d05a8e53a9
[Rest Server] Check resource quota for gang scheduling only ( #3936 )
...
* Check resource quota for gang scheduling only
Check resource quota for gang scheduling only.
* Skip check for oppo
Skip check for oppo priority.
2019-11-30 17:08:37 +08:00
Yifan Xiong
c47ca4dc01
Specify image pull policy for app container ( #3933 )
...
Specify image pull policy for app container.
2019-11-29 19:22:47 +08:00
YundongYe
ee12c0f37e
[Kubespray] Disable netchecker ( #3932 )
2019-11-29 16:50:09 +08:00