Граф коммитов

3966 Коммитов

Автор SHA1 Сообщение Дата
Scarlett Li b0d2e09110
Update README.md 2019-12-25 12:18:19 +08:00
Scarlett Li fedc424a2a
Update README.md 2019-12-25 12:16:18 +08:00
Binyang2014 a80763ff3a
[k8s-dashboard] Enable https for k8s-dashboard (#4032)
* enable https

* fix ut

* remove dashboard roles

* bug fix
2019-12-25 10:32:30 +08:00
Binyang2014 44506d372a
[Grafana] remove grafana chart (#4055)
* remove grafana chart

* fix

* move dashboard to admin section
2019-12-25 10:23:14 +08:00
Qinzheng Sun af98e312f3
[Web Portal] Remove admin-lte layout dependencies (#4039)
* miscs

* layout

* responsive layout & sidebar items

* plugins

* fix

* 709

* notification button font size

* 235

* 253

* coin

* spelling
2019-12-23 18:56:10 +08:00
Zhiyuan He 56a6f6bbab
fix (#4049) 2019-12-23 17:18:15 +08:00
Yifan Xiong 927cda4886
[CI/CD] Update Jenkins (#4035)
Update Jenkins.
2019-12-23 16:50:52 +08:00
Binyang2014 4fc30d798b
[ErrorSpec] make pattern case insensitive (#4047) 2019-12-20 19:25:45 +08:00
Binyang2014 2aa4ec6da6
[runtime] fix image check (#4015) (#4017) 2019-12-20 18:28:38 +08:00
Yuqi Wang 44f27bb666
[Hived]: Per VC queuing to avoid cross VC starvation (#4041) 2019-12-20 12:22:14 +08:00
Binyang2014 317d0c4dfe
[Prometheus ] fix prometheus high io issue (#4019) (#4033) 2019-12-20 09:56:19 +08:00
Yifan Xiong 01facedc05
[Rest Server] Expose preemption status in scheduler (#4007)
* Expose preemption status in scheduler

Expose preemption status in scheduler.

* Add error handling for hived webservice call

Add error handling for hived webservice call.

* Update hived scheduler service configuration

Update hived scheduler service configuration.

* Update

Update.

* Update

Update.
2019-12-19 19:42:37 +08:00
yiyione e64da1d439
update yaml job config schema (#4011) 2019-12-19 14:45:20 +08:00
Hanyu Zhao dbb55ae270 [HiveD] fix pod hanging after reconfiguration (#4003) 2019-12-19 09:47:28 +08:00
Yifan Xiong ae388103b0
[Docs] OpenAPI docs for RESTful API (#3996)
Add OpenAPI docs for RESTful API.
2019-12-17 15:43:00 +08:00
Mingliang Tao 439d4e6965
Fix clone loading misbehavior (#3923) 2019-12-16 11:10:38 +08:00
yiyione 51e72bb030
[VS Code] use bash in simulation (#3991)
* use /bin/bash in simulation

* update
2019-12-12 15:06:38 +08:00
Binyang2014 013eecd116
[runtime] Change to call k8s api to get storage config (#3944)
Change to use k8s api to finish runtime storage plugin
2019-12-10 16:38:43 +08:00
Yifan Xiong 141301d4ad
Fix typos (#3980)
* Fix typos

Fix typos by [misspell](https://github.com/client9/misspell).

* Fix typos in api

Fix typos in api.

* Add spelling check in GitHub Actions

Add spelling check using misspell in GitHub Actions.
2019-12-10 11:37:07 +08:00
Binyang2014 f45849f50e
Binyli/job ssh (#3986) (#3992)
Remove ssh plugin when gangAllocation is set to false.
Tested in int bed with following job config
```yaml
extras:
  gangAllocation: false
  com.microsoft.pai.runtimeplugin:
    - plugin: ssh
      parameters:
        jobssh: true
```
2019-12-10 09:52:42 +08:00
Binyang2014 b0ff180888
[runtime] add docker image checker (#3974) (#3993)
* add docker image checker
2019-12-10 09:51:34 +08:00
yiyione d56432fa9c
[VS Code] job V2 local simulation (#3969)
* add simulate v2 job

* fix username in replace variables

* update
2019-12-09 14:45:43 +08:00
Yuqi Wang 8d2cbf1c51
[Hived]: Update Image (#3987) 2019-12-09 14:31:51 +08:00
Hanyu Zhao bff8b9d67e [HiveD] downgrade pod when PreassignedCellTypes is empty (#3983) 2019-12-09 14:14:35 +08:00
YundongYe c7698316b1
Fix typo in kubespray's tutorial (#3975) 2019-12-09 11:24:21 +08:00
Yuqi Wang 1bc971fa8a
[Hived]: Update Image (#3981) 2019-12-06 19:50:41 +08:00
Hanyu Zhao 6dcbca600e [HiveD] support partial release of affinity group (#3978) 2019-12-06 19:25:41 +08:00
Yuqi Wang 6522eb5f7d
Explicitly config and tune FIFO (#3977) 2019-12-06 17:27:26 +08:00
Yifan Xiong 1b9ee8134d
[Rest Server] Update default completion policy (#3972)
* Update default completion policy

Update default completion policy,
set minSucceededInstances to task instances by default.

* Update document

Update document.
2019-12-06 17:07:56 +08:00
Yuqi Wang de98a9a352
[Hived]: Support to tune FIFO (#3976) 2019-12-06 17:02:38 +08:00
Binyang2014 900546c3e6
[Error Spec] Add more error patterns (#3959)
* add new error patterns
2019-12-06 12:52:10 +08:00
Yifan Xiong 4b0346bf3c
[Rest Server] Fix duplicate affinity group name issue (#3971)
* Fix duplicate affinity group name issue

Fix duplicate affinity group name issue.
2019-12-06 11:16:01 +08:00
Yifan Xiong b3baf46743
[Rest Server] Add available resources for low priotity cluster (#3940)
* Add available resources for low priotity cluster

Add available resources for low priotity cluster.

* Update virtual cluster statistics on webportal

Update virtual cluster statistics on webportal.
2019-12-06 10:50:45 +08:00
Hanyu Zhao 965ae38e18
HiveD: record cell type in pod annotation (#3962)
* record cell type (instead of level) in pod annotation

* fix selectedNode

* refine failed reason

* minor fixes

* minor fixes
2019-12-06 09:09:29 +08:00
Yifan Xiong c3341596b2
Update examples (#3968)
Update examples:
* Upgrade built cuda version in pytorch image
* Change to python3 in pytorch examples
* Make NCCL options as a parameter in mpi
2019-12-05 18:56:13 +08:00
Binyang2014 a60bb44ff1
[Runtime] fix typo error (#3964) 2019-12-05 15:15:41 +08:00
Yifan Xiong 0ce68684fd
Shorten name to fit backend restrict (#3958)
Shorten name to fit backend restrict.
2019-12-04 12:42:00 +08:00
Yuqi Wang f22085b159
[Hived]: Expose and Refine Pod Waiting Reason (#3931) 2019-12-04 11:48:20 +08:00
Yifan Xiong 1dcb1198d9
Fix vulnerabilities in npm dependencies (#3954)
* Fix vulnerabilities in npm dependencies

Fix vulnerabilities in npm dependencies.

* Bump set-value and union-value

Bump set-value and union-value to fix vulnerabilities.
2019-12-04 11:02:49 +08:00
Yifan Xiong 7a035fa87f
Add critical priority in hived (#3949)
Add critical priority in hived.
2019-12-04 11:02:37 +08:00
AosChen 9e93908e7b Add the Example, show the process and fix it fit the PAI. (#3934)
* Add the Example, show the process and fix it fit the PAI.

* Change the output directory.

* Word fix.
2019-12-04 10:03:26 +08:00
Binyang2014 663b06db13
[Job-exporter] Fix gpu matrix not match task-role issue (#3951)
Tested in vnext bed. After this fix, the task role metrics show correctly
Refer: NVIDIA/nvidia-docker#376
GPU minor number not match NVIDIA_VISIBLE_DEVICES sometimes
2019-12-03 17:45:09 +08:00
YundongYe b75f66aa06
[kubespray] some script to backup and recover data (#3955) 2019-12-03 16:48:03 +08:00
Yifan Xiong b34177b12d
[Rest Server] Add image pull secrets for private registry (#3943)
* Add image pull secrets for private registry

Add image pull secrets for private registry.
2019-12-03 10:39:20 +08:00
Qinzheng Sun bced6e43b9
[Web Portal] fix react hook warning (#3920) 2019-12-02 15:47:51 +08:00
yiyione 3d1488324d
[VS Code] Fix add cluster bug when connection timeout (#3929)
* add try catch to json parse error message

* add some description for cluster config
2019-12-02 14:56:26 +08:00
Mingliang Tao 9bda534aea
Remove max gpu count limit (#3927) 2019-12-02 10:39:28 +08:00
Yifan Xiong d05a8e53a9
[Rest Server] Check resource quota for gang scheduling only (#3936)
* Check resource quota for gang scheduling only

Check resource quota for gang scheduling only.

* Skip check for oppo

Skip check for oppo priority.
2019-11-30 17:08:37 +08:00
Yifan Xiong c47ca4dc01
Specify image pull policy for app container (#3933)
Specify image pull policy for app container.
2019-11-29 19:22:47 +08:00
YundongYe ee12c0f37e
[Kubespray] Disable netchecker (#3932) 2019-11-29 16:50:09 +08:00