Граф коммитов

58 Коммитов

Автор SHA1 Сообщение Дата
Yuqi Wang c09b27b53a
Update deploy.yaml 2020-07-02 13:15:17 +08:00
Yuqi Wang c2e19c3af6
Fix deploy.yaml (#21) 2020-07-02 13:07:00 +08:00
Yuqi Wang a99db09587
Fix deploy.yaml (#20) 2020-07-02 12:33:53 +08:00
Yifan Xiong 4372c8c470
[CI/CD] Add unit test and coverage in GitHub Actions (#19)
Add unit test and coverage in GitHub Actions.
2020-06-29 11:13:10 +08:00
Hanyu Zhao a0361c07d8
Fix doomed bad cell check when allocating/releasing cells (#18)
* fix doomed bad cell check when allocating/releasing cells

* more UTs for bad nodes

* add comment for doomed bad cells

* refine comment
2020-06-24 15:10:22 +08:00
Hanyu Zhao ed0c0b802b
Bad/unsuggested node handling (#17)
* intra-VC scheduler aware of bad (non-suggested nodes)

* doomed bad cells

* backtrack cell search

* refine logging

* add unit tests

* rename free vc cells to preassigned cells

* resolve comments

* fix failedReason overwritten to empty when searching a GPU type a VC not have

* add log for deleteOpporVirtualCell

* fix logger for intra-vc scheduler

* fix getLowestPriorityVirtualCell returns a doomed bad cell

* resolve comments

* add UT for lazy preemption revert
2020-06-24 09:17:46 +08:00
Hanyu Zhao d71d7708e8
Add doc for state machine (#16)
* add doc for state machines

* fix

* add title

* fix pre-allocated and preempting

* resolve comments

* adjust figure location

* move images

* resolve comments

* resolve comments

* resolve comments
2020-06-04 15:25:31 +08:00
Hanyu Zhao cb9c73d47d
Rename reserved cells to pinned cells (#14)
* rename reserved cells to pinned cells

* rename acquired to reserved

* rename reserved to pinned in yaml files and docs

* update feature demo for pinned cells

* fix typo

* fix value receiver of deleteAllocatedAffinityGroup

* fix typo in readme

* fix ambiguous naming about reserved cells

* BeingReserved -> Reserved
2020-04-27 14:39:42 +08:00
Hanyu Zhao fad90121d1
refine chain search (#13)
* fix bug in early stop chain

* keep searching the chains until placement within suggested nodes

* refine logging

* refactor h.Schedule() which was too long

* fix virtual cell's healthiness

* refine suggested nodes related logic

* resolve comments

* readme
2020-04-07 10:51:38 +08:00
Hanyu Zhao be47ddb5ac
Stateful Preemption (#8)
* refactor package algorithm

* stateful preemption

* add unit tests

* fix bad cell tracking & cell state tracking

* deleteUnallocatedPod

* move cancellation of ongoing preemption out of mapVirtualPlacementToPhysical (should be clearer)

* early exit a chain when a VC does not have it

* refine comments

* resolve comments

* refine comment

* expose AG status & fix bad cell tracking when virtual cell is partially bound

* resolve comments

* two-phase scheduling

* resolve comments

* disable preemption in filtering phase & random in getFewestOpproCell

* fix UT

* refine log
2020-03-30 19:48:17 +08:00
Yuqi Wang 2b2f96a54e
Refine K8S events and docs (#12) 2020-03-30 19:37:18 +08:00
Yuqi Wang 2ba822ec37
Still schedule on preemptRoutine in case filterRoutine is not called by K8S (#11)
* Still schedule on preemptRoutine in case filterRoutine is not called by K8S

* Refine
2020-03-27 15:20:25 +08:00
Yuqi Wang 8c41c85184
Refine Config Doc (#10) 2020-03-25 14:08:16 +08:00
Yuqi Wang 7015e38995
Refine Doc (#9) 2020-03-24 11:17:38 +08:00
Yuqi Wang b101a1e86c
Deliver unallocated Pod events to SchedulerAlgorithm (#7) 2020-03-23 21:23:55 +08:00
Yuqi Wang 085063d400
Merge pull request #6 from microsoft/yqwang/syncPAI
Move Hived from OpenPAI to dedicated repo: Final Part
2020-03-23 20:02:16 +08:00
Yuqi Wang 0eb7e8d870 Resolve Conflicts 2020-03-23 10:03:14 +00:00
Yuqi Wang 964865de87 [Hived]: Add Feature Demo Doc (#4235)
* [Hived]: Add Feature Demo Doc

* [Hived]: Add Feature Demo Doc

* [HiveD] refine doc (#4241)

* refine readme.md

* refine readme.md

* minor

* resolve comments

* refine feature demo

* Refine Feature Demo Doc (#4309)

* Refine

Co-authored-by: Hanyu Zhao <zhaohanyu@pku.edu.cn>
2020-03-20 20:23:34 +08:00
Hanyu Zhao a18b49d736 [HiveD] add log for suggested nodes (#4262)
* add log for suggested nodes

* add log for suggested nodes

* add log for suggested nodes

* refine log

* refine log
2020-03-17 14:24:26 +08:00
Hanyu Zhao 521f871035 [HiveD] expose vc capacity (#4153)
* expose vc capacity

* fix failure of running cells

* json representation of cluster status

* add oppor cells to vc

* minor fixes

* add node lister to watch nodes

* minor fixes

* minor fixes

* minor

* refine cell addressing

* quota headroom track (wip)

* safety check

* fix bug in mapNonPreassignedCellToVirtual

* track bad cells

* minor refinements

* resolve comments

* resolve comments

* resolve comments & fix track bad cells (only need to track bad free cells)

* resolve comments
2020-03-16 21:26:30 +08:00
Hanyu Zhao 7e302c6c08 [HiveD] VC aware of suggested nodes (#4268)
* vc aware of suggested nodes

* vc aware of suggested nodes

* suggested

* add UT

* fix comments

* minor fix

* revert vc aware suggested nodes

* fix UT
2020-03-16 13:07:36 +08:00
Yuqi Wang 230f7471a4 [Hived]: Filter on full SuggestedNodes (#4271) 2020-03-10 18:36:48 +08:00
Hanyu Zhao 170f853a32 [HiveD] check suggested nodes for all pods (#4251)
* check suggested nodes for all pods

* check suggested nodes for all pods

* fix log message

* minor fix
2020-03-09 17:30:26 +08:00
Yuqi Wang 378ceec2a0
Add NOTICE file for third party pkgs (#4) 2020-03-04 17:10:08 +08:00
Yuqi Wang 4e97096a8f
Add Docs (#3) 2020-02-06 13:53:08 +08:00
Yuqi Wang e5c98dd763 [Hived]: Add overview doc (#4171) 2020-02-06 11:42:41 +08:00
Yuqi Wang f34577041b [Hived]: Aware Deletion and Add embedded inside one Update event (#4135) 2020-01-13 13:57:16 +08:00
Yuqi Wang e4f58e441b [Hived]: Refine config quick start docs (#4102) 2020-01-03 17:41:50 +08:00
Yuqi Wang c7236c6ca6 [Hived]: Add config quick start docs (#4099) 2020-01-03 12:48:29 +08:00
Yuqi Wang 689cf909bc
Init dedicated project (#2) 2019-12-30 20:33:17 +08:00
Yuqi Wang 5a0777039e
Move Hived from OpenPAI to dedicated repo
Move Hived from OpenPAI to dedicated repo
2019-12-30 20:05:51 +08:00
Yuqi Wang 2d7f8aaab3 Resolve Conflicts 2019-12-30 09:30:57 +00:00
Microsoft Open Source 7cfa7b6965 Initial SECURITY.md commit 2019-12-29 21:57:42 -08:00
Microsoft Open Source bceb4dcdb9 Updating README.md to template content 2019-12-29 21:57:41 -08:00
Microsoft Open Source ab01db4d52 Updating LICENSE to template content 2019-12-29 21:57:39 -08:00
Microsoft Open Source 535f1a77a3 Initial CODE_OF_CONDUCT.md commit 2019-12-29 21:57:38 -08:00
Yuqi Wang ef1784fd5d
Initial commit 2019-12-30 11:37:42 +08:00
Hanyu Zhao 8f094b4d8f [HiveD] fix pod hanging after reconfiguration (#4003) 2019-12-19 09:47:28 +08:00
Hanyu Zhao 0c59d5b296 [HiveD] downgrade pod when PreassignedCellTypes is empty (#3983) 2019-12-09 14:14:35 +08:00
Hanyu Zhao 88fa2bfbeb [HiveD] support partial release of affinity group (#3978) 2019-12-06 19:25:41 +08:00
Yuqi Wang d424cbcea2 Explicitly config and tune FIFO (#3977) 2019-12-06 17:27:26 +08:00
Yuqi Wang 4ea7c05643 [Hived]: Support to tune FIFO (#3976) 2019-12-06 17:02:38 +08:00
Hanyu Zhao f10d301966 HiveD: record cell type in pod annotation (#3962)
* record cell type (instead of level) in pod annotation

* fix selectedNode

* refine failed reason

* minor fixes

* minor fixes
2019-12-06 09:09:29 +08:00
Yuqi Wang b25f9256fe [Hived]: Expose and Refine Pod Waiting Reason (#3931) 2019-12-04 11:48:20 +08:00
Yuqi Wang 8107225cc1 [Hived]: Disable leader election (#3928) 2019-11-28 16:08:35 +08:00
Yuqi Wang f16f52ac52 [Hived]: Expose LazyPreemptionStatus (#3917) 2019-11-28 15:11:24 +08:00
Hanyu Zhao 619401f782 HiveD: fix bug in initial assignment validation (#3908)
* fix bug in initial assignment validation

* add tests for quota validation and lazy preemption
2019-11-27 09:53:04 +08:00
Yuqi Wang da6bce75e4 [Hived]: Fix user specified priority may conflict with internal reserved priority (#3893) 2019-11-25 17:25:49 +08:00
Hanyu Zhao 1c1a1f60b0 add flag to turn on/off lazy preemption (#3891) 2019-11-25 16:38:22 +08:00
Hanyu Zhao 1f5c1c4d2f HiveD intra-vc preemption for restart (#3861)
* intra-vc preemption when adding allocated pods (restart)
* fix overwriting pod numbers for group members with same gpu number
2019-11-18 19:15:36 +08:00