Yuqi Wang
c09b27b53a
Update deploy.yaml
2020-07-02 13:15:17 +08:00
Yuqi Wang
c2e19c3af6
Fix deploy.yaml ( #21 )
2020-07-02 13:07:00 +08:00
Yuqi Wang
a99db09587
Fix deploy.yaml ( #20 )
2020-07-02 12:33:53 +08:00
Yifan Xiong
4372c8c470
[CI/CD] Add unit test and coverage in GitHub Actions ( #19 )
...
Add unit test and coverage in GitHub Actions.
2020-06-29 11:13:10 +08:00
Hanyu Zhao
a0361c07d8
Fix doomed bad cell check when allocating/releasing cells ( #18 )
...
* fix doomed bad cell check when allocating/releasing cells
* more UTs for bad nodes
* add comment for doomed bad cells
* refine comment
2020-06-24 15:10:22 +08:00
Hanyu Zhao
ed0c0b802b
Bad/unsuggested node handling ( #17 )
...
* intra-VC scheduler aware of bad (non-suggested nodes)
* doomed bad cells
* backtrack cell search
* refine logging
* add unit tests
* rename free vc cells to preassigned cells
* resolve comments
* fix failedReason overwritten to empty when searching a GPU type a VC not have
* add log for deleteOpporVirtualCell
* fix logger for intra-vc scheduler
* fix getLowestPriorityVirtualCell returns a doomed bad cell
* resolve comments
* add UT for lazy preemption revert
2020-06-24 09:17:46 +08:00
Hanyu Zhao
d71d7708e8
Add doc for state machine ( #16 )
...
* add doc for state machines
* fix
* add title
* fix pre-allocated and preempting
* resolve comments
* adjust figure location
* move images
* resolve comments
* resolve comments
* resolve comments
2020-06-04 15:25:31 +08:00
Hanyu Zhao
cb9c73d47d
Rename reserved cells to pinned cells ( #14 )
...
* rename reserved cells to pinned cells
* rename acquired to reserved
* rename reserved to pinned in yaml files and docs
* update feature demo for pinned cells
* fix typo
* fix value receiver of deleteAllocatedAffinityGroup
* fix typo in readme
* fix ambiguous naming about reserved cells
* BeingReserved -> Reserved
2020-04-27 14:39:42 +08:00
Hanyu Zhao
fad90121d1
refine chain search ( #13 )
...
* fix bug in early stop chain
* keep searching the chains until placement within suggested nodes
* refine logging
* refactor h.Schedule() which was too long
* fix virtual cell's healthiness
* refine suggested nodes related logic
* resolve comments
* readme
2020-04-07 10:51:38 +08:00
Hanyu Zhao
be47ddb5ac
Stateful Preemption ( #8 )
...
* refactor package algorithm
* stateful preemption
* add unit tests
* fix bad cell tracking & cell state tracking
* deleteUnallocatedPod
* move cancellation of ongoing preemption out of mapVirtualPlacementToPhysical (should be clearer)
* early exit a chain when a VC does not have it
* refine comments
* resolve comments
* refine comment
* expose AG status & fix bad cell tracking when virtual cell is partially bound
* resolve comments
* two-phase scheduling
* resolve comments
* disable preemption in filtering phase & random in getFewestOpproCell
* fix UT
* refine log
2020-03-30 19:48:17 +08:00
Yuqi Wang
2b2f96a54e
Refine K8S events and docs ( #12 )
2020-03-30 19:37:18 +08:00
Yuqi Wang
2ba822ec37
Still schedule on preemptRoutine in case filterRoutine is not called by K8S ( #11 )
...
* Still schedule on preemptRoutine in case filterRoutine is not called by K8S
* Refine
2020-03-27 15:20:25 +08:00
Yuqi Wang
8c41c85184
Refine Config Doc ( #10 )
2020-03-25 14:08:16 +08:00
Yuqi Wang
7015e38995
Refine Doc ( #9 )
2020-03-24 11:17:38 +08:00
Yuqi Wang
b101a1e86c
Deliver unallocated Pod events to SchedulerAlgorithm ( #7 )
2020-03-23 21:23:55 +08:00
Yuqi Wang
085063d400
Merge pull request #6 from microsoft/yqwang/syncPAI
...
Move Hived from OpenPAI to dedicated repo: Final Part
2020-03-23 20:02:16 +08:00
Yuqi Wang
0eb7e8d870
Resolve Conflicts
2020-03-23 10:03:14 +00:00
Yuqi Wang
964865de87
[Hived]: Add Feature Demo Doc ( #4235 )
...
* [Hived]: Add Feature Demo Doc
* [Hived]: Add Feature Demo Doc
* [HiveD] refine doc (#4241 )
* refine readme.md
* refine readme.md
* minor
* resolve comments
* refine feature demo
* Refine Feature Demo Doc (#4309 )
* Refine
Co-authored-by: Hanyu Zhao <zhaohanyu@pku.edu.cn>
2020-03-20 20:23:34 +08:00
Hanyu Zhao
a18b49d736
[HiveD] add log for suggested nodes ( #4262 )
...
* add log for suggested nodes
* add log for suggested nodes
* add log for suggested nodes
* refine log
* refine log
2020-03-17 14:24:26 +08:00
Hanyu Zhao
521f871035
[HiveD] expose vc capacity ( #4153 )
...
* expose vc capacity
* fix failure of running cells
* json representation of cluster status
* add oppor cells to vc
* minor fixes
* add node lister to watch nodes
* minor fixes
* minor fixes
* minor
* refine cell addressing
* quota headroom track (wip)
* safety check
* fix bug in mapNonPreassignedCellToVirtual
* track bad cells
* minor refinements
* resolve comments
* resolve comments
* resolve comments & fix track bad cells (only need to track bad free cells)
* resolve comments
2020-03-16 21:26:30 +08:00
Hanyu Zhao
7e302c6c08
[HiveD] VC aware of suggested nodes ( #4268 )
...
* vc aware of suggested nodes
* vc aware of suggested nodes
* suggested
* add UT
* fix comments
* minor fix
* revert vc aware suggested nodes
* fix UT
2020-03-16 13:07:36 +08:00
Yuqi Wang
230f7471a4
[Hived]: Filter on full SuggestedNodes ( #4271 )
2020-03-10 18:36:48 +08:00
Hanyu Zhao
170f853a32
[HiveD] check suggested nodes for all pods ( #4251 )
...
* check suggested nodes for all pods
* check suggested nodes for all pods
* fix log message
* minor fix
2020-03-09 17:30:26 +08:00
Yuqi Wang
378ceec2a0
Add NOTICE file for third party pkgs ( #4 )
2020-03-04 17:10:08 +08:00
Yuqi Wang
4e97096a8f
Add Docs ( #3 )
2020-02-06 13:53:08 +08:00
Yuqi Wang
e5c98dd763
[Hived]: Add overview doc ( #4171 )
2020-02-06 11:42:41 +08:00
Yuqi Wang
f34577041b
[Hived]: Aware Deletion and Add embedded inside one Update event ( #4135 )
2020-01-13 13:57:16 +08:00
Yuqi Wang
e4f58e441b
[Hived]: Refine config quick start docs ( #4102 )
2020-01-03 17:41:50 +08:00
Yuqi Wang
c7236c6ca6
[Hived]: Add config quick start docs ( #4099 )
2020-01-03 12:48:29 +08:00
Yuqi Wang
689cf909bc
Init dedicated project ( #2 )
2019-12-30 20:33:17 +08:00
Yuqi Wang
5a0777039e
Move Hived from OpenPAI to dedicated repo
...
Move Hived from OpenPAI to dedicated repo
2019-12-30 20:05:51 +08:00
Yuqi Wang
2d7f8aaab3
Resolve Conflicts
2019-12-30 09:30:57 +00:00
Microsoft Open Source
7cfa7b6965
Initial SECURITY.md commit
2019-12-29 21:57:42 -08:00
Microsoft Open Source
bceb4dcdb9
Updating README.md to template content
2019-12-29 21:57:41 -08:00
Microsoft Open Source
ab01db4d52
Updating LICENSE to template content
2019-12-29 21:57:39 -08:00
Microsoft Open Source
535f1a77a3
Initial CODE_OF_CONDUCT.md commit
2019-12-29 21:57:38 -08:00
Yuqi Wang
ef1784fd5d
Initial commit
2019-12-30 11:37:42 +08:00
Hanyu Zhao
8f094b4d8f
[HiveD] fix pod hanging after reconfiguration ( #4003 )
2019-12-19 09:47:28 +08:00
Hanyu Zhao
0c59d5b296
[HiveD] downgrade pod when PreassignedCellTypes is empty ( #3983 )
2019-12-09 14:14:35 +08:00
Hanyu Zhao
88fa2bfbeb
[HiveD] support partial release of affinity group ( #3978 )
2019-12-06 19:25:41 +08:00
Yuqi Wang
d424cbcea2
Explicitly config and tune FIFO ( #3977 )
2019-12-06 17:27:26 +08:00
Yuqi Wang
4ea7c05643
[Hived]: Support to tune FIFO ( #3976 )
2019-12-06 17:02:38 +08:00
Hanyu Zhao
f10d301966
HiveD: record cell type in pod annotation ( #3962 )
...
* record cell type (instead of level) in pod annotation
* fix selectedNode
* refine failed reason
* minor fixes
* minor fixes
2019-12-06 09:09:29 +08:00
Yuqi Wang
b25f9256fe
[Hived]: Expose and Refine Pod Waiting Reason ( #3931 )
2019-12-04 11:48:20 +08:00
Yuqi Wang
8107225cc1
[Hived]: Disable leader election ( #3928 )
2019-11-28 16:08:35 +08:00
Yuqi Wang
f16f52ac52
[Hived]: Expose LazyPreemptionStatus ( #3917 )
2019-11-28 15:11:24 +08:00
Hanyu Zhao
619401f782
HiveD: fix bug in initial assignment validation ( #3908 )
...
* fix bug in initial assignment validation
* add tests for quota validation and lazy preemption
2019-11-27 09:53:04 +08:00
Yuqi Wang
da6bce75e4
[Hived]: Fix user specified priority may conflict with internal reserved priority ( #3893 )
2019-11-25 17:25:49 +08:00
Hanyu Zhao
1c1a1f60b0
add flag to turn on/off lazy preemption ( #3891 )
2019-11-25 16:38:22 +08:00
Hanyu Zhao
1f5c1c4d2f
HiveD intra-vc preemption for restart ( #3861 )
...
* intra-vc preemption when adding allocated pods (restart)
* fix overwriting pod numbers for group members with same gpu number
2019-11-18 19:15:36 +08:00