Граф коммитов

65 Коммитов

Автор SHA1 Сообщение Дата
Yifan Xiong 66f26ac47b
Replace deprecated set-env (#32)
Replace deprecated set-env in GitHub Actions.
2020-11-25 17:20:22 +08:00
Zhenhua Han c68b48ca75
Add bibtex of the OSDI paper to README.md (#31)
* Add bibtex of the OSDI paper to README.md
2020-11-20 15:19:25 +08:00
Yifan Xiong df185eecde
Update feature demo examples (#30)
* Update feature demo examples.
* Add defaulting for `ignoreK8sSuggestedNodes`.
* Fix sort in `getUsablePhysicalCells`.
2020-08-20 15:07:58 +08:00
Yifan Xiong a8b0aa7d5b
Automatically reload changed config (#29)
Automatically reload changed config.
2020-08-12 14:06:47 +08:00
Yifan Xiong fbff5b09b4
Rename term gpu to leaf cell (#28)
* Rename gpuType/gpuNumber to skuType/skuNumber

Rename gpuType -> skuType, gpuNumber -> skuNumber.

* Rename gpu to device

Rename gpu to device when referring affinity and index.

* Add explanation for sku type and device

Add explanation for sku type and device.

* Revert term sku and device to leaf cell

Revert term sku and device to leaf cell.

* Fix

Fix.

* Convert old spec annotations for compatibility

Convert old spec annotations for backward compatibility.

* Update README

Update README.

* Resolve comments

Resolve comments.

* Update

Update.
2020-07-27 15:54:33 +08:00
Yifan Xiong 76ed604419
Split higher level cell when allocated bad cells (#27)
* Split higher level cell when allocated bad cells

When buddy allocation failed due to bad cells,
try to split a higher level cell to get current level cells.

* Fix deletion errors

Fix deletion errors.

* Add allocate function to split higher level cells

Add allocate function to split higher level cells.

* Fix test

Fix test.

* Add test case

Add test case.

* Resolve comments

Resolve comments.

* Add free list in panic log when safety is broken

Add free list in panic log when safety is broken.

* Add test case when unable to split

Add test case when unable to split due to safety guarantee.

* Update

Update.

* Fix memory leak in removePickedGpus

Fix memory leak in `removePickedGpus`.

* Early stop safety check at current level

Early stop safety check at current level.
2020-07-15 11:06:51 +08:00
Yuqi Wang 406f379be2
Update deploy.yaml (#26) 2020-07-07 13:20:49 +08:00
Yuqi Wang 05da8ea798
Update deploy.yaml (#22) 2020-07-02 13:21:32 +08:00
Yuqi Wang c2e19c3af6
Fix deploy.yaml (#21) 2020-07-02 13:07:00 +08:00
Yuqi Wang a99db09587
Fix deploy.yaml (#20) 2020-07-02 12:33:53 +08:00
Yifan Xiong 4372c8c470
[CI/CD] Add unit test and coverage in GitHub Actions (#19)
Add unit test and coverage in GitHub Actions.
2020-06-29 11:13:10 +08:00
Hanyu Zhao a0361c07d8
Fix doomed bad cell check when allocating/releasing cells (#18)
* fix doomed bad cell check when allocating/releasing cells

* more UTs for bad nodes

* add comment for doomed bad cells

* refine comment
2020-06-24 15:10:22 +08:00
Hanyu Zhao ed0c0b802b
Bad/unsuggested node handling (#17)
* intra-VC scheduler aware of bad (non-suggested nodes)

* doomed bad cells

* backtrack cell search

* refine logging

* add unit tests

* rename free vc cells to preassigned cells

* resolve comments

* fix failedReason overwritten to empty when searching a GPU type a VC not have

* add log for deleteOpporVirtualCell

* fix logger for intra-vc scheduler

* fix getLowestPriorityVirtualCell returns a doomed bad cell

* resolve comments

* add UT for lazy preemption revert
2020-06-24 09:17:46 +08:00
Hanyu Zhao d71d7708e8
Add doc for state machine (#16)
* add doc for state machines

* fix

* add title

* fix pre-allocated and preempting

* resolve comments

* adjust figure location

* move images

* resolve comments

* resolve comments

* resolve comments
2020-06-04 15:25:31 +08:00
Hanyu Zhao cb9c73d47d
Rename reserved cells to pinned cells (#14)
* rename reserved cells to pinned cells

* rename acquired to reserved

* rename reserved to pinned in yaml files and docs

* update feature demo for pinned cells

* fix typo

* fix value receiver of deleteAllocatedAffinityGroup

* fix typo in readme

* fix ambiguous naming about reserved cells

* BeingReserved -> Reserved
2020-04-27 14:39:42 +08:00
Hanyu Zhao fad90121d1
refine chain search (#13)
* fix bug in early stop chain

* keep searching the chains until placement within suggested nodes

* refine logging

* refactor h.Schedule() which was too long

* fix virtual cell's healthiness

* refine suggested nodes related logic

* resolve comments

* readme
2020-04-07 10:51:38 +08:00
Hanyu Zhao be47ddb5ac
Stateful Preemption (#8)
* refactor package algorithm

* stateful preemption

* add unit tests

* fix bad cell tracking & cell state tracking

* deleteUnallocatedPod

* move cancellation of ongoing preemption out of mapVirtualPlacementToPhysical (should be clearer)

* early exit a chain when a VC does not have it

* refine comments

* resolve comments

* refine comment

* expose AG status & fix bad cell tracking when virtual cell is partially bound

* resolve comments

* two-phase scheduling

* resolve comments

* disable preemption in filtering phase & random in getFewestOpproCell

* fix UT

* refine log
2020-03-30 19:48:17 +08:00
Yuqi Wang 2b2f96a54e
Refine K8S events and docs (#12) 2020-03-30 19:37:18 +08:00
Yuqi Wang 2ba822ec37
Still schedule on preemptRoutine in case filterRoutine is not called by K8S (#11)
* Still schedule on preemptRoutine in case filterRoutine is not called by K8S

* Refine
2020-03-27 15:20:25 +08:00
Yuqi Wang 8c41c85184
Refine Config Doc (#10) 2020-03-25 14:08:16 +08:00
Yuqi Wang 7015e38995
Refine Doc (#9) 2020-03-24 11:17:38 +08:00
Yuqi Wang b101a1e86c
Deliver unallocated Pod events to SchedulerAlgorithm (#7) 2020-03-23 21:23:55 +08:00
Yuqi Wang 085063d400
Merge pull request #6 from microsoft/yqwang/syncPAI
Move Hived from OpenPAI to dedicated repo: Final Part
2020-03-23 20:02:16 +08:00
Yuqi Wang 0eb7e8d870 Resolve Conflicts 2020-03-23 10:03:14 +00:00
Yuqi Wang 964865de87 [Hived]: Add Feature Demo Doc (#4235)
* [Hived]: Add Feature Demo Doc

* [Hived]: Add Feature Demo Doc

* [HiveD] refine doc (#4241)

* refine readme.md

* refine readme.md

* minor

* resolve comments

* refine feature demo

* Refine Feature Demo Doc (#4309)

* Refine

Co-authored-by: Hanyu Zhao <zhaohanyu@pku.edu.cn>
2020-03-20 20:23:34 +08:00
Hanyu Zhao a18b49d736 [HiveD] add log for suggested nodes (#4262)
* add log for suggested nodes

* add log for suggested nodes

* add log for suggested nodes

* refine log

* refine log
2020-03-17 14:24:26 +08:00
Hanyu Zhao 521f871035 [HiveD] expose vc capacity (#4153)
* expose vc capacity

* fix failure of running cells

* json representation of cluster status

* add oppor cells to vc

* minor fixes

* add node lister to watch nodes

* minor fixes

* minor fixes

* minor

* refine cell addressing

* quota headroom track (wip)

* safety check

* fix bug in mapNonPreassignedCellToVirtual

* track bad cells

* minor refinements

* resolve comments

* resolve comments

* resolve comments & fix track bad cells (only need to track bad free cells)

* resolve comments
2020-03-16 21:26:30 +08:00
Hanyu Zhao 7e302c6c08 [HiveD] VC aware of suggested nodes (#4268)
* vc aware of suggested nodes

* vc aware of suggested nodes

* suggested

* add UT

* fix comments

* minor fix

* revert vc aware suggested nodes

* fix UT
2020-03-16 13:07:36 +08:00
Yuqi Wang 230f7471a4 [Hived]: Filter on full SuggestedNodes (#4271) 2020-03-10 18:36:48 +08:00
Hanyu Zhao 170f853a32 [HiveD] check suggested nodes for all pods (#4251)
* check suggested nodes for all pods

* check suggested nodes for all pods

* fix log message

* minor fix
2020-03-09 17:30:26 +08:00
Yuqi Wang 378ceec2a0
Add NOTICE file for third party pkgs (#4) 2020-03-04 17:10:08 +08:00
Yuqi Wang 4e97096a8f
Add Docs (#3) 2020-02-06 13:53:08 +08:00
Yuqi Wang e5c98dd763 [Hived]: Add overview doc (#4171) 2020-02-06 11:42:41 +08:00
Yuqi Wang f34577041b [Hived]: Aware Deletion and Add embedded inside one Update event (#4135) 2020-01-13 13:57:16 +08:00
Yuqi Wang e4f58e441b [Hived]: Refine config quick start docs (#4102) 2020-01-03 17:41:50 +08:00
Yuqi Wang c7236c6ca6 [Hived]: Add config quick start docs (#4099) 2020-01-03 12:48:29 +08:00
Yuqi Wang 689cf909bc
Init dedicated project (#2) 2019-12-30 20:33:17 +08:00
Yuqi Wang 5a0777039e
Move Hived from OpenPAI to dedicated repo
Move Hived from OpenPAI to dedicated repo
2019-12-30 20:05:51 +08:00
Yuqi Wang 2d7f8aaab3 Resolve Conflicts 2019-12-30 09:30:57 +00:00
Microsoft Open Source 7cfa7b6965 Initial SECURITY.md commit 2019-12-29 21:57:42 -08:00
Microsoft Open Source bceb4dcdb9 Updating README.md to template content 2019-12-29 21:57:41 -08:00
Microsoft Open Source ab01db4d52 Updating LICENSE to template content 2019-12-29 21:57:39 -08:00
Microsoft Open Source 535f1a77a3 Initial CODE_OF_CONDUCT.md commit 2019-12-29 21:57:38 -08:00
Yuqi Wang ef1784fd5d
Initial commit 2019-12-30 11:37:42 +08:00
Hanyu Zhao 8f094b4d8f [HiveD] fix pod hanging after reconfiguration (#4003) 2019-12-19 09:47:28 +08:00
Hanyu Zhao 0c59d5b296 [HiveD] downgrade pod when PreassignedCellTypes is empty (#3983) 2019-12-09 14:14:35 +08:00
Hanyu Zhao 88fa2bfbeb [HiveD] support partial release of affinity group (#3978) 2019-12-06 19:25:41 +08:00
Yuqi Wang d424cbcea2 Explicitly config and tune FIFO (#3977) 2019-12-06 17:27:26 +08:00
Yuqi Wang 4ea7c05643 [Hived]: Support to tune FIFO (#3976) 2019-12-06 17:02:38 +08:00
Hanyu Zhao f10d301966 HiveD: record cell type in pod annotation (#3962)
* record cell type (instead of level) in pod annotation

* fix selectedNode

* refine failed reason

* minor fixes

* minor fixes
2019-12-06 09:09:29 +08:00