* Rename gpuType/gpuNumber to skuType/skuNumber

Rename gpuType -> skuType, gpuNumber -> skuNumber.

* Rename gpu to device

Rename gpu to device when referring affinity and index.

* Add explanation for sku type and device

Add explanation for sku type and device.

* Revert term sku and device to leaf cell

Revert term sku and device to leaf cell.

* Fix

Fix.

* Convert old spec annotations for compatibility

Convert old spec annotations for backward compatibility.

* Update README

Update README.

* Resolve comments

Resolve comments.

* Update

Update.
This commit is contained in:
Yifan Xiong 2020-07-27 15:54:33 +08:00 коммит произвёл GitHub
Родитель 76ed604419
Коммит fbff5b09b4
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
55 изменённых файлов: 1047 добавлений и 996 удалений

Просмотреть файл

@ -20,7 +20,7 @@ The killer feature that distinguishes HiveD is that it provides resource guarant
HiveD protects VCs' resources in terms of **cell**, a user-defined resource type that encodes both the quantity and other kinds of information, such as topology and hardware type. In the above example, a user can define a cell type of 8-GPU node, and the VC can be assigned one of such cell. Then, HiveD will ensure that *there is always one 8-GPU node available for the VC*, regardless of the other workloads in the cluster.
HiveD allows flexible cell definitions for fine-grained resource guarantees. For example, users can define cells at multiple topology levels (e.g., PCI-e switch), for different GPU models, or networking configurations (e.g., InfiniBand domain). A VC can have various types of cells, and HiveD will guarantee all of them.
HiveD allows flexible cell definitions for fine-grained resource guarantees. For example, users can define cells at multiple topology levels (e.g., PCI-e switch), for different device models (e.g., NVIDIA V100 GPU, AMD Radeon MI100 GPU, Cloud TPU v3), or networking configurations (e.g., InfiniBand domain). A VC can have various types of cells, and HiveD will guarantee all of them.
### [Gang Scheduling](example/feature/README.md#Gang-Scheduling)
@ -34,8 +34,8 @@ HiveD supports multiple job **priorities**. Higher-priority jobs can **[preempt]
## Feature
1. [Multi-Tenancy: Virtual Cluster (VC)](example/feature/README.md#VC-Safety)
2. [Fine-Grained VC Resource Guarantee](example/feature/README.md#VC-Safety): Quantity, [Topology](example/feature/README.md#VC-Safety), [Type](example/feature/README.md#GPU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), etc.
3. Flexible Intra-VC Scheduling: [Topology-Awareness](example/feature/README.md#Topology-Aware-Intra-VC-Scheduling), [Flexible GPU Types](example/feature/README.md#GPU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), Scheduling Policy Customization, etc.
2. [Fine-Grained VC Resource Guarantee](example/feature/README.md#VC-Safety): Quantity, [Topology](example/feature/README.md#VC-Safety), [Type](example/feature/README.md#SKU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), etc.
3. Flexible Intra-VC Scheduling: [Topology-Awareness](example/feature/README.md#Topology-Aware-Intra-VC-Scheduling), [Flexible Hardware Types](example/feature/README.md#SKU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), Scheduling Policy Customization, etc.
4. Optimized Resource Fragmentation and Less Starvation
5. [Priorities](example/feature/README.md#Guaranteed-Job), [Overuse with Low Priority](example/feature/README.md#Opportunistic-Job), and [Inter-](example/feature/README.md#Inter-VC-Preemption)/[Intra-VC Preemption](example/feature/README.md#Intra-VC-Preemption)
6. [Job (Full/Partial) Gang Scheduling/Preemption](example/feature/README.md#Gang-Scheduling)

Просмотреть файл

@ -68,7 +68,7 @@ For all cells currently associated with other AGs:
`Used` (by other AGs) -> `Reserving` (by this AG) (e<sub>2</sub> in cell state machine);
`Reserving`/`Reserved` (by other AGs) -> `Reserving`/`Reserved` (by this AG) (e<sub>3</sub>/e<sub>6</sub> in cell state machine);
`Reserving`/`Reserved` (by other AGs) -> `Reserving`/`Reserved` (by this AG) (e<sub>3</sub>/e<sub>6</sub> in cell state machine);
For free cells:
@ -94,7 +94,7 @@ __e<sub>4</sub>__:
Condition: all pods of this AG are deleted.
Operation:
Operation:
all cells `Used` (by this AG) -> `Free` (e<sub>1</sub> in cell state machine).
__e<sub>5</sub>__:
@ -118,7 +118,7 @@ __e<sub>7</sub>__:
Condition: all pods of this AG are deleted.
Operation:
Operation:
All the `Reserving` cells (by this AG) -> `Used` (by the `Being preempted` AG currently associated with the cell) (e<sub>4</sub> in cell state machine).
@ -132,7 +132,7 @@ Operation: none.
## Cell State Machine
Cell is the resource unit in HiveD. The figure below shows the state machine of cell. Note that here cells are _lowest-level physical cells_, e.g., single-GPU cells in typical configs (we record states only in these cells).
Cell is the resource unit in HiveD. The figure below shows the state machine of cell. Note that here cells are _lowest-level physical cells_, e.g., leaf cells in typical configs (we record states only in these cells).
<p style="text-align: center;">
<img src="img/cell-state-machine.png" title="cell" alt="cell" width="70%"/>
@ -188,7 +188,7 @@ __e<sub>2</sub>__:
Condition: triggered by another AG from `Pending` to `Preempting` (i.e., that AG is preempting the `Allocated` AG currently associated with this cell) (e<sub>1</sub> in AG state machine).
Operation:
Operation:
The `Allocated` AG on this cell -> `Being preempted` (e<sub>6</sub> in AG state machine);
@ -236,7 +236,7 @@ __e<sub>8</sub>__:
Condition: triggered by (i) there is currently a `Preempting` AG on this cell but another `Allocated` AG is now associated with the cell (e<sub>0</sub> in AG state machine); OR (ii) the `Preempting` AG currently associated with this cell transitions to `Allocated` (e<sub>2</sub> in AG state machine).
Operation:
Operation:
For (i): the `Preempting` AG on this cell -> `Pending` (e<sub>5</sub> in AG state machine); release the cell and then allocate it to the new `Allocated` AG.

Просмотреть файл

@ -2,6 +2,7 @@
## <a name="Index">Index</a>
- [Config](#Config)
- [Scheduling GPUs](#Scheduling-GPUs)
## <a name="Config">Config</a>
### <a name="ConfigQuickStart">Config QuickStart</a>
@ -14,7 +15,6 @@
Notes:
1. It is like the [Azure VM Series](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu) or [GCP Machine Types](https://cloud.google.com/compute/docs/machine-types).
2. Currently, the `skuTypes` is not directly used by HivedScheduler, but it is used by [OpenPAI RestServer](https://github.com/microsoft/pai/tree/master/src/rest-server) to setup proportional Pod resource requests and limits. So, if you are not using with [OpenPAI RestServer](https://github.com/microsoft/pai/tree/master/src/rest-server), you can skip to config it.
3. It is previously known as `gpuTypes`, and we are in the progress to rename it to `skuTypes`, as HiveD only awares the abstract `cell` concept instead of the concrete hardware that the `cell` represents.
**Example:**
@ -117,7 +117,7 @@
5. Put it together
**Example:**
Finally, after above steps, your config would be:
```yaml
physicalCluster:
@ -155,3 +155,37 @@
### <a name="ConfigDetail">Config Detail</a>
[Detail Example](../example/config)
## <a name="Scheduling-GPUs">Scheduling GPUs</a>
To leverage this scheduler to schedule GPUs, if one container in the Pod want to use the allocated GPUs for the whole Pod,
it could contain below environment variables:
* NVIDIA GPUs
```yaml
env:
- name: NVIDIA_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
```
The scheduler directly delivers GPU isolation decision to [nvidia-container-runtime](https://github.com/NVIDIA/nvidia-container-runtime)
through Pod Env `NVIDIA_VISIBLE_DEVICES`.
* AMD GPUs
```yaml
env:
- name: AMD_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
```
The scheduler directly delivers GPU isolation decision to [rocm-container-runtime](https://github.com/abuccts/rocm-container-runtime)
through Pod Env `AMD_VISIBLE_DEVICES`.
The annotation referred by the env will be populated by scheduler when bind the pod.
If multiple containers in the Pod contain the env, the allocated GPUs are all visible to them,
so it is these containers' freedom to control how to share these GPUs.

Просмотреть файл

@ -11,7 +11,7 @@ kubeApiServerAddress: http://10.10.10.10:8080
#
# Constrains:
# 1. All cellTypes should form a forest, i.e. a disjoint union of trees.
# 2. All physicalCells should contain at most one physical specific GPU.
# 2. All physicalCells should contain at most one physical specific device.
# 3. Each physicalCell should contain exactly one node level cellType.
# 4. Each physicalCell should specify full hierarchies defined by its cellType.
# 5. A pinnedCellId should can be universally locate one physicalCell.
@ -24,7 +24,7 @@ kubeApiServerAddress: http://10.10.10.10:8080
################################################################################
physicalCluster:
# Define the cell structures.
# Each leaf cellType contains a single GPU and also defines a gpuType of the
# Each leaf cellType contains a single device and also defines a leafCellType of the
# same name.
cellTypes:
#######################################
@ -35,8 +35,8 @@ physicalCluster:
childCellType: CT1
# Specify how many child cells it contains.
childCellNumber: 2
# Specify whether it is a node level cellType, i.e. contains all GPUs of
# its corresponding gpuType within one node and only contains these GPUs.
# Specify whether it is a node level cellType, i.e. contains all leaf cells of
# its corresponding leafCellType within one node and only contains these leaf cells.
# Defaults to false.
isNodeLevel: true
@ -149,15 +149,15 @@ physicalCluster:
cellAddress: 0.0.0.0
- cellType: CT1-NODE
cellAddress: 0.0.0.1
# One node has multiple gpu types and
# non-standard gpu indices (by explicitly specifying cell addresses)
# One node has multiple leaf cell types and
# non-standard leaf cell indices (by explicitly specifying cell addresses)
- cellType: CT1-NODE
cellAddress: 1.0.0.2 # NODE Name
cellChildren:
- cellAddress: 8 # GPU Index
pinnedCellId: VC1-YQW-CT1
- cellAddress: 9 # GPU Index
# One cell has non-standard gpu indices
# One cell has non-standard leaf cell indices
- cellType: 3-DGX1-P100-NODE
cellChildren:
# cellAddress can be omitted for non-node level cellType, which defaults to

Просмотреть файл

@ -9,7 +9,7 @@
HiveD guarantees **quota safety for all VCs**, in the sense that the requests to cells defined in each VC can always be satisfied.
VC's cells can be described by Hardware Quantity, [Topology](#VC-Safety), [Type](#GPU-Type), [Pinned Cells](#Pinned-Cells), etc. To guarantee safety, HiveD never allows a VC to "invade" other VCs' cells. For example, to guarantee all VCs' topology, one VC's [guaranteed jobs](#Guaranteed-Job) should never make fragmentation inside other VCs:
VC's cells can be described by Hardware Quantity, [Topology](#VC-Safety), [Type](#SKU-Type), [Pinned Cells](#Pinned-Cells), etc. To guarantee safety, HiveD never allows a VC to "invade" other VCs' cells. For example, to guarantee all VCs' topology, one VC's [guaranteed jobs](#Guaranteed-Job) should never make fragmentation inside other VCs:
Two DGX-2s, two VCs each owns one DGX-2 node. For a traditional scheduler, this will translate into two VCs each owning 16 GPUs. When a user submits 16 1-GPU jobs to VC1, the user in VC2 might not be able to run a 16-GPU job, due to possible fragmentation issue caused by VC1. While HiveD can guarantee each VC always has one entire node available for its dedicated use.
@ -30,19 +30,21 @@ This is similar to [K8S Taints and Tolerations](https://kubernetes.io/docs/conce
2. Submit job [itc-pin](file/itc-pin.yaml) to VC1, all tasks in task role vc1pinned will be on node 10.151.41.25 (which is pinned), all tasks in task role vc1nopinned will NOT be on node 10.151.41.25.
<img src="file/itc-pin.png" width="900"/>
## GPU Type
## SKU Type
### Description
If `gpuType` is specified in the job, only that type of GPU will be allocated to the job, otherwise, any type of GPU can be allocated.
`skuType` is the leaf `cellType` which does not have internal topology anymore.
If `skuType` is specified in the job, only that type of leaf cell will be allocated to the job, otherwise, any type of leaf cell can be allocated.
This is similar to [K8S Labels and Selectors](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels), but with [VC Safety](#VC-Safety) guaranteed.
### Reproduce Steps
#### `gpuType` specified
#### `skuType` specified
1. Use [hived-config-2](file/hived-config-2.yaml).
2. Submit job [itc-k80-type](file/itc-k80-type.yaml), it will be partially running (some tasks waiting because all the specified K80 GPUs are used).
<img src="file/itc-k80-type.png" width="900"/>
#### `gpuType` not specified
#### `skuType` not specified
1. Use [hived-config-2](file/hived-config-2.yaml).
2. Submit job [itc-no-type](file/itc-no-type.yaml), it will be fully running, and some tasks are using K80 (10.151.41.18) while others are using M60 (10.151.41.26).
<img src="file/itc-no-type.png" width="900"/>
@ -135,7 +137,7 @@ One VC's [Guaranteed Job](#Guaranteed-Job) can preempt other VCs' [Opportunistic
## Topology-Aware Intra-VC Scheduling
### Description
Within one VC, HiveD chooses nearest GPUs for one `AffinityGroup` in best effort.
Within one VC, HiveD chooses nearest leaf cells for one `AffinityGroup` in best effort.
### Reproduce Steps
1. Use [hived-config-2](file/hived-config-2.yaml).
@ -147,40 +149,40 @@ Within one VC, HiveD chooses nearest GPUs for one `AffinityGroup` in best effort
## Work-Preserving Reconfiguration
### Description
HiveD can be reconfigured without unnecessary user impacts, such as add/update/delete physical/virtual clusters, GPU types/topologies, etc.
HiveD can be reconfigured without unnecessary user impacts, such as add/update/delete physical/virtual clusters, different device types/topologies, etc.
### Reproduce Steps
#### PhysicalCluster Reconfig - Delete PhysicalCell
1. Use [hived-config-2](file/hived-config-2.yaml).
2. Submit job [itc-reconfig-1](file/itc-reconfig-1.yaml) which requests M60 `gpuType`. Wait until it is running.
3. Delete all M60 `gpuType` related PhysicalCells and VirtualCells from [hived-config-2](file/hived-config-2.yaml), i.e. becomes [hived-config-33](file/hived-config-33.yaml).
4. Use [hived-config-33](file/hived-config-33.yaml), and restart HiveD.
2. Submit job [itc-reconfig-1](file/itc-reconfig-1.yaml) which requests M60 `skuType`. Wait until it is running.
3. Delete all M60 `skuType` related PhysicalCells and VirtualCells from [hived-config-2](file/hived-config-2.yaml), i.e. becomes [hived-config-33](file/hived-config-33.yaml).
4. Use [hived-config-33](file/hived-config-33.yaml), and restart HiveD.
5. The job will still run without any impact, but its M60 usage is ignored by HiveD.
*However, normally, the job will still fail if the corresponding physical node is later deleted from K8S or unhealthy.*
<img src="file/itc-reconfig-1.png" width="900"/>
#### PhysicalCluster Reconfig - Add PhysicalCell
1. Use [hived-config-33](file/hived-config-33.yaml).
2. Submit job [itc-k80-type](file/itc-k80-type.yaml) which requests K80 `gpuType`. Wait until it is running.
3. Add all M60 `gpuType` related PhysicalCells and VirtualCells into [hived-config-33](file/hived-config-33.yaml), i.e. becomes [hived-config-2](file/hived-config-2.yaml).
4. Use [hived-config-2](file/hived-config-2.yaml), and restart HiveD.
2. Submit job [itc-k80-type](file/itc-k80-type.yaml) which requests K80 `skuType`. Wait until it is running.
3. Add all M60 `skuType` related PhysicalCells and VirtualCells into [hived-config-33](file/hived-config-33.yaml), i.e. becomes [hived-config-2](file/hived-config-2.yaml).
4. Use [hived-config-2](file/hived-config-2.yaml), and restart HiveD.
5. The job will still run without any impact, and its K80 usage is still accounted by HiveD.
<img src="file/itc-k80-type.png" width="900"/>
#### PhysicalCluster Reconfig - Update PhysicalCell - Add Node
1. Use [hived-config-2](file/hived-config-2.yaml).
2. Submit job [itc-reconfig-1](file/itc-reconfig-1.yaml) which requests M60 `gpuType`. Wait until it is running.
2. Submit job [itc-reconfig-1](file/itc-reconfig-1.yaml) which requests M60 `skuType`. Wait until it is running.
3. Add one M60 node into a PhysicalCell, then becomes [hived-config-4](file/hived-config-4.yaml).
4. Use [hived-config-4](file/hived-config-4.yaml), and restart HiveD.
4. Use [hived-config-4](file/hived-config-4.yaml), and restart HiveD.
5. The job will still run without any impact, and its M60 usage is still accounted by HiveD.
6. To confirm the job is not impacted, such as [lazy preempted](#Lazy-Preemption). Submit job [itc-reconfig-2](file/itc-reconfig-2.yaml) which requests all M60 nodes and has the same priority as [itc-reconfig-1](file/itc-reconfig-1.yaml). The job will be waiting instead of preempting [itc-reconfig-1](file/itc-reconfig-1.yaml).
<img src="file/itc-reconfig-2.png" width="900"/>
#### PhysicalCluster Reconfig - Update PhysicalCell - Delete Node
1. Use [hived-config-2](file/hived-config-2.yaml).
2. Submit job [itc-reconfig-3](file/itc-reconfig-3.yaml) which requests K80 `gpuType`. Wait until it is running.
2. Submit job [itc-reconfig-3](file/itc-reconfig-3.yaml) which requests K80 `skuType`. Wait until it is running.
3. Delete one K80 node used by [itc-reconfig-3](file/itc-reconfig-3.yaml) from a PhysicalCell, then becomes [hived-config-7](file/hived-config-7.yaml).
4. Use [hived-config-7](file/hived-config-7.yaml), and restart HiveD.
4. Use [hived-config-7](file/hived-config-7.yaml), and restart HiveD.
5. The job will still run without any impact, but its deleted node usage is ignored by HiveD.
*However, normally, the job will still fail if the corresponding physical node is later deleted from K8S or unhealthy.*
<img src="file/itc-reconfig-3-1.png" width="900"/>
@ -189,7 +191,7 @@ HiveD can be reconfigured without unnecessary user impacts, such as add/update/d
1. Use [hived-config-2](file/hived-config-2.yaml).
2. Submit job [itc-reconfig-3](file/itc-reconfig-3.yaml) to default VC. Wait until it is running.
3. Delete the default VC and move its quota to VC1, then becomes [hived-config-5](file/hived-config-5.yaml).
4. Use [hived-config-5](file/hived-config-5.yaml), and restart HiveD.
4. Use [hived-config-5](file/hived-config-5.yaml), and restart HiveD.
5. The job will still run without any interruption but [lazy preempted](#Lazy-Preemption) by HiveD.
<img src="file/itc-reconfig-3.png" width="900"/>
6. To confirm it is [lazy preempted](#Lazy-Preemption), submit job [itc-reconfig-4](file/itc-reconfig-4.yaml) to VC1 which requests all K80 nodes. The job will immediately preempt [itc-reconfig-3](file/itc-reconfig-3.yaml).
@ -199,7 +201,7 @@ HiveD can be reconfigured without unnecessary user impacts, such as add/update/d
1. Use [hived-config-2](file/hived-config-2.yaml).
2. Submit job [itc-reconfig-3](file/itc-reconfig-3.yaml) to default VC. Wait until it is running.
3. Move one K80-NODE cell from default VC to VC1, then becomes [hived-config-6](file/hived-config-6.yaml).
4. Use [hived-config-6](file/hived-config-6.yaml), and restart HiveD.
4. Use [hived-config-6](file/hived-config-6.yaml), and restart HiveD.
5. The job will still run without any interruption but [lazy preempted](#Lazy-Preemption) by HiveD.
6. To confirm it is [lazy preempted](#Lazy-Preemption), submit job [itc-reconfig-5](file/itc-reconfig-5.yaml) to VC1 which requests all K80 nodes. The job will immediately preempt [itc-reconfig-3](file/itc-reconfig-3.yaml).
<img src="file/itc-reconfig-5.png" width="900"/>

Просмотреть файл

@ -1,7 +1,7 @@
kubeApiServerAddress: http://10.151.41.16:8080
physicalCluster:
gpuTypes:
skuTypes:
K80:
gpu: 1
cpu: 4

Просмотреть файл

@ -1,7 +1,7 @@
kubeApiServerAddress: http://10.151.41.16:8080
physicalCluster:
gpuTypes:
skuTypes:
K80:
gpu: 1
cpu: 4

Просмотреть файл

@ -1,7 +1,7 @@
kubeApiServerAddress: http://10.151.41.16:8080
physicalCluster:
gpuTypes:
skuTypes:
K80:
gpu: 1
cpu: 4

Просмотреть файл

@ -1,7 +1,7 @@
kubeApiServerAddress: http://10.151.41.16:8080
physicalCluster:
gpuTypes:
skuTypes:
K80:
gpu: 1
cpu: 4

Просмотреть файл

@ -1,7 +1,7 @@
kubeApiServerAddress: http://10.151.41.16:8080
physicalCluster:
gpuTypes:
skuTypes:
K80:
gpu: 1
cpu: 4

Просмотреть файл

@ -1,7 +1,7 @@
kubeApiServerAddress: http://10.151.41.16:8080
physicalCluster:
gpuTypes:
skuTypes:
K80:
gpu: 1
cpu: 4

Просмотреть файл

@ -1,7 +1,7 @@
kubeApiServerAddress: http://10.151.41.16:8080
physicalCluster:
gpuTypes:
skuTypes:
K80:
gpu: 1
cpu: 4

Просмотреть файл

@ -1,7 +1,7 @@
kubeApiServerAddress: http://10.151.41.16:8080
physicalCluster:
gpuTypes:
skuTypes:
K80:
gpu: 1
cpu: 4

Просмотреть файл

@ -1,7 +1,7 @@
kubeApiServerAddress: http://10.151.41.16:8080
physicalCluster:
gpuTypes:
skuTypes:
K80:
gpu: 1
cpu: 4

Просмотреть файл

@ -29,5 +29,5 @@ extras:
jobPriorityClass: prod
taskRoles:
train:
gpuType: M60
skuType: M60
submitFrom: submit-job-v2

Просмотреть файл

@ -29,5 +29,5 @@ extras:
jobPriorityClass: prod
taskRoles:
train:
gpuType: K80
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -28,5 +28,5 @@ extras:
jobPriorityClass: prod
taskRoles:
train:
gpuType: K80
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -28,5 +28,5 @@ extras:
jobPriorityClass: prod
taskRoles:
train:
gpuType: K80
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -28,5 +28,5 @@ extras:
jobPriorityClass: prod
taskRoles:
train:
gpuType: K80
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -28,5 +28,5 @@ extras:
jobPriorityClass: oppo
taskRoles:
train:
gpuType: K80
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -28,5 +28,5 @@ extras:
jobPriorityClass: prod
taskRoles:
train:
gpuType: K80
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -28,5 +28,5 @@ extras:
jobPriorityClass: prod
taskRoles:
train:
gpuType: M60
skuType: M60
submitFrom: submit-job-v2

Просмотреть файл

@ -28,5 +28,5 @@ extras:
jobPriorityClass: test
taskRoles:
train:
gpuType: M60
skuType: M60
submitFrom: submit-job-v2

Просмотреть файл

@ -28,5 +28,5 @@ extras:
jobPriorityClass: prod
taskRoles:
train:
gpuType: K80
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -28,5 +28,5 @@ extras:
jobPriorityClass: prod
taskRoles:
train:
gpuType: K80
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -28,5 +28,5 @@ extras:
jobPriorityClass: test
taskRoles:
train:
gpuType: K80
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -28,5 +28,5 @@ extras:
jobPriorityClass: prod
taskRoles:
train:
gpuType: K80
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -29,5 +29,5 @@ extras:
jobPriorityClass: prod
taskRoles:
train:
gpuType: M60
skuType: M60
submitFrom: submit-job-v2

Просмотреть файл

@ -28,5 +28,5 @@ extras:
jobPriorityClass: oppo
taskRoles:
train:
gpuType: K80
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -42,7 +42,7 @@ extras:
jobPriorityClass: prod
taskRoles:
vc1nopinned:
gpuType: K80
skuType: K80
affinityGroupName: vc1nopinned
vc1pinned:
pinnedCellId: VC1-K80

Просмотреть файл

@ -27,5 +27,5 @@ extras:
jobPriorityClass: test
taskRoles:
train:
gpuType: M60
skuType: M60
submitFrom: submit-job-v2

Просмотреть файл

@ -27,5 +27,5 @@ extras:
jobPriorityClass: test
taskRoles:
train:
gpuType: M60
skuType: M60
submitFrom: submit-job-v2

Просмотреть файл

@ -27,5 +27,5 @@ extras:
jobPriorityClass: test
taskRoles:
train:
gpuType: K80
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -27,5 +27,5 @@ extras:
jobPriorityClass: test
taskRoles:
train:
gpuType: K80
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -27,5 +27,5 @@ extras:
jobPriorityClass: test
taskRoles:
train:
gpuType: K80
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -29,4 +29,4 @@ extras:
jobPriorityClass: prod
taskRoles:
train:
gpuType: K80
skuType: K80

Просмотреть файл

@ -29,4 +29,4 @@ extras:
jobPriorityClass: prod
taskRoles:
train:
gpuType: K80
skuType: K80

Просмотреть файл

@ -7,8 +7,8 @@ jobPriorityClass: PROD
taskRoles:
a:
taskNumber: 5
gpuType: K80
gpuNumber: 1
leafCellType: K80
leafCellNumber: 1
affinityGroupName: null
---
jobVC: VC2
@ -17,8 +17,8 @@ jobPriorityClass: PROD
taskRoles:
a:
taskNumber: 5
gpuType: K80
gpuNumber: 1
leafCellType: K80
leafCellNumber: 1
affinityGroupName: null
---
jobVC: VC2
@ -28,7 +28,7 @@ taskRoles:
a:
taskNumber: 5
pinnedCellId: VC2-K80
gpuNumber: 1
leafCellNumber: 1
affinityGroupName: null
---
@ -36,7 +36,7 @@ taskRoles:
# [Optional]: Cluster Admin -> RestServer Config -> PC
################################################################################
physicalCluster:
gpuTypes:
leafCellTypes:
K80:
gpu: 1
cpu: 4
@ -71,8 +71,8 @@ spec:
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
virtualCluster: VC1
priority: 1000
gpuType: K80
gpuNumber: 1
leafCellType: K80
leafCellNumber: 1
affinityGroup: null
spec:
schedulerName: hivedscheduler
@ -91,7 +91,7 @@ spec:
- name: NVIDIA_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
---
apiVersion: frameworkcontroller.microsoft.com/v1
kind: Framework
@ -118,8 +118,8 @@ spec:
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
virtualCluster: VC2
priority: 1000
gpuType: K80
gpuNumber: 1
leafCellType: K80
leafCellNumber: 1
affinityGroup: null
spec:
schedulerName: hivedscheduler
@ -138,7 +138,7 @@ spec:
- name: NVIDIA_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
---
apiVersion: frameworkcontroller.microsoft.com/v1
kind: Framework
@ -166,7 +166,7 @@ spec:
virtualCluster: VC2
priority: 1000
pinnedCellId: VC2-K80
gpuNumber: 1
leafCellNumber: 1
affinityGroup: null
spec:
schedulerName: hivedscheduler
@ -185,7 +185,7 @@ spec:
- name: NVIDIA_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
---
################################################################################
@ -200,8 +200,8 @@ metadata:
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
virtualCluster: VC1
priority: 1000
gpuType: K80
gpuNumber: 1
leafCellType: K80
leafCellNumber: 1
affinityGroup: null
spec:
schedulerName: hivedscheduler
@ -220,7 +220,7 @@ spec:
- name: NVIDIA_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
---
apiVersion: v1
kind: Pod
@ -231,8 +231,8 @@ metadata:
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
virtualCluster: VC2
priority: 1000
gpuType: K80
gpuNumber: 1
leafCellType: K80
leafCellNumber: 1
affinityGroup: null
spec:
schedulerName: hivedscheduler
@ -251,7 +251,7 @@ spec:
- name: NVIDIA_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
---
apiVersion: v1
kind: Pod
@ -263,7 +263,7 @@ metadata:
virtualCluster: VC2
priority: 1000
pinnedCellId: VC2-K80
gpuNumber: 1
leafCellNumber: 1
affinityGroup: null
spec:
schedulerName: hivedscheduler
@ -282,4 +282,4 @@ spec:
- name: NVIDIA_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']

Просмотреть файл

@ -2,8 +2,8 @@
# [Optional]: Job User -> RestServer Request
#
# Constrains:
# 1. For one task, only need to specify gpuType or pinnedCellId, not both.
# 2. All gpuTypes or pinnedCellIds under the same affinityGroup must be the same.
# 1. For one task, only need to specify leafCellType or pinnedCellId, not both.
# 2. All leafCellTypes or pinnedCellIds under the same affinityGroup must be the same.
#
# affinityGroupName:
# An affinityGroup forms a cell request and scheduler will try all candidate
@ -27,63 +27,63 @@ taskRoles:
# All tasks in role A, B, C should be within the same cell named PCN-ABC.
#
# Total request of PCN-ABC:
# gpuType: DGX2-V100
# gpuNumber: 1 * 16 + 3 * 8 + 1 * 4 = 44 GPUs = 2.75 DGX2 nodes
# leafCellType: DGX2-V100
# leafCellNumber: 1 * 16 + 3 * 8 + 1 * 4 = 44 GPUs = 2.75 DGX2 nodes
# Candidate cellTypes:
# 3-DGX2-NODE, 4-DGX2-NODE, 4-DIRECT-DGX2-NODE, 5-DGX2-NODE.
A:
taskNumber: 1
gpuType: DGX2-V100
gpuNumber: 16
leafCellType: DGX2-V100
leafCellNumber: 16
affinityGroupName: PCN-ABC
B:
taskNumber: 3
gpuType: DGX2-V100
gpuNumber: 8
leafCellType: DGX2-V100
leafCellNumber: 8
affinityGroupName: PCN-ABC
C:
taskNumber: 1
gpuType: DGX2-V100
gpuNumber: 4
leafCellType: DGX2-V100
leafCellNumber: 4
affinityGroupName: PCN-ABC
# All tasks in role D should be within the same cell named PCN-D.
#
# Total request of PCN-D:
# gpuType: null -> any gpuType
# gpuNumber: 2 * CP2(3) = 2 * 4 = 8 GPUs
# leafCellType: null -> any leafCellType
# leafCellNumber: 2 * CP2(3) = 2 * 4 = 8 GPUs
# Candidate cellTypes:
# DGX1-P100-NODE, DGX1-V100-NODE, DGX2-NODE-8-GPU, IB-DGX2-NODE-8-GPU.
D:
taskNumber: 2
gpuType: null # null, empty or not specified -> any gpuType
gpuNumber: 3
leafCellType: null # null, empty or not specified -> any leafCellType
leafCellNumber: 3
affinityGroupName: PCN-D
# Tasks in role E is not required to be within the same cell.
#
# Each task forms a cell request:
# gpuType: DGX2-V100
# gpuNumber: 1 * 16 = 16 GPUs = 1 DGX2 node
# leafCellType: DGX2-V100
# leafCellNumber: 1 * 16 = 16 GPUs = 1 DGX2 node
# Candidate cellTypes:
# DGX2-NODE.
E:
taskNumber: 2
gpuType: DGX2-V100
gpuNumber: 16
leafCellType: DGX2-V100
leafCellNumber: 16
affinityGroupName: null # null, empty or not specified -> no affinityGroup
# All tasks in role F should be within the same cell named PCN-F.
#
# Total request of PCN-F:
# pinnedCellId: VC1-YQW-IB-DGX2
# gpuNumber: 2 * CP2(3) = 2 * 4 = 8 GPUs
# leafCellNumber: 2 * CP2(3) = 2 * 4 = 8 GPUs
# Candidate physicalCells:
# VC1-YQW-IB-DGX2.
F:
taskNumber: 2
pinnedCellId: VC1-YQW-IB-DGX2
gpuNumber: 3
leafCellNumber: 3
affinityGroupName: PCN-F
---
@ -94,10 +94,10 @@ taskRoles:
# things in advance.
# For example:
# Pod Spec cpu, memory.
# 1. Given gpuType or pinnedCellId, just pick the corresponding cpu, memory unit.
# 2. No gpuType or pinnedCellId is given, choose the minimal cpu, memory unit.
# 1. Given leafCellType or pinnedCellId, just pick the corresponding cpu, memory unit.
# 2. No leafCellType or pinnedCellId is given, choose the minimal cpu, memory unit.
physicalCluster:
gpuTypes:
leafCellTypes:
# Check resource value format in
# k8s.io/apimachinery/pkg/api/resource/quantity.go
@ -146,17 +146,17 @@ spec:
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
virtualCluster: VC1
priority: 1000
gpuType: DGX2-V100
gpuNumber: 16
leafCellType: DGX2-V100
leafCellNumber: 16
affinityGroup:
name: JOBX/PCN-ABC
members:
- podNumber: 1
gpuNumber: 16
leafCellNumber: 16
- podNumber: 3
gpuNumber: 8
leafCellNumber: 8
- podNumber: 1
gpuNumber: 4
leafCellNumber: 4
spec:
# See ../../run/deploy.yaml for why and how to specify the schedulerName.
schedulerName: hivedscheduler
@ -185,7 +185,7 @@ spec:
valueFrom:
fieldRef:
# This annotation will be populated by scheduler when bind the pod.
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
# K8S port scheduling is incompatible with HiveD, so the job should detect
# port conflict by itself and fail with transient error, then controller
# should retry it with new port.
@ -200,17 +200,17 @@ spec:
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
virtualCluster: VC1
priority: 1000
gpuType: DGX2-V100
gpuNumber: 8
leafCellType: DGX2-V100
leafCellNumber: 8
affinityGroup:
name: JOBX/PCN-ABC
members:
- podNumber: 1
gpuNumber: 16
leafCellNumber: 16
- podNumber: 3
gpuNumber: 8
leafCellNumber: 8
- podNumber: 1
gpuNumber: 4
leafCellNumber: 4
spec:
schedulerName: hivedscheduler
priority: 1000
@ -224,7 +224,7 @@ spec:
- name: NVIDIA_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
- name: C
taskNumber: 1
task:
@ -234,17 +234,17 @@ spec:
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
virtualCluster: VC1
priority: 1000
gpuType: DGX2-V100
gpuNumber: 4
leafCellType: DGX2-V100
leafCellNumber: 4
affinityGroup:
name: JOBX/PCN-ABC
members:
- podNumber: 1
gpuNumber: 16
leafCellNumber: 16
- podNumber: 3
gpuNumber: 8
leafCellNumber: 8
- podNumber: 1
gpuNumber: 4
leafCellNumber: 4
spec:
schedulerName: hivedscheduler
priority: 1000
@ -258,7 +258,7 @@ spec:
- name: NVIDIA_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
- name: D
taskNumber: 2
task:
@ -268,13 +268,13 @@ spec:
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
virtualCluster: VC1
priority: 1000
gpuType: null
gpuNumber: 3
leafCellType: null
leafCellNumber: 3
affinityGroup:
name: JOBX/PCN-D
members:
- podNumber: 2
gpuNumber: 3
leafCellNumber: 3
spec:
schedulerName: hivedscheduler
priority: 1000
@ -290,7 +290,7 @@ spec:
- name: NVIDIA_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
- name: E
taskNumber: 2
task:
@ -300,8 +300,8 @@ spec:
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
virtualCluster: VC1
priority: 1000
gpuType: DGX2-V100
gpuNumber: 16
leafCellType: DGX2-V100
leafCellNumber: 16
affinityGroup: null
spec:
schedulerName: hivedscheduler
@ -316,7 +316,7 @@ spec:
- name: NVIDIA_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
- name: F
taskNumber: 2
task:
@ -327,12 +327,12 @@ spec:
virtualCluster: VC1
priority: 1000
pinnedCellId: VC1-YQW-IB-DGX2
gpuNumber: 3
leafCellNumber: 3
affinityGroup:
name: JOBX/PCN-F
members:
- podNumber: 2
gpuNumber: 3
leafCellNumber: 3
spec:
schedulerName: hivedscheduler
priority: 1000
@ -346,7 +346,7 @@ spec:
- name: NVIDIA_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
---
################################################################################
@ -360,17 +360,17 @@ metadata:
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
virtualCluster: VC1
priority: 1000
gpuType: DGX2-V100
gpuNumber: 16
leafCellType: DGX2-V100
leafCellNumber: 16
affinityGroup:
name: JOBX/PCN-ABC
members:
- podNumber: 1
gpuNumber: 16
leafCellNumber: 16
- podNumber: 3
gpuNumber: 8
leafCellNumber: 8
- podNumber: 1
gpuNumber: 4
leafCellNumber: 4
spec:
schedulerName: hivedscheduler
priority: 1000
@ -384,7 +384,7 @@ spec:
- name: NVIDIA_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
---
apiVersion: v1
kind: Pod
@ -395,17 +395,17 @@ metadata:
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
virtualCluster: VC1
priority: 1000
gpuType: DGX2-V100
gpuNumber: 8
leafCellType: DGX2-V100
leafCellNumber: 8
affinityGroup:
name: JOBX/PCN-ABC
members:
- podNumber: 1
gpuNumber: 16
leafCellNumber: 16
- podNumber: 3
gpuNumber: 8
leafCellNumber: 8
- podNumber: 1
gpuNumber: 4
leafCellNumber: 4
spec:
schedulerName: hivedscheduler
priority: 1000
@ -419,7 +419,7 @@ spec:
- name: NVIDIA_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
---
apiVersion: v1
kind: Pod
@ -429,17 +429,17 @@ metadata:
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
virtualCluster: VC1
priority: 1000
gpuType: DGX2-V100
gpuNumber: 4
leafCellType: DGX2-V100
leafCellNumber: 4
affinityGroup:
name: JOBX/PCN-ABC
members:
- podNumber: 1
gpuNumber: 16
leafCellNumber: 16
- podNumber: 3
gpuNumber: 8
leafCellNumber: 8
- podNumber: 1
gpuNumber: 4
leafCellNumber: 4
spec:
schedulerName: hivedscheduler
priority: 1000
@ -453,7 +453,7 @@ spec:
- name: NVIDIA_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
---
apiVersion: v1
kind: Pod
@ -464,13 +464,13 @@ metadata:
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
virtualCluster: VC1
priority: 1000
gpuType: null
gpuNumber: 3
leafCellType: null
leafCellNumber: 3
affinityGroup:
name: JOBX/PCN-D
members:
- podNumber: 2
gpuNumber: 3
leafCellNumber: 3
spec:
schedulerName: hivedscheduler
priority: 1000
@ -484,7 +484,7 @@ spec:
- name: NVIDIA_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
---
apiVersion: v1
kind: Pod
@ -495,8 +495,8 @@ metadata:
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
virtualCluster: VC1
priority: 1000
gpuType: DGX2-V100
gpuNumber: 16
leafCellType: DGX2-V100
leafCellNumber: 16
affinityGroup: null
spec:
schedulerName: hivedscheduler
@ -511,7 +511,7 @@ spec:
- name: NVIDIA_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
---
apiVersion: v1
kind: Pod
@ -523,12 +523,12 @@ metadata:
virtualCluster: VC1
priority: 1000
pinnedCellId: VC1-YQW-IB-DGX2
gpuNumber: 3
leafCellNumber: 3
affinityGroup:
name: JOBX/PCN-F
members:
- podNumber: 2
gpuNumber: 3
leafCellNumber: 3
spec:
schedulerName: hivedscheduler
priority: 1000
@ -542,4 +542,4 @@ spec:
- name: NVIDIA_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']

Просмотреть файл

@ -26,8 +26,8 @@ spec:
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
virtualCluster: VC2
priority: 1000
gpuType: K80
gpuNumber: 1
leafCellType: K80
leafCellNumber: 1
affinityGroup: null
spec:
schedulerName: hivedscheduler
@ -58,7 +58,7 @@ spec:
- name: NVIDIA_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
volumeMounts:
- name: frameworkbarrier-volume
mountPath: "/mnt/frameworkbarrier"
@ -95,8 +95,8 @@ spec:
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
virtualCluster: VC2
priority: 1000
gpuType: K80
gpuNumber: 1
leafCellType: K80
leafCellNumber: 1
affinityGroup: null
spec:
schedulerName: hivedscheduler
@ -127,7 +127,7 @@ spec:
- name: NVIDIA_VISIBLE_DEVICES
valueFrom:
fieldRef:
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
volumeMounts:
- name: frameworkbarrier-volume
mountPath: "/mnt/frameworkbarrier"

Просмотреть файл

@ -49,7 +49,7 @@ data:
webServerAddress: ":30096"
waitingPodSchedulingBlockMilliSec: 50
physicalCluster:
gpuTypes:
skuTypes:
K80:
gpu: 1
cpu: 5

Просмотреть файл

@ -24,11 +24,12 @@ package algorithm
import (
"fmt"
"github.com/microsoft/hivedscheduler/pkg/api"
"k8s.io/klog"
)
// A Cell represents a set of GPUs affinitized by their interconnection topology.
// A Cell represents a set of leaf cells affinitized by their interconnection topology.
// Cells are organized as a tree through pointers to their parents / children.
type Cell interface {
GetChain() CellChain
@ -41,9 +42,9 @@ type Cell interface {
AtOrHigherThanNode() bool
GetPriority() CellPriority
SetPriority(CellPriority)
GetTotalGpuNum() int32
GetUsedGpuNumAtPriorities() map[CellPriority]int32
IncreaseUsedGpuNumAtPriority(CellPriority, int32)
GetTotalLeafCellNum() int32
GetUsedLeafCellNumAtPriorities() map[CellPriority]int32
IncreaseUsedLeafCellNumAtPriority(CellPriority, int32)
}
func CellEqual(c1 Cell, c2 Cell) bool {
@ -65,9 +66,9 @@ type GenericCell struct {
state CellState
// A cell is healthy if all of the cell's children are healthy (bad if any child is bad).
// The healthy field is orthogonal to priority and state.
healthy bool
totalGpuNum int32 // total GPU number of a cell
usedGpuNumAtPriorities map[CellPriority]int32 // GPU number used by each priority
healthy bool
totalLeafCellNum int32 // total leaf cell number of a cell
usedLeafCellNumAtPriorities map[CellPriority]int32 // leaf cell number used by each priority
}
func (c *GenericCell) GetChain() CellChain {
@ -110,18 +111,18 @@ func (c *GenericCell) IsHealthy() bool {
return c.healthy
}
func (c *GenericCell) GetTotalGpuNum() int32 {
return c.totalGpuNum
func (c *GenericCell) GetTotalLeafCellNum() int32 {
return c.totalLeafCellNum
}
func (c *GenericCell) GetUsedGpuNumAtPriorities() map[CellPriority]int32 {
return c.usedGpuNumAtPriorities
func (c *GenericCell) GetUsedLeafCellNumAtPriorities() map[CellPriority]int32 {
return c.usedLeafCellNumAtPriorities
}
func (c *GenericCell) IncreaseUsedGpuNumAtPriority(p CellPriority, delta int32) {
c.usedGpuNumAtPriorities[p] += delta
if c.usedGpuNumAtPriorities[p] == 0 {
delete(c.usedGpuNumAtPriorities, p)
func (c *GenericCell) IncreaseUsedLeafCellNumAtPriority(p CellPriority, delta int32) {
c.usedLeafCellNumAtPriorities[p] += delta
if c.usedLeafCellNumAtPriorities[p] == 0 {
delete(c.usedLeafCellNumAtPriorities, p)
}
}
@ -129,7 +130,7 @@ func (c *GenericCell) IncreaseUsedGpuNumAtPriority(p CellPriority, delta int32)
type PhysicalCell struct {
GenericCell
nodes []string // node names inside the cell
gpuIndices []int32 // [-1] for cells at levels higher than node
leafCellIndices []int32 // [-1] for cells at levels higher than node
usingGroup *AlgoAffinityGroup // affinity group using this cell
reservingOrReservedGroup *AlgoAffinityGroup // affinity group that is reserving, or has reserved the cell (e.g., waiting for preemption)
virtualCell *VirtualCell // points to the bound virtual cell
@ -151,14 +152,14 @@ func NewPhysicalCell(
return &PhysicalCell{
GenericCell: GenericCell{
chain: c,
level: l,
priority: freePriority,
address: address,
atOrHigherThanNode: g,
totalGpuNum: n,
usedGpuNumAtPriorities: map[CellPriority]int32{},
state: cellFree,
chain: c,
level: l,
priority: freePriority,
address: address,
atOrHigherThanNode: g,
totalLeafCellNum: n,
usedLeafCellNumAtPriorities: map[CellPriority]int32{},
state: cellFree,
// cells are set to healthy initially, and will be all set to bad in HivedAlgorithm.initBadNodes
healthy: true,
},
@ -203,16 +204,16 @@ func (c *PhysicalCell) SetState(s CellState) {
}
func (c *PhysicalCell) GetPhysicalPlacement() ([]string, []int32) {
return c.nodes, c.gpuIndices
return c.nodes, c.leafCellIndices
}
func (c *PhysicalCell) GetPhysicalPlacementString() string {
return fmt.Sprintf("%v:%v", c.nodes, c.gpuIndices)
return fmt.Sprintf("%v:%v", c.nodes, c.leafCellIndices)
}
func (c *PhysicalCell) SetPhysicalResources(nodes []string, gpuIndices []int32) {
func (c *PhysicalCell) SetPhysicalResources(nodes []string, leafCellIndices []int32) {
c.nodes = nodes
c.gpuIndices = gpuIndices
c.leafCellIndices = leafCellIndices
}
func (c *PhysicalCell) AddUsingGroup(g *AlgoAffinityGroup) {
@ -335,14 +336,14 @@ func NewVirtualCell(
return &VirtualCell{
GenericCell: GenericCell{
chain: c,
level: l,
priority: freePriority,
address: address,
atOrHigherThanNode: g,
totalGpuNum: n,
usedGpuNumAtPriorities: map[CellPriority]int32{},
state: cellFree,
chain: c,
level: l,
priority: freePriority,
address: address,
atOrHigherThanNode: g,
totalLeafCellNum: n,
usedLeafCellNumAtPriorities: map[CellPriority]int32{},
state: cellFree,
// cells are set to healthy initially, and will be all set to bad in HivedAlgorithm.initBadNodes
healthy: true,
},

Просмотреть файл

@ -236,8 +236,8 @@ func getUsablePhysicalCells(
}
// prioritize the cells with fewer opportunistic pods (to reduce preemption)
sort.SliceStable(candidates, func(i, j int) bool {
return candidates[i].GetUsedGpuNumAtPriorities()[opportunisticPriority] <
candidates[j].GetUsedGpuNumAtPriorities()[opportunisticPriority]
return candidates[i].GetUsedLeafCellNumAtPriorities()[opportunisticPriority] <
candidates[j].GetUsedLeafCellNumAtPriorities()[opportunisticPriority]
})
return usableCandidates
}
@ -382,7 +382,7 @@ func getUnboundVirtualCell(cl CellList) *VirtualCell {
}
// bindCell binds a virtual cell to a physical cell and its parent recursively.
// bindCell always starts from the lowest level, i.e., GPU-level cells.
// bindCell always starts from the lowest level, i.e., leaf-level cells.
func bindCell(pc *PhysicalCell, vc *VirtualCell) {
for vc.GetPhysicalCell() == nil {
pc.SetVirtualCell(vc)
@ -397,7 +397,7 @@ func bindCell(pc *PhysicalCell, vc *VirtualCell) {
}
// unbindCell unbinds a virtual cell with a physical cell and its parent recursively.
// unbindCell always starts from the lowest level, i.e., GPU-level cells.
// unbindCell always starts from the lowest level, i.e., leaf-level cells.
func unbindCell(c *PhysicalCell) {
boundVirtual := c.GetVirtualCell()
for !boundVirtual.GetPhysicalCell().IsPinned() {
@ -421,7 +421,7 @@ func unbindCell(c *PhysicalCell) {
// setCellPriority sets priority for a cell and its parent recursively, guaranteeing that
// the priority of a cell is the max of those of its children.
// setCellPriority always starts from the lowest level, i.e., GPU-level cells.
// setCellPriority always starts from the lowest level, i.e., leaf-level cells.
func setCellPriority(c Cell, p CellPriority) {
originalPriority := c.GetPriority()
c.SetPriority(p)
@ -440,15 +440,15 @@ func setCellPriority(c Cell, p CellPriority) {
}
}
// updateUsedGpuNumAtPriority updates the number of used GPUs at a priority for a cell
// updateUsedLeafCellNumAtPriority updates the number of used leaf cells at a priority for a cell
// and its parent recursively.
func updateUsedGpuNumAtPriority(c Cell, p CellPriority, increase bool) {
func updateUsedLeafCellNumAtPriority(c Cell, p CellPriority, increase bool) {
for c != nil {
delta := int32(-1)
if increase {
delta = 1
}
c.IncreaseUsedGpuNumAtPriority(p, delta)
c.IncreaseUsedLeafCellNumAtPriority(p, delta)
c = c.GetParent()
}
}

Просмотреть файл

@ -24,21 +24,22 @@ package algorithm
import (
"fmt"
"strings"
"github.com/microsoft/hivedscheduler/pkg/api"
"github.com/microsoft/hivedscheduler/pkg/common"
"strings"
)
// internal wrapper for spec cellTypes
type cellChainElement struct {
cellType api.CellType // current cell type
level CellLevel // current cell level, leaf cell is 1
childCellType api.CellType // child cell type
childNumber int32 // child number
hasNode bool // current cell type is a node or above cell
isMultiNodes bool // current cell type is a multiple node cell
gpuType string // current cell gpu type
gpuNumber int32 // how many gpu in current cell
cellType api.CellType // current cell type
level CellLevel // current cell level, leaf cell is 1
childCellType api.CellType // child cell type
childNumber int32 // child number
hasNode bool // current cell type is a node or above cell
isMultiNodes bool // current cell type is a multiple node cell
leafCellType string // current cell leaf cell type
leafCellNumber int32 // how many leaf cell in current cell
}
type cellTypeConstructor struct {
@ -66,14 +67,14 @@ func (c *cellTypeConstructor) addCellChain(ct api.CellType) {
if !ok {
// not found in raw spec, it's leaf cell
c.cellChainElements[ct] = &cellChainElement{
cellType: ct,
level: lowestLevel,
childCellType: "",
childNumber: 0,
hasNode: false,
isMultiNodes: false,
gpuType: string(ct),
gpuNumber: 1,
cellType: ct,
level: lowestLevel,
childCellType: "",
childNumber: 0,
hasNode: false,
isMultiNodes: false,
leafCellType: string(ct),
leafCellNumber: 1,
}
return
}
@ -87,14 +88,14 @@ func (c *cellTypeConstructor) addCellChain(ct api.CellType) {
// child cell type has been added, added current element,
cct := c.cellChainElements[child]
c.cellChainElements[ct] = &cellChainElement{
cellType: ct,
level: cct.level + 1,
childCellType: cct.cellType,
childNumber: ctSpec.ChildCellNumber,
hasNode: cct.hasNode || ctSpec.IsNodeLevel,
isMultiNodes: cct.hasNode,
gpuType: cct.gpuType,
gpuNumber: cct.gpuNumber * ctSpec.ChildCellNumber,
cellType: ct,
level: cct.level + 1,
childCellType: cct.cellType,
childNumber: ctSpec.ChildCellNumber,
hasNode: cct.hasNode || ctSpec.IsNodeLevel,
isMultiNodes: cct.hasNode,
leafCellType: cct.leafCellType,
leafCellNumber: cct.leafCellNumber * ctSpec.ChildCellNumber,
}
return
}
@ -155,7 +156,7 @@ func (c *physicalCellConstructor) buildChildCell(
return cellInstance
}
var currentCellNodes []string
var currentCellGpuIndices []int32
var currentCellLeafCellIndices []int32
var currentCellChildren CellList
for _, childSpec := range spec.CellChildren {
childCellInstance := c.buildChildCell(childSpec, ce.childCellType, currentNode)
@ -165,18 +166,18 @@ func (c *physicalCellConstructor) buildChildCell(
// super-node cell merge child nodes
currentCellNodes = append(currentCellNodes, childCellInstance.nodes...)
} else {
// sub-node cell merge child node gpu indices
currentCellGpuIndices = append(currentCellGpuIndices, childCellInstance.gpuIndices...)
// sub-node cell merge child node leaf cell indices
currentCellLeafCellIndices = append(currentCellLeafCellIndices, childCellInstance.leafCellIndices...)
}
}
// update current cell children and resource
cellInstance.SetChildren(currentCellChildren)
if ce.isMultiNodes {
currentCellGpuIndices = []int32{-1}
currentCellLeafCellIndices = []int32{-1}
} else {
currentCellNodes = []string{currentNode}
}
cellInstance.SetPhysicalResources(currentCellNodes, currentCellGpuIndices)
cellInstance.SetPhysicalResources(currentCellNodes, currentCellLeafCellIndices)
return cellInstance
}
@ -188,7 +189,7 @@ func (c *physicalCellConstructor) addCell(
address api.CellAddress) *PhysicalCell {
cellInstance := NewPhysicalCell(
c.buildingChain, ce.level, ce.hasNode, ce.gpuNumber, ce.cellType, address, ce.hasNode && !ce.isMultiNodes)
c.buildingChain, ce.level, ce.hasNode, ce.leafCellNumber, ce.cellType, address, ce.hasNode && !ce.isMultiNodes)
if _, ok := c.fullCellList[chain]; !ok {
c.fullCellList[chain] = ChainCellList{}
}
@ -211,8 +212,8 @@ func (c *physicalCellConstructor) buildFullTree() *PhysicalCell {
panic(fmt.Sprintf("top cell must be node-level or above: %v", cc))
}
cellInstance := c.buildChildCell(c.buildingSpec, api.CellType(cc), "")
// set GPU type only for top-level cells (as a chain shares the same GPU type)
cellInstance.GetAPIStatus().GpuType = ce.gpuType
// set leaf cell type only for top-level cells (as a chain shares the same leaf cell type)
cellInstance.GetAPIStatus().LeafCellType = ce.leafCellType
return cellInstance
}
@ -289,7 +290,7 @@ func (c *virtualCellConstructor) addCell(
c.buildingChain,
ce.level,
ce.hasNode,
ce.gpuNumber,
ce.leafCellNumber,
nil,
ce.cellType,
address,
@ -345,8 +346,8 @@ func (c *virtualCellConstructor) buildFullTree(address api.CellAddress) *Virtual
panic(fmt.Sprintf("cellType %v in VirtualCells is not found in cell types definition", c.buildingChild))
}
cellInstance := c.buildChildCell(c.buildingChild, address)
// set GPU type only for top-level cells (as a chain shares the same GPU type)
cellInstance.GetAPIStatus().GpuType = ce.gpuType
// set leaf cell type only for top-level cells (as a chain shares the same leaf cell type)
cellInstance.GetAPIStatus().LeafCellType = ce.leafCellType
return cellInstance
}
@ -418,23 +419,23 @@ func parseCellChainInfo(
map[CellChain]map[CellLevel]api.CellType,
map[string][]CellChain) {
cellLevelToGpuNum := map[CellChain]map[CellLevel]int32{}
cellLevelToLeafCellNum := map[CellChain]map[CellLevel]int32{}
cellLevelToType := map[CellChain]map[CellLevel]api.CellType{}
gpuTypeToChain := map[string][]CellChain{}
leafCellTypeToChain := map[string][]CellChain{}
for _, chain := range chains {
ce := cellChainElements[api.CellType(chain)]
gpuTypeToChain[ce.gpuType] = append(gpuTypeToChain[ce.gpuType], chain)
leafCellTypeToChain[ce.leafCellType] = append(leafCellTypeToChain[ce.leafCellType], chain)
cellLevelToGpuNum[chain] = map[CellLevel]int32{}
cellLevelToLeafCellNum[chain] = map[CellLevel]int32{}
cellLevelToType[chain] = map[CellLevel]api.CellType{}
ce, ok := cellChainElements[api.CellType(chain)]
for ok {
cellLevelToGpuNum[chain][ce.level] = ce.gpuNumber
cellLevelToLeafCellNum[chain][ce.level] = ce.leafCellNumber
cellLevelToType[chain][ce.level] = ce.cellType
ce, ok = cellChainElements[ce.childCellType]
}
}
return cellLevelToGpuNum, cellLevelToType, gpuTypeToChain
return cellLevelToLeafCellNum, cellLevelToType, leafCellTypeToChain
}
@ -446,8 +447,8 @@ func ParseConfig(sConfig *api.Config) (
virtualNonPinnedFreeList map[api.VirtualClusterName]map[CellChain]ChainCellList, // vc:chain:level:[]virtualCell
virtualPinnedCells map[api.VirtualClusterName]map[api.PinnedCellId]ChainCellList, // vc:pinnedCellId:level:[]virtualCell
physicalPinnedCells map[api.VirtualClusterName]map[api.PinnedCellId]*PhysicalCell, // vc:pinnedCellId:PhysicalCell
cellLevelToGpuNum map[CellChain]map[CellLevel]int32, // chain:level:gpuNumber
gpuTypeToChain map[string][]CellChain, // gpuType:[]chain
cellLevelToLeafCellNum map[CellChain]map[CellLevel]int32, // chain:level:leafCellNumber
leafCellTypeToChain map[string][]CellChain, // leafCellType:[]chain
cellLevelToType map[CellChain]map[CellLevel]api.CellType, // chain:level:cellType
) {
@ -470,7 +471,7 @@ func ParseConfig(sConfig *api.Config) (
for k := range physicalFullList {
cellChains = append(cellChains, k)
}
cellLevelToGpuNum, cellLevelToType, gpuTypeToChain = parseCellChainInfo(cellChainElements, cellChains)
cellLevelToLeafCellNum, cellLevelToType, leafCellTypeToChain = parseCellChainInfo(cellChainElements, cellChains)
return
}

Просмотреть файл

@ -94,7 +94,7 @@ type HivedAlgorithm struct {
// bad nodes in the physical cluster
badNodes common.Set
// map each GPU type to all chains that contain this type
// map each leaf cell type to all chains that contain this type
cellChains map[string][]CellChain
// map each level in a chain to the specific cell type name
cellTypes map[CellChain]map[CellLevel]api.CellType
@ -107,7 +107,7 @@ type HivedAlgorithm struct {
// NewHivedAlgorithm initializes a HivedAlgorithm from the config file.
func NewHivedAlgorithm(sConfig *api.Config) *HivedAlgorithm {
fullPcl, freePcl, vcFreeCellNum, nonPinnedFullVcl, nonPinnedFreeVcl, pinnedVcl, pinnedPcl,
gpuNums, chains, cellTypes := ParseConfig(sConfig)
leafCellNums, chains, cellTypes := ParseConfig(sConfig)
h := &HivedAlgorithm{
vcSchedulers: map[api.VirtualClusterName]intraVCScheduler{},
@ -132,10 +132,10 @@ func NewHivedAlgorithm(sConfig *api.Config) *HivedAlgorithm {
for vcName := range nonPinnedFullVcl {
// TODO: Support per-VC configurable intra VC scheduling algo.
h.vcSchedulers[vcName] = newDefaultIntraVCScheduler(
nonPinnedFullVcl[vcName], nonPinnedFreeVcl[vcName], pinnedVcl[vcName], gpuNums)
nonPinnedFullVcl[vcName], nonPinnedFreeVcl[vcName], pinnedVcl[vcName], leafCellNums)
}
for chain, ccl := range h.fullCellList {
h.opportunisticSchedulers[chain] = NewTopologyAwareScheduler(ccl, gpuNums[chain], false)
h.opportunisticSchedulers[chain] = NewTopologyAwareScheduler(ccl, leafCellNums[chain], false)
}
h.initCellNums()
h.initAPIClusterStatus()
@ -192,11 +192,11 @@ func (h *HivedAlgorithm) Schedule(
suggestedNodeSet.Add(n)
}
var (
groupPhysicalPlacement groupPhysicalPlacement // GPU number -> a set of pods -> a set of GPUs of each pod
groupVirtualPlacement groupVirtualPlacement // GPU number -> a set of pods -> a set of GPUs of each pod
groupPhysicalPlacement groupPhysicalPlacement // leaf cell number -> a set of pods -> a set of leaf cells of each pod
groupVirtualPlacement groupVirtualPlacement // leaf cell number -> a set of pods -> a set of leaf cells of each pod
preemptionVictims map[string]common.Set // node -> pods
waitReason string
podIndex int32 // index of current pod among those of the same GPU number in the group, 0 by default
podIndex int32 // index of current pod among those of the same leaf cell number in the group, 0 by default
)
if g := h.affinityGroups[s.AffinityGroup.Name]; g != nil {
@ -215,7 +215,7 @@ func (h *HivedAlgorithm) Schedule(
preemptionVictims,
waitReason,
h.cellTypes,
s.GpuNumber,
s.LeafCellNumber,
podIndex,
h.affinityGroups[s.AffinityGroup.Name],
s.AffinityGroup.Name,
@ -251,22 +251,22 @@ func (h *HivedAlgorithm) AddAllocatedPod(pod *core.Pod) {
s := internal.ExtractPodSchedulingSpec(pod)
info := internal.ExtractPodBindInfo(pod)
klog.Infof("[%v]: Adding allocated pod to affinity group %v...", internal.Key(pod), s.AffinityGroup.Name)
klog.Infof("[%v]: Adding to node %v, GPUs %v", internal.Key(pod), info.Node, common.ToJson(info.GpuIsolation))
klog.Infof("[%v]: Adding to node %v, leaf cells %v", internal.Key(pod), info.Node, common.ToJson(info.LeafCellIsolation))
podIndex := int32(0)
if g := h.affinityGroups[s.AffinityGroup.Name]; g != nil {
if g.state == groupPreempting {
h.allocatePreemptingAffinityGroup(g, pod)
}
if podIndex = getAllocatedPodIndex(info, s.GpuNumber); podIndex == -1 {
klog.Errorf("[%v]: Pod placement not found in group %v: node %v, GPUs %v",
internal.Key(pod), s.AffinityGroup.Name, info.Node, info.GpuIsolation)
if podIndex = getAllocatedPodIndex(info, s.LeafCellNumber); podIndex == -1 {
klog.Errorf("[%v]: Pod placement not found in group %v: node %v, leaf cells %v",
internal.Key(pod), s.AffinityGroup.Name, info.Node, info.LeafCellIsolation)
return
}
} else {
h.createAllocatedAffinityGroup(s, info, pod)
}
h.affinityGroups[s.AffinityGroup.Name].allocatedPods[s.GpuNumber][podIndex] = pod
h.affinityGroups[s.AffinityGroup.Name].allocatedPods[s.LeafCellNumber][podIndex] = pod
}
func (h *HivedAlgorithm) DeleteAllocatedPod(pod *core.Pod) {
@ -276,18 +276,18 @@ func (h *HivedAlgorithm) DeleteAllocatedPod(pod *core.Pod) {
s := internal.ExtractPodSchedulingSpec(pod)
info := internal.ExtractPodBindInfo(pod)
klog.Infof("[%v]: Deleting allocated pod from affinity group %v...", internal.Key(pod), s.AffinityGroup.Name)
klog.Infof("[%v]: Deleting from node %v, GPUs %v", internal.Key(pod), info.Node, common.ToJson(info.GpuIsolation))
klog.Infof("[%v]: Deleting from node %v, leaf cells %v", internal.Key(pod), info.Node, common.ToJson(info.LeafCellIsolation))
if g := h.affinityGroups[s.AffinityGroup.Name]; g == nil {
klog.Errorf("[%v]: Group %v not found when deleting pod", internal.Key(pod), s.AffinityGroup.Name)
return
} else {
if podIndex := getAllocatedPodIndex(info, s.GpuNumber); podIndex == -1 {
klog.Errorf("[%v]: Pod placement not found in group %v: node %v, GPUs %v",
internal.Key(pod), s.AffinityGroup.Name, info.Node, info.GpuIsolation)
if podIndex := getAllocatedPodIndex(info, s.LeafCellNumber); podIndex == -1 {
klog.Errorf("[%v]: Pod placement not found in group %v: node %v, leaf cells %v",
internal.Key(pod), s.AffinityGroup.Name, info.Node, info.LeafCellIsolation)
return
} else {
g.allocatedPods[s.GpuNumber][podIndex] = nil
g.allocatedPods[s.LeafCellNumber][podIndex] = nil
}
if allPodsReleased(g.allocatedPods) {
h.deleteAllocatedAffinityGroup(g, pod)
@ -470,11 +470,11 @@ func (h *HivedAlgorithm) setBadNode(nodeName string) {
}
h.badNodes.Add(nodeName)
for _, ccl := range h.fullCellList {
for _, gpu := range ccl[1] {
pGpu := gpu.(*PhysicalCell)
nodes, _ := pGpu.GetPhysicalPlacement()
for _, leafCell := range ccl[1] {
pLeafCell := leafCell.(*PhysicalCell)
nodes, _ := pLeafCell.GetPhysicalPlacement()
if nodes[0] == nodeName {
h.setBadCell(pGpu)
h.setBadCell(pLeafCell)
}
}
}
@ -487,11 +487,11 @@ func (h *HivedAlgorithm) setHealthyNode(nodeName string) {
}
h.badNodes.Delete(nodeName)
for _, ccl := range h.fullCellList {
for _, gpu := range ccl[1] {
pGpu := gpu.(*PhysicalCell)
nodes, _ := pGpu.GetPhysicalPlacement()
for _, leafCell := range ccl[1] {
pLeafCell := leafCell.(*PhysicalCell)
nodes, _ := pLeafCell.GetPhysicalPlacement()
if nodes[0] == nodeName {
h.setHealthyCell(pGpu)
h.setHealthyCell(pLeafCell)
}
}
}
@ -499,7 +499,7 @@ func (h *HivedAlgorithm) setHealthyNode(nodeName string) {
// setBadCell marks a physical cell (and also the virtual cell it is bound to) as bad,
// and recursively for its parent, guaranteeing that a cell is bad if any of its children is bad.
// setBadCell always starts from the lowest level, i.e., GPU-level cells.
// setBadCell always starts from the lowest level, i.e., leaf-level cells.
func (h *HivedAlgorithm) setBadCell(c *PhysicalCell) {
if !c.IsHealthy() {
return
@ -522,7 +522,7 @@ func (h *HivedAlgorithm) setBadCell(c *PhysicalCell) {
// setHealthyCell marks a physical cell (and also the virtual cell it is bound to) as healthy,
// and recursively for its parent, guaranteeing that a cell is healthy if all of its children are healthy.
// setHealthy always starts from the lowest level, i.e., GPU-level cells.
// setHealthy always starts from the lowest level, i.e., leaf-level cells.
func (h *HivedAlgorithm) setHealthyCell(c *PhysicalCell) {
if c.IsHealthy() {
return
@ -667,23 +667,23 @@ func (h *HivedAlgorithm) schedulePodFromExistingGroup(
podIndex int32) {
badOrNonSuggestedNodes := collectBadOrNonSuggestedNodes(
g.physicalGpuPlacement, suggestedNodes, g.ignoreK8sSuggestedNodes)
g.physicalLeafCellPlacement, suggestedNodes, g.ignoreK8sSuggestedNodes)
// state of an existing group can be either Allocated or Preempting
if g.state == groupAllocated {
klog.Infof("[%v]: Pod is from an affinity group that is already allocated: %v",
internal.Key(pod), s.AffinityGroup.Name)
groupPhysicalPlacement = g.physicalGpuPlacement
groupVirtualPlacement = g.virtualGpuPlacement
groupPhysicalPlacement = g.physicalLeafCellPlacement
groupVirtualPlacement = g.virtualLeafCellPlacement
if !badOrNonSuggestedNodes.IsEmpty() {
// for an allocated group, we always insist the previous scheduling decision
// even if some pods are now bad or not within suggested nodes
klog.Warningf("[%v]: Some nodes allocated to affinity group %v are no longer "+
"healthy and within K8s suggested nodes: %v", internal.Key(pod), g.name, badOrNonSuggestedNodes)
}
if podIndex = getNewPodIndex(g.allocatedPods[s.GpuNumber]); podIndex == -1 {
if podIndex = getNewPodIndex(g.allocatedPods[s.LeafCellNumber]); podIndex == -1 {
panic(internal.NewBadRequestError(fmt.Sprintf(
"Requesting more pods than the configured number for %v GPUs (%v pods) in affinity group %v",
s.GpuNumber, g.totalPodNums[s.GpuNumber], s.AffinityGroup.Name)))
"Requesting more pods than the configured number for %v leaf cells (%v pods) in affinity group %v",
s.LeafCellNumber, g.totalPodNums[s.LeafCellNumber], s.AffinityGroup.Name)))
}
} else { // groupPreempting
klog.Infof("[%v]: Pod is from an affinity group that is preempting others: %v",
@ -698,8 +698,8 @@ func (h *HivedAlgorithm) schedulePodFromExistingGroup(
internal.Key(pod), g.name, badOrNonSuggestedNodes)
h.deletePreemptingAffinityGroup(g, pod)
} else {
groupPhysicalPlacement = g.physicalGpuPlacement
groupVirtualPlacement = g.virtualGpuPlacement
groupPhysicalPlacement = g.physicalLeafCellPlacement
groupVirtualPlacement = g.virtualLeafCellPlacement
preemptionVictims, _ = collectPreemptionVictims(groupPhysicalPlacement)
if len(preemptionVictims) == 0 {
klog.Infof(
@ -751,7 +751,7 @@ func (h *HivedAlgorithm) schedulePodFromNewGroup(
return groupPhysicalPlacement, groupVirtualPlacement, preemptionVictims, waitReason
}
// scheduleNewAffinityGroup schedules each pod of a new affinity group to a set of GPUs
// scheduleNewAffinityGroup schedules each pod of a new affinity group to a set of leaf cells
// (in both the physical cluster and the VC). This is the entrance of a new scheduling attempt.
func (h *HivedAlgorithm) scheduleNewAffinityGroup(
pod *core.Pod,
@ -773,33 +773,33 @@ func (h *HivedAlgorithm) scheduleNewAffinityGroup(
ignoreSuggestedNodes: s.IgnoreK8sSuggestedNodes,
}
for _, m := range s.AffinityGroup.Members {
// we will merge group members with same GPU number
sr.affinityGroupPodNums[m.GpuNumber] += m.PodNumber
// we will merge group members with same leaf cell number
sr.affinityGroupPodNums[m.LeafCellNumber] += m.PodNumber
}
h.validateSchedulingRequest(sr, pod)
if sr.pinnedCellId != "" {
klog.Infof("Using pinned cell %v", s.PinnedCellId)
physicalPlacement, virtualPlacement, failedReason = h.handleSchedulingRequest(sr)
} else if s.GpuType != "" {
if _, ok := h.cellChains[s.GpuType]; !ok {
} else if s.LeafCellType != "" {
if _, ok := h.cellChains[s.LeafCellType]; !ok {
panic(internal.NewBadRequestError(fmt.Sprintf(
"[%v]: Pod requesting GPU type %v which the whole cluster does not have",
internal.Key(pod), s.GpuType)))
"[%v]: Pod requesting leaf cell type %v which the whole cluster does not have",
internal.Key(pod), s.LeafCellType)))
}
klog.Infof("Using specified GPU type %v", s.GpuType)
physicalPlacement, virtualPlacement, failedReason = h.scheduleAffinityGroupForGpuType(
sr, s.GpuType, pod, true)
klog.Infof("Using specified leaf cell type %v", s.LeafCellType)
physicalPlacement, virtualPlacement, failedReason = h.scheduleAffinityGroupForLeafCellType(
sr, s.LeafCellType, pod, true)
} else {
physicalPlacement, virtualPlacement, failedReason = h.scheduleAffinityGroupForAnyGpuType(sr, pod)
physicalPlacement, virtualPlacement, failedReason = h.scheduleAffinityGroupForAnyLeafCellType(sr, pod)
}
return physicalPlacement, virtualPlacement, failedReason
}
// scheduleAffinityGroupForGpuType schedules an affinity group in a certain cell chain
// that matches the given GPU type.
func (h *HivedAlgorithm) scheduleAffinityGroupForGpuType(
// scheduleAffinityGroupForLeafCellType schedules an affinity group in a certain cell chain
// that matches the given leaf cell type.
func (h *HivedAlgorithm) scheduleAffinityGroupForLeafCellType(
sr schedulingRequest,
gpuType string,
leafCellType string,
pod *core.Pod,
typeSpecified bool) (
physicalPlacement groupPhysicalPlacement,
@ -807,7 +807,7 @@ func (h *HivedAlgorithm) scheduleAffinityGroupForGpuType(
failedReason string) {
vcHasType := false
for _, chain := range h.cellChains[gpuType] {
for _, chain := range h.cellChains[leafCellType] {
if sr.priority < minGuaranteedPriority ||
h.vcSchedulers[sr.vc].getNonPinnedPreassignedCells()[chain] != nil {
vcHasType = true
@ -822,15 +822,15 @@ func (h *HivedAlgorithm) scheduleAffinityGroupForGpuType(
}
if typeSpecified && sr.priority >= minGuaranteedPriority && !vcHasType {
panic(internal.NewBadRequestError(fmt.Sprintf(
"[%v]: Pod requesting GPU type %v which VC %v does not have",
internal.Key(pod), gpuType, sr.vc)))
"[%v]: Pod requesting leaf cell type %v which VC %v does not have",
internal.Key(pod), leafCellType, sr.vc)))
}
return nil, nil, failedReason
}
// scheduleAffinityGroupForAnyGpuType schedules an affinity group in every possible GPU type
// (when the user does not specify a GPU type).
func (h *HivedAlgorithm) scheduleAffinityGroupForAnyGpuType(
// scheduleAffinityGroupForAnyLeafCellType schedules an affinity group in every possible leaf cell type
// (when the user does not specify a leaf cell type).
func (h *HivedAlgorithm) scheduleAffinityGroupForAnyLeafCellType(
sr schedulingRequest,
pod *core.Pod) (
groupPhysicalPlacement,
@ -838,10 +838,10 @@ func (h *HivedAlgorithm) scheduleAffinityGroupForAnyGpuType(
string) {
var failedReason string
for gpuType := range h.cellChains {
klog.Infof("Searching GPU type %v", gpuType)
for leafCellType := range h.cellChains {
klog.Infof("Searching leaf cell type %v", leafCellType)
typePhysicalPlacement, typeVirtualPlacement, typeFailedReason :=
h.scheduleAffinityGroupForGpuType(sr, gpuType, pod, false)
h.scheduleAffinityGroupForLeafCellType(sr, leafCellType, pod, false)
if typePhysicalPlacement != nil {
return typePhysicalPlacement, typeVirtualPlacement, ""
}
@ -880,7 +880,7 @@ func (h *HivedAlgorithm) handleSchedulingRequest(
if sr.pinnedCellId != "" {
str = fmt.Sprintf("pinned cell %v", sr.pinnedCellId)
}
klog.Infof("Processing scheduling request: %v, GPU numbers %v, priority %v",
klog.Infof("Processing scheduling request: %v, leaf cell numbers %v, priority %v",
str, common.ToJson(sr.affinityGroupPodNums), sr.priority)
if sr.priority >= minGuaranteedPriority {
physicalPlacement, virtualPlacement, failedReason = h.scheduleGuaranteedAffinityGroup(sr)
@ -910,10 +910,10 @@ func (h *HivedAlgorithm) scheduleGuaranteedAffinityGroup(
}
// map the vc placement to the physical cluster
bindings := map[api.CellAddress]*PhysicalCell{}
gpuNums := common.Int32MapKeys(sr.affinityGroupPodNums)
common.SortInt32(gpuNums)
lazyPreemptedGroups := h.tryLazyPreempt(virtualPlacement, gpuNums, sr.affinityGroupName)
preassignedCells, nonPreassignedCells := virtualPlacement.toBindingPaths(gpuNums, bindings)
leafCellNums := common.Int32MapKeys(sr.affinityGroupPodNums)
common.SortInt32(leafCellNums)
lazyPreemptedGroups := h.tryLazyPreempt(virtualPlacement, leafCellNums, sr.affinityGroupName)
preassignedCells, nonPreassignedCells := virtualPlacement.toBindingPaths(leafCellNums, bindings)
// make a copy of freeCellNum, may change its values during allocation
freeCellNumCopy := map[CellLevel]int32{}
for k, v := range h.allVCFreeCellNum[sr.chain] {
@ -927,7 +927,7 @@ func (h *HivedAlgorithm) scheduleGuaranteedAffinityGroup(
sr.suggestedNodes,
sr.ignoreSuggestedNodes,
bindings); ok {
return virtualPlacement.toPhysicalPlacement(bindings, gpuNums), virtualPlacement, ""
return virtualPlacement.toPhysicalPlacement(bindings, leafCellNums), virtualPlacement, ""
}
for groupName, placement := range lazyPreemptedGroups {
h.revertLazyPreempt(h.affinityGroups[groupName], placement)
@ -944,18 +944,18 @@ func (h *HivedAlgorithm) scheduleGuaranteedAffinityGroup(
// tryLazyPreempt tries to lazy preempt the affinity groups found on a placement.
func (h *HivedAlgorithm) tryLazyPreempt(
p groupVirtualPlacement,
gpuNums []int32,
leafCellNums []int32,
groupName string) map[string]groupVirtualPlacement {
preemptedGroups := map[string]groupVirtualPlacement{}
for _, podGpuNum := range gpuNums {
podPlacements := p[podGpuNum]
for _, podLeafCellNum := range leafCellNums {
podPlacements := p[podLeafCellNum]
for _, pod := range podPlacements {
for _, gpu := range pod {
if pGpu := gpu.(*VirtualCell).GetPhysicalCell(); pGpu != nil {
if pGpu.GetState() == cellUsed && pGpu.GetUsingGroup().lazyPreemptionEnable {
preemptedGroups[pGpu.GetUsingGroup().name] = h.lazyPreemptAffinityGroup(
pGpu.GetUsingGroup(), groupName)
for _, leafCell := range pod {
if pLeafCell := leafCell.(*VirtualCell).GetPhysicalCell(); pLeafCell != nil {
if pLeafCell.GetState() == cellUsed && pLeafCell.GetUsingGroup().lazyPreemptionEnable {
preemptedGroups[pLeafCell.GetUsingGroup().name] = h.lazyPreemptAffinityGroup(
pLeafCell.GetUsingGroup(), groupName)
}
}
}
@ -985,46 +985,46 @@ func (h *HivedAlgorithm) createAllocatedAffinityGroup(s *api.PodSchedulingSpec,
s.AffinityGroup, s.VirtualCluster, s.LazyPreemptionEnable, s.Priority, groupAllocated)
shouldLazyPreempt := false
for _, gms := range info.AffinityGroupBindInfo {
gpuNumber := int32(len(gms.PodPlacements[0].PhysicalGpuIndices))
leafCellNumber := int32(len(gms.PodPlacements[0].PhysicalLeafCellIndices))
for podIndex := int32(0); podIndex < int32(len(gms.PodPlacements)); podIndex++ {
node := gms.PodPlacements[podIndex].PhysicalNode
for gpuIndex := int32(0); gpuIndex < int32(
len(gms.PodPlacements[podIndex].PhysicalGpuIndices)); gpuIndex++ {
pGpu, vGpu, lazyPreempt := h.findAllocatedGpu(
gpuIndex,
gms.PodPlacements[podIndex].PhysicalGpuIndices,
for leafCellIndex := int32(0); leafCellIndex < int32(
len(gms.PodPlacements[podIndex].PhysicalLeafCellIndices)); leafCellIndex++ {
pLeafCell, vLeafCell, lazyPreempt := h.findAllocatedLeafCell(
leafCellIndex,
gms.PodPlacements[podIndex].PhysicalLeafCellIndices,
gms.PodPlacements[podIndex].PreassignedCellTypes,
CellChain(info.CellChain), node, shouldLazyPreempt, s, newGroup, pod)
if pGpu == nil {
// pGpu not being found means that this GPU address does not exist in the spec.
// we simply ignore this GPU, and let the job run normally
// (but we cannot ignore the other GPUs of this pod that are still in the spec,
if pLeafCell == nil {
// pLeafCell not being found means that this leaf cell address does not exist in the spec.
// we simply ignore this leaf cell, and let the job run normally
// (but we cannot ignore the other leaf cells of this pod that are still in the spec,
// otherwise it may cause resource conflicts)
continue
} else {
newGroup.physicalGpuPlacement[gpuNumber][podIndex][gpuIndex] = pGpu
newGroup.physicalLeafCellPlacement[leafCellNumber][podIndex][leafCellIndex] = pLeafCell
if lazyPreempt == nil {
newGroup.virtualGpuPlacement = nil
} else if vGpu != nil {
newGroup.virtualGpuPlacement[gpuNumber][podIndex][gpuIndex] = vGpu
if inFreeCellList(pGpu) && vGpu.GetPreassignedCell().GetPriority() > freePriority {
newGroup.virtualLeafCellPlacement = nil
} else if vLeafCell != nil {
newGroup.virtualLeafCellPlacement[leafCellNumber][podIndex][leafCellIndex] = vLeafCell
if inFreeCellList(pLeafCell) && vLeafCell.GetPreassignedCell().GetPriority() > freePriority {
// This means we decide to bind this cell to a virtual cell whose preassigned cell
// has been bound (in cases like reconfiguration and the VC's cells are fewer than before).
// We need to destroy the previous binding, by lazy preempting all the groups
// in the preassigned cell
h.lazyPreemptCell(vGpu.GetPreassignedCell(), newGroup.name)
h.lazyPreemptCell(vLeafCell.GetPreassignedCell(), newGroup.name)
}
} else {
shouldLazyPreempt = shouldLazyPreempt || *lazyPreempt
}
// Even if we have successfully found the vGpu and pGpu, there is still one possibility
// Even if we have successfully found the vLeafCell and pLeafCell, there is still one possibility
// that we should not bind them: allocating the physical cell may lead to broken safety.
// Such case won't happen by design as buddy alloc guarantees safety; but this could
// happen due to inconsistency of VC assignments for reasons like reconfiguration.
// In this case, we will lazy preempt this affinity group.
safetyOk, reason := h.allocateGpu(pGpu, vGpu, CellPriority(s.Priority), newGroup.vc)
pGpu.AddUsingGroup(newGroup)
setCellState(pGpu, cellUsed)
safetyOk, reason := h.allocateLeafCell(pLeafCell, vLeafCell, CellPriority(s.Priority), newGroup.vc)
pLeafCell.AddUsingGroup(newGroup)
setCellState(pLeafCell, cellUsed)
if !safetyOk {
shouldLazyPreempt = true
klog.Warningf("[%v]: %v", internal.Key(pod), reason)
@ -1045,22 +1045,22 @@ func (h *HivedAlgorithm) createAllocatedAffinityGroup(s *api.PodSchedulingSpec,
func (h *HivedAlgorithm) deleteAllocatedAffinityGroup(g *AlgoAffinityGroup, pod *core.Pod) {
klog.Infof("[%v]: All pods complete, deleting allocated affinity group: %v",
internal.Key(pod), g.name)
for _, podPlacements := range g.physicalGpuPlacement {
for _, podPlacements := range g.physicalLeafCellPlacement {
for _, podPlacement := range podPlacements {
for _, gpu := range podPlacement {
if gpu == nil {
for _, leafCell := range podPlacement {
if leafCell == nil {
continue
}
pGpu := gpu.(*PhysicalCell)
pGpu.DeleteUsingGroup(g)
// state of pGpu can be either Used or Reserving
if pGpu.GetState() == cellUsed {
h.releaseGpu(pGpu, g.vc)
setCellState(pGpu, cellFree)
pLeafCell := leafCell.(*PhysicalCell)
pLeafCell.DeleteUsingGroup(g)
// state of pLeafCell can be either Used or Reserving
if pLeafCell.GetState() == cellUsed {
h.releaseLeafCell(pLeafCell, g.vc)
setCellState(pLeafCell, cellFree)
} else { // cellReserving
// When pGpu is in Reserving state, we shouldn't call h.releaseGpu
// When pLeafCell is in Reserving state, we shouldn't call h.releaseLeafCell
// because it must have been allocated to the reserving group before
setCellState(pGpu, cellReserved)
setCellState(pLeafCell, cellReserved)
}
}
}
@ -1082,26 +1082,26 @@ func (h *HivedAlgorithm) createPreemptingAffinityGroup(
klog.Infof("[%v]: Creating new preempting affinity group: %v", internal.Key(pod), s.AffinityGroup.Name)
newGroup := newAlgoAffinityGroup(
s.AffinityGroup, s.VirtualCluster, s.LazyPreemptionEnable, s.Priority, groupPreempting)
newGroup.physicalGpuPlacement = physicalPlacement
newGroup.virtualGpuPlacement = virtualPlacement
for gpuNum := range physicalPlacement {
for podIndex := range physicalPlacement[gpuNum] {
for gpuIndex, gpu := range physicalPlacement[gpuNum][podIndex] {
pGpu := gpu.(*PhysicalCell)
vGpu := virtualPlacement[gpuNum][podIndex][gpuIndex].(*VirtualCell)
if pGpu.GetState() == cellUsed {
usingGroup := pGpu.GetUsingGroup()
h.releaseGpu(pGpu, usingGroup.vc)
newGroup.physicalLeafCellPlacement = physicalPlacement
newGroup.virtualLeafCellPlacement = virtualPlacement
for leafCellNum := range physicalPlacement {
for podIndex := range physicalPlacement[leafCellNum] {
for leafCellIndex, leafCell := range physicalPlacement[leafCellNum][podIndex] {
pLeafCell := leafCell.(*PhysicalCell)
vLeafCell := virtualPlacement[leafCellNum][podIndex][leafCellIndex].(*VirtualCell)
if pLeafCell.GetState() == cellUsed {
usingGroup := pLeafCell.GetUsingGroup()
h.releaseLeafCell(pLeafCell, usingGroup.vc)
usingGroup.state = groupBeingPreempted
}
h.allocateGpu(pGpu, vGpu, CellPriority(s.Priority), newGroup.vc)
pGpu.AddReservingOrReservedGroup(newGroup)
// state of pGpu can be either Used or Free (if it was Reserving or Reserved,
h.allocateLeafCell(pLeafCell, vLeafCell, CellPriority(s.Priority), newGroup.vc)
pLeafCell.AddReservingOrReservedGroup(newGroup)
// state of pLeafCell can be either Used or Free (if it was Reserving or Reserved,
// we must have canceled the ongoing preemption before, in h.Schedule)
if pGpu.GetState() == cellUsed {
setCellState(pGpu, cellReserving)
if pLeafCell.GetState() == cellUsed {
setCellState(pLeafCell, cellReserving)
} else { // cellFree
setCellState(pGpu, cellReserved)
setCellState(pLeafCell, cellReserved)
}
}
}
@ -1114,27 +1114,27 @@ func (h *HivedAlgorithm) createPreemptingAffinityGroup(
// deletePreemptingAffinityGroup revokes a preemption and deletes the affinity group that is
// still waiting for the completion of the preemption.
func (h *HivedAlgorithm) deletePreemptingAffinityGroup(g *AlgoAffinityGroup, pod *core.Pod) {
for gpuNum := range g.physicalGpuPlacement {
for podIndex := range g.physicalGpuPlacement[gpuNum] {
for _, gpu := range g.physicalGpuPlacement[gpuNum][podIndex] {
pGpu := gpu.(*PhysicalCell)
h.releaseGpu(pGpu, g.vc)
pGpu.DeleteReservingOrReservedGroup(pGpu.GetReservingOrReservedGroup())
// state of pGpu can be either Reserving or Reserved
if pGpu.GetState() == cellReserving {
setCellState(pGpu, cellUsed)
for leafCellNum := range g.physicalLeafCellPlacement {
for podIndex := range g.physicalLeafCellPlacement[leafCellNum] {
for _, leafCell := range g.physicalLeafCellPlacement[leafCellNum][podIndex] {
pLeafCell := leafCell.(*PhysicalCell)
h.releaseLeafCell(pLeafCell, g.vc)
pLeafCell.DeleteReservingOrReservedGroup(pLeafCell.GetReservingOrReservedGroup())
// state of pLeafCell can be either Reserving or Reserved
if pLeafCell.GetState() == cellReserving {
setCellState(pLeafCell, cellUsed)
// return the cell to the group being preempted
beingPreemptedGroup := pGpu.GetUsingGroup()
var beingPreemptedVGpu *VirtualCell
if beingPreemptedGroup.virtualGpuPlacement != nil {
beingPreemptedVGpu = retrieveVirtualCell(
beingPreemptedGroup.physicalGpuPlacement,
beingPreemptedGroup.virtualGpuPlacement, pGpu)
beingPreemptedGroup := pLeafCell.GetUsingGroup()
var beingPreemptedVLeafCell *VirtualCell
if beingPreemptedGroup.virtualLeafCellPlacement != nil {
beingPreemptedVLeafCell = retrieveVirtualCell(
beingPreemptedGroup.physicalLeafCellPlacement,
beingPreemptedGroup.virtualLeafCellPlacement, pLeafCell)
}
h.allocateGpu(
pGpu, beingPreemptedVGpu, CellPriority(beingPreemptedGroup.priority), beingPreemptedGroup.vc)
h.allocateLeafCell(
pLeafCell, beingPreemptedVLeafCell, CellPriority(beingPreemptedGroup.priority), beingPreemptedGroup.vc)
} else { // cellReserved
setCellState(pGpu, cellFree)
setCellState(pLeafCell, cellFree)
}
}
}
@ -1146,13 +1146,13 @@ func (h *HivedAlgorithm) deletePreemptingAffinityGroup(g *AlgoAffinityGroup, pod
// allocatePreemptingAffinityGroup lets a preemptor affinity group whose preemption has completed
// transition to allocated state.
func (h *HivedAlgorithm) allocatePreemptingAffinityGroup(g *AlgoAffinityGroup, pod *core.Pod) {
for gpuNum := range g.physicalGpuPlacement {
for podIndex := range g.physicalGpuPlacement[gpuNum] {
for _, gpu := range g.physicalGpuPlacement[gpuNum][podIndex] {
pGpu := gpu.(*PhysicalCell)
pGpu.DeleteReservingOrReservedGroup(g)
pGpu.AddUsingGroup(g)
setCellState(pGpu, cellUsed)
for leafCellNum := range g.physicalLeafCellPlacement {
for podIndex := range g.physicalLeafCellPlacement[leafCellNum] {
for _, leafCell := range g.physicalLeafCellPlacement[leafCellNum][podIndex] {
pLeafCell := leafCell.(*PhysicalCell)
pLeafCell.DeleteReservingOrReservedGroup(g)
pLeafCell.AddUsingGroup(g)
setCellState(pLeafCell, cellUsed)
}
}
}
@ -1166,20 +1166,20 @@ func (h *HivedAlgorithm) allocatePreemptingAffinityGroup(g *AlgoAffinityGroup, p
func (h *HivedAlgorithm) lazyPreemptAffinityGroup(
victim *AlgoAffinityGroup,
preemptor string) (originalVirtualPlacement groupVirtualPlacement) {
for _, podVirtualPlacements := range victim.virtualGpuPlacement {
for _, podVirtualPlacements := range victim.virtualLeafCellPlacement {
for _, podVirtualPlacement := range podVirtualPlacements {
for _, gpu := range podVirtualPlacement {
if gpu != nil {
vGpu := gpu.(*VirtualCell)
pGpu := vGpu.GetPhysicalCell()
h.releaseGpu(pGpu, victim.vc)
h.allocateGpu(pGpu, nil, opportunisticPriority, victim.vc)
for _, leafCell := range podVirtualPlacement {
if leafCell != nil {
vLeafCell := leafCell.(*VirtualCell)
pLeafCell := vLeafCell.GetPhysicalCell()
h.releaseLeafCell(pLeafCell, victim.vc)
h.allocateLeafCell(pLeafCell, nil, opportunisticPriority, victim.vc)
}
}
}
}
originalVirtualPlacement = victim.virtualGpuPlacement
victim.virtualGpuPlacement = nil
originalVirtualPlacement = victim.virtualLeafCellPlacement
victim.virtualLeafCellPlacement = nil
victim.lazyPreemptionStatus = &api.LazyPreemptionStatus{
Preemptor: preemptor,
PreemptionTime: meta.Now(),
@ -1200,30 +1200,30 @@ func (h *HivedAlgorithm) lazyPreemptCell(c *VirtualCell, preemptor string) {
// revertLazyPreempt reverts the lazy preemption of an affinity group.
func (h *HivedAlgorithm) revertLazyPreempt(g *AlgoAffinityGroup, virtualPlacement groupVirtualPlacement) {
for gpuNum := range g.physicalGpuPlacement {
for podIndex := range g.physicalGpuPlacement[gpuNum] {
for gpuIndex, gpu := range g.physicalGpuPlacement[gpuNum][podIndex] {
if gpu == nil {
for leafCellNum := range g.physicalLeafCellPlacement {
for podIndex := range g.physicalLeafCellPlacement[leafCellNum] {
for leafCellIndex, leafCell := range g.physicalLeafCellPlacement[leafCellNum][podIndex] {
if leafCell == nil {
continue
}
pGpu := gpu.(*PhysicalCell)
vGpu := virtualPlacement[gpuNum][podIndex][gpuIndex].(*VirtualCell)
h.releaseGpu(pGpu, g.vc)
h.allocateGpu(pGpu, vGpu, CellPriority(g.priority), g.vc)
pLeafCell := leafCell.(*PhysicalCell)
vLeafCell := virtualPlacement[leafCellNum][podIndex][leafCellIndex].(*VirtualCell)
h.releaseLeafCell(pLeafCell, g.vc)
h.allocateLeafCell(pLeafCell, vLeafCell, CellPriority(g.priority), g.vc)
}
}
}
g.virtualGpuPlacement = virtualPlacement
g.virtualLeafCellPlacement = virtualPlacement
g.lazyPreemptionStatus = nil
klog.Infof("Lazy preemption of affinity group %v is reverted", g.name)
}
// findAllocatedGpu finds the physical and virtual GPUs in the full cell lists for an allocate pod.
// findAllocatedLeafCell finds the physical and virtual leaf cells in the full cell lists for an allocate pod.
// The boolean return value indicates whether the affinity group should be lazy-preempted.
// The bool being nil means the group is OT and has no virtual placement.
func (h *HivedAlgorithm) findAllocatedGpu(
func (h *HivedAlgorithm) findAllocatedLeafCell(
index int32,
physicalGpuIndices []int32,
physicalLeafCellIndices []int32,
preassignedCellTypes []api.CellType,
chain CellChain,
node string,
@ -1233,24 +1233,24 @@ func (h *HivedAlgorithm) findAllocatedGpu(
pod *core.Pod) (*PhysicalCell, *VirtualCell, *bool) {
priority := CellPriority(s.Priority)
physicalGpuIndex := physicalGpuIndices[index]
if pGpu := findPhysicalGpu(h.fullCellList, chain, node, physicalGpuIndex); pGpu == nil {
physicalLeafCellIndex := physicalLeafCellIndices[index]
if pLeafCell := findPhysicalLeafCell(h.fullCellList, chain, node, physicalLeafCellIndex); pLeafCell == nil {
klog.Warningf(
"[%v]: Cannot find GPU %v on node %v: not found in the spec. Pod ignored",
internal.Key(pod), physicalGpuIndex, node)
"[%v]: Cannot find leaf cell %v on node %v: not found in the spec. Pod ignored",
internal.Key(pod), physicalLeafCellIndex, node)
return nil, nil, common.PtrBool(false)
} else {
var vGpu *VirtualCell
var vLeafCell *VirtualCell
if preassignedCellTypes == nil {
klog.Warningf("[%v]: Cannot find virtual cell: preassigned cell not found in pod bind info", internal.Key(pod))
return pGpu, nil, common.PtrBool(true)
return pLeafCell, nil, common.PtrBool(true)
}
if group.virtualGpuPlacement != nil && !lazyPreempted {
if group.virtualLeafCellPlacement != nil && !lazyPreempted {
preassignedType := preassignedCellTypes[index]
if preassignedType != "" {
var preassignedLevel CellLevel
typeFound := false
for l, t := range h.cellTypes[pGpu.GetChain()] {
for l, t := range h.cellTypes[pLeafCell.GetChain()] {
if t == preassignedType {
preassignedLevel = l
typeFound = true
@ -1258,12 +1258,12 @@ func (h *HivedAlgorithm) findAllocatedGpu(
}
var message string
if !typeFound {
message = fmt.Sprintf("Preassigned cell type %v not found in chain %v", preassignedType, pGpu.GetChain())
message = fmt.Sprintf("Preassigned cell type %v not found in chain %v", preassignedType, pLeafCell.GetChain())
} else if vcs := h.vcSchedulers[s.VirtualCluster]; vcs == nil {
message = fmt.Sprintf("VC %v not found", s.VirtualCluster)
} else {
vccl := vcs.getNonPinnedPreassignedCells()[pGpu.GetChain()]
str := string(pGpu.GetChain())
vccl := vcs.getNonPinnedPreassignedCells()[pLeafCell.GetChain()]
str := string(pLeafCell.GetChain())
if s.PinnedCellId != "" {
vccl = vcs.getPinnedCells()[s.PinnedCellId]
str = string(s.PinnedCellId)
@ -1271,84 +1271,84 @@ func (h *HivedAlgorithm) findAllocatedGpu(
if vccl == nil {
message = fmt.Sprintf("VC %v has no cell for %v", s.VirtualCluster, str)
} else {
vGpu, message = mapPhysicalCellToVirtual(pGpu, vccl, preassignedLevel, priority)
vLeafCell, message = mapPhysicalCellToVirtual(pLeafCell, vccl, preassignedLevel, priority)
}
}
if vGpu == nil {
if vLeafCell == nil {
klog.Warningf("[%v]: Cannot find virtual cell: %v", internal.Key(pod), message)
return pGpu, nil, common.PtrBool(true)
return pLeafCell, nil, common.PtrBool(true)
} else {
return pGpu, vGpu, common.PtrBool(false)
return pLeafCell, vLeafCell, common.PtrBool(false)
}
} else {
return pGpu, nil, nil
return pLeafCell, nil, nil
}
} else {
return pGpu, nil, common.PtrBool(false)
return pLeafCell, nil, common.PtrBool(false)
}
}
}
// allocateGpu creates the cell bindings, allocates the preassigned cell (if necessary),
// allocateLeafCell creates the cell bindings, allocates the preassigned cell (if necessary),
// and sets the priority.
func (h *HivedAlgorithm) allocateGpu(
pGpu *PhysicalCell,
vGpu *VirtualCell,
func (h *HivedAlgorithm) allocateLeafCell(
pLeafCell *PhysicalCell,
vLeafCell *VirtualCell,
p CellPriority,
vcn api.VirtualClusterName) (safetyOk bool, reason string) {
safetyOk = true
if vGpu != nil {
setCellPriority(vGpu, p)
updateUsedGpuNumAtPriority(vGpu, p, true)
setCellPriority(pGpu, p)
updateUsedGpuNumAtPriority(pGpu, p, true)
pac := vGpu.GetPreassignedCell()
if vLeafCell != nil {
setCellPriority(vLeafCell, p)
updateUsedLeafCellNumAtPriority(vLeafCell, p, true)
setCellPriority(pLeafCell, p)
updateUsedLeafCellNumAtPriority(pLeafCell, p, true)
pac := vLeafCell.GetPreassignedCell()
preassignedNewlyBound := pac.GetPhysicalCell() == nil
if pGpu.GetVirtualCell() == nil {
if pLeafCell.GetVirtualCell() == nil {
// the binding could have been created before (when the cell is bad)
bindCell(pGpu, vGpu)
bindCell(pLeafCell, vLeafCell)
}
if preassignedNewlyBound {
safetyOk, reason = h.allocatePreassignedCell(pac.GetPhysicalCell(), vcn, false)
}
} else {
setCellPriority(pGpu, opportunisticPriority)
updateUsedGpuNumAtPriority(pGpu, opportunisticPriority, true)
pGpu.GetAPIStatus().VC = vcn
setCellPriority(pLeafCell, opportunisticPriority)
updateUsedLeafCellNumAtPriority(pLeafCell, opportunisticPriority, true)
pLeafCell.GetAPIStatus().VC = vcn
h.apiClusterStatus.VirtualClusters[vcn] = append(
h.apiClusterStatus.VirtualClusters[vcn], generateOTVirtualCell(pGpu.GetAPIStatus()))
h.apiClusterStatus.VirtualClusters[vcn], generateOTVirtualCell(pLeafCell.GetAPIStatus()))
}
return safetyOk, reason
}
// releaseGpu destroys the cell bindings, release the preassigned cell (if necessary),
// releaseLeafCell destroys the cell bindings, release the preassigned cell (if necessary),
// and resets the priority.
func (h *HivedAlgorithm) releaseGpu(pGpu *PhysicalCell, vcn api.VirtualClusterName) {
if vGpu := pGpu.GetVirtualCell(); vGpu != nil {
updateUsedGpuNumAtPriority(vGpu, vGpu.GetPriority(), false)
setCellPriority(vGpu, freePriority)
preassignedPhysical := vGpu.GetPreassignedCell().GetPhysicalCell()
if pGpu.IsHealthy() {
func (h *HivedAlgorithm) releaseLeafCell(pLeafCell *PhysicalCell, vcn api.VirtualClusterName) {
if vLeafCell := pLeafCell.GetVirtualCell(); vLeafCell != nil {
updateUsedLeafCellNumAtPriority(vLeafCell, vLeafCell.GetPriority(), false)
setCellPriority(vLeafCell, freePriority)
preassignedPhysical := vLeafCell.GetPreassignedCell().GetPhysicalCell()
if pLeafCell.IsHealthy() {
// we won't unbind the cell if it is bad
unbindCell(pGpu)
unbindCell(pLeafCell)
}
// To check if we should release the preassigned cell, we cannot simply check if the
// virtual cell is already unbound. It's possible that the cell is bad, then the binding
// won't be destroyed automatically (the cell is still bound only because it is bad).
// If the below condition is true, then the preassigned cell is not in real use and we can hence release it.
if !preassignedPhysical.IsPinned() && vGpu.GetPreassignedCell().GetPriority() < minGuaranteedPriority &&
if !preassignedPhysical.IsPinned() && vLeafCell.GetPreassignedCell().GetPriority() < minGuaranteedPriority &&
!h.vcDoomedBadCells[vcn][preassignedPhysical.GetChain()].contains(
preassignedPhysical, preassignedPhysical.GetLevel()) {
h.releasePreassignedCell(preassignedPhysical, vcn, false)
}
} else {
pGpu.GetAPIStatus().VC = ""
pLeafCell.GetAPIStatus().VC = ""
h.apiClusterStatus.VirtualClusters[vcn] = deleteOTVirtualCell(
h.apiClusterStatus.VirtualClusters[vcn], pGpu.GetAddress())
h.apiClusterStatus.VirtualClusters[vcn], pLeafCell.GetAddress())
}
updateUsedGpuNumAtPriority(pGpu, pGpu.GetPriority(), false)
setCellPriority(pGpu, freePriority)
updateUsedLeafCellNumAtPriority(pLeafCell, pLeafCell.GetPriority(), false)
setCellPriority(pLeafCell, freePriority)
}
// allocatePreassignedCell allocates a physical cell to a preassigned virtual cell, removes the physical cell

Просмотреть файл

@ -67,106 +67,106 @@ var group1, group2, group3, group4, group5, group6, group7, group8, group9, grou
group15, group16, group17, group18, group19, group20, group21, group22, group23, group24, group25, group26, group27,
group28, group29, group30, group31, group32, group33, group34 = &api.AffinityGroupSpec{
Name: "group1",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 1}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 1}},
}, &api.AffinityGroupSpec{
Name: "group2",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 1}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 1}},
}, &api.AffinityGroupSpec{
Name: "group3",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 8}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 8}},
}, &api.AffinityGroupSpec{
Name: "group4",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 1}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 1}},
}, &api.AffinityGroupSpec{
Name: "group5",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, GpuNumber: 16}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}},
}, &api.AffinityGroupSpec{
Name: "group6",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 1}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 1}},
}, &api.AffinityGroupSpec{
Name: "group7",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 3, GpuNumber: 8}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 3, LeafCellNumber: 8}},
}, &api.AffinityGroupSpec{
Name: "group8",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 8}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 8}},
}, &api.AffinityGroupSpec{
Name: "group9",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 7}, {PodNumber: 1, GpuNumber: 5}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 7}, {PodNumber: 1, LeafCellNumber: 5}},
}, &api.AffinityGroupSpec{
Name: "group10",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 1}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 1}},
}, &api.AffinityGroupSpec{
Name: "group11",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, GpuNumber: 16}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}},
}, &api.AffinityGroupSpec{
Name: "group12",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, GpuNumber: 16}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}},
}, &api.AffinityGroupSpec{
Name: "group13",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, GpuNumber: 16}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}},
}, &api.AffinityGroupSpec{
Name: "group14",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, GpuNumber: 16}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}},
}, &api.AffinityGroupSpec{
Name: "group15",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 2}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 2}},
}, &api.AffinityGroupSpec{
Name: "group16",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 2}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 2}},
}, &api.AffinityGroupSpec{
Name: "group17",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 2}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 2}},
}, &api.AffinityGroupSpec{
Name: "group18",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, GpuNumber: 16}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}},
}, &api.AffinityGroupSpec{
Name: "group19",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, GpuNumber: 16}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}},
}, &api.AffinityGroupSpec{
Name: "group20",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 16}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}},
}, &api.AffinityGroupSpec{
Name: "group21",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 16}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}},
}, &api.AffinityGroupSpec{
Name: "group22",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 16}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}},
}, &api.AffinityGroupSpec{
Name: "group23",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 16}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}},
}, &api.AffinityGroupSpec{
Name: "group24",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, GpuNumber: 16}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}},
}, &api.AffinityGroupSpec{
Name: "group25",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 16}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}},
}, &api.AffinityGroupSpec{
Name: "group26",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, GpuNumber: 16}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}},
}, &api.AffinityGroupSpec{
Name: "group27",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, GpuNumber: 16}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}},
}, &api.AffinityGroupSpec{
Name: "group28",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 16}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}},
}, &api.AffinityGroupSpec{
Name: "group29",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 4, GpuNumber: 16}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 4, LeafCellNumber: 16}},
}, &api.AffinityGroupSpec{
Name: "group30",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 16}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}},
}, &api.AffinityGroupSpec{
Name: "group31",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 16}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}},
}, &api.AffinityGroupSpec{
Name: "group32",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 16}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}},
}, &api.AffinityGroupSpec{
Name: "group33",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 16}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}},
}, &api.AffinityGroupSpec{
Name: "group34",
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 16}},
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}},
}
var pss = map[types.UID]api.PodSchedulingSpec{
@ -175,368 +175,368 @@ var pss = map[types.UID]api.PodSchedulingSpec{
Priority: 0,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "DGX2-V100",
GpuNumber: 1,
LeafCellType: "DGX2-V100",
LeafCellNumber: 1,
AffinityGroup: group1,
}, "pod2": { // buddy of pod1
VirtualCluster: "VC1",
Priority: 1,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "DGX2-V100",
GpuNumber: 1,
LeafCellType: "DGX2-V100",
LeafCellNumber: 1,
AffinityGroup: group2,
}, "pod3": { // non-buddy of pod 1 & 2 (avoidance of preemption)
VirtualCluster: "VC1",
Priority: 2,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "DGX2-V100",
GpuNumber: 8,
LeafCellType: "DGX2-V100",
LeafCellNumber: 8,
AffinityGroup: group3,
}, "pod4": { // opportunistic pod (will stay away from the guaranteed pods)
VirtualCluster: "VC1",
Priority: -1,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "DGX2-V100",
GpuNumber: 1,
LeafCellType: "DGX2-V100",
LeafCellNumber: 1,
AffinityGroup: group4,
}, "pod5": { // use pinned cell
VirtualCluster: "VC1",
Priority: 1,
LazyPreemptionEnable: true,
PinnedCellId: "VC1-YQW-DGX2",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group5,
}, "pod6": { // use pinned cell
VirtualCluster: "VC1",
Priority: 1,
LazyPreemptionEnable: true,
PinnedCellId: "VC1-YQW-DGX2",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group5,
}, "pod7": { // insufficient VC cells; should return PodWaitInfo
VirtualCluster: "VC2",
Priority: 1,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "DGX1-P100",
GpuNumber: 8,
LeafCellType: "DGX1-P100",
LeafCellNumber: 8,
AffinityGroup: group7,
}, "pod8": { // any GPU type; heterogeneous affinity group
}, "pod8": { // any leaf cell type; heterogeneous affinity group
VirtualCluster: "VC2",
Priority: 1,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "",
GpuNumber: 7,
LeafCellType: "",
LeafCellNumber: 7,
AffinityGroup: group9,
}, "pod9": { // any GPU type; heterogeneous affinity group
}, "pod9": { // any leaf cell type; heterogeneous affinity group
VirtualCluster: "VC2",
Priority: 1,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "",
GpuNumber: 5,
LeafCellType: "",
LeafCellNumber: 5,
AffinityGroup: group9,
}, "pod10": { // use a GPU type that the VC does not have; should User Error Panic
}, "pod10": { // use a leaf cell type that the VC does not have; should User Error Panic
VirtualCluster: "VC2",
Priority: 1,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "DGX2-V100",
GpuNumber: 1,
LeafCellType: "DGX2-V100",
LeafCellNumber: 1,
AffinityGroup: group6,
}, "pod11": { // invalid affinity group configuration
VirtualCluster: "VC2",
Priority: 1,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "DGX1-P100",
GpuNumber: 2,
LeafCellType: "DGX1-P100",
LeafCellNumber: 2,
AffinityGroup: group8,
}, "pod12": { // invalid affinity group configuration
VirtualCluster: "VC2",
Priority: 1,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "DGX1-P100",
GpuNumber: 2,
LeafCellType: "DGX1-P100",
LeafCellNumber: 2,
AffinityGroup: group8,
}, "pod13": { // invalid VC
VirtualCluster: "surprise!",
Priority: 1,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "DGX1-P100",
GpuNumber: 1,
LeafCellType: "DGX1-P100",
LeafCellNumber: 1,
AffinityGroup: group10,
}, "pod14": { // invalid pinned cell
VirtualCluster: "VC2",
Priority: 1,
LazyPreemptionEnable: true,
PinnedCellId: "surprise!",
GpuType: "DGX1-P100",
GpuNumber: 1,
LeafCellType: "DGX1-P100",
LeafCellNumber: 1,
AffinityGroup: group10,
}, "pod15": { // invalid priority
VirtualCluster: "VC2",
Priority: 1001,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "DGX1-P100",
GpuNumber: 1,
LeafCellType: "DGX1-P100",
LeafCellNumber: 1,
AffinityGroup: group10,
}, "pod16": { // trigger preemption
VirtualCluster: "VC1",
Priority: 2,
LazyPreemptionEnable: true,
PinnedCellId: "VC1-YQW-DGX2",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group11,
}, "pod17": { // trigger preemption
VirtualCluster: "VC1",
Priority: 2,
LazyPreemptionEnable: true,
PinnedCellId: "VC1-YQW-DGX2",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group11,
}, "pod18": { // used for test splitting physical cell hierarchies in reconfiguration
VirtualCluster: "VC1",
Priority: 1,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group12,
}, "pod19": { // used for test splitting physical cell hierarchies in reconfiguration
VirtualCluster: "VC1",
Priority: 1,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group12,
}, "pod20": { // guaranteed pod in splitting physical cell hierarchies
VirtualCluster: "VC1",
Priority: 1,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group13,
}, "pod21": { // guaranteed pod in splitting physical cell hierarchies
VirtualCluster: "VC1",
Priority: 1,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group13,
}, "pod22": { // opportunistic pod in splitting physical cell hierarchies
VirtualCluster: "VC1",
Priority: -1,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group14,
}, "pod23": { // opportunistic pod in splitting physical cell hierarchies
VirtualCluster: "VC1",
Priority: -1,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group14,
}, "pod24": { // used for triggering intra-VC preemption
VirtualCluster: "VC2",
Priority: 0,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "CT1",
GpuNumber: 2,
LeafCellType: "CT1",
LeafCellNumber: 2,
AffinityGroup: group15,
}, "pod25": { // trigger intra-VC preemption
VirtualCluster: "VC2",
Priority: 1,
LazyPreemptionEnable: false,
PinnedCellId: "",
GpuType: "CT1",
GpuNumber: 2,
LeafCellType: "CT1",
LeafCellNumber: 2,
AffinityGroup: group16,
}, "pod26": { // will preempt pod25 immediately (as lazy preemption is not enabled)
VirtualCluster: "VC2",
Priority: 2,
LazyPreemptionEnable: false,
PinnedCellId: "",
GpuType: "CT1",
GpuNumber: 2,
LeafCellType: "CT1",
LeafCellNumber: 2,
AffinityGroup: group17,
}, "pod27": { // will be rejected because one of the pod in this group is allocated a non-suggested node
VirtualCluster: "VC1",
Priority: 1,
LazyPreemptionEnable: false,
PinnedCellId: "VC1-YQW-DGX2",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group18,
}, "pod28": { // used for stateful preemption test
VirtualCluster: "VC1",
Priority: 1,
LazyPreemptionEnable: false,
PinnedCellId: "VC1-YQW-DGX2",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group19,
}, "pod29": { // will try to preempt pod28
VirtualCluster: "VC1",
Priority: 2,
LazyPreemptionEnable: true,
PinnedCellId: "VC1-YQW-DGX2",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group20,
}, "pod30": { // cannot get scheduled because pod28's still holding the resource
VirtualCluster: "VC1",
Priority: 1,
LazyPreemptionEnable: true,
PinnedCellId: "VC1-YQW-DGX2",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group21,
}, "pod31": { // will try to preempt pod28, and will be scheduled to a different node from pod29
VirtualCluster: "VC1",
Priority: 2,
LazyPreemptionEnable: true,
PinnedCellId: "VC1-YQW-DGX2",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group22,
}, "pod32": { // cannot get scheduled because VC1-YQW-DGX2 has been used up by pod29 and pod31
VirtualCluster: "VC1",
Priority: 2,
LazyPreemptionEnable: true,
PinnedCellId: "VC1-YQW-DGX2",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group23,
}, "pod33": { // will cancel pod29 and pod31's preemption, and continue to preempt pod28
VirtualCluster: "VC1",
Priority: 3,
LazyPreemptionEnable: true,
PinnedCellId: "VC1-YQW-DGX2",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group24,
}, "pod34": { // will cancel pod33's preemption, and get scheduled immediately (because pod28 has been deleted)
VirtualCluster: "VC1",
Priority: 4,
LazyPreemptionEnable: false,
PinnedCellId: "VC1-YQW-DGX2",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group25,
}, "pod35": { // will preempt pod34, and will be deleted before the preemption is done (so the preemption will be canceled)
VirtualCluster: "VC1",
Priority: 5,
LazyPreemptionEnable: true,
PinnedCellId: "VC1-YQW-DGX2",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group26,
}, "pod36": { // will iterate the GPU types until find a placement within suggested nodes
}, "pod36": { // will iterate the leaf cell types until find a placement within suggested nodes
VirtualCluster: "VC1",
Priority: -1,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "",
GpuNumber: 1,
LeafCellType: "",
LeafCellNumber: 1,
AffinityGroup: group1,
}, "pod37": { // used for test aware of suggested nodes in VC
VirtualCluster: "VC1",
Priority: 1,
LazyPreemptionEnable: true,
PinnedCellId: "VC1-YQW-DGX2",
GpuType: "DGX2-V100",
GpuNumber: 1,
LeafCellType: "DGX2-V100",
LeafCellNumber: 1,
AffinityGroup: group1,
}, "pod38": { // used for test aware of suggested nodes in VC
VirtualCluster: "VC1",
Priority: 1,
LazyPreemptionEnable: true,
PinnedCellId: "VC1-YQW-DGX2",
GpuType: "DGX2-V100",
GpuNumber: 1,
LeafCellType: "DGX2-V100",
LeafCellNumber: 1,
AffinityGroup: group2,
}, "pod39": { // used for triggering backtrack cell search
VirtualCluster: "VC1",
Priority: 1,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group27,
}, "pod40": { // backtrack cell search
VirtualCluster: "VC1",
Priority: 1,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group28,
}, "pod41": { // revert lazy preemption in backtrack cell search
VirtualCluster: "VC1",
Priority: 2,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group29,
}, "pod42": { // doomed bad cell test
VirtualCluster: "VC1",
Priority: 0,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group30,
}, "pod43": { // doomed bad cell test
VirtualCluster: "VC2",
Priority: 0,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group31,
}, "pod44": { // safe relaxed buddy allocate for bad node test
VirtualCluster: "VC1",
Priority: 0,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group32,
}, "pod45": { // safe relaxed buddy allocate for bad node test
VirtualCluster: "VC1",
Priority: 0,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group33,
}, "pod46": { // safe relaxed buddy allocate safety test
VirtualCluster: "VC1",
Priority: 0,
LazyPreemptionEnable: true,
PinnedCellId: "",
GpuType: "DGX2-V100",
GpuNumber: 16,
LeafCellType: "DGX2-V100",
LeafCellNumber: 16,
AffinityGroup: group34,
},
}
@ -559,36 +559,36 @@ var casesForStatefulPreemption = []string{
}
type result struct {
node string
gpuIsolation []int32
node string
leafCellIsolation []int32
}
var expectedBindInfos = map[string]result{
"pod1": {node: "0.0.1.0", gpuIsolation: []int32{0}},
"pod2": {node: "0.0.1.0", gpuIsolation: []int32{1}},
"pod3": {node: "0.0.1.0", gpuIsolation: []int32{8, 9, 10, 11, 12, 13, 14, 15}},
"pod4": {node: "0.0.5.0", gpuIsolation: []int32{0}},
"pod5": {node: "0.0.3.0", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod6": {node: "0.0.3.1", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod8": {node: "1.0.0.0", gpuIsolation: []int32{1, 3, 4, 7, 0, 2, 6}},
"pod9": {node: "1.0.0.2", gpuIsolation: []int32{0, 1, 2, 3, 4}},
"pod18": {node: "0.0.3.2", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod19": {node: "0.0.3.3", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod20": {node: "0.0.4.0", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod21": {node: "0.0.4.1", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod22": {node: "0.0.4.2", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod23": {node: "0.0.4.3", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod24": {node: "0.0.0.1", gpuIsolation: []int32{0, 1}},
"pod25": {node: "0.0.0.0", gpuIsolation: []int32{0, 1}},
"pod28": {node: "0.0.3.0", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod34": {node: "0.0.3.0", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod36": {node: "0.0.1.0", gpuIsolation: []int32{0}},
"pod37": {node: "0.0.3.0", gpuIsolation: []int32{0}},
"pod38": {node: "0.0.3.1", gpuIsolation: []int32{0}},
"pod39": {node: "0.0.3.2", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod40": {node: "0.0.4.3", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod44": {node: "0.0.3.2", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod45": {node: "0.0.4.2", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod1": {node: "0.0.1.0", leafCellIsolation: []int32{0}},
"pod2": {node: "0.0.1.0", leafCellIsolation: []int32{1}},
"pod3": {node: "0.0.1.0", leafCellIsolation: []int32{8, 9, 10, 11, 12, 13, 14, 15}},
"pod4": {node: "0.0.5.0", leafCellIsolation: []int32{0}},
"pod5": {node: "0.0.3.0", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod6": {node: "0.0.3.1", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod8": {node: "1.0.0.0", leafCellIsolation: []int32{1, 3, 4, 7, 0, 2, 6}},
"pod9": {node: "1.0.0.2", leafCellIsolation: []int32{0, 1, 2, 3, 4}},
"pod18": {node: "0.0.3.2", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod19": {node: "0.0.3.3", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod20": {node: "0.0.4.0", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod21": {node: "0.0.4.1", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod22": {node: "0.0.4.2", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod23": {node: "0.0.4.3", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod24": {node: "0.0.0.1", leafCellIsolation: []int32{0, 1}},
"pod25": {node: "0.0.0.0", leafCellIsolation: []int32{0, 1}},
"pod28": {node: "0.0.3.0", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod34": {node: "0.0.3.0", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod36": {node: "0.0.1.0", leafCellIsolation: []int32{0}},
"pod37": {node: "0.0.3.0", leafCellIsolation: []int32{0}},
"pod38": {node: "0.0.3.1", leafCellIsolation: []int32{0}},
"pod39": {node: "0.0.3.2", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod40": {node: "0.0.4.3", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod44": {node: "0.0.3.2", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
"pod45": {node: "0.0.4.2", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
}
var expectedPreemptInfos = map[string]common.Set{
@ -615,7 +615,7 @@ func TestHivedAlgorithm(t *testing.T) {
sConfig := api.NewConfig(api.InitRawConfig(&configFilePath))
h := NewHivedAlgorithm(sConfig)
initNodes(h)
// sort chains of each GPU type for stability of the test
// sort chains of each leaf cell type for stability of the test
for _, chains := range h.cellChains {
sortChains(chains)
}
@ -670,8 +670,8 @@ func printConfig(t *testing.T, h *HivedAlgorithm) {
t.Logf("%v", ccl)
}
}
for gpuType, chains := range h.cellChains {
t.Logf("%v: %v", gpuType, chains)
for leafCellType, chains := range h.cellChains {
t.Logf("%v: %v", leafCellType, chains)
}
}
@ -876,21 +876,21 @@ func testStatefulPreemption(t *testing.T, configFilePath string) {
}
if podName == "pod35" {
p := &groupPhysicalPlacement{}
*p = h.affinityGroups[pss[pod.UID].AffinityGroup.Name].physicalGpuPlacement
*p = h.affinityGroups[pss[pod.UID].AffinityGroup.Name].physicalLeafCellPlacement
h.DeleteUnallocatedPod(pod)
// test correctness of preemption cancellation
for _, podPlacements := range *p {
for _, podGpus := range podPlacements {
for _, gpu := range podGpus {
pGpu := gpu.(*PhysicalCell)
if pGpu.GetState() == cellUsed {
if int32(pGpu.GetPriority()) != pss["pod34"].Priority {
for _, podLeafCells := range podPlacements {
for _, leafCell := range podLeafCells {
pLeafCell := leafCell.(*PhysicalCell)
if pLeafCell.GetState() == cellUsed {
if int32(pLeafCell.GetPriority()) != pss["pod34"].Priority {
t.Errorf("Cell %v's priority should be pod34's priority, but is %v",
pGpu.GetAddress(), pGpu.GetPriority())
pLeafCell.GetAddress(), pLeafCell.GetPriority())
}
} else if pGpu.GetState() != cellFree {
} else if pLeafCell.GetState() != cellFree {
t.Errorf("Cell %v should be in Free state, but is %v",
pGpu.GetAddress(), pGpu.GetState())
pLeafCell.GetAddress(), pLeafCell.GetState())
}
}
}
@ -1084,7 +1084,7 @@ func testReconfiguration(t *testing.T, configFilePath string) {
for _, podName := range casesThatShouldBeLazyPreempted {
pod := allPods[podName]
g := h.affinityGroups[pss[pod.UID].AffinityGroup.Name]
if g.virtualGpuPlacement != nil {
if g.virtualLeafCellPlacement != nil {
t.Errorf("Group %v is expected to be lazy preempted, but not", g.name)
}
}
@ -1105,7 +1105,7 @@ func testInvalidInitialAssignment(t *testing.T, sConfig *api.Config) {
NewHivedAlgorithm(sConfig)
}
func compareGpuIsolation(a []int32, b []int32) bool {
func compareLeafCellIsolation(a []int32, b []int32) bool {
if len(a) == len(b) {
for i := 0; i < len(a); i++ {
if a[i] != b[i] {
@ -1130,15 +1130,15 @@ func compareSchedulingResult(t *testing.T, pod *core.Pod, psr internal.PodSchedu
if expected, ok := expectedBindInfos[pod.Name]; !ok {
if psr.PodBindInfo != nil {
t.Errorf("[%v]: wrong pod scheduling result: expected empty, but got %v:%v",
internal.Key(pod), psr.PodBindInfo.Node, psr.PodBindInfo.GpuIsolation)
internal.Key(pod), psr.PodBindInfo.Node, psr.PodBindInfo.LeafCellIsolation)
}
if !expectedPreemptInfos[pod.Name].IsEmpty() && !containsPods(psr.PodPreemptInfo.VictimPods, expectedPreemptInfos[pod.Name]) {
t.Errorf("[%v]: wrong preempt victims: expected %v, but got %v",
internal.Key(pod), expectedPreemptInfos[pod.Name], psr.PodPreemptInfo.VictimPods)
}
} else if psr.PodBindInfo.Node != expected.node ||
!compareGpuIsolation(psr.PodBindInfo.GpuIsolation, expected.gpuIsolation) {
!compareLeafCellIsolation(psr.PodBindInfo.LeafCellIsolation, expected.leafCellIsolation) {
t.Errorf("[%v]: wrong pod bind info: expected %v:%v, but got %v:%v",
internal.Key(pod), expected.node, expected.gpuIsolation, psr.PodBindInfo.Node, psr.PodBindInfo.GpuIsolation)
internal.Key(pod), expected.node, expected.leafCellIsolation, psr.PodBindInfo.Node, psr.PodBindInfo.LeafCellIsolation)
}
}

Просмотреть файл

@ -24,6 +24,7 @@ package algorithm
import (
"fmt"
"github.com/microsoft/hivedscheduler/pkg/api"
"github.com/microsoft/hivedscheduler/pkg/common"
"k8s.io/klog"
@ -31,7 +32,7 @@ import (
// intraVCScheduler is an interface for scheduling pods inside a VC.
// It stores two maps of ChainCellList, one for pinned cells, the other for non-pinned ones.
// It should be able to return a set of GPU placements in the VC for a scheduling request.
// It should be able to return a set of leaf cell placements in the VC for a scheduling request.
type intraVCScheduler interface {
getNonPinnedFullCellList() map[CellChain]ChainCellList
getNonPinnedPreassignedCells() map[CellChain]ChainCellList
@ -57,15 +58,15 @@ func newDefaultIntraVCScheduler(
nonPinnedFullList map[CellChain]ChainCellList,
nonPinnedFreeList map[CellChain]ChainCellList,
pinnedList map[api.PinnedCellId]ChainCellList,
gpuNums map[CellChain]map[CellLevel]int32) *defaultIntraVCScheduler {
leafCellNums map[CellChain]map[CellLevel]int32) *defaultIntraVCScheduler {
snr := map[CellChain]*topologyAwareScheduler{}
sr := map[api.PinnedCellId]*topologyAwareScheduler{}
for chain, ccl := range nonPinnedFullList {
snr[chain] = NewTopologyAwareScheduler(ccl, gpuNums[chain], true)
snr[chain] = NewTopologyAwareScheduler(ccl, leafCellNums[chain], true)
}
for pid, ccl := range pinnedList {
sr[pid] = NewTopologyAwareScheduler(ccl, gpuNums[ccl[CellLevel(1)][0].GetChain()], true)
sr[pid] = NewTopologyAwareScheduler(ccl, leafCellNums[ccl[CellLevel(1)][0].GetChain()], true)
}
return &defaultIntraVCScheduler{
nonPinnedFullCellList: nonPinnedFullList,
@ -99,7 +100,7 @@ func (s *defaultIntraVCScheduler) schedule(
scheduler = s.pinnedCellSchedulers[sr.pinnedCellId]
str = fmt.Sprintf("pinned cell %v", sr.pinnedCellId)
}
klog.Infof("Processing scheduling request in VC %v: %v, GPU numbers %v, priority %v",
klog.Infof("Processing scheduling request in VC %v: %v, leaf cell numbers %v, priority %v",
sr.vc, str, common.ToJson(sr.affinityGroupPodNums), sr.priority)
if scheduler != nil {
placement, failedReason = scheduler.Schedule(

Просмотреть файл

@ -31,14 +31,14 @@ import (
)
// topologyAwareScheduler can schedule a set of pods on a cluster view.
// It first tries to place pods to nodes with fewer free GPUs (i.e., packing), while trying to avoid preemptions.
// Then inside each node, it tries to allocate GPUs with better affinity.
// It first tries to place pods to nodes with fewer free leaf cells (i.e., packing), while trying to avoid preemptions.
// Then inside each node, it tries to allocate leaf cells with better affinity.
type topologyAwareScheduler struct {
// a list of nodes (node-level cells or top-level cells that are lower than node level)
cv clusterView
// GPU number at each level in the cell hierarchy. we use this to
// calculate the optimal affinity for a given GPU number.
levelGpuNum map[CellLevel]int32
// leaf cell number at each level in the cell hierarchy. we use this to
// calculate the optimal affinity for a given leaf cell number.
levelLeafCellNum map[CellLevel]int32
// pack pods cross different priorities, or inside each priority. the former is for intra-VC scheduling,
// because high-priority can avoid preemption in the whole cluster view,
// and hence we can pack pods with different priorities.
@ -52,103 +52,103 @@ type topologyAwareScheduler struct {
// (lower-level if no node-level) from a free cell list.
func NewTopologyAwareScheduler(
ccl ChainCellList,
levelGpuNum map[CellLevel]int32,
levelLeafCellNum map[CellLevel]int32,
crossPriorityPack bool) *topologyAwareScheduler {
return &topologyAwareScheduler{
cv: newClusterView(ccl),
levelGpuNum: levelGpuNum,
levelLeafCellNum: levelLeafCellNum,
crossPriorityPack: crossPriorityPack,
}
}
func (t *topologyAwareScheduler) Schedule(
podGpuNumbers map[int32]int32,
podLeafCellNumbers map[int32]int32,
p CellPriority,
suggestedNodes common.Set,
ignoreSuggestedNodes bool) (
podPlacements map[int32][]CellList,
failedReason string) {
// GPU numbers of the pods to schedule
var sortedPodGpuNumbers []int32
for gpuNum, podNum := range podGpuNumbers {
// leaf cell numbers of the pods to schedule
var sortedPodLeafCellNumbers []int32
for leafCellNum, podNum := range podLeafCellNumbers {
for i := int32(0); i < podNum; i++ {
sortedPodGpuNumbers = append(sortedPodGpuNumbers, gpuNum)
sortedPodLeafCellNumbers = append(sortedPodLeafCellNumbers, leafCellNum)
}
}
common.SortInt32(sortedPodGpuNumbers)
common.SortInt32(sortedPodLeafCellNumbers)
// disable preemption first (reduce preemption)
priority := opportunisticPriority
t.updateClusterView(priority, suggestedNodes, ignoreSuggestedNodes)
// try to fit the pods to a set of nodes
selectedNodeIndices, failedReason := findNodesForPods(t.cv, sortedPodGpuNumbers)
selectedNodeIndices, failedReason := findNodesForPods(t.cv, sortedPodLeafCellNumbers)
// enable preemption if scheduling failed
if selectedNodeIndices == nil && p > opportunisticPriority {
priority = p
t.updateClusterView(priority, suggestedNodes, ignoreSuggestedNodes)
selectedNodeIndices, failedReason = findNodesForPods(t.cv, sortedPodGpuNumbers)
selectedNodeIndices, failedReason = findNodesForPods(t.cv, sortedPodLeafCellNumbers)
}
if selectedNodeIndices == nil {
return nil, failedReason
}
// find GPUs inside the selected node for each pod
selectedNodes := make(CellList, len(sortedPodGpuNumbers))
// find leaf cells inside the selected node for each pod
selectedNodes := make(CellList, len(sortedPodLeafCellNumbers))
for i := 0; i < len(selectedNodeIndices); i++ {
selectedNodes[i] = t.cv[selectedNodeIndices[i]].c
}
selectedGpus := CellList{}
nodeAvailableGpus := map[Cell]CellList{}
selectedLeafCells := CellList{}
nodeAvailableLeafCells := map[Cell]CellList{}
podPlacements = map[int32][]CellList{}
for podIndex := 0; podIndex < len(sortedPodGpuNumbers); podIndex++ {
gpuNumber := sortedPodGpuNumbers[podIndex]
for podIndex := 0; podIndex < len(sortedPodLeafCellNumbers); podIndex++ {
leafCellNumber := sortedPodLeafCellNumbers[podIndex]
n := selectedNodes[podIndex]
// TODO: Optimize findNodesForPods and findGpusInNode together to get a better placement,
// TODO: Optimize findNodesForPods and findLeafCellsInNode together to get a better placement,
// such as also aware intra node topology when findNodesForPods.
selectedGpus, nodeAvailableGpus[n] = findGpusInNode(n, gpuNumber, priority, nodeAvailableGpus[n], t.levelGpuNum)
if podPlacements[gpuNumber] == nil {
podPlacements[gpuNumber] = []CellList{}
selectedLeafCells, nodeAvailableLeafCells[n] = findLeafCellsInNode(n, leafCellNumber, priority, nodeAvailableLeafCells[n], t.levelLeafCellNum)
if podPlacements[leafCellNumber] == nil {
podPlacements[leafCellNumber] = []CellList{}
}
podPlacements[gpuNumber] = append(podPlacements[gpuNumber], selectedGpus)
podPlacements[leafCellNumber] = append(podPlacements[leafCellNumber], selectedLeafCells)
}
return podPlacements, ""
}
type node struct {
c Cell // a node-level cell or a top-level cell that is lower than node level
freeGpuNumAtPriority int32 // free GPU number at the priority of the pod to be scheduled (lower priority considered as free)
usedGpuNumSamePriority int32 // GPU number used by the same priority as that of the pod to be scheduled
usedGpuNumHigherPriority int32 // GPU number used by higher priorities than that of the pod to be scheduled
healthy bool // if the node is healthy
suggested bool // if the node is within suggested nodes
nodeAddress api.CellAddress // used for logging the node address when bad or not suggested
c Cell // a node-level cell or a top-level cell that is lower than node level
freeLeafCellNumAtPriority int32 // free leaf cell number at the priority of the pod to be scheduled (lower priority considered as free)
usedLeafCellNumSamePriority int32 // leaf cell number used by the same priority as that of the pod to be scheduled
usedLeafCellNumHigherPriority int32 // leaf cell number used by higher priorities than that of the pod to be scheduled
healthy bool // if the node is healthy
suggested bool // if the node is within suggested nodes
nodeAddress api.CellAddress // used for logging the node address when bad or not suggested
}
// When cross-priority packing is not enabled, we count the GPU numbers used by the current
// priority (n.usedGpuNumSamePriority), and the higher priorities (n.usedGpuNumHigherPriority), respectively.
// When sorting the nodes, nodes with higher usedGpuNumSamePriority and lower usedGpuNumHigherPriority
// When cross-priority packing is not enabled, we count the leaf cell numbers used by the current
// priority (n.usedLeafCellNumSamePriority), and the higher priorities (n.usedLeafCellNumHigherPriority), respectively.
// When sorting the nodes, nodes with higher usedLeafCellNumSamePriority and lower usedLeafCellNumHigherPriority
// will be preferred (i.e., pack pods inside the same priority, and stay from higher priorities).
// Note that in this case, the nodes may NOT be ordered in term of total used GPU number,
// Note that in this case, the nodes may NOT be ordered in term of total used leaf cell number,
// which may result in feasible pod placements being not found.
//
// Otherwise, n.usedGpuNumSamePriority is set to the total used GPU number,
// so that nodes with more used GPUs will be preferred (i.e., pack pods globally across priorities).
// Otherwise, n.usedLeafCellNumSamePriority is set to the total used leaf cell number,
// so that nodes with more used leaf cells will be preferred (i.e., pack pods globally across priorities).
// In this case a feasible pod placement is guaranteed to be found (as long as all nodes are in suggested nodes).
func (n *node) updateUsedGpuNumForPriority(p CellPriority, crossPriorityPack bool) {
n.usedGpuNumSamePriority = n.c.GetUsedGpuNumAtPriorities()[p]
n.usedGpuNumHigherPriority = 0
n.freeGpuNumAtPriority = n.c.GetTotalGpuNum()
for priority, num := range n.c.GetUsedGpuNumAtPriorities() {
func (n *node) updateUsedLeafCellNumForPriority(p CellPriority, crossPriorityPack bool) {
n.usedLeafCellNumSamePriority = n.c.GetUsedLeafCellNumAtPriorities()[p]
n.usedLeafCellNumHigherPriority = 0
n.freeLeafCellNumAtPriority = n.c.GetTotalLeafCellNum()
for priority, num := range n.c.GetUsedLeafCellNumAtPriorities() {
if crossPriorityPack {
if priority != p {
n.usedGpuNumSamePriority += num
n.usedLeafCellNumSamePriority += num
}
} else if priority > p {
n.usedGpuNumHigherPriority += num
n.usedLeafCellNumHigherPriority += num
}
if priority >= p {
n.freeGpuNumAtPriority -= num
n.freeLeafCellNumAtPriority -= num
}
}
}
@ -158,10 +158,10 @@ type clusterView []*node
func newClusterView(ccl ChainCellList) clusterView {
var l CellLevel
// TODO: currently if a top-level cell is lower than node level, it will be considered as a single node.
// For example, 2 single GPU-level cells are considered as 2 nodes each with 1 GPU.
// For example, 2 single leaf-level cells are considered as 2 nodes each with 1 leaf cell.
// We cannot merge them because the 2 cells might be mapped to different physical nodes.
// We plan to support using multiple cells in a best-effort manner (for example, schedule a 2-GPU pod
// on 2 1-GPU cells, if we can find 2 1-GPU cells that can be mapped to the same physical node).
// We plan to support using multiple cells in a best-effort manner (for example, schedule a 2-leaf-cell pod
// on 2 1-leaf-cell cells, if we can find 2 1-leaf-cell cells that can be mapped to the same physical node).
for l = CellLevel(1); l <= CellLevel(len(ccl)); l++ {
if ccl[l][0].AtOrHigherThanNode() {
break
@ -205,18 +205,18 @@ func (cv clusterView) Len() int {
// We sort the nodes in decreasing significance of:
// (1) if the node is healthy (avoid unhealthy),
// (2) if the node is suggested (avoid non-suggested),
// (3) usedGpuNumSamePriority (more is preferred),
// (4) usedGpuNumHigherPriority (less is preferred).
// (3) usedLeafCellNumSamePriority (more is preferred),
// (4) usedLeafCellNumHigherPriority (less is preferred).
func (cv clusterView) Less(i int, j int) bool {
if cv[i].healthy != cv[j].healthy {
return cv[i].healthy
} else if cv[i].suggested != cv[j].suggested {
return cv[i].suggested
} else if cv[i].usedGpuNumSamePriority > cv[j].usedGpuNumSamePriority {
} else if cv[i].usedLeafCellNumSamePriority > cv[j].usedLeafCellNumSamePriority {
return true
} else if cv[i].usedGpuNumSamePriority < cv[j].usedGpuNumSamePriority {
} else if cv[i].usedLeafCellNumSamePriority < cv[j].usedLeafCellNumSamePriority {
return false
} else if cv[i].usedGpuNumHigherPriority < cv[j].usedGpuNumHigherPriority {
} else if cv[i].usedLeafCellNumHigherPriority < cv[j].usedLeafCellNumHigherPriority {
return true
} else {
return false
@ -227,14 +227,14 @@ func (cv clusterView) Swap(i int, j int) {
cv[i], cv[j] = cv[j], cv[i]
}
// updateClusterView updates the GPU numbers of the nodes for the sorting.
// updateClusterView updates the leaf cell numbers of the nodes for the sorting.
func (t *topologyAwareScheduler) updateClusterView(
p CellPriority,
suggestedNodes common.Set,
ignoreSuggestedNodes bool) {
for _, n := range t.cv {
n.updateUsedGpuNumForPriority(p, t.crossPriorityPack)
n.updateUsedLeafCellNumForPriority(p, t.crossPriorityPack)
n.healthy, n.suggested, n.nodeAddress = nodeHealthyAndInSuggested(n, suggestedNodes, ignoreSuggestedNodes)
}
}
@ -264,24 +264,24 @@ func nodeHealthyAndInSuggested(
return true, true, ""
}
// findNodesForPods finds a set of nodes that can accommodate the GPU requirements of the pods.
func findNodesForPods(cv clusterView, gpuNums []int32) (pickedNodeIndices []int32, failedReason string) {
// sort the nodes according to gpu numbers in each node.
// findNodesForPods finds a set of nodes that can accommodate the leaf cell requirements of the pods.
func findNodesForPods(cv clusterView, leafCellNums []int32) (pickedNodeIndices []int32, failedReason string) {
// sort the nodes according to leaf cell numbers in each node.
// this is achieved through the Less method defined in type clusterView.
// TODO: Ensure Opportunistic Pods also can always can find the solution, regardless of
// the iteration order.
// For example:
// 1. clusterView = 2GPU Node, 1GPU Node
// 2. gpuNums = 1GPU Pod, 2GPU Pod
// First 1GPU Pod may allocate to 2GPU Node, but the latter pod cannot be fitted anymore.
// 1. clusterView = 2-leaf-cell Node, 1-leaf-cell Node
// 2. leafCellNums = 1-leaf-cell Pod, 2-leaf-cell Pod
// First 1-leaf-cell Pod may allocate to 2-leaf-cell Node, but the latter pod cannot be fitted anymore.
sort.Stable(cv)
pickedNodeIndices = make([]int32, len(gpuNums)) // indices of the currently picked nodes
pickedNodeIndices = make([]int32, len(leafCellNums)) // indices of the currently picked nodes
podIndex := 0
pickedGpuNum := int32(0)
pickedLeafCellNum := int32(0)
var n *node
for nodeIndex := 0; nodeIndex < len(cv); {
n = cv[nodeIndex]
if n.freeGpuNumAtPriority-pickedGpuNum >= gpuNums[podIndex] {
if n.freeLeafCellNumAtPriority-pickedLeafCellNum >= leafCellNums[podIndex] {
// fail when encountering a node that is either bad or not within suggested nodes
if !n.healthy {
return nil, fmt.Sprintf(
@ -292,104 +292,104 @@ func findNodesForPods(cv clusterView, gpuNums []int32) (pickedNodeIndices []int3
"have to use at least one non-suggested node %v", n.nodeAddress)
}
pickedNodeIndices[podIndex] = int32(nodeIndex)
pickedGpuNum += gpuNums[podIndex]
pickedLeafCellNum += leafCellNums[podIndex]
podIndex++
if podIndex == len(gpuNums) {
if podIndex == len(leafCellNums) {
return pickedNodeIndices, ""
}
} else {
pickedGpuNum = 0
pickedLeafCellNum = 0
nodeIndex++
}
}
return nil, "insufficient capacity"
}
// findGpusInNode finds a set of GPUs with the best affinity in a node for a pod.
func findGpusInNode(
// findLeafCellsInNode finds a set of leaf cells with the best affinity in a node for a pod.
func findLeafCellsInNode(
n Cell,
gpuNum int32,
leafCellNum int32,
p CellPriority,
availableGpus CellList,
levelGpuNum map[CellLevel]int32) (CellList, CellList) {
availableLeafCells CellList,
levelLeafCellNum map[CellLevel]int32) (CellList, CellList) {
// indices of the currently picked GPUs
currentGpuIndices := make([]int32, gpuNum)
// affinity of the currently picked GPUs, defined as the lowest common ancestor
// of the GPUs in the cell hierarchy (lower level means better affinity)
currentAffinity := make(CellList, gpuNum)
// GPUs with the best affinity ever seen
bestAffinityGpus := make(CellList, gpuNum)
// indices of the GPUs with the best affinity ever seen
bestAffinityGpuIndices := make([]int32, gpuNum)
// the best affinity ever seen (i.e., lowest level of lowest common ancestor of a set of GPUs)
// indices of the currently picked leaf cells
currentLeafCellIndices := make([]int32, leafCellNum)
// affinity of the currently picked leaf cells, defined as the lowest common ancestor
// of the leaf cells in the cell hierarchy (lower level means better affinity)
currentAffinity := make(CellList, leafCellNum)
// leaf cells with the best affinity ever seen
bestAffinityLeafCells := make(CellList, leafCellNum)
// indices of the leaf cells with the best affinity ever seen
bestAffinityLeafCellIndices := make([]int32, leafCellNum)
// the best affinity ever seen (i.e., lowest level of lowest common ancestor of a set of leaf cells)
bestAffinity := highestLevel
// the optimal affinity for the GPU number, i.e., the lowest possible of the lowest common ancestor of GPUs
optimalAffinity := getOptimalAffinity(gpuNum, levelGpuNum)
// the optimal affinity for the leaf cell number, i.e., the lowest possible of the lowest common ancestor of leaf cells
optimalAffinity := getOptimalAffinity(leafCellNum, levelLeafCellNum)
if availableGpus == nil {
availableGpus = CellList{}
preemptibleGpus := CellList{}
availableGpus, preemptibleGpus = getGpusFromNode(n, p, availableGpus, preemptibleGpus)
// free GPUs will be used first (before preemptible GPUs)
availableGpus = append(availableGpus, preemptibleGpus...)
if availableLeafCells == nil {
availableLeafCells = CellList{}
preemptibleLeafCells := CellList{}
availableLeafCells, preemptibleLeafCells = getLeafCellsFromNode(n, p, availableLeafCells, preemptibleLeafCells)
// free leaf cells will be used first (before preemptible leaf cells)
availableLeafCells = append(availableLeafCells, preemptibleLeafCells...)
}
availableGpuIndex := int32(0)
searchGpuIndex := int32(0)
var gpu Cell
availableLeafCellIndex := int32(0)
searchLeafCellIndex := int32(0)
var leafCell Cell
for {
for availableGpuIndex < int32(len(availableGpus)) {
gpu = availableGpus[availableGpuIndex]
currentGpuIndices[searchGpuIndex] = availableGpuIndex
if searchGpuIndex == 0 {
currentAffinity[searchGpuIndex] = gpu
for availableLeafCellIndex < int32(len(availableLeafCells)) {
leafCell = availableLeafCells[availableLeafCellIndex]
currentLeafCellIndices[searchLeafCellIndex] = availableLeafCellIndex
if searchLeafCellIndex == 0 {
currentAffinity[searchLeafCellIndex] = leafCell
} else {
currentAffinity[searchGpuIndex] = findLCA(gpu, currentAffinity[searchGpuIndex-1])
currentAffinity[searchLeafCellIndex] = findLCA(leafCell, currentAffinity[searchLeafCellIndex-1])
// pruning: if the current LCA has been higher than the lowest ever,
// the node will be skipped
if (currentAffinity[searchGpuIndex] == nil && bestAffinity < highestLevel) ||
(currentAffinity[searchGpuIndex] != nil && currentAffinity[searchGpuIndex].GetLevel() > bestAffinity) {
availableGpuIndex++
if (currentAffinity[searchLeafCellIndex] == nil && bestAffinity < highestLevel) ||
(currentAffinity[searchLeafCellIndex] != nil && currentAffinity[searchLeafCellIndex].GetLevel() > bestAffinity) {
availableLeafCellIndex++
continue
}
}
if searchGpuIndex == gpuNum-1 {
if searchLeafCellIndex == leafCellNum-1 {
foundOptimalAffinity := false
bestAffinity, foundOptimalAffinity = checkCurrentGpus(
bestAffinity, foundOptimalAffinity = checkCurrentLeafCells(
currentAffinity[len(currentAffinity)-1].GetLevel(),
availableGpus,
currentGpuIndices,
availableLeafCells,
currentLeafCellIndices,
bestAffinity,
bestAffinityGpus,
bestAffinityGpuIndices,
bestAffinityLeafCells,
bestAffinityLeafCellIndices,
optimalAffinity)
if foundOptimalAffinity {
// early stop: return if the solution is optimal (i.e., all buddies)
availableGpus = removePickedGpus(availableGpus, bestAffinityGpuIndices)
return bestAffinityGpus, availableGpus
availableLeafCells = removePickedLeafCells(availableLeafCells, bestAffinityLeafCellIndices)
return bestAffinityLeafCells, availableLeafCells
}
} else {
searchGpuIndex++
searchLeafCellIndex++
}
availableGpuIndex++
availableLeafCellIndex++
}
searchGpuIndex--
if searchGpuIndex < 0 {
searchLeafCellIndex--
if searchLeafCellIndex < 0 {
if bestAffinity == highestLevel {
// Unreachable
panic(fmt.Sprintf("Assert Failure: failed to allocate %v GPUs in picked node %v", gpuNum, n.GetAddress()))
panic(fmt.Sprintf("Assert Failure: failed to allocate %v leaf cells in picked node %v", leafCellNum, n.GetAddress()))
}
availableGpus = removePickedGpus(availableGpus, bestAffinityGpuIndices)
return bestAffinityGpus, availableGpus
availableLeafCells = removePickedLeafCells(availableLeafCells, bestAffinityLeafCellIndices)
return bestAffinityLeafCells, availableLeafCells
}
availableGpuIndex = currentGpuIndices[searchGpuIndex] + 1
availableLeafCellIndex = currentLeafCellIndices[searchLeafCellIndex] + 1
}
}
// getOptimalAffinity calculates the optimal affinity for a given GPU number.
func getOptimalAffinity(gpuNum int32, levelGpuNum map[CellLevel]int32) CellLevel {
for l := CellLevel(1); l <= CellLevel(len(levelGpuNum)); l++ {
if levelGpuNum[l] >= gpuNum {
// getOptimalAffinity calculates the optimal affinity for a given leaf cell number.
func getOptimalAffinity(leafCellNum int32, levelLeafCellNum map[CellLevel]int32) CellLevel {
for l := CellLevel(1); l <= CellLevel(len(levelLeafCellNum)); l++ {
if levelLeafCellNum[l] >= leafCellNum {
return l
}
}
@ -398,21 +398,21 @@ func getOptimalAffinity(gpuNum int32, levelGpuNum map[CellLevel]int32) CellLevel
panic(fmt.Sprintf("Assert Failure: pod allocated a node but exceeds the capacity of the current chain"))
}
// checkCurrentGpus checks if the currently picked GPUs have the lowest LCA. It also checks if the solution
// is optimal (if the GPUs are all buddies).
func checkCurrentGpus(
// checkCurrentLeafCells checks if the currently picked leaf cells have the lowest LCA. It also checks if the solution
// is optimal (if the leaf cells are all buddies).
func checkCurrentLeafCells(
affinity CellLevel,
gpus CellList,
leafCells CellList,
currentIndices []int32,
bestAffinity CellLevel,
bestAffinityGpus CellList,
bestAffinityGpuIndices []int32,
bestAffinityLeafCells CellList,
bestAffinityLeafCellIndices []int32,
optimalAffinity CellLevel) (CellLevel, bool) {
if affinity < bestAffinity {
copy(bestAffinityGpuIndices, currentIndices)
copy(bestAffinityLeafCellIndices, currentIndices)
for i := 0; i < len(currentIndices); i++ {
bestAffinityGpus[i] = gpus[currentIndices[i]]
bestAffinityLeafCells[i] = leafCells[currentIndices[i]]
}
if affinity == optimalAffinity {
return affinity, true
@ -423,21 +423,21 @@ func checkCurrentGpus(
return bestAffinity, false
}
// removePickedGpus remove picked GPUs from the available GPU list.
func removePickedGpus(gpus CellList, indices []int32) CellList {
// removePickedLeafCells remove picked leaf cells from the available leaf cell list.
func removePickedLeafCells(leafCells CellList, indices []int32) CellList {
for i, index := range indices {
offset := int32(i)
if i < len(indices)-1 {
nextIndex := indices[i+1]
copy(gpus[index-offset:nextIndex-offset-1], gpus[index+1:nextIndex])
copy(leafCells[index-offset:nextIndex-offset-1], leafCells[index+1:nextIndex])
} else {
copy(gpus[index-offset:], gpus[index+1:])
copy(leafCells[index-offset:], leafCells[index+1:])
}
}
for i := len(gpus) - len(indices); i < len(gpus); i++ {
gpus[i] = nil
for i := len(leafCells) - len(indices); i < len(leafCells); i++ {
leafCells[i] = nil
}
return gpus[:len(gpus)-len(indices)]
return leafCells[:len(leafCells)-len(indices)]
}
// findLCA finds the lowest common ancestor of two cells (nil if they have no LCA).
@ -461,16 +461,16 @@ func findLCA(lower Cell, higher Cell) Cell {
return lower.GetParent()
}
// getGpusFromNode collects free GPUs and preemptible GPUs according to the priority.
func getGpusFromNode(c Cell, p CellPriority, freeGpus CellList, preemptibleGpus CellList) (CellList, CellList) {
// getLeafCellsFromNode collects free leaf cells and preemptible leaf cells according to the priority.
func getLeafCellsFromNode(c Cell, p CellPriority, freeLeafCells CellList, preemptibleLeafCells CellList) (CellList, CellList) {
if c.GetLevel() > 1 {
for _, cc := range c.GetChildren() {
freeGpus, preemptibleGpus = getGpusFromNode(cc, p, freeGpus, preemptibleGpus)
freeLeafCells, preemptibleLeafCells = getLeafCellsFromNode(cc, p, freeLeafCells, preemptibleLeafCells)
}
} else if c.GetPriority() == freePriority {
freeGpus = append(freeGpus, c)
freeLeafCells = append(freeLeafCells, c)
} else if c.GetPriority() < p {
preemptibleGpus = append(preemptibleGpus, c)
preemptibleLeafCells = append(preemptibleLeafCells, c)
}
return freeGpus, preemptibleGpus
return freeLeafCells, preemptibleLeafCells
}

Просмотреть файл

@ -24,11 +24,12 @@ package algorithm
import (
"fmt"
"strings"
"github.com/microsoft/hivedscheduler/pkg/api"
"github.com/microsoft/hivedscheduler/pkg/common"
core "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/types"
"strings"
)
type (
@ -44,7 +45,7 @@ type schedulingRequest struct {
pinnedCellId api.PinnedCellId
chain CellChain
affinityGroupName string
affinityGroupPodNums map[int32]int32 // gpu number -> pod number
affinityGroupPodNums map[int32]int32 // leaf cell number -> pod number
priority CellPriority
suggestedNodes common.Set
ignoreSuggestedNodes bool
@ -135,15 +136,15 @@ type AlgoAffinityGroup struct {
lazyPreemptionEnable bool
// Whether we should ignore K8s suggested nodes. If false, we will avoid binding cells to non-suggested nodes.
// Note that we always avoid using bad nodes; avoiding non-suggested nodes is optional and best-effort.
ignoreK8sSuggestedNodes bool
priority int32
totalPodNums map[int32]int32 // GpuNum -> PodNum
allocatedPods map[int32][]*core.Pod // GpuNum -> a list of allocated pods
preemptingPods map[types.UID]*core.Pod
physicalGpuPlacement groupPhysicalPlacement
virtualGpuPlacement groupVirtualPlacement
state AffinityGroupState
lazyPreemptionStatus *api.LazyPreemptionStatus
ignoreK8sSuggestedNodes bool
priority int32
totalPodNums map[int32]int32 // LeafCellNum -> PodNum
allocatedPods map[int32][]*core.Pod // LeafCellNum -> a list of allocated pods
preemptingPods map[types.UID]*core.Pod
physicalLeafCellPlacement groupPhysicalPlacement
virtualLeafCellPlacement groupVirtualPlacement
state AffinityGroupState
lazyPreemptionStatus *api.LazyPreemptionStatus
}
func newAlgoAffinityGroup(
@ -155,29 +156,29 @@ func newAlgoAffinityGroup(
podNums := make(map[int32]int32)
for _, m := range g.Members {
podNums[m.GpuNumber] += m.PodNumber
podNums[m.LeafCellNumber] += m.PodNumber
}
group := &AlgoAffinityGroup{
name: g.Name,
vc: vc,
lazyPreemptionEnable: lazyPreemptionEnable,
priority: priority,
totalPodNums: podNums,
allocatedPods: map[int32][]*core.Pod{},
physicalGpuPlacement: groupPhysicalPlacement{},
virtualGpuPlacement: groupVirtualPlacement{},
state: state,
name: g.Name,
vc: vc,
lazyPreemptionEnable: lazyPreemptionEnable,
priority: priority,
totalPodNums: podNums,
allocatedPods: map[int32][]*core.Pod{},
physicalLeafCellPlacement: groupPhysicalPlacement{},
virtualLeafCellPlacement: groupVirtualPlacement{},
state: state,
}
if state == groupPreempting {
group.preemptingPods = map[types.UID]*core.Pod{}
}
for gpuNum, podNum := range podNums {
group.physicalGpuPlacement[gpuNum] = make([]CellList, podNum)
group.virtualGpuPlacement[gpuNum] = make([]CellList, podNum)
group.allocatedPods[gpuNum] = make([]*core.Pod, podNum)
for leafCellNum, podNum := range podNums {
group.physicalLeafCellPlacement[leafCellNum] = make([]CellList, podNum)
group.virtualLeafCellPlacement[leafCellNum] = make([]CellList, podNum)
group.allocatedPods[leafCellNum] = make([]*core.Pod, podNum)
for i := int32(0); i < podNum; i++ {
group.physicalGpuPlacement[gpuNum][i] = make(CellList, gpuNum)
group.virtualGpuPlacement[gpuNum][i] = make(CellList, gpuNum)
group.physicalLeafCellPlacement[leafCellNum][i] = make(CellList, leafCellNum)
group.virtualLeafCellPlacement[leafCellNum][i] = make(CellList, leafCellNum)
}
}
return group
@ -193,11 +194,11 @@ func (aag *AlgoAffinityGroup) ToAffinityGroup() api.AffinityGroup {
LazyPreemptionStatus: aag.lazyPreemptionStatus,
},
}
if aag.physicalGpuPlacement != nil {
ag.Status.PhysicalPlacement = aag.physicalGpuPlacement.nodeToGpuIndices()
if aag.physicalLeafCellPlacement != nil {
ag.Status.PhysicalPlacement = aag.physicalLeafCellPlacement.nodeToLeafCellIndices()
}
if aag.virtualGpuPlacement != nil {
ag.Status.VirtualPlacement = aag.virtualGpuPlacement.preassignedCellToLeafCells()
if aag.virtualLeafCellPlacement != nil {
ag.Status.VirtualPlacement = aag.virtualLeafCellPlacement.preassignedCellToLeafCells()
}
for _, pods := range aag.allocatedPods {
for _, p := range pods {
@ -212,28 +213,28 @@ func (aag *AlgoAffinityGroup) ToAffinityGroup() api.AffinityGroup {
return ag
}
type groupPhysicalPlacement map[int32][]CellList // GpuNum -> a list of pods -> a list of physical GPU cells of each pod
type groupVirtualPlacement map[int32][]CellList // GpuNum -> a list of pods -> a list of virtual GPU cells of each pod
type groupPhysicalPlacement map[int32][]CellList // LeafCellNum -> a list of pods -> a list of physical leaf cells of each pod
type groupVirtualPlacement map[int32][]CellList // LeafCellNum -> a list of pods -> a list of virtual leaf cells of each pod
func (p groupPhysicalPlacement) String() string {
return common.ToJson(p.nodeToGpuIndices())
return common.ToJson(p.nodeToLeafCellIndices())
}
func (p groupPhysicalPlacement) nodeToGpuIndices() map[string][]int32 {
nodeToGpuIndices := map[string][]int32{}
func (p groupPhysicalPlacement) nodeToLeafCellIndices() map[string][]int32 {
nodeToLeafCellIndices := map[string][]int32{}
for _, podPlacements := range p {
for _, podPlacement := range podPlacements {
for _, gpu := range podPlacement {
pGpu := gpu.(*PhysicalCell)
nodes, gpuIndices := pGpu.GetPhysicalPlacement()
if _, ok := nodeToGpuIndices[nodes[0]]; !ok {
nodeToGpuIndices[nodes[0]] = []int32{}
for _, leafCell := range podPlacement {
pLeafCell := leafCell.(*PhysicalCell)
nodes, leafCellIndices := pLeafCell.GetPhysicalPlacement()
if _, ok := nodeToLeafCellIndices[nodes[0]]; !ok {
nodeToLeafCellIndices[nodes[0]] = []int32{}
}
nodeToGpuIndices[nodes[0]] = append(nodeToGpuIndices[nodes[0]], gpuIndices[0])
nodeToLeafCellIndices[nodes[0]] = append(nodeToLeafCellIndices[nodes[0]], leafCellIndices[0])
}
}
}
return nodeToGpuIndices
return nodeToLeafCellIndices
}
func (p groupVirtualPlacement) String() string {
@ -244,10 +245,10 @@ func (p groupVirtualPlacement) preassignedCellToLeafCells() map[api.CellAddress]
preassignedCellToLeafCells := map[api.CellAddress][]api.CellAddress{}
for _, podPlacements := range p {
for _, podPlacement := range podPlacements {
for _, gpu := range podPlacement {
vGpu := gpu.(*VirtualCell)
address := vGpu.GetAddress()
preassignedAddress := vGpu.GetPreassignedCell().GetAddress()
for _, leafCell := range podPlacement {
vLeafCell := leafCell.(*VirtualCell)
address := vLeafCell.GetAddress()
preassignedAddress := vLeafCell.GetPreassignedCell().GetAddress()
if _, ok := preassignedCellToLeafCells[preassignedAddress]; !ok {
preassignedCellToLeafCells[preassignedAddress] = []api.CellAddress{}
}
@ -261,17 +262,17 @@ func (p groupVirtualPlacement) preassignedCellToLeafCells() map[api.CellAddress]
func (p groupVirtualPlacement) toPhysicalPlacement(
bindings map[api.CellAddress]*PhysicalCell,
gpuNums []int32) groupPhysicalPlacement {
leafCellNums []int32) groupPhysicalPlacement {
physicalPlacement := groupPhysicalPlacement{}
for _, podGpuNum := range gpuNums {
podPlacements := p[podGpuNum]
physicalPlacement[podGpuNum] = make([]CellList, len(podPlacements))
for _, podLeafCellNum := range leafCellNums {
podPlacements := p[podLeafCellNum]
physicalPlacement[podLeafCellNum] = make([]CellList, len(podPlacements))
for i, podPlacement := range podPlacements {
physicalPlacement[podGpuNum][i] = make(CellList, len(podPlacement))
for j, gpu := range podPlacement {
pGpu := bindings[gpu.GetAddress()]
physicalPlacement[podGpuNum][i][j] = pGpu
physicalPlacement[podLeafCellNum][i] = make(CellList, len(podPlacement))
for j, leafCell := range podPlacement {
pLeafCell := bindings[leafCell.GetAddress()]
physicalPlacement[podLeafCellNum][i][j] = pLeafCell
}
}
}
@ -282,22 +283,22 @@ func (p groupVirtualPlacement) toPhysicalPlacement(
// lowest-level cells in a physical placement. It is generated by collecting all the unbound
// ancestors for these cells and group them in a tree.
func (p groupVirtualPlacement) toBindingPaths(
gpuNums []int32,
leafCellNums []int32,
bindings map[api.CellAddress]*PhysicalCell) (
preassignedCells []*cellBindingPathVertex,
nonPreassignedCells [][]*cellBindingPathVertex) {
allBindingPathVertices := map[api.CellAddress]*cellBindingPathVertex{}
for _, podGpuNum := range gpuNums {
podPlacements := p[podGpuNum]
for _, podLeafCellNum := range leafCellNums {
podPlacements := p[podLeafCellNum]
for _, podPlacement := range podPlacements {
for _, gpu := range podPlacement {
if pGpu := gpu.(*VirtualCell).GetPhysicalCell(); pGpu != nil {
bindings[gpu.GetAddress()] = pGpu
for _, leafCell := range podPlacement {
if pLeafCell := leafCell.(*VirtualCell).GetPhysicalCell(); pLeafCell != nil {
bindings[leafCell.GetAddress()] = pLeafCell
continue
}
var bindingPath []*VirtualCell
for c := gpu; c != nil; c = c.GetParent() {
for c := leafCell; c != nil; c = c.GetParent() {
vc := c.(*VirtualCell)
if vc.GetPhysicalCell() != nil || allBindingPathVertices[vc.GetAddress()] != nil {
break

Просмотреть файл

@ -24,13 +24,14 @@ package algorithm
import (
"fmt"
"math/rand"
"github.com/microsoft/hivedscheduler/pkg/api"
"github.com/microsoft/hivedscheduler/pkg/common"
"github.com/microsoft/hivedscheduler/pkg/internal"
core "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/types"
"k8s.io/klog"
"math/rand"
)
// generatePodScheduleResult writes the scheduling result into a PodScheduleResult.
@ -40,7 +41,7 @@ func generatePodScheduleResult(
preemptionVictims map[string]common.Set,
waitReason string,
cellLevelToType map[CellChain]map[CellLevel]api.CellType,
currentGpuNum int32,
currentLeafCellNum int32,
currentPodIndex int32,
group *AlgoAffinityGroup,
groupName string,
@ -63,14 +64,14 @@ func generatePodScheduleResult(
}
// we find the selected node after the preemption is done, otherwise the preemption victims
// may cause the selected node to be excluded from the suggested nodes
affinityGroupBindInfo, selectedNode, selectedGpuIndices, cellChain := generateAffinityGroupBindInfo(
groupPhysicalPlacement, groupVirtualPlacement, cellLevelToType, currentGpuNum, currentPodIndex, group, groupName)
klog.Infof("[%v]: pod is decided to be scheduled to node %v, GPUs %v",
internal.Key(pod), selectedNode, common.ToJson(selectedGpuIndices))
affinityGroupBindInfo, selectedNode, selectedLeafCellIndices, cellChain := generateAffinityGroupBindInfo(
groupPhysicalPlacement, groupVirtualPlacement, cellLevelToType, currentLeafCellNum, currentPodIndex, group, groupName)
klog.Infof("[%v]: pod is decided to be scheduled to node %v, leaf cells %v",
internal.Key(pod), selectedNode, common.ToJson(selectedLeafCellIndices))
return internal.PodScheduleResult{
PodBindInfo: &api.PodBindInfo{
Node: selectedNode,
GpuIsolation: selectedGpuIndices,
LeafCellIsolation: selectedLeafCellIndices,
CellChain: cellChain,
AffinityGroupBindInfo: affinityGroupBindInfo,
},
@ -102,71 +103,71 @@ func generatePodPreemptInfo(preemptionVictims map[string]common.Set, pod *core.P
}
// generateAffinityGroupBindInfo translates the physical and virtual placements of an affinity group
// into a a series of AffinityGroupMemberBindInfos, and also returns the allocated node and GPU addresses
// into a a series of AffinityGroupMemberBindInfos, and also returns the allocated node and leaf cell addresses
// of the current pod.
func generateAffinityGroupBindInfo(
groupPhysicalPlacement groupPhysicalPlacement,
groupVirtualPlacement groupVirtualPlacement,
cellLevelToType map[CellChain]map[CellLevel]api.CellType,
currentGpuNum int32,
currentLeafCellNum int32,
currentPodIndex int32,
group *AlgoAffinityGroup,
groupName string) (
affinityGroupBindInfo []api.AffinityGroupMemberBindInfo,
selectedNode string,
selectedGpuIndices []int32,
selectedLeafCellIndices []int32,
chain string) {
affinityGroupBindInfo = make([]api.AffinityGroupMemberBindInfo, len(groupPhysicalPlacement))
groupMemberIndex := 0
for podGpuNum, podPhysicalPlacements := range groupPhysicalPlacement {
for podLeafCellNum, podPhysicalPlacements := range groupPhysicalPlacement {
mbi := api.AffinityGroupMemberBindInfo{
PodPlacements: make([]api.PodPlacementInfo, len(podPhysicalPlacements)),
}
for podIndex := int32(0); podIndex < int32(len(podPhysicalPlacements)); podIndex++ {
mbi.PodPlacements[podIndex].PhysicalGpuIndices = make([]int32, podGpuNum)
mbi.PodPlacements[podIndex].PreassignedCellTypes = make([]api.CellType, podGpuNum)
for gpuIndex := int32(0); gpuIndex < podGpuNum; gpuIndex++ {
pGpu := podPhysicalPlacements[podIndex][gpuIndex]
if pGpu == nil {
mbi.PodPlacements[podIndex].PhysicalLeafCellIndices = make([]int32, podLeafCellNum)
mbi.PodPlacements[podIndex].PreassignedCellTypes = make([]api.CellType, podLeafCellNum)
for leafCellIndex := int32(0); leafCellIndex < podLeafCellNum; leafCellIndex++ {
pLeafCell := podPhysicalPlacements[podIndex][leafCellIndex]
if pLeafCell == nil {
if group == nil || group.state == groupPreempting {
panic(fmt.Sprintf("The first pod in group %v was allocated invalid resource", groupName))
}
// if the physical placement of this pod is not found (e.g., removed due to reconfiguration),
// we will insist the decision by retrieving it from other pods
mbi.PodPlacements[podIndex], chain = retrieveMissingPodPlacement(group, podGpuNum, podIndex)
mbi.PodPlacements[podIndex], chain = retrieveMissingPodPlacement(group, podLeafCellNum, podIndex)
klog.Warningf(
"pod placement has been invalid and is retrieved from annotation of other pods: node %v, GPU %v",
mbi.PodPlacements[podIndex].PhysicalNode, mbi.PodPlacements[podIndex].PhysicalGpuIndices[gpuIndex])
"pod placement has been invalid and is retrieved from annotation of other pods: node %v, leaf cell %v",
mbi.PodPlacements[podIndex].PhysicalNode, mbi.PodPlacements[podIndex].PhysicalLeafCellIndices[leafCellIndex])
} else {
nodes, gpuIndices := pGpu.(*PhysicalCell).GetPhysicalPlacement()
// here each cell (i.e., pGpu) is only one GPU, hence we takes the first element
// in its "nodes" and "gpuIndices" as the node and GPU address
nodes, leafCellIndices := pLeafCell.(*PhysicalCell).GetPhysicalPlacement()
// here each cell (i.e., pLeafCell) is only one leaf cell, hence we takes the first element
// in its "nodes" and "leafCellIndices" as the node and leaf cell address
if mbi.PodPlacements[podIndex].PhysicalNode == "" {
mbi.PodPlacements[podIndex].PhysicalNode = nodes[0]
}
mbi.PodPlacements[podIndex].PhysicalGpuIndices[gpuIndex] = gpuIndices[0]
mbi.PodPlacements[podIndex].PhysicalLeafCellIndices[leafCellIndex] = leafCellIndices[0]
if groupVirtualPlacement != nil {
vGpu := groupVirtualPlacement[podGpuNum][podIndex][gpuIndex].(*VirtualCell)
mbi.PodPlacements[podIndex].PreassignedCellTypes[gpuIndex] =
cellLevelToType[vGpu.GetChain()][vGpu.GetPreassignedCell().GetLevel()]
vLeafCell := groupVirtualPlacement[podLeafCellNum][podIndex][leafCellIndex].(*VirtualCell)
mbi.PodPlacements[podIndex].PreassignedCellTypes[leafCellIndex] =
cellLevelToType[vLeafCell.GetChain()][vLeafCell.GetPreassignedCell().GetLevel()]
} else {
mbi.PodPlacements[podIndex].PreassignedCellTypes[gpuIndex] = ""
mbi.PodPlacements[podIndex].PreassignedCellTypes[leafCellIndex] = ""
}
}
}
}
if podGpuNum == currentGpuNum {
if podLeafCellNum == currentLeafCellNum {
selectedNode = mbi.PodPlacements[currentPodIndex].PhysicalNode
selectedGpuIndices = mbi.PodPlacements[currentPodIndex].PhysicalGpuIndices
if pGpu := groupPhysicalPlacement[currentGpuNum][currentPodIndex][0]; pGpu != nil {
chain = string(pGpu.GetChain())
selectedLeafCellIndices = mbi.PodPlacements[currentPodIndex].PhysicalLeafCellIndices
if pLeafCell := groupPhysicalPlacement[currentLeafCellNum][currentPodIndex][0]; pLeafCell != nil {
chain = string(pLeafCell.GetChain())
}
}
affinityGroupBindInfo[groupMemberIndex] = mbi
groupMemberIndex++
}
return affinityGroupBindInfo, selectedNode, selectedGpuIndices, chain
return affinityGroupBindInfo, selectedNode, selectedLeafCellIndices, chain
}
// collectBadOrNonSuggestedNodes collects all the nodes that are not within the suggested nodes
@ -178,14 +179,14 @@ func collectBadOrNonSuggestedNodes(
badOrNonSuggestedNodes common.Set) {
badOrNonSuggestedNodes = common.NewSet()
for gpuNum := range placement {
for podIndex := range placement[gpuNum] {
for _, gpu := range placement[gpuNum][podIndex] {
if gpu == nil {
for leafCellNum := range placement {
for podIndex := range placement[leafCellNum] {
for _, leafCell := range placement[leafCellNum][podIndex] {
if leafCell == nil {
continue
}
nodes, _ := gpu.(*PhysicalCell).GetPhysicalPlacement()
if !gpu.(*PhysicalCell).IsHealthy() ||
nodes, _ := leafCell.(*PhysicalCell).GetPhysicalPlacement()
if !leafCell.(*PhysicalCell).IsHealthy() ||
(!ignoreSuggestedNodes && !suggestedNodes.Contains(nodes[0])) {
badOrNonSuggestedNodes.Add(nodes[0])
}
@ -196,24 +197,24 @@ func collectBadOrNonSuggestedNodes(
}
// collectPreemptionVictims collects preemption victims of an affinity group.
// If any of the GPUs allocated for the whole group is still used by a pod,
// If any of the leaf cells allocated for the whole group is still used by a pod,
// we will wait for the preemption, as a group is gang-scheduled.
func collectPreemptionVictims(placement groupPhysicalPlacement) (
victimPods map[string]common.Set, overlappingPreemptorGroups common.Set) {
victimPods = map[string]common.Set{} // node -> pods
overlappingPreemptorGroups = common.NewSet()
for gpuNum := range placement {
for podIndex := range placement[gpuNum] {
for _, gpu := range placement[gpuNum][podIndex] {
if gpu == nil {
for leafCellNum := range placement {
for podIndex := range placement[leafCellNum] {
for _, leafCell := range placement[leafCellNum][podIndex] {
if leafCell == nil {
continue
}
pGpu := gpu.(*PhysicalCell)
state := pGpu.GetState()
pLeafCell := leafCell.(*PhysicalCell)
state := pLeafCell.GetState()
if state == cellUsed || state == cellReserving {
// for any victim pod, gang-preempt all the other pods from the same affinity group
for _, pods := range pGpu.GetUsingGroup().allocatedPods {
for _, pods := range pLeafCell.GetUsingGroup().allocatedPods {
for _, v := range pods {
if v != nil {
if _, ok := victimPods[v.Spec.NodeName]; !ok {
@ -225,7 +226,7 @@ func collectPreemptionVictims(placement groupPhysicalPlacement) (
}
}
if state == cellReserving || state == cellReserved {
overlappingPreemptorGroups.Add(pGpu.GetReservingOrReservedGroup())
overlappingPreemptorGroups.Add(pLeafCell.GetReservingOrReservedGroup())
}
}
}
@ -246,13 +247,13 @@ func victimsToString(victimPods map[string]common.Set) string {
// retrieveMissingPodPlacement finds the placement of a pod from the annotation of other pods in the same group
// when the pod's placement has been invalid (i.e., not found in the spec).
func retrieveMissingPodPlacement(g *AlgoAffinityGroup, gpuNum int32, podIndex int32) (api.PodPlacementInfo, string) {
func retrieveMissingPodPlacement(g *AlgoAffinityGroup, leafCellNum int32, podIndex int32) (api.PodPlacementInfo, string) {
for _, pods := range g.allocatedPods {
for _, p := range pods {
if p != nil {
info := internal.ExtractPodBindInfo(p)
for _, mbi := range info.AffinityGroupBindInfo {
if gpuNum == int32(len(mbi.PodPlacements[0].PhysicalGpuIndices)) {
if leafCellNum == int32(len(mbi.PodPlacements[0].PhysicalLeafCellIndices)) {
return mbi.PodPlacements[podIndex], info.CellChain
}
}
@ -260,20 +261,20 @@ func retrieveMissingPodPlacement(g *AlgoAffinityGroup, gpuNum int32, podIndex in
}
}
panic(fmt.Sprintf(
"No allocated pod found in an allocated group %v when retrieving placement for pod %v with GPU number %v", g.name, podIndex, gpuNum))
"No allocated pod found in an allocated group %v when retrieving placement for pod %v with leaf cell number %v", g.name, podIndex, leafCellNum))
}
// retrieveVirtualCell finds the corresponding virtual cell for a physical cell in the placements of an affinity group.
func retrieveVirtualCell(
physicalPlacement groupPhysicalPlacement,
virtualPlacement groupVirtualPlacement,
pGpu *PhysicalCell) (vGpu *VirtualCell) {
pLeafCell *PhysicalCell) (vLeafCell *VirtualCell) {
for gpuNum := range physicalPlacement {
for podIndex := range physicalPlacement[gpuNum] {
for gpuIndex, gpu := range physicalPlacement[gpuNum][podIndex] {
if gpu != nil && CellEqual(gpu, pGpu) {
return virtualPlacement[gpuNum][podIndex][gpuIndex].(*VirtualCell)
for leafCellNum := range physicalPlacement {
for podIndex := range physicalPlacement[leafCellNum] {
for leafCellIndex, leafCell := range physicalPlacement[leafCellNum][podIndex] {
if leafCell != nil && CellEqual(leafCell, pLeafCell) {
return virtualPlacement[leafCellNum][podIndex][leafCellIndex].(*VirtualCell)
}
}
}
@ -294,12 +295,12 @@ func getNewPodIndex(pods []*core.Pod) int32 {
}
// getAllocatedPodIndex finds the index of an allocated pod in its group according to its placement.
func getAllocatedPodIndex(info *api.PodBindInfo, gpuNum int32) int32 {
func getAllocatedPodIndex(info *api.PodBindInfo, leafCellNum int32) int32 {
for _, gms := range info.AffinityGroupBindInfo {
if gpuNumber := int32(len(gms.PodPlacements[0].PhysicalGpuIndices)); gpuNumber == gpuNum {
if leafCellNumber := int32(len(gms.PodPlacements[0].PhysicalLeafCellIndices)); leafCellNumber == leafCellNum {
for podIndex, placement := range gms.PodPlacements {
if placement.PhysicalNode == info.Node && common.Int32SliceContains(
placement.PhysicalGpuIndices, info.GpuIsolation[0]) {
placement.PhysicalLeafCellIndices, info.LeafCellIsolation[0]) {
return int32(podIndex)
}
}
@ -320,19 +321,19 @@ func allPodsReleased(allocatedPods map[int32][]*core.Pod) bool {
return true
}
// findPhysicalGpu finds a physical GPU cell in the full list. If the GPU is not found in the chain specified
// findPhysicalLeafCell finds a physical leaf cell in the full list. If the leaf cell is not found in the chain specified
// in the PodBindInfo (due to reconfiguration), we will try to search in the other chains.
func findPhysicalGpu(
func findPhysicalLeafCell(
fullCellList map[CellChain]ChainCellList,
chain CellChain,
node string,
gpuIndex int32) *PhysicalCell {
leafCellIndex int32) *PhysicalCell {
if g := findPhysicalGpuInChain(fullCellList, chain, node, gpuIndex); g == nil {
if g := findPhysicalLeafCellInChain(fullCellList, chain, node, leafCellIndex); g == nil {
for c := range fullCellList {
if c != chain {
if g = findPhysicalGpuInChain(fullCellList, c, node, gpuIndex); g != nil {
klog.Warningf("GPU %v on node %v has been moved to chain %v", gpuIndex, node, c)
if g = findPhysicalLeafCellInChain(fullCellList, c, node, leafCellIndex); g != nil {
klog.Warningf("Leaf cell %v on node %v has been moved to chain %v", leafCellIndex, node, c)
return g
}
}
@ -343,18 +344,18 @@ func findPhysicalGpu(
}
}
// findPhysicalGpuInChain finds a physical GPU cell in the full list of a given chain. This search is based on
// *one* node and *one* GPU index, assuming there is no resource overlapping among cells at the same level.
func findPhysicalGpuInChain(
// findPhysicalLeafCellInChain finds a physical leaf cell in the full list of a given chain. This search is based on
// *one* node and *one* leaf cell index, assuming there is no resource overlapping among cells at the same level.
func findPhysicalLeafCellInChain(
fullCellList map[CellChain]ChainCellList,
chain CellChain,
node string,
gpuIndex int32) *PhysicalCell {
leafCellIndex int32) *PhysicalCell {
for _, c := range fullCellList[chain][1] {
success := false
cc := c.(*PhysicalCell)
nodes, gpuIndices := cc.GetPhysicalPlacement()
nodes, leafCellIndices := cc.GetPhysicalPlacement()
for _, n := range nodes {
if n == node {
success = true
@ -362,11 +363,11 @@ func findPhysicalGpuInChain(
}
}
if success {
if gpuIndex < 0 {
if leafCellIndex < 0 {
return cc
} else {
for _, g := range gpuIndices {
if g == gpuIndex {
for _, g := range leafCellIndices {
if g == leafCellIndex {
return cc
}
}
@ -392,7 +393,7 @@ func inFreeCellList(c *PhysicalCell) bool {
// setCellState sets state for a cell and its parent recursively. A parent cell will be in Used state
// if any of its children is in Used state. For the other states (Free, Reserving, Reserved),
// a parent will be in the state if all of this children are in the state.
// setCellState always starts from the lowest level, i.e., GPU-level cells.
// setCellState always starts from the lowest level, i.e., leaf-level cells.
func setCellState(c *PhysicalCell, s CellState) {
c.SetState(s)
if c.GetParent() != nil {
@ -418,7 +419,7 @@ func allChildrenSameState(c *PhysicalCell, s CellState) bool {
func generateOTVirtualCell(pc *api.PhysicalCellStatus) *api.VirtualCellStatus {
vc := &api.VirtualCellStatus{
CellStatus: api.CellStatus{
GpuType: pc.GpuType,
LeafCellType: pc.LeafCellType,
CellType: pc.CellType,
CellAddress: pc.CellAddress + "-opp",
CellState: api.CellState(cellUsed),

Просмотреть файл

@ -24,11 +24,12 @@ package api
import (
"fmt"
"github.com/microsoft/hivedscheduler/pkg/common"
"io/ioutil"
"os"
"github.com/microsoft/hivedscheduler/pkg/common"
"k8s.io/client-go/rest"
"k8s.io/client-go/tools/clientcmd"
"os"
)
type Config struct {
@ -72,7 +73,7 @@ type Config struct {
WaitingPodSchedulingBlockMilliSec *int64 `yaml:"waitingPodSchedulingBlockMilliSec"`
// Specify the whole physical cluster
// TODO: Automatically construct it based on node info from GPU and Network Device Plugins
// TODO: Automatically construct it based on node info from Device Plugins
PhysicalCluster *PhysicalClusterSpec `yaml:"physicalCluster"`
// Specify all the virtual clusters belongs to the physical cluster
@ -148,7 +149,7 @@ func inferPhysicalCellSpec(
return
}
if ct.IsNodeLevel {
// reset default address to 0 when found a node level cell, leaf cell will use it as gpu indices
// reset default address to 0 when found a node level cell, leaf cell will use it as indices
defaultAddress = 0
}
if ct.ChildCellNumber > 0 && len(spec.CellChildren) == 0 {

Просмотреть файл

@ -44,23 +44,10 @@ const (
// PodSchedulingSpec YAML format.
AnnotationKeyPodSchedulingSpec = GroupName + "/pod-scheduling-spec"
// To leverage this scheduler, if one container in the Pod want to use the
// allocated GPUs for the whole Pod, it should contain below env.
// env:
// - name: NVIDIA_VISIBLE_DEVICES
// valueFrom:
// fieldRef:
// fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
// The annotation referred by the env will be populated by scheduler when bind the pod.
//
// Notes:
// 1. The scheduler directly delivers GPU isolation decision to
// nvidia-container-runtime through Pod Env: NVIDIA_VISIBLE_DEVICES.
// 2. If multiple containers in the Pod contain the env, the allocated GPUs are
// all visible to them, so it is these containers' freedom to control how
// to share these GPUs.
EnvNameNvidiaVisibleDevices = "NVIDIA_VISIBLE_DEVICES"
AnnotationKeyPodGpuIsolation = GroupName + "/pod-gpu-isolation"
// To leverage this scheduler, the Pod could reference below annotation to
// use the allocated leaf cells for the whole Pod.
AnnotationKeyPodLeafCellIsolation = GroupName + "/pod-leaf-cell-isolation"
DeprecatedAnnotationKeyPodGpuIsolation = GroupName + "/pod-gpu-isolation"
// Populated by this scheduler, used to track and recover allocated placement.
// It is in PodBindInfo YAML format.

Просмотреть файл

@ -24,6 +24,7 @@ package api
import (
"fmt"
meta "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/types"
)
@ -78,8 +79,8 @@ type PodSchedulingSpec struct {
VirtualCluster VirtualClusterName `yaml:"virtualCluster"`
Priority int32 `yaml:"priority"`
PinnedCellId PinnedCellId `yaml:"pinnedCellId"`
GpuType string `yaml:"gpuType"`
GpuNumber int32 `yaml:"gpuNumber"`
LeafCellType string `yaml:"leafCellType"`
LeafCellNumber int32 `yaml:"leafCellNumber"`
GangReleaseEnable bool `yaml:"gangReleaseEnable"`
LazyPreemptionEnable bool `yaml:"lazyPreemptionEnable"`
IgnoreK8sSuggestedNodes bool `yaml:"ignoreK8sSuggestedNodes"`
@ -92,15 +93,15 @@ type AffinityGroupSpec struct {
}
type AffinityGroupMemberSpec struct {
PodNumber int32 `yaml:"podNumber"`
GpuNumber int32 `yaml:"gpuNumber"`
PodNumber int32 `yaml:"podNumber"`
LeafCellNumber int32 `yaml:"leafCellNumber"`
}
// Used to recover scheduler allocated resource
type PodBindInfo struct {
Node string `yaml:"node"` // node to bind
GpuIsolation []int32 `yaml:"gpuIsolation"` // GPUs to bind
CellChain string `yaml:"cellChain"` // cell chain selected
Node string `yaml:"node"` // node to bind
LeafCellIsolation []int32 `yaml:"leafCellIsolation"` // leaf cells to bind
CellChain string `yaml:"cellChain"` // cell chain selected
AffinityGroupBindInfo []AffinityGroupMemberBindInfo `yaml:"affinityGroupBindInfo"`
}
@ -109,8 +110,8 @@ type AffinityGroupMemberBindInfo struct {
}
type PodPlacementInfo struct {
PhysicalNode string `yaml:"physicalNode"`
PhysicalGpuIndices []int32 `yaml:"physicalGpuIndices"`
PhysicalNode string `yaml:"physicalNode"`
PhysicalLeafCellIndices []int32 `yaml:"physicalLeafCellIndices"`
// preassigned cell types used by the pods. used to locate the virtual cells
// when adding an allocated pod
PreassignedCellTypes []CellType `yaml:"preassignedCellTypes"`
@ -156,7 +157,7 @@ type AffinityGroupStatus struct {
VC VirtualClusterName `json:"vc"`
Priority int32 `json:"priority"`
State AffinityGroupState `json:"state"`
PhysicalPlacement map[string][]int32 `json:"physicalPlacement,omitempty"` // node -> GPU indices
PhysicalPlacement map[string][]int32 `json:"physicalPlacement,omitempty"` // node -> leaf cell indices
VirtualPlacement map[CellAddress][]CellAddress `json:"virtualPlacement,omitempty"` // preassigned cell -> leaf cells
AllocatedPods []types.UID `json:"allocatedPods,omitempty"`
PreemptingPods []types.UID `json:"preemptingPods,omitempty"`
@ -181,9 +182,9 @@ const (
)
type CellStatus struct {
GpuType string `json:"gpuType,omitempty"`
CellType CellType `json:"cellType"`
IsNodeLevel bool `json:"isNodeLevel,omitempty"`
LeafCellType string `json:"leafCellType,omitempty"`
CellType CellType `json:"cellType"`
IsNodeLevel bool `json:"isNodeLevel,omitempty"`
// Address of a physical cell consists of its address (or index) in each level
// (e.g., node0/0/0/0 may represent node0, CPU socket 0, PCIe switch 0, GPU 0.
// Address of a virtual cell consists of its VC name, index of the preassigned cell,

Просмотреть файл

@ -24,6 +24,9 @@ package internal
import (
"fmt"
"net/http"
"strings"
si "github.com/microsoft/hivedscheduler/pkg/api"
"github.com/microsoft/hivedscheduler/pkg/common"
core "k8s.io/api/core/v1"
@ -32,7 +35,6 @@ import (
"k8s.io/client-go/rest"
"k8s.io/client-go/tools/cache"
"k8s.io/klog"
"net/http"
)
func CreateClient(kConfig *rest.Config) kubeClient.Interface {
@ -175,19 +177,30 @@ func NewBindingPod(pod *core.Pod, podBindInfo *si.PodBindInfo) *core.Pod {
if bindingPod.Annotations == nil {
bindingPod.Annotations = map[string]string{}
}
bindingPod.Annotations[si.AnnotationKeyPodGpuIsolation] =
common.ToIndicesString(podBindInfo.GpuIsolation)
bindingPod.Annotations[si.AnnotationKeyPodLeafCellIsolation] =
common.ToIndicesString(podBindInfo.LeafCellIsolation)
bindingPod.Annotations[si.AnnotationKeyPodBindInfo] =
common.ToYaml(podBindInfo)
return bindingPod
}
// converts old spec annotations for backward compatibility
func convertOldAnnotation(annotation string) string {
r := strings.NewReplacer(
"gpuType", "leafCellType",
"gpuNumber", "leafCellNumber",
"gpuIsolation", "leafCellIsolation",
"physicalGpuIndices", "physicalLeafCellIndices",
)
return r.Replace(annotation)
}
// PodBindInfo comes from internal, so just need to assert when deserialization.
func ExtractPodBindInfo(allocatedPod *core.Pod) *si.PodBindInfo {
podBindInfo := si.PodBindInfo{}
annotation := allocatedPod.Annotations[si.AnnotationKeyPodBindInfo]
annotation := convertOldAnnotation(allocatedPod.Annotations[si.AnnotationKeyPodBindInfo])
if annotation == "" {
panic(fmt.Errorf(
"Pod does not contain or contains empty annotation: %v",
@ -199,9 +212,16 @@ func ExtractPodBindInfo(allocatedPod *core.Pod) *si.PodBindInfo {
}
func ExtractPodBindAnnotations(allocatedPod *core.Pod) map[string]string {
return map[string]string{
si.AnnotationKeyPodGpuIsolation: allocatedPod.Annotations[si.AnnotationKeyPodGpuIsolation],
si.AnnotationKeyPodBindInfo: allocatedPod.Annotations[si.AnnotationKeyPodBindInfo],
if _, ok := allocatedPod.Annotations[si.AnnotationKeyPodLeafCellIsolation]; ok {
return map[string]string{
si.AnnotationKeyPodLeafCellIsolation: allocatedPod.Annotations[si.AnnotationKeyPodLeafCellIsolation],
si.AnnotationKeyPodBindInfo: allocatedPod.Annotations[si.AnnotationKeyPodBindInfo],
}
} else {
return map[string]string{
si.AnnotationKeyPodLeafCellIsolation: allocatedPod.Annotations[si.DeprecatedAnnotationKeyPodGpuIsolation],
si.AnnotationKeyPodBindInfo: convertOldAnnotation(allocatedPod.Annotations[si.AnnotationKeyPodBindInfo]),
}
}
}
@ -214,7 +234,7 @@ func ExtractPodSchedulingSpec(pod *core.Pod) *si.PodSchedulingSpec {
podSchedulingSpec := si.PodSchedulingSpec{}
annotation := pod.Annotations[si.AnnotationKeyPodSchedulingSpec]
annotation := convertOldAnnotation(pod.Annotations[si.AnnotationKeyPodSchedulingSpec])
if annotation == "" {
panic(fmt.Errorf(errPfx + "Annotation does not exist or is empty"))
}
@ -226,8 +246,8 @@ func ExtractPodSchedulingSpec(pod *core.Pod) *si.PodSchedulingSpec {
podSchedulingSpec.AffinityGroup = &si.AffinityGroupSpec{
Name: fmt.Sprintf("%v/%v", pod.Namespace, pod.Name),
Members: []si.AffinityGroupMemberSpec{{
PodNumber: 1,
GpuNumber: podSchedulingSpec.GpuNumber},
PodNumber: 1,
LeafCellNumber: podSchedulingSpec.LeafCellNumber},
},
}
}
@ -242,8 +262,8 @@ func ExtractPodSchedulingSpec(pod *core.Pod) *si.PodSchedulingSpec {
if podSchedulingSpec.Priority > si.MaxGuaranteedPriority {
panic(fmt.Errorf(errPfx+"Priority is greater than %v", si.MaxGuaranteedPriority))
}
if podSchedulingSpec.GpuNumber <= 0 {
panic(fmt.Errorf(errPfx + "GpuNumber is non-positive"))
if podSchedulingSpec.LeafCellNumber <= 0 {
panic(fmt.Errorf(errPfx + "LeafCellNumber is non-positive"))
}
if podSchedulingSpec.AffinityGroup.Name == "" {
panic(fmt.Errorf(errPfx + "AffinityGroup.Name is empty"))
@ -254,10 +274,10 @@ func ExtractPodSchedulingSpec(pod *core.Pod) *si.PodSchedulingSpec {
if member.PodNumber <= 0 {
panic(fmt.Errorf(errPfx + "AffinityGroup.Members has non-positive PodNumber"))
}
if member.GpuNumber <= 0 {
panic(fmt.Errorf(errPfx + "AffinityGroup.Members has non-positive GpuNumber"))
if member.LeafCellNumber <= 0 {
panic(fmt.Errorf(errPfx + "AffinityGroup.Members has non-positive LeafCellNumber"))
}
if member.GpuNumber == podSchedulingSpec.GpuNumber {
if member.LeafCellNumber == podSchedulingSpec.LeafCellNumber {
isPodInGroup = true
}
}
@ -287,10 +307,10 @@ func BindPod(kClient kubeClient.Interface, bindingPod *core.Pod) {
panic(fmt.Errorf("Failed to bind Pod: %v", err))
}
klog.Infof("[%v]: Succeeded to bind Pod on node %v, gpus %v",
klog.Infof("[%v]: Succeeded to bind Pod on node %v, leaf cells %v",
Key(bindingPod),
bindingPod.Spec.NodeName,
bindingPod.Annotations[si.AnnotationKeyPodGpuIsolation])
bindingPod.Annotations[si.AnnotationKeyPodLeafCellIsolation])
}
func NewBadRequestError(message string) *si.WebServerError {

Просмотреть файл

@ -24,6 +24,9 @@ package scheduler
import (
"fmt"
"sync"
"time"
"github.com/microsoft/hivedscheduler/pkg/algorithm"
si "github.com/microsoft/hivedscheduler/pkg/api"
"github.com/microsoft/hivedscheduler/pkg/common"
@ -39,8 +42,6 @@ import (
"k8s.io/client-go/tools/cache"
"k8s.io/klog"
ei "k8s.io/kubernetes/pkg/scheduler/api"
"sync"
"time"
)
// HivedScheduler is the scheduling framework which serves as the bridge between
@ -437,9 +438,9 @@ func (s *HivedScheduler) shouldForceBind(
// based on current status, the retried Pod should can be scheduled on suitable
// placement decision eventually.
// Thus, the problematic decision can only be stale decision, i.e. only newly
// bad GPUs or newly deleted Nodes will lead Pod retried.
// For newly bad GPUs, it is like the normal behaviour that a pod will fail
// after the GPUs it runs on become unhealthy.
// bad devices or newly deleted Nodes will lead Pod retried.
// For newly bad devices, it is like the normal behaviour that a pod will fail
// after the devices it runs on become unhealthy.
// For newly deleted Nodes, it is like the normal behaviour that a pod will
// be deleted by the GarbageCollectionController after the node it runs on is
// deleted.