Rename term gpu to leaf cell (#28)
* Rename gpuType/gpuNumber to skuType/skuNumber Rename gpuType -> skuType, gpuNumber -> skuNumber. * Rename gpu to device Rename gpu to device when referring affinity and index. * Add explanation for sku type and device Add explanation for sku type and device. * Revert term sku and device to leaf cell Revert term sku and device to leaf cell. * Fix Fix. * Convert old spec annotations for compatibility Convert old spec annotations for backward compatibility. * Update README Update README. * Resolve comments Resolve comments. * Update Update.
This commit is contained in:
Родитель
76ed604419
Коммит
fbff5b09b4
|
@ -20,7 +20,7 @@ The killer feature that distinguishes HiveD is that it provides resource guarant
|
|||
|
||||
HiveD protects VCs' resources in terms of **cell**, a user-defined resource type that encodes both the quantity and other kinds of information, such as topology and hardware type. In the above example, a user can define a cell type of 8-GPU node, and the VC can be assigned one of such cell. Then, HiveD will ensure that *there is always one 8-GPU node available for the VC*, regardless of the other workloads in the cluster.
|
||||
|
||||
HiveD allows flexible cell definitions for fine-grained resource guarantees. For example, users can define cells at multiple topology levels (e.g., PCI-e switch), for different GPU models, or networking configurations (e.g., InfiniBand domain). A VC can have various types of cells, and HiveD will guarantee all of them.
|
||||
HiveD allows flexible cell definitions for fine-grained resource guarantees. For example, users can define cells at multiple topology levels (e.g., PCI-e switch), for different device models (e.g., NVIDIA V100 GPU, AMD Radeon MI100 GPU, Cloud TPU v3), or networking configurations (e.g., InfiniBand domain). A VC can have various types of cells, and HiveD will guarantee all of them.
|
||||
|
||||
### [Gang Scheduling](example/feature/README.md#Gang-Scheduling)
|
||||
|
||||
|
@ -34,8 +34,8 @@ HiveD supports multiple job **priorities**. Higher-priority jobs can **[preempt]
|
|||
|
||||
## Feature
|
||||
1. [Multi-Tenancy: Virtual Cluster (VC)](example/feature/README.md#VC-Safety)
|
||||
2. [Fine-Grained VC Resource Guarantee](example/feature/README.md#VC-Safety): Quantity, [Topology](example/feature/README.md#VC-Safety), [Type](example/feature/README.md#GPU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), etc.
|
||||
3. Flexible Intra-VC Scheduling: [Topology-Awareness](example/feature/README.md#Topology-Aware-Intra-VC-Scheduling), [Flexible GPU Types](example/feature/README.md#GPU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), Scheduling Policy Customization, etc.
|
||||
2. [Fine-Grained VC Resource Guarantee](example/feature/README.md#VC-Safety): Quantity, [Topology](example/feature/README.md#VC-Safety), [Type](example/feature/README.md#SKU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), etc.
|
||||
3. Flexible Intra-VC Scheduling: [Topology-Awareness](example/feature/README.md#Topology-Aware-Intra-VC-Scheduling), [Flexible Hardware Types](example/feature/README.md#SKU-Type), [Pinned VC Resource](example/feature/README.md#Pinned-Cells), Scheduling Policy Customization, etc.
|
||||
4. Optimized Resource Fragmentation and Less Starvation
|
||||
5. [Priorities](example/feature/README.md#Guaranteed-Job), [Overuse with Low Priority](example/feature/README.md#Opportunistic-Job), and [Inter-](example/feature/README.md#Inter-VC-Preemption)/[Intra-VC Preemption](example/feature/README.md#Intra-VC-Preemption)
|
||||
6. [Job (Full/Partial) Gang Scheduling/Preemption](example/feature/README.md#Gang-Scheduling)
|
||||
|
|
|
@ -68,7 +68,7 @@ For all cells currently associated with other AGs:
|
|||
|
||||
`Used` (by other AGs) -> `Reserving` (by this AG) (e<sub>2</sub> in cell state machine);
|
||||
|
||||
`Reserving`/`Reserved` (by other AGs) -> `Reserving`/`Reserved` (by this AG) (e<sub>3</sub>/e<sub>6</sub> in cell state machine);
|
||||
`Reserving`/`Reserved` (by other AGs) -> `Reserving`/`Reserved` (by this AG) (e<sub>3</sub>/e<sub>6</sub> in cell state machine);
|
||||
|
||||
For free cells:
|
||||
|
||||
|
@ -94,7 +94,7 @@ __e<sub>4</sub>__:
|
|||
|
||||
Condition: all pods of this AG are deleted.
|
||||
|
||||
Operation:
|
||||
Operation:
|
||||
all cells `Used` (by this AG) -> `Free` (e<sub>1</sub> in cell state machine).
|
||||
|
||||
__e<sub>5</sub>__:
|
||||
|
@ -118,7 +118,7 @@ __e<sub>7</sub>__:
|
|||
|
||||
Condition: all pods of this AG are deleted.
|
||||
|
||||
Operation:
|
||||
Operation:
|
||||
|
||||
All the `Reserving` cells (by this AG) -> `Used` (by the `Being preempted` AG currently associated with the cell) (e<sub>4</sub> in cell state machine).
|
||||
|
||||
|
@ -132,7 +132,7 @@ Operation: none.
|
|||
|
||||
## Cell State Machine
|
||||
|
||||
Cell is the resource unit in HiveD. The figure below shows the state machine of cell. Note that here cells are _lowest-level physical cells_, e.g., single-GPU cells in typical configs (we record states only in these cells).
|
||||
Cell is the resource unit in HiveD. The figure below shows the state machine of cell. Note that here cells are _lowest-level physical cells_, e.g., leaf cells in typical configs (we record states only in these cells).
|
||||
|
||||
<p style="text-align: center;">
|
||||
<img src="img/cell-state-machine.png" title="cell" alt="cell" width="70%"/>
|
||||
|
@ -188,7 +188,7 @@ __e<sub>2</sub>__:
|
|||
|
||||
Condition: triggered by another AG from `Pending` to `Preempting` (i.e., that AG is preempting the `Allocated` AG currently associated with this cell) (e<sub>1</sub> in AG state machine).
|
||||
|
||||
Operation:
|
||||
Operation:
|
||||
|
||||
The `Allocated` AG on this cell -> `Being preempted` (e<sub>6</sub> in AG state machine);
|
||||
|
||||
|
@ -236,7 +236,7 @@ __e<sub>8</sub>__:
|
|||
|
||||
Condition: triggered by (i) there is currently a `Preempting` AG on this cell but another `Allocated` AG is now associated with the cell (e<sub>0</sub> in AG state machine); OR (ii) the `Preempting` AG currently associated with this cell transitions to `Allocated` (e<sub>2</sub> in AG state machine).
|
||||
|
||||
Operation:
|
||||
Operation:
|
||||
|
||||
For (i): the `Preempting` AG on this cell -> `Pending` (e<sub>5</sub> in AG state machine); release the cell and then allocate it to the new `Allocated` AG.
|
||||
|
||||
|
|
|
@ -2,6 +2,7 @@
|
|||
|
||||
## <a name="Index">Index</a>
|
||||
- [Config](#Config)
|
||||
- [Scheduling GPUs](#Scheduling-GPUs)
|
||||
|
||||
## <a name="Config">Config</a>
|
||||
### <a name="ConfigQuickStart">Config QuickStart</a>
|
||||
|
@ -14,7 +15,6 @@
|
|||
Notes:
|
||||
1. It is like the [Azure VM Series](https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu) or [GCP Machine Types](https://cloud.google.com/compute/docs/machine-types).
|
||||
2. Currently, the `skuTypes` is not directly used by HivedScheduler, but it is used by [OpenPAI RestServer](https://github.com/microsoft/pai/tree/master/src/rest-server) to setup proportional Pod resource requests and limits. So, if you are not using with [OpenPAI RestServer](https://github.com/microsoft/pai/tree/master/src/rest-server), you can skip to config it.
|
||||
3. It is previously known as `gpuTypes`, and we are in the progress to rename it to `skuTypes`, as HiveD only awares the abstract `cell` concept instead of the concrete hardware that the `cell` represents.
|
||||
|
||||
**Example:**
|
||||
|
||||
|
@ -117,7 +117,7 @@
|
|||
5. Put it together
|
||||
|
||||
**Example:**
|
||||
|
||||
|
||||
Finally, after above steps, your config would be:
|
||||
```yaml
|
||||
physicalCluster:
|
||||
|
@ -155,3 +155,37 @@
|
|||
|
||||
### <a name="ConfigDetail">Config Detail</a>
|
||||
[Detail Example](../example/config)
|
||||
|
||||
## <a name="Scheduling-GPUs">Scheduling GPUs</a>
|
||||
|
||||
To leverage this scheduler to schedule GPUs, if one container in the Pod want to use the allocated GPUs for the whole Pod,
|
||||
it could contain below environment variables:
|
||||
|
||||
* NVIDIA GPUs
|
||||
|
||||
```yaml
|
||||
env:
|
||||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
|
||||
```
|
||||
The scheduler directly delivers GPU isolation decision to [nvidia-container-runtime](https://github.com/NVIDIA/nvidia-container-runtime)
|
||||
through Pod Env `NVIDIA_VISIBLE_DEVICES`.
|
||||
|
||||
* AMD GPUs
|
||||
|
||||
```yaml
|
||||
env:
|
||||
- name: AMD_VISIBLE_DEVICES
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
|
||||
```
|
||||
The scheduler directly delivers GPU isolation decision to [rocm-container-runtime](https://github.com/abuccts/rocm-container-runtime)
|
||||
through Pod Env `AMD_VISIBLE_DEVICES`.
|
||||
|
||||
The annotation referred by the env will be populated by scheduler when bind the pod.
|
||||
|
||||
If multiple containers in the Pod contain the env, the allocated GPUs are all visible to them,
|
||||
so it is these containers' freedom to control how to share these GPUs.
|
||||
|
|
|
@ -11,7 +11,7 @@ kubeApiServerAddress: http://10.10.10.10:8080
|
|||
#
|
||||
# Constrains:
|
||||
# 1. All cellTypes should form a forest, i.e. a disjoint union of trees.
|
||||
# 2. All physicalCells should contain at most one physical specific GPU.
|
||||
# 2. All physicalCells should contain at most one physical specific device.
|
||||
# 3. Each physicalCell should contain exactly one node level cellType.
|
||||
# 4. Each physicalCell should specify full hierarchies defined by its cellType.
|
||||
# 5. A pinnedCellId should can be universally locate one physicalCell.
|
||||
|
@ -24,7 +24,7 @@ kubeApiServerAddress: http://10.10.10.10:8080
|
|||
################################################################################
|
||||
physicalCluster:
|
||||
# Define the cell structures.
|
||||
# Each leaf cellType contains a single GPU and also defines a gpuType of the
|
||||
# Each leaf cellType contains a single device and also defines a leafCellType of the
|
||||
# same name.
|
||||
cellTypes:
|
||||
#######################################
|
||||
|
@ -35,8 +35,8 @@ physicalCluster:
|
|||
childCellType: CT1
|
||||
# Specify how many child cells it contains.
|
||||
childCellNumber: 2
|
||||
# Specify whether it is a node level cellType, i.e. contains all GPUs of
|
||||
# its corresponding gpuType within one node and only contains these GPUs.
|
||||
# Specify whether it is a node level cellType, i.e. contains all leaf cells of
|
||||
# its corresponding leafCellType within one node and only contains these leaf cells.
|
||||
# Defaults to false.
|
||||
isNodeLevel: true
|
||||
|
||||
|
@ -149,15 +149,15 @@ physicalCluster:
|
|||
cellAddress: 0.0.0.0
|
||||
- cellType: CT1-NODE
|
||||
cellAddress: 0.0.0.1
|
||||
# One node has multiple gpu types and
|
||||
# non-standard gpu indices (by explicitly specifying cell addresses)
|
||||
# One node has multiple leaf cell types and
|
||||
# non-standard leaf cell indices (by explicitly specifying cell addresses)
|
||||
- cellType: CT1-NODE
|
||||
cellAddress: 1.0.0.2 # NODE Name
|
||||
cellChildren:
|
||||
- cellAddress: 8 # GPU Index
|
||||
pinnedCellId: VC1-YQW-CT1
|
||||
- cellAddress: 9 # GPU Index
|
||||
# One cell has non-standard gpu indices
|
||||
# One cell has non-standard leaf cell indices
|
||||
- cellType: 3-DGX1-P100-NODE
|
||||
cellChildren:
|
||||
# cellAddress can be omitted for non-node level cellType, which defaults to
|
||||
|
|
|
@ -9,7 +9,7 @@
|
|||
|
||||
HiveD guarantees **quota safety for all VCs**, in the sense that the requests to cells defined in each VC can always be satisfied.
|
||||
|
||||
VC's cells can be described by Hardware Quantity, [Topology](#VC-Safety), [Type](#GPU-Type), [Pinned Cells](#Pinned-Cells), etc. To guarantee safety, HiveD never allows a VC to "invade" other VCs' cells. For example, to guarantee all VCs' topology, one VC's [guaranteed jobs](#Guaranteed-Job) should never make fragmentation inside other VCs:
|
||||
VC's cells can be described by Hardware Quantity, [Topology](#VC-Safety), [Type](#SKU-Type), [Pinned Cells](#Pinned-Cells), etc. To guarantee safety, HiveD never allows a VC to "invade" other VCs' cells. For example, to guarantee all VCs' topology, one VC's [guaranteed jobs](#Guaranteed-Job) should never make fragmentation inside other VCs:
|
||||
|
||||
Two DGX-2s, two VCs each owns one DGX-2 node. For a traditional scheduler, this will translate into two VCs each owning 16 GPUs. When a user submits 16 1-GPU jobs to VC1, the user in VC2 might not be able to run a 16-GPU job, due to possible fragmentation issue caused by VC1. While HiveD can guarantee each VC always has one entire node available for its dedicated use.
|
||||
|
||||
|
@ -30,19 +30,21 @@ This is similar to [K8S Taints and Tolerations](https://kubernetes.io/docs/conce
|
|||
2. Submit job [itc-pin](file/itc-pin.yaml) to VC1, all tasks in task role vc1pinned will be on node 10.151.41.25 (which is pinned), all tasks in task role vc1nopinned will NOT be on node 10.151.41.25.
|
||||
<img src="file/itc-pin.png" width="900"/>
|
||||
|
||||
## GPU Type
|
||||
## SKU Type
|
||||
### Description
|
||||
If `gpuType` is specified in the job, only that type of GPU will be allocated to the job, otherwise, any type of GPU can be allocated.
|
||||
`skuType` is the leaf `cellType` which does not have internal topology anymore.
|
||||
|
||||
If `skuType` is specified in the job, only that type of leaf cell will be allocated to the job, otherwise, any type of leaf cell can be allocated.
|
||||
|
||||
This is similar to [K8S Labels and Selectors](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels), but with [VC Safety](#VC-Safety) guaranteed.
|
||||
|
||||
### Reproduce Steps
|
||||
#### `gpuType` specified
|
||||
#### `skuType` specified
|
||||
1. Use [hived-config-2](file/hived-config-2.yaml).
|
||||
2. Submit job [itc-k80-type](file/itc-k80-type.yaml), it will be partially running (some tasks waiting because all the specified K80 GPUs are used).
|
||||
<img src="file/itc-k80-type.png" width="900"/>
|
||||
|
||||
#### `gpuType` not specified
|
||||
#### `skuType` not specified
|
||||
1. Use [hived-config-2](file/hived-config-2.yaml).
|
||||
2. Submit job [itc-no-type](file/itc-no-type.yaml), it will be fully running, and some tasks are using K80 (10.151.41.18) while others are using M60 (10.151.41.26).
|
||||
<img src="file/itc-no-type.png" width="900"/>
|
||||
|
@ -135,7 +137,7 @@ One VC's [Guaranteed Job](#Guaranteed-Job) can preempt other VCs' [Opportunistic
|
|||
|
||||
## Topology-Aware Intra-VC Scheduling
|
||||
### Description
|
||||
Within one VC, HiveD chooses nearest GPUs for one `AffinityGroup` in best effort.
|
||||
Within one VC, HiveD chooses nearest leaf cells for one `AffinityGroup` in best effort.
|
||||
|
||||
### Reproduce Steps
|
||||
1. Use [hived-config-2](file/hived-config-2.yaml).
|
||||
|
@ -147,40 +149,40 @@ Within one VC, HiveD chooses nearest GPUs for one `AffinityGroup` in best effort
|
|||
|
||||
## Work-Preserving Reconfiguration
|
||||
### Description
|
||||
HiveD can be reconfigured without unnecessary user impacts, such as add/update/delete physical/virtual clusters, GPU types/topologies, etc.
|
||||
HiveD can be reconfigured without unnecessary user impacts, such as add/update/delete physical/virtual clusters, different device types/topologies, etc.
|
||||
|
||||
### Reproduce Steps
|
||||
#### PhysicalCluster Reconfig - Delete PhysicalCell
|
||||
1. Use [hived-config-2](file/hived-config-2.yaml).
|
||||
2. Submit job [itc-reconfig-1](file/itc-reconfig-1.yaml) which requests M60 `gpuType`. Wait until it is running.
|
||||
3. Delete all M60 `gpuType` related PhysicalCells and VirtualCells from [hived-config-2](file/hived-config-2.yaml), i.e. becomes [hived-config-33](file/hived-config-33.yaml).
|
||||
4. Use [hived-config-33](file/hived-config-33.yaml), and restart HiveD.
|
||||
2. Submit job [itc-reconfig-1](file/itc-reconfig-1.yaml) which requests M60 `skuType`. Wait until it is running.
|
||||
3. Delete all M60 `skuType` related PhysicalCells and VirtualCells from [hived-config-2](file/hived-config-2.yaml), i.e. becomes [hived-config-33](file/hived-config-33.yaml).
|
||||
4. Use [hived-config-33](file/hived-config-33.yaml), and restart HiveD.
|
||||
5. The job will still run without any impact, but its M60 usage is ignored by HiveD.
|
||||
*However, normally, the job will still fail if the corresponding physical node is later deleted from K8S or unhealthy.*
|
||||
<img src="file/itc-reconfig-1.png" width="900"/>
|
||||
|
||||
#### PhysicalCluster Reconfig - Add PhysicalCell
|
||||
1. Use [hived-config-33](file/hived-config-33.yaml).
|
||||
2. Submit job [itc-k80-type](file/itc-k80-type.yaml) which requests K80 `gpuType`. Wait until it is running.
|
||||
3. Add all M60 `gpuType` related PhysicalCells and VirtualCells into [hived-config-33](file/hived-config-33.yaml), i.e. becomes [hived-config-2](file/hived-config-2.yaml).
|
||||
4. Use [hived-config-2](file/hived-config-2.yaml), and restart HiveD.
|
||||
2. Submit job [itc-k80-type](file/itc-k80-type.yaml) which requests K80 `skuType`. Wait until it is running.
|
||||
3. Add all M60 `skuType` related PhysicalCells and VirtualCells into [hived-config-33](file/hived-config-33.yaml), i.e. becomes [hived-config-2](file/hived-config-2.yaml).
|
||||
4. Use [hived-config-2](file/hived-config-2.yaml), and restart HiveD.
|
||||
5. The job will still run without any impact, and its K80 usage is still accounted by HiveD.
|
||||
<img src="file/itc-k80-type.png" width="900"/>
|
||||
|
||||
#### PhysicalCluster Reconfig - Update PhysicalCell - Add Node
|
||||
1. Use [hived-config-2](file/hived-config-2.yaml).
|
||||
2. Submit job [itc-reconfig-1](file/itc-reconfig-1.yaml) which requests M60 `gpuType`. Wait until it is running.
|
||||
2. Submit job [itc-reconfig-1](file/itc-reconfig-1.yaml) which requests M60 `skuType`. Wait until it is running.
|
||||
3. Add one M60 node into a PhysicalCell, then becomes [hived-config-4](file/hived-config-4.yaml).
|
||||
4. Use [hived-config-4](file/hived-config-4.yaml), and restart HiveD.
|
||||
4. Use [hived-config-4](file/hived-config-4.yaml), and restart HiveD.
|
||||
5. The job will still run without any impact, and its M60 usage is still accounted by HiveD.
|
||||
6. To confirm the job is not impacted, such as [lazy preempted](#Lazy-Preemption). Submit job [itc-reconfig-2](file/itc-reconfig-2.yaml) which requests all M60 nodes and has the same priority as [itc-reconfig-1](file/itc-reconfig-1.yaml). The job will be waiting instead of preempting [itc-reconfig-1](file/itc-reconfig-1.yaml).
|
||||
<img src="file/itc-reconfig-2.png" width="900"/>
|
||||
|
||||
#### PhysicalCluster Reconfig - Update PhysicalCell - Delete Node
|
||||
1. Use [hived-config-2](file/hived-config-2.yaml).
|
||||
2. Submit job [itc-reconfig-3](file/itc-reconfig-3.yaml) which requests K80 `gpuType`. Wait until it is running.
|
||||
2. Submit job [itc-reconfig-3](file/itc-reconfig-3.yaml) which requests K80 `skuType`. Wait until it is running.
|
||||
3. Delete one K80 node used by [itc-reconfig-3](file/itc-reconfig-3.yaml) from a PhysicalCell, then becomes [hived-config-7](file/hived-config-7.yaml).
|
||||
4. Use [hived-config-7](file/hived-config-7.yaml), and restart HiveD.
|
||||
4. Use [hived-config-7](file/hived-config-7.yaml), and restart HiveD.
|
||||
5. The job will still run without any impact, but its deleted node usage is ignored by HiveD.
|
||||
*However, normally, the job will still fail if the corresponding physical node is later deleted from K8S or unhealthy.*
|
||||
<img src="file/itc-reconfig-3-1.png" width="900"/>
|
||||
|
@ -189,7 +191,7 @@ HiveD can be reconfigured without unnecessary user impacts, such as add/update/d
|
|||
1. Use [hived-config-2](file/hived-config-2.yaml).
|
||||
2. Submit job [itc-reconfig-3](file/itc-reconfig-3.yaml) to default VC. Wait until it is running.
|
||||
3. Delete the default VC and move its quota to VC1, then becomes [hived-config-5](file/hived-config-5.yaml).
|
||||
4. Use [hived-config-5](file/hived-config-5.yaml), and restart HiveD.
|
||||
4. Use [hived-config-5](file/hived-config-5.yaml), and restart HiveD.
|
||||
5. The job will still run without any interruption but [lazy preempted](#Lazy-Preemption) by HiveD.
|
||||
<img src="file/itc-reconfig-3.png" width="900"/>
|
||||
6. To confirm it is [lazy preempted](#Lazy-Preemption), submit job [itc-reconfig-4](file/itc-reconfig-4.yaml) to VC1 which requests all K80 nodes. The job will immediately preempt [itc-reconfig-3](file/itc-reconfig-3.yaml).
|
||||
|
@ -199,7 +201,7 @@ HiveD can be reconfigured without unnecessary user impacts, such as add/update/d
|
|||
1. Use [hived-config-2](file/hived-config-2.yaml).
|
||||
2. Submit job [itc-reconfig-3](file/itc-reconfig-3.yaml) to default VC. Wait until it is running.
|
||||
3. Move one K80-NODE cell from default VC to VC1, then becomes [hived-config-6](file/hived-config-6.yaml).
|
||||
4. Use [hived-config-6](file/hived-config-6.yaml), and restart HiveD.
|
||||
4. Use [hived-config-6](file/hived-config-6.yaml), and restart HiveD.
|
||||
5. The job will still run without any interruption but [lazy preempted](#Lazy-Preemption) by HiveD.
|
||||
6. To confirm it is [lazy preempted](#Lazy-Preemption), submit job [itc-reconfig-5](file/itc-reconfig-5.yaml) to VC1 which requests all K80 nodes. The job will immediately preempt [itc-reconfig-3](file/itc-reconfig-3.yaml).
|
||||
<img src="file/itc-reconfig-5.png" width="900"/>
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
kubeApiServerAddress: http://10.151.41.16:8080
|
||||
|
||||
physicalCluster:
|
||||
gpuTypes:
|
||||
skuTypes:
|
||||
K80:
|
||||
gpu: 1
|
||||
cpu: 4
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
kubeApiServerAddress: http://10.151.41.16:8080
|
||||
|
||||
physicalCluster:
|
||||
gpuTypes:
|
||||
skuTypes:
|
||||
K80:
|
||||
gpu: 1
|
||||
cpu: 4
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
kubeApiServerAddress: http://10.151.41.16:8080
|
||||
|
||||
physicalCluster:
|
||||
gpuTypes:
|
||||
skuTypes:
|
||||
K80:
|
||||
gpu: 1
|
||||
cpu: 4
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
kubeApiServerAddress: http://10.151.41.16:8080
|
||||
|
||||
physicalCluster:
|
||||
gpuTypes:
|
||||
skuTypes:
|
||||
K80:
|
||||
gpu: 1
|
||||
cpu: 4
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
kubeApiServerAddress: http://10.151.41.16:8080
|
||||
|
||||
physicalCluster:
|
||||
gpuTypes:
|
||||
skuTypes:
|
||||
K80:
|
||||
gpu: 1
|
||||
cpu: 4
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
kubeApiServerAddress: http://10.151.41.16:8080
|
||||
|
||||
physicalCluster:
|
||||
gpuTypes:
|
||||
skuTypes:
|
||||
K80:
|
||||
gpu: 1
|
||||
cpu: 4
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
kubeApiServerAddress: http://10.151.41.16:8080
|
||||
|
||||
physicalCluster:
|
||||
gpuTypes:
|
||||
skuTypes:
|
||||
K80:
|
||||
gpu: 1
|
||||
cpu: 4
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
kubeApiServerAddress: http://10.151.41.16:8080
|
||||
|
||||
physicalCluster:
|
||||
gpuTypes:
|
||||
skuTypes:
|
||||
K80:
|
||||
gpu: 1
|
||||
cpu: 4
|
||||
|
|
|
@ -1,7 +1,7 @@
|
|||
kubeApiServerAddress: http://10.151.41.16:8080
|
||||
|
||||
physicalCluster:
|
||||
gpuTypes:
|
||||
skuTypes:
|
||||
K80:
|
||||
gpu: 1
|
||||
cpu: 4
|
||||
|
|
|
@ -29,5 +29,5 @@ extras:
|
|||
jobPriorityClass: prod
|
||||
taskRoles:
|
||||
train:
|
||||
gpuType: M60
|
||||
skuType: M60
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -29,5 +29,5 @@ extras:
|
|||
jobPriorityClass: prod
|
||||
taskRoles:
|
||||
train:
|
||||
gpuType: K80
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -28,5 +28,5 @@ extras:
|
|||
jobPriorityClass: prod
|
||||
taskRoles:
|
||||
train:
|
||||
gpuType: K80
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -28,5 +28,5 @@ extras:
|
|||
jobPriorityClass: prod
|
||||
taskRoles:
|
||||
train:
|
||||
gpuType: K80
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -28,5 +28,5 @@ extras:
|
|||
jobPriorityClass: prod
|
||||
taskRoles:
|
||||
train:
|
||||
gpuType: K80
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -28,5 +28,5 @@ extras:
|
|||
jobPriorityClass: oppo
|
||||
taskRoles:
|
||||
train:
|
||||
gpuType: K80
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -28,5 +28,5 @@ extras:
|
|||
jobPriorityClass: prod
|
||||
taskRoles:
|
||||
train:
|
||||
gpuType: K80
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -28,5 +28,5 @@ extras:
|
|||
jobPriorityClass: prod
|
||||
taskRoles:
|
||||
train:
|
||||
gpuType: M60
|
||||
skuType: M60
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -28,5 +28,5 @@ extras:
|
|||
jobPriorityClass: test
|
||||
taskRoles:
|
||||
train:
|
||||
gpuType: M60
|
||||
skuType: M60
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -28,5 +28,5 @@ extras:
|
|||
jobPriorityClass: prod
|
||||
taskRoles:
|
||||
train:
|
||||
gpuType: K80
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -28,5 +28,5 @@ extras:
|
|||
jobPriorityClass: prod
|
||||
taskRoles:
|
||||
train:
|
||||
gpuType: K80
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -28,5 +28,5 @@ extras:
|
|||
jobPriorityClass: test
|
||||
taskRoles:
|
||||
train:
|
||||
gpuType: K80
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -28,5 +28,5 @@ extras:
|
|||
jobPriorityClass: prod
|
||||
taskRoles:
|
||||
train:
|
||||
gpuType: K80
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -29,5 +29,5 @@ extras:
|
|||
jobPriorityClass: prod
|
||||
taskRoles:
|
||||
train:
|
||||
gpuType: M60
|
||||
skuType: M60
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -28,5 +28,5 @@ extras:
|
|||
jobPriorityClass: oppo
|
||||
taskRoles:
|
||||
train:
|
||||
gpuType: K80
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -42,7 +42,7 @@ extras:
|
|||
jobPriorityClass: prod
|
||||
taskRoles:
|
||||
vc1nopinned:
|
||||
gpuType: K80
|
||||
skuType: K80
|
||||
affinityGroupName: vc1nopinned
|
||||
vc1pinned:
|
||||
pinnedCellId: VC1-K80
|
||||
|
|
|
@ -27,5 +27,5 @@ extras:
|
|||
jobPriorityClass: test
|
||||
taskRoles:
|
||||
train:
|
||||
gpuType: M60
|
||||
skuType: M60
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -27,5 +27,5 @@ extras:
|
|||
jobPriorityClass: test
|
||||
taskRoles:
|
||||
train:
|
||||
gpuType: M60
|
||||
skuType: M60
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -27,5 +27,5 @@ extras:
|
|||
jobPriorityClass: test
|
||||
taskRoles:
|
||||
train:
|
||||
gpuType: K80
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -27,5 +27,5 @@ extras:
|
|||
jobPriorityClass: test
|
||||
taskRoles:
|
||||
train:
|
||||
gpuType: K80
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -27,5 +27,5 @@ extras:
|
|||
jobPriorityClass: test
|
||||
taskRoles:
|
||||
train:
|
||||
gpuType: K80
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -29,4 +29,4 @@ extras:
|
|||
jobPriorityClass: prod
|
||||
taskRoles:
|
||||
train:
|
||||
gpuType: K80
|
||||
skuType: K80
|
||||
|
|
|
@ -29,4 +29,4 @@ extras:
|
|||
jobPriorityClass: prod
|
||||
taskRoles:
|
||||
train:
|
||||
gpuType: K80
|
||||
skuType: K80
|
||||
|
|
|
@ -7,8 +7,8 @@ jobPriorityClass: PROD
|
|||
taskRoles:
|
||||
a:
|
||||
taskNumber: 5
|
||||
gpuType: K80
|
||||
gpuNumber: 1
|
||||
leafCellType: K80
|
||||
leafCellNumber: 1
|
||||
affinityGroupName: null
|
||||
---
|
||||
jobVC: VC2
|
||||
|
@ -17,8 +17,8 @@ jobPriorityClass: PROD
|
|||
taskRoles:
|
||||
a:
|
||||
taskNumber: 5
|
||||
gpuType: K80
|
||||
gpuNumber: 1
|
||||
leafCellType: K80
|
||||
leafCellNumber: 1
|
||||
affinityGroupName: null
|
||||
---
|
||||
jobVC: VC2
|
||||
|
@ -28,7 +28,7 @@ taskRoles:
|
|||
a:
|
||||
taskNumber: 5
|
||||
pinnedCellId: VC2-K80
|
||||
gpuNumber: 1
|
||||
leafCellNumber: 1
|
||||
affinityGroupName: null
|
||||
|
||||
---
|
||||
|
@ -36,7 +36,7 @@ taskRoles:
|
|||
# [Optional]: Cluster Admin -> RestServer Config -> PC
|
||||
################################################################################
|
||||
physicalCluster:
|
||||
gpuTypes:
|
||||
leafCellTypes:
|
||||
K80:
|
||||
gpu: 1
|
||||
cpu: 4
|
||||
|
@ -71,8 +71,8 @@ spec:
|
|||
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
|
||||
virtualCluster: VC1
|
||||
priority: 1000
|
||||
gpuType: K80
|
||||
gpuNumber: 1
|
||||
leafCellType: K80
|
||||
leafCellNumber: 1
|
||||
affinityGroup: null
|
||||
spec:
|
||||
schedulerName: hivedscheduler
|
||||
|
@ -91,7 +91,7 @@ spec:
|
|||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
|
||||
---
|
||||
apiVersion: frameworkcontroller.microsoft.com/v1
|
||||
kind: Framework
|
||||
|
@ -118,8 +118,8 @@ spec:
|
|||
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
|
||||
virtualCluster: VC2
|
||||
priority: 1000
|
||||
gpuType: K80
|
||||
gpuNumber: 1
|
||||
leafCellType: K80
|
||||
leafCellNumber: 1
|
||||
affinityGroup: null
|
||||
spec:
|
||||
schedulerName: hivedscheduler
|
||||
|
@ -138,7 +138,7 @@ spec:
|
|||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
|
||||
---
|
||||
apiVersion: frameworkcontroller.microsoft.com/v1
|
||||
kind: Framework
|
||||
|
@ -166,7 +166,7 @@ spec:
|
|||
virtualCluster: VC2
|
||||
priority: 1000
|
||||
pinnedCellId: VC2-K80
|
||||
gpuNumber: 1
|
||||
leafCellNumber: 1
|
||||
affinityGroup: null
|
||||
spec:
|
||||
schedulerName: hivedscheduler
|
||||
|
@ -185,7 +185,7 @@ spec:
|
|||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
|
||||
|
||||
---
|
||||
################################################################################
|
||||
|
@ -200,8 +200,8 @@ metadata:
|
|||
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
|
||||
virtualCluster: VC1
|
||||
priority: 1000
|
||||
gpuType: K80
|
||||
gpuNumber: 1
|
||||
leafCellType: K80
|
||||
leafCellNumber: 1
|
||||
affinityGroup: null
|
||||
spec:
|
||||
schedulerName: hivedscheduler
|
||||
|
@ -220,7 +220,7 @@ spec:
|
|||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
|
@ -231,8 +231,8 @@ metadata:
|
|||
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
|
||||
virtualCluster: VC2
|
||||
priority: 1000
|
||||
gpuType: K80
|
||||
gpuNumber: 1
|
||||
leafCellType: K80
|
||||
leafCellNumber: 1
|
||||
affinityGroup: null
|
||||
spec:
|
||||
schedulerName: hivedscheduler
|
||||
|
@ -251,7 +251,7 @@ spec:
|
|||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
|
@ -263,7 +263,7 @@ metadata:
|
|||
virtualCluster: VC2
|
||||
priority: 1000
|
||||
pinnedCellId: VC2-K80
|
||||
gpuNumber: 1
|
||||
leafCellNumber: 1
|
||||
affinityGroup: null
|
||||
spec:
|
||||
schedulerName: hivedscheduler
|
||||
|
@ -282,4 +282,4 @@ spec:
|
|||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
|
||||
|
|
|
@ -2,8 +2,8 @@
|
|||
# [Optional]: Job User -> RestServer Request
|
||||
#
|
||||
# Constrains:
|
||||
# 1. For one task, only need to specify gpuType or pinnedCellId, not both.
|
||||
# 2. All gpuTypes or pinnedCellIds under the same affinityGroup must be the same.
|
||||
# 1. For one task, only need to specify leafCellType or pinnedCellId, not both.
|
||||
# 2. All leafCellTypes or pinnedCellIds under the same affinityGroup must be the same.
|
||||
#
|
||||
# affinityGroupName:
|
||||
# An affinityGroup forms a cell request and scheduler will try all candidate
|
||||
|
@ -27,63 +27,63 @@ taskRoles:
|
|||
# All tasks in role A, B, C should be within the same cell named PCN-ABC.
|
||||
#
|
||||
# Total request of PCN-ABC:
|
||||
# gpuType: DGX2-V100
|
||||
# gpuNumber: 1 * 16 + 3 * 8 + 1 * 4 = 44 GPUs = 2.75 DGX2 nodes
|
||||
# leafCellType: DGX2-V100
|
||||
# leafCellNumber: 1 * 16 + 3 * 8 + 1 * 4 = 44 GPUs = 2.75 DGX2 nodes
|
||||
# Candidate cellTypes:
|
||||
# 3-DGX2-NODE, 4-DGX2-NODE, 4-DIRECT-DGX2-NODE, 5-DGX2-NODE.
|
||||
A:
|
||||
taskNumber: 1
|
||||
gpuType: DGX2-V100
|
||||
gpuNumber: 16
|
||||
leafCellType: DGX2-V100
|
||||
leafCellNumber: 16
|
||||
affinityGroupName: PCN-ABC
|
||||
B:
|
||||
taskNumber: 3
|
||||
gpuType: DGX2-V100
|
||||
gpuNumber: 8
|
||||
leafCellType: DGX2-V100
|
||||
leafCellNumber: 8
|
||||
affinityGroupName: PCN-ABC
|
||||
C:
|
||||
taskNumber: 1
|
||||
gpuType: DGX2-V100
|
||||
gpuNumber: 4
|
||||
leafCellType: DGX2-V100
|
||||
leafCellNumber: 4
|
||||
affinityGroupName: PCN-ABC
|
||||
|
||||
# All tasks in role D should be within the same cell named PCN-D.
|
||||
#
|
||||
# Total request of PCN-D:
|
||||
# gpuType: null -> any gpuType
|
||||
# gpuNumber: 2 * CP2(3) = 2 * 4 = 8 GPUs
|
||||
# leafCellType: null -> any leafCellType
|
||||
# leafCellNumber: 2 * CP2(3) = 2 * 4 = 8 GPUs
|
||||
# Candidate cellTypes:
|
||||
# DGX1-P100-NODE, DGX1-V100-NODE, DGX2-NODE-8-GPU, IB-DGX2-NODE-8-GPU.
|
||||
D:
|
||||
taskNumber: 2
|
||||
gpuType: null # null, empty or not specified -> any gpuType
|
||||
gpuNumber: 3
|
||||
leafCellType: null # null, empty or not specified -> any leafCellType
|
||||
leafCellNumber: 3
|
||||
affinityGroupName: PCN-D
|
||||
|
||||
# Tasks in role E is not required to be within the same cell.
|
||||
#
|
||||
# Each task forms a cell request:
|
||||
# gpuType: DGX2-V100
|
||||
# gpuNumber: 1 * 16 = 16 GPUs = 1 DGX2 node
|
||||
# leafCellType: DGX2-V100
|
||||
# leafCellNumber: 1 * 16 = 16 GPUs = 1 DGX2 node
|
||||
# Candidate cellTypes:
|
||||
# DGX2-NODE.
|
||||
E:
|
||||
taskNumber: 2
|
||||
gpuType: DGX2-V100
|
||||
gpuNumber: 16
|
||||
leafCellType: DGX2-V100
|
||||
leafCellNumber: 16
|
||||
affinityGroupName: null # null, empty or not specified -> no affinityGroup
|
||||
|
||||
# All tasks in role F should be within the same cell named PCN-F.
|
||||
#
|
||||
# Total request of PCN-F:
|
||||
# pinnedCellId: VC1-YQW-IB-DGX2
|
||||
# gpuNumber: 2 * CP2(3) = 2 * 4 = 8 GPUs
|
||||
# leafCellNumber: 2 * CP2(3) = 2 * 4 = 8 GPUs
|
||||
# Candidate physicalCells:
|
||||
# VC1-YQW-IB-DGX2.
|
||||
F:
|
||||
taskNumber: 2
|
||||
pinnedCellId: VC1-YQW-IB-DGX2
|
||||
gpuNumber: 3
|
||||
leafCellNumber: 3
|
||||
affinityGroupName: PCN-F
|
||||
|
||||
---
|
||||
|
@ -94,10 +94,10 @@ taskRoles:
|
|||
# things in advance.
|
||||
# For example:
|
||||
# Pod Spec cpu, memory.
|
||||
# 1. Given gpuType or pinnedCellId, just pick the corresponding cpu, memory unit.
|
||||
# 2. No gpuType or pinnedCellId is given, choose the minimal cpu, memory unit.
|
||||
# 1. Given leafCellType or pinnedCellId, just pick the corresponding cpu, memory unit.
|
||||
# 2. No leafCellType or pinnedCellId is given, choose the minimal cpu, memory unit.
|
||||
physicalCluster:
|
||||
gpuTypes:
|
||||
leafCellTypes:
|
||||
# Check resource value format in
|
||||
# k8s.io/apimachinery/pkg/api/resource/quantity.go
|
||||
|
||||
|
@ -146,17 +146,17 @@ spec:
|
|||
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
|
||||
virtualCluster: VC1
|
||||
priority: 1000
|
||||
gpuType: DGX2-V100
|
||||
gpuNumber: 16
|
||||
leafCellType: DGX2-V100
|
||||
leafCellNumber: 16
|
||||
affinityGroup:
|
||||
name: JOBX/PCN-ABC
|
||||
members:
|
||||
- podNumber: 1
|
||||
gpuNumber: 16
|
||||
leafCellNumber: 16
|
||||
- podNumber: 3
|
||||
gpuNumber: 8
|
||||
leafCellNumber: 8
|
||||
- podNumber: 1
|
||||
gpuNumber: 4
|
||||
leafCellNumber: 4
|
||||
spec:
|
||||
# See ../../run/deploy.yaml for why and how to specify the schedulerName.
|
||||
schedulerName: hivedscheduler
|
||||
|
@ -185,7 +185,7 @@ spec:
|
|||
valueFrom:
|
||||
fieldRef:
|
||||
# This annotation will be populated by scheduler when bind the pod.
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
|
||||
# K8S port scheduling is incompatible with HiveD, so the job should detect
|
||||
# port conflict by itself and fail with transient error, then controller
|
||||
# should retry it with new port.
|
||||
|
@ -200,17 +200,17 @@ spec:
|
|||
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
|
||||
virtualCluster: VC1
|
||||
priority: 1000
|
||||
gpuType: DGX2-V100
|
||||
gpuNumber: 8
|
||||
leafCellType: DGX2-V100
|
||||
leafCellNumber: 8
|
||||
affinityGroup:
|
||||
name: JOBX/PCN-ABC
|
||||
members:
|
||||
- podNumber: 1
|
||||
gpuNumber: 16
|
||||
leafCellNumber: 16
|
||||
- podNumber: 3
|
||||
gpuNumber: 8
|
||||
leafCellNumber: 8
|
||||
- podNumber: 1
|
||||
gpuNumber: 4
|
||||
leafCellNumber: 4
|
||||
spec:
|
||||
schedulerName: hivedscheduler
|
||||
priority: 1000
|
||||
|
@ -224,7 +224,7 @@ spec:
|
|||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
|
||||
- name: C
|
||||
taskNumber: 1
|
||||
task:
|
||||
|
@ -234,17 +234,17 @@ spec:
|
|||
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
|
||||
virtualCluster: VC1
|
||||
priority: 1000
|
||||
gpuType: DGX2-V100
|
||||
gpuNumber: 4
|
||||
leafCellType: DGX2-V100
|
||||
leafCellNumber: 4
|
||||
affinityGroup:
|
||||
name: JOBX/PCN-ABC
|
||||
members:
|
||||
- podNumber: 1
|
||||
gpuNumber: 16
|
||||
leafCellNumber: 16
|
||||
- podNumber: 3
|
||||
gpuNumber: 8
|
||||
leafCellNumber: 8
|
||||
- podNumber: 1
|
||||
gpuNumber: 4
|
||||
leafCellNumber: 4
|
||||
spec:
|
||||
schedulerName: hivedscheduler
|
||||
priority: 1000
|
||||
|
@ -258,7 +258,7 @@ spec:
|
|||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
|
||||
- name: D
|
||||
taskNumber: 2
|
||||
task:
|
||||
|
@ -268,13 +268,13 @@ spec:
|
|||
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
|
||||
virtualCluster: VC1
|
||||
priority: 1000
|
||||
gpuType: null
|
||||
gpuNumber: 3
|
||||
leafCellType: null
|
||||
leafCellNumber: 3
|
||||
affinityGroup:
|
||||
name: JOBX/PCN-D
|
||||
members:
|
||||
- podNumber: 2
|
||||
gpuNumber: 3
|
||||
leafCellNumber: 3
|
||||
spec:
|
||||
schedulerName: hivedscheduler
|
||||
priority: 1000
|
||||
|
@ -290,7 +290,7 @@ spec:
|
|||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
|
||||
- name: E
|
||||
taskNumber: 2
|
||||
task:
|
||||
|
@ -300,8 +300,8 @@ spec:
|
|||
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
|
||||
virtualCluster: VC1
|
||||
priority: 1000
|
||||
gpuType: DGX2-V100
|
||||
gpuNumber: 16
|
||||
leafCellType: DGX2-V100
|
||||
leafCellNumber: 16
|
||||
affinityGroup: null
|
||||
spec:
|
||||
schedulerName: hivedscheduler
|
||||
|
@ -316,7 +316,7 @@ spec:
|
|||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
|
||||
- name: F
|
||||
taskNumber: 2
|
||||
task:
|
||||
|
@ -327,12 +327,12 @@ spec:
|
|||
virtualCluster: VC1
|
||||
priority: 1000
|
||||
pinnedCellId: VC1-YQW-IB-DGX2
|
||||
gpuNumber: 3
|
||||
leafCellNumber: 3
|
||||
affinityGroup:
|
||||
name: JOBX/PCN-F
|
||||
members:
|
||||
- podNumber: 2
|
||||
gpuNumber: 3
|
||||
leafCellNumber: 3
|
||||
spec:
|
||||
schedulerName: hivedscheduler
|
||||
priority: 1000
|
||||
|
@ -346,7 +346,7 @@ spec:
|
|||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
|
||||
|
||||
---
|
||||
################################################################################
|
||||
|
@ -360,17 +360,17 @@ metadata:
|
|||
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
|
||||
virtualCluster: VC1
|
||||
priority: 1000
|
||||
gpuType: DGX2-V100
|
||||
gpuNumber: 16
|
||||
leafCellType: DGX2-V100
|
||||
leafCellNumber: 16
|
||||
affinityGroup:
|
||||
name: JOBX/PCN-ABC
|
||||
members:
|
||||
- podNumber: 1
|
||||
gpuNumber: 16
|
||||
leafCellNumber: 16
|
||||
- podNumber: 3
|
||||
gpuNumber: 8
|
||||
leafCellNumber: 8
|
||||
- podNumber: 1
|
||||
gpuNumber: 4
|
||||
leafCellNumber: 4
|
||||
spec:
|
||||
schedulerName: hivedscheduler
|
||||
priority: 1000
|
||||
|
@ -384,7 +384,7 @@ spec:
|
|||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
|
@ -395,17 +395,17 @@ metadata:
|
|||
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
|
||||
virtualCluster: VC1
|
||||
priority: 1000
|
||||
gpuType: DGX2-V100
|
||||
gpuNumber: 8
|
||||
leafCellType: DGX2-V100
|
||||
leafCellNumber: 8
|
||||
affinityGroup:
|
||||
name: JOBX/PCN-ABC
|
||||
members:
|
||||
- podNumber: 1
|
||||
gpuNumber: 16
|
||||
leafCellNumber: 16
|
||||
- podNumber: 3
|
||||
gpuNumber: 8
|
||||
leafCellNumber: 8
|
||||
- podNumber: 1
|
||||
gpuNumber: 4
|
||||
leafCellNumber: 4
|
||||
spec:
|
||||
schedulerName: hivedscheduler
|
||||
priority: 1000
|
||||
|
@ -419,7 +419,7 @@ spec:
|
|||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
|
@ -429,17 +429,17 @@ metadata:
|
|||
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
|
||||
virtualCluster: VC1
|
||||
priority: 1000
|
||||
gpuType: DGX2-V100
|
||||
gpuNumber: 4
|
||||
leafCellType: DGX2-V100
|
||||
leafCellNumber: 4
|
||||
affinityGroup:
|
||||
name: JOBX/PCN-ABC
|
||||
members:
|
||||
- podNumber: 1
|
||||
gpuNumber: 16
|
||||
leafCellNumber: 16
|
||||
- podNumber: 3
|
||||
gpuNumber: 8
|
||||
leafCellNumber: 8
|
||||
- podNumber: 1
|
||||
gpuNumber: 4
|
||||
leafCellNumber: 4
|
||||
spec:
|
||||
schedulerName: hivedscheduler
|
||||
priority: 1000
|
||||
|
@ -453,7 +453,7 @@ spec:
|
|||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
|
@ -464,13 +464,13 @@ metadata:
|
|||
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
|
||||
virtualCluster: VC1
|
||||
priority: 1000
|
||||
gpuType: null
|
||||
gpuNumber: 3
|
||||
leafCellType: null
|
||||
leafCellNumber: 3
|
||||
affinityGroup:
|
||||
name: JOBX/PCN-D
|
||||
members:
|
||||
- podNumber: 2
|
||||
gpuNumber: 3
|
||||
leafCellNumber: 3
|
||||
spec:
|
||||
schedulerName: hivedscheduler
|
||||
priority: 1000
|
||||
|
@ -484,7 +484,7 @@ spec:
|
|||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
|
@ -495,8 +495,8 @@ metadata:
|
|||
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
|
||||
virtualCluster: VC1
|
||||
priority: 1000
|
||||
gpuType: DGX2-V100
|
||||
gpuNumber: 16
|
||||
leafCellType: DGX2-V100
|
||||
leafCellNumber: 16
|
||||
affinityGroup: null
|
||||
spec:
|
||||
schedulerName: hivedscheduler
|
||||
|
@ -511,7 +511,7 @@ spec:
|
|||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
|
@ -523,12 +523,12 @@ metadata:
|
|||
virtualCluster: VC1
|
||||
priority: 1000
|
||||
pinnedCellId: VC1-YQW-IB-DGX2
|
||||
gpuNumber: 3
|
||||
leafCellNumber: 3
|
||||
affinityGroup:
|
||||
name: JOBX/PCN-F
|
||||
members:
|
||||
- podNumber: 2
|
||||
gpuNumber: 3
|
||||
leafCellNumber: 3
|
||||
spec:
|
||||
schedulerName: hivedscheduler
|
||||
priority: 1000
|
||||
|
@ -542,4 +542,4 @@ spec:
|
|||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
|
||||
|
|
|
@ -26,8 +26,8 @@ spec:
|
|||
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
|
||||
virtualCluster: VC2
|
||||
priority: 1000
|
||||
gpuType: K80
|
||||
gpuNumber: 1
|
||||
leafCellType: K80
|
||||
leafCellNumber: 1
|
||||
affinityGroup: null
|
||||
spec:
|
||||
schedulerName: hivedscheduler
|
||||
|
@ -58,7 +58,7 @@ spec:
|
|||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
|
||||
volumeMounts:
|
||||
- name: frameworkbarrier-volume
|
||||
mountPath: "/mnt/frameworkbarrier"
|
||||
|
@ -95,8 +95,8 @@ spec:
|
|||
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
|
||||
virtualCluster: VC2
|
||||
priority: 1000
|
||||
gpuType: K80
|
||||
gpuNumber: 1
|
||||
leafCellType: K80
|
||||
leafCellNumber: 1
|
||||
affinityGroup: null
|
||||
spec:
|
||||
schedulerName: hivedscheduler
|
||||
|
@ -127,7 +127,7 @@ spec:
|
|||
- name: NVIDIA_VISIBLE_DEVICES
|
||||
valueFrom:
|
||||
fieldRef:
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
|
||||
fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-leaf-cell-isolation']
|
||||
volumeMounts:
|
||||
- name: frameworkbarrier-volume
|
||||
mountPath: "/mnt/frameworkbarrier"
|
||||
|
|
|
@ -49,7 +49,7 @@ data:
|
|||
webServerAddress: ":30096"
|
||||
waitingPodSchedulingBlockMilliSec: 50
|
||||
physicalCluster:
|
||||
gpuTypes:
|
||||
skuTypes:
|
||||
K80:
|
||||
gpu: 1
|
||||
cpu: 5
|
||||
|
|
|
@ -24,11 +24,12 @@ package algorithm
|
|||
|
||||
import (
|
||||
"fmt"
|
||||
|
||||
"github.com/microsoft/hivedscheduler/pkg/api"
|
||||
"k8s.io/klog"
|
||||
)
|
||||
|
||||
// A Cell represents a set of GPUs affinitized by their interconnection topology.
|
||||
// A Cell represents a set of leaf cells affinitized by their interconnection topology.
|
||||
// Cells are organized as a tree through pointers to their parents / children.
|
||||
type Cell interface {
|
||||
GetChain() CellChain
|
||||
|
@ -41,9 +42,9 @@ type Cell interface {
|
|||
AtOrHigherThanNode() bool
|
||||
GetPriority() CellPriority
|
||||
SetPriority(CellPriority)
|
||||
GetTotalGpuNum() int32
|
||||
GetUsedGpuNumAtPriorities() map[CellPriority]int32
|
||||
IncreaseUsedGpuNumAtPriority(CellPriority, int32)
|
||||
GetTotalLeafCellNum() int32
|
||||
GetUsedLeafCellNumAtPriorities() map[CellPriority]int32
|
||||
IncreaseUsedLeafCellNumAtPriority(CellPriority, int32)
|
||||
}
|
||||
|
||||
func CellEqual(c1 Cell, c2 Cell) bool {
|
||||
|
@ -65,9 +66,9 @@ type GenericCell struct {
|
|||
state CellState
|
||||
// A cell is healthy if all of the cell's children are healthy (bad if any child is bad).
|
||||
// The healthy field is orthogonal to priority and state.
|
||||
healthy bool
|
||||
totalGpuNum int32 // total GPU number of a cell
|
||||
usedGpuNumAtPriorities map[CellPriority]int32 // GPU number used by each priority
|
||||
healthy bool
|
||||
totalLeafCellNum int32 // total leaf cell number of a cell
|
||||
usedLeafCellNumAtPriorities map[CellPriority]int32 // leaf cell number used by each priority
|
||||
}
|
||||
|
||||
func (c *GenericCell) GetChain() CellChain {
|
||||
|
@ -110,18 +111,18 @@ func (c *GenericCell) IsHealthy() bool {
|
|||
return c.healthy
|
||||
}
|
||||
|
||||
func (c *GenericCell) GetTotalGpuNum() int32 {
|
||||
return c.totalGpuNum
|
||||
func (c *GenericCell) GetTotalLeafCellNum() int32 {
|
||||
return c.totalLeafCellNum
|
||||
}
|
||||
|
||||
func (c *GenericCell) GetUsedGpuNumAtPriorities() map[CellPriority]int32 {
|
||||
return c.usedGpuNumAtPriorities
|
||||
func (c *GenericCell) GetUsedLeafCellNumAtPriorities() map[CellPriority]int32 {
|
||||
return c.usedLeafCellNumAtPriorities
|
||||
}
|
||||
|
||||
func (c *GenericCell) IncreaseUsedGpuNumAtPriority(p CellPriority, delta int32) {
|
||||
c.usedGpuNumAtPriorities[p] += delta
|
||||
if c.usedGpuNumAtPriorities[p] == 0 {
|
||||
delete(c.usedGpuNumAtPriorities, p)
|
||||
func (c *GenericCell) IncreaseUsedLeafCellNumAtPriority(p CellPriority, delta int32) {
|
||||
c.usedLeafCellNumAtPriorities[p] += delta
|
||||
if c.usedLeafCellNumAtPriorities[p] == 0 {
|
||||
delete(c.usedLeafCellNumAtPriorities, p)
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -129,7 +130,7 @@ func (c *GenericCell) IncreaseUsedGpuNumAtPriority(p CellPriority, delta int32)
|
|||
type PhysicalCell struct {
|
||||
GenericCell
|
||||
nodes []string // node names inside the cell
|
||||
gpuIndices []int32 // [-1] for cells at levels higher than node
|
||||
leafCellIndices []int32 // [-1] for cells at levels higher than node
|
||||
usingGroup *AlgoAffinityGroup // affinity group using this cell
|
||||
reservingOrReservedGroup *AlgoAffinityGroup // affinity group that is reserving, or has reserved the cell (e.g., waiting for preemption)
|
||||
virtualCell *VirtualCell // points to the bound virtual cell
|
||||
|
@ -151,14 +152,14 @@ func NewPhysicalCell(
|
|||
|
||||
return &PhysicalCell{
|
||||
GenericCell: GenericCell{
|
||||
chain: c,
|
||||
level: l,
|
||||
priority: freePriority,
|
||||
address: address,
|
||||
atOrHigherThanNode: g,
|
||||
totalGpuNum: n,
|
||||
usedGpuNumAtPriorities: map[CellPriority]int32{},
|
||||
state: cellFree,
|
||||
chain: c,
|
||||
level: l,
|
||||
priority: freePriority,
|
||||
address: address,
|
||||
atOrHigherThanNode: g,
|
||||
totalLeafCellNum: n,
|
||||
usedLeafCellNumAtPriorities: map[CellPriority]int32{},
|
||||
state: cellFree,
|
||||
// cells are set to healthy initially, and will be all set to bad in HivedAlgorithm.initBadNodes
|
||||
healthy: true,
|
||||
},
|
||||
|
@ -203,16 +204,16 @@ func (c *PhysicalCell) SetState(s CellState) {
|
|||
}
|
||||
|
||||
func (c *PhysicalCell) GetPhysicalPlacement() ([]string, []int32) {
|
||||
return c.nodes, c.gpuIndices
|
||||
return c.nodes, c.leafCellIndices
|
||||
}
|
||||
|
||||
func (c *PhysicalCell) GetPhysicalPlacementString() string {
|
||||
return fmt.Sprintf("%v:%v", c.nodes, c.gpuIndices)
|
||||
return fmt.Sprintf("%v:%v", c.nodes, c.leafCellIndices)
|
||||
}
|
||||
|
||||
func (c *PhysicalCell) SetPhysicalResources(nodes []string, gpuIndices []int32) {
|
||||
func (c *PhysicalCell) SetPhysicalResources(nodes []string, leafCellIndices []int32) {
|
||||
c.nodes = nodes
|
||||
c.gpuIndices = gpuIndices
|
||||
c.leafCellIndices = leafCellIndices
|
||||
}
|
||||
|
||||
func (c *PhysicalCell) AddUsingGroup(g *AlgoAffinityGroup) {
|
||||
|
@ -335,14 +336,14 @@ func NewVirtualCell(
|
|||
|
||||
return &VirtualCell{
|
||||
GenericCell: GenericCell{
|
||||
chain: c,
|
||||
level: l,
|
||||
priority: freePriority,
|
||||
address: address,
|
||||
atOrHigherThanNode: g,
|
||||
totalGpuNum: n,
|
||||
usedGpuNumAtPriorities: map[CellPriority]int32{},
|
||||
state: cellFree,
|
||||
chain: c,
|
||||
level: l,
|
||||
priority: freePriority,
|
||||
address: address,
|
||||
atOrHigherThanNode: g,
|
||||
totalLeafCellNum: n,
|
||||
usedLeafCellNumAtPriorities: map[CellPriority]int32{},
|
||||
state: cellFree,
|
||||
// cells are set to healthy initially, and will be all set to bad in HivedAlgorithm.initBadNodes
|
||||
healthy: true,
|
||||
},
|
||||
|
|
|
@ -236,8 +236,8 @@ func getUsablePhysicalCells(
|
|||
}
|
||||
// prioritize the cells with fewer opportunistic pods (to reduce preemption)
|
||||
sort.SliceStable(candidates, func(i, j int) bool {
|
||||
return candidates[i].GetUsedGpuNumAtPriorities()[opportunisticPriority] <
|
||||
candidates[j].GetUsedGpuNumAtPriorities()[opportunisticPriority]
|
||||
return candidates[i].GetUsedLeafCellNumAtPriorities()[opportunisticPriority] <
|
||||
candidates[j].GetUsedLeafCellNumAtPriorities()[opportunisticPriority]
|
||||
})
|
||||
return usableCandidates
|
||||
}
|
||||
|
@ -382,7 +382,7 @@ func getUnboundVirtualCell(cl CellList) *VirtualCell {
|
|||
}
|
||||
|
||||
// bindCell binds a virtual cell to a physical cell and its parent recursively.
|
||||
// bindCell always starts from the lowest level, i.e., GPU-level cells.
|
||||
// bindCell always starts from the lowest level, i.e., leaf-level cells.
|
||||
func bindCell(pc *PhysicalCell, vc *VirtualCell) {
|
||||
for vc.GetPhysicalCell() == nil {
|
||||
pc.SetVirtualCell(vc)
|
||||
|
@ -397,7 +397,7 @@ func bindCell(pc *PhysicalCell, vc *VirtualCell) {
|
|||
}
|
||||
|
||||
// unbindCell unbinds a virtual cell with a physical cell and its parent recursively.
|
||||
// unbindCell always starts from the lowest level, i.e., GPU-level cells.
|
||||
// unbindCell always starts from the lowest level, i.e., leaf-level cells.
|
||||
func unbindCell(c *PhysicalCell) {
|
||||
boundVirtual := c.GetVirtualCell()
|
||||
for !boundVirtual.GetPhysicalCell().IsPinned() {
|
||||
|
@ -421,7 +421,7 @@ func unbindCell(c *PhysicalCell) {
|
|||
|
||||
// setCellPriority sets priority for a cell and its parent recursively, guaranteeing that
|
||||
// the priority of a cell is the max of those of its children.
|
||||
// setCellPriority always starts from the lowest level, i.e., GPU-level cells.
|
||||
// setCellPriority always starts from the lowest level, i.e., leaf-level cells.
|
||||
func setCellPriority(c Cell, p CellPriority) {
|
||||
originalPriority := c.GetPriority()
|
||||
c.SetPriority(p)
|
||||
|
@ -440,15 +440,15 @@ func setCellPriority(c Cell, p CellPriority) {
|
|||
}
|
||||
}
|
||||
|
||||
// updateUsedGpuNumAtPriority updates the number of used GPUs at a priority for a cell
|
||||
// updateUsedLeafCellNumAtPriority updates the number of used leaf cells at a priority for a cell
|
||||
// and its parent recursively.
|
||||
func updateUsedGpuNumAtPriority(c Cell, p CellPriority, increase bool) {
|
||||
func updateUsedLeafCellNumAtPriority(c Cell, p CellPriority, increase bool) {
|
||||
for c != nil {
|
||||
delta := int32(-1)
|
||||
if increase {
|
||||
delta = 1
|
||||
}
|
||||
c.IncreaseUsedGpuNumAtPriority(p, delta)
|
||||
c.IncreaseUsedLeafCellNumAtPriority(p, delta)
|
||||
c = c.GetParent()
|
||||
}
|
||||
}
|
||||
|
|
|
@ -24,21 +24,22 @@ package algorithm
|
|||
|
||||
import (
|
||||
"fmt"
|
||||
"strings"
|
||||
|
||||
"github.com/microsoft/hivedscheduler/pkg/api"
|
||||
"github.com/microsoft/hivedscheduler/pkg/common"
|
||||
"strings"
|
||||
)
|
||||
|
||||
// internal wrapper for spec cellTypes
|
||||
type cellChainElement struct {
|
||||
cellType api.CellType // current cell type
|
||||
level CellLevel // current cell level, leaf cell is 1
|
||||
childCellType api.CellType // child cell type
|
||||
childNumber int32 // child number
|
||||
hasNode bool // current cell type is a node or above cell
|
||||
isMultiNodes bool // current cell type is a multiple node cell
|
||||
gpuType string // current cell gpu type
|
||||
gpuNumber int32 // how many gpu in current cell
|
||||
cellType api.CellType // current cell type
|
||||
level CellLevel // current cell level, leaf cell is 1
|
||||
childCellType api.CellType // child cell type
|
||||
childNumber int32 // child number
|
||||
hasNode bool // current cell type is a node or above cell
|
||||
isMultiNodes bool // current cell type is a multiple node cell
|
||||
leafCellType string // current cell leaf cell type
|
||||
leafCellNumber int32 // how many leaf cell in current cell
|
||||
}
|
||||
|
||||
type cellTypeConstructor struct {
|
||||
|
@ -66,14 +67,14 @@ func (c *cellTypeConstructor) addCellChain(ct api.CellType) {
|
|||
if !ok {
|
||||
// not found in raw spec, it's leaf cell
|
||||
c.cellChainElements[ct] = &cellChainElement{
|
||||
cellType: ct,
|
||||
level: lowestLevel,
|
||||
childCellType: "",
|
||||
childNumber: 0,
|
||||
hasNode: false,
|
||||
isMultiNodes: false,
|
||||
gpuType: string(ct),
|
||||
gpuNumber: 1,
|
||||
cellType: ct,
|
||||
level: lowestLevel,
|
||||
childCellType: "",
|
||||
childNumber: 0,
|
||||
hasNode: false,
|
||||
isMultiNodes: false,
|
||||
leafCellType: string(ct),
|
||||
leafCellNumber: 1,
|
||||
}
|
||||
return
|
||||
}
|
||||
|
@ -87,14 +88,14 @@ func (c *cellTypeConstructor) addCellChain(ct api.CellType) {
|
|||
// child cell type has been added, added current element,
|
||||
cct := c.cellChainElements[child]
|
||||
c.cellChainElements[ct] = &cellChainElement{
|
||||
cellType: ct,
|
||||
level: cct.level + 1,
|
||||
childCellType: cct.cellType,
|
||||
childNumber: ctSpec.ChildCellNumber,
|
||||
hasNode: cct.hasNode || ctSpec.IsNodeLevel,
|
||||
isMultiNodes: cct.hasNode,
|
||||
gpuType: cct.gpuType,
|
||||
gpuNumber: cct.gpuNumber * ctSpec.ChildCellNumber,
|
||||
cellType: ct,
|
||||
level: cct.level + 1,
|
||||
childCellType: cct.cellType,
|
||||
childNumber: ctSpec.ChildCellNumber,
|
||||
hasNode: cct.hasNode || ctSpec.IsNodeLevel,
|
||||
isMultiNodes: cct.hasNode,
|
||||
leafCellType: cct.leafCellType,
|
||||
leafCellNumber: cct.leafCellNumber * ctSpec.ChildCellNumber,
|
||||
}
|
||||
return
|
||||
}
|
||||
|
@ -155,7 +156,7 @@ func (c *physicalCellConstructor) buildChildCell(
|
|||
return cellInstance
|
||||
}
|
||||
var currentCellNodes []string
|
||||
var currentCellGpuIndices []int32
|
||||
var currentCellLeafCellIndices []int32
|
||||
var currentCellChildren CellList
|
||||
for _, childSpec := range spec.CellChildren {
|
||||
childCellInstance := c.buildChildCell(childSpec, ce.childCellType, currentNode)
|
||||
|
@ -165,18 +166,18 @@ func (c *physicalCellConstructor) buildChildCell(
|
|||
// super-node cell merge child nodes
|
||||
currentCellNodes = append(currentCellNodes, childCellInstance.nodes...)
|
||||
} else {
|
||||
// sub-node cell merge child node gpu indices
|
||||
currentCellGpuIndices = append(currentCellGpuIndices, childCellInstance.gpuIndices...)
|
||||
// sub-node cell merge child node leaf cell indices
|
||||
currentCellLeafCellIndices = append(currentCellLeafCellIndices, childCellInstance.leafCellIndices...)
|
||||
}
|
||||
}
|
||||
// update current cell children and resource
|
||||
cellInstance.SetChildren(currentCellChildren)
|
||||
if ce.isMultiNodes {
|
||||
currentCellGpuIndices = []int32{-1}
|
||||
currentCellLeafCellIndices = []int32{-1}
|
||||
} else {
|
||||
currentCellNodes = []string{currentNode}
|
||||
}
|
||||
cellInstance.SetPhysicalResources(currentCellNodes, currentCellGpuIndices)
|
||||
cellInstance.SetPhysicalResources(currentCellNodes, currentCellLeafCellIndices)
|
||||
|
||||
return cellInstance
|
||||
}
|
||||
|
@ -188,7 +189,7 @@ func (c *physicalCellConstructor) addCell(
|
|||
address api.CellAddress) *PhysicalCell {
|
||||
|
||||
cellInstance := NewPhysicalCell(
|
||||
c.buildingChain, ce.level, ce.hasNode, ce.gpuNumber, ce.cellType, address, ce.hasNode && !ce.isMultiNodes)
|
||||
c.buildingChain, ce.level, ce.hasNode, ce.leafCellNumber, ce.cellType, address, ce.hasNode && !ce.isMultiNodes)
|
||||
if _, ok := c.fullCellList[chain]; !ok {
|
||||
c.fullCellList[chain] = ChainCellList{}
|
||||
}
|
||||
|
@ -211,8 +212,8 @@ func (c *physicalCellConstructor) buildFullTree() *PhysicalCell {
|
|||
panic(fmt.Sprintf("top cell must be node-level or above: %v", cc))
|
||||
}
|
||||
cellInstance := c.buildChildCell(c.buildingSpec, api.CellType(cc), "")
|
||||
// set GPU type only for top-level cells (as a chain shares the same GPU type)
|
||||
cellInstance.GetAPIStatus().GpuType = ce.gpuType
|
||||
// set leaf cell type only for top-level cells (as a chain shares the same leaf cell type)
|
||||
cellInstance.GetAPIStatus().LeafCellType = ce.leafCellType
|
||||
return cellInstance
|
||||
}
|
||||
|
||||
|
@ -289,7 +290,7 @@ func (c *virtualCellConstructor) addCell(
|
|||
c.buildingChain,
|
||||
ce.level,
|
||||
ce.hasNode,
|
||||
ce.gpuNumber,
|
||||
ce.leafCellNumber,
|
||||
nil,
|
||||
ce.cellType,
|
||||
address,
|
||||
|
@ -345,8 +346,8 @@ func (c *virtualCellConstructor) buildFullTree(address api.CellAddress) *Virtual
|
|||
panic(fmt.Sprintf("cellType %v in VirtualCells is not found in cell types definition", c.buildingChild))
|
||||
}
|
||||
cellInstance := c.buildChildCell(c.buildingChild, address)
|
||||
// set GPU type only for top-level cells (as a chain shares the same GPU type)
|
||||
cellInstance.GetAPIStatus().GpuType = ce.gpuType
|
||||
// set leaf cell type only for top-level cells (as a chain shares the same leaf cell type)
|
||||
cellInstance.GetAPIStatus().LeafCellType = ce.leafCellType
|
||||
return cellInstance
|
||||
}
|
||||
|
||||
|
@ -418,23 +419,23 @@ func parseCellChainInfo(
|
|||
map[CellChain]map[CellLevel]api.CellType,
|
||||
map[string][]CellChain) {
|
||||
|
||||
cellLevelToGpuNum := map[CellChain]map[CellLevel]int32{}
|
||||
cellLevelToLeafCellNum := map[CellChain]map[CellLevel]int32{}
|
||||
cellLevelToType := map[CellChain]map[CellLevel]api.CellType{}
|
||||
gpuTypeToChain := map[string][]CellChain{}
|
||||
leafCellTypeToChain := map[string][]CellChain{}
|
||||
for _, chain := range chains {
|
||||
ce := cellChainElements[api.CellType(chain)]
|
||||
gpuTypeToChain[ce.gpuType] = append(gpuTypeToChain[ce.gpuType], chain)
|
||||
leafCellTypeToChain[ce.leafCellType] = append(leafCellTypeToChain[ce.leafCellType], chain)
|
||||
|
||||
cellLevelToGpuNum[chain] = map[CellLevel]int32{}
|
||||
cellLevelToLeafCellNum[chain] = map[CellLevel]int32{}
|
||||
cellLevelToType[chain] = map[CellLevel]api.CellType{}
|
||||
ce, ok := cellChainElements[api.CellType(chain)]
|
||||
for ok {
|
||||
cellLevelToGpuNum[chain][ce.level] = ce.gpuNumber
|
||||
cellLevelToLeafCellNum[chain][ce.level] = ce.leafCellNumber
|
||||
cellLevelToType[chain][ce.level] = ce.cellType
|
||||
ce, ok = cellChainElements[ce.childCellType]
|
||||
}
|
||||
}
|
||||
return cellLevelToGpuNum, cellLevelToType, gpuTypeToChain
|
||||
return cellLevelToLeafCellNum, cellLevelToType, leafCellTypeToChain
|
||||
|
||||
}
|
||||
|
||||
|
@ -446,8 +447,8 @@ func ParseConfig(sConfig *api.Config) (
|
|||
virtualNonPinnedFreeList map[api.VirtualClusterName]map[CellChain]ChainCellList, // vc:chain:level:[]virtualCell
|
||||
virtualPinnedCells map[api.VirtualClusterName]map[api.PinnedCellId]ChainCellList, // vc:pinnedCellId:level:[]virtualCell
|
||||
physicalPinnedCells map[api.VirtualClusterName]map[api.PinnedCellId]*PhysicalCell, // vc:pinnedCellId:PhysicalCell
|
||||
cellLevelToGpuNum map[CellChain]map[CellLevel]int32, // chain:level:gpuNumber
|
||||
gpuTypeToChain map[string][]CellChain, // gpuType:[]chain
|
||||
cellLevelToLeafCellNum map[CellChain]map[CellLevel]int32, // chain:level:leafCellNumber
|
||||
leafCellTypeToChain map[string][]CellChain, // leafCellType:[]chain
|
||||
cellLevelToType map[CellChain]map[CellLevel]api.CellType, // chain:level:cellType
|
||||
) {
|
||||
|
||||
|
@ -470,7 +471,7 @@ func ParseConfig(sConfig *api.Config) (
|
|||
for k := range physicalFullList {
|
||||
cellChains = append(cellChains, k)
|
||||
}
|
||||
cellLevelToGpuNum, cellLevelToType, gpuTypeToChain = parseCellChainInfo(cellChainElements, cellChains)
|
||||
cellLevelToLeafCellNum, cellLevelToType, leafCellTypeToChain = parseCellChainInfo(cellChainElements, cellChains)
|
||||
|
||||
return
|
||||
}
|
||||
|
|
|
@ -94,7 +94,7 @@ type HivedAlgorithm struct {
|
|||
|
||||
// bad nodes in the physical cluster
|
||||
badNodes common.Set
|
||||
// map each GPU type to all chains that contain this type
|
||||
// map each leaf cell type to all chains that contain this type
|
||||
cellChains map[string][]CellChain
|
||||
// map each level in a chain to the specific cell type name
|
||||
cellTypes map[CellChain]map[CellLevel]api.CellType
|
||||
|
@ -107,7 +107,7 @@ type HivedAlgorithm struct {
|
|||
// NewHivedAlgorithm initializes a HivedAlgorithm from the config file.
|
||||
func NewHivedAlgorithm(sConfig *api.Config) *HivedAlgorithm {
|
||||
fullPcl, freePcl, vcFreeCellNum, nonPinnedFullVcl, nonPinnedFreeVcl, pinnedVcl, pinnedPcl,
|
||||
gpuNums, chains, cellTypes := ParseConfig(sConfig)
|
||||
leafCellNums, chains, cellTypes := ParseConfig(sConfig)
|
||||
|
||||
h := &HivedAlgorithm{
|
||||
vcSchedulers: map[api.VirtualClusterName]intraVCScheduler{},
|
||||
|
@ -132,10 +132,10 @@ func NewHivedAlgorithm(sConfig *api.Config) *HivedAlgorithm {
|
|||
for vcName := range nonPinnedFullVcl {
|
||||
// TODO: Support per-VC configurable intra VC scheduling algo.
|
||||
h.vcSchedulers[vcName] = newDefaultIntraVCScheduler(
|
||||
nonPinnedFullVcl[vcName], nonPinnedFreeVcl[vcName], pinnedVcl[vcName], gpuNums)
|
||||
nonPinnedFullVcl[vcName], nonPinnedFreeVcl[vcName], pinnedVcl[vcName], leafCellNums)
|
||||
}
|
||||
for chain, ccl := range h.fullCellList {
|
||||
h.opportunisticSchedulers[chain] = NewTopologyAwareScheduler(ccl, gpuNums[chain], false)
|
||||
h.opportunisticSchedulers[chain] = NewTopologyAwareScheduler(ccl, leafCellNums[chain], false)
|
||||
}
|
||||
h.initCellNums()
|
||||
h.initAPIClusterStatus()
|
||||
|
@ -192,11 +192,11 @@ func (h *HivedAlgorithm) Schedule(
|
|||
suggestedNodeSet.Add(n)
|
||||
}
|
||||
var (
|
||||
groupPhysicalPlacement groupPhysicalPlacement // GPU number -> a set of pods -> a set of GPUs of each pod
|
||||
groupVirtualPlacement groupVirtualPlacement // GPU number -> a set of pods -> a set of GPUs of each pod
|
||||
groupPhysicalPlacement groupPhysicalPlacement // leaf cell number -> a set of pods -> a set of leaf cells of each pod
|
||||
groupVirtualPlacement groupVirtualPlacement // leaf cell number -> a set of pods -> a set of leaf cells of each pod
|
||||
preemptionVictims map[string]common.Set // node -> pods
|
||||
waitReason string
|
||||
podIndex int32 // index of current pod among those of the same GPU number in the group, 0 by default
|
||||
podIndex int32 // index of current pod among those of the same leaf cell number in the group, 0 by default
|
||||
)
|
||||
|
||||
if g := h.affinityGroups[s.AffinityGroup.Name]; g != nil {
|
||||
|
@ -215,7 +215,7 @@ func (h *HivedAlgorithm) Schedule(
|
|||
preemptionVictims,
|
||||
waitReason,
|
||||
h.cellTypes,
|
||||
s.GpuNumber,
|
||||
s.LeafCellNumber,
|
||||
podIndex,
|
||||
h.affinityGroups[s.AffinityGroup.Name],
|
||||
s.AffinityGroup.Name,
|
||||
|
@ -251,22 +251,22 @@ func (h *HivedAlgorithm) AddAllocatedPod(pod *core.Pod) {
|
|||
s := internal.ExtractPodSchedulingSpec(pod)
|
||||
info := internal.ExtractPodBindInfo(pod)
|
||||
klog.Infof("[%v]: Adding allocated pod to affinity group %v...", internal.Key(pod), s.AffinityGroup.Name)
|
||||
klog.Infof("[%v]: Adding to node %v, GPUs %v", internal.Key(pod), info.Node, common.ToJson(info.GpuIsolation))
|
||||
klog.Infof("[%v]: Adding to node %v, leaf cells %v", internal.Key(pod), info.Node, common.ToJson(info.LeafCellIsolation))
|
||||
|
||||
podIndex := int32(0)
|
||||
if g := h.affinityGroups[s.AffinityGroup.Name]; g != nil {
|
||||
if g.state == groupPreempting {
|
||||
h.allocatePreemptingAffinityGroup(g, pod)
|
||||
}
|
||||
if podIndex = getAllocatedPodIndex(info, s.GpuNumber); podIndex == -1 {
|
||||
klog.Errorf("[%v]: Pod placement not found in group %v: node %v, GPUs %v",
|
||||
internal.Key(pod), s.AffinityGroup.Name, info.Node, info.GpuIsolation)
|
||||
if podIndex = getAllocatedPodIndex(info, s.LeafCellNumber); podIndex == -1 {
|
||||
klog.Errorf("[%v]: Pod placement not found in group %v: node %v, leaf cells %v",
|
||||
internal.Key(pod), s.AffinityGroup.Name, info.Node, info.LeafCellIsolation)
|
||||
return
|
||||
}
|
||||
} else {
|
||||
h.createAllocatedAffinityGroup(s, info, pod)
|
||||
}
|
||||
h.affinityGroups[s.AffinityGroup.Name].allocatedPods[s.GpuNumber][podIndex] = pod
|
||||
h.affinityGroups[s.AffinityGroup.Name].allocatedPods[s.LeafCellNumber][podIndex] = pod
|
||||
}
|
||||
|
||||
func (h *HivedAlgorithm) DeleteAllocatedPod(pod *core.Pod) {
|
||||
|
@ -276,18 +276,18 @@ func (h *HivedAlgorithm) DeleteAllocatedPod(pod *core.Pod) {
|
|||
s := internal.ExtractPodSchedulingSpec(pod)
|
||||
info := internal.ExtractPodBindInfo(pod)
|
||||
klog.Infof("[%v]: Deleting allocated pod from affinity group %v...", internal.Key(pod), s.AffinityGroup.Name)
|
||||
klog.Infof("[%v]: Deleting from node %v, GPUs %v", internal.Key(pod), info.Node, common.ToJson(info.GpuIsolation))
|
||||
klog.Infof("[%v]: Deleting from node %v, leaf cells %v", internal.Key(pod), info.Node, common.ToJson(info.LeafCellIsolation))
|
||||
|
||||
if g := h.affinityGroups[s.AffinityGroup.Name]; g == nil {
|
||||
klog.Errorf("[%v]: Group %v not found when deleting pod", internal.Key(pod), s.AffinityGroup.Name)
|
||||
return
|
||||
} else {
|
||||
if podIndex := getAllocatedPodIndex(info, s.GpuNumber); podIndex == -1 {
|
||||
klog.Errorf("[%v]: Pod placement not found in group %v: node %v, GPUs %v",
|
||||
internal.Key(pod), s.AffinityGroup.Name, info.Node, info.GpuIsolation)
|
||||
if podIndex := getAllocatedPodIndex(info, s.LeafCellNumber); podIndex == -1 {
|
||||
klog.Errorf("[%v]: Pod placement not found in group %v: node %v, leaf cells %v",
|
||||
internal.Key(pod), s.AffinityGroup.Name, info.Node, info.LeafCellIsolation)
|
||||
return
|
||||
} else {
|
||||
g.allocatedPods[s.GpuNumber][podIndex] = nil
|
||||
g.allocatedPods[s.LeafCellNumber][podIndex] = nil
|
||||
}
|
||||
if allPodsReleased(g.allocatedPods) {
|
||||
h.deleteAllocatedAffinityGroup(g, pod)
|
||||
|
@ -470,11 +470,11 @@ func (h *HivedAlgorithm) setBadNode(nodeName string) {
|
|||
}
|
||||
h.badNodes.Add(nodeName)
|
||||
for _, ccl := range h.fullCellList {
|
||||
for _, gpu := range ccl[1] {
|
||||
pGpu := gpu.(*PhysicalCell)
|
||||
nodes, _ := pGpu.GetPhysicalPlacement()
|
||||
for _, leafCell := range ccl[1] {
|
||||
pLeafCell := leafCell.(*PhysicalCell)
|
||||
nodes, _ := pLeafCell.GetPhysicalPlacement()
|
||||
if nodes[0] == nodeName {
|
||||
h.setBadCell(pGpu)
|
||||
h.setBadCell(pLeafCell)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -487,11 +487,11 @@ func (h *HivedAlgorithm) setHealthyNode(nodeName string) {
|
|||
}
|
||||
h.badNodes.Delete(nodeName)
|
||||
for _, ccl := range h.fullCellList {
|
||||
for _, gpu := range ccl[1] {
|
||||
pGpu := gpu.(*PhysicalCell)
|
||||
nodes, _ := pGpu.GetPhysicalPlacement()
|
||||
for _, leafCell := range ccl[1] {
|
||||
pLeafCell := leafCell.(*PhysicalCell)
|
||||
nodes, _ := pLeafCell.GetPhysicalPlacement()
|
||||
if nodes[0] == nodeName {
|
||||
h.setHealthyCell(pGpu)
|
||||
h.setHealthyCell(pLeafCell)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -499,7 +499,7 @@ func (h *HivedAlgorithm) setHealthyNode(nodeName string) {
|
|||
|
||||
// setBadCell marks a physical cell (and also the virtual cell it is bound to) as bad,
|
||||
// and recursively for its parent, guaranteeing that a cell is bad if any of its children is bad.
|
||||
// setBadCell always starts from the lowest level, i.e., GPU-level cells.
|
||||
// setBadCell always starts from the lowest level, i.e., leaf-level cells.
|
||||
func (h *HivedAlgorithm) setBadCell(c *PhysicalCell) {
|
||||
if !c.IsHealthy() {
|
||||
return
|
||||
|
@ -522,7 +522,7 @@ func (h *HivedAlgorithm) setBadCell(c *PhysicalCell) {
|
|||
|
||||
// setHealthyCell marks a physical cell (and also the virtual cell it is bound to) as healthy,
|
||||
// and recursively for its parent, guaranteeing that a cell is healthy if all of its children are healthy.
|
||||
// setHealthy always starts from the lowest level, i.e., GPU-level cells.
|
||||
// setHealthy always starts from the lowest level, i.e., leaf-level cells.
|
||||
func (h *HivedAlgorithm) setHealthyCell(c *PhysicalCell) {
|
||||
if c.IsHealthy() {
|
||||
return
|
||||
|
@ -667,23 +667,23 @@ func (h *HivedAlgorithm) schedulePodFromExistingGroup(
|
|||
podIndex int32) {
|
||||
|
||||
badOrNonSuggestedNodes := collectBadOrNonSuggestedNodes(
|
||||
g.physicalGpuPlacement, suggestedNodes, g.ignoreK8sSuggestedNodes)
|
||||
g.physicalLeafCellPlacement, suggestedNodes, g.ignoreK8sSuggestedNodes)
|
||||
// state of an existing group can be either Allocated or Preempting
|
||||
if g.state == groupAllocated {
|
||||
klog.Infof("[%v]: Pod is from an affinity group that is already allocated: %v",
|
||||
internal.Key(pod), s.AffinityGroup.Name)
|
||||
groupPhysicalPlacement = g.physicalGpuPlacement
|
||||
groupVirtualPlacement = g.virtualGpuPlacement
|
||||
groupPhysicalPlacement = g.physicalLeafCellPlacement
|
||||
groupVirtualPlacement = g.virtualLeafCellPlacement
|
||||
if !badOrNonSuggestedNodes.IsEmpty() {
|
||||
// for an allocated group, we always insist the previous scheduling decision
|
||||
// even if some pods are now bad or not within suggested nodes
|
||||
klog.Warningf("[%v]: Some nodes allocated to affinity group %v are no longer "+
|
||||
"healthy and within K8s suggested nodes: %v", internal.Key(pod), g.name, badOrNonSuggestedNodes)
|
||||
}
|
||||
if podIndex = getNewPodIndex(g.allocatedPods[s.GpuNumber]); podIndex == -1 {
|
||||
if podIndex = getNewPodIndex(g.allocatedPods[s.LeafCellNumber]); podIndex == -1 {
|
||||
panic(internal.NewBadRequestError(fmt.Sprintf(
|
||||
"Requesting more pods than the configured number for %v GPUs (%v pods) in affinity group %v",
|
||||
s.GpuNumber, g.totalPodNums[s.GpuNumber], s.AffinityGroup.Name)))
|
||||
"Requesting more pods than the configured number for %v leaf cells (%v pods) in affinity group %v",
|
||||
s.LeafCellNumber, g.totalPodNums[s.LeafCellNumber], s.AffinityGroup.Name)))
|
||||
}
|
||||
} else { // groupPreempting
|
||||
klog.Infof("[%v]: Pod is from an affinity group that is preempting others: %v",
|
||||
|
@ -698,8 +698,8 @@ func (h *HivedAlgorithm) schedulePodFromExistingGroup(
|
|||
internal.Key(pod), g.name, badOrNonSuggestedNodes)
|
||||
h.deletePreemptingAffinityGroup(g, pod)
|
||||
} else {
|
||||
groupPhysicalPlacement = g.physicalGpuPlacement
|
||||
groupVirtualPlacement = g.virtualGpuPlacement
|
||||
groupPhysicalPlacement = g.physicalLeafCellPlacement
|
||||
groupVirtualPlacement = g.virtualLeafCellPlacement
|
||||
preemptionVictims, _ = collectPreemptionVictims(groupPhysicalPlacement)
|
||||
if len(preemptionVictims) == 0 {
|
||||
klog.Infof(
|
||||
|
@ -751,7 +751,7 @@ func (h *HivedAlgorithm) schedulePodFromNewGroup(
|
|||
return groupPhysicalPlacement, groupVirtualPlacement, preemptionVictims, waitReason
|
||||
}
|
||||
|
||||
// scheduleNewAffinityGroup schedules each pod of a new affinity group to a set of GPUs
|
||||
// scheduleNewAffinityGroup schedules each pod of a new affinity group to a set of leaf cells
|
||||
// (in both the physical cluster and the VC). This is the entrance of a new scheduling attempt.
|
||||
func (h *HivedAlgorithm) scheduleNewAffinityGroup(
|
||||
pod *core.Pod,
|
||||
|
@ -773,33 +773,33 @@ func (h *HivedAlgorithm) scheduleNewAffinityGroup(
|
|||
ignoreSuggestedNodes: s.IgnoreK8sSuggestedNodes,
|
||||
}
|
||||
for _, m := range s.AffinityGroup.Members {
|
||||
// we will merge group members with same GPU number
|
||||
sr.affinityGroupPodNums[m.GpuNumber] += m.PodNumber
|
||||
// we will merge group members with same leaf cell number
|
||||
sr.affinityGroupPodNums[m.LeafCellNumber] += m.PodNumber
|
||||
}
|
||||
h.validateSchedulingRequest(sr, pod)
|
||||
if sr.pinnedCellId != "" {
|
||||
klog.Infof("Using pinned cell %v", s.PinnedCellId)
|
||||
physicalPlacement, virtualPlacement, failedReason = h.handleSchedulingRequest(sr)
|
||||
} else if s.GpuType != "" {
|
||||
if _, ok := h.cellChains[s.GpuType]; !ok {
|
||||
} else if s.LeafCellType != "" {
|
||||
if _, ok := h.cellChains[s.LeafCellType]; !ok {
|
||||
panic(internal.NewBadRequestError(fmt.Sprintf(
|
||||
"[%v]: Pod requesting GPU type %v which the whole cluster does not have",
|
||||
internal.Key(pod), s.GpuType)))
|
||||
"[%v]: Pod requesting leaf cell type %v which the whole cluster does not have",
|
||||
internal.Key(pod), s.LeafCellType)))
|
||||
}
|
||||
klog.Infof("Using specified GPU type %v", s.GpuType)
|
||||
physicalPlacement, virtualPlacement, failedReason = h.scheduleAffinityGroupForGpuType(
|
||||
sr, s.GpuType, pod, true)
|
||||
klog.Infof("Using specified leaf cell type %v", s.LeafCellType)
|
||||
physicalPlacement, virtualPlacement, failedReason = h.scheduleAffinityGroupForLeafCellType(
|
||||
sr, s.LeafCellType, pod, true)
|
||||
} else {
|
||||
physicalPlacement, virtualPlacement, failedReason = h.scheduleAffinityGroupForAnyGpuType(sr, pod)
|
||||
physicalPlacement, virtualPlacement, failedReason = h.scheduleAffinityGroupForAnyLeafCellType(sr, pod)
|
||||
}
|
||||
return physicalPlacement, virtualPlacement, failedReason
|
||||
}
|
||||
|
||||
// scheduleAffinityGroupForGpuType schedules an affinity group in a certain cell chain
|
||||
// that matches the given GPU type.
|
||||
func (h *HivedAlgorithm) scheduleAffinityGroupForGpuType(
|
||||
// scheduleAffinityGroupForLeafCellType schedules an affinity group in a certain cell chain
|
||||
// that matches the given leaf cell type.
|
||||
func (h *HivedAlgorithm) scheduleAffinityGroupForLeafCellType(
|
||||
sr schedulingRequest,
|
||||
gpuType string,
|
||||
leafCellType string,
|
||||
pod *core.Pod,
|
||||
typeSpecified bool) (
|
||||
physicalPlacement groupPhysicalPlacement,
|
||||
|
@ -807,7 +807,7 @@ func (h *HivedAlgorithm) scheduleAffinityGroupForGpuType(
|
|||
failedReason string) {
|
||||
|
||||
vcHasType := false
|
||||
for _, chain := range h.cellChains[gpuType] {
|
||||
for _, chain := range h.cellChains[leafCellType] {
|
||||
if sr.priority < minGuaranteedPriority ||
|
||||
h.vcSchedulers[sr.vc].getNonPinnedPreassignedCells()[chain] != nil {
|
||||
vcHasType = true
|
||||
|
@ -822,15 +822,15 @@ func (h *HivedAlgorithm) scheduleAffinityGroupForGpuType(
|
|||
}
|
||||
if typeSpecified && sr.priority >= minGuaranteedPriority && !vcHasType {
|
||||
panic(internal.NewBadRequestError(fmt.Sprintf(
|
||||
"[%v]: Pod requesting GPU type %v which VC %v does not have",
|
||||
internal.Key(pod), gpuType, sr.vc)))
|
||||
"[%v]: Pod requesting leaf cell type %v which VC %v does not have",
|
||||
internal.Key(pod), leafCellType, sr.vc)))
|
||||
}
|
||||
return nil, nil, failedReason
|
||||
}
|
||||
|
||||
// scheduleAffinityGroupForAnyGpuType schedules an affinity group in every possible GPU type
|
||||
// (when the user does not specify a GPU type).
|
||||
func (h *HivedAlgorithm) scheduleAffinityGroupForAnyGpuType(
|
||||
// scheduleAffinityGroupForAnyLeafCellType schedules an affinity group in every possible leaf cell type
|
||||
// (when the user does not specify a leaf cell type).
|
||||
func (h *HivedAlgorithm) scheduleAffinityGroupForAnyLeafCellType(
|
||||
sr schedulingRequest,
|
||||
pod *core.Pod) (
|
||||
groupPhysicalPlacement,
|
||||
|
@ -838,10 +838,10 @@ func (h *HivedAlgorithm) scheduleAffinityGroupForAnyGpuType(
|
|||
string) {
|
||||
|
||||
var failedReason string
|
||||
for gpuType := range h.cellChains {
|
||||
klog.Infof("Searching GPU type %v", gpuType)
|
||||
for leafCellType := range h.cellChains {
|
||||
klog.Infof("Searching leaf cell type %v", leafCellType)
|
||||
typePhysicalPlacement, typeVirtualPlacement, typeFailedReason :=
|
||||
h.scheduleAffinityGroupForGpuType(sr, gpuType, pod, false)
|
||||
h.scheduleAffinityGroupForLeafCellType(sr, leafCellType, pod, false)
|
||||
if typePhysicalPlacement != nil {
|
||||
return typePhysicalPlacement, typeVirtualPlacement, ""
|
||||
}
|
||||
|
@ -880,7 +880,7 @@ func (h *HivedAlgorithm) handleSchedulingRequest(
|
|||
if sr.pinnedCellId != "" {
|
||||
str = fmt.Sprintf("pinned cell %v", sr.pinnedCellId)
|
||||
}
|
||||
klog.Infof("Processing scheduling request: %v, GPU numbers %v, priority %v",
|
||||
klog.Infof("Processing scheduling request: %v, leaf cell numbers %v, priority %v",
|
||||
str, common.ToJson(sr.affinityGroupPodNums), sr.priority)
|
||||
if sr.priority >= minGuaranteedPriority {
|
||||
physicalPlacement, virtualPlacement, failedReason = h.scheduleGuaranteedAffinityGroup(sr)
|
||||
|
@ -910,10 +910,10 @@ func (h *HivedAlgorithm) scheduleGuaranteedAffinityGroup(
|
|||
}
|
||||
// map the vc placement to the physical cluster
|
||||
bindings := map[api.CellAddress]*PhysicalCell{}
|
||||
gpuNums := common.Int32MapKeys(sr.affinityGroupPodNums)
|
||||
common.SortInt32(gpuNums)
|
||||
lazyPreemptedGroups := h.tryLazyPreempt(virtualPlacement, gpuNums, sr.affinityGroupName)
|
||||
preassignedCells, nonPreassignedCells := virtualPlacement.toBindingPaths(gpuNums, bindings)
|
||||
leafCellNums := common.Int32MapKeys(sr.affinityGroupPodNums)
|
||||
common.SortInt32(leafCellNums)
|
||||
lazyPreemptedGroups := h.tryLazyPreempt(virtualPlacement, leafCellNums, sr.affinityGroupName)
|
||||
preassignedCells, nonPreassignedCells := virtualPlacement.toBindingPaths(leafCellNums, bindings)
|
||||
// make a copy of freeCellNum, may change its values during allocation
|
||||
freeCellNumCopy := map[CellLevel]int32{}
|
||||
for k, v := range h.allVCFreeCellNum[sr.chain] {
|
||||
|
@ -927,7 +927,7 @@ func (h *HivedAlgorithm) scheduleGuaranteedAffinityGroup(
|
|||
sr.suggestedNodes,
|
||||
sr.ignoreSuggestedNodes,
|
||||
bindings); ok {
|
||||
return virtualPlacement.toPhysicalPlacement(bindings, gpuNums), virtualPlacement, ""
|
||||
return virtualPlacement.toPhysicalPlacement(bindings, leafCellNums), virtualPlacement, ""
|
||||
}
|
||||
for groupName, placement := range lazyPreemptedGroups {
|
||||
h.revertLazyPreempt(h.affinityGroups[groupName], placement)
|
||||
|
@ -944,18 +944,18 @@ func (h *HivedAlgorithm) scheduleGuaranteedAffinityGroup(
|
|||
// tryLazyPreempt tries to lazy preempt the affinity groups found on a placement.
|
||||
func (h *HivedAlgorithm) tryLazyPreempt(
|
||||
p groupVirtualPlacement,
|
||||
gpuNums []int32,
|
||||
leafCellNums []int32,
|
||||
groupName string) map[string]groupVirtualPlacement {
|
||||
|
||||
preemptedGroups := map[string]groupVirtualPlacement{}
|
||||
for _, podGpuNum := range gpuNums {
|
||||
podPlacements := p[podGpuNum]
|
||||
for _, podLeafCellNum := range leafCellNums {
|
||||
podPlacements := p[podLeafCellNum]
|
||||
for _, pod := range podPlacements {
|
||||
for _, gpu := range pod {
|
||||
if pGpu := gpu.(*VirtualCell).GetPhysicalCell(); pGpu != nil {
|
||||
if pGpu.GetState() == cellUsed && pGpu.GetUsingGroup().lazyPreemptionEnable {
|
||||
preemptedGroups[pGpu.GetUsingGroup().name] = h.lazyPreemptAffinityGroup(
|
||||
pGpu.GetUsingGroup(), groupName)
|
||||
for _, leafCell := range pod {
|
||||
if pLeafCell := leafCell.(*VirtualCell).GetPhysicalCell(); pLeafCell != nil {
|
||||
if pLeafCell.GetState() == cellUsed && pLeafCell.GetUsingGroup().lazyPreemptionEnable {
|
||||
preemptedGroups[pLeafCell.GetUsingGroup().name] = h.lazyPreemptAffinityGroup(
|
||||
pLeafCell.GetUsingGroup(), groupName)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -985,46 +985,46 @@ func (h *HivedAlgorithm) createAllocatedAffinityGroup(s *api.PodSchedulingSpec,
|
|||
s.AffinityGroup, s.VirtualCluster, s.LazyPreemptionEnable, s.Priority, groupAllocated)
|
||||
shouldLazyPreempt := false
|
||||
for _, gms := range info.AffinityGroupBindInfo {
|
||||
gpuNumber := int32(len(gms.PodPlacements[0].PhysicalGpuIndices))
|
||||
leafCellNumber := int32(len(gms.PodPlacements[0].PhysicalLeafCellIndices))
|
||||
for podIndex := int32(0); podIndex < int32(len(gms.PodPlacements)); podIndex++ {
|
||||
node := gms.PodPlacements[podIndex].PhysicalNode
|
||||
for gpuIndex := int32(0); gpuIndex < int32(
|
||||
len(gms.PodPlacements[podIndex].PhysicalGpuIndices)); gpuIndex++ {
|
||||
pGpu, vGpu, lazyPreempt := h.findAllocatedGpu(
|
||||
gpuIndex,
|
||||
gms.PodPlacements[podIndex].PhysicalGpuIndices,
|
||||
for leafCellIndex := int32(0); leafCellIndex < int32(
|
||||
len(gms.PodPlacements[podIndex].PhysicalLeafCellIndices)); leafCellIndex++ {
|
||||
pLeafCell, vLeafCell, lazyPreempt := h.findAllocatedLeafCell(
|
||||
leafCellIndex,
|
||||
gms.PodPlacements[podIndex].PhysicalLeafCellIndices,
|
||||
gms.PodPlacements[podIndex].PreassignedCellTypes,
|
||||
CellChain(info.CellChain), node, shouldLazyPreempt, s, newGroup, pod)
|
||||
if pGpu == nil {
|
||||
// pGpu not being found means that this GPU address does not exist in the spec.
|
||||
// we simply ignore this GPU, and let the job run normally
|
||||
// (but we cannot ignore the other GPUs of this pod that are still in the spec,
|
||||
if pLeafCell == nil {
|
||||
// pLeafCell not being found means that this leaf cell address does not exist in the spec.
|
||||
// we simply ignore this leaf cell, and let the job run normally
|
||||
// (but we cannot ignore the other leaf cells of this pod that are still in the spec,
|
||||
// otherwise it may cause resource conflicts)
|
||||
continue
|
||||
} else {
|
||||
newGroup.physicalGpuPlacement[gpuNumber][podIndex][gpuIndex] = pGpu
|
||||
newGroup.physicalLeafCellPlacement[leafCellNumber][podIndex][leafCellIndex] = pLeafCell
|
||||
if lazyPreempt == nil {
|
||||
newGroup.virtualGpuPlacement = nil
|
||||
} else if vGpu != nil {
|
||||
newGroup.virtualGpuPlacement[gpuNumber][podIndex][gpuIndex] = vGpu
|
||||
if inFreeCellList(pGpu) && vGpu.GetPreassignedCell().GetPriority() > freePriority {
|
||||
newGroup.virtualLeafCellPlacement = nil
|
||||
} else if vLeafCell != nil {
|
||||
newGroup.virtualLeafCellPlacement[leafCellNumber][podIndex][leafCellIndex] = vLeafCell
|
||||
if inFreeCellList(pLeafCell) && vLeafCell.GetPreassignedCell().GetPriority() > freePriority {
|
||||
// This means we decide to bind this cell to a virtual cell whose preassigned cell
|
||||
// has been bound (in cases like reconfiguration and the VC's cells are fewer than before).
|
||||
// We need to destroy the previous binding, by lazy preempting all the groups
|
||||
// in the preassigned cell
|
||||
h.lazyPreemptCell(vGpu.GetPreassignedCell(), newGroup.name)
|
||||
h.lazyPreemptCell(vLeafCell.GetPreassignedCell(), newGroup.name)
|
||||
}
|
||||
} else {
|
||||
shouldLazyPreempt = shouldLazyPreempt || *lazyPreempt
|
||||
}
|
||||
// Even if we have successfully found the vGpu and pGpu, there is still one possibility
|
||||
// Even if we have successfully found the vLeafCell and pLeafCell, there is still one possibility
|
||||
// that we should not bind them: allocating the physical cell may lead to broken safety.
|
||||
// Such case won't happen by design as buddy alloc guarantees safety; but this could
|
||||
// happen due to inconsistency of VC assignments for reasons like reconfiguration.
|
||||
// In this case, we will lazy preempt this affinity group.
|
||||
safetyOk, reason := h.allocateGpu(pGpu, vGpu, CellPriority(s.Priority), newGroup.vc)
|
||||
pGpu.AddUsingGroup(newGroup)
|
||||
setCellState(pGpu, cellUsed)
|
||||
safetyOk, reason := h.allocateLeafCell(pLeafCell, vLeafCell, CellPriority(s.Priority), newGroup.vc)
|
||||
pLeafCell.AddUsingGroup(newGroup)
|
||||
setCellState(pLeafCell, cellUsed)
|
||||
if !safetyOk {
|
||||
shouldLazyPreempt = true
|
||||
klog.Warningf("[%v]: %v", internal.Key(pod), reason)
|
||||
|
@ -1045,22 +1045,22 @@ func (h *HivedAlgorithm) createAllocatedAffinityGroup(s *api.PodSchedulingSpec,
|
|||
func (h *HivedAlgorithm) deleteAllocatedAffinityGroup(g *AlgoAffinityGroup, pod *core.Pod) {
|
||||
klog.Infof("[%v]: All pods complete, deleting allocated affinity group: %v",
|
||||
internal.Key(pod), g.name)
|
||||
for _, podPlacements := range g.physicalGpuPlacement {
|
||||
for _, podPlacements := range g.physicalLeafCellPlacement {
|
||||
for _, podPlacement := range podPlacements {
|
||||
for _, gpu := range podPlacement {
|
||||
if gpu == nil {
|
||||
for _, leafCell := range podPlacement {
|
||||
if leafCell == nil {
|
||||
continue
|
||||
}
|
||||
pGpu := gpu.(*PhysicalCell)
|
||||
pGpu.DeleteUsingGroup(g)
|
||||
// state of pGpu can be either Used or Reserving
|
||||
if pGpu.GetState() == cellUsed {
|
||||
h.releaseGpu(pGpu, g.vc)
|
||||
setCellState(pGpu, cellFree)
|
||||
pLeafCell := leafCell.(*PhysicalCell)
|
||||
pLeafCell.DeleteUsingGroup(g)
|
||||
// state of pLeafCell can be either Used or Reserving
|
||||
if pLeafCell.GetState() == cellUsed {
|
||||
h.releaseLeafCell(pLeafCell, g.vc)
|
||||
setCellState(pLeafCell, cellFree)
|
||||
} else { // cellReserving
|
||||
// When pGpu is in Reserving state, we shouldn't call h.releaseGpu
|
||||
// When pLeafCell is in Reserving state, we shouldn't call h.releaseLeafCell
|
||||
// because it must have been allocated to the reserving group before
|
||||
setCellState(pGpu, cellReserved)
|
||||
setCellState(pLeafCell, cellReserved)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -1082,26 +1082,26 @@ func (h *HivedAlgorithm) createPreemptingAffinityGroup(
|
|||
klog.Infof("[%v]: Creating new preempting affinity group: %v", internal.Key(pod), s.AffinityGroup.Name)
|
||||
newGroup := newAlgoAffinityGroup(
|
||||
s.AffinityGroup, s.VirtualCluster, s.LazyPreemptionEnable, s.Priority, groupPreempting)
|
||||
newGroup.physicalGpuPlacement = physicalPlacement
|
||||
newGroup.virtualGpuPlacement = virtualPlacement
|
||||
for gpuNum := range physicalPlacement {
|
||||
for podIndex := range physicalPlacement[gpuNum] {
|
||||
for gpuIndex, gpu := range physicalPlacement[gpuNum][podIndex] {
|
||||
pGpu := gpu.(*PhysicalCell)
|
||||
vGpu := virtualPlacement[gpuNum][podIndex][gpuIndex].(*VirtualCell)
|
||||
if pGpu.GetState() == cellUsed {
|
||||
usingGroup := pGpu.GetUsingGroup()
|
||||
h.releaseGpu(pGpu, usingGroup.vc)
|
||||
newGroup.physicalLeafCellPlacement = physicalPlacement
|
||||
newGroup.virtualLeafCellPlacement = virtualPlacement
|
||||
for leafCellNum := range physicalPlacement {
|
||||
for podIndex := range physicalPlacement[leafCellNum] {
|
||||
for leafCellIndex, leafCell := range physicalPlacement[leafCellNum][podIndex] {
|
||||
pLeafCell := leafCell.(*PhysicalCell)
|
||||
vLeafCell := virtualPlacement[leafCellNum][podIndex][leafCellIndex].(*VirtualCell)
|
||||
if pLeafCell.GetState() == cellUsed {
|
||||
usingGroup := pLeafCell.GetUsingGroup()
|
||||
h.releaseLeafCell(pLeafCell, usingGroup.vc)
|
||||
usingGroup.state = groupBeingPreempted
|
||||
}
|
||||
h.allocateGpu(pGpu, vGpu, CellPriority(s.Priority), newGroup.vc)
|
||||
pGpu.AddReservingOrReservedGroup(newGroup)
|
||||
// state of pGpu can be either Used or Free (if it was Reserving or Reserved,
|
||||
h.allocateLeafCell(pLeafCell, vLeafCell, CellPriority(s.Priority), newGroup.vc)
|
||||
pLeafCell.AddReservingOrReservedGroup(newGroup)
|
||||
// state of pLeafCell can be either Used or Free (if it was Reserving or Reserved,
|
||||
// we must have canceled the ongoing preemption before, in h.Schedule)
|
||||
if pGpu.GetState() == cellUsed {
|
||||
setCellState(pGpu, cellReserving)
|
||||
if pLeafCell.GetState() == cellUsed {
|
||||
setCellState(pLeafCell, cellReserving)
|
||||
} else { // cellFree
|
||||
setCellState(pGpu, cellReserved)
|
||||
setCellState(pLeafCell, cellReserved)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -1114,27 +1114,27 @@ func (h *HivedAlgorithm) createPreemptingAffinityGroup(
|
|||
// deletePreemptingAffinityGroup revokes a preemption and deletes the affinity group that is
|
||||
// still waiting for the completion of the preemption.
|
||||
func (h *HivedAlgorithm) deletePreemptingAffinityGroup(g *AlgoAffinityGroup, pod *core.Pod) {
|
||||
for gpuNum := range g.physicalGpuPlacement {
|
||||
for podIndex := range g.physicalGpuPlacement[gpuNum] {
|
||||
for _, gpu := range g.physicalGpuPlacement[gpuNum][podIndex] {
|
||||
pGpu := gpu.(*PhysicalCell)
|
||||
h.releaseGpu(pGpu, g.vc)
|
||||
pGpu.DeleteReservingOrReservedGroup(pGpu.GetReservingOrReservedGroup())
|
||||
// state of pGpu can be either Reserving or Reserved
|
||||
if pGpu.GetState() == cellReserving {
|
||||
setCellState(pGpu, cellUsed)
|
||||
for leafCellNum := range g.physicalLeafCellPlacement {
|
||||
for podIndex := range g.physicalLeafCellPlacement[leafCellNum] {
|
||||
for _, leafCell := range g.physicalLeafCellPlacement[leafCellNum][podIndex] {
|
||||
pLeafCell := leafCell.(*PhysicalCell)
|
||||
h.releaseLeafCell(pLeafCell, g.vc)
|
||||
pLeafCell.DeleteReservingOrReservedGroup(pLeafCell.GetReservingOrReservedGroup())
|
||||
// state of pLeafCell can be either Reserving or Reserved
|
||||
if pLeafCell.GetState() == cellReserving {
|
||||
setCellState(pLeafCell, cellUsed)
|
||||
// return the cell to the group being preempted
|
||||
beingPreemptedGroup := pGpu.GetUsingGroup()
|
||||
var beingPreemptedVGpu *VirtualCell
|
||||
if beingPreemptedGroup.virtualGpuPlacement != nil {
|
||||
beingPreemptedVGpu = retrieveVirtualCell(
|
||||
beingPreemptedGroup.physicalGpuPlacement,
|
||||
beingPreemptedGroup.virtualGpuPlacement, pGpu)
|
||||
beingPreemptedGroup := pLeafCell.GetUsingGroup()
|
||||
var beingPreemptedVLeafCell *VirtualCell
|
||||
if beingPreemptedGroup.virtualLeafCellPlacement != nil {
|
||||
beingPreemptedVLeafCell = retrieveVirtualCell(
|
||||
beingPreemptedGroup.physicalLeafCellPlacement,
|
||||
beingPreemptedGroup.virtualLeafCellPlacement, pLeafCell)
|
||||
}
|
||||
h.allocateGpu(
|
||||
pGpu, beingPreemptedVGpu, CellPriority(beingPreemptedGroup.priority), beingPreemptedGroup.vc)
|
||||
h.allocateLeafCell(
|
||||
pLeafCell, beingPreemptedVLeafCell, CellPriority(beingPreemptedGroup.priority), beingPreemptedGroup.vc)
|
||||
} else { // cellReserved
|
||||
setCellState(pGpu, cellFree)
|
||||
setCellState(pLeafCell, cellFree)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -1146,13 +1146,13 @@ func (h *HivedAlgorithm) deletePreemptingAffinityGroup(g *AlgoAffinityGroup, pod
|
|||
// allocatePreemptingAffinityGroup lets a preemptor affinity group whose preemption has completed
|
||||
// transition to allocated state.
|
||||
func (h *HivedAlgorithm) allocatePreemptingAffinityGroup(g *AlgoAffinityGroup, pod *core.Pod) {
|
||||
for gpuNum := range g.physicalGpuPlacement {
|
||||
for podIndex := range g.physicalGpuPlacement[gpuNum] {
|
||||
for _, gpu := range g.physicalGpuPlacement[gpuNum][podIndex] {
|
||||
pGpu := gpu.(*PhysicalCell)
|
||||
pGpu.DeleteReservingOrReservedGroup(g)
|
||||
pGpu.AddUsingGroup(g)
|
||||
setCellState(pGpu, cellUsed)
|
||||
for leafCellNum := range g.physicalLeafCellPlacement {
|
||||
for podIndex := range g.physicalLeafCellPlacement[leafCellNum] {
|
||||
for _, leafCell := range g.physicalLeafCellPlacement[leafCellNum][podIndex] {
|
||||
pLeafCell := leafCell.(*PhysicalCell)
|
||||
pLeafCell.DeleteReservingOrReservedGroup(g)
|
||||
pLeafCell.AddUsingGroup(g)
|
||||
setCellState(pLeafCell, cellUsed)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -1166,20 +1166,20 @@ func (h *HivedAlgorithm) allocatePreemptingAffinityGroup(g *AlgoAffinityGroup, p
|
|||
func (h *HivedAlgorithm) lazyPreemptAffinityGroup(
|
||||
victim *AlgoAffinityGroup,
|
||||
preemptor string) (originalVirtualPlacement groupVirtualPlacement) {
|
||||
for _, podVirtualPlacements := range victim.virtualGpuPlacement {
|
||||
for _, podVirtualPlacements := range victim.virtualLeafCellPlacement {
|
||||
for _, podVirtualPlacement := range podVirtualPlacements {
|
||||
for _, gpu := range podVirtualPlacement {
|
||||
if gpu != nil {
|
||||
vGpu := gpu.(*VirtualCell)
|
||||
pGpu := vGpu.GetPhysicalCell()
|
||||
h.releaseGpu(pGpu, victim.vc)
|
||||
h.allocateGpu(pGpu, nil, opportunisticPriority, victim.vc)
|
||||
for _, leafCell := range podVirtualPlacement {
|
||||
if leafCell != nil {
|
||||
vLeafCell := leafCell.(*VirtualCell)
|
||||
pLeafCell := vLeafCell.GetPhysicalCell()
|
||||
h.releaseLeafCell(pLeafCell, victim.vc)
|
||||
h.allocateLeafCell(pLeafCell, nil, opportunisticPriority, victim.vc)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
originalVirtualPlacement = victim.virtualGpuPlacement
|
||||
victim.virtualGpuPlacement = nil
|
||||
originalVirtualPlacement = victim.virtualLeafCellPlacement
|
||||
victim.virtualLeafCellPlacement = nil
|
||||
victim.lazyPreemptionStatus = &api.LazyPreemptionStatus{
|
||||
Preemptor: preemptor,
|
||||
PreemptionTime: meta.Now(),
|
||||
|
@ -1200,30 +1200,30 @@ func (h *HivedAlgorithm) lazyPreemptCell(c *VirtualCell, preemptor string) {
|
|||
|
||||
// revertLazyPreempt reverts the lazy preemption of an affinity group.
|
||||
func (h *HivedAlgorithm) revertLazyPreempt(g *AlgoAffinityGroup, virtualPlacement groupVirtualPlacement) {
|
||||
for gpuNum := range g.physicalGpuPlacement {
|
||||
for podIndex := range g.physicalGpuPlacement[gpuNum] {
|
||||
for gpuIndex, gpu := range g.physicalGpuPlacement[gpuNum][podIndex] {
|
||||
if gpu == nil {
|
||||
for leafCellNum := range g.physicalLeafCellPlacement {
|
||||
for podIndex := range g.physicalLeafCellPlacement[leafCellNum] {
|
||||
for leafCellIndex, leafCell := range g.physicalLeafCellPlacement[leafCellNum][podIndex] {
|
||||
if leafCell == nil {
|
||||
continue
|
||||
}
|
||||
pGpu := gpu.(*PhysicalCell)
|
||||
vGpu := virtualPlacement[gpuNum][podIndex][gpuIndex].(*VirtualCell)
|
||||
h.releaseGpu(pGpu, g.vc)
|
||||
h.allocateGpu(pGpu, vGpu, CellPriority(g.priority), g.vc)
|
||||
pLeafCell := leafCell.(*PhysicalCell)
|
||||
vLeafCell := virtualPlacement[leafCellNum][podIndex][leafCellIndex].(*VirtualCell)
|
||||
h.releaseLeafCell(pLeafCell, g.vc)
|
||||
h.allocateLeafCell(pLeafCell, vLeafCell, CellPriority(g.priority), g.vc)
|
||||
}
|
||||
}
|
||||
}
|
||||
g.virtualGpuPlacement = virtualPlacement
|
||||
g.virtualLeafCellPlacement = virtualPlacement
|
||||
g.lazyPreemptionStatus = nil
|
||||
klog.Infof("Lazy preemption of affinity group %v is reverted", g.name)
|
||||
}
|
||||
|
||||
// findAllocatedGpu finds the physical and virtual GPUs in the full cell lists for an allocate pod.
|
||||
// findAllocatedLeafCell finds the physical and virtual leaf cells in the full cell lists for an allocate pod.
|
||||
// The boolean return value indicates whether the affinity group should be lazy-preempted.
|
||||
// The bool being nil means the group is OT and has no virtual placement.
|
||||
func (h *HivedAlgorithm) findAllocatedGpu(
|
||||
func (h *HivedAlgorithm) findAllocatedLeafCell(
|
||||
index int32,
|
||||
physicalGpuIndices []int32,
|
||||
physicalLeafCellIndices []int32,
|
||||
preassignedCellTypes []api.CellType,
|
||||
chain CellChain,
|
||||
node string,
|
||||
|
@ -1233,24 +1233,24 @@ func (h *HivedAlgorithm) findAllocatedGpu(
|
|||
pod *core.Pod) (*PhysicalCell, *VirtualCell, *bool) {
|
||||
|
||||
priority := CellPriority(s.Priority)
|
||||
physicalGpuIndex := physicalGpuIndices[index]
|
||||
if pGpu := findPhysicalGpu(h.fullCellList, chain, node, physicalGpuIndex); pGpu == nil {
|
||||
physicalLeafCellIndex := physicalLeafCellIndices[index]
|
||||
if pLeafCell := findPhysicalLeafCell(h.fullCellList, chain, node, physicalLeafCellIndex); pLeafCell == nil {
|
||||
klog.Warningf(
|
||||
"[%v]: Cannot find GPU %v on node %v: not found in the spec. Pod ignored",
|
||||
internal.Key(pod), physicalGpuIndex, node)
|
||||
"[%v]: Cannot find leaf cell %v on node %v: not found in the spec. Pod ignored",
|
||||
internal.Key(pod), physicalLeafCellIndex, node)
|
||||
return nil, nil, common.PtrBool(false)
|
||||
} else {
|
||||
var vGpu *VirtualCell
|
||||
var vLeafCell *VirtualCell
|
||||
if preassignedCellTypes == nil {
|
||||
klog.Warningf("[%v]: Cannot find virtual cell: preassigned cell not found in pod bind info", internal.Key(pod))
|
||||
return pGpu, nil, common.PtrBool(true)
|
||||
return pLeafCell, nil, common.PtrBool(true)
|
||||
}
|
||||
if group.virtualGpuPlacement != nil && !lazyPreempted {
|
||||
if group.virtualLeafCellPlacement != nil && !lazyPreempted {
|
||||
preassignedType := preassignedCellTypes[index]
|
||||
if preassignedType != "" {
|
||||
var preassignedLevel CellLevel
|
||||
typeFound := false
|
||||
for l, t := range h.cellTypes[pGpu.GetChain()] {
|
||||
for l, t := range h.cellTypes[pLeafCell.GetChain()] {
|
||||
if t == preassignedType {
|
||||
preassignedLevel = l
|
||||
typeFound = true
|
||||
|
@ -1258,12 +1258,12 @@ func (h *HivedAlgorithm) findAllocatedGpu(
|
|||
}
|
||||
var message string
|
||||
if !typeFound {
|
||||
message = fmt.Sprintf("Preassigned cell type %v not found in chain %v", preassignedType, pGpu.GetChain())
|
||||
message = fmt.Sprintf("Preassigned cell type %v not found in chain %v", preassignedType, pLeafCell.GetChain())
|
||||
} else if vcs := h.vcSchedulers[s.VirtualCluster]; vcs == nil {
|
||||
message = fmt.Sprintf("VC %v not found", s.VirtualCluster)
|
||||
} else {
|
||||
vccl := vcs.getNonPinnedPreassignedCells()[pGpu.GetChain()]
|
||||
str := string(pGpu.GetChain())
|
||||
vccl := vcs.getNonPinnedPreassignedCells()[pLeafCell.GetChain()]
|
||||
str := string(pLeafCell.GetChain())
|
||||
if s.PinnedCellId != "" {
|
||||
vccl = vcs.getPinnedCells()[s.PinnedCellId]
|
||||
str = string(s.PinnedCellId)
|
||||
|
@ -1271,84 +1271,84 @@ func (h *HivedAlgorithm) findAllocatedGpu(
|
|||
if vccl == nil {
|
||||
message = fmt.Sprintf("VC %v has no cell for %v", s.VirtualCluster, str)
|
||||
} else {
|
||||
vGpu, message = mapPhysicalCellToVirtual(pGpu, vccl, preassignedLevel, priority)
|
||||
vLeafCell, message = mapPhysicalCellToVirtual(pLeafCell, vccl, preassignedLevel, priority)
|
||||
}
|
||||
}
|
||||
if vGpu == nil {
|
||||
if vLeafCell == nil {
|
||||
klog.Warningf("[%v]: Cannot find virtual cell: %v", internal.Key(pod), message)
|
||||
return pGpu, nil, common.PtrBool(true)
|
||||
return pLeafCell, nil, common.PtrBool(true)
|
||||
} else {
|
||||
return pGpu, vGpu, common.PtrBool(false)
|
||||
return pLeafCell, vLeafCell, common.PtrBool(false)
|
||||
}
|
||||
} else {
|
||||
return pGpu, nil, nil
|
||||
return pLeafCell, nil, nil
|
||||
}
|
||||
} else {
|
||||
return pGpu, nil, common.PtrBool(false)
|
||||
return pLeafCell, nil, common.PtrBool(false)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// allocateGpu creates the cell bindings, allocates the preassigned cell (if necessary),
|
||||
// allocateLeafCell creates the cell bindings, allocates the preassigned cell (if necessary),
|
||||
// and sets the priority.
|
||||
func (h *HivedAlgorithm) allocateGpu(
|
||||
pGpu *PhysicalCell,
|
||||
vGpu *VirtualCell,
|
||||
func (h *HivedAlgorithm) allocateLeafCell(
|
||||
pLeafCell *PhysicalCell,
|
||||
vLeafCell *VirtualCell,
|
||||
p CellPriority,
|
||||
vcn api.VirtualClusterName) (safetyOk bool, reason string) {
|
||||
|
||||
safetyOk = true
|
||||
if vGpu != nil {
|
||||
setCellPriority(vGpu, p)
|
||||
updateUsedGpuNumAtPriority(vGpu, p, true)
|
||||
setCellPriority(pGpu, p)
|
||||
updateUsedGpuNumAtPriority(pGpu, p, true)
|
||||
pac := vGpu.GetPreassignedCell()
|
||||
if vLeafCell != nil {
|
||||
setCellPriority(vLeafCell, p)
|
||||
updateUsedLeafCellNumAtPriority(vLeafCell, p, true)
|
||||
setCellPriority(pLeafCell, p)
|
||||
updateUsedLeafCellNumAtPriority(pLeafCell, p, true)
|
||||
pac := vLeafCell.GetPreassignedCell()
|
||||
preassignedNewlyBound := pac.GetPhysicalCell() == nil
|
||||
if pGpu.GetVirtualCell() == nil {
|
||||
if pLeafCell.GetVirtualCell() == nil {
|
||||
// the binding could have been created before (when the cell is bad)
|
||||
bindCell(pGpu, vGpu)
|
||||
bindCell(pLeafCell, vLeafCell)
|
||||
}
|
||||
if preassignedNewlyBound {
|
||||
safetyOk, reason = h.allocatePreassignedCell(pac.GetPhysicalCell(), vcn, false)
|
||||
}
|
||||
} else {
|
||||
setCellPriority(pGpu, opportunisticPriority)
|
||||
updateUsedGpuNumAtPriority(pGpu, opportunisticPriority, true)
|
||||
pGpu.GetAPIStatus().VC = vcn
|
||||
setCellPriority(pLeafCell, opportunisticPriority)
|
||||
updateUsedLeafCellNumAtPriority(pLeafCell, opportunisticPriority, true)
|
||||
pLeafCell.GetAPIStatus().VC = vcn
|
||||
h.apiClusterStatus.VirtualClusters[vcn] = append(
|
||||
h.apiClusterStatus.VirtualClusters[vcn], generateOTVirtualCell(pGpu.GetAPIStatus()))
|
||||
h.apiClusterStatus.VirtualClusters[vcn], generateOTVirtualCell(pLeafCell.GetAPIStatus()))
|
||||
}
|
||||
return safetyOk, reason
|
||||
}
|
||||
|
||||
// releaseGpu destroys the cell bindings, release the preassigned cell (if necessary),
|
||||
// releaseLeafCell destroys the cell bindings, release the preassigned cell (if necessary),
|
||||
// and resets the priority.
|
||||
func (h *HivedAlgorithm) releaseGpu(pGpu *PhysicalCell, vcn api.VirtualClusterName) {
|
||||
if vGpu := pGpu.GetVirtualCell(); vGpu != nil {
|
||||
updateUsedGpuNumAtPriority(vGpu, vGpu.GetPriority(), false)
|
||||
setCellPriority(vGpu, freePriority)
|
||||
preassignedPhysical := vGpu.GetPreassignedCell().GetPhysicalCell()
|
||||
if pGpu.IsHealthy() {
|
||||
func (h *HivedAlgorithm) releaseLeafCell(pLeafCell *PhysicalCell, vcn api.VirtualClusterName) {
|
||||
if vLeafCell := pLeafCell.GetVirtualCell(); vLeafCell != nil {
|
||||
updateUsedLeafCellNumAtPriority(vLeafCell, vLeafCell.GetPriority(), false)
|
||||
setCellPriority(vLeafCell, freePriority)
|
||||
preassignedPhysical := vLeafCell.GetPreassignedCell().GetPhysicalCell()
|
||||
if pLeafCell.IsHealthy() {
|
||||
// we won't unbind the cell if it is bad
|
||||
unbindCell(pGpu)
|
||||
unbindCell(pLeafCell)
|
||||
}
|
||||
// To check if we should release the preassigned cell, we cannot simply check if the
|
||||
// virtual cell is already unbound. It's possible that the cell is bad, then the binding
|
||||
// won't be destroyed automatically (the cell is still bound only because it is bad).
|
||||
// If the below condition is true, then the preassigned cell is not in real use and we can hence release it.
|
||||
if !preassignedPhysical.IsPinned() && vGpu.GetPreassignedCell().GetPriority() < minGuaranteedPriority &&
|
||||
if !preassignedPhysical.IsPinned() && vLeafCell.GetPreassignedCell().GetPriority() < minGuaranteedPriority &&
|
||||
!h.vcDoomedBadCells[vcn][preassignedPhysical.GetChain()].contains(
|
||||
preassignedPhysical, preassignedPhysical.GetLevel()) {
|
||||
h.releasePreassignedCell(preassignedPhysical, vcn, false)
|
||||
}
|
||||
} else {
|
||||
pGpu.GetAPIStatus().VC = ""
|
||||
pLeafCell.GetAPIStatus().VC = ""
|
||||
h.apiClusterStatus.VirtualClusters[vcn] = deleteOTVirtualCell(
|
||||
h.apiClusterStatus.VirtualClusters[vcn], pGpu.GetAddress())
|
||||
h.apiClusterStatus.VirtualClusters[vcn], pLeafCell.GetAddress())
|
||||
}
|
||||
updateUsedGpuNumAtPriority(pGpu, pGpu.GetPriority(), false)
|
||||
setCellPriority(pGpu, freePriority)
|
||||
updateUsedLeafCellNumAtPriority(pLeafCell, pLeafCell.GetPriority(), false)
|
||||
setCellPriority(pLeafCell, freePriority)
|
||||
}
|
||||
|
||||
// allocatePreassignedCell allocates a physical cell to a preassigned virtual cell, removes the physical cell
|
||||
|
|
|
@ -67,106 +67,106 @@ var group1, group2, group3, group4, group5, group6, group7, group8, group9, grou
|
|||
group15, group16, group17, group18, group19, group20, group21, group22, group23, group24, group25, group26, group27,
|
||||
group28, group29, group30, group31, group32, group33, group34 = &api.AffinityGroupSpec{
|
||||
Name: "group1",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 1}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 1}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group2",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 1}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 1}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group3",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 8}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 8}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group4",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 1}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 1}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group5",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, GpuNumber: 16}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group6",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 1}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 1}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group7",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 3, GpuNumber: 8}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 3, LeafCellNumber: 8}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group8",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 8}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 8}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group9",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 7}, {PodNumber: 1, GpuNumber: 5}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 7}, {PodNumber: 1, LeafCellNumber: 5}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group10",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 1}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 1}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group11",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, GpuNumber: 16}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group12",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, GpuNumber: 16}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group13",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, GpuNumber: 16}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group14",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, GpuNumber: 16}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group15",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 2}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 2}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group16",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 2}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 2}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group17",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 2}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 2}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group18",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, GpuNumber: 16}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group19",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, GpuNumber: 16}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group20",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 16}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group21",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 16}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group22",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 16}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group23",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 16}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group24",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, GpuNumber: 16}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group25",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 16}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group26",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, GpuNumber: 16}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group27",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, GpuNumber: 16}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 2, LeafCellNumber: 16}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group28",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 16}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group29",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 4, GpuNumber: 16}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 4, LeafCellNumber: 16}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group30",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 16}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group31",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 16}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group32",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 16}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group33",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 16}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}},
|
||||
}, &api.AffinityGroupSpec{
|
||||
Name: "group34",
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, GpuNumber: 16}},
|
||||
Members: []api.AffinityGroupMemberSpec{{PodNumber: 1, LeafCellNumber: 16}},
|
||||
}
|
||||
|
||||
var pss = map[types.UID]api.PodSchedulingSpec{
|
||||
|
@ -175,368 +175,368 @@ var pss = map[types.UID]api.PodSchedulingSpec{
|
|||
Priority: 0,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 1,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 1,
|
||||
AffinityGroup: group1,
|
||||
}, "pod2": { // buddy of pod1
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 1,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 1,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 1,
|
||||
AffinityGroup: group2,
|
||||
}, "pod3": { // non-buddy of pod 1 & 2 (avoidance of preemption)
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 2,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 8,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 8,
|
||||
AffinityGroup: group3,
|
||||
}, "pod4": { // opportunistic pod (will stay away from the guaranteed pods)
|
||||
VirtualCluster: "VC1",
|
||||
Priority: -1,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 1,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 1,
|
||||
AffinityGroup: group4,
|
||||
}, "pod5": { // use pinned cell
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 1,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "VC1-YQW-DGX2",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group5,
|
||||
}, "pod6": { // use pinned cell
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 1,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "VC1-YQW-DGX2",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group5,
|
||||
}, "pod7": { // insufficient VC cells; should return PodWaitInfo
|
||||
VirtualCluster: "VC2",
|
||||
Priority: 1,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "DGX1-P100",
|
||||
GpuNumber: 8,
|
||||
LeafCellType: "DGX1-P100",
|
||||
LeafCellNumber: 8,
|
||||
AffinityGroup: group7,
|
||||
}, "pod8": { // any GPU type; heterogeneous affinity group
|
||||
}, "pod8": { // any leaf cell type; heterogeneous affinity group
|
||||
VirtualCluster: "VC2",
|
||||
Priority: 1,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "",
|
||||
GpuNumber: 7,
|
||||
LeafCellType: "",
|
||||
LeafCellNumber: 7,
|
||||
AffinityGroup: group9,
|
||||
}, "pod9": { // any GPU type; heterogeneous affinity group
|
||||
}, "pod9": { // any leaf cell type; heterogeneous affinity group
|
||||
VirtualCluster: "VC2",
|
||||
Priority: 1,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "",
|
||||
GpuNumber: 5,
|
||||
LeafCellType: "",
|
||||
LeafCellNumber: 5,
|
||||
AffinityGroup: group9,
|
||||
}, "pod10": { // use a GPU type that the VC does not have; should User Error Panic
|
||||
}, "pod10": { // use a leaf cell type that the VC does not have; should User Error Panic
|
||||
VirtualCluster: "VC2",
|
||||
Priority: 1,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 1,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 1,
|
||||
AffinityGroup: group6,
|
||||
}, "pod11": { // invalid affinity group configuration
|
||||
VirtualCluster: "VC2",
|
||||
Priority: 1,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "DGX1-P100",
|
||||
GpuNumber: 2,
|
||||
LeafCellType: "DGX1-P100",
|
||||
LeafCellNumber: 2,
|
||||
AffinityGroup: group8,
|
||||
}, "pod12": { // invalid affinity group configuration
|
||||
VirtualCluster: "VC2",
|
||||
Priority: 1,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "DGX1-P100",
|
||||
GpuNumber: 2,
|
||||
LeafCellType: "DGX1-P100",
|
||||
LeafCellNumber: 2,
|
||||
AffinityGroup: group8,
|
||||
}, "pod13": { // invalid VC
|
||||
VirtualCluster: "surprise!",
|
||||
Priority: 1,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "DGX1-P100",
|
||||
GpuNumber: 1,
|
||||
LeafCellType: "DGX1-P100",
|
||||
LeafCellNumber: 1,
|
||||
AffinityGroup: group10,
|
||||
}, "pod14": { // invalid pinned cell
|
||||
VirtualCluster: "VC2",
|
||||
Priority: 1,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "surprise!",
|
||||
GpuType: "DGX1-P100",
|
||||
GpuNumber: 1,
|
||||
LeafCellType: "DGX1-P100",
|
||||
LeafCellNumber: 1,
|
||||
AffinityGroup: group10,
|
||||
}, "pod15": { // invalid priority
|
||||
VirtualCluster: "VC2",
|
||||
Priority: 1001,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "DGX1-P100",
|
||||
GpuNumber: 1,
|
||||
LeafCellType: "DGX1-P100",
|
||||
LeafCellNumber: 1,
|
||||
AffinityGroup: group10,
|
||||
}, "pod16": { // trigger preemption
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 2,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "VC1-YQW-DGX2",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group11,
|
||||
}, "pod17": { // trigger preemption
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 2,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "VC1-YQW-DGX2",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group11,
|
||||
}, "pod18": { // used for test splitting physical cell hierarchies in reconfiguration
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 1,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group12,
|
||||
}, "pod19": { // used for test splitting physical cell hierarchies in reconfiguration
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 1,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group12,
|
||||
}, "pod20": { // guaranteed pod in splitting physical cell hierarchies
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 1,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group13,
|
||||
}, "pod21": { // guaranteed pod in splitting physical cell hierarchies
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 1,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group13,
|
||||
}, "pod22": { // opportunistic pod in splitting physical cell hierarchies
|
||||
VirtualCluster: "VC1",
|
||||
Priority: -1,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group14,
|
||||
}, "pod23": { // opportunistic pod in splitting physical cell hierarchies
|
||||
VirtualCluster: "VC1",
|
||||
Priority: -1,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group14,
|
||||
}, "pod24": { // used for triggering intra-VC preemption
|
||||
VirtualCluster: "VC2",
|
||||
Priority: 0,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "CT1",
|
||||
GpuNumber: 2,
|
||||
LeafCellType: "CT1",
|
||||
LeafCellNumber: 2,
|
||||
AffinityGroup: group15,
|
||||
}, "pod25": { // trigger intra-VC preemption
|
||||
VirtualCluster: "VC2",
|
||||
Priority: 1,
|
||||
LazyPreemptionEnable: false,
|
||||
PinnedCellId: "",
|
||||
GpuType: "CT1",
|
||||
GpuNumber: 2,
|
||||
LeafCellType: "CT1",
|
||||
LeafCellNumber: 2,
|
||||
AffinityGroup: group16,
|
||||
}, "pod26": { // will preempt pod25 immediately (as lazy preemption is not enabled)
|
||||
VirtualCluster: "VC2",
|
||||
Priority: 2,
|
||||
LazyPreemptionEnable: false,
|
||||
PinnedCellId: "",
|
||||
GpuType: "CT1",
|
||||
GpuNumber: 2,
|
||||
LeafCellType: "CT1",
|
||||
LeafCellNumber: 2,
|
||||
AffinityGroup: group17,
|
||||
}, "pod27": { // will be rejected because one of the pod in this group is allocated a non-suggested node
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 1,
|
||||
LazyPreemptionEnable: false,
|
||||
PinnedCellId: "VC1-YQW-DGX2",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group18,
|
||||
}, "pod28": { // used for stateful preemption test
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 1,
|
||||
LazyPreemptionEnable: false,
|
||||
PinnedCellId: "VC1-YQW-DGX2",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group19,
|
||||
}, "pod29": { // will try to preempt pod28
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 2,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "VC1-YQW-DGX2",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group20,
|
||||
}, "pod30": { // cannot get scheduled because pod28's still holding the resource
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 1,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "VC1-YQW-DGX2",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group21,
|
||||
}, "pod31": { // will try to preempt pod28, and will be scheduled to a different node from pod29
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 2,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "VC1-YQW-DGX2",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group22,
|
||||
}, "pod32": { // cannot get scheduled because VC1-YQW-DGX2 has been used up by pod29 and pod31
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 2,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "VC1-YQW-DGX2",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group23,
|
||||
}, "pod33": { // will cancel pod29 and pod31's preemption, and continue to preempt pod28
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 3,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "VC1-YQW-DGX2",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group24,
|
||||
}, "pod34": { // will cancel pod33's preemption, and get scheduled immediately (because pod28 has been deleted)
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 4,
|
||||
LazyPreemptionEnable: false,
|
||||
PinnedCellId: "VC1-YQW-DGX2",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group25,
|
||||
}, "pod35": { // will preempt pod34, and will be deleted before the preemption is done (so the preemption will be canceled)
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 5,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "VC1-YQW-DGX2",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group26,
|
||||
}, "pod36": { // will iterate the GPU types until find a placement within suggested nodes
|
||||
}, "pod36": { // will iterate the leaf cell types until find a placement within suggested nodes
|
||||
VirtualCluster: "VC1",
|
||||
Priority: -1,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "",
|
||||
GpuNumber: 1,
|
||||
LeafCellType: "",
|
||||
LeafCellNumber: 1,
|
||||
AffinityGroup: group1,
|
||||
}, "pod37": { // used for test aware of suggested nodes in VC
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 1,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "VC1-YQW-DGX2",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 1,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 1,
|
||||
AffinityGroup: group1,
|
||||
}, "pod38": { // used for test aware of suggested nodes in VC
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 1,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "VC1-YQW-DGX2",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 1,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 1,
|
||||
AffinityGroup: group2,
|
||||
}, "pod39": { // used for triggering backtrack cell search
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 1,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group27,
|
||||
}, "pod40": { // backtrack cell search
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 1,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group28,
|
||||
}, "pod41": { // revert lazy preemption in backtrack cell search
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 2,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group29,
|
||||
}, "pod42": { // doomed bad cell test
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 0,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group30,
|
||||
}, "pod43": { // doomed bad cell test
|
||||
VirtualCluster: "VC2",
|
||||
Priority: 0,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group31,
|
||||
}, "pod44": { // safe relaxed buddy allocate for bad node test
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 0,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group32,
|
||||
}, "pod45": { // safe relaxed buddy allocate for bad node test
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 0,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group33,
|
||||
}, "pod46": { // safe relaxed buddy allocate safety test
|
||||
VirtualCluster: "VC1",
|
||||
Priority: 0,
|
||||
LazyPreemptionEnable: true,
|
||||
PinnedCellId: "",
|
||||
GpuType: "DGX2-V100",
|
||||
GpuNumber: 16,
|
||||
LeafCellType: "DGX2-V100",
|
||||
LeafCellNumber: 16,
|
||||
AffinityGroup: group34,
|
||||
},
|
||||
}
|
||||
|
@ -559,36 +559,36 @@ var casesForStatefulPreemption = []string{
|
|||
}
|
||||
|
||||
type result struct {
|
||||
node string
|
||||
gpuIsolation []int32
|
||||
node string
|
||||
leafCellIsolation []int32
|
||||
}
|
||||
|
||||
var expectedBindInfos = map[string]result{
|
||||
"pod1": {node: "0.0.1.0", gpuIsolation: []int32{0}},
|
||||
"pod2": {node: "0.0.1.0", gpuIsolation: []int32{1}},
|
||||
"pod3": {node: "0.0.1.0", gpuIsolation: []int32{8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod4": {node: "0.0.5.0", gpuIsolation: []int32{0}},
|
||||
"pod5": {node: "0.0.3.0", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod6": {node: "0.0.3.1", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod8": {node: "1.0.0.0", gpuIsolation: []int32{1, 3, 4, 7, 0, 2, 6}},
|
||||
"pod9": {node: "1.0.0.2", gpuIsolation: []int32{0, 1, 2, 3, 4}},
|
||||
"pod18": {node: "0.0.3.2", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod19": {node: "0.0.3.3", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod20": {node: "0.0.4.0", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod21": {node: "0.0.4.1", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod22": {node: "0.0.4.2", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod23": {node: "0.0.4.3", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod24": {node: "0.0.0.1", gpuIsolation: []int32{0, 1}},
|
||||
"pod25": {node: "0.0.0.0", gpuIsolation: []int32{0, 1}},
|
||||
"pod28": {node: "0.0.3.0", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod34": {node: "0.0.3.0", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod36": {node: "0.0.1.0", gpuIsolation: []int32{0}},
|
||||
"pod37": {node: "0.0.3.0", gpuIsolation: []int32{0}},
|
||||
"pod38": {node: "0.0.3.1", gpuIsolation: []int32{0}},
|
||||
"pod39": {node: "0.0.3.2", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod40": {node: "0.0.4.3", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod44": {node: "0.0.3.2", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod45": {node: "0.0.4.2", gpuIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod1": {node: "0.0.1.0", leafCellIsolation: []int32{0}},
|
||||
"pod2": {node: "0.0.1.0", leafCellIsolation: []int32{1}},
|
||||
"pod3": {node: "0.0.1.0", leafCellIsolation: []int32{8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod4": {node: "0.0.5.0", leafCellIsolation: []int32{0}},
|
||||
"pod5": {node: "0.0.3.0", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod6": {node: "0.0.3.1", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod8": {node: "1.0.0.0", leafCellIsolation: []int32{1, 3, 4, 7, 0, 2, 6}},
|
||||
"pod9": {node: "1.0.0.2", leafCellIsolation: []int32{0, 1, 2, 3, 4}},
|
||||
"pod18": {node: "0.0.3.2", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod19": {node: "0.0.3.3", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod20": {node: "0.0.4.0", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod21": {node: "0.0.4.1", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod22": {node: "0.0.4.2", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod23": {node: "0.0.4.3", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod24": {node: "0.0.0.1", leafCellIsolation: []int32{0, 1}},
|
||||
"pod25": {node: "0.0.0.0", leafCellIsolation: []int32{0, 1}},
|
||||
"pod28": {node: "0.0.3.0", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod34": {node: "0.0.3.0", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod36": {node: "0.0.1.0", leafCellIsolation: []int32{0}},
|
||||
"pod37": {node: "0.0.3.0", leafCellIsolation: []int32{0}},
|
||||
"pod38": {node: "0.0.3.1", leafCellIsolation: []int32{0}},
|
||||
"pod39": {node: "0.0.3.2", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod40": {node: "0.0.4.3", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod44": {node: "0.0.3.2", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
"pod45": {node: "0.0.4.2", leafCellIsolation: []int32{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15}},
|
||||
}
|
||||
|
||||
var expectedPreemptInfos = map[string]common.Set{
|
||||
|
@ -615,7 +615,7 @@ func TestHivedAlgorithm(t *testing.T) {
|
|||
sConfig := api.NewConfig(api.InitRawConfig(&configFilePath))
|
||||
h := NewHivedAlgorithm(sConfig)
|
||||
initNodes(h)
|
||||
// sort chains of each GPU type for stability of the test
|
||||
// sort chains of each leaf cell type for stability of the test
|
||||
for _, chains := range h.cellChains {
|
||||
sortChains(chains)
|
||||
}
|
||||
|
@ -670,8 +670,8 @@ func printConfig(t *testing.T, h *HivedAlgorithm) {
|
|||
t.Logf("%v", ccl)
|
||||
}
|
||||
}
|
||||
for gpuType, chains := range h.cellChains {
|
||||
t.Logf("%v: %v", gpuType, chains)
|
||||
for leafCellType, chains := range h.cellChains {
|
||||
t.Logf("%v: %v", leafCellType, chains)
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -876,21 +876,21 @@ func testStatefulPreemption(t *testing.T, configFilePath string) {
|
|||
}
|
||||
if podName == "pod35" {
|
||||
p := &groupPhysicalPlacement{}
|
||||
*p = h.affinityGroups[pss[pod.UID].AffinityGroup.Name].physicalGpuPlacement
|
||||
*p = h.affinityGroups[pss[pod.UID].AffinityGroup.Name].physicalLeafCellPlacement
|
||||
h.DeleteUnallocatedPod(pod)
|
||||
// test correctness of preemption cancellation
|
||||
for _, podPlacements := range *p {
|
||||
for _, podGpus := range podPlacements {
|
||||
for _, gpu := range podGpus {
|
||||
pGpu := gpu.(*PhysicalCell)
|
||||
if pGpu.GetState() == cellUsed {
|
||||
if int32(pGpu.GetPriority()) != pss["pod34"].Priority {
|
||||
for _, podLeafCells := range podPlacements {
|
||||
for _, leafCell := range podLeafCells {
|
||||
pLeafCell := leafCell.(*PhysicalCell)
|
||||
if pLeafCell.GetState() == cellUsed {
|
||||
if int32(pLeafCell.GetPriority()) != pss["pod34"].Priority {
|
||||
t.Errorf("Cell %v's priority should be pod34's priority, but is %v",
|
||||
pGpu.GetAddress(), pGpu.GetPriority())
|
||||
pLeafCell.GetAddress(), pLeafCell.GetPriority())
|
||||
}
|
||||
} else if pGpu.GetState() != cellFree {
|
||||
} else if pLeafCell.GetState() != cellFree {
|
||||
t.Errorf("Cell %v should be in Free state, but is %v",
|
||||
pGpu.GetAddress(), pGpu.GetState())
|
||||
pLeafCell.GetAddress(), pLeafCell.GetState())
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -1084,7 +1084,7 @@ func testReconfiguration(t *testing.T, configFilePath string) {
|
|||
for _, podName := range casesThatShouldBeLazyPreempted {
|
||||
pod := allPods[podName]
|
||||
g := h.affinityGroups[pss[pod.UID].AffinityGroup.Name]
|
||||
if g.virtualGpuPlacement != nil {
|
||||
if g.virtualLeafCellPlacement != nil {
|
||||
t.Errorf("Group %v is expected to be lazy preempted, but not", g.name)
|
||||
}
|
||||
}
|
||||
|
@ -1105,7 +1105,7 @@ func testInvalidInitialAssignment(t *testing.T, sConfig *api.Config) {
|
|||
NewHivedAlgorithm(sConfig)
|
||||
}
|
||||
|
||||
func compareGpuIsolation(a []int32, b []int32) bool {
|
||||
func compareLeafCellIsolation(a []int32, b []int32) bool {
|
||||
if len(a) == len(b) {
|
||||
for i := 0; i < len(a); i++ {
|
||||
if a[i] != b[i] {
|
||||
|
@ -1130,15 +1130,15 @@ func compareSchedulingResult(t *testing.T, pod *core.Pod, psr internal.PodSchedu
|
|||
if expected, ok := expectedBindInfos[pod.Name]; !ok {
|
||||
if psr.PodBindInfo != nil {
|
||||
t.Errorf("[%v]: wrong pod scheduling result: expected empty, but got %v:%v",
|
||||
internal.Key(pod), psr.PodBindInfo.Node, psr.PodBindInfo.GpuIsolation)
|
||||
internal.Key(pod), psr.PodBindInfo.Node, psr.PodBindInfo.LeafCellIsolation)
|
||||
}
|
||||
if !expectedPreemptInfos[pod.Name].IsEmpty() && !containsPods(psr.PodPreemptInfo.VictimPods, expectedPreemptInfos[pod.Name]) {
|
||||
t.Errorf("[%v]: wrong preempt victims: expected %v, but got %v",
|
||||
internal.Key(pod), expectedPreemptInfos[pod.Name], psr.PodPreemptInfo.VictimPods)
|
||||
}
|
||||
} else if psr.PodBindInfo.Node != expected.node ||
|
||||
!compareGpuIsolation(psr.PodBindInfo.GpuIsolation, expected.gpuIsolation) {
|
||||
!compareLeafCellIsolation(psr.PodBindInfo.LeafCellIsolation, expected.leafCellIsolation) {
|
||||
t.Errorf("[%v]: wrong pod bind info: expected %v:%v, but got %v:%v",
|
||||
internal.Key(pod), expected.node, expected.gpuIsolation, psr.PodBindInfo.Node, psr.PodBindInfo.GpuIsolation)
|
||||
internal.Key(pod), expected.node, expected.leafCellIsolation, psr.PodBindInfo.Node, psr.PodBindInfo.LeafCellIsolation)
|
||||
}
|
||||
}
|
||||
|
|
|
@ -24,6 +24,7 @@ package algorithm
|
|||
|
||||
import (
|
||||
"fmt"
|
||||
|
||||
"github.com/microsoft/hivedscheduler/pkg/api"
|
||||
"github.com/microsoft/hivedscheduler/pkg/common"
|
||||
"k8s.io/klog"
|
||||
|
@ -31,7 +32,7 @@ import (
|
|||
|
||||
// intraVCScheduler is an interface for scheduling pods inside a VC.
|
||||
// It stores two maps of ChainCellList, one for pinned cells, the other for non-pinned ones.
|
||||
// It should be able to return a set of GPU placements in the VC for a scheduling request.
|
||||
// It should be able to return a set of leaf cell placements in the VC for a scheduling request.
|
||||
type intraVCScheduler interface {
|
||||
getNonPinnedFullCellList() map[CellChain]ChainCellList
|
||||
getNonPinnedPreassignedCells() map[CellChain]ChainCellList
|
||||
|
@ -57,15 +58,15 @@ func newDefaultIntraVCScheduler(
|
|||
nonPinnedFullList map[CellChain]ChainCellList,
|
||||
nonPinnedFreeList map[CellChain]ChainCellList,
|
||||
pinnedList map[api.PinnedCellId]ChainCellList,
|
||||
gpuNums map[CellChain]map[CellLevel]int32) *defaultIntraVCScheduler {
|
||||
leafCellNums map[CellChain]map[CellLevel]int32) *defaultIntraVCScheduler {
|
||||
|
||||
snr := map[CellChain]*topologyAwareScheduler{}
|
||||
sr := map[api.PinnedCellId]*topologyAwareScheduler{}
|
||||
for chain, ccl := range nonPinnedFullList {
|
||||
snr[chain] = NewTopologyAwareScheduler(ccl, gpuNums[chain], true)
|
||||
snr[chain] = NewTopologyAwareScheduler(ccl, leafCellNums[chain], true)
|
||||
}
|
||||
for pid, ccl := range pinnedList {
|
||||
sr[pid] = NewTopologyAwareScheduler(ccl, gpuNums[ccl[CellLevel(1)][0].GetChain()], true)
|
||||
sr[pid] = NewTopologyAwareScheduler(ccl, leafCellNums[ccl[CellLevel(1)][0].GetChain()], true)
|
||||
}
|
||||
return &defaultIntraVCScheduler{
|
||||
nonPinnedFullCellList: nonPinnedFullList,
|
||||
|
@ -99,7 +100,7 @@ func (s *defaultIntraVCScheduler) schedule(
|
|||
scheduler = s.pinnedCellSchedulers[sr.pinnedCellId]
|
||||
str = fmt.Sprintf("pinned cell %v", sr.pinnedCellId)
|
||||
}
|
||||
klog.Infof("Processing scheduling request in VC %v: %v, GPU numbers %v, priority %v",
|
||||
klog.Infof("Processing scheduling request in VC %v: %v, leaf cell numbers %v, priority %v",
|
||||
sr.vc, str, common.ToJson(sr.affinityGroupPodNums), sr.priority)
|
||||
if scheduler != nil {
|
||||
placement, failedReason = scheduler.Schedule(
|
||||
|
|
|
@ -31,14 +31,14 @@ import (
|
|||
)
|
||||
|
||||
// topologyAwareScheduler can schedule a set of pods on a cluster view.
|
||||
// It first tries to place pods to nodes with fewer free GPUs (i.e., packing), while trying to avoid preemptions.
|
||||
// Then inside each node, it tries to allocate GPUs with better affinity.
|
||||
// It first tries to place pods to nodes with fewer free leaf cells (i.e., packing), while trying to avoid preemptions.
|
||||
// Then inside each node, it tries to allocate leaf cells with better affinity.
|
||||
type topologyAwareScheduler struct {
|
||||
// a list of nodes (node-level cells or top-level cells that are lower than node level)
|
||||
cv clusterView
|
||||
// GPU number at each level in the cell hierarchy. we use this to
|
||||
// calculate the optimal affinity for a given GPU number.
|
||||
levelGpuNum map[CellLevel]int32
|
||||
// leaf cell number at each level in the cell hierarchy. we use this to
|
||||
// calculate the optimal affinity for a given leaf cell number.
|
||||
levelLeafCellNum map[CellLevel]int32
|
||||
// pack pods cross different priorities, or inside each priority. the former is for intra-VC scheduling,
|
||||
// because high-priority can avoid preemption in the whole cluster view,
|
||||
// and hence we can pack pods with different priorities.
|
||||
|
@ -52,103 +52,103 @@ type topologyAwareScheduler struct {
|
|||
// (lower-level if no node-level) from a free cell list.
|
||||
func NewTopologyAwareScheduler(
|
||||
ccl ChainCellList,
|
||||
levelGpuNum map[CellLevel]int32,
|
||||
levelLeafCellNum map[CellLevel]int32,
|
||||
crossPriorityPack bool) *topologyAwareScheduler {
|
||||
|
||||
return &topologyAwareScheduler{
|
||||
cv: newClusterView(ccl),
|
||||
levelGpuNum: levelGpuNum,
|
||||
levelLeafCellNum: levelLeafCellNum,
|
||||
crossPriorityPack: crossPriorityPack,
|
||||
}
|
||||
}
|
||||
|
||||
func (t *topologyAwareScheduler) Schedule(
|
||||
podGpuNumbers map[int32]int32,
|
||||
podLeafCellNumbers map[int32]int32,
|
||||
p CellPriority,
|
||||
suggestedNodes common.Set,
|
||||
ignoreSuggestedNodes bool) (
|
||||
podPlacements map[int32][]CellList,
|
||||
failedReason string) {
|
||||
|
||||
// GPU numbers of the pods to schedule
|
||||
var sortedPodGpuNumbers []int32
|
||||
for gpuNum, podNum := range podGpuNumbers {
|
||||
// leaf cell numbers of the pods to schedule
|
||||
var sortedPodLeafCellNumbers []int32
|
||||
for leafCellNum, podNum := range podLeafCellNumbers {
|
||||
for i := int32(0); i < podNum; i++ {
|
||||
sortedPodGpuNumbers = append(sortedPodGpuNumbers, gpuNum)
|
||||
sortedPodLeafCellNumbers = append(sortedPodLeafCellNumbers, leafCellNum)
|
||||
}
|
||||
}
|
||||
common.SortInt32(sortedPodGpuNumbers)
|
||||
common.SortInt32(sortedPodLeafCellNumbers)
|
||||
|
||||
// disable preemption first (reduce preemption)
|
||||
priority := opportunisticPriority
|
||||
t.updateClusterView(priority, suggestedNodes, ignoreSuggestedNodes)
|
||||
// try to fit the pods to a set of nodes
|
||||
selectedNodeIndices, failedReason := findNodesForPods(t.cv, sortedPodGpuNumbers)
|
||||
selectedNodeIndices, failedReason := findNodesForPods(t.cv, sortedPodLeafCellNumbers)
|
||||
// enable preemption if scheduling failed
|
||||
if selectedNodeIndices == nil && p > opportunisticPriority {
|
||||
priority = p
|
||||
t.updateClusterView(priority, suggestedNodes, ignoreSuggestedNodes)
|
||||
selectedNodeIndices, failedReason = findNodesForPods(t.cv, sortedPodGpuNumbers)
|
||||
selectedNodeIndices, failedReason = findNodesForPods(t.cv, sortedPodLeafCellNumbers)
|
||||
}
|
||||
if selectedNodeIndices == nil {
|
||||
return nil, failedReason
|
||||
}
|
||||
// find GPUs inside the selected node for each pod
|
||||
selectedNodes := make(CellList, len(sortedPodGpuNumbers))
|
||||
// find leaf cells inside the selected node for each pod
|
||||
selectedNodes := make(CellList, len(sortedPodLeafCellNumbers))
|
||||
for i := 0; i < len(selectedNodeIndices); i++ {
|
||||
selectedNodes[i] = t.cv[selectedNodeIndices[i]].c
|
||||
}
|
||||
selectedGpus := CellList{}
|
||||
nodeAvailableGpus := map[Cell]CellList{}
|
||||
selectedLeafCells := CellList{}
|
||||
nodeAvailableLeafCells := map[Cell]CellList{}
|
||||
podPlacements = map[int32][]CellList{}
|
||||
for podIndex := 0; podIndex < len(sortedPodGpuNumbers); podIndex++ {
|
||||
gpuNumber := sortedPodGpuNumbers[podIndex]
|
||||
for podIndex := 0; podIndex < len(sortedPodLeafCellNumbers); podIndex++ {
|
||||
leafCellNumber := sortedPodLeafCellNumbers[podIndex]
|
||||
n := selectedNodes[podIndex]
|
||||
// TODO: Optimize findNodesForPods and findGpusInNode together to get a better placement,
|
||||
// TODO: Optimize findNodesForPods and findLeafCellsInNode together to get a better placement,
|
||||
// such as also aware intra node topology when findNodesForPods.
|
||||
selectedGpus, nodeAvailableGpus[n] = findGpusInNode(n, gpuNumber, priority, nodeAvailableGpus[n], t.levelGpuNum)
|
||||
if podPlacements[gpuNumber] == nil {
|
||||
podPlacements[gpuNumber] = []CellList{}
|
||||
selectedLeafCells, nodeAvailableLeafCells[n] = findLeafCellsInNode(n, leafCellNumber, priority, nodeAvailableLeafCells[n], t.levelLeafCellNum)
|
||||
if podPlacements[leafCellNumber] == nil {
|
||||
podPlacements[leafCellNumber] = []CellList{}
|
||||
}
|
||||
podPlacements[gpuNumber] = append(podPlacements[gpuNumber], selectedGpus)
|
||||
podPlacements[leafCellNumber] = append(podPlacements[leafCellNumber], selectedLeafCells)
|
||||
}
|
||||
return podPlacements, ""
|
||||
}
|
||||
|
||||
type node struct {
|
||||
c Cell // a node-level cell or a top-level cell that is lower than node level
|
||||
freeGpuNumAtPriority int32 // free GPU number at the priority of the pod to be scheduled (lower priority considered as free)
|
||||
usedGpuNumSamePriority int32 // GPU number used by the same priority as that of the pod to be scheduled
|
||||
usedGpuNumHigherPriority int32 // GPU number used by higher priorities than that of the pod to be scheduled
|
||||
healthy bool // if the node is healthy
|
||||
suggested bool // if the node is within suggested nodes
|
||||
nodeAddress api.CellAddress // used for logging the node address when bad or not suggested
|
||||
c Cell // a node-level cell or a top-level cell that is lower than node level
|
||||
freeLeafCellNumAtPriority int32 // free leaf cell number at the priority of the pod to be scheduled (lower priority considered as free)
|
||||
usedLeafCellNumSamePriority int32 // leaf cell number used by the same priority as that of the pod to be scheduled
|
||||
usedLeafCellNumHigherPriority int32 // leaf cell number used by higher priorities than that of the pod to be scheduled
|
||||
healthy bool // if the node is healthy
|
||||
suggested bool // if the node is within suggested nodes
|
||||
nodeAddress api.CellAddress // used for logging the node address when bad or not suggested
|
||||
}
|
||||
|
||||
// When cross-priority packing is not enabled, we count the GPU numbers used by the current
|
||||
// priority (n.usedGpuNumSamePriority), and the higher priorities (n.usedGpuNumHigherPriority), respectively.
|
||||
// When sorting the nodes, nodes with higher usedGpuNumSamePriority and lower usedGpuNumHigherPriority
|
||||
// When cross-priority packing is not enabled, we count the leaf cell numbers used by the current
|
||||
// priority (n.usedLeafCellNumSamePriority), and the higher priorities (n.usedLeafCellNumHigherPriority), respectively.
|
||||
// When sorting the nodes, nodes with higher usedLeafCellNumSamePriority and lower usedLeafCellNumHigherPriority
|
||||
// will be preferred (i.e., pack pods inside the same priority, and stay from higher priorities).
|
||||
// Note that in this case, the nodes may NOT be ordered in term of total used GPU number,
|
||||
// Note that in this case, the nodes may NOT be ordered in term of total used leaf cell number,
|
||||
// which may result in feasible pod placements being not found.
|
||||
//
|
||||
// Otherwise, n.usedGpuNumSamePriority is set to the total used GPU number,
|
||||
// so that nodes with more used GPUs will be preferred (i.e., pack pods globally across priorities).
|
||||
// Otherwise, n.usedLeafCellNumSamePriority is set to the total used leaf cell number,
|
||||
// so that nodes with more used leaf cells will be preferred (i.e., pack pods globally across priorities).
|
||||
// In this case a feasible pod placement is guaranteed to be found (as long as all nodes are in suggested nodes).
|
||||
func (n *node) updateUsedGpuNumForPriority(p CellPriority, crossPriorityPack bool) {
|
||||
n.usedGpuNumSamePriority = n.c.GetUsedGpuNumAtPriorities()[p]
|
||||
n.usedGpuNumHigherPriority = 0
|
||||
n.freeGpuNumAtPriority = n.c.GetTotalGpuNum()
|
||||
for priority, num := range n.c.GetUsedGpuNumAtPriorities() {
|
||||
func (n *node) updateUsedLeafCellNumForPriority(p CellPriority, crossPriorityPack bool) {
|
||||
n.usedLeafCellNumSamePriority = n.c.GetUsedLeafCellNumAtPriorities()[p]
|
||||
n.usedLeafCellNumHigherPriority = 0
|
||||
n.freeLeafCellNumAtPriority = n.c.GetTotalLeafCellNum()
|
||||
for priority, num := range n.c.GetUsedLeafCellNumAtPriorities() {
|
||||
if crossPriorityPack {
|
||||
if priority != p {
|
||||
n.usedGpuNumSamePriority += num
|
||||
n.usedLeafCellNumSamePriority += num
|
||||
}
|
||||
} else if priority > p {
|
||||
n.usedGpuNumHigherPriority += num
|
||||
n.usedLeafCellNumHigherPriority += num
|
||||
}
|
||||
if priority >= p {
|
||||
n.freeGpuNumAtPriority -= num
|
||||
n.freeLeafCellNumAtPriority -= num
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -158,10 +158,10 @@ type clusterView []*node
|
|||
func newClusterView(ccl ChainCellList) clusterView {
|
||||
var l CellLevel
|
||||
// TODO: currently if a top-level cell is lower than node level, it will be considered as a single node.
|
||||
// For example, 2 single GPU-level cells are considered as 2 nodes each with 1 GPU.
|
||||
// For example, 2 single leaf-level cells are considered as 2 nodes each with 1 leaf cell.
|
||||
// We cannot merge them because the 2 cells might be mapped to different physical nodes.
|
||||
// We plan to support using multiple cells in a best-effort manner (for example, schedule a 2-GPU pod
|
||||
// on 2 1-GPU cells, if we can find 2 1-GPU cells that can be mapped to the same physical node).
|
||||
// We plan to support using multiple cells in a best-effort manner (for example, schedule a 2-leaf-cell pod
|
||||
// on 2 1-leaf-cell cells, if we can find 2 1-leaf-cell cells that can be mapped to the same physical node).
|
||||
for l = CellLevel(1); l <= CellLevel(len(ccl)); l++ {
|
||||
if ccl[l][0].AtOrHigherThanNode() {
|
||||
break
|
||||
|
@ -205,18 +205,18 @@ func (cv clusterView) Len() int {
|
|||
// We sort the nodes in decreasing significance of:
|
||||
// (1) if the node is healthy (avoid unhealthy),
|
||||
// (2) if the node is suggested (avoid non-suggested),
|
||||
// (3) usedGpuNumSamePriority (more is preferred),
|
||||
// (4) usedGpuNumHigherPriority (less is preferred).
|
||||
// (3) usedLeafCellNumSamePriority (more is preferred),
|
||||
// (4) usedLeafCellNumHigherPriority (less is preferred).
|
||||
func (cv clusterView) Less(i int, j int) bool {
|
||||
if cv[i].healthy != cv[j].healthy {
|
||||
return cv[i].healthy
|
||||
} else if cv[i].suggested != cv[j].suggested {
|
||||
return cv[i].suggested
|
||||
} else if cv[i].usedGpuNumSamePriority > cv[j].usedGpuNumSamePriority {
|
||||
} else if cv[i].usedLeafCellNumSamePriority > cv[j].usedLeafCellNumSamePriority {
|
||||
return true
|
||||
} else if cv[i].usedGpuNumSamePriority < cv[j].usedGpuNumSamePriority {
|
||||
} else if cv[i].usedLeafCellNumSamePriority < cv[j].usedLeafCellNumSamePriority {
|
||||
return false
|
||||
} else if cv[i].usedGpuNumHigherPriority < cv[j].usedGpuNumHigherPriority {
|
||||
} else if cv[i].usedLeafCellNumHigherPriority < cv[j].usedLeafCellNumHigherPriority {
|
||||
return true
|
||||
} else {
|
||||
return false
|
||||
|
@ -227,14 +227,14 @@ func (cv clusterView) Swap(i int, j int) {
|
|||
cv[i], cv[j] = cv[j], cv[i]
|
||||
}
|
||||
|
||||
// updateClusterView updates the GPU numbers of the nodes for the sorting.
|
||||
// updateClusterView updates the leaf cell numbers of the nodes for the sorting.
|
||||
func (t *topologyAwareScheduler) updateClusterView(
|
||||
p CellPriority,
|
||||
suggestedNodes common.Set,
|
||||
ignoreSuggestedNodes bool) {
|
||||
|
||||
for _, n := range t.cv {
|
||||
n.updateUsedGpuNumForPriority(p, t.crossPriorityPack)
|
||||
n.updateUsedLeafCellNumForPriority(p, t.crossPriorityPack)
|
||||
n.healthy, n.suggested, n.nodeAddress = nodeHealthyAndInSuggested(n, suggestedNodes, ignoreSuggestedNodes)
|
||||
}
|
||||
}
|
||||
|
@ -264,24 +264,24 @@ func nodeHealthyAndInSuggested(
|
|||
return true, true, ""
|
||||
}
|
||||
|
||||
// findNodesForPods finds a set of nodes that can accommodate the GPU requirements of the pods.
|
||||
func findNodesForPods(cv clusterView, gpuNums []int32) (pickedNodeIndices []int32, failedReason string) {
|
||||
// sort the nodes according to gpu numbers in each node.
|
||||
// findNodesForPods finds a set of nodes that can accommodate the leaf cell requirements of the pods.
|
||||
func findNodesForPods(cv clusterView, leafCellNums []int32) (pickedNodeIndices []int32, failedReason string) {
|
||||
// sort the nodes according to leaf cell numbers in each node.
|
||||
// this is achieved through the Less method defined in type clusterView.
|
||||
// TODO: Ensure Opportunistic Pods also can always can find the solution, regardless of
|
||||
// the iteration order.
|
||||
// For example:
|
||||
// 1. clusterView = 2GPU Node, 1GPU Node
|
||||
// 2. gpuNums = 1GPU Pod, 2GPU Pod
|
||||
// First 1GPU Pod may allocate to 2GPU Node, but the latter pod cannot be fitted anymore.
|
||||
// 1. clusterView = 2-leaf-cell Node, 1-leaf-cell Node
|
||||
// 2. leafCellNums = 1-leaf-cell Pod, 2-leaf-cell Pod
|
||||
// First 1-leaf-cell Pod may allocate to 2-leaf-cell Node, but the latter pod cannot be fitted anymore.
|
||||
sort.Stable(cv)
|
||||
pickedNodeIndices = make([]int32, len(gpuNums)) // indices of the currently picked nodes
|
||||
pickedNodeIndices = make([]int32, len(leafCellNums)) // indices of the currently picked nodes
|
||||
podIndex := 0
|
||||
pickedGpuNum := int32(0)
|
||||
pickedLeafCellNum := int32(0)
|
||||
var n *node
|
||||
for nodeIndex := 0; nodeIndex < len(cv); {
|
||||
n = cv[nodeIndex]
|
||||
if n.freeGpuNumAtPriority-pickedGpuNum >= gpuNums[podIndex] {
|
||||
if n.freeLeafCellNumAtPriority-pickedLeafCellNum >= leafCellNums[podIndex] {
|
||||
// fail when encountering a node that is either bad or not within suggested nodes
|
||||
if !n.healthy {
|
||||
return nil, fmt.Sprintf(
|
||||
|
@ -292,104 +292,104 @@ func findNodesForPods(cv clusterView, gpuNums []int32) (pickedNodeIndices []int3
|
|||
"have to use at least one non-suggested node %v", n.nodeAddress)
|
||||
}
|
||||
pickedNodeIndices[podIndex] = int32(nodeIndex)
|
||||
pickedGpuNum += gpuNums[podIndex]
|
||||
pickedLeafCellNum += leafCellNums[podIndex]
|
||||
podIndex++
|
||||
if podIndex == len(gpuNums) {
|
||||
if podIndex == len(leafCellNums) {
|
||||
return pickedNodeIndices, ""
|
||||
}
|
||||
} else {
|
||||
pickedGpuNum = 0
|
||||
pickedLeafCellNum = 0
|
||||
nodeIndex++
|
||||
}
|
||||
}
|
||||
return nil, "insufficient capacity"
|
||||
}
|
||||
|
||||
// findGpusInNode finds a set of GPUs with the best affinity in a node for a pod.
|
||||
func findGpusInNode(
|
||||
// findLeafCellsInNode finds a set of leaf cells with the best affinity in a node for a pod.
|
||||
func findLeafCellsInNode(
|
||||
n Cell,
|
||||
gpuNum int32,
|
||||
leafCellNum int32,
|
||||
p CellPriority,
|
||||
availableGpus CellList,
|
||||
levelGpuNum map[CellLevel]int32) (CellList, CellList) {
|
||||
availableLeafCells CellList,
|
||||
levelLeafCellNum map[CellLevel]int32) (CellList, CellList) {
|
||||
|
||||
// indices of the currently picked GPUs
|
||||
currentGpuIndices := make([]int32, gpuNum)
|
||||
// affinity of the currently picked GPUs, defined as the lowest common ancestor
|
||||
// of the GPUs in the cell hierarchy (lower level means better affinity)
|
||||
currentAffinity := make(CellList, gpuNum)
|
||||
// GPUs with the best affinity ever seen
|
||||
bestAffinityGpus := make(CellList, gpuNum)
|
||||
// indices of the GPUs with the best affinity ever seen
|
||||
bestAffinityGpuIndices := make([]int32, gpuNum)
|
||||
// the best affinity ever seen (i.e., lowest level of lowest common ancestor of a set of GPUs)
|
||||
// indices of the currently picked leaf cells
|
||||
currentLeafCellIndices := make([]int32, leafCellNum)
|
||||
// affinity of the currently picked leaf cells, defined as the lowest common ancestor
|
||||
// of the leaf cells in the cell hierarchy (lower level means better affinity)
|
||||
currentAffinity := make(CellList, leafCellNum)
|
||||
// leaf cells with the best affinity ever seen
|
||||
bestAffinityLeafCells := make(CellList, leafCellNum)
|
||||
// indices of the leaf cells with the best affinity ever seen
|
||||
bestAffinityLeafCellIndices := make([]int32, leafCellNum)
|
||||
// the best affinity ever seen (i.e., lowest level of lowest common ancestor of a set of leaf cells)
|
||||
bestAffinity := highestLevel
|
||||
// the optimal affinity for the GPU number, i.e., the lowest possible of the lowest common ancestor of GPUs
|
||||
optimalAffinity := getOptimalAffinity(gpuNum, levelGpuNum)
|
||||
// the optimal affinity for the leaf cell number, i.e., the lowest possible of the lowest common ancestor of leaf cells
|
||||
optimalAffinity := getOptimalAffinity(leafCellNum, levelLeafCellNum)
|
||||
|
||||
if availableGpus == nil {
|
||||
availableGpus = CellList{}
|
||||
preemptibleGpus := CellList{}
|
||||
availableGpus, preemptibleGpus = getGpusFromNode(n, p, availableGpus, preemptibleGpus)
|
||||
// free GPUs will be used first (before preemptible GPUs)
|
||||
availableGpus = append(availableGpus, preemptibleGpus...)
|
||||
if availableLeafCells == nil {
|
||||
availableLeafCells = CellList{}
|
||||
preemptibleLeafCells := CellList{}
|
||||
availableLeafCells, preemptibleLeafCells = getLeafCellsFromNode(n, p, availableLeafCells, preemptibleLeafCells)
|
||||
// free leaf cells will be used first (before preemptible leaf cells)
|
||||
availableLeafCells = append(availableLeafCells, preemptibleLeafCells...)
|
||||
}
|
||||
availableGpuIndex := int32(0)
|
||||
searchGpuIndex := int32(0)
|
||||
var gpu Cell
|
||||
availableLeafCellIndex := int32(0)
|
||||
searchLeafCellIndex := int32(0)
|
||||
var leafCell Cell
|
||||
for {
|
||||
for availableGpuIndex < int32(len(availableGpus)) {
|
||||
gpu = availableGpus[availableGpuIndex]
|
||||
currentGpuIndices[searchGpuIndex] = availableGpuIndex
|
||||
if searchGpuIndex == 0 {
|
||||
currentAffinity[searchGpuIndex] = gpu
|
||||
for availableLeafCellIndex < int32(len(availableLeafCells)) {
|
||||
leafCell = availableLeafCells[availableLeafCellIndex]
|
||||
currentLeafCellIndices[searchLeafCellIndex] = availableLeafCellIndex
|
||||
if searchLeafCellIndex == 0 {
|
||||
currentAffinity[searchLeafCellIndex] = leafCell
|
||||
} else {
|
||||
currentAffinity[searchGpuIndex] = findLCA(gpu, currentAffinity[searchGpuIndex-1])
|
||||
currentAffinity[searchLeafCellIndex] = findLCA(leafCell, currentAffinity[searchLeafCellIndex-1])
|
||||
// pruning: if the current LCA has been higher than the lowest ever,
|
||||
// the node will be skipped
|
||||
if (currentAffinity[searchGpuIndex] == nil && bestAffinity < highestLevel) ||
|
||||
(currentAffinity[searchGpuIndex] != nil && currentAffinity[searchGpuIndex].GetLevel() > bestAffinity) {
|
||||
availableGpuIndex++
|
||||
if (currentAffinity[searchLeafCellIndex] == nil && bestAffinity < highestLevel) ||
|
||||
(currentAffinity[searchLeafCellIndex] != nil && currentAffinity[searchLeafCellIndex].GetLevel() > bestAffinity) {
|
||||
availableLeafCellIndex++
|
||||
continue
|
||||
}
|
||||
}
|
||||
if searchGpuIndex == gpuNum-1 {
|
||||
if searchLeafCellIndex == leafCellNum-1 {
|
||||
foundOptimalAffinity := false
|
||||
bestAffinity, foundOptimalAffinity = checkCurrentGpus(
|
||||
bestAffinity, foundOptimalAffinity = checkCurrentLeafCells(
|
||||
currentAffinity[len(currentAffinity)-1].GetLevel(),
|
||||
availableGpus,
|
||||
currentGpuIndices,
|
||||
availableLeafCells,
|
||||
currentLeafCellIndices,
|
||||
bestAffinity,
|
||||
bestAffinityGpus,
|
||||
bestAffinityGpuIndices,
|
||||
bestAffinityLeafCells,
|
||||
bestAffinityLeafCellIndices,
|
||||
optimalAffinity)
|
||||
if foundOptimalAffinity {
|
||||
// early stop: return if the solution is optimal (i.e., all buddies)
|
||||
availableGpus = removePickedGpus(availableGpus, bestAffinityGpuIndices)
|
||||
return bestAffinityGpus, availableGpus
|
||||
availableLeafCells = removePickedLeafCells(availableLeafCells, bestAffinityLeafCellIndices)
|
||||
return bestAffinityLeafCells, availableLeafCells
|
||||
}
|
||||
} else {
|
||||
searchGpuIndex++
|
||||
searchLeafCellIndex++
|
||||
}
|
||||
availableGpuIndex++
|
||||
availableLeafCellIndex++
|
||||
}
|
||||
searchGpuIndex--
|
||||
if searchGpuIndex < 0 {
|
||||
searchLeafCellIndex--
|
||||
if searchLeafCellIndex < 0 {
|
||||
if bestAffinity == highestLevel {
|
||||
// Unreachable
|
||||
panic(fmt.Sprintf("Assert Failure: failed to allocate %v GPUs in picked node %v", gpuNum, n.GetAddress()))
|
||||
panic(fmt.Sprintf("Assert Failure: failed to allocate %v leaf cells in picked node %v", leafCellNum, n.GetAddress()))
|
||||
}
|
||||
availableGpus = removePickedGpus(availableGpus, bestAffinityGpuIndices)
|
||||
return bestAffinityGpus, availableGpus
|
||||
availableLeafCells = removePickedLeafCells(availableLeafCells, bestAffinityLeafCellIndices)
|
||||
return bestAffinityLeafCells, availableLeafCells
|
||||
}
|
||||
availableGpuIndex = currentGpuIndices[searchGpuIndex] + 1
|
||||
availableLeafCellIndex = currentLeafCellIndices[searchLeafCellIndex] + 1
|
||||
}
|
||||
}
|
||||
|
||||
// getOptimalAffinity calculates the optimal affinity for a given GPU number.
|
||||
func getOptimalAffinity(gpuNum int32, levelGpuNum map[CellLevel]int32) CellLevel {
|
||||
for l := CellLevel(1); l <= CellLevel(len(levelGpuNum)); l++ {
|
||||
if levelGpuNum[l] >= gpuNum {
|
||||
// getOptimalAffinity calculates the optimal affinity for a given leaf cell number.
|
||||
func getOptimalAffinity(leafCellNum int32, levelLeafCellNum map[CellLevel]int32) CellLevel {
|
||||
for l := CellLevel(1); l <= CellLevel(len(levelLeafCellNum)); l++ {
|
||||
if levelLeafCellNum[l] >= leafCellNum {
|
||||
return l
|
||||
}
|
||||
}
|
||||
|
@ -398,21 +398,21 @@ func getOptimalAffinity(gpuNum int32, levelGpuNum map[CellLevel]int32) CellLevel
|
|||
panic(fmt.Sprintf("Assert Failure: pod allocated a node but exceeds the capacity of the current chain"))
|
||||
}
|
||||
|
||||
// checkCurrentGpus checks if the currently picked GPUs have the lowest LCA. It also checks if the solution
|
||||
// is optimal (if the GPUs are all buddies).
|
||||
func checkCurrentGpus(
|
||||
// checkCurrentLeafCells checks if the currently picked leaf cells have the lowest LCA. It also checks if the solution
|
||||
// is optimal (if the leaf cells are all buddies).
|
||||
func checkCurrentLeafCells(
|
||||
affinity CellLevel,
|
||||
gpus CellList,
|
||||
leafCells CellList,
|
||||
currentIndices []int32,
|
||||
bestAffinity CellLevel,
|
||||
bestAffinityGpus CellList,
|
||||
bestAffinityGpuIndices []int32,
|
||||
bestAffinityLeafCells CellList,
|
||||
bestAffinityLeafCellIndices []int32,
|
||||
optimalAffinity CellLevel) (CellLevel, bool) {
|
||||
|
||||
if affinity < bestAffinity {
|
||||
copy(bestAffinityGpuIndices, currentIndices)
|
||||
copy(bestAffinityLeafCellIndices, currentIndices)
|
||||
for i := 0; i < len(currentIndices); i++ {
|
||||
bestAffinityGpus[i] = gpus[currentIndices[i]]
|
||||
bestAffinityLeafCells[i] = leafCells[currentIndices[i]]
|
||||
}
|
||||
if affinity == optimalAffinity {
|
||||
return affinity, true
|
||||
|
@ -423,21 +423,21 @@ func checkCurrentGpus(
|
|||
return bestAffinity, false
|
||||
}
|
||||
|
||||
// removePickedGpus remove picked GPUs from the available GPU list.
|
||||
func removePickedGpus(gpus CellList, indices []int32) CellList {
|
||||
// removePickedLeafCells remove picked leaf cells from the available leaf cell list.
|
||||
func removePickedLeafCells(leafCells CellList, indices []int32) CellList {
|
||||
for i, index := range indices {
|
||||
offset := int32(i)
|
||||
if i < len(indices)-1 {
|
||||
nextIndex := indices[i+1]
|
||||
copy(gpus[index-offset:nextIndex-offset-1], gpus[index+1:nextIndex])
|
||||
copy(leafCells[index-offset:nextIndex-offset-1], leafCells[index+1:nextIndex])
|
||||
} else {
|
||||
copy(gpus[index-offset:], gpus[index+1:])
|
||||
copy(leafCells[index-offset:], leafCells[index+1:])
|
||||
}
|
||||
}
|
||||
for i := len(gpus) - len(indices); i < len(gpus); i++ {
|
||||
gpus[i] = nil
|
||||
for i := len(leafCells) - len(indices); i < len(leafCells); i++ {
|
||||
leafCells[i] = nil
|
||||
}
|
||||
return gpus[:len(gpus)-len(indices)]
|
||||
return leafCells[:len(leafCells)-len(indices)]
|
||||
}
|
||||
|
||||
// findLCA finds the lowest common ancestor of two cells (nil if they have no LCA).
|
||||
|
@ -461,16 +461,16 @@ func findLCA(lower Cell, higher Cell) Cell {
|
|||
return lower.GetParent()
|
||||
}
|
||||
|
||||
// getGpusFromNode collects free GPUs and preemptible GPUs according to the priority.
|
||||
func getGpusFromNode(c Cell, p CellPriority, freeGpus CellList, preemptibleGpus CellList) (CellList, CellList) {
|
||||
// getLeafCellsFromNode collects free leaf cells and preemptible leaf cells according to the priority.
|
||||
func getLeafCellsFromNode(c Cell, p CellPriority, freeLeafCells CellList, preemptibleLeafCells CellList) (CellList, CellList) {
|
||||
if c.GetLevel() > 1 {
|
||||
for _, cc := range c.GetChildren() {
|
||||
freeGpus, preemptibleGpus = getGpusFromNode(cc, p, freeGpus, preemptibleGpus)
|
||||
freeLeafCells, preemptibleLeafCells = getLeafCellsFromNode(cc, p, freeLeafCells, preemptibleLeafCells)
|
||||
}
|
||||
} else if c.GetPriority() == freePriority {
|
||||
freeGpus = append(freeGpus, c)
|
||||
freeLeafCells = append(freeLeafCells, c)
|
||||
} else if c.GetPriority() < p {
|
||||
preemptibleGpus = append(preemptibleGpus, c)
|
||||
preemptibleLeafCells = append(preemptibleLeafCells, c)
|
||||
}
|
||||
return freeGpus, preemptibleGpus
|
||||
return freeLeafCells, preemptibleLeafCells
|
||||
}
|
||||
|
|
|
@ -24,11 +24,12 @@ package algorithm
|
|||
|
||||
import (
|
||||
"fmt"
|
||||
"strings"
|
||||
|
||||
"github.com/microsoft/hivedscheduler/pkg/api"
|
||||
"github.com/microsoft/hivedscheduler/pkg/common"
|
||||
core "k8s.io/api/core/v1"
|
||||
"k8s.io/apimachinery/pkg/types"
|
||||
"strings"
|
||||
)
|
||||
|
||||
type (
|
||||
|
@ -44,7 +45,7 @@ type schedulingRequest struct {
|
|||
pinnedCellId api.PinnedCellId
|
||||
chain CellChain
|
||||
affinityGroupName string
|
||||
affinityGroupPodNums map[int32]int32 // gpu number -> pod number
|
||||
affinityGroupPodNums map[int32]int32 // leaf cell number -> pod number
|
||||
priority CellPriority
|
||||
suggestedNodes common.Set
|
||||
ignoreSuggestedNodes bool
|
||||
|
@ -135,15 +136,15 @@ type AlgoAffinityGroup struct {
|
|||
lazyPreemptionEnable bool
|
||||
// Whether we should ignore K8s suggested nodes. If false, we will avoid binding cells to non-suggested nodes.
|
||||
// Note that we always avoid using bad nodes; avoiding non-suggested nodes is optional and best-effort.
|
||||
ignoreK8sSuggestedNodes bool
|
||||
priority int32
|
||||
totalPodNums map[int32]int32 // GpuNum -> PodNum
|
||||
allocatedPods map[int32][]*core.Pod // GpuNum -> a list of allocated pods
|
||||
preemptingPods map[types.UID]*core.Pod
|
||||
physicalGpuPlacement groupPhysicalPlacement
|
||||
virtualGpuPlacement groupVirtualPlacement
|
||||
state AffinityGroupState
|
||||
lazyPreemptionStatus *api.LazyPreemptionStatus
|
||||
ignoreK8sSuggestedNodes bool
|
||||
priority int32
|
||||
totalPodNums map[int32]int32 // LeafCellNum -> PodNum
|
||||
allocatedPods map[int32][]*core.Pod // LeafCellNum -> a list of allocated pods
|
||||
preemptingPods map[types.UID]*core.Pod
|
||||
physicalLeafCellPlacement groupPhysicalPlacement
|
||||
virtualLeafCellPlacement groupVirtualPlacement
|
||||
state AffinityGroupState
|
||||
lazyPreemptionStatus *api.LazyPreemptionStatus
|
||||
}
|
||||
|
||||
func newAlgoAffinityGroup(
|
||||
|
@ -155,29 +156,29 @@ func newAlgoAffinityGroup(
|
|||
|
||||
podNums := make(map[int32]int32)
|
||||
for _, m := range g.Members {
|
||||
podNums[m.GpuNumber] += m.PodNumber
|
||||
podNums[m.LeafCellNumber] += m.PodNumber
|
||||
}
|
||||
group := &AlgoAffinityGroup{
|
||||
name: g.Name,
|
||||
vc: vc,
|
||||
lazyPreemptionEnable: lazyPreemptionEnable,
|
||||
priority: priority,
|
||||
totalPodNums: podNums,
|
||||
allocatedPods: map[int32][]*core.Pod{},
|
||||
physicalGpuPlacement: groupPhysicalPlacement{},
|
||||
virtualGpuPlacement: groupVirtualPlacement{},
|
||||
state: state,
|
||||
name: g.Name,
|
||||
vc: vc,
|
||||
lazyPreemptionEnable: lazyPreemptionEnable,
|
||||
priority: priority,
|
||||
totalPodNums: podNums,
|
||||
allocatedPods: map[int32][]*core.Pod{},
|
||||
physicalLeafCellPlacement: groupPhysicalPlacement{},
|
||||
virtualLeafCellPlacement: groupVirtualPlacement{},
|
||||
state: state,
|
||||
}
|
||||
if state == groupPreempting {
|
||||
group.preemptingPods = map[types.UID]*core.Pod{}
|
||||
}
|
||||
for gpuNum, podNum := range podNums {
|
||||
group.physicalGpuPlacement[gpuNum] = make([]CellList, podNum)
|
||||
group.virtualGpuPlacement[gpuNum] = make([]CellList, podNum)
|
||||
group.allocatedPods[gpuNum] = make([]*core.Pod, podNum)
|
||||
for leafCellNum, podNum := range podNums {
|
||||
group.physicalLeafCellPlacement[leafCellNum] = make([]CellList, podNum)
|
||||
group.virtualLeafCellPlacement[leafCellNum] = make([]CellList, podNum)
|
||||
group.allocatedPods[leafCellNum] = make([]*core.Pod, podNum)
|
||||
for i := int32(0); i < podNum; i++ {
|
||||
group.physicalGpuPlacement[gpuNum][i] = make(CellList, gpuNum)
|
||||
group.virtualGpuPlacement[gpuNum][i] = make(CellList, gpuNum)
|
||||
group.physicalLeafCellPlacement[leafCellNum][i] = make(CellList, leafCellNum)
|
||||
group.virtualLeafCellPlacement[leafCellNum][i] = make(CellList, leafCellNum)
|
||||
}
|
||||
}
|
||||
return group
|
||||
|
@ -193,11 +194,11 @@ func (aag *AlgoAffinityGroup) ToAffinityGroup() api.AffinityGroup {
|
|||
LazyPreemptionStatus: aag.lazyPreemptionStatus,
|
||||
},
|
||||
}
|
||||
if aag.physicalGpuPlacement != nil {
|
||||
ag.Status.PhysicalPlacement = aag.physicalGpuPlacement.nodeToGpuIndices()
|
||||
if aag.physicalLeafCellPlacement != nil {
|
||||
ag.Status.PhysicalPlacement = aag.physicalLeafCellPlacement.nodeToLeafCellIndices()
|
||||
}
|
||||
if aag.virtualGpuPlacement != nil {
|
||||
ag.Status.VirtualPlacement = aag.virtualGpuPlacement.preassignedCellToLeafCells()
|
||||
if aag.virtualLeafCellPlacement != nil {
|
||||
ag.Status.VirtualPlacement = aag.virtualLeafCellPlacement.preassignedCellToLeafCells()
|
||||
}
|
||||
for _, pods := range aag.allocatedPods {
|
||||
for _, p := range pods {
|
||||
|
@ -212,28 +213,28 @@ func (aag *AlgoAffinityGroup) ToAffinityGroup() api.AffinityGroup {
|
|||
return ag
|
||||
}
|
||||
|
||||
type groupPhysicalPlacement map[int32][]CellList // GpuNum -> a list of pods -> a list of physical GPU cells of each pod
|
||||
type groupVirtualPlacement map[int32][]CellList // GpuNum -> a list of pods -> a list of virtual GPU cells of each pod
|
||||
type groupPhysicalPlacement map[int32][]CellList // LeafCellNum -> a list of pods -> a list of physical leaf cells of each pod
|
||||
type groupVirtualPlacement map[int32][]CellList // LeafCellNum -> a list of pods -> a list of virtual leaf cells of each pod
|
||||
|
||||
func (p groupPhysicalPlacement) String() string {
|
||||
return common.ToJson(p.nodeToGpuIndices())
|
||||
return common.ToJson(p.nodeToLeafCellIndices())
|
||||
}
|
||||
|
||||
func (p groupPhysicalPlacement) nodeToGpuIndices() map[string][]int32 {
|
||||
nodeToGpuIndices := map[string][]int32{}
|
||||
func (p groupPhysicalPlacement) nodeToLeafCellIndices() map[string][]int32 {
|
||||
nodeToLeafCellIndices := map[string][]int32{}
|
||||
for _, podPlacements := range p {
|
||||
for _, podPlacement := range podPlacements {
|
||||
for _, gpu := range podPlacement {
|
||||
pGpu := gpu.(*PhysicalCell)
|
||||
nodes, gpuIndices := pGpu.GetPhysicalPlacement()
|
||||
if _, ok := nodeToGpuIndices[nodes[0]]; !ok {
|
||||
nodeToGpuIndices[nodes[0]] = []int32{}
|
||||
for _, leafCell := range podPlacement {
|
||||
pLeafCell := leafCell.(*PhysicalCell)
|
||||
nodes, leafCellIndices := pLeafCell.GetPhysicalPlacement()
|
||||
if _, ok := nodeToLeafCellIndices[nodes[0]]; !ok {
|
||||
nodeToLeafCellIndices[nodes[0]] = []int32{}
|
||||
}
|
||||
nodeToGpuIndices[nodes[0]] = append(nodeToGpuIndices[nodes[0]], gpuIndices[0])
|
||||
nodeToLeafCellIndices[nodes[0]] = append(nodeToLeafCellIndices[nodes[0]], leafCellIndices[0])
|
||||
}
|
||||
}
|
||||
}
|
||||
return nodeToGpuIndices
|
||||
return nodeToLeafCellIndices
|
||||
}
|
||||
|
||||
func (p groupVirtualPlacement) String() string {
|
||||
|
@ -244,10 +245,10 @@ func (p groupVirtualPlacement) preassignedCellToLeafCells() map[api.CellAddress]
|
|||
preassignedCellToLeafCells := map[api.CellAddress][]api.CellAddress{}
|
||||
for _, podPlacements := range p {
|
||||
for _, podPlacement := range podPlacements {
|
||||
for _, gpu := range podPlacement {
|
||||
vGpu := gpu.(*VirtualCell)
|
||||
address := vGpu.GetAddress()
|
||||
preassignedAddress := vGpu.GetPreassignedCell().GetAddress()
|
||||
for _, leafCell := range podPlacement {
|
||||
vLeafCell := leafCell.(*VirtualCell)
|
||||
address := vLeafCell.GetAddress()
|
||||
preassignedAddress := vLeafCell.GetPreassignedCell().GetAddress()
|
||||
if _, ok := preassignedCellToLeafCells[preassignedAddress]; !ok {
|
||||
preassignedCellToLeafCells[preassignedAddress] = []api.CellAddress{}
|
||||
}
|
||||
|
@ -261,17 +262,17 @@ func (p groupVirtualPlacement) preassignedCellToLeafCells() map[api.CellAddress]
|
|||
|
||||
func (p groupVirtualPlacement) toPhysicalPlacement(
|
||||
bindings map[api.CellAddress]*PhysicalCell,
|
||||
gpuNums []int32) groupPhysicalPlacement {
|
||||
leafCellNums []int32) groupPhysicalPlacement {
|
||||
|
||||
physicalPlacement := groupPhysicalPlacement{}
|
||||
for _, podGpuNum := range gpuNums {
|
||||
podPlacements := p[podGpuNum]
|
||||
physicalPlacement[podGpuNum] = make([]CellList, len(podPlacements))
|
||||
for _, podLeafCellNum := range leafCellNums {
|
||||
podPlacements := p[podLeafCellNum]
|
||||
physicalPlacement[podLeafCellNum] = make([]CellList, len(podPlacements))
|
||||
for i, podPlacement := range podPlacements {
|
||||
physicalPlacement[podGpuNum][i] = make(CellList, len(podPlacement))
|
||||
for j, gpu := range podPlacement {
|
||||
pGpu := bindings[gpu.GetAddress()]
|
||||
physicalPlacement[podGpuNum][i][j] = pGpu
|
||||
physicalPlacement[podLeafCellNum][i] = make(CellList, len(podPlacement))
|
||||
for j, leafCell := range podPlacement {
|
||||
pLeafCell := bindings[leafCell.GetAddress()]
|
||||
physicalPlacement[podLeafCellNum][i][j] = pLeafCell
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -282,22 +283,22 @@ func (p groupVirtualPlacement) toPhysicalPlacement(
|
|||
// lowest-level cells in a physical placement. It is generated by collecting all the unbound
|
||||
// ancestors for these cells and group them in a tree.
|
||||
func (p groupVirtualPlacement) toBindingPaths(
|
||||
gpuNums []int32,
|
||||
leafCellNums []int32,
|
||||
bindings map[api.CellAddress]*PhysicalCell) (
|
||||
preassignedCells []*cellBindingPathVertex,
|
||||
nonPreassignedCells [][]*cellBindingPathVertex) {
|
||||
|
||||
allBindingPathVertices := map[api.CellAddress]*cellBindingPathVertex{}
|
||||
for _, podGpuNum := range gpuNums {
|
||||
podPlacements := p[podGpuNum]
|
||||
for _, podLeafCellNum := range leafCellNums {
|
||||
podPlacements := p[podLeafCellNum]
|
||||
for _, podPlacement := range podPlacements {
|
||||
for _, gpu := range podPlacement {
|
||||
if pGpu := gpu.(*VirtualCell).GetPhysicalCell(); pGpu != nil {
|
||||
bindings[gpu.GetAddress()] = pGpu
|
||||
for _, leafCell := range podPlacement {
|
||||
if pLeafCell := leafCell.(*VirtualCell).GetPhysicalCell(); pLeafCell != nil {
|
||||
bindings[leafCell.GetAddress()] = pLeafCell
|
||||
continue
|
||||
}
|
||||
var bindingPath []*VirtualCell
|
||||
for c := gpu; c != nil; c = c.GetParent() {
|
||||
for c := leafCell; c != nil; c = c.GetParent() {
|
||||
vc := c.(*VirtualCell)
|
||||
if vc.GetPhysicalCell() != nil || allBindingPathVertices[vc.GetAddress()] != nil {
|
||||
break
|
||||
|
|
|
@ -24,13 +24,14 @@ package algorithm
|
|||
|
||||
import (
|
||||
"fmt"
|
||||
"math/rand"
|
||||
|
||||
"github.com/microsoft/hivedscheduler/pkg/api"
|
||||
"github.com/microsoft/hivedscheduler/pkg/common"
|
||||
"github.com/microsoft/hivedscheduler/pkg/internal"
|
||||
core "k8s.io/api/core/v1"
|
||||
"k8s.io/apimachinery/pkg/types"
|
||||
"k8s.io/klog"
|
||||
"math/rand"
|
||||
)
|
||||
|
||||
// generatePodScheduleResult writes the scheduling result into a PodScheduleResult.
|
||||
|
@ -40,7 +41,7 @@ func generatePodScheduleResult(
|
|||
preemptionVictims map[string]common.Set,
|
||||
waitReason string,
|
||||
cellLevelToType map[CellChain]map[CellLevel]api.CellType,
|
||||
currentGpuNum int32,
|
||||
currentLeafCellNum int32,
|
||||
currentPodIndex int32,
|
||||
group *AlgoAffinityGroup,
|
||||
groupName string,
|
||||
|
@ -63,14 +64,14 @@ func generatePodScheduleResult(
|
|||
}
|
||||
// we find the selected node after the preemption is done, otherwise the preemption victims
|
||||
// may cause the selected node to be excluded from the suggested nodes
|
||||
affinityGroupBindInfo, selectedNode, selectedGpuIndices, cellChain := generateAffinityGroupBindInfo(
|
||||
groupPhysicalPlacement, groupVirtualPlacement, cellLevelToType, currentGpuNum, currentPodIndex, group, groupName)
|
||||
klog.Infof("[%v]: pod is decided to be scheduled to node %v, GPUs %v",
|
||||
internal.Key(pod), selectedNode, common.ToJson(selectedGpuIndices))
|
||||
affinityGroupBindInfo, selectedNode, selectedLeafCellIndices, cellChain := generateAffinityGroupBindInfo(
|
||||
groupPhysicalPlacement, groupVirtualPlacement, cellLevelToType, currentLeafCellNum, currentPodIndex, group, groupName)
|
||||
klog.Infof("[%v]: pod is decided to be scheduled to node %v, leaf cells %v",
|
||||
internal.Key(pod), selectedNode, common.ToJson(selectedLeafCellIndices))
|
||||
return internal.PodScheduleResult{
|
||||
PodBindInfo: &api.PodBindInfo{
|
||||
Node: selectedNode,
|
||||
GpuIsolation: selectedGpuIndices,
|
||||
LeafCellIsolation: selectedLeafCellIndices,
|
||||
CellChain: cellChain,
|
||||
AffinityGroupBindInfo: affinityGroupBindInfo,
|
||||
},
|
||||
|
@ -102,71 +103,71 @@ func generatePodPreemptInfo(preemptionVictims map[string]common.Set, pod *core.P
|
|||
}
|
||||
|
||||
// generateAffinityGroupBindInfo translates the physical and virtual placements of an affinity group
|
||||
// into a a series of AffinityGroupMemberBindInfos, and also returns the allocated node and GPU addresses
|
||||
// into a a series of AffinityGroupMemberBindInfos, and also returns the allocated node and leaf cell addresses
|
||||
// of the current pod.
|
||||
func generateAffinityGroupBindInfo(
|
||||
groupPhysicalPlacement groupPhysicalPlacement,
|
||||
groupVirtualPlacement groupVirtualPlacement,
|
||||
cellLevelToType map[CellChain]map[CellLevel]api.CellType,
|
||||
currentGpuNum int32,
|
||||
currentLeafCellNum int32,
|
||||
currentPodIndex int32,
|
||||
group *AlgoAffinityGroup,
|
||||
groupName string) (
|
||||
affinityGroupBindInfo []api.AffinityGroupMemberBindInfo,
|
||||
selectedNode string,
|
||||
selectedGpuIndices []int32,
|
||||
selectedLeafCellIndices []int32,
|
||||
chain string) {
|
||||
|
||||
affinityGroupBindInfo = make([]api.AffinityGroupMemberBindInfo, len(groupPhysicalPlacement))
|
||||
groupMemberIndex := 0
|
||||
for podGpuNum, podPhysicalPlacements := range groupPhysicalPlacement {
|
||||
for podLeafCellNum, podPhysicalPlacements := range groupPhysicalPlacement {
|
||||
mbi := api.AffinityGroupMemberBindInfo{
|
||||
PodPlacements: make([]api.PodPlacementInfo, len(podPhysicalPlacements)),
|
||||
}
|
||||
for podIndex := int32(0); podIndex < int32(len(podPhysicalPlacements)); podIndex++ {
|
||||
mbi.PodPlacements[podIndex].PhysicalGpuIndices = make([]int32, podGpuNum)
|
||||
mbi.PodPlacements[podIndex].PreassignedCellTypes = make([]api.CellType, podGpuNum)
|
||||
for gpuIndex := int32(0); gpuIndex < podGpuNum; gpuIndex++ {
|
||||
pGpu := podPhysicalPlacements[podIndex][gpuIndex]
|
||||
if pGpu == nil {
|
||||
mbi.PodPlacements[podIndex].PhysicalLeafCellIndices = make([]int32, podLeafCellNum)
|
||||
mbi.PodPlacements[podIndex].PreassignedCellTypes = make([]api.CellType, podLeafCellNum)
|
||||
for leafCellIndex := int32(0); leafCellIndex < podLeafCellNum; leafCellIndex++ {
|
||||
pLeafCell := podPhysicalPlacements[podIndex][leafCellIndex]
|
||||
if pLeafCell == nil {
|
||||
if group == nil || group.state == groupPreempting {
|
||||
panic(fmt.Sprintf("The first pod in group %v was allocated invalid resource", groupName))
|
||||
}
|
||||
// if the physical placement of this pod is not found (e.g., removed due to reconfiguration),
|
||||
// we will insist the decision by retrieving it from other pods
|
||||
mbi.PodPlacements[podIndex], chain = retrieveMissingPodPlacement(group, podGpuNum, podIndex)
|
||||
mbi.PodPlacements[podIndex], chain = retrieveMissingPodPlacement(group, podLeafCellNum, podIndex)
|
||||
klog.Warningf(
|
||||
"pod placement has been invalid and is retrieved from annotation of other pods: node %v, GPU %v",
|
||||
mbi.PodPlacements[podIndex].PhysicalNode, mbi.PodPlacements[podIndex].PhysicalGpuIndices[gpuIndex])
|
||||
"pod placement has been invalid and is retrieved from annotation of other pods: node %v, leaf cell %v",
|
||||
mbi.PodPlacements[podIndex].PhysicalNode, mbi.PodPlacements[podIndex].PhysicalLeafCellIndices[leafCellIndex])
|
||||
} else {
|
||||
nodes, gpuIndices := pGpu.(*PhysicalCell).GetPhysicalPlacement()
|
||||
// here each cell (i.e., pGpu) is only one GPU, hence we takes the first element
|
||||
// in its "nodes" and "gpuIndices" as the node and GPU address
|
||||
nodes, leafCellIndices := pLeafCell.(*PhysicalCell).GetPhysicalPlacement()
|
||||
// here each cell (i.e., pLeafCell) is only one leaf cell, hence we takes the first element
|
||||
// in its "nodes" and "leafCellIndices" as the node and leaf cell address
|
||||
if mbi.PodPlacements[podIndex].PhysicalNode == "" {
|
||||
mbi.PodPlacements[podIndex].PhysicalNode = nodes[0]
|
||||
}
|
||||
mbi.PodPlacements[podIndex].PhysicalGpuIndices[gpuIndex] = gpuIndices[0]
|
||||
mbi.PodPlacements[podIndex].PhysicalLeafCellIndices[leafCellIndex] = leafCellIndices[0]
|
||||
if groupVirtualPlacement != nil {
|
||||
vGpu := groupVirtualPlacement[podGpuNum][podIndex][gpuIndex].(*VirtualCell)
|
||||
mbi.PodPlacements[podIndex].PreassignedCellTypes[gpuIndex] =
|
||||
cellLevelToType[vGpu.GetChain()][vGpu.GetPreassignedCell().GetLevel()]
|
||||
vLeafCell := groupVirtualPlacement[podLeafCellNum][podIndex][leafCellIndex].(*VirtualCell)
|
||||
mbi.PodPlacements[podIndex].PreassignedCellTypes[leafCellIndex] =
|
||||
cellLevelToType[vLeafCell.GetChain()][vLeafCell.GetPreassignedCell().GetLevel()]
|
||||
} else {
|
||||
mbi.PodPlacements[podIndex].PreassignedCellTypes[gpuIndex] = ""
|
||||
mbi.PodPlacements[podIndex].PreassignedCellTypes[leafCellIndex] = ""
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
if podGpuNum == currentGpuNum {
|
||||
if podLeafCellNum == currentLeafCellNum {
|
||||
selectedNode = mbi.PodPlacements[currentPodIndex].PhysicalNode
|
||||
selectedGpuIndices = mbi.PodPlacements[currentPodIndex].PhysicalGpuIndices
|
||||
if pGpu := groupPhysicalPlacement[currentGpuNum][currentPodIndex][0]; pGpu != nil {
|
||||
chain = string(pGpu.GetChain())
|
||||
selectedLeafCellIndices = mbi.PodPlacements[currentPodIndex].PhysicalLeafCellIndices
|
||||
if pLeafCell := groupPhysicalPlacement[currentLeafCellNum][currentPodIndex][0]; pLeafCell != nil {
|
||||
chain = string(pLeafCell.GetChain())
|
||||
}
|
||||
}
|
||||
affinityGroupBindInfo[groupMemberIndex] = mbi
|
||||
groupMemberIndex++
|
||||
}
|
||||
return affinityGroupBindInfo, selectedNode, selectedGpuIndices, chain
|
||||
return affinityGroupBindInfo, selectedNode, selectedLeafCellIndices, chain
|
||||
}
|
||||
|
||||
// collectBadOrNonSuggestedNodes collects all the nodes that are not within the suggested nodes
|
||||
|
@ -178,14 +179,14 @@ func collectBadOrNonSuggestedNodes(
|
|||
badOrNonSuggestedNodes common.Set) {
|
||||
|
||||
badOrNonSuggestedNodes = common.NewSet()
|
||||
for gpuNum := range placement {
|
||||
for podIndex := range placement[gpuNum] {
|
||||
for _, gpu := range placement[gpuNum][podIndex] {
|
||||
if gpu == nil {
|
||||
for leafCellNum := range placement {
|
||||
for podIndex := range placement[leafCellNum] {
|
||||
for _, leafCell := range placement[leafCellNum][podIndex] {
|
||||
if leafCell == nil {
|
||||
continue
|
||||
}
|
||||
nodes, _ := gpu.(*PhysicalCell).GetPhysicalPlacement()
|
||||
if !gpu.(*PhysicalCell).IsHealthy() ||
|
||||
nodes, _ := leafCell.(*PhysicalCell).GetPhysicalPlacement()
|
||||
if !leafCell.(*PhysicalCell).IsHealthy() ||
|
||||
(!ignoreSuggestedNodes && !suggestedNodes.Contains(nodes[0])) {
|
||||
badOrNonSuggestedNodes.Add(nodes[0])
|
||||
}
|
||||
|
@ -196,24 +197,24 @@ func collectBadOrNonSuggestedNodes(
|
|||
}
|
||||
|
||||
// collectPreemptionVictims collects preemption victims of an affinity group.
|
||||
// If any of the GPUs allocated for the whole group is still used by a pod,
|
||||
// If any of the leaf cells allocated for the whole group is still used by a pod,
|
||||
// we will wait for the preemption, as a group is gang-scheduled.
|
||||
func collectPreemptionVictims(placement groupPhysicalPlacement) (
|
||||
victimPods map[string]common.Set, overlappingPreemptorGroups common.Set) {
|
||||
|
||||
victimPods = map[string]common.Set{} // node -> pods
|
||||
overlappingPreemptorGroups = common.NewSet()
|
||||
for gpuNum := range placement {
|
||||
for podIndex := range placement[gpuNum] {
|
||||
for _, gpu := range placement[gpuNum][podIndex] {
|
||||
if gpu == nil {
|
||||
for leafCellNum := range placement {
|
||||
for podIndex := range placement[leafCellNum] {
|
||||
for _, leafCell := range placement[leafCellNum][podIndex] {
|
||||
if leafCell == nil {
|
||||
continue
|
||||
}
|
||||
pGpu := gpu.(*PhysicalCell)
|
||||
state := pGpu.GetState()
|
||||
pLeafCell := leafCell.(*PhysicalCell)
|
||||
state := pLeafCell.GetState()
|
||||
if state == cellUsed || state == cellReserving {
|
||||
// for any victim pod, gang-preempt all the other pods from the same affinity group
|
||||
for _, pods := range pGpu.GetUsingGroup().allocatedPods {
|
||||
for _, pods := range pLeafCell.GetUsingGroup().allocatedPods {
|
||||
for _, v := range pods {
|
||||
if v != nil {
|
||||
if _, ok := victimPods[v.Spec.NodeName]; !ok {
|
||||
|
@ -225,7 +226,7 @@ func collectPreemptionVictims(placement groupPhysicalPlacement) (
|
|||
}
|
||||
}
|
||||
if state == cellReserving || state == cellReserved {
|
||||
overlappingPreemptorGroups.Add(pGpu.GetReservingOrReservedGroup())
|
||||
overlappingPreemptorGroups.Add(pLeafCell.GetReservingOrReservedGroup())
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -246,13 +247,13 @@ func victimsToString(victimPods map[string]common.Set) string {
|
|||
|
||||
// retrieveMissingPodPlacement finds the placement of a pod from the annotation of other pods in the same group
|
||||
// when the pod's placement has been invalid (i.e., not found in the spec).
|
||||
func retrieveMissingPodPlacement(g *AlgoAffinityGroup, gpuNum int32, podIndex int32) (api.PodPlacementInfo, string) {
|
||||
func retrieveMissingPodPlacement(g *AlgoAffinityGroup, leafCellNum int32, podIndex int32) (api.PodPlacementInfo, string) {
|
||||
for _, pods := range g.allocatedPods {
|
||||
for _, p := range pods {
|
||||
if p != nil {
|
||||
info := internal.ExtractPodBindInfo(p)
|
||||
for _, mbi := range info.AffinityGroupBindInfo {
|
||||
if gpuNum == int32(len(mbi.PodPlacements[0].PhysicalGpuIndices)) {
|
||||
if leafCellNum == int32(len(mbi.PodPlacements[0].PhysicalLeafCellIndices)) {
|
||||
return mbi.PodPlacements[podIndex], info.CellChain
|
||||
}
|
||||
}
|
||||
|
@ -260,20 +261,20 @@ func retrieveMissingPodPlacement(g *AlgoAffinityGroup, gpuNum int32, podIndex in
|
|||
}
|
||||
}
|
||||
panic(fmt.Sprintf(
|
||||
"No allocated pod found in an allocated group %v when retrieving placement for pod %v with GPU number %v", g.name, podIndex, gpuNum))
|
||||
"No allocated pod found in an allocated group %v when retrieving placement for pod %v with leaf cell number %v", g.name, podIndex, leafCellNum))
|
||||
}
|
||||
|
||||
// retrieveVirtualCell finds the corresponding virtual cell for a physical cell in the placements of an affinity group.
|
||||
func retrieveVirtualCell(
|
||||
physicalPlacement groupPhysicalPlacement,
|
||||
virtualPlacement groupVirtualPlacement,
|
||||
pGpu *PhysicalCell) (vGpu *VirtualCell) {
|
||||
pLeafCell *PhysicalCell) (vLeafCell *VirtualCell) {
|
||||
|
||||
for gpuNum := range physicalPlacement {
|
||||
for podIndex := range physicalPlacement[gpuNum] {
|
||||
for gpuIndex, gpu := range physicalPlacement[gpuNum][podIndex] {
|
||||
if gpu != nil && CellEqual(gpu, pGpu) {
|
||||
return virtualPlacement[gpuNum][podIndex][gpuIndex].(*VirtualCell)
|
||||
for leafCellNum := range physicalPlacement {
|
||||
for podIndex := range physicalPlacement[leafCellNum] {
|
||||
for leafCellIndex, leafCell := range physicalPlacement[leafCellNum][podIndex] {
|
||||
if leafCell != nil && CellEqual(leafCell, pLeafCell) {
|
||||
return virtualPlacement[leafCellNum][podIndex][leafCellIndex].(*VirtualCell)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -294,12 +295,12 @@ func getNewPodIndex(pods []*core.Pod) int32 {
|
|||
}
|
||||
|
||||
// getAllocatedPodIndex finds the index of an allocated pod in its group according to its placement.
|
||||
func getAllocatedPodIndex(info *api.PodBindInfo, gpuNum int32) int32 {
|
||||
func getAllocatedPodIndex(info *api.PodBindInfo, leafCellNum int32) int32 {
|
||||
for _, gms := range info.AffinityGroupBindInfo {
|
||||
if gpuNumber := int32(len(gms.PodPlacements[0].PhysicalGpuIndices)); gpuNumber == gpuNum {
|
||||
if leafCellNumber := int32(len(gms.PodPlacements[0].PhysicalLeafCellIndices)); leafCellNumber == leafCellNum {
|
||||
for podIndex, placement := range gms.PodPlacements {
|
||||
if placement.PhysicalNode == info.Node && common.Int32SliceContains(
|
||||
placement.PhysicalGpuIndices, info.GpuIsolation[0]) {
|
||||
placement.PhysicalLeafCellIndices, info.LeafCellIsolation[0]) {
|
||||
return int32(podIndex)
|
||||
}
|
||||
}
|
||||
|
@ -320,19 +321,19 @@ func allPodsReleased(allocatedPods map[int32][]*core.Pod) bool {
|
|||
return true
|
||||
}
|
||||
|
||||
// findPhysicalGpu finds a physical GPU cell in the full list. If the GPU is not found in the chain specified
|
||||
// findPhysicalLeafCell finds a physical leaf cell in the full list. If the leaf cell is not found in the chain specified
|
||||
// in the PodBindInfo (due to reconfiguration), we will try to search in the other chains.
|
||||
func findPhysicalGpu(
|
||||
func findPhysicalLeafCell(
|
||||
fullCellList map[CellChain]ChainCellList,
|
||||
chain CellChain,
|
||||
node string,
|
||||
gpuIndex int32) *PhysicalCell {
|
||||
leafCellIndex int32) *PhysicalCell {
|
||||
|
||||
if g := findPhysicalGpuInChain(fullCellList, chain, node, gpuIndex); g == nil {
|
||||
if g := findPhysicalLeafCellInChain(fullCellList, chain, node, leafCellIndex); g == nil {
|
||||
for c := range fullCellList {
|
||||
if c != chain {
|
||||
if g = findPhysicalGpuInChain(fullCellList, c, node, gpuIndex); g != nil {
|
||||
klog.Warningf("GPU %v on node %v has been moved to chain %v", gpuIndex, node, c)
|
||||
if g = findPhysicalLeafCellInChain(fullCellList, c, node, leafCellIndex); g != nil {
|
||||
klog.Warningf("Leaf cell %v on node %v has been moved to chain %v", leafCellIndex, node, c)
|
||||
return g
|
||||
}
|
||||
}
|
||||
|
@ -343,18 +344,18 @@ func findPhysicalGpu(
|
|||
}
|
||||
}
|
||||
|
||||
// findPhysicalGpuInChain finds a physical GPU cell in the full list of a given chain. This search is based on
|
||||
// *one* node and *one* GPU index, assuming there is no resource overlapping among cells at the same level.
|
||||
func findPhysicalGpuInChain(
|
||||
// findPhysicalLeafCellInChain finds a physical leaf cell in the full list of a given chain. This search is based on
|
||||
// *one* node and *one* leaf cell index, assuming there is no resource overlapping among cells at the same level.
|
||||
func findPhysicalLeafCellInChain(
|
||||
fullCellList map[CellChain]ChainCellList,
|
||||
chain CellChain,
|
||||
node string,
|
||||
gpuIndex int32) *PhysicalCell {
|
||||
leafCellIndex int32) *PhysicalCell {
|
||||
|
||||
for _, c := range fullCellList[chain][1] {
|
||||
success := false
|
||||
cc := c.(*PhysicalCell)
|
||||
nodes, gpuIndices := cc.GetPhysicalPlacement()
|
||||
nodes, leafCellIndices := cc.GetPhysicalPlacement()
|
||||
for _, n := range nodes {
|
||||
if n == node {
|
||||
success = true
|
||||
|
@ -362,11 +363,11 @@ func findPhysicalGpuInChain(
|
|||
}
|
||||
}
|
||||
if success {
|
||||
if gpuIndex < 0 {
|
||||
if leafCellIndex < 0 {
|
||||
return cc
|
||||
} else {
|
||||
for _, g := range gpuIndices {
|
||||
if g == gpuIndex {
|
||||
for _, g := range leafCellIndices {
|
||||
if g == leafCellIndex {
|
||||
return cc
|
||||
}
|
||||
}
|
||||
|
@ -392,7 +393,7 @@ func inFreeCellList(c *PhysicalCell) bool {
|
|||
// setCellState sets state for a cell and its parent recursively. A parent cell will be in Used state
|
||||
// if any of its children is in Used state. For the other states (Free, Reserving, Reserved),
|
||||
// a parent will be in the state if all of this children are in the state.
|
||||
// setCellState always starts from the lowest level, i.e., GPU-level cells.
|
||||
// setCellState always starts from the lowest level, i.e., leaf-level cells.
|
||||
func setCellState(c *PhysicalCell, s CellState) {
|
||||
c.SetState(s)
|
||||
if c.GetParent() != nil {
|
||||
|
@ -418,7 +419,7 @@ func allChildrenSameState(c *PhysicalCell, s CellState) bool {
|
|||
func generateOTVirtualCell(pc *api.PhysicalCellStatus) *api.VirtualCellStatus {
|
||||
vc := &api.VirtualCellStatus{
|
||||
CellStatus: api.CellStatus{
|
||||
GpuType: pc.GpuType,
|
||||
LeafCellType: pc.LeafCellType,
|
||||
CellType: pc.CellType,
|
||||
CellAddress: pc.CellAddress + "-opp",
|
||||
CellState: api.CellState(cellUsed),
|
||||
|
|
|
@ -24,11 +24,12 @@ package api
|
|||
|
||||
import (
|
||||
"fmt"
|
||||
"github.com/microsoft/hivedscheduler/pkg/common"
|
||||
"io/ioutil"
|
||||
"os"
|
||||
|
||||
"github.com/microsoft/hivedscheduler/pkg/common"
|
||||
"k8s.io/client-go/rest"
|
||||
"k8s.io/client-go/tools/clientcmd"
|
||||
"os"
|
||||
)
|
||||
|
||||
type Config struct {
|
||||
|
@ -72,7 +73,7 @@ type Config struct {
|
|||
WaitingPodSchedulingBlockMilliSec *int64 `yaml:"waitingPodSchedulingBlockMilliSec"`
|
||||
|
||||
// Specify the whole physical cluster
|
||||
// TODO: Automatically construct it based on node info from GPU and Network Device Plugins
|
||||
// TODO: Automatically construct it based on node info from Device Plugins
|
||||
PhysicalCluster *PhysicalClusterSpec `yaml:"physicalCluster"`
|
||||
|
||||
// Specify all the virtual clusters belongs to the physical cluster
|
||||
|
@ -148,7 +149,7 @@ func inferPhysicalCellSpec(
|
|||
return
|
||||
}
|
||||
if ct.IsNodeLevel {
|
||||
// reset default address to 0 when found a node level cell, leaf cell will use it as gpu indices
|
||||
// reset default address to 0 when found a node level cell, leaf cell will use it as indices
|
||||
defaultAddress = 0
|
||||
}
|
||||
if ct.ChildCellNumber > 0 && len(spec.CellChildren) == 0 {
|
||||
|
|
|
@ -44,23 +44,10 @@ const (
|
|||
// PodSchedulingSpec YAML format.
|
||||
AnnotationKeyPodSchedulingSpec = GroupName + "/pod-scheduling-spec"
|
||||
|
||||
// To leverage this scheduler, if one container in the Pod want to use the
|
||||
// allocated GPUs for the whole Pod, it should contain below env.
|
||||
// env:
|
||||
// - name: NVIDIA_VISIBLE_DEVICES
|
||||
// valueFrom:
|
||||
// fieldRef:
|
||||
// fieldPath: metadata.annotations['hivedscheduler.microsoft.com/pod-gpu-isolation']
|
||||
// The annotation referred by the env will be populated by scheduler when bind the pod.
|
||||
//
|
||||
// Notes:
|
||||
// 1. The scheduler directly delivers GPU isolation decision to
|
||||
// nvidia-container-runtime through Pod Env: NVIDIA_VISIBLE_DEVICES.
|
||||
// 2. If multiple containers in the Pod contain the env, the allocated GPUs are
|
||||
// all visible to them, so it is these containers' freedom to control how
|
||||
// to share these GPUs.
|
||||
EnvNameNvidiaVisibleDevices = "NVIDIA_VISIBLE_DEVICES"
|
||||
AnnotationKeyPodGpuIsolation = GroupName + "/pod-gpu-isolation"
|
||||
// To leverage this scheduler, the Pod could reference below annotation to
|
||||
// use the allocated leaf cells for the whole Pod.
|
||||
AnnotationKeyPodLeafCellIsolation = GroupName + "/pod-leaf-cell-isolation"
|
||||
DeprecatedAnnotationKeyPodGpuIsolation = GroupName + "/pod-gpu-isolation"
|
||||
|
||||
// Populated by this scheduler, used to track and recover allocated placement.
|
||||
// It is in PodBindInfo YAML format.
|
||||
|
|
|
@ -24,6 +24,7 @@ package api
|
|||
|
||||
import (
|
||||
"fmt"
|
||||
|
||||
meta "k8s.io/apimachinery/pkg/apis/meta/v1"
|
||||
"k8s.io/apimachinery/pkg/types"
|
||||
)
|
||||
|
@ -78,8 +79,8 @@ type PodSchedulingSpec struct {
|
|||
VirtualCluster VirtualClusterName `yaml:"virtualCluster"`
|
||||
Priority int32 `yaml:"priority"`
|
||||
PinnedCellId PinnedCellId `yaml:"pinnedCellId"`
|
||||
GpuType string `yaml:"gpuType"`
|
||||
GpuNumber int32 `yaml:"gpuNumber"`
|
||||
LeafCellType string `yaml:"leafCellType"`
|
||||
LeafCellNumber int32 `yaml:"leafCellNumber"`
|
||||
GangReleaseEnable bool `yaml:"gangReleaseEnable"`
|
||||
LazyPreemptionEnable bool `yaml:"lazyPreemptionEnable"`
|
||||
IgnoreK8sSuggestedNodes bool `yaml:"ignoreK8sSuggestedNodes"`
|
||||
|
@ -92,15 +93,15 @@ type AffinityGroupSpec struct {
|
|||
}
|
||||
|
||||
type AffinityGroupMemberSpec struct {
|
||||
PodNumber int32 `yaml:"podNumber"`
|
||||
GpuNumber int32 `yaml:"gpuNumber"`
|
||||
PodNumber int32 `yaml:"podNumber"`
|
||||
LeafCellNumber int32 `yaml:"leafCellNumber"`
|
||||
}
|
||||
|
||||
// Used to recover scheduler allocated resource
|
||||
type PodBindInfo struct {
|
||||
Node string `yaml:"node"` // node to bind
|
||||
GpuIsolation []int32 `yaml:"gpuIsolation"` // GPUs to bind
|
||||
CellChain string `yaml:"cellChain"` // cell chain selected
|
||||
Node string `yaml:"node"` // node to bind
|
||||
LeafCellIsolation []int32 `yaml:"leafCellIsolation"` // leaf cells to bind
|
||||
CellChain string `yaml:"cellChain"` // cell chain selected
|
||||
AffinityGroupBindInfo []AffinityGroupMemberBindInfo `yaml:"affinityGroupBindInfo"`
|
||||
}
|
||||
|
||||
|
@ -109,8 +110,8 @@ type AffinityGroupMemberBindInfo struct {
|
|||
}
|
||||
|
||||
type PodPlacementInfo struct {
|
||||
PhysicalNode string `yaml:"physicalNode"`
|
||||
PhysicalGpuIndices []int32 `yaml:"physicalGpuIndices"`
|
||||
PhysicalNode string `yaml:"physicalNode"`
|
||||
PhysicalLeafCellIndices []int32 `yaml:"physicalLeafCellIndices"`
|
||||
// preassigned cell types used by the pods. used to locate the virtual cells
|
||||
// when adding an allocated pod
|
||||
PreassignedCellTypes []CellType `yaml:"preassignedCellTypes"`
|
||||
|
@ -156,7 +157,7 @@ type AffinityGroupStatus struct {
|
|||
VC VirtualClusterName `json:"vc"`
|
||||
Priority int32 `json:"priority"`
|
||||
State AffinityGroupState `json:"state"`
|
||||
PhysicalPlacement map[string][]int32 `json:"physicalPlacement,omitempty"` // node -> GPU indices
|
||||
PhysicalPlacement map[string][]int32 `json:"physicalPlacement,omitempty"` // node -> leaf cell indices
|
||||
VirtualPlacement map[CellAddress][]CellAddress `json:"virtualPlacement,omitempty"` // preassigned cell -> leaf cells
|
||||
AllocatedPods []types.UID `json:"allocatedPods,omitempty"`
|
||||
PreemptingPods []types.UID `json:"preemptingPods,omitempty"`
|
||||
|
@ -181,9 +182,9 @@ const (
|
|||
)
|
||||
|
||||
type CellStatus struct {
|
||||
GpuType string `json:"gpuType,omitempty"`
|
||||
CellType CellType `json:"cellType"`
|
||||
IsNodeLevel bool `json:"isNodeLevel,omitempty"`
|
||||
LeafCellType string `json:"leafCellType,omitempty"`
|
||||
CellType CellType `json:"cellType"`
|
||||
IsNodeLevel bool `json:"isNodeLevel,omitempty"`
|
||||
// Address of a physical cell consists of its address (or index) in each level
|
||||
// (e.g., node0/0/0/0 may represent node0, CPU socket 0, PCIe switch 0, GPU 0.
|
||||
// Address of a virtual cell consists of its VC name, index of the preassigned cell,
|
||||
|
|
|
@ -24,6 +24,9 @@ package internal
|
|||
|
||||
import (
|
||||
"fmt"
|
||||
"net/http"
|
||||
"strings"
|
||||
|
||||
si "github.com/microsoft/hivedscheduler/pkg/api"
|
||||
"github.com/microsoft/hivedscheduler/pkg/common"
|
||||
core "k8s.io/api/core/v1"
|
||||
|
@ -32,7 +35,6 @@ import (
|
|||
"k8s.io/client-go/rest"
|
||||
"k8s.io/client-go/tools/cache"
|
||||
"k8s.io/klog"
|
||||
"net/http"
|
||||
)
|
||||
|
||||
func CreateClient(kConfig *rest.Config) kubeClient.Interface {
|
||||
|
@ -175,19 +177,30 @@ func NewBindingPod(pod *core.Pod, podBindInfo *si.PodBindInfo) *core.Pod {
|
|||
if bindingPod.Annotations == nil {
|
||||
bindingPod.Annotations = map[string]string{}
|
||||
}
|
||||
bindingPod.Annotations[si.AnnotationKeyPodGpuIsolation] =
|
||||
common.ToIndicesString(podBindInfo.GpuIsolation)
|
||||
bindingPod.Annotations[si.AnnotationKeyPodLeafCellIsolation] =
|
||||
common.ToIndicesString(podBindInfo.LeafCellIsolation)
|
||||
bindingPod.Annotations[si.AnnotationKeyPodBindInfo] =
|
||||
common.ToYaml(podBindInfo)
|
||||
|
||||
return bindingPod
|
||||
}
|
||||
|
||||
// converts old spec annotations for backward compatibility
|
||||
func convertOldAnnotation(annotation string) string {
|
||||
r := strings.NewReplacer(
|
||||
"gpuType", "leafCellType",
|
||||
"gpuNumber", "leafCellNumber",
|
||||
"gpuIsolation", "leafCellIsolation",
|
||||
"physicalGpuIndices", "physicalLeafCellIndices",
|
||||
)
|
||||
return r.Replace(annotation)
|
||||
}
|
||||
|
||||
// PodBindInfo comes from internal, so just need to assert when deserialization.
|
||||
func ExtractPodBindInfo(allocatedPod *core.Pod) *si.PodBindInfo {
|
||||
podBindInfo := si.PodBindInfo{}
|
||||
|
||||
annotation := allocatedPod.Annotations[si.AnnotationKeyPodBindInfo]
|
||||
annotation := convertOldAnnotation(allocatedPod.Annotations[si.AnnotationKeyPodBindInfo])
|
||||
if annotation == "" {
|
||||
panic(fmt.Errorf(
|
||||
"Pod does not contain or contains empty annotation: %v",
|
||||
|
@ -199,9 +212,16 @@ func ExtractPodBindInfo(allocatedPod *core.Pod) *si.PodBindInfo {
|
|||
}
|
||||
|
||||
func ExtractPodBindAnnotations(allocatedPod *core.Pod) map[string]string {
|
||||
return map[string]string{
|
||||
si.AnnotationKeyPodGpuIsolation: allocatedPod.Annotations[si.AnnotationKeyPodGpuIsolation],
|
||||
si.AnnotationKeyPodBindInfo: allocatedPod.Annotations[si.AnnotationKeyPodBindInfo],
|
||||
if _, ok := allocatedPod.Annotations[si.AnnotationKeyPodLeafCellIsolation]; ok {
|
||||
return map[string]string{
|
||||
si.AnnotationKeyPodLeafCellIsolation: allocatedPod.Annotations[si.AnnotationKeyPodLeafCellIsolation],
|
||||
si.AnnotationKeyPodBindInfo: allocatedPod.Annotations[si.AnnotationKeyPodBindInfo],
|
||||
}
|
||||
} else {
|
||||
return map[string]string{
|
||||
si.AnnotationKeyPodLeafCellIsolation: allocatedPod.Annotations[si.DeprecatedAnnotationKeyPodGpuIsolation],
|
||||
si.AnnotationKeyPodBindInfo: convertOldAnnotation(allocatedPod.Annotations[si.AnnotationKeyPodBindInfo]),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -214,7 +234,7 @@ func ExtractPodSchedulingSpec(pod *core.Pod) *si.PodSchedulingSpec {
|
|||
|
||||
podSchedulingSpec := si.PodSchedulingSpec{}
|
||||
|
||||
annotation := pod.Annotations[si.AnnotationKeyPodSchedulingSpec]
|
||||
annotation := convertOldAnnotation(pod.Annotations[si.AnnotationKeyPodSchedulingSpec])
|
||||
if annotation == "" {
|
||||
panic(fmt.Errorf(errPfx + "Annotation does not exist or is empty"))
|
||||
}
|
||||
|
@ -226,8 +246,8 @@ func ExtractPodSchedulingSpec(pod *core.Pod) *si.PodSchedulingSpec {
|
|||
podSchedulingSpec.AffinityGroup = &si.AffinityGroupSpec{
|
||||
Name: fmt.Sprintf("%v/%v", pod.Namespace, pod.Name),
|
||||
Members: []si.AffinityGroupMemberSpec{{
|
||||
PodNumber: 1,
|
||||
GpuNumber: podSchedulingSpec.GpuNumber},
|
||||
PodNumber: 1,
|
||||
LeafCellNumber: podSchedulingSpec.LeafCellNumber},
|
||||
},
|
||||
}
|
||||
}
|
||||
|
@ -242,8 +262,8 @@ func ExtractPodSchedulingSpec(pod *core.Pod) *si.PodSchedulingSpec {
|
|||
if podSchedulingSpec.Priority > si.MaxGuaranteedPriority {
|
||||
panic(fmt.Errorf(errPfx+"Priority is greater than %v", si.MaxGuaranteedPriority))
|
||||
}
|
||||
if podSchedulingSpec.GpuNumber <= 0 {
|
||||
panic(fmt.Errorf(errPfx + "GpuNumber is non-positive"))
|
||||
if podSchedulingSpec.LeafCellNumber <= 0 {
|
||||
panic(fmt.Errorf(errPfx + "LeafCellNumber is non-positive"))
|
||||
}
|
||||
if podSchedulingSpec.AffinityGroup.Name == "" {
|
||||
panic(fmt.Errorf(errPfx + "AffinityGroup.Name is empty"))
|
||||
|
@ -254,10 +274,10 @@ func ExtractPodSchedulingSpec(pod *core.Pod) *si.PodSchedulingSpec {
|
|||
if member.PodNumber <= 0 {
|
||||
panic(fmt.Errorf(errPfx + "AffinityGroup.Members has non-positive PodNumber"))
|
||||
}
|
||||
if member.GpuNumber <= 0 {
|
||||
panic(fmt.Errorf(errPfx + "AffinityGroup.Members has non-positive GpuNumber"))
|
||||
if member.LeafCellNumber <= 0 {
|
||||
panic(fmt.Errorf(errPfx + "AffinityGroup.Members has non-positive LeafCellNumber"))
|
||||
}
|
||||
if member.GpuNumber == podSchedulingSpec.GpuNumber {
|
||||
if member.LeafCellNumber == podSchedulingSpec.LeafCellNumber {
|
||||
isPodInGroup = true
|
||||
}
|
||||
}
|
||||
|
@ -287,10 +307,10 @@ func BindPod(kClient kubeClient.Interface, bindingPod *core.Pod) {
|
|||
panic(fmt.Errorf("Failed to bind Pod: %v", err))
|
||||
}
|
||||
|
||||
klog.Infof("[%v]: Succeeded to bind Pod on node %v, gpus %v",
|
||||
klog.Infof("[%v]: Succeeded to bind Pod on node %v, leaf cells %v",
|
||||
Key(bindingPod),
|
||||
bindingPod.Spec.NodeName,
|
||||
bindingPod.Annotations[si.AnnotationKeyPodGpuIsolation])
|
||||
bindingPod.Annotations[si.AnnotationKeyPodLeafCellIsolation])
|
||||
}
|
||||
|
||||
func NewBadRequestError(message string) *si.WebServerError {
|
||||
|
|
|
@ -24,6 +24,9 @@ package scheduler
|
|||
|
||||
import (
|
||||
"fmt"
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
"github.com/microsoft/hivedscheduler/pkg/algorithm"
|
||||
si "github.com/microsoft/hivedscheduler/pkg/api"
|
||||
"github.com/microsoft/hivedscheduler/pkg/common"
|
||||
|
@ -39,8 +42,6 @@ import (
|
|||
"k8s.io/client-go/tools/cache"
|
||||
"k8s.io/klog"
|
||||
ei "k8s.io/kubernetes/pkg/scheduler/api"
|
||||
"sync"
|
||||
"time"
|
||||
)
|
||||
|
||||
// HivedScheduler is the scheduling framework which serves as the bridge between
|
||||
|
@ -437,9 +438,9 @@ func (s *HivedScheduler) shouldForceBind(
|
|||
// based on current status, the retried Pod should can be scheduled on suitable
|
||||
// placement decision eventually.
|
||||
// Thus, the problematic decision can only be stale decision, i.e. only newly
|
||||
// bad GPUs or newly deleted Nodes will lead Pod retried.
|
||||
// For newly bad GPUs, it is like the normal behaviour that a pod will fail
|
||||
// after the GPUs it runs on become unhealthy.
|
||||
// bad devices or newly deleted Nodes will lead Pod retried.
|
||||
// For newly bad devices, it is like the normal behaviour that a pod will fail
|
||||
// after the devices it runs on become unhealthy.
|
||||
// For newly deleted Nodes, it is like the normal behaviour that a pod will
|
||||
// be deleted by the GarbageCollectionController after the node it runs on is
|
||||
// deleted.
|
||||
|
|
Загрузка…
Ссылка в новой задаче