* Update feature demo examples.
* Add defaulting for `ignoreK8sSuggestedNodes`.
* Fix sort in `getUsablePhysicalCells`.
This commit is contained in:
Yifan Xiong 2020-08-20 15:07:58 +08:00 коммит произвёл GitHub
Родитель a8b0aa7d5b
Коммит df185eecde
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
41 изменённых файлов: 70 добавлений и 136 удалений

Просмотреть файл

@ -40,7 +40,6 @@ HiveD supports multiple job **priorities**. Higher-priority jobs can **[preempt]
5. [Priorities](example/feature/README.md#Guaranteed-Job), [Overuse with Low Priority](example/feature/README.md#Opportunistic-Job), and [Inter-](example/feature/README.md#Inter-VC-Preemption)/[Intra-VC Preemption](example/feature/README.md#Intra-VC-Preemption)
6. [Job (Full/Partial) Gang Scheduling/Preemption](example/feature/README.md#Gang-Scheduling)
7. Fault-Tolerance, [Bad Hardware Awareness](example/feature/README.md#Bad-Hardware-Awareness), [Work-Preserving Reconfiguration](example/feature/README.md#Work-Preserving-Reconfiguration)
8. [Leverage K8S Default Scheduler](example/feature/README.md#Leverage-K8S-Default-Scheduler)
## Prerequisite
1. A Kubernetes cluster, v1.14.2 or above, on-cloud or on-premise.

Просмотреть файл

@ -11,7 +11,7 @@ HiveD guarantees **quota safety for all VCs**, in the sense that the requests to
VC's cells can be described by Hardware Quantity, [Topology](#VC-Safety), [Type](#SKU-Type), [Pinned Cells](#Pinned-Cells), etc. To guarantee safety, HiveD never allows a VC to "invade" other VCs' cells. For example, to guarantee all VCs' topology, one VC's [guaranteed jobs](#Guaranteed-Job) should never make fragmentation inside other VCs:
Two DGX-2s, two VCs each owns one DGX-2 node. For a traditional scheduler, this will translate into two VCs each owning 16 GPUs. When a user submits 16 1-GPU jobs to VC1, the user in VC2 might not be able to run a 16-GPU job, due to possible fragmentation issue caused by VC1. While HiveD can guarantee each VC always has one entire node available for its dedicated use.
Two DGX-2s, two VCs each owns one DGX-2 node. For a traditional scheduler, this will translate into two VCs each owning 16 GPUs. When a user submits 16 1-GPU jobs to vc1, the user in vc2 might not be able to run a 16-GPU job, due to possible fragmentation issue caused by vc1. While HiveD can guarantee each VC always has one entire node available for its dedicated use.
### Reproduce Steps
1. Use [hived-config-1](file/hived-config-1.yaml).
@ -27,7 +27,7 @@ This is similar to [K8S Taints and Tolerations](https://kubernetes.io/docs/conce
### Reproduce Steps
1. Use [hived-config-8](file/hived-config-8.yaml).
2. Submit job [itc-pin](file/itc-pin.yaml) to VC1, all tasks in task role vc1pinned will be on node 10.151.41.25 (which is pinned), all tasks in task role vc1nopinned will NOT be on node 10.151.41.25.
2. Submit job [itc-pin](file/itc-pin.yaml) to vc1, all tasks in task role vc1pinned will be on node 10.151.41.25 (which is pinned), all tasks in task role vc1nopinned will NOT be on node 10.151.41.25.
<img src="file/itc-pin.png" width="900"/>
## SKU Type
@ -68,8 +68,8 @@ This is useful for jobs that cannot perform any useful work, such as making prog
<img src="file/itc-gang4.png" width="900"/>
#### TensorFlow Distributed Training
1. Use [hived-config-1](file/hived-config-1.yaml).
2. Submit job [itc-dtf](file/itc-dtf.yaml) to VC2, it will success.
1. Use [hived-config-2](file/hived-config-2.yaml).
2. Submit job [itc-dtf](file/itc-dtf.yaml) to default VC, it will success.
<img src="file/itc-dtf.png" width="900"/>
## Incremental Scheduling
@ -110,27 +110,28 @@ Within one VC, a high-priority job can preempt low-priority jobs.
### Reproduce Steps
#### Immediate Preemption
1. Use [hived-config-3](file/hived-config-3.yaml).
2. Submit [itc-intra-imd-preempt-test](file/itc-intra-imd-preempt-test.yaml), which requests for 4 M60 GPUs for VC1 with test (0) priority.
3. Submit [itc-intra-imd-preempt-prod](file/itc-intra-imd-preempt-prod.yaml), which also requests for 4 M60 GPUs for VC1 with prod (100) priority. The job will preempt the test job immediately, so the test job is retried and waiting for resource.
2. Submit [itc-intra-imd-preempt-test](file/itc-intra-imd-preempt-test.yaml), which requests for 4 M60 GPUs for vc1 with test (0) priority.
3. Submit [itc-intra-imd-preempt-prod](file/itc-intra-imd-preempt-prod.yaml), which also requests for 4 M60 GPUs for vc1 with prod (100) priority. The job will preempt the test job immediately, so the test job is retried and waiting for resource.
<img src="file/itc-intra-imd-preempt-test.png" width="900"/>
<img src="file/itc-intra-imd-preempt-prod.png" width="900"/>
#### Lazy Preemption
1. Use [hived-config-3](file/hived-config-3.yaml).
2. Submit [itc-intra-lazy-preempt-test](file/itc-intra-lazy-preempt-test.yaml), which requests for 4 K80 GPUs for VC1 with test (0) priority.
3. Submit [itc-intra-lazy-preempt-prod](file/itc-intra-lazy-preempt-prod.yaml), which also requests for 4 K80 GPUs for VC1 with prod (100) priority. The job will just downgrade the test job to be [Opportunistic Job](#Opportunistic-Job), instead of preempting it immediately, because all jobs can still fit into the whole physical cluster.
2. Submit [itc-intra-lazy-preempt-test](file/itc-intra-lazy-preempt-test.yaml), which requests for 4 K80 GPUs for vc1 with test (0) priority.
3. Submit [itc-intra-lazy-preempt-prod](file/itc-intra-lazy-preempt-prod.yaml), which also requests for 4 K80 GPUs for vc1 with prod (100) priority. The job will just downgrade the test job to be [Opportunistic Job](#Opportunistic-Job), instead of preempting it immediately, because all jobs can still fit into the whole physical cluster.
4. Submit [itc-intra-lazy-preempt-prod2](file/itc-intra-lazy-preempt-prod2.yaml), which also requests for 3 * 4 K80 GPUs for default VC with prod (100) priority. The job will preempt the test job immediately, because all jobs cannot fit into the whole physical cluster.
<img src="file/itc-intra-lazy-preempt-test.png" width="900"/>
<img src="file/itc-intra-lazy-preempt-prod.png" width="900"/>
<img src="file/itc-intra-lazy-preempt-prod2.png" width="900"/>
> NOTE: `lazyPreemptionEnable` option is disabled by default, becasue earlier job may be downgraded to low priority job and get preempted by later jobs, which may be confusing.
## Inter-VC Preemption
### Description
One VC's [Guaranteed Job](#Guaranteed-Job) can preempt other VCs' [Opportunistic Jobs](#Opportunistic-Job).
### Reproduce Steps
1. Use [hived-config-2](file/hived-config-2.yaml).
2. Submit [itc-inter-preempt-oppo](file/itc-inter-preempt-oppo.yaml), which requests for 2 * 4 K80 GPUs for VC1 with oppo (-1) priority.
1. Use [hived-config-3](file/hived-config-3.yaml).
2. Submit [itc-inter-preempt-oppo](file/itc-inter-preempt-oppo.yaml), which requests for 2 * 4 K80 GPUs for vc1 with oppo (-1) priority.
3. Submit [itc-inter-preempt-prod](file/itc-inter-preempt-prod.yaml), which also requests for 3 * 4 K80 GPUs for default VC with prod (100) priority. The job will preempt the oppo job immediately.
<img src="file/itc-inter-preempt-oppo.png" width="900"/>
<img src="file/itc-inter-preempt-prod.png" width="900"/>
@ -190,20 +191,20 @@ HiveD can be reconfigured without unnecessary user impacts, such as add/update/d
#### VirtualCluster Reconfig - Delete VirtualCluster
1. Use [hived-config-2](file/hived-config-2.yaml).
2. Submit job [itc-reconfig-3](file/itc-reconfig-3.yaml) to default VC. Wait until it is running.
3. Delete the default VC and move its quota to VC1, then becomes [hived-config-5](file/hived-config-5.yaml).
3. Delete the default VC and move its quota to vc1, then becomes [hived-config-5](file/hived-config-5.yaml).
4. Use [hived-config-5](file/hived-config-5.yaml), and restart HiveD.
5. The job will still run without any interruption but [lazy preempted](#Lazy-Preemption) by HiveD.
<img src="file/itc-reconfig-3.png" width="900"/>
6. To confirm it is [lazy preempted](#Lazy-Preemption), submit job [itc-reconfig-4](file/itc-reconfig-4.yaml) to VC1 which requests all K80 nodes. The job will immediately preempt [itc-reconfig-3](file/itc-reconfig-3.yaml).
6. To confirm it is [lazy preempted](#Lazy-Preemption), submit job [itc-reconfig-4](file/itc-reconfig-4.yaml) to vc1 which requests all K80 nodes. The job will immediately preempt [itc-reconfig-3](file/itc-reconfig-3.yaml).
<img src="file/itc-reconfig-4.png" width="900"/>
#### VirtualCluster Reconfig - Update VirtualCluster
1. Use [hived-config-2](file/hived-config-2.yaml).
2. Submit job [itc-reconfig-3](file/itc-reconfig-3.yaml) to default VC. Wait until it is running.
3. Move one K80-NODE cell from default VC to VC1, then becomes [hived-config-6](file/hived-config-6.yaml).
3. Move one K80-NODE cell from default VC to vc1, then becomes [hived-config-6](file/hived-config-6.yaml).
4. Use [hived-config-6](file/hived-config-6.yaml), and restart HiveD.
5. The job will still run without any interruption but [lazy preempted](#Lazy-Preemption) by HiveD.
6. To confirm it is [lazy preempted](#Lazy-Preemption), submit job [itc-reconfig-5](file/itc-reconfig-5.yaml) to VC1 which requests all K80 nodes. The job will immediately preempt [itc-reconfig-3](file/itc-reconfig-3.yaml).
6. To confirm it is [lazy preempted](#Lazy-Preemption), submit job [itc-reconfig-5](file/itc-reconfig-5.yaml) to vc1 which requests all K80 nodes. The job will immediately preempt [itc-reconfig-3](file/itc-reconfig-3.yaml).
<img src="file/itc-reconfig-5.png" width="900"/>
## Bad Hardware Awareness
@ -219,17 +220,3 @@ Avoid scheduling pods to bad hardware.
4. Bring back 10.151.41.26 by `sudo systemctl start kubelet`. Wait until this is detected by K8S.
5. The waiting job will start running, without any retries.
<img src="file/itc-badnode50-3.png" width="900"/>
## Leverage K8S Default Scheduler
### Description
You can still leverage almost all scheduling features provided by your existing [K8S Default Scheduler](https://kubernetes.io/docs/concepts/scheduling/kube-scheduler) with HiveD, such as these [Filtering Policies](https://kubernetes.io/docs/concepts/scheduling/kube-scheduler/#filtering).
### Reproduce Steps
#### Leverage [Labels and Selectors](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels)
1. Use [hived-config-2](file/hived-config-2.yaml).
2. Remove PAI worker label for 10.151.41.26 (the only M60 node).
3. Submit job [itc-no-worker-label](file/itc-no-worker-label.yaml), which requests M60 node, it will be waiting without IP associated.
<img src="file/itc-no-worker-label-1.png" width="900"/>
4. Add back PAI worker label for 10.151.41.26.
5. The waiting job will start running, without any retries.
<img src="file/itc-no-worker-label-2.png" width="900"/>

Просмотреть файл

@ -34,11 +34,11 @@ physicalCluster:
- cellAddress: 10.151.41.24
virtualClusters:
VC1:
vc1:
virtualCells:
- cellType: 3-K80-NODE.K80-NODE
cellNumber: 1
VC2:
vc2:
virtualCells:
- cellType: 3-K80-NODE.K80-NODE
cellNumber: 1

Просмотреть файл

@ -45,7 +45,7 @@ physicalCluster:
- cellAddress: 10.151.41.26
virtualClusters:
VC1:
vc1:
virtualCells:
- cellType: K80-NODE-POOL.K80-NODE
cellNumber: 1

Просмотреть файл

@ -45,7 +45,7 @@ physicalCluster:
- cellAddress: 10.151.41.26
virtualClusters:
VC1:
vc1:
virtualCells:
- cellType: K80-NODE-POOL.K80-NODE
cellNumber: 1

Просмотреть файл

@ -42,7 +42,7 @@ physicalCluster:
# - cellAddress: 10.151.41.25
virtualClusters:
VC1:
vc1:
virtualCells:
- cellType: K80-NODE-POOL.K80-NODE
cellNumber: 1

Просмотреть файл

@ -45,7 +45,7 @@ physicalCluster:
- cellAddress: 10.151.41.25
virtualClusters:
VC1:
vc1:
virtualCells:
- cellType: K80-NODE-POOL.K80-NODE
cellNumber: 1

Просмотреть файл

@ -45,7 +45,7 @@ physicalCluster:
- cellAddress: 10.151.41.26
virtualClusters:
VC1:
vc1:
virtualCells:
- cellType: K80-NODE-POOL.K80-NODE
cellNumber: 4

Просмотреть файл

@ -45,7 +45,7 @@ physicalCluster:
- cellAddress: 10.151.41.26
virtualClusters:
VC1:
vc1:
virtualCells:
- cellType: K80-NODE-POOL.K80-NODE
cellNumber: 3

Просмотреть файл

@ -44,7 +44,7 @@ physicalCluster:
- cellAddress: 10.151.41.26
virtualClusters:
VC1:
vc1:
virtualCells:
- cellType: K80-NODE-POOL.K80-NODE
cellNumber: 1

Просмотреть файл

@ -34,13 +34,13 @@ physicalCluster:
- cellAddress: 10.151.41.24
virtualClusters:
VC1:
vc1:
virtualCells:
- cellType: 3-K80-NODE.K80-NODE
cellNumber: 1
pinnedCells:
- pinnedCellId: VC1-K80
VC2:
vc2:
virtualCells:
- cellType: 3-K80-NODE.K80-NODE
cellNumber: 1

Просмотреть файл

@ -11,18 +11,17 @@ taskRoles:
instances: 1
completion:
minFailedInstances: 1
minSucceededInstances: 6
minSucceededInstances: 1
dockerImage: keras_tensorflow_example
resourcePerInstance:
cpu: 4
memoryMB: 8192
gpu: 1
commands:
- nvidia-smi -L
- printenv
- sleep 10000
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
- python mnist_cnn.py
defaults:
virtualCluster: VC1
virtualCluster: vc1
extras:
gangAllocation: true
hivedScheduler:
@ -30,4 +29,3 @@ extras:
taskRoles:
train:
skuType: M60
submitFrom: submit-job-v2

Просмотреть файл

@ -20,9 +20,9 @@ taskRoles:
commands:
- nvidia-smi -L
- printenv
- sleep 10000
- sleep 10m
defaults:
virtualCluster: VC1
virtualCluster: vc1
extras:
gangAllocation: true
hivedScheduler:
@ -30,4 +30,3 @@ extras:
taskRoles:
train:
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -100,5 +100,3 @@ deployments:
- echo "Uploading data ..."
defaults:
deployment: tf_example
extras:
submitFrom: submit-job-v2

Просмотреть файл

@ -21,7 +21,7 @@ taskRoles:
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
- python mnist_cnn.py
defaults:
virtualCluster: VC1
virtualCluster: vc1
extras:
gangAllocation: false
hivedScheduler:
@ -29,4 +29,3 @@ extras:
taskRoles:
train:
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -21,7 +21,7 @@ taskRoles:
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
- python mnist_cnn.py
defaults:
virtualCluster: VC1
virtualCluster: vc1
extras:
gangAllocation: true
hivedScheduler:
@ -29,4 +29,3 @@ extras:
taskRoles:
train:
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -11,7 +11,7 @@ taskRoles:
instances: 4
completion:
minFailedInstances: 1
minSucceededInstances: 6
minSucceededInstances: 4
dockerImage: keras_tensorflow_example
resourcePerInstance:
cpu: 4
@ -21,7 +21,7 @@ taskRoles:
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
- python mnist_cnn.py
defaults:
virtualCluster: VC1
virtualCluster: vc1
extras:
gangAllocation: true
hivedScheduler:
@ -29,4 +29,3 @@ extras:
taskRoles:
train:
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -20,13 +20,12 @@ taskRoles:
commands:
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
- python mnist_cnn.py
- sleep 10000
- sleep 10m
defaults:
virtualCluster: VC1
virtualCluster: vc1
extras:
hivedScheduler:
jobPriorityClass: oppo
taskRoles:
train:
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -20,7 +20,7 @@ taskRoles:
commands:
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
- python mnist_cnn.py
- sleep 10000
- sleep 10m
defaults:
virtualCluster: default
extras:
@ -29,4 +29,3 @@ extras:
taskRoles:
train:
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -20,13 +20,12 @@ taskRoles:
commands:
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
- python mnist_cnn.py
- sleep 10000
- sleep 10m
defaults:
virtualCluster: VC1
virtualCluster: vc1
extras:
hivedScheduler:
jobPriorityClass: prod
taskRoles:
train:
skuType: M60
submitFrom: submit-job-v2

Просмотреть файл

@ -20,13 +20,12 @@ taskRoles:
commands:
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
- python mnist_cnn.py
- sleep 10000
- sleep 10m
defaults:
virtualCluster: VC1
virtualCluster: vc1
extras:
hivedScheduler:
jobPriorityClass: test
taskRoles:
train:
skuType: M60
submitFrom: submit-job-v2

Просмотреть файл

@ -20,13 +20,12 @@ taskRoles:
commands:
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
- python mnist_cnn.py
- sleep 10000
- sleep 10m
defaults:
virtualCluster: VC1
virtualCluster: vc1
extras:
hivedScheduler:
jobPriorityClass: prod
taskRoles:
train:
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -20,7 +20,7 @@ taskRoles:
commands:
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
- python mnist_cnn.py
- sleep 10000
- sleep 10m
defaults:
virtualCluster: default
extras:
@ -29,4 +29,3 @@ extras:
taskRoles:
train:
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -20,13 +20,13 @@ taskRoles:
commands:
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
- python mnist_cnn.py
- sleep 10000
- sleep 10m
defaults:
virtualCluster: VC1
virtualCluster: vc1
extras:
hivedScheduler:
jobPriorityClass: test
taskRoles:
train:
skuType: K80
submitFrom: submit-job-v2
lazyPreemptionEnable: true

Просмотреть файл

@ -20,8 +20,9 @@ taskRoles:
commands:
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
- python mnist_cnn.py
- sleep 10m
defaults:
virtualCluster: VC1
virtualCluster: vc1
extras:
gangAllocation: false
hivedScheduler:
@ -29,4 +30,3 @@ extras:
taskRoles:
train:
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -1,5 +1,5 @@
protocolVersion: 2
name: itc-k80-type
name: itc-no-type
type: job
prerequisites:
- protocolVersion: 2
@ -21,9 +21,8 @@ taskRoles:
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
- python mnist_cnn.py
defaults:
virtualCluster: VC1
virtualCluster: vc1
extras:
gangAllocation: false
hivedScheduler:
jobPriorityClass: prod
submitFrom: submit-job-v2

Двоичные данные
example/feature/file/itc-no-worker-label-1.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 19 KiB

Двоичные данные
example/feature/file/itc-no-worker-label-2.png

Двоичный файл не отображается.

До

Ширина:  |  Высота:  |  Размер: 21 KiB

Просмотреть файл

@ -1,33 +0,0 @@
protocolVersion: 2
name: itc-no-worker-label
type: job
prerequisites:
- protocolVersion: 2
name: keras_tensorflow_example
type: dockerimage
uri: openpai/pai.example.keras.tensorflow
taskRoles:
train:
instances: 1
completion:
minFailedInstances: 1
minSucceededInstances: 6
dockerImage: keras_tensorflow_example
resourcePerInstance:
cpu: 4
memoryMB: 8192
gpu: 1
commands:
- nvidia-smi -L
- printenv
- sleep 10000
defaults:
virtualCluster: VC1
extras:
gangAllocation: true
hivedScheduler:
jobPriorityClass: prod
taskRoles:
train:
skuType: M60
submitFrom: submit-job-v2

Просмотреть файл

@ -21,7 +21,7 @@ taskRoles:
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
- python mnist_cnn.py
defaults:
virtualCluster: VC1
virtualCluster: vc1
extras:
gangAllocation: false
hivedScheduler:
@ -29,4 +29,3 @@ extras:
taskRoles:
train:
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -35,7 +35,7 @@ taskRoles:
- python mnist_cnn.py
defaults:
virtualCluster: VC1
virtualCluster: vc1
extras:
hivedScheduler:

Просмотреть файл

@ -20,12 +20,12 @@ taskRoles:
commands:
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
- python mnist_cnn.py
- sleep 10m
defaults:
virtualCluster: VC1
virtualCluster: vc1
extras:
hivedScheduler:
jobPriorityClass: test
taskRoles:
train:
skuType: M60
submitFrom: submit-job-v2

Просмотреть файл

@ -21,11 +21,10 @@ taskRoles:
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
- python mnist_cnn.py
defaults:
virtualCluster: VC1
virtualCluster: vc1
extras:
hivedScheduler:
jobPriorityClass: test
taskRoles:
train:
skuType: M60
submitFrom: submit-job-v2

Просмотреть файл

@ -20,6 +20,7 @@ taskRoles:
commands:
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
- python mnist_cnn.py
- sleep 10m
defaults:
virtualCluster: default
extras:
@ -28,4 +29,3 @@ extras:
taskRoles:
train:
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -11,7 +11,7 @@ taskRoles:
instances: 4
completion:
minFailedInstances: 1
minSucceededInstances: 2
minSucceededInstances: 4
dockerImage: keras_tensorflow_example
resourcePerInstance:
cpu: 16
@ -21,11 +21,10 @@ taskRoles:
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
- python mnist_cnn.py
defaults:
virtualCluster: VC1
virtualCluster: vc1
extras:
hivedScheduler:
jobPriorityClass: test
taskRoles:
train:
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -11,7 +11,7 @@ taskRoles:
instances: 3
completion:
minFailedInstances: 1
minSucceededInstances: 2
minSucceededInstances: 3
dockerImage: keras_tensorflow_example
resourcePerInstance:
cpu: 16
@ -21,11 +21,10 @@ taskRoles:
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
- python mnist_cnn.py
defaults:
virtualCluster: VC1
virtualCluster: vc1
extras:
hivedScheduler:
jobPriorityClass: test
taskRoles:
train:
skuType: K80
submitFrom: submit-job-v2

Просмотреть файл

@ -22,7 +22,7 @@ taskRoles:
- python mnist_cnn.py
defaults:
virtualCluster: VC1
virtualCluster: vc1
extras:
hivedScheduler:

Просмотреть файл

@ -22,7 +22,7 @@ taskRoles:
- python mnist_cnn.py
defaults:
virtualCluster: VC1
virtualCluster: vc1
extras:
hivedScheduler:

Просмотреть файл

@ -235,9 +235,9 @@ func getUsablePhysicalCells(
return nil
}
// prioritize the cells with fewer opportunistic pods (to reduce preemption)
sort.SliceStable(candidates, func(i, j int) bool {
return candidates[i].GetUsedLeafCellNumAtPriorities()[opportunisticPriority] <
candidates[j].GetUsedLeafCellNumAtPriorities()[opportunisticPriority]
sort.SliceStable(usableCandidates, func(i, j int) bool {
return usableCandidates[i].GetUsedLeafCellNumAtPriorities()[opportunisticPriority] <
usableCandidates[j].GetUsedLeafCellNumAtPriorities()[opportunisticPriority]
})
return usableCandidates
}

Просмотреть файл

@ -83,7 +83,7 @@ type PodSchedulingSpec struct {
LeafCellNumber int32 `yaml:"leafCellNumber"`
GangReleaseEnable bool `yaml:"gangReleaseEnable"`
LazyPreemptionEnable bool `yaml:"lazyPreemptionEnable"`
IgnoreK8sSuggestedNodes bool `yaml:"ignoreK8sSuggestedNodes"`
IgnoreK8sSuggestedNodes bool `yaml:"ignoreK8sSuggestedNodes" default:"true"`
AffinityGroup *AffinityGroupSpec `yaml:"affinityGroup"`
}

Просмотреть файл

@ -232,7 +232,7 @@ func ExtractPodSchedulingSpec(pod *core.Pod) *si.PodSchedulingSpec {
defer AsBadRequestPanic()
errPfx := fmt.Sprintf("Pod annotation %v: ", si.AnnotationKeyPodSchedulingSpec)
podSchedulingSpec := si.PodSchedulingSpec{}
podSchedulingSpec := si.PodSchedulingSpec{IgnoreK8sSuggestedNodes: true}
annotation := convertOldAnnotation(pod.Annotations[si.AnnotationKeyPodSchedulingSpec])
if annotation == "" {