Update feature demo examples (#30)
* Update feature demo examples. * Add defaulting for `ignoreK8sSuggestedNodes`. * Fix sort in `getUsablePhysicalCells`.
This commit is contained in:
Родитель
a8b0aa7d5b
Коммит
df185eecde
|
@ -40,7 +40,6 @@ HiveD supports multiple job **priorities**. Higher-priority jobs can **[preempt]
|
|||
5. [Priorities](example/feature/README.md#Guaranteed-Job), [Overuse with Low Priority](example/feature/README.md#Opportunistic-Job), and [Inter-](example/feature/README.md#Inter-VC-Preemption)/[Intra-VC Preemption](example/feature/README.md#Intra-VC-Preemption)
|
||||
6. [Job (Full/Partial) Gang Scheduling/Preemption](example/feature/README.md#Gang-Scheduling)
|
||||
7. Fault-Tolerance, [Bad Hardware Awareness](example/feature/README.md#Bad-Hardware-Awareness), [Work-Preserving Reconfiguration](example/feature/README.md#Work-Preserving-Reconfiguration)
|
||||
8. [Leverage K8S Default Scheduler](example/feature/README.md#Leverage-K8S-Default-Scheduler)
|
||||
|
||||
## Prerequisite
|
||||
1. A Kubernetes cluster, v1.14.2 or above, on-cloud or on-premise.
|
||||
|
|
|
@ -11,7 +11,7 @@ HiveD guarantees **quota safety for all VCs**, in the sense that the requests to
|
|||
|
||||
VC's cells can be described by Hardware Quantity, [Topology](#VC-Safety), [Type](#SKU-Type), [Pinned Cells](#Pinned-Cells), etc. To guarantee safety, HiveD never allows a VC to "invade" other VCs' cells. For example, to guarantee all VCs' topology, one VC's [guaranteed jobs](#Guaranteed-Job) should never make fragmentation inside other VCs:
|
||||
|
||||
Two DGX-2s, two VCs each owns one DGX-2 node. For a traditional scheduler, this will translate into two VCs each owning 16 GPUs. When a user submits 16 1-GPU jobs to VC1, the user in VC2 might not be able to run a 16-GPU job, due to possible fragmentation issue caused by VC1. While HiveD can guarantee each VC always has one entire node available for its dedicated use.
|
||||
Two DGX-2s, two VCs each owns one DGX-2 node. For a traditional scheduler, this will translate into two VCs each owning 16 GPUs. When a user submits 16 1-GPU jobs to vc1, the user in vc2 might not be able to run a 16-GPU job, due to possible fragmentation issue caused by vc1. While HiveD can guarantee each VC always has one entire node available for its dedicated use.
|
||||
|
||||
### Reproduce Steps
|
||||
1. Use [hived-config-1](file/hived-config-1.yaml).
|
||||
|
@ -27,7 +27,7 @@ This is similar to [K8S Taints and Tolerations](https://kubernetes.io/docs/conce
|
|||
|
||||
### Reproduce Steps
|
||||
1. Use [hived-config-8](file/hived-config-8.yaml).
|
||||
2. Submit job [itc-pin](file/itc-pin.yaml) to VC1, all tasks in task role vc1pinned will be on node 10.151.41.25 (which is pinned), all tasks in task role vc1nopinned will NOT be on node 10.151.41.25.
|
||||
2. Submit job [itc-pin](file/itc-pin.yaml) to vc1, all tasks in task role vc1pinned will be on node 10.151.41.25 (which is pinned), all tasks in task role vc1nopinned will NOT be on node 10.151.41.25.
|
||||
<img src="file/itc-pin.png" width="900"/>
|
||||
|
||||
## SKU Type
|
||||
|
@ -68,8 +68,8 @@ This is useful for jobs that cannot perform any useful work, such as making prog
|
|||
<img src="file/itc-gang4.png" width="900"/>
|
||||
|
||||
#### TensorFlow Distributed Training
|
||||
1. Use [hived-config-1](file/hived-config-1.yaml).
|
||||
2. Submit job [itc-dtf](file/itc-dtf.yaml) to VC2, it will success.
|
||||
1. Use [hived-config-2](file/hived-config-2.yaml).
|
||||
2. Submit job [itc-dtf](file/itc-dtf.yaml) to default VC, it will success.
|
||||
<img src="file/itc-dtf.png" width="900"/>
|
||||
|
||||
## Incremental Scheduling
|
||||
|
@ -110,27 +110,28 @@ Within one VC, a high-priority job can preempt low-priority jobs.
|
|||
### Reproduce Steps
|
||||
#### Immediate Preemption
|
||||
1. Use [hived-config-3](file/hived-config-3.yaml).
|
||||
2. Submit [itc-intra-imd-preempt-test](file/itc-intra-imd-preempt-test.yaml), which requests for 4 M60 GPUs for VC1 with test (0) priority.
|
||||
3. Submit [itc-intra-imd-preempt-prod](file/itc-intra-imd-preempt-prod.yaml), which also requests for 4 M60 GPUs for VC1 with prod (100) priority. The job will preempt the test job immediately, so the test job is retried and waiting for resource.
|
||||
2. Submit [itc-intra-imd-preempt-test](file/itc-intra-imd-preempt-test.yaml), which requests for 4 M60 GPUs for vc1 with test (0) priority.
|
||||
3. Submit [itc-intra-imd-preempt-prod](file/itc-intra-imd-preempt-prod.yaml), which also requests for 4 M60 GPUs for vc1 with prod (100) priority. The job will preempt the test job immediately, so the test job is retried and waiting for resource.
|
||||
<img src="file/itc-intra-imd-preempt-test.png" width="900"/>
|
||||
<img src="file/itc-intra-imd-preempt-prod.png" width="900"/>
|
||||
|
||||
#### Lazy Preemption
|
||||
1. Use [hived-config-3](file/hived-config-3.yaml).
|
||||
2. Submit [itc-intra-lazy-preempt-test](file/itc-intra-lazy-preempt-test.yaml), which requests for 4 K80 GPUs for VC1 with test (0) priority.
|
||||
3. Submit [itc-intra-lazy-preempt-prod](file/itc-intra-lazy-preempt-prod.yaml), which also requests for 4 K80 GPUs for VC1 with prod (100) priority. The job will just downgrade the test job to be [Opportunistic Job](#Opportunistic-Job), instead of preempting it immediately, because all jobs can still fit into the whole physical cluster.
|
||||
2. Submit [itc-intra-lazy-preempt-test](file/itc-intra-lazy-preempt-test.yaml), which requests for 4 K80 GPUs for vc1 with test (0) priority.
|
||||
3. Submit [itc-intra-lazy-preempt-prod](file/itc-intra-lazy-preempt-prod.yaml), which also requests for 4 K80 GPUs for vc1 with prod (100) priority. The job will just downgrade the test job to be [Opportunistic Job](#Opportunistic-Job), instead of preempting it immediately, because all jobs can still fit into the whole physical cluster.
|
||||
4. Submit [itc-intra-lazy-preempt-prod2](file/itc-intra-lazy-preempt-prod2.yaml), which also requests for 3 * 4 K80 GPUs for default VC with prod (100) priority. The job will preempt the test job immediately, because all jobs cannot fit into the whole physical cluster.
|
||||
<img src="file/itc-intra-lazy-preempt-test.png" width="900"/>
|
||||
<img src="file/itc-intra-lazy-preempt-prod.png" width="900"/>
|
||||
<img src="file/itc-intra-lazy-preempt-prod2.png" width="900"/>
|
||||
> NOTE: `lazyPreemptionEnable` option is disabled by default, becasue earlier job may be downgraded to low priority job and get preempted by later jobs, which may be confusing.
|
||||
|
||||
## Inter-VC Preemption
|
||||
### Description
|
||||
One VC's [Guaranteed Job](#Guaranteed-Job) can preempt other VCs' [Opportunistic Jobs](#Opportunistic-Job).
|
||||
|
||||
### Reproduce Steps
|
||||
1. Use [hived-config-2](file/hived-config-2.yaml).
|
||||
2. Submit [itc-inter-preempt-oppo](file/itc-inter-preempt-oppo.yaml), which requests for 2 * 4 K80 GPUs for VC1 with oppo (-1) priority.
|
||||
1. Use [hived-config-3](file/hived-config-3.yaml).
|
||||
2. Submit [itc-inter-preempt-oppo](file/itc-inter-preempt-oppo.yaml), which requests for 2 * 4 K80 GPUs for vc1 with oppo (-1) priority.
|
||||
3. Submit [itc-inter-preempt-prod](file/itc-inter-preempt-prod.yaml), which also requests for 3 * 4 K80 GPUs for default VC with prod (100) priority. The job will preempt the oppo job immediately.
|
||||
<img src="file/itc-inter-preempt-oppo.png" width="900"/>
|
||||
<img src="file/itc-inter-preempt-prod.png" width="900"/>
|
||||
|
@ -190,20 +191,20 @@ HiveD can be reconfigured without unnecessary user impacts, such as add/update/d
|
|||
#### VirtualCluster Reconfig - Delete VirtualCluster
|
||||
1. Use [hived-config-2](file/hived-config-2.yaml).
|
||||
2. Submit job [itc-reconfig-3](file/itc-reconfig-3.yaml) to default VC. Wait until it is running.
|
||||
3. Delete the default VC and move its quota to VC1, then becomes [hived-config-5](file/hived-config-5.yaml).
|
||||
3. Delete the default VC and move its quota to vc1, then becomes [hived-config-5](file/hived-config-5.yaml).
|
||||
4. Use [hived-config-5](file/hived-config-5.yaml), and restart HiveD.
|
||||
5. The job will still run without any interruption but [lazy preempted](#Lazy-Preemption) by HiveD.
|
||||
<img src="file/itc-reconfig-3.png" width="900"/>
|
||||
6. To confirm it is [lazy preempted](#Lazy-Preemption), submit job [itc-reconfig-4](file/itc-reconfig-4.yaml) to VC1 which requests all K80 nodes. The job will immediately preempt [itc-reconfig-3](file/itc-reconfig-3.yaml).
|
||||
6. To confirm it is [lazy preempted](#Lazy-Preemption), submit job [itc-reconfig-4](file/itc-reconfig-4.yaml) to vc1 which requests all K80 nodes. The job will immediately preempt [itc-reconfig-3](file/itc-reconfig-3.yaml).
|
||||
<img src="file/itc-reconfig-4.png" width="900"/>
|
||||
|
||||
#### VirtualCluster Reconfig - Update VirtualCluster
|
||||
1. Use [hived-config-2](file/hived-config-2.yaml).
|
||||
2. Submit job [itc-reconfig-3](file/itc-reconfig-3.yaml) to default VC. Wait until it is running.
|
||||
3. Move one K80-NODE cell from default VC to VC1, then becomes [hived-config-6](file/hived-config-6.yaml).
|
||||
3. Move one K80-NODE cell from default VC to vc1, then becomes [hived-config-6](file/hived-config-6.yaml).
|
||||
4. Use [hived-config-6](file/hived-config-6.yaml), and restart HiveD.
|
||||
5. The job will still run without any interruption but [lazy preempted](#Lazy-Preemption) by HiveD.
|
||||
6. To confirm it is [lazy preempted](#Lazy-Preemption), submit job [itc-reconfig-5](file/itc-reconfig-5.yaml) to VC1 which requests all K80 nodes. The job will immediately preempt [itc-reconfig-3](file/itc-reconfig-3.yaml).
|
||||
6. To confirm it is [lazy preempted](#Lazy-Preemption), submit job [itc-reconfig-5](file/itc-reconfig-5.yaml) to vc1 which requests all K80 nodes. The job will immediately preempt [itc-reconfig-3](file/itc-reconfig-3.yaml).
|
||||
<img src="file/itc-reconfig-5.png" width="900"/>
|
||||
|
||||
## Bad Hardware Awareness
|
||||
|
@ -219,17 +220,3 @@ Avoid scheduling pods to bad hardware.
|
|||
4. Bring back 10.151.41.26 by `sudo systemctl start kubelet`. Wait until this is detected by K8S.
|
||||
5. The waiting job will start running, without any retries.
|
||||
<img src="file/itc-badnode50-3.png" width="900"/>
|
||||
|
||||
## Leverage K8S Default Scheduler
|
||||
### Description
|
||||
You can still leverage almost all scheduling features provided by your existing [K8S Default Scheduler](https://kubernetes.io/docs/concepts/scheduling/kube-scheduler) with HiveD, such as these [Filtering Policies](https://kubernetes.io/docs/concepts/scheduling/kube-scheduler/#filtering).
|
||||
|
||||
### Reproduce Steps
|
||||
#### Leverage [Labels and Selectors](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels)
|
||||
1. Use [hived-config-2](file/hived-config-2.yaml).
|
||||
2. Remove PAI worker label for 10.151.41.26 (the only M60 node).
|
||||
3. Submit job [itc-no-worker-label](file/itc-no-worker-label.yaml), which requests M60 node, it will be waiting without IP associated.
|
||||
<img src="file/itc-no-worker-label-1.png" width="900"/>
|
||||
4. Add back PAI worker label for 10.151.41.26.
|
||||
5. The waiting job will start running, without any retries.
|
||||
<img src="file/itc-no-worker-label-2.png" width="900"/>
|
||||
|
|
|
@ -34,11 +34,11 @@ physicalCluster:
|
|||
- cellAddress: 10.151.41.24
|
||||
|
||||
virtualClusters:
|
||||
VC1:
|
||||
vc1:
|
||||
virtualCells:
|
||||
- cellType: 3-K80-NODE.K80-NODE
|
||||
cellNumber: 1
|
||||
VC2:
|
||||
vc2:
|
||||
virtualCells:
|
||||
- cellType: 3-K80-NODE.K80-NODE
|
||||
cellNumber: 1
|
||||
|
|
|
@ -45,7 +45,7 @@ physicalCluster:
|
|||
- cellAddress: 10.151.41.26
|
||||
|
||||
virtualClusters:
|
||||
VC1:
|
||||
vc1:
|
||||
virtualCells:
|
||||
- cellType: K80-NODE-POOL.K80-NODE
|
||||
cellNumber: 1
|
||||
|
|
|
@ -45,7 +45,7 @@ physicalCluster:
|
|||
- cellAddress: 10.151.41.26
|
||||
|
||||
virtualClusters:
|
||||
VC1:
|
||||
vc1:
|
||||
virtualCells:
|
||||
- cellType: K80-NODE-POOL.K80-NODE
|
||||
cellNumber: 1
|
||||
|
|
|
@ -42,7 +42,7 @@ physicalCluster:
|
|||
# - cellAddress: 10.151.41.25
|
||||
|
||||
virtualClusters:
|
||||
VC1:
|
||||
vc1:
|
||||
virtualCells:
|
||||
- cellType: K80-NODE-POOL.K80-NODE
|
||||
cellNumber: 1
|
||||
|
|
|
@ -45,7 +45,7 @@ physicalCluster:
|
|||
- cellAddress: 10.151.41.25
|
||||
|
||||
virtualClusters:
|
||||
VC1:
|
||||
vc1:
|
||||
virtualCells:
|
||||
- cellType: K80-NODE-POOL.K80-NODE
|
||||
cellNumber: 1
|
||||
|
|
|
@ -45,7 +45,7 @@ physicalCluster:
|
|||
- cellAddress: 10.151.41.26
|
||||
|
||||
virtualClusters:
|
||||
VC1:
|
||||
vc1:
|
||||
virtualCells:
|
||||
- cellType: K80-NODE-POOL.K80-NODE
|
||||
cellNumber: 4
|
||||
|
|
|
@ -45,7 +45,7 @@ physicalCluster:
|
|||
- cellAddress: 10.151.41.26
|
||||
|
||||
virtualClusters:
|
||||
VC1:
|
||||
vc1:
|
||||
virtualCells:
|
||||
- cellType: K80-NODE-POOL.K80-NODE
|
||||
cellNumber: 3
|
||||
|
|
|
@ -44,7 +44,7 @@ physicalCluster:
|
|||
- cellAddress: 10.151.41.26
|
||||
|
||||
virtualClusters:
|
||||
VC1:
|
||||
vc1:
|
||||
virtualCells:
|
||||
- cellType: K80-NODE-POOL.K80-NODE
|
||||
cellNumber: 1
|
||||
|
|
|
@ -34,13 +34,13 @@ physicalCluster:
|
|||
- cellAddress: 10.151.41.24
|
||||
|
||||
virtualClusters:
|
||||
VC1:
|
||||
vc1:
|
||||
virtualCells:
|
||||
- cellType: 3-K80-NODE.K80-NODE
|
||||
cellNumber: 1
|
||||
pinnedCells:
|
||||
- pinnedCellId: VC1-K80
|
||||
VC2:
|
||||
vc2:
|
||||
virtualCells:
|
||||
- cellType: 3-K80-NODE.K80-NODE
|
||||
cellNumber: 1
|
||||
|
|
|
@ -11,18 +11,17 @@ taskRoles:
|
|||
instances: 1
|
||||
completion:
|
||||
minFailedInstances: 1
|
||||
minSucceededInstances: 6
|
||||
minSucceededInstances: 1
|
||||
dockerImage: keras_tensorflow_example
|
||||
resourcePerInstance:
|
||||
cpu: 4
|
||||
memoryMB: 8192
|
||||
gpu: 1
|
||||
commands:
|
||||
- nvidia-smi -L
|
||||
- printenv
|
||||
- sleep 10000
|
||||
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
|
||||
- python mnist_cnn.py
|
||||
defaults:
|
||||
virtualCluster: VC1
|
||||
virtualCluster: vc1
|
||||
extras:
|
||||
gangAllocation: true
|
||||
hivedScheduler:
|
||||
|
@ -30,4 +29,3 @@ extras:
|
|||
taskRoles:
|
||||
train:
|
||||
skuType: M60
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -20,9 +20,9 @@ taskRoles:
|
|||
commands:
|
||||
- nvidia-smi -L
|
||||
- printenv
|
||||
- sleep 10000
|
||||
- sleep 10m
|
||||
defaults:
|
||||
virtualCluster: VC1
|
||||
virtualCluster: vc1
|
||||
extras:
|
||||
gangAllocation: true
|
||||
hivedScheduler:
|
||||
|
@ -30,4 +30,3 @@ extras:
|
|||
taskRoles:
|
||||
train:
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -100,5 +100,3 @@ deployments:
|
|||
- echo "Uploading data ..."
|
||||
defaults:
|
||||
deployment: tf_example
|
||||
extras:
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -21,7 +21,7 @@ taskRoles:
|
|||
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
|
||||
- python mnist_cnn.py
|
||||
defaults:
|
||||
virtualCluster: VC1
|
||||
virtualCluster: vc1
|
||||
extras:
|
||||
gangAllocation: false
|
||||
hivedScheduler:
|
||||
|
@ -29,4 +29,3 @@ extras:
|
|||
taskRoles:
|
||||
train:
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -21,7 +21,7 @@ taskRoles:
|
|||
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
|
||||
- python mnist_cnn.py
|
||||
defaults:
|
||||
virtualCluster: VC1
|
||||
virtualCluster: vc1
|
||||
extras:
|
||||
gangAllocation: true
|
||||
hivedScheduler:
|
||||
|
@ -29,4 +29,3 @@ extras:
|
|||
taskRoles:
|
||||
train:
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -11,7 +11,7 @@ taskRoles:
|
|||
instances: 4
|
||||
completion:
|
||||
minFailedInstances: 1
|
||||
minSucceededInstances: 6
|
||||
minSucceededInstances: 4
|
||||
dockerImage: keras_tensorflow_example
|
||||
resourcePerInstance:
|
||||
cpu: 4
|
||||
|
@ -21,7 +21,7 @@ taskRoles:
|
|||
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
|
||||
- python mnist_cnn.py
|
||||
defaults:
|
||||
virtualCluster: VC1
|
||||
virtualCluster: vc1
|
||||
extras:
|
||||
gangAllocation: true
|
||||
hivedScheduler:
|
||||
|
@ -29,4 +29,3 @@ extras:
|
|||
taskRoles:
|
||||
train:
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -20,13 +20,12 @@ taskRoles:
|
|||
commands:
|
||||
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
|
||||
- python mnist_cnn.py
|
||||
- sleep 10000
|
||||
- sleep 10m
|
||||
defaults:
|
||||
virtualCluster: VC1
|
||||
virtualCluster: vc1
|
||||
extras:
|
||||
hivedScheduler:
|
||||
jobPriorityClass: oppo
|
||||
taskRoles:
|
||||
train:
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -20,7 +20,7 @@ taskRoles:
|
|||
commands:
|
||||
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
|
||||
- python mnist_cnn.py
|
||||
- sleep 10000
|
||||
- sleep 10m
|
||||
defaults:
|
||||
virtualCluster: default
|
||||
extras:
|
||||
|
@ -29,4 +29,3 @@ extras:
|
|||
taskRoles:
|
||||
train:
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -20,13 +20,12 @@ taskRoles:
|
|||
commands:
|
||||
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
|
||||
- python mnist_cnn.py
|
||||
- sleep 10000
|
||||
- sleep 10m
|
||||
defaults:
|
||||
virtualCluster: VC1
|
||||
virtualCluster: vc1
|
||||
extras:
|
||||
hivedScheduler:
|
||||
jobPriorityClass: prod
|
||||
taskRoles:
|
||||
train:
|
||||
skuType: M60
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -20,13 +20,12 @@ taskRoles:
|
|||
commands:
|
||||
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
|
||||
- python mnist_cnn.py
|
||||
- sleep 10000
|
||||
- sleep 10m
|
||||
defaults:
|
||||
virtualCluster: VC1
|
||||
virtualCluster: vc1
|
||||
extras:
|
||||
hivedScheduler:
|
||||
jobPriorityClass: test
|
||||
taskRoles:
|
||||
train:
|
||||
skuType: M60
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -20,13 +20,12 @@ taskRoles:
|
|||
commands:
|
||||
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
|
||||
- python mnist_cnn.py
|
||||
- sleep 10000
|
||||
- sleep 10m
|
||||
defaults:
|
||||
virtualCluster: VC1
|
||||
virtualCluster: vc1
|
||||
extras:
|
||||
hivedScheduler:
|
||||
jobPriorityClass: prod
|
||||
taskRoles:
|
||||
train:
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -20,7 +20,7 @@ taskRoles:
|
|||
commands:
|
||||
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
|
||||
- python mnist_cnn.py
|
||||
- sleep 10000
|
||||
- sleep 10m
|
||||
defaults:
|
||||
virtualCluster: default
|
||||
extras:
|
||||
|
@ -29,4 +29,3 @@ extras:
|
|||
taskRoles:
|
||||
train:
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -20,13 +20,13 @@ taskRoles:
|
|||
commands:
|
||||
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
|
||||
- python mnist_cnn.py
|
||||
- sleep 10000
|
||||
- sleep 10m
|
||||
defaults:
|
||||
virtualCluster: VC1
|
||||
virtualCluster: vc1
|
||||
extras:
|
||||
hivedScheduler:
|
||||
jobPriorityClass: test
|
||||
taskRoles:
|
||||
train:
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
lazyPreemptionEnable: true
|
||||
|
|
|
@ -20,8 +20,9 @@ taskRoles:
|
|||
commands:
|
||||
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
|
||||
- python mnist_cnn.py
|
||||
- sleep 10m
|
||||
defaults:
|
||||
virtualCluster: VC1
|
||||
virtualCluster: vc1
|
||||
extras:
|
||||
gangAllocation: false
|
||||
hivedScheduler:
|
||||
|
@ -29,4 +30,3 @@ extras:
|
|||
taskRoles:
|
||||
train:
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -1,5 +1,5 @@
|
|||
protocolVersion: 2
|
||||
name: itc-k80-type
|
||||
name: itc-no-type
|
||||
type: job
|
||||
prerequisites:
|
||||
- protocolVersion: 2
|
||||
|
@ -21,9 +21,8 @@ taskRoles:
|
|||
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
|
||||
- python mnist_cnn.py
|
||||
defaults:
|
||||
virtualCluster: VC1
|
||||
virtualCluster: vc1
|
||||
extras:
|
||||
gangAllocation: false
|
||||
hivedScheduler:
|
||||
jobPriorityClass: prod
|
||||
submitFrom: submit-job-v2
|
||||
|
|
Двоичные данные
example/feature/file/itc-no-worker-label-1.png
Двоичные данные
example/feature/file/itc-no-worker-label-1.png
Двоичный файл не отображается.
До Ширина: | Высота: | Размер: 19 KiB |
Двоичные данные
example/feature/file/itc-no-worker-label-2.png
Двоичные данные
example/feature/file/itc-no-worker-label-2.png
Двоичный файл не отображается.
До Ширина: | Высота: | Размер: 21 KiB |
|
@ -1,33 +0,0 @@
|
|||
protocolVersion: 2
|
||||
name: itc-no-worker-label
|
||||
type: job
|
||||
prerequisites:
|
||||
- protocolVersion: 2
|
||||
name: keras_tensorflow_example
|
||||
type: dockerimage
|
||||
uri: openpai/pai.example.keras.tensorflow
|
||||
taskRoles:
|
||||
train:
|
||||
instances: 1
|
||||
completion:
|
||||
minFailedInstances: 1
|
||||
minSucceededInstances: 6
|
||||
dockerImage: keras_tensorflow_example
|
||||
resourcePerInstance:
|
||||
cpu: 4
|
||||
memoryMB: 8192
|
||||
gpu: 1
|
||||
commands:
|
||||
- nvidia-smi -L
|
||||
- printenv
|
||||
- sleep 10000
|
||||
defaults:
|
||||
virtualCluster: VC1
|
||||
extras:
|
||||
gangAllocation: true
|
||||
hivedScheduler:
|
||||
jobPriorityClass: prod
|
||||
taskRoles:
|
||||
train:
|
||||
skuType: M60
|
||||
submitFrom: submit-job-v2
|
|
@ -21,7 +21,7 @@ taskRoles:
|
|||
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
|
||||
- python mnist_cnn.py
|
||||
defaults:
|
||||
virtualCluster: VC1
|
||||
virtualCluster: vc1
|
||||
extras:
|
||||
gangAllocation: false
|
||||
hivedScheduler:
|
||||
|
@ -29,4 +29,3 @@ extras:
|
|||
taskRoles:
|
||||
train:
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -35,7 +35,7 @@ taskRoles:
|
|||
- python mnist_cnn.py
|
||||
|
||||
defaults:
|
||||
virtualCluster: VC1
|
||||
virtualCluster: vc1
|
||||
|
||||
extras:
|
||||
hivedScheduler:
|
||||
|
|
|
@ -20,12 +20,12 @@ taskRoles:
|
|||
commands:
|
||||
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
|
||||
- python mnist_cnn.py
|
||||
- sleep 10m
|
||||
defaults:
|
||||
virtualCluster: VC1
|
||||
virtualCluster: vc1
|
||||
extras:
|
||||
hivedScheduler:
|
||||
jobPriorityClass: test
|
||||
taskRoles:
|
||||
train:
|
||||
skuType: M60
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -21,11 +21,10 @@ taskRoles:
|
|||
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
|
||||
- python mnist_cnn.py
|
||||
defaults:
|
||||
virtualCluster: VC1
|
||||
virtualCluster: vc1
|
||||
extras:
|
||||
hivedScheduler:
|
||||
jobPriorityClass: test
|
||||
taskRoles:
|
||||
train:
|
||||
skuType: M60
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -20,6 +20,7 @@ taskRoles:
|
|||
commands:
|
||||
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
|
||||
- python mnist_cnn.py
|
||||
- sleep 10m
|
||||
defaults:
|
||||
virtualCluster: default
|
||||
extras:
|
||||
|
@ -28,4 +29,3 @@ extras:
|
|||
taskRoles:
|
||||
train:
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -11,7 +11,7 @@ taskRoles:
|
|||
instances: 4
|
||||
completion:
|
||||
minFailedInstances: 1
|
||||
minSucceededInstances: 2
|
||||
minSucceededInstances: 4
|
||||
dockerImage: keras_tensorflow_example
|
||||
resourcePerInstance:
|
||||
cpu: 16
|
||||
|
@ -21,11 +21,10 @@ taskRoles:
|
|||
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
|
||||
- python mnist_cnn.py
|
||||
defaults:
|
||||
virtualCluster: VC1
|
||||
virtualCluster: vc1
|
||||
extras:
|
||||
hivedScheduler:
|
||||
jobPriorityClass: test
|
||||
taskRoles:
|
||||
train:
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -11,7 +11,7 @@ taskRoles:
|
|||
instances: 3
|
||||
completion:
|
||||
minFailedInstances: 1
|
||||
minSucceededInstances: 2
|
||||
minSucceededInstances: 3
|
||||
dockerImage: keras_tensorflow_example
|
||||
resourcePerInstance:
|
||||
cpu: 16
|
||||
|
@ -21,11 +21,10 @@ taskRoles:
|
|||
- rm /usr/local/cuda/lib64/stubs/libcuda.so.1
|
||||
- python mnist_cnn.py
|
||||
defaults:
|
||||
virtualCluster: VC1
|
||||
virtualCluster: vc1
|
||||
extras:
|
||||
hivedScheduler:
|
||||
jobPriorityClass: test
|
||||
taskRoles:
|
||||
train:
|
||||
skuType: K80
|
||||
submitFrom: submit-job-v2
|
||||
|
|
|
@ -22,7 +22,7 @@ taskRoles:
|
|||
- python mnist_cnn.py
|
||||
|
||||
defaults:
|
||||
virtualCluster: VC1
|
||||
virtualCluster: vc1
|
||||
|
||||
extras:
|
||||
hivedScheduler:
|
||||
|
|
|
@ -22,7 +22,7 @@ taskRoles:
|
|||
- python mnist_cnn.py
|
||||
|
||||
defaults:
|
||||
virtualCluster: VC1
|
||||
virtualCluster: vc1
|
||||
|
||||
extras:
|
||||
hivedScheduler:
|
||||
|
|
|
@ -235,9 +235,9 @@ func getUsablePhysicalCells(
|
|||
return nil
|
||||
}
|
||||
// prioritize the cells with fewer opportunistic pods (to reduce preemption)
|
||||
sort.SliceStable(candidates, func(i, j int) bool {
|
||||
return candidates[i].GetUsedLeafCellNumAtPriorities()[opportunisticPriority] <
|
||||
candidates[j].GetUsedLeafCellNumAtPriorities()[opportunisticPriority]
|
||||
sort.SliceStable(usableCandidates, func(i, j int) bool {
|
||||
return usableCandidates[i].GetUsedLeafCellNumAtPriorities()[opportunisticPriority] <
|
||||
usableCandidates[j].GetUsedLeafCellNumAtPriorities()[opportunisticPriority]
|
||||
})
|
||||
return usableCandidates
|
||||
}
|
||||
|
|
|
@ -83,7 +83,7 @@ type PodSchedulingSpec struct {
|
|||
LeafCellNumber int32 `yaml:"leafCellNumber"`
|
||||
GangReleaseEnable bool `yaml:"gangReleaseEnable"`
|
||||
LazyPreemptionEnable bool `yaml:"lazyPreemptionEnable"`
|
||||
IgnoreK8sSuggestedNodes bool `yaml:"ignoreK8sSuggestedNodes"`
|
||||
IgnoreK8sSuggestedNodes bool `yaml:"ignoreK8sSuggestedNodes" default:"true"`
|
||||
AffinityGroup *AffinityGroupSpec `yaml:"affinityGroup"`
|
||||
}
|
||||
|
||||
|
|
|
@ -232,7 +232,7 @@ func ExtractPodSchedulingSpec(pod *core.Pod) *si.PodSchedulingSpec {
|
|||
defer AsBadRequestPanic()
|
||||
errPfx := fmt.Sprintf("Pod annotation %v: ", si.AnnotationKeyPodSchedulingSpec)
|
||||
|
||||
podSchedulingSpec := si.PodSchedulingSpec{}
|
||||
podSchedulingSpec := si.PodSchedulingSpec{IgnoreK8sSuggestedNodes: true}
|
||||
|
||||
annotation := convertOldAnnotation(pod.Annotations[si.AnnotationKeyPodSchedulingSpec])
|
||||
if annotation == "" {
|
||||
|
|
Загрузка…
Ссылка в новой задаче