This commit is contained in:
Yuqing Yang 2020-02-19 17:51:03 +08:00
Родитель 011fe2a96c
Коммит 565aca9e3a
1 изменённых файлов: 14 добавлений и 13 удалений

Просмотреть файл

@ -9,17 +9,18 @@ Since the clusters are heterogeneous and the computing power may be significantl
## Describing a multi-cluster job ## Describing a multi-cluster job
Suppose there are $n$ clusters $C_1, \dots, C_n$ as below, each of them provides RESTful API to `submit`, `query` and `cancel` a job. The job running on clusters has sequential states such as `WAITING`, `RUNNING`, `SUCCESS` and `FAILED`. Suppose there are $n$ clusters *C_1*, .., *C_n* as below, each of them provides RESTful API to `submit`, `query` and `cancel` a job. The job running on clusters has sequential states such as `WAITING`, `RUNNING`, `SUCCESS` and `FAILED`.
| | Alias $\star$ | Uri | GPU type | Priority $\star\star$ | | | Alias * | Uri | GPU type | Priority ** |
| ----- | ------------- | ------------- | -------- | --------------------- | | ----- | ------- | ------------- | -------- | ----------- |
| $C_1$ | alias-1 | cluster.uri.1 | V100 | $p_1$ | | *C_1* | alias-1 | cluster.uri.1 | V100 | *p_1* |
| .. | .. | .. | .. | .. | | .. | .. | .. | .. | .. |
| $C_n$ | alias-n | cluster.uri.n | K80 | $p_n$ | | *C_n* | alias-n | cluster.uri.n | K80 | *p_n* |
*$\star$: alias should be unique for each cluster (default is the URI)*</br>
*$\star\star$: Priority is a number in the range of 0 and 100, more details of which will be described later*
To describe the deep learning job with multiple cluster configuration, we need to implment the *Multiple-Cluster-Configuration (MCC)* the [job protocol](github.com/microsoft/openpai-protocol) and provide a specialize method in the [SDK](github.com/microsoft/openpaisdk). *\*: alias should be unique for each cluster (default is the URI)*</br>
*\*\*: Priority is a number in the range of 0 and 100, more details of which will be described later*
To describe the deep learning job with multiple cluster configuration, we need to implement the *Multiple-Cluster-Configuration (MCC)* the [job protocol](github.com/microsoft/openpai-protocol) and provide a specialize method in the [SDK](github.com/microsoft/openpaisdk).
Below is an example of defining a job for two clusters. Since users need to handle their data for each cluster (by [teamwise storage configuration](github.com/microsoft/pai/src/kube-runtime/src/plugins/teamwise_storage/README.md)), we give the example of configuring data storages and parameters (e.g. `gpuNum`). Below is an example of defining a job for two clusters. Since users need to handle their data for each cluster (by [teamwise storage configuration](github.com/microsoft/pai/src/kube-runtime/src/plugins/teamwise_storage/README.md)), we give the example of configuring data storages and parameters (e.g. `gpuNum`).
@ -82,16 +83,16 @@ extra:
## First-Available-Cluster scheduling ## First-Available-Cluster scheduling
The First-Available-Cluster (FAC) policy would be the most simple and straitforward policy. The scheduler submits the job to all the clusters and periodically query the status of them. When one of the jobs is successfully scheduled (to `RUNNING` state for a guard time), the scheduler will cancel (kill) all other jobs and return the scheduled job and corresponding cluster to user. The First-Available-Cluster (FAC) policy would be the most simple and straightforward policy. The scheduler submits the job to all the clusters and periodically query the status of them. When one of the jobs is successfully scheduled (to `RUNNING` state for a guard time), the scheduler will cancel (kill) all other jobs and return the scheduled job and corresponding cluster to user.
If more than one jobs are scheduled, the scheduler will keep only the one entering `RUNNING` state earliest, and cancel others. If more than one jobs are scheduled, the scheduler will keep only the one entering `RUNNING` state earliest, and cancel others.
To implement within existing RESTful APIs, we chhose a *Submit-Then-Cancel* method to test the availability of each cluster. To implement within existing RESTful APIs, we choose a *Submit-Then-Cancel* method to test the availability of each cluster.
## FAC with priority ## FAC with priority
The problem of FAC policy is that it takes all the clusters equally, without noticing the difference of computing powers. However, users may have preference between clusters, for example, most users want their jobs to be scheduled on the powerful clusters. The problem of FAC policy is that it takes all the clusters equally, without noticing the difference of computing powers. However, users may have preference between clusters, for example, most users want their jobs to be scheduled on the powerful clusters.
To solve the problem, we let users give their preference of clusters by scoring every cluster with a number between 0~100 (denoted by `Priority`). When a job is successfully scheduled in cluster $C_i$ with priority $p_i$, then all jobs submitted to clusters whose priority is smaller than $p_i$ will be cancelled intermidately, but the jobs submitted to higher priorited clusters will have an extra surviving time. To solve the problem, we let users give their preference of clusters by scoring every cluster with a number between 0~100 (denoted by `Priority`). When a job is successfully scheduled in cluster *C_i* with priority *p_i*, then all jobs submitted to clusters whose priority is smaller than *p_i* will be cancelled intermediately, but the jobs submitted to higher priority clusters will have an extra surviving time.
For exmaple, there is a cluster $C_j$ whose priority $p_j$ is higher ($p_j > p_i$), then $C_j$'s job will not be cancelled until $(p_j-p_i)\times timeScale$ later. If $C_j$'s job is successfully scheduled in this time, the scheduler will choose $C_j$ as the execution cluster and cancel the job on lower priorited cluster $C_i$. For example, there is a cluster *C_j* of higher priority *p_j* (*p_j > p_i*), then *C_j*'s job will not be cancelled until *(p_j-p_i)\*timeScale* later. If *C_j*'s job is successfully scheduled in this time, the scheduler will choose *C_j* as the execution cluster and cancel the job on lower priority cluster *C_i*.