зеркало из https://github.com/github/vitess-gh.git
149 строки
4.3 KiB
Markdown
149 строки
4.3 KiB
Markdown
# Long Running Tasks
|
|
|
|
Currently, in Vitess, long running tasks are used in a couple places:
|
|
|
|
* `vtworker` runs tasks that can take a long time, as they can deal with a lot
|
|
of data.
|
|
|
|
* `vttablet` runs backups and restores that can also take a long time.
|
|
|
|
In both cases, we have a streaming RPC API that starts the job, and streams some
|
|
status back while it is running. If the streaming RPC is interrupted, the
|
|
behavior is different: `vtworker` and `vttablet backup` will interrupt their
|
|
jobs, whereas `vttablet restore` will try to keep going if it is past a point of
|
|
no return for restores.
|
|
|
|
This document proposes a different, but common design for both use cases.
|
|
|
|
## RPC Model
|
|
|
|
We introduce a model with three different RPCs:
|
|
|
|
* Start the job will be a synchronous non-streaming RPC, and just starts the
|
|
job. It returns an identifier for the job.
|
|
|
|
* Getting the job status is a streaming RPC. It takes the job identifier, and
|
|
streams the current status back. This job status can contain log entries
|
|
(regular messages) or progress (percentage, N/M completion, ...).
|
|
|
|
* Canceling a running job is its own synchronous non-streaming RPC, and takes
|
|
the job identifier.
|
|
|
|
For simplicity, let's try to make these the same API for 'vtworker' and
|
|
'vttablet'. Usually, a single destination can only run a single job, but let's
|
|
not assume that in the API. If a destination process cannot run a job, it should
|
|
return the usual `RESOURCE_EXHAUSTED` canonical error code.
|
|
|
|
These RPCs should be grouped in a new API service. Let's describe it as usual in
|
|
`jobdata.proto` and `jobservice.proto`. The current `vtworkerdata.proto` and
|
|
`vtworkerservice.proto` will eventually be removed and replaced by the new
|
|
service.
|
|
|
|
Let's use the usual `repeated string args` to describe the job. `vtworker`
|
|
already uses that.
|
|
|
|
So the proposed proto definitions:
|
|
|
|
``` proto
|
|
|
|
# in jobdata.proto
|
|
|
|
message StartRequest {
|
|
repeated string args = 1;
|
|
}
|
|
|
|
message StartResponse {
|
|
string uid = 1;
|
|
}
|
|
|
|
message StatusRequest {
|
|
string uid = 1;
|
|
}
|
|
|
|
// Progress describes the current progress of the task.
|
|
// Note the fields here match the Progress and ProgressMessage from the Node
|
|
// display of workflows.
|
|
message Progress {
|
|
// percentage can be 0-100 if known, or -1 if unknown.
|
|
int8 percentage = 1;
|
|
|
|
// message can be empty if percentage is set.
|
|
string message = 2;
|
|
}
|
|
|
|
// FinalStatus describes the end result of a job.
|
|
message FinalStatus {
|
|
// error is empty if the job was successful.
|
|
string error = 1;
|
|
}
|
|
|
|
// StatusResponse can have any of its fields set.
|
|
message StatusResponse {
|
|
// event is optional, used for logging.
|
|
logutil.Event event = 1;
|
|
|
|
// progress is optional, used to indicate progress.
|
|
Progress progress = 2;
|
|
|
|
// If final_status is set, this is the last StatusResponse for this job,
|
|
// it is terminated.
|
|
FinalStatus final_status = 3;
|
|
}
|
|
|
|
message CancelRequest {
|
|
string uid = 1;
|
|
}
|
|
|
|
message CancelResponse {
|
|
}
|
|
|
|
# in jobdata.service
|
|
|
|
service Job {
|
|
rpc Start (StartRequest) returns (StartResponse) {};
|
|
|
|
rpc Status (StatusRequest) returns (stream StatusResponse) {};
|
|
|
|
rpc Cancel (CancelRequest) returns (CancelResponse) {};
|
|
}
|
|
```
|
|
|
|
## Integration with Current Components
|
|
|
|
### vtworker
|
|
|
|
This design is very simple to implement within vtworker. At first, we don't need
|
|
to link the progress in, just the logging part.
|
|
|
|
vtworker will only support running one job like this at a time.
|
|
|
|
### vttablet
|
|
|
|
This is also somewhat easy to implement within vttablet. Only `Backup` and
|
|
`Restore` will be changed to use this.
|
|
|
|
vttablet will only support running one job like this at a time. It will also
|
|
take the ActionLock, so no other tablet actions can run at the same time (as we
|
|
do now).
|
|
|
|
### vtctld Workflows Integration
|
|
|
|
The link here is also very straightforward:
|
|
|
|
* When successfully starting a remote job, the address of the remote worker and
|
|
the UID of the job can be checkpointed.
|
|
|
|
* After that, the workflow can just connect and update its status and logs when
|
|
receiving an update.
|
|
|
|
* If the workflow is aborted and reloaded somewhere else (vtctld restart), it
|
|
can reconnect to the running job easily.
|
|
|
|
* Canceling the job is also easy, just call the RPC.
|
|
|
|
### Comments
|
|
|
|
Both vtworker and vttablet could remember the last N jobs they ran, and their
|
|
status. So when a workflow tries to reconnect to a finished job, they just
|
|
stream a single `StatusResponse` with a `final_status` field.
|