зеркало из https://github.com/github/vitess-gh.git
264 строки
11 KiB
Markdown
264 строки
11 KiB
Markdown
# Vitess Queues
|
||
|
||
This document describes a proposed design and implementation guidelines for a
|
||
Queue feature inside Vitess. Events can be added to a Queue, and processed
|
||
asynchronously by Receivers.
|
||
|
||
## Problem and Challenges
|
||
|
||
Queues are hard to implement right on MySQL. They need to scale, and usually
|
||
issues when implementing them at the application level start creeping up at high
|
||
QPS.
|
||
|
||
This document proposes a design to address the following requirements:
|
||
|
||
* Need to support sharding transparently (and scale linearly with number of
|
||
shards).
|
||
* Need to not lose events.
|
||
* Extracting the events from the queue for processing needs to be scalable.
|
||
* Adding to the queue needs to be transaction-safe (so changing some content and
|
||
adding to the queue can be committed as a transaction).
|
||
* Events have an optional timestamp of when they should fire. If none is
|
||
specified, it will be as soon as possible.
|
||
* May fire events more than once, if the event receiver is unresponsive, or in
|
||
case of topology events (resharding, reparent, ...).
|
||
* Events that are successfully acked should never fire again.
|
||
* Strict event processing ordering is not a requirement.
|
||
|
||
While it is possible to implement high QPS queues using a sharded table, it
|
||
would introduce the following limitations:
|
||
|
||
* The Queue consumers need to poll the database to find new work.
|
||
* The Queue consumers have to be somewhat shard-aware, to only consume events
|
||
from one shard.
|
||
* With a higher number of consumers on one shard, they step on each-others toes,
|
||
and introduce database latency.
|
||
|
||
## Proposed Design
|
||
|
||
At a high level, we propose to still use a table as backend for a Queue. It
|
||
follows the same sharding as the Keyspace it is in, meaning it is present on all
|
||
shards. However, instead of having consumers query the table directly, we
|
||
introduce the Queue Manager:
|
||
|
||
* When a tablet becomes the primary of a shard (at startup, or at reparent time),
|
||
it creates a Queue Manager routine for each Queue Table in the schema.
|
||
* The Queue Manager is responsible for managing the data in the Queue Table.
|
||
* The Queue Manager dispatches events that are firing to a pool of
|
||
Listeners. Each Listener maintains a streaming connection to the Queue it is
|
||
listening on.
|
||
* The Queue Manager can then use an in-memory cache covering the next time
|
||
interval, and be efficient.
|
||
* All requests to the Queue table are then done by Primary Key, and are very
|
||
fast.
|
||
|
||
Each Queue Table has the following mandatory fields. At first, we can create
|
||
them using a regular schema change. Eventually, we'll support a `CREATE QUEUE
|
||
name(...)` command. Each entry in the queue has a sharding key (could be a hash
|
||
of the trigger timestamp). The Primary Key for the table is a timestamp,
|
||
nanosecond granularity:
|
||
|
||
* Timestamp in nanoseconds since Epoch (PK): uint64
|
||
* State (enum): to_be_fired, fired, acked.
|
||
* Some sharding key: user-defined, specified at creation time, with vindex.
|
||
* Other columns: user-defined, specified at creation time.
|
||
|
||
A Queue is linked to a Queue Manager on the primary vttablet, on each shard. The
|
||
Queue Manager is responsible for dispatching events to the Receivers. The Queue
|
||
Manager can impose the following limits:
|
||
|
||
* Maximum QPS of events being processed.
|
||
* Maximum concurrency of event processing.
|
||
|
||
*Note*: we still use the database to store the items, so any state change is
|
||
persisted. We'll explore scalability below, but the QPS for Queues will still be
|
||
limited by the insert / update QPS of each shard primary. To increase Queue
|
||
capabilities, the usual horizontal sharding process can be used.
|
||
|
||
## Implementation Details
|
||
|
||
Queue Tables are marked in the schema by a comment, in a similar way we detect
|
||
Sequence Tables
|
||
[now](https://github.com/vitessio/vitess/blob/0b3de7c4a2de8daec545f040639b55a835361685/go/vt/vttablet/tabletserver/tabletserver.go#L138).
|
||
|
||
When a tablet becomes a primary, and there are Queue tables, it creates a
|
||
QueueManager for each of them.
|
||
|
||
The Queue Manager has a window of ‘current or soon’ events (let’s say the next
|
||
30 seconds or 1 minute). Upon startup, it reads these events into memory. Then
|
||
for every event modification that is done for events in that window, it keeps a
|
||
memory copy of the event.
|
||
|
||
Receivers use streaming RPCs to ask for events, providing a KeyRange if
|
||
necessary to only target a subset of the Queues. That RPC is connected to the
|
||
corresponding Queue Managers on the targeted shards. It is recommended for
|
||
multiple Receivers to target the same shard, for redundancy and scalability.
|
||
|
||
Whenever an event becomes current and needs to be fired, the Queue Manager
|
||
changes the state of the event on disk to ‘fired’, and sends it to a random
|
||
Receiver (or the Receiver that is the least busy, or the most responsive, which
|
||
ever it is). The Receiver will then either:
|
||
|
||
* Ack the event after processing, and then the Manager can mark it ‘acked’ in
|
||
the table, or even delete it.
|
||
* Not respond in time, or not ack it: the Manager can mark it as ‘to be fired’
|
||
again. In that case, the Manager can back off and try again later. The logic
|
||
here can be as simple or complicated as necessary.
|
||
|
||
The easiest implementation is to not guarantee an execution order for the
|
||
events. Providing ordering would not be too difficult, but has an impact on
|
||
performance (when a Receiver doesn't respond in time, it delays eveybody else, so
|
||
we'd need to timeout quickly and keep going).
|
||
|
||
For ‘old’ events (that have been fired but not acked), we propose to use an old
|
||
event collector that retries with some kind of longer threshold / retry
|
||
period. That way we don't use the Queue Manager for old events, and keep it
|
||
firing recent events.
|
||
|
||
Event history can be managed in multiple ways. The Queue Manager should probably
|
||
have some kind of retention policy, and delete events that have been fired and
|
||
older than N days. Deleting the table would delete the entire queue.
|
||
|
||
## Example / Code
|
||
|
||
We create the Queue:
|
||
|
||
``` sql
|
||
CREATE QUEUE my_queue(
|
||
user_id BIGINT(10),
|
||
payload VARCHAR(256),
|
||
)
|
||
|
||
```
|
||
|
||
Then we change VSchema to have a hash vindex for user_id.
|
||
|
||
Let's insert something into the queue:
|
||
|
||
``` sql
|
||
BEGIN
|
||
INSERT INTO my_queue(user_id, payload) VALUES (10, ‘I want to send this’);
|
||
COMMIT
|
||
```
|
||
|
||
We add another streaming endpoint to vtgate for queue event consumption, and one
|
||
for acking:
|
||
|
||
``` protocol-buffer
|
||
message QueueReceiveRequest {
|
||
// caller_id identifies the caller. This is the effective caller ID,
|
||
// set by the application to further identify the caller.
|
||
vtrpc.CallerID caller_id = 1;
|
||
|
||
// keyspace to target the query to.
|
||
string keyspace = 2;
|
||
|
||
// shard to target the query to, for unsharded keyspaces.
|
||
string shard = 3;
|
||
|
||
// KeyRange to target the query to, for sharded keyspaces.
|
||
topodata.KeyRange key_range = 4;
|
||
}
|
||
|
||
message QueueReceiveResponse {
|
||
// First response has fields, next ones have data.
|
||
// Note we might get multiple notifications in a single packet.
|
||
QueryResult result = 1;
|
||
}
|
||
|
||
message QueueAckRequest {
|
||
// TODO: find a good way to reference the event, maybe
|
||
// have an ‘id’ in the query result.
|
||
Repeated Id string;
|
||
}
|
||
|
||
message QueueAckResponse {
|
||
}
|
||
|
||
service Vitess {
|
||
...
|
||
rpc QueueReceive(vtgate.QueueReceiveRequest) returns (stream vtgate.QueueReceiveResponse) {};
|
||
|
||
Rpc QueueAck(vtgate.QueueAckRequest) returns (vtgate.QueueAckResponse) {};
|
||
...
|
||
}
|
||
|
||
```
|
||
|
||
Then the receiver code is just an endless loop:
|
||
|
||
* Call QueueReceive(). For each QueueReceiveResponse:
|
||
* For each row:
|
||
* Do what needs to be done.
|
||
* Call QueueAck() for the processed events.
|
||
|
||
*Note*: acks may need to be inside a transaction too. In that case, we may want
|
||
to remove the QueueAck, and replace it with an: `UPDATE my_queue(state)
|
||
VALUES(‘acked’) WHERE <PK>=...;` Statement. It might be re-written by vttablet.
|
||
|
||
## Scalability
|
||
|
||
How does this scale? The answer should be: very well. There is no expensive
|
||
query done to the database. Only a range query (might need locking) every N
|
||
seconds. Everything else is point updates.
|
||
|
||
This approach scales with the number of shards too. Resharding will multiply the
|
||
queue processing power.
|
||
|
||
Since we have a process on the primary, we piggy-back on our regular tablet
|
||
primary election.
|
||
|
||
*Event creation*: each creation is one INSERT into the database. Creation of
|
||
events that are supposed to execute as soon as possible may create a hotspot in
|
||
the table. If so, we might be better off using the keyspace id for primary key,
|
||
and having an index on timestamps. If the event creation is in the execution
|
||
window, we also keep it in memory. Multiple events can even be created in the
|
||
same transaction. *Optimization*: if the event is supposed to be fired right
|
||
away, we might even create it as ‘fired’ directly.
|
||
|
||
*Event dispatching*: mostly done by the Queue Manager. Each dispatch is one
|
||
update (by PK) for firing, and one update (by PK again) for acking. The RPC
|
||
overhead for sending these events and receiving the acks from the receivers is
|
||
negligible (as it has no DB impact).
|
||
|
||
*Queue Manager*: it does one scan of the upcoming events for the next N seconds,
|
||
every N seconds. Hopefully that query is fast. Note we could add a ‘LIMIT’ to
|
||
that query and only consider the next N seconds, or the period that has 10k
|
||
events (value TBD), whichever is smaller.
|
||
|
||
*Event Processing*: a pool of Receivers is consuming the events. Scale that up
|
||
to as many as is needed, with any desired redundancy. The shard primary will send
|
||
the event to one receiver only, until it is acked, or times out.
|
||
|
||
*Hot Spots in the Queue table*: with very high QPS, we will end up having a hot
|
||
spot on the window of events that are coming up. A proposed solution is to use
|
||
an 'event_name' that has a unique value, and use it as a primary key. When the
|
||
Queue Manager gets new events however, it will still need to sort them by
|
||
timestamp, and therefore will require an index. It is unclear which solution
|
||
will work better, we'd need to experiment. For instance, if most events are only
|
||
present in the table while they are being processed, the entire table will have
|
||
high QPS anyway.
|
||
|
||
## Caveats
|
||
|
||
This is not meant to replace UpdateStream. UpdateStream is meant to send one
|
||
notification for each database event, without extra database cost, and without
|
||
any filtering. Queues have a database cost.
|
||
|
||
Events can be fired multiple times, in corner cases. However, we always do a
|
||
database access before firing the event. So in case of reparent / resharding,
|
||
that database access should fail. So we should stop dispatching events as soon
|
||
as the primary is read-only. Also, if we receive an ack when the database is
|
||
read-only, we can’t store it. With the Query Buffering feature, vtgate may be
|
||
able to retry later on the new primary.
|
||
|
||
|
||
## Extensions
|
||
|
||
The following extensions can be implemented, but won't be at first:
|
||
|
||
* *Queue Ordering*: only fire one event at a time, in order.
|
||
|
||
* *Event Groups*: groups events together using a common key, and when they're
|
||
all done, fire another event.
|