зеркало из https://github.com/github/vitess-gh.git
Коммит
dcf8ff682c
|
@ -18,6 +18,7 @@ and a more [detailed presentation from @Scale '14](http://youtu.be/5yDO-tmIoXY).
|
|||
### Intro
|
||||
* [Helicopter overview](http://vitess.io):
|
||||
high level overview of Vitess that should tell you whether Vitess is for you.
|
||||
* [Sharding in Vitess](http://vitess.io/doc/Sharding)
|
||||
* [Frequently Asked Questions](http://vitess.io/doc/FAQ).
|
||||
|
||||
### Using Vitess
|
||||
|
|
|
@ -30,19 +30,19 @@ eventual consistency guarantees), run data analysis tools that take a long time
|
|||
### Tablet
|
||||
|
||||
A tablet is a single server that runs:
|
||||
- a MySQL instance
|
||||
- a vttablet instance
|
||||
- a local row cache instance
|
||||
- an other per-db process that is necessary for operational purposes
|
||||
* a MySQL instance
|
||||
* a vttablet instance
|
||||
* a local row cache instance
|
||||
* an other per-db process that is necessary for operational purposes
|
||||
|
||||
It can be idle (not assigned to any keyspace), or assigned to a keyspace/shard. If it becomes unhealthy, it is usually changed to scrap.
|
||||
|
||||
It has a type. The commonly used types are:
|
||||
- master: for the mysql master, RW database.
|
||||
- replica: for a mysql slave that serves read-only traffic, with guaranteed low replication latency.
|
||||
- rdonly: for a mysql slave that serves read-only traffic for backend processing jobs (like map-reduce type jobs). It has no real guaranteed replication latency.
|
||||
- spare: for a mysql slave not used at the moment (hot spare).
|
||||
- experimental, schema, lag, backup, restore, checker, ... : various types for specific purposes.
|
||||
* master: for the mysql master, RW database.
|
||||
* replica: for a mysql slave that serves read-only traffic, with guaranteed low replication latency.
|
||||
* rdonly: for a mysql slave that serves read-only traffic for backend processing jobs (like map-reduce type jobs). It has no real guaranteed replication latency.
|
||||
* spare: for a mysql slave not used at the moment (hot spare).
|
||||
* experimental, schema, lag, backup, restore, checker, ... : various types for specific purposes.
|
||||
|
||||
Only master, replica and rdonly are advertised in the Serving Graph.
|
||||
|
||||
|
@ -107,11 +107,11 @@ There is one local instance of that service per Cell (Data Center). The goal is
|
|||
using the remaining Cells. (a Zookeeper instance running on 3 or 5 hosts locally is a good configuration).
|
||||
|
||||
The data is partitioned as follows:
|
||||
- Keyspaces: global instance
|
||||
- Shards: global instance
|
||||
- Tablets: local instances
|
||||
- Serving Graph: local instances
|
||||
- Replication Graph: the master alias is in the global instance, the master-slave map is in the local cells.
|
||||
* Keyspaces: global instance
|
||||
* Shards: global instance
|
||||
* Tablets: local instances
|
||||
* Serving Graph: local instances
|
||||
* Replication Graph: the master alias is in the global instance, the master-slave map is in the local cells.
|
||||
|
||||
Clients are designed to just read the local Serving Graph, therefore they only need the local instance to be up.
|
||||
|
||||
|
|
|
@ -6,30 +6,30 @@ and what kind of effort it would require for you to start using Vitess.
|
|||
|
||||
## When do you need Vitess?
|
||||
|
||||
- You store all your data in a MySQL database, and have a significant number of
|
||||
* You store all your data in a MySQL database, and have a significant number of
|
||||
clients. At some point, you start getting Too many connections errors from
|
||||
MySQL, so you have to change the max_connections system variable. Every MySQL
|
||||
connection has a memory overhead, which is just below 3 MB in the default
|
||||
configuration. If you want 1500 additional connections, you will need over 4 GB
|
||||
of additional RAM – and this is not going to be contributing to faster queries.
|
||||
|
||||
- From time to time, your developers make mistakes. For example, they make your
|
||||
* From time to time, your developers make mistakes. For example, they make your
|
||||
app issue a query without setting a LIMIT, which makes the database slow for
|
||||
all users. Or maybe they issue updates that break statement based replication.
|
||||
Whenever you see such a query, you react, but it usually takes some time and
|
||||
effort to get the story straight.
|
||||
|
||||
- You store your data in a MySQL database, and your database has grown
|
||||
* You store your data in a MySQL database, and your database has grown
|
||||
uncomfortably big. You are planning to do some horizontal sharding. MySQL
|
||||
doesn’t support sharding, so you will have write the code to perform the
|
||||
sharding, and then bake all the sharding logic into your app.
|
||||
|
||||
- You run a MySQL cluster, and use replication for availability: you have a master
|
||||
* You run a MySQL cluster, and use replication for availability: you have a master
|
||||
database and a few replicas, and in the case of a master failure some replica
|
||||
should become the new master. You have to manage the lifecycle of the databases,
|
||||
and communicate the current state of the system to the application.
|
||||
|
||||
- You run a MySQL cluster, and have custom database configurations for different
|
||||
* You run a MySQL cluster, and have custom database configurations for different
|
||||
workloads. There’s the master where all the writes go, fast read-only replicas
|
||||
for web clients, slower read-only replicas for batch jobs, and another kind of
|
||||
slower replicas for backups. If you have horizontal sharding, this setup is
|
||||
|
|
|
@ -122,7 +122,7 @@ vtctl MigrateServedTypes -reverse test_keyspace/0 replica
|
|||
## Scrap the source shard
|
||||
|
||||
If all the above steps were successful, it’s safe to remove the source shard (which should no longer be in use):
|
||||
- For each tablet in the source shard: `vtctl ScrapTablet <source tablet alias>`
|
||||
- For each tablet in the source shard: `vtctl DeleteTablet <source tablet alias>`
|
||||
- Rebuild the serving graph: `vtctl RebuildKeyspaceGraph test_keyspace`
|
||||
- Delete the source shard: `vtctl DeleteShard test_keyspace/0`
|
||||
* For each tablet in the source shard: `vtctl ScrapTablet <source tablet alias>`
|
||||
* For each tablet in the source shard: `vtctl DeleteTablet <source tablet alias>`
|
||||
* Rebuild the serving graph: `vtctl RebuildKeyspaceGraph test_keyspace`
|
||||
* Delete the source shard: `vtctl DeleteShard test_keyspace/0`
|
||||
|
|
|
@ -54,14 +54,14 @@ live system, it errs on the side of safety, and will abort if any
|
|||
tablet is not responding right.
|
||||
|
||||
The actions performed are:
|
||||
- any existing tablet replication is stopped. If any tablet fails
|
||||
* any existing tablet replication is stopped. If any tablet fails
|
||||
(because it is not available or not succeeding), we abort.
|
||||
- the master-elect is initialized as a master.
|
||||
- in parallel for each tablet, we do:
|
||||
- on the master-elect, we insert an entry in a test table.
|
||||
- on the slaves, we set the master, and wait for the entry in the test table.
|
||||
- if any tablet fails, we error out.
|
||||
- we then rebuild the serving graph for the shard.
|
||||
* the master-elect is initialized as a master.
|
||||
* in parallel for each tablet, we do:
|
||||
* on the master-elect, we insert an entry in a test table.
|
||||
* on the slaves, we set the master, and wait for the entry in the test table.
|
||||
* if any tablet fails, we error out.
|
||||
* we then rebuild the serving graph for the shard.
|
||||
|
||||
### Planned Reparents: vtctl PlannedReparentShard
|
||||
|
||||
|
@ -69,14 +69,14 @@ This command is used when both the current master and the new master
|
|||
are alive and functioning properly.
|
||||
|
||||
The actions performed are:
|
||||
- we tell the old master to go read-only. It then shuts down its query
|
||||
* we tell the old master to go read-only. It then shuts down its query
|
||||
service. We get its replication position back.
|
||||
- we tell the master-elect to wait for that replication data, and then
|
||||
* we tell the master-elect to wait for that replication data, and then
|
||||
start being the master.
|
||||
- in parallel for each tablet, we do:
|
||||
- on the master-elect, we insert an entry in a test table. If that
|
||||
* in parallel for each tablet, we do:
|
||||
* on the master-elect, we insert an entry in a test table. If that
|
||||
works, we update the MasterAlias record of the global Shard object.
|
||||
- on the slaves (including the old master), we set the master, and
|
||||
* on the slaves (including the old master), we set the master, and
|
||||
wait for the entry in the test table. (if a slave wasn't
|
||||
replicating, we don't change its state and don't start replication
|
||||
after reparent)
|
||||
|
@ -96,15 +96,15 @@ just make sure the master-elect is the most advanced in replication
|
|||
within all the available slaves, and reparent everybody.
|
||||
|
||||
The actions performed are:
|
||||
- if the current master is still alive, we scrap it. That will make it
|
||||
* if the current master is still alive, we scrap it. That will make it
|
||||
stop what it's doing, stop its query service, and be unusable.
|
||||
- we gather the current replication position on all slaves.
|
||||
- we make sure the master-elect has the most advanced position.
|
||||
- we promote the master-elect.
|
||||
- in parallel for each tablet, we do:
|
||||
- on the master-elect, we insert an entry in a test table. If that
|
||||
* we gather the current replication position on all slaves.
|
||||
* we make sure the master-elect has the most advanced position.
|
||||
* we promote the master-elect.
|
||||
* in parallel for each tablet, we do:
|
||||
* on the master-elect, we insert an entry in a test table. If that
|
||||
works, we update the MasterAlias record of the global Shard object.
|
||||
- on the slaves (excluding the old master), we set the master, and
|
||||
* on the slaves (excluding the old master), we set the master, and
|
||||
wait for the entry in the test table. (if a slave wasn't
|
||||
replicating, we don't change its state and don't start replication
|
||||
after reparent)
|
||||
|
@ -122,33 +122,33 @@ servers. We then trigger the 'vtctl TabletExternallyReparented'
|
|||
command.
|
||||
|
||||
The flow for that command is as follows:
|
||||
- the shard is locked in the global topology server.
|
||||
- we read the Shard object from the global topology server.
|
||||
- we read all the tablets in the replication graph for the shard. Note
|
||||
* the shard is locked in the global topology server.
|
||||
* we read the Shard object from the global topology server.
|
||||
* we read all the tablets in the replication graph for the shard. Note
|
||||
we allow partial reads here, so if a data center is down, as long as
|
||||
the data center containing the new master is up, we keep going.
|
||||
- the new master performs a 'SlaveWasPromoted' action. This remote
|
||||
* the new master performs a 'SlaveWasPromoted' action. This remote
|
||||
action makes sure the new master is not a MySQL slave of another
|
||||
server (the 'show slave status' command should not return anything,
|
||||
meaning 'reset slave' should have been called).
|
||||
- for every host in the replication graph, we call the
|
||||
* for every host in the replication graph, we call the
|
||||
'SlaveWasRestarted' action. It takes as parameter the address of the
|
||||
new master. On each slave, we update the topology server record for
|
||||
that tablet with the new master, and the replication graph for that
|
||||
tablet as well.
|
||||
- for the old master, if it doesn't successfully return from
|
||||
* for the old master, if it doesn't successfully return from
|
||||
'SlaveWasRestarted', we change its type to 'spare' (so a dead old
|
||||
master doesn't interfere).
|
||||
- we then update the Shard object with the new master.
|
||||
- we rebuild the serving graph for that shard. This will update the
|
||||
* we then update the Shard object with the new master.
|
||||
* we rebuild the serving graph for that shard. This will update the
|
||||
'master' record for sure, and also keep all the tablets that have
|
||||
successfully reparented.
|
||||
|
||||
Failure cases:
|
||||
- The global topology server has to be available for locking and
|
||||
* The global topology server has to be available for locking and
|
||||
modification during this operation. If not, the operation will just
|
||||
fail.
|
||||
- If a single topology server is down in one data center (and it's not
|
||||
* If a single topology server is down in one data center (and it's not
|
||||
the master data center), the tablets in that data center will be
|
||||
ignored by the reparent. When the topology server comes back up,
|
||||
just re-run 'vtctl InitTablet' on the tablets, and that will fix
|
||||
|
|
|
@ -1,35 +1,40 @@
|
|||
# Resharding
|
||||
|
||||
In Vitess, resharding describes the process of re-organizing data dynamically, with very minimal downtime (we manage to
|
||||
completely perform most data transitions with less than 5 seconds of read-only downtime - new data cannot be written,
|
||||
existing data can still be read).
|
||||
In Vitess, resharding describes the process of re-organizing data
|
||||
dynamically, with very minimal downtime (we manage to completely
|
||||
perform most data transitions with less than 5 seconds of read-only
|
||||
downtime - new data cannot be written, existing data can still be
|
||||
read).
|
||||
|
||||
See the description of [how Sharding works in Vitess](Sharding.md) for
|
||||
higher level concepts on Sharding.
|
||||
|
||||
## Process
|
||||
|
||||
To follow a step-by-step guide for how to shard a keyspace, you can see [this page](HorizontalReshardingGuide.md).
|
||||
|
||||
In general, the process to achieve this goal is composed of the following steps:
|
||||
- pick the original shard(s)
|
||||
- pick the destination shard(s) coverage
|
||||
- create the destination shard(s) tablets (in a mode where they are not used to serve traffic yet)
|
||||
- bring up the destination shard(s) tablets, with read-only masters.
|
||||
- backup and split the data from the original shard(s)
|
||||
- merge and import the data on the destination shard(s)
|
||||
- start and run filtered replication from original to destination shard(s), catch up
|
||||
- move the read-only traffic to the destination shard(s), stop serving read-only traffic from original shard(s). This transition can take a few hours. We might want to move rdonly separately from replica traffic.
|
||||
- in quick succession:
|
||||
- make original master(s) read-only
|
||||
- flush filtered replication on all filtered replication source servers (after making sure they were caught up with their masters)
|
||||
- wait until replication is caught up on all destination shard(s) masters
|
||||
- move the write traffic to the destination shard(s)
|
||||
- make destination master(s) read-write
|
||||
- scrap the original shard(s)
|
||||
* pick the original shard(s)
|
||||
* pick the destination shard(s) coverage
|
||||
* create the destination shard(s) tablets (in a mode where they are not used to serve traffic yet)
|
||||
* bring up the destination shard(s) tablets, with read-only masters.
|
||||
* backup and split the data from the original shard(s)
|
||||
* merge and import the data on the destination shard(s)
|
||||
* start and run filtered replication from original to destination shard(s), catch up
|
||||
* move the read-only traffic to the destination shard(s), stop serving read-only traffic from original shard(s). This transition can take a few hours. We might want to move rdonly separately from replica traffic.
|
||||
* in quick succession:
|
||||
* make original master(s) read-only
|
||||
* flush filtered replication on all filtered replication source servers (after making sure they were caught up with their masters)
|
||||
* wait until replication is caught up on all destination shard(s) masters
|
||||
* move the write traffic to the destination shard(s)
|
||||
* make destination master(s) read-write
|
||||
* scrap the original shard(s)
|
||||
|
||||
## Applications
|
||||
|
||||
The main application we currently support:
|
||||
- in a sharded keyspace, split or merge shards (horizontal sharding)
|
||||
- in a non-sharded keyspace, break out some tables into a different keyspace (vertical sharding)
|
||||
* in a sharded keyspace, split or merge shards (horizontal sharding)
|
||||
* in a non-sharded keyspace, break out some tables into a different keyspace (vertical sharding)
|
||||
|
||||
With these supported features, it is very easy to start with a single keyspace containing all the data (multiple tables),
|
||||
and then as the data grows, move tables to different keyspaces, start sharding some keyspaces, ... without any real
|
||||
|
@ -38,17 +43,17 @@ downtime for the application.
|
|||
## Scaling Up and Down
|
||||
|
||||
Here is a quick table of what to do with Vitess when a change is required:
|
||||
- uniformly increase read capacity: add replicas, or split shards
|
||||
- uniformly increase write capacity: split shards
|
||||
- reclaim free space: merge shards / keyspaces
|
||||
- increase geo-diversity: add new cells and new replicas
|
||||
- cool a hot tablet: if read access, add replicas or split shards, if write access, split shards.
|
||||
* uniformly increase read capacity: add replicas, or split shards
|
||||
* uniformly increase write capacity: split shards
|
||||
* reclaim free space: merge shards / keyspaces
|
||||
* increase geo-diversity: add new cells and new replicas
|
||||
* cool a hot tablet: if read access, add replicas or split shards, if write access, split shards.
|
||||
|
||||
## Filtered Replication
|
||||
|
||||
The cornerstone of Resharding is being able to replicate the right data. Mysql doesn't support any filtering, so the
|
||||
Vitess project implements it entirely:
|
||||
- the tablet server tags transactions with comments that describe what the scope of the statements are (which keyspace_id,
|
||||
* the tablet server tags transactions with comments that describe what the scope of the statements are (which keyspace_id,
|
||||
which table, ...). That way the MySQL binlogs contain all filtering data.
|
||||
- a server process can filter and stream the MySQL binlogs (using the comments).
|
||||
- a client process can apply the filtered logs locally (they are just regular SQL statements at this point).
|
||||
* a server process can filter and stream the MySQL binlogs (using the comments).
|
||||
* a client process can apply the filtered logs locally (they are just regular SQL statements at this point).
|
||||
|
|
|
@ -42,11 +42,11 @@ $ vtctl -wait-time=30s ValidateSchemaKeyspace user
|
|||
## Changing the Schema
|
||||
|
||||
Goals:
|
||||
- simplify schema updates on the fleet
|
||||
- minimize human actions / errors
|
||||
- guarantee no or very little downtime for most schema updates
|
||||
- do not store any permanent schema data in Topology Server, just use it for actions.
|
||||
- only look at tables for now (not stored procedures or grants for instance, although they both could be added fairly easily in the same manner)
|
||||
* simplify schema updates on the fleet
|
||||
* minimize human actions / errors
|
||||
* guarantee no or very little downtime for most schema updates
|
||||
* do not store any permanent schema data in Topology Server, just use it for actions.
|
||||
* only look at tables for now (not stored procedures or grants for instance, although they both could be added fairly easily in the same manner)
|
||||
|
||||
We’re trying to get reasonable confidence that a schema update is going to work before applying it. Since we cannot really apply a change to live tables without potentially causing trouble, we have implemented a Preflight operation: it copies the current schema into a temporary database, applies the change there to validate it, and gathers the resulting schema. After this Preflight, we have a good idea of what to expect, and we can apply the change to any database and make sure it worked.
|
||||
|
||||
|
@ -71,35 +71,37 @@ type SchemaChange struct {
|
|||
```
|
||||
|
||||
And the associated ApplySchema remote action for a tablet. Then the performed steps are:
|
||||
- The database to use is either derived from the tablet dbName if UseVt is false, or is the _vt database. A ‘use dbname’ is prepended to the Sql.
|
||||
- (if BeforeSchema is not nil) read the schema, make sure it is equal to BeforeSchema. If not equal: if Force is not set, we will abort, if Force is set, we’ll issue a warning and keep going.
|
||||
- if AllowReplication is false, we’ll disable replication (adding SET sql_log_bin=0 before the Sql).
|
||||
- We will then apply the Sql command.
|
||||
- (if AfterSchema is not nil) read the schema again, make sure it is equal to AfterSchema. If not equal: if Force is not set, we will issue an error, if Force is set, we’ll issue a warning.
|
||||
* The database to use is either derived from the tablet dbName if UseVt is false, or is the _vt database. A ‘use dbname’ is prepended to the Sql.
|
||||
* (if BeforeSchema is not nil) read the schema, make sure it is equal to BeforeSchema. If not equal: if Force is not set, we will abort, if Force is set, we’ll issue a warning and keep going.
|
||||
* if AllowReplication is false, we’ll disable replication (adding SET sql_log_bin=0 before the Sql).
|
||||
* We will then apply the Sql command.
|
||||
* (if AfterSchema is not nil) read the schema again, make sure it is equal to AfterSchema. If not equal: if Force is not set, we will issue an error, if Force is set, we’ll issue a warning.
|
||||
|
||||
We will return the following information:
|
||||
- whether it worked or not (doh!)
|
||||
- BeforeSchema
|
||||
- AfterSchema
|
||||
* whether it worked or not (doh!)
|
||||
* BeforeSchema
|
||||
* AfterSchema
|
||||
|
||||
### Use case 1: Single tablet update:
|
||||
- we first do a Preflight (to know what BeforeSchema and AfterSchema will be). This can be disabled, but is not recommended.
|
||||
- we then do the schema upgrade. We will check BeforeSchema before the upgrade, and AfterSchema after the upgrade.
|
||||
* we first do a Preflight (to know what BeforeSchema and AfterSchema will be). This can be disabled, but is not recommended.
|
||||
* we then do the schema upgrade. We will check BeforeSchema before the upgrade, and AfterSchema after the upgrade.
|
||||
|
||||
### Use case 2: Single Shard update:
|
||||
- need to figure out (or be told) if it’s a simple or complex schema update (does it require the shell game?). For now we'll use a command line flag.
|
||||
- in any case, do a Preflight on the master, to get the BeforeSchema and AfterSchema values.
|
||||
- in any case, gather the schema on all databases, to see which ones have been upgraded already or not. This guarantees we can interrupt and restart a schema change. Also, this makes sure no action is currently running on the databases we're about to change.
|
||||
- if simple:
|
||||
- nobody has it: apply to master, very similar to a single tablet update.
|
||||
- some tablets have it but not others: error out
|
||||
- if complex: do the shell game while disabling replication. Skip the tablets that already have it. Have an option to re-parent at the end.
|
||||
- Note the Backup, and Lag servers won't apply a complex schema change. Only the servers actively in the replication graph will.
|
||||
- the process can be interrupted at any time, restarting it as a complex schema upgrade should just work.
|
||||
|
||||
* need to figure out (or be told) if it’s a simple or complex schema update (does it require the shell game?). For now we'll use a command line flag.
|
||||
* in any case, do a Preflight on the master, to get the BeforeSchema and AfterSchema values.
|
||||
* in any case, gather the schema on all databases, to see which ones have been upgraded already or not. This guarantees we can interrupt and restart a schema change. Also, this makes sure no action is currently running on the databases we're about to change.
|
||||
* if simple:
|
||||
* nobody has it: apply to master, very similar to a single tablet update.
|
||||
* some tablets have it but not others: error out
|
||||
* if complex: do the shell game while disabling replication. Skip the tablets that already have it. Have an option to re-parent at the end.
|
||||
* Note the Backup, and Lag servers won't apply a complex schema change. Only the servers actively in the replication graph will.
|
||||
* the process can be interrupted at any time, restarting it as a complex schema upgrade should just work.
|
||||
|
||||
### Use case 3: Keyspace update:
|
||||
- Similar to Single Shard, but the BeforeSchema and AfterSchema values are taken from the first shard, and used in all shards after that.
|
||||
- We don't know the new masters to use on each shard, so just skip re-parenting all together.
|
||||
|
||||
* Similar to Single Shard, but the BeforeSchema and AfterSchema values are taken from the first shard, and used in all shards after that.
|
||||
* We don't know the new masters to use on each shard, so just skip re-parenting all together.
|
||||
|
||||
This translates into the following vtctl commands:
|
||||
|
||||
|
|
|
@ -0,0 +1,165 @@
|
|||
# Sharding in Vitess
|
||||
|
||||
This document describes the various options for sharding in Vitess.
|
||||
|
||||
## Range-based Sharding
|
||||
|
||||
This is the out-of-the-box sharding solution for Vitess. Each record
|
||||
in the database has a Sharding Key value associated with it (and
|
||||
stored in that row). Two records with the same Sharding Key are always
|
||||
collocated on the same shard (allowing joins for instance). Two
|
||||
records with different sharding keys are not necessarily on the same
|
||||
shard.
|
||||
|
||||
In a Keyspace, each Shard then contains a range of Sharding Key
|
||||
values. The full set of Shards covers the entire range.
|
||||
|
||||
In this environment, query routing needs to figure out the Sharding
|
||||
Key or Keys for each query, and then route it properly to the
|
||||
appropriate shard(s). We achieve this by providing either a sharding
|
||||
key value (known as KeyspaceID in the API), or a sharding key range
|
||||
(KeyRange). We are also developing more ways to route queries
|
||||
automatically with [version 3 or our API](VTGateV3.md), where we store
|
||||
more metadata in our topology to understand where to route queries.
|
||||
|
||||
### Sharding Key
|
||||
|
||||
Sharding Keys need to be compared to each other to enable range-based
|
||||
sharding, as we need to figure out if a value is within a Shard's
|
||||
range.
|
||||
|
||||
Vitess was designed to allow two types of sharding keys:
|
||||
* Binary data: just an array of bytes. We use regular byte array
|
||||
comparison here. Can be used for strings. MySQL representation is a
|
||||
VARBINARY field.
|
||||
* 64 bits unsigned integer: we first convert the 64 bits integer into
|
||||
a byte array (by just copying the bytes, most significant byte
|
||||
first, into 8 bytes). Then we apply the same byte array
|
||||
comparison. MySQL representation is an bigint(20) UNSIGNED .
|
||||
|
||||
A sharded keyspace contains information about the type of sharding key
|
||||
(binary data or 64 bits unsigned integer), and the column name for the
|
||||
Sharding Key (that is in every table).
|
||||
|
||||
To guarantee a balanced use of the shards, the Sharding Keys should be
|
||||
evenly distributed in their space (if half the records of a Keyspace
|
||||
map to a single sharding key, all these records will always be on the
|
||||
same shard, making it impossible to split).
|
||||
|
||||
A common example of a sharding key in a web site that serves millions
|
||||
of users is the 64 bit hash of a User Id. By using a hashing function
|
||||
the Sharding Keys are evenly distributed in the space. By using a very
|
||||
granular sharding key, one could shard all the way to one user per
|
||||
shard.
|
||||
|
||||
Comparison of Sharding Keys can be a bit tricky, but if one remembers
|
||||
they are always converted to byte arrays, and then converted, it
|
||||
becomes easier. For instance, the value [ 0x80 ] is the mid value for
|
||||
Sharding Keys. All byte sharding keys that start with numbers strictly
|
||||
lower than 0x80 are lower, and all byte sharding keys that start with
|
||||
number equal or greater than 0x80 are higher. Note that [ 0x80 ] can
|
||||
also be compared to 64 bits unsigned integer values: any value that is
|
||||
smaller than 0x8000000000000000 will be smaller, any value equal or
|
||||
greater will be greater.
|
||||
|
||||
### Key Range
|
||||
|
||||
A Key Range has a Start and an End value. A value is inside the Key
|
||||
Range if it is greater or equal to the Start, and strictly less than
|
||||
the End.
|
||||
|
||||
Two Key Ranges are consecutive if the End of the first one is equal to
|
||||
the Start of the next one.
|
||||
|
||||
Two special values exist:
|
||||
* if a Start is empty, it represents the lowest value, and all values
|
||||
are greater than it.
|
||||
* if an End is empty, it represents the biggest value, and all values
|
||||
are strictly lower than it.
|
||||
|
||||
Examples:
|
||||
* Start=[], End=[]: full Key Range
|
||||
* Start=[], End=[0x80]: Lower half of the Key Range.
|
||||
* Start=[0x80], End=[]: Upper half of the Key Range.
|
||||
* Start=[0x40], End=[0x80]: Second quarter of the Key Range.
|
||||
|
||||
As noted previously, this can be used for both binary and uint64
|
||||
Sharding Keys. For uint64 Sharding Keys, the single byte number
|
||||
represents the most significant 8 bits of the value.
|
||||
|
||||
### Range-based Shard Name
|
||||
|
||||
The Name of a Shard when it is part of a Range-based sharded keyspace
|
||||
is the Start and End of its keyrange, printed in hexadecimal, and
|
||||
separated by a hyphen.
|
||||
|
||||
For instance, if Start is the array of bytes [ 0x80 ] and End is the
|
||||
array of bytes [ 0xc0 ], then the name of the Shard will be: 80-c0
|
||||
|
||||
We will use this convention in the rest of this document.
|
||||
|
||||
### Sharding Key Partition
|
||||
|
||||
A partition represent a set of Key Ranges that cover the entire space. For instance, the following four shards are a valid full partition:
|
||||
* -40
|
||||
* 40-80
|
||||
* 80-c0
|
||||
* c0-
|
||||
|
||||
When we build the serving graph for a given Range-based Sharded
|
||||
Keyspace, we ensure the Shards are valid and cover the full space.
|
||||
|
||||
During resharding, we can split or merge consecutive Shards with very
|
||||
minimal downtime.
|
||||
|
||||
### Resharding
|
||||
|
||||
Vitess provides a set of tools and processes to deal with Range Based Shards:
|
||||
* [Dynamic resharding](Resharding.md) allows splitting or merging of shards with no
|
||||
read downtime, and very minimal master unavailability (<5s).
|
||||
* Client APIs are designed to take sharding into account.
|
||||
* [Map-reduce framework](https://github.com/youtube/vitess/blob/master/java/vtgate-client/src/main/java/com/youtube/vitess/vtgate/hadoop/README.md) fully utilizes the Key Ranges to read data as
|
||||
fast as possible, concurrently from all shards and all replicas.
|
||||
|
||||
### Cross-Shard Indexes
|
||||
|
||||
The 'primary key' for sharded data is the Sharding Key value. In order
|
||||
to look up data with another index, it is very straightforward to
|
||||
create a lookup table in a different Keyspace that maps that other
|
||||
index to the Sharding Key.
|
||||
|
||||
For instance, if User ID is hashed as Sharding Key in a User keyspace,
|
||||
adding a User Name to User Id lookup table in a different Lookup
|
||||
keyspace allows the user to also route queries the right way.
|
||||
|
||||
With the current version of the API, Cross-Shard indexes have to be
|
||||
handled at the application layer. However, [version 3 or our API](VTGateV3.md)
|
||||
will provide multiple ways to solve this without application layer changes.
|
||||
|
||||
## Custom Sharding
|
||||
|
||||
This is designed to be used if your application already has support
|
||||
for sharding, or if you want to control exactly which shard the data
|
||||
goes to. In that use case, each Keyspace just has a collection of
|
||||
shards, and the client code always specifies which shard to talk to.
|
||||
|
||||
The shards can just use any name, and they are always addressed by
|
||||
name. The API calls to vtgate are ExecuteShard, ExecuteBatchShard and
|
||||
StreamExecuteShard. None of the *KeyspaceIds, *KeyRanges or *EntityIds
|
||||
API calls can be used.
|
||||
|
||||
The Map-Reduce framework can still iterate over the data across multiple shards.
|
||||
|
||||
Also, none of the automated resharding tools and processes that Vitess
|
||||
provides for Range-Based sharding can be used here.
|
||||
|
||||
Note: the *Shard API calls are not exposed by all clients at the
|
||||
moment. This is going to be fixed soon.
|
||||
|
||||
### Custom Sharding Example: Lookup-Based Sharding
|
||||
|
||||
One example of Custom Sharding is Lookup based: one keyspace is used
|
||||
as a lookup keyspace, and contains the mapping between the identifying
|
||||
key of a record, and the shard name it is on. When accessing the
|
||||
records, the client first needs to find the shard name by looking it
|
||||
up in the lookup table, and then knows where to route queries.
|
|
@ -57,8 +57,8 @@ level picture of all the servers and their current state.
|
|||
|
||||
### vtworker
|
||||
vtworker is meant to host long-running processes. It supports a plugin infrastructure, and offers libraries to easily pick tablets to use. We have developed:
|
||||
- resharding differ jobs: meant to check data integrity during shard splits and joins.
|
||||
- vertical split differ jobs: meant to check data integrity during vertical splits and joins.
|
||||
* resharding differ jobs: meant to check data integrity during shard splits and joins.
|
||||
* vertical split differ jobs: meant to check data integrity during vertical splits and joins.
|
||||
|
||||
It is very easy to add other checker processes for in-tablet integrity checks (verifying foreign key-like relationships), and cross shard data integrity (for instance, if a keyspace contains an index table referencing data in another keyspace).
|
||||
|
||||
|
|
|
@ -41,12 +41,12 @@ An entire Keyspace can be locked. We use this during resharding for instance, wh
|
|||
### Shard
|
||||
|
||||
A Shard contains a subset of the data for a Keyspace. The Shard record in the global topology contains:
|
||||
- the MySQL Master tablet alias for this shard
|
||||
- the sharding key range covered by this Shard inside the Keyspace
|
||||
- the tablet types this Shard is serving (master, replica, batch, …), per cell if necessary.
|
||||
- if during filtered replication, the source shards this shard is replicating from
|
||||
- the list of cells that have tablets in this shard
|
||||
- shard-global tablet controls, like blacklisted tables no tablet should serve in this shard
|
||||
* the MySQL Master tablet alias for this shard
|
||||
* the sharding key range covered by this Shard inside the Keyspace
|
||||
* the tablet types this Shard is serving (master, replica, batch, …), per cell if necessary.
|
||||
* if during filtered replication, the source shards this shard is replicating from
|
||||
* the list of cells that have tablets in this shard
|
||||
* shard-global tablet controls, like blacklisted tables no tablet should serve in this shard
|
||||
|
||||
A Shard can be locked. We use this during operations that affect either the Shard record, or multiple tablets within a Shard (like reparenting), so multiple jobs don’t concurrently alter the data.
|
||||
|
||||
|
@ -61,19 +61,19 @@ This section describes the data structures stored in the local instance (per cel
|
|||
### Tablets
|
||||
|
||||
The Tablet record has a lot of information about a single vttablet process running inside a tablet (along with the MySQL process):
|
||||
- the Tablet Alias (cell+unique id) that uniquely identifies the Tablet
|
||||
- the Hostname, IP address and port map of the Tablet
|
||||
- the current Tablet type (master, replica, batch, spare, …)
|
||||
- which Keyspace / Shard the tablet is part of
|
||||
- the health map for the Tablet (if in degraded mode)
|
||||
- the sharding Key Range served by this Tablet
|
||||
- user-specified tag map (to store per installation data for instance)
|
||||
* the Tablet Alias (cell+unique id) that uniquely identifies the Tablet
|
||||
* the Hostname, IP address and port map of the Tablet
|
||||
* the current Tablet type (master, replica, batch, spare, …)
|
||||
* which Keyspace / Shard the tablet is part of
|
||||
* the health map for the Tablet (if in degraded mode)
|
||||
* the sharding Key Range served by this Tablet
|
||||
* user-specified tag map (to store per installation data for instance)
|
||||
|
||||
A Tablet record is created before a tablet can be running (either by `vtctl InitTablet` or by passing the `init_*` parameters to vttablet). The only way a Tablet record will be updated is one of:
|
||||
- The vttablet process itself owns the record while it is running, and can change it.
|
||||
- At init time, before the tablet starts
|
||||
- After shutdown, when the tablet gets scrapped or deleted.
|
||||
- If a tablet becomes unresponsive, it may be forced to spare to remove it from the serving graph (such as when reparenting away from a dead master, by the `vtctl ReparentShard` action).
|
||||
* The vttablet process itself owns the record while it is running, and can change it.
|
||||
* At init time, before the tablet starts
|
||||
* After shutdown, when the tablet gets scrapped or deleted.
|
||||
* If a tablet becomes unresponsive, it may be forced to spare to remove it from the serving graph (such as when reparenting away from a dead master, by the `vtctl ReparentShard` action).
|
||||
|
||||
### Replication Graph
|
||||
|
||||
|
@ -86,16 +86,16 @@ The Serving Graph is what the clients use to find which EndPoints to send querie
|
|||
#### SrvKeyspace
|
||||
|
||||
It is the local representation of a Keyspace. It contains information on what shard to use for getting to the data (but not information about each individual shard):
|
||||
- the partitions map is keyed by the tablet type (master, replica, batch, …) and the values are list of shards to use for serving.
|
||||
- it also contains the global Keyspace fields, copied for fast access.
|
||||
* the partitions map is keyed by the tablet type (master, replica, batch, …) and the values are list of shards to use for serving.
|
||||
* it also contains the global Keyspace fields, copied for fast access.
|
||||
|
||||
It can be rebuilt by running `vtctl RebuildKeyspaceGraph`. It is not automatically rebuilt when adding new tablets in a cell, as this would cause too much overhead and is only needed once per cell/keyspace. It may also be changed during horizontal and vertical splits.
|
||||
|
||||
#### SrvShard
|
||||
|
||||
It is the local representation of a Shard. It contains information on details internal to this Shard only, but not to any tablet running in this shard:
|
||||
- the name and sharding Key Range for this Shard.
|
||||
- the cell that has the master for this Shard.
|
||||
* the name and sharding Key Range for this Shard.
|
||||
* the cell that has the master for this Shard.
|
||||
|
||||
It is possible to lock a SrvShard object, to massively update all EndPoints in it.
|
||||
|
||||
|
@ -104,10 +104,10 @@ It can be rebuilt (along with all the EndPoints in this Shard) by running `vtctl
|
|||
#### EndPoints
|
||||
|
||||
For each possible serving type (master, replica, batch), in each Cell / Keyspace / Shard, we maintain a rolled-up EndPoint list. Each entry in the list has information about one Tablet:
|
||||
- the Tablet Uid
|
||||
- the Host on which the Tablet resides
|
||||
- the port map for that Tablet
|
||||
- the health map for that Tablet
|
||||
* the Tablet Uid
|
||||
* the Host on which the Tablet resides
|
||||
* the port map for that Tablet
|
||||
* the health map for that Tablet
|
||||
|
||||
## Workflows Involving the Topology Server
|
||||
|
||||
|
@ -144,13 +144,13 @@ For locking, we use an auto-incrementing file name in the `/action` subdirectory
|
|||
|
||||
Note the paths used to store global and per-cell data do not overlap, so a single ZK can be used for both global and local ZKs. This is however not recommended, for reliability reasons.
|
||||
|
||||
- Keyspace: `/zk/global/vt/keyspaces/<keyspace>`
|
||||
- Shard: `/zk/global/vt/keyspaces/<keyspace>/shards/<shard>`
|
||||
- Tablet: `/zk/<cell>/vt/tablets/<uid>`
|
||||
- Replication Graph: `/zk/<cell>/vt/replication/<keyspace>/<shard>`
|
||||
- SrvKeyspace: `/zk/<cell>/vt/ns/<keyspace>`
|
||||
- SrvShard: `/zk/<cell>/vt/ns/<keyspace>/<shard>`
|
||||
- EndPoints: `/zk/<cell>/vt/ns/<keyspace>/<shard>/<tablet type>`
|
||||
* Keyspace: `/zk/global/vt/keyspaces/<keyspace>`
|
||||
* Shard: `/zk/global/vt/keyspaces/<keyspace>/shards/<shard>`
|
||||
* Tablet: `/zk/<cell>/vt/tablets/<uid>`
|
||||
* Replication Graph: `/zk/<cell>/vt/replication/<keyspace>/<shard>`
|
||||
* SrvKeyspace: `/zk/<cell>/vt/ns/<keyspace>`
|
||||
* SrvShard: `/zk/<cell>/vt/ns/<keyspace>/<shard>`
|
||||
* EndPoints: `/zk/<cell>/vt/ns/<keyspace>/<shard>/<tablet type>`
|
||||
|
||||
We provide the 'zk' utility for easy access to the topology data in ZooKeeper. For instance:
|
||||
```
|
||||
|
@ -171,11 +171,11 @@ We use the `_Data` filename to store the data, JSON encoded.
|
|||
For locking, we store a `_Lock` file with various contents in the directory that contains the object to lock.
|
||||
|
||||
We use the following paths:
|
||||
- Keyspace: `/vt/keyspaces/<keyspace>/_Data`
|
||||
- Shard: `/vt/keyspaces/<keyspace>/<shard>/_Data`
|
||||
- Tablet: `/vt/tablets/<cell>-<uid>/_Data`
|
||||
- Replication Graph: `/vt/replication/<keyspace>/<shard>/_Data`
|
||||
- SrvKeyspace: `/vt/ns/<keyspace>/_Data`
|
||||
- SrvShard: `/vt/ns/<keyspace>/<shard>/_Data`
|
||||
- EndPoints: `/vt/ns/<keyspace>/<shard>/<tablet type>`
|
||||
* Keyspace: `/vt/keyspaces/<keyspace>/_Data`
|
||||
* Shard: `/vt/keyspaces/<keyspace>/<shard>/_Data`
|
||||
* Tablet: `/vt/tablets/<cell>-<uid>/_Data`
|
||||
* Replication Graph: `/vt/replication/<keyspace>/<shard>/_Data`
|
||||
* SrvKeyspace: `/vt/ns/<keyspace>/_Data`
|
||||
* SrvShard: `/vt/ns/<keyspace>/<shard>/_Data`
|
||||
* EndPoints: `/vt/ns/<keyspace>/<shard>/<tablet type>`
|
||||
|
||||
|
|
Загрузка…
Ссылка в новой задаче