Merge pull request #685 from youtube/log

Log
This commit is contained in:
Alain Jobart 2015-05-11 10:52:33 -07:00
Родитель ecc7e2863f fea04f1503
Коммит dcf8ff682c
10 изменённых файлов: 320 добавлений и 147 удалений

Просмотреть файл

@ -18,6 +18,7 @@ and a more [detailed presentation from @Scale '14](http://youtu.be/5yDO-tmIoXY).
### Intro
* [Helicopter overview](http://vitess.io):
high level overview of Vitess that should tell you whether Vitess is for you.
* [Sharding in Vitess](http://vitess.io/doc/Sharding)
* [Frequently Asked Questions](http://vitess.io/doc/FAQ).
### Using Vitess

Просмотреть файл

@ -30,19 +30,19 @@ eventual consistency guarantees), run data analysis tools that take a long time
### Tablet
A tablet is a single server that runs:
- a MySQL instance
- a vttablet instance
- a local row cache instance
- an other per-db process that is necessary for operational purposes
* a MySQL instance
* a vttablet instance
* a local row cache instance
* an other per-db process that is necessary for operational purposes
It can be idle (not assigned to any keyspace), or assigned to a keyspace/shard. If it becomes unhealthy, it is usually changed to scrap.
It has a type. The commonly used types are:
- master: for the mysql master, RW database.
- replica: for a mysql slave that serves read-only traffic, with guaranteed low replication latency.
- rdonly: for a mysql slave that serves read-only traffic for backend processing jobs (like map-reduce type jobs). It has no real guaranteed replication latency.
- spare: for a mysql slave not used at the moment (hot spare).
- experimental, schema, lag, backup, restore, checker, ... : various types for specific purposes.
* master: for the mysql master, RW database.
* replica: for a mysql slave that serves read-only traffic, with guaranteed low replication latency.
* rdonly: for a mysql slave that serves read-only traffic for backend processing jobs (like map-reduce type jobs). It has no real guaranteed replication latency.
* spare: for a mysql slave not used at the moment (hot spare).
* experimental, schema, lag, backup, restore, checker, ... : various types for specific purposes.
Only master, replica and rdonly are advertised in the Serving Graph.
@ -107,11 +107,11 @@ There is one local instance of that service per Cell (Data Center). The goal is
using the remaining Cells. (a Zookeeper instance running on 3 or 5 hosts locally is a good configuration).
The data is partitioned as follows:
- Keyspaces: global instance
- Shards: global instance
- Tablets: local instances
- Serving Graph: local instances
- Replication Graph: the master alias is in the global instance, the master-slave map is in the local cells.
* Keyspaces: global instance
* Shards: global instance
* Tablets: local instances
* Serving Graph: local instances
* Replication Graph: the master alias is in the global instance, the master-slave map is in the local cells.
Clients are designed to just read the local Serving Graph, therefore they only need the local instance to be up.

Просмотреть файл

@ -6,30 +6,30 @@ and what kind of effort it would require for you to start using Vitess.
## When do you need Vitess?
- You store all your data in a MySQL database, and have a significant number of
* You store all your data in a MySQL database, and have a significant number of
clients. At some point, you start getting Too many connections errors from
MySQL, so you have to change the max_connections system variable. Every MySQL
connection has a memory overhead, which is just below 3 MB in the default
configuration. If you want 1500 additional connections, you will need over 4 GB
of additional RAM – and this is not going to be contributing to faster queries.
- From time to time, your developers make mistakes. For example, they make your
* From time to time, your developers make mistakes. For example, they make your
app issue a query without setting a LIMIT, which makes the database slow for
all users. Or maybe they issue updates that break statement based replication.
Whenever you see such a query, you react, but it usually takes some time and
effort to get the story straight.
- You store your data in a MySQL database, and your database has grown
* You store your data in a MySQL database, and your database has grown
uncomfortably big. You are planning to do some horizontal sharding. MySQL
doesnt support sharding, so you will have write the code to perform the
sharding, and then bake all the sharding logic into your app.
- You run a MySQL cluster, and use replication for availability: you have a master
* You run a MySQL cluster, and use replication for availability: you have a master
database and a few replicas, and in the case of a master failure some replica
should become the new master. You have to manage the lifecycle of the databases,
and communicate the current state of the system to the application.
- You run a MySQL cluster, and have custom database configurations for different
* You run a MySQL cluster, and have custom database configurations for different
workloads. Theres the master where all the writes go, fast read-only replicas
for web clients, slower read-only replicas for batch jobs, and another kind of
slower replicas for backups. If you have horizontal sharding, this setup is

Просмотреть файл

@ -122,7 +122,7 @@ vtctl MigrateServedTypes -reverse test_keyspace/0 replica
## Scrap the source shard
If all the above steps were successful, its safe to remove the source shard (which should no longer be in use):
- For each tablet in the source shard: `vtctl ScrapTablet <source tablet alias>`
- For each tablet in the source shard: `vtctl DeleteTablet <source tablet alias>`
- Rebuild the serving graph: `vtctl RebuildKeyspaceGraph test_keyspace`
- Delete the source shard: `vtctl DeleteShard test_keyspace/0`
* For each tablet in the source shard: `vtctl ScrapTablet <source tablet alias>`
* For each tablet in the source shard: `vtctl DeleteTablet <source tablet alias>`
* Rebuild the serving graph: `vtctl RebuildKeyspaceGraph test_keyspace`
* Delete the source shard: `vtctl DeleteShard test_keyspace/0`

Просмотреть файл

@ -54,14 +54,14 @@ live system, it errs on the side of safety, and will abort if any
tablet is not responding right.
The actions performed are:
- any existing tablet replication is stopped. If any tablet fails
* any existing tablet replication is stopped. If any tablet fails
(because it is not available or not succeeding), we abort.
- the master-elect is initialized as a master.
- in parallel for each tablet, we do:
- on the master-elect, we insert an entry in a test table.
- on the slaves, we set the master, and wait for the entry in the test table.
- if any tablet fails, we error out.
- we then rebuild the serving graph for the shard.
* the master-elect is initialized as a master.
* in parallel for each tablet, we do:
* on the master-elect, we insert an entry in a test table.
* on the slaves, we set the master, and wait for the entry in the test table.
* if any tablet fails, we error out.
* we then rebuild the serving graph for the shard.
### Planned Reparents: vtctl PlannedReparentShard
@ -69,14 +69,14 @@ This command is used when both the current master and the new master
are alive and functioning properly.
The actions performed are:
- we tell the old master to go read-only. It then shuts down its query
* we tell the old master to go read-only. It then shuts down its query
service. We get its replication position back.
- we tell the master-elect to wait for that replication data, and then
* we tell the master-elect to wait for that replication data, and then
start being the master.
- in parallel for each tablet, we do:
- on the master-elect, we insert an entry in a test table. If that
* in parallel for each tablet, we do:
* on the master-elect, we insert an entry in a test table. If that
works, we update the MasterAlias record of the global Shard object.
- on the slaves (including the old master), we set the master, and
* on the slaves (including the old master), we set the master, and
wait for the entry in the test table. (if a slave wasn't
replicating, we don't change its state and don't start replication
after reparent)
@ -96,15 +96,15 @@ just make sure the master-elect is the most advanced in replication
within all the available slaves, and reparent everybody.
The actions performed are:
- if the current master is still alive, we scrap it. That will make it
* if the current master is still alive, we scrap it. That will make it
stop what it's doing, stop its query service, and be unusable.
- we gather the current replication position on all slaves.
- we make sure the master-elect has the most advanced position.
- we promote the master-elect.
- in parallel for each tablet, we do:
- on the master-elect, we insert an entry in a test table. If that
* we gather the current replication position on all slaves.
* we make sure the master-elect has the most advanced position.
* we promote the master-elect.
* in parallel for each tablet, we do:
* on the master-elect, we insert an entry in a test table. If that
works, we update the MasterAlias record of the global Shard object.
- on the slaves (excluding the old master), we set the master, and
* on the slaves (excluding the old master), we set the master, and
wait for the entry in the test table. (if a slave wasn't
replicating, we don't change its state and don't start replication
after reparent)
@ -122,33 +122,33 @@ servers. We then trigger the 'vtctl TabletExternallyReparented'
command.
The flow for that command is as follows:
- the shard is locked in the global topology server.
- we read the Shard object from the global topology server.
- we read all the tablets in the replication graph for the shard. Note
* the shard is locked in the global topology server.
* we read the Shard object from the global topology server.
* we read all the tablets in the replication graph for the shard. Note
we allow partial reads here, so if a data center is down, as long as
the data center containing the new master is up, we keep going.
- the new master performs a 'SlaveWasPromoted' action. This remote
* the new master performs a 'SlaveWasPromoted' action. This remote
action makes sure the new master is not a MySQL slave of another
server (the 'show slave status' command should not return anything,
meaning 'reset slave' should have been called).
- for every host in the replication graph, we call the
* for every host in the replication graph, we call the
'SlaveWasRestarted' action. It takes as parameter the address of the
new master. On each slave, we update the topology server record for
that tablet with the new master, and the replication graph for that
tablet as well.
- for the old master, if it doesn't successfully return from
* for the old master, if it doesn't successfully return from
'SlaveWasRestarted', we change its type to 'spare' (so a dead old
master doesn't interfere).
- we then update the Shard object with the new master.
- we rebuild the serving graph for that shard. This will update the
* we then update the Shard object with the new master.
* we rebuild the serving graph for that shard. This will update the
'master' record for sure, and also keep all the tablets that have
successfully reparented.
Failure cases:
- The global topology server has to be available for locking and
* The global topology server has to be available for locking and
modification during this operation. If not, the operation will just
fail.
- If a single topology server is down in one data center (and it's not
* If a single topology server is down in one data center (and it's not
the master data center), the tablets in that data center will be
ignored by the reparent. When the topology server comes back up,
just re-run 'vtctl InitTablet' on the tablets, and that will fix

Просмотреть файл

@ -1,35 +1,40 @@
# Resharding
In Vitess, resharding describes the process of re-organizing data dynamically, with very minimal downtime (we manage to
completely perform most data transitions with less than 5 seconds of read-only downtime - new data cannot be written,
existing data can still be read).
In Vitess, resharding describes the process of re-organizing data
dynamically, with very minimal downtime (we manage to completely
perform most data transitions with less than 5 seconds of read-only
downtime - new data cannot be written, existing data can still be
read).
See the description of [how Sharding works in Vitess](Sharding.md) for
higher level concepts on Sharding.
## Process
To follow a step-by-step guide for how to shard a keyspace, you can see [this page](HorizontalReshardingGuide.md).
In general, the process to achieve this goal is composed of the following steps:
- pick the original shard(s)
- pick the destination shard(s) coverage
- create the destination shard(s) tablets (in a mode where they are not used to serve traffic yet)
- bring up the destination shard(s) tablets, with read-only masters.
- backup and split the data from the original shard(s)
- merge and import the data on the destination shard(s)
- start and run filtered replication from original to destination shard(s), catch up
- move the read-only traffic to the destination shard(s), stop serving read-only traffic from original shard(s). This transition can take a few hours. We might want to move rdonly separately from replica traffic.
- in quick succession:
- make original master(s) read-only
- flush filtered replication on all filtered replication source servers (after making sure they were caught up with their masters)
- wait until replication is caught up on all destination shard(s) masters
- move the write traffic to the destination shard(s)
- make destination master(s) read-write
- scrap the original shard(s)
* pick the original shard(s)
* pick the destination shard(s) coverage
* create the destination shard(s) tablets (in a mode where they are not used to serve traffic yet)
* bring up the destination shard(s) tablets, with read-only masters.
* backup and split the data from the original shard(s)
* merge and import the data on the destination shard(s)
* start and run filtered replication from original to destination shard(s), catch up
* move the read-only traffic to the destination shard(s), stop serving read-only traffic from original shard(s). This transition can take a few hours. We might want to move rdonly separately from replica traffic.
* in quick succession:
* make original master(s) read-only
* flush filtered replication on all filtered replication source servers (after making sure they were caught up with their masters)
* wait until replication is caught up on all destination shard(s) masters
* move the write traffic to the destination shard(s)
* make destination master(s) read-write
* scrap the original shard(s)
## Applications
The main application we currently support:
- in a sharded keyspace, split or merge shards (horizontal sharding)
- in a non-sharded keyspace, break out some tables into a different keyspace (vertical sharding)
* in a sharded keyspace, split or merge shards (horizontal sharding)
* in a non-sharded keyspace, break out some tables into a different keyspace (vertical sharding)
With these supported features, it is very easy to start with a single keyspace containing all the data (multiple tables),
and then as the data grows, move tables to different keyspaces, start sharding some keyspaces, ... without any real
@ -38,17 +43,17 @@ downtime for the application.
## Scaling Up and Down
Here is a quick table of what to do with Vitess when a change is required:
- uniformly increase read capacity: add replicas, or split shards
- uniformly increase write capacity: split shards
- reclaim free space: merge shards / keyspaces
- increase geo-diversity: add new cells and new replicas
- cool a hot tablet: if read access, add replicas or split shards, if write access, split shards.
* uniformly increase read capacity: add replicas, or split shards
* uniformly increase write capacity: split shards
* reclaim free space: merge shards / keyspaces
* increase geo-diversity: add new cells and new replicas
* cool a hot tablet: if read access, add replicas or split shards, if write access, split shards.
## Filtered Replication
The cornerstone of Resharding is being able to replicate the right data. Mysql doesn't support any filtering, so the
Vitess project implements it entirely:
- the tablet server tags transactions with comments that describe what the scope of the statements are (which keyspace_id,
* the tablet server tags transactions with comments that describe what the scope of the statements are (which keyspace_id,
which table, ...). That way the MySQL binlogs contain all filtering data.
- a server process can filter and stream the MySQL binlogs (using the comments).
- a client process can apply the filtered logs locally (they are just regular SQL statements at this point).
* a server process can filter and stream the MySQL binlogs (using the comments).
* a client process can apply the filtered logs locally (they are just regular SQL statements at this point).

Просмотреть файл

@ -42,11 +42,11 @@ $ vtctl -wait-time=30s ValidateSchemaKeyspace user
## Changing the Schema
Goals:
- simplify schema updates on the fleet
- minimize human actions / errors
- guarantee no or very little downtime for most schema updates
- do not store any permanent schema data in Topology Server, just use it for actions.
- only look at tables for now (not stored procedures or grants for instance, although they both could be added fairly easily in the same manner)
* simplify schema updates on the fleet
* minimize human actions / errors
* guarantee no or very little downtime for most schema updates
* do not store any permanent schema data in Topology Server, just use it for actions.
* only look at tables for now (not stored procedures or grants for instance, although they both could be added fairly easily in the same manner)
Were trying to get reasonable confidence that a schema update is going to work before applying it. Since we cannot really apply a change to live tables without potentially causing trouble, we have implemented a Preflight operation: it copies the current schema into a temporary database, applies the change there to validate it, and gathers the resulting schema. After this Preflight, we have a good idea of what to expect, and we can apply the change to any database and make sure it worked.
@ -71,35 +71,37 @@ type SchemaChange struct {
```
And the associated ApplySchema remote action for a tablet. Then the performed steps are:
- The database to use is either derived from the tablet dbName if UseVt is false, or is the _vt database. A use dbname is prepended to the Sql.
- (if BeforeSchema is not nil) read the schema, make sure it is equal to BeforeSchema. If not equal: if Force is not set, we will abort, if Force is set, well issue a warning and keep going.
- if AllowReplication is false, well disable replication (adding SET sql_log_bin=0 before the Sql).
- We will then apply the Sql command.
- (if AfterSchema is not nil) read the schema again, make sure it is equal to AfterSchema. If not equal: if Force is not set, we will issue an error, if Force is set, well issue a warning.
* The database to use is either derived from the tablet dbName if UseVt is false, or is the _vt database. A use dbname is prepended to the Sql.
* (if BeforeSchema is not nil) read the schema, make sure it is equal to BeforeSchema. If not equal: if Force is not set, we will abort, if Force is set, well issue a warning and keep going.
* if AllowReplication is false, well disable replication (adding SET sql_log_bin=0 before the Sql).
* We will then apply the Sql command.
* (if AfterSchema is not nil) read the schema again, make sure it is equal to AfterSchema. If not equal: if Force is not set, we will issue an error, if Force is set, well issue a warning.
We will return the following information:
- whether it worked or not (doh!)
- BeforeSchema
- AfterSchema
* whether it worked or not (doh!)
* BeforeSchema
* AfterSchema
### Use case 1: Single tablet update:
- we first do a Preflight (to know what BeforeSchema and AfterSchema will be). This can be disabled, but is not recommended.
- we then do the schema upgrade. We will check BeforeSchema before the upgrade, and AfterSchema after the upgrade.
* we first do a Preflight (to know what BeforeSchema and AfterSchema will be). This can be disabled, but is not recommended.
* we then do the schema upgrade. We will check BeforeSchema before the upgrade, and AfterSchema after the upgrade.
### Use case 2: Single Shard update:
- need to figure out (or be told) if its a simple or complex schema update (does it require the shell game?). For now we'll use a command line flag.
- in any case, do a Preflight on the master, to get the BeforeSchema and AfterSchema values.
- in any case, gather the schema on all databases, to see which ones have been upgraded already or not. This guarantees we can interrupt and restart a schema change. Also, this makes sure no action is currently running on the databases we're about to change.
- if simple:
- nobody has it: apply to master, very similar to a single tablet update.
- some tablets have it but not others: error out
- if complex: do the shell game while disabling replication. Skip the tablets that already have it. Have an option to re-parent at the end.
- Note the Backup, and Lag servers won't apply a complex schema change. Only the servers actively in the replication graph will.
- the process can be interrupted at any time, restarting it as a complex schema upgrade should just work.
* need to figure out (or be told) if its a simple or complex schema update (does it require the shell game?). For now we'll use a command line flag.
* in any case, do a Preflight on the master, to get the BeforeSchema and AfterSchema values.
* in any case, gather the schema on all databases, to see which ones have been upgraded already or not. This guarantees we can interrupt and restart a schema change. Also, this makes sure no action is currently running on the databases we're about to change.
* if simple:
* nobody has it: apply to master, very similar to a single tablet update.
* some tablets have it but not others: error out
* if complex: do the shell game while disabling replication. Skip the tablets that already have it. Have an option to re-parent at the end.
* Note the Backup, and Lag servers won't apply a complex schema change. Only the servers actively in the replication graph will.
* the process can be interrupted at any time, restarting it as a complex schema upgrade should just work.
### Use case 3: Keyspace update:
- Similar to Single Shard, but the BeforeSchema and AfterSchema values are taken from the first shard, and used in all shards after that.
- We don't know the new masters to use on each shard, so just skip re-parenting all together.
* Similar to Single Shard, but the BeforeSchema and AfterSchema values are taken from the first shard, and used in all shards after that.
* We don't know the new masters to use on each shard, so just skip re-parenting all together.
This translates into the following vtctl commands:

165
doc/Sharding.md Normal file
Просмотреть файл

@ -0,0 +1,165 @@
# Sharding in Vitess
This document describes the various options for sharding in Vitess.
## Range-based Sharding
This is the out-of-the-box sharding solution for Vitess. Each record
in the database has a Sharding Key value associated with it (and
stored in that row). Two records with the same Sharding Key are always
collocated on the same shard (allowing joins for instance). Two
records with different sharding keys are not necessarily on the same
shard.
In a Keyspace, each Shard then contains a range of Sharding Key
values. The full set of Shards covers the entire range.
In this environment, query routing needs to figure out the Sharding
Key or Keys for each query, and then route it properly to the
appropriate shard(s). We achieve this by providing either a sharding
key value (known as KeyspaceID in the API), or a sharding key range
(KeyRange). We are also developing more ways to route queries
automatically with [version 3 or our API](VTGateV3.md), where we store
more metadata in our topology to understand where to route queries.
### Sharding Key
Sharding Keys need to be compared to each other to enable range-based
sharding, as we need to figure out if a value is within a Shard's
range.
Vitess was designed to allow two types of sharding keys:
* Binary data: just an array of bytes. We use regular byte array
comparison here. Can be used for strings. MySQL representation is a
VARBINARY field.
* 64 bits unsigned integer: we first convert the 64 bits integer into
a byte array (by just copying the bytes, most significant byte
first, into 8 bytes). Then we apply the same byte array
comparison. MySQL representation is an bigint(20) UNSIGNED .
A sharded keyspace contains information about the type of sharding key
(binary data or 64 bits unsigned integer), and the column name for the
Sharding Key (that is in every table).
To guarantee a balanced use of the shards, the Sharding Keys should be
evenly distributed in their space (if half the records of a Keyspace
map to a single sharding key, all these records will always be on the
same shard, making it impossible to split).
A common example of a sharding key in a web site that serves millions
of users is the 64 bit hash of a User Id. By using a hashing function
the Sharding Keys are evenly distributed in the space. By using a very
granular sharding key, one could shard all the way to one user per
shard.
Comparison of Sharding Keys can be a bit tricky, but if one remembers
they are always converted to byte arrays, and then converted, it
becomes easier. For instance, the value [ 0x80 ] is the mid value for
Sharding Keys. All byte sharding keys that start with numbers strictly
lower than 0x80 are lower, and all byte sharding keys that start with
number equal or greater than 0x80 are higher. Note that [ 0x80 ] can
also be compared to 64 bits unsigned integer values: any value that is
smaller than 0x8000000000000000 will be smaller, any value equal or
greater will be greater.
### Key Range
A Key Range has a Start and an End value. A value is inside the Key
Range if it is greater or equal to the Start, and strictly less than
the End.
Two Key Ranges are consecutive if the End of the first one is equal to
the Start of the next one.
Two special values exist:
* if a Start is empty, it represents the lowest value, and all values
are greater than it.
* if an End is empty, it represents the biggest value, and all values
are strictly lower than it.
Examples:
* Start=[], End=[]: full Key Range
* Start=[], End=[0x80]: Lower half of the Key Range.
* Start=[0x80], End=[]: Upper half of the Key Range.
* Start=[0x40], End=[0x80]: Second quarter of the Key Range.
As noted previously, this can be used for both binary and uint64
Sharding Keys. For uint64 Sharding Keys, the single byte number
represents the most significant 8 bits of the value.
### Range-based Shard Name
The Name of a Shard when it is part of a Range-based sharded keyspace
is the Start and End of its keyrange, printed in hexadecimal, and
separated by a hyphen.
For instance, if Start is the array of bytes [ 0x80 ] and End is the
array of bytes [ 0xc0 ], then the name of the Shard will be: 80-c0
We will use this convention in the rest of this document.
### Sharding Key Partition
A partition represent a set of Key Ranges that cover the entire space. For instance, the following four shards are a valid full partition:
* -40
* 40-80
* 80-c0
* c0-
When we build the serving graph for a given Range-based Sharded
Keyspace, we ensure the Shards are valid and cover the full space.
During resharding, we can split or merge consecutive Shards with very
minimal downtime.
### Resharding
Vitess provides a set of tools and processes to deal with Range Based Shards:
* [Dynamic resharding](Resharding.md) allows splitting or merging of shards with no
read downtime, and very minimal master unavailability (<5s).
* Client APIs are designed to take sharding into account.
* [Map-reduce framework](https://github.com/youtube/vitess/blob/master/java/vtgate-client/src/main/java/com/youtube/vitess/vtgate/hadoop/README.md) fully utilizes the Key Ranges to read data as
fast as possible, concurrently from all shards and all replicas.
### Cross-Shard Indexes
The 'primary key' for sharded data is the Sharding Key value. In order
to look up data with another index, it is very straightforward to
create a lookup table in a different Keyspace that maps that other
index to the Sharding Key.
For instance, if User ID is hashed as Sharding Key in a User keyspace,
adding a User Name to User Id lookup table in a different Lookup
keyspace allows the user to also route queries the right way.
With the current version of the API, Cross-Shard indexes have to be
handled at the application layer. However, [version 3 or our API](VTGateV3.md)
will provide multiple ways to solve this without application layer changes.
## Custom Sharding
This is designed to be used if your application already has support
for sharding, or if you want to control exactly which shard the data
goes to. In that use case, each Keyspace just has a collection of
shards, and the client code always specifies which shard to talk to.
The shards can just use any name, and they are always addressed by
name. The API calls to vtgate are ExecuteShard, ExecuteBatchShard and
StreamExecuteShard. None of the *KeyspaceIds, *KeyRanges or *EntityIds
API calls can be used.
The Map-Reduce framework can still iterate over the data across multiple shards.
Also, none of the automated resharding tools and processes that Vitess
provides for Range-Based sharding can be used here.
Note: the *Shard API calls are not exposed by all clients at the
moment. This is going to be fixed soon.
### Custom Sharding Example: Lookup-Based Sharding
One example of Custom Sharding is Lookup based: one keyspace is used
as a lookup keyspace, and contains the mapping between the identifying
key of a record, and the shard name it is on. When accessing the
records, the client first needs to find the shard name by looking it
up in the lookup table, and then knows where to route queries.

Просмотреть файл

@ -57,8 +57,8 @@ level picture of all the servers and their current state.
### vtworker
vtworker is meant to host long-running processes. It supports a plugin infrastructure, and offers libraries to easily pick tablets to use. We have developed:
- resharding differ jobs: meant to check data integrity during shard splits and joins.
- vertical split differ jobs: meant to check data integrity during vertical splits and joins.
* resharding differ jobs: meant to check data integrity during shard splits and joins.
* vertical split differ jobs: meant to check data integrity during vertical splits and joins.
It is very easy to add other checker processes for in-tablet integrity checks (verifying foreign key-like relationships), and cross shard data integrity (for instance, if a keyspace contains an index table referencing data in another keyspace).

Просмотреть файл

@ -41,12 +41,12 @@ An entire Keyspace can be locked. We use this during resharding for instance, wh
### Shard
A Shard contains a subset of the data for a Keyspace. The Shard record in the global topology contains:
- the MySQL Master tablet alias for this shard
- the sharding key range covered by this Shard inside the Keyspace
- the tablet types this Shard is serving (master, replica, batch, …), per cell if necessary.
- if during filtered replication, the source shards this shard is replicating from
- the list of cells that have tablets in this shard
- shard-global tablet controls, like blacklisted tables no tablet should serve in this shard
* the MySQL Master tablet alias for this shard
* the sharding key range covered by this Shard inside the Keyspace
* the tablet types this Shard is serving (master, replica, batch, …), per cell if necessary.
* if during filtered replication, the source shards this shard is replicating from
* the list of cells that have tablets in this shard
* shard-global tablet controls, like blacklisted tables no tablet should serve in this shard
A Shard can be locked. We use this during operations that affect either the Shard record, or multiple tablets within a Shard (like reparenting), so multiple jobs dont concurrently alter the data.
@ -61,19 +61,19 @@ This section describes the data structures stored in the local instance (per cel
### Tablets
The Tablet record has a lot of information about a single vttablet process running inside a tablet (along with the MySQL process):
- the Tablet Alias (cell+unique id) that uniquely identifies the Tablet
- the Hostname, IP address and port map of the Tablet
- the current Tablet type (master, replica, batch, spare, …)
- which Keyspace / Shard the tablet is part of
- the health map for the Tablet (if in degraded mode)
- the sharding Key Range served by this Tablet
- user-specified tag map (to store per installation data for instance)
* the Tablet Alias (cell+unique id) that uniquely identifies the Tablet
* the Hostname, IP address and port map of the Tablet
* the current Tablet type (master, replica, batch, spare, …)
* which Keyspace / Shard the tablet is part of
* the health map for the Tablet (if in degraded mode)
* the sharding Key Range served by this Tablet
* user-specified tag map (to store per installation data for instance)
A Tablet record is created before a tablet can be running (either by `vtctl InitTablet` or by passing the `init_*` parameters to vttablet). The only way a Tablet record will be updated is one of:
- The vttablet process itself owns the record while it is running, and can change it.
- At init time, before the tablet starts
- After shutdown, when the tablet gets scrapped or deleted.
- If a tablet becomes unresponsive, it may be forced to spare to remove it from the serving graph (such as when reparenting away from a dead master, by the `vtctl ReparentShard` action).
* The vttablet process itself owns the record while it is running, and can change it.
* At init time, before the tablet starts
* After shutdown, when the tablet gets scrapped or deleted.
* If a tablet becomes unresponsive, it may be forced to spare to remove it from the serving graph (such as when reparenting away from a dead master, by the `vtctl ReparentShard` action).
### Replication Graph
@ -86,16 +86,16 @@ The Serving Graph is what the clients use to find which EndPoints to send querie
#### SrvKeyspace
It is the local representation of a Keyspace. It contains information on what shard to use for getting to the data (but not information about each individual shard):
- the partitions map is keyed by the tablet type (master, replica, batch, …) and the values are list of shards to use for serving.
- it also contains the global Keyspace fields, copied for fast access.
* the partitions map is keyed by the tablet type (master, replica, batch, …) and the values are list of shards to use for serving.
* it also contains the global Keyspace fields, copied for fast access.
It can be rebuilt by running `vtctl RebuildKeyspaceGraph`. It is not automatically rebuilt when adding new tablets in a cell, as this would cause too much overhead and is only needed once per cell/keyspace. It may also be changed during horizontal and vertical splits.
#### SrvShard
It is the local representation of a Shard. It contains information on details internal to this Shard only, but not to any tablet running in this shard:
- the name and sharding Key Range for this Shard.
- the cell that has the master for this Shard.
* the name and sharding Key Range for this Shard.
* the cell that has the master for this Shard.
It is possible to lock a SrvShard object, to massively update all EndPoints in it.
@ -104,10 +104,10 @@ It can be rebuilt (along with all the EndPoints in this Shard) by running `vtctl
#### EndPoints
For each possible serving type (master, replica, batch), in each Cell / Keyspace / Shard, we maintain a rolled-up EndPoint list. Each entry in the list has information about one Tablet:
- the Tablet Uid
- the Host on which the Tablet resides
- the port map for that Tablet
- the health map for that Tablet
* the Tablet Uid
* the Host on which the Tablet resides
* the port map for that Tablet
* the health map for that Tablet
## Workflows Involving the Topology Server
@ -144,13 +144,13 @@ For locking, we use an auto-incrementing file name in the `/action` subdirectory
Note the paths used to store global and per-cell data do not overlap, so a single ZK can be used for both global and local ZKs. This is however not recommended, for reliability reasons.
- Keyspace: `/zk/global/vt/keyspaces/<keyspace>`
- Shard: `/zk/global/vt/keyspaces/<keyspace>/shards/<shard>`
- Tablet: `/zk/<cell>/vt/tablets/<uid>`
- Replication Graph: `/zk/<cell>/vt/replication/<keyspace>/<shard>`
- SrvKeyspace: `/zk/<cell>/vt/ns/<keyspace>`
- SrvShard: `/zk/<cell>/vt/ns/<keyspace>/<shard>`
- EndPoints: `/zk/<cell>/vt/ns/<keyspace>/<shard>/<tablet type>`
* Keyspace: `/zk/global/vt/keyspaces/<keyspace>`
* Shard: `/zk/global/vt/keyspaces/<keyspace>/shards/<shard>`
* Tablet: `/zk/<cell>/vt/tablets/<uid>`
* Replication Graph: `/zk/<cell>/vt/replication/<keyspace>/<shard>`
* SrvKeyspace: `/zk/<cell>/vt/ns/<keyspace>`
* SrvShard: `/zk/<cell>/vt/ns/<keyspace>/<shard>`
* EndPoints: `/zk/<cell>/vt/ns/<keyspace>/<shard>/<tablet type>`
We provide the 'zk' utility for easy access to the topology data in ZooKeeper. For instance:
```
@ -171,11 +171,11 @@ We use the `_Data` filename to store the data, JSON encoded.
For locking, we store a `_Lock` file with various contents in the directory that contains the object to lock.
We use the following paths:
- Keyspace: `/vt/keyspaces/<keyspace>/_Data`
- Shard: `/vt/keyspaces/<keyspace>/<shard>/_Data`
- Tablet: `/vt/tablets/<cell>-<uid>/_Data`
- Replication Graph: `/vt/replication/<keyspace>/<shard>/_Data`
- SrvKeyspace: `/vt/ns/<keyspace>/_Data`
- SrvShard: `/vt/ns/<keyspace>/<shard>/_Data`
- EndPoints: `/vt/ns/<keyspace>/<shard>/<tablet type>`
* Keyspace: `/vt/keyspaces/<keyspace>/_Data`
* Shard: `/vt/keyspaces/<keyspace>/<shard>/_Data`
* Tablet: `/vt/tablets/<cell>-<uid>/_Data`
* Replication Graph: `/vt/replication/<keyspace>/<shard>/_Data`
* SrvKeyspace: `/vt/ns/<keyspace>/_Data`
* SrvShard: `/vt/ns/<keyspace>/<shard>/_Data`
* EndPoints: `/vt/ns/<keyspace>/<shard>/<tablet type>`