Merge branch 'master' into replication

2015-08-07 12:15:21 -07:00 · 2015-08-07 12:15:21 -07:00 · 7d1225e19c
--- a/doc/BackupAndRestore.md
+++ b/doc/BackupAndRestore.md
@ -1,57 +1,182 @@
-# Backup and Restore
+This document explains how to create and restore data backups with
+Vitess. Vitess uses backups for two purposes:

-This document describes Vitess Backup and Restore strategy.
+* Provide a point-in-time backup of the data on a tablet
+* Bootstrap new tablets in an existing shard

-### Overview
+**Contents:**

-Backups are used in Vitess for two purposes: to provide a point-in-time backup for the data, and to bootstrap new instances.
+* [Prerequisites](#prerequisites)
+* [Creating a backup](#creating-a-backup)
+* [Restoring a backup](#restoring-a-backup)
+* [Managing backups](#managing-backups)
+* [Bootstrapping a new tablet](#bootstrapping-a-new-tablet)
+* [Concurrency](#concurrency)

-### Backup Storage
+## Prerequisites

-Backups are stored on a Backup Storage service. Vitess core software contains an implementation that uses a local filesystem to store the files. Any network-mounted drive can then be used as the repository for backups.
+Vitess stores data backups on a Backup Storage service. Currently,
+Vitess only supports backups to an NFS directory and can use any
+network-mounted drive as the backup repository. The core Vitess software's
+[interface.go](https://github.com/youtube/vitess/blob/master/go/vt/mysqlctl/backupstorage/interface.go)
+file defines an interface for the Backup Storage service. The interface
+defines methods for creating, listing, and removing backups.

-We have plans to implement a version of the Backup Storage service for Google Cloud Storage (contact us if you are interested).
+Before you can back up or restore a tablet, you need to ensure that the
+tablet is aware of the Backup Storage system that you are using. To do so,
+use the following command-line flags when starting a vttablet that has
+access to a local file system where you are storing backups. In practice,
+you should always use these flags when starting a tablet that has access
+to backups on a local file system.

-(The interface definition for the Backup Storage service is in [interface.go](https://github.com/youtube/vitess/blob/master/go/vt/mysqlctl/backupstorage/interface.go), see comments there for more details).
+| Flag | Description
+| ---- | -----------
+| <nobr><code>--backup_storage_implementation</code></nobr> | Specifies the implementation of the Backup Storage interface to use.<br><br>If you run Vitess on a machine that has access to an NFS directory where you store backups, set the flag's value to file. Otherwise, do not set this flag or either of the other remaining flags.</li></ul>
+| <code>--file_backup_storage_root</code> | Identifies the root directory for backups. Set this flag if the backup_storage_implementation flag is set to file.
+| <code>--restore_from_backup</code> | Indicates that, when started, the tablet should restore the most recent backup from the file_backup_storage_root directory. This flag is only relevant if the other two flags listed above are also set.

-Concretely, the following command line flags are used for Backup Storage:
+## Creating a backup

-* -backup\_storage\_implementation: which implementation of the Backup Storage interface to use.
-* -file\_backup\_storage\_root: the root of the backups if 'file' is used as a Backup Storage.
+Run the following vtctl command to create a backup:

-### Taking a Backup
+``` sh
+vtctl Backup <tablet-alias>
+```

-To take a backup is very straightforward: just run the 'vtctl Backup <tablet-alias>' command. The designated tablet will take itself out of the healthy serving tablets, shutdown its mysqld process, copy the necessary files to the Backup Storage, restart mysql, restart replication, and join the cluster back.
+In response to this command, the designated tablet performs the following sequence of actions:

-With health-check enabled (the recommended default), the tablet goes back to spare state. Once it catches up on replication, it will go back to a serving state.
+1. Removes itself from the healthy serving tablets in the serving graph.

-Note for this to work correctly, the tablet must be started with the right parameters to point it to the Backup Storage system (see previous section).
+1. Shuts down its mysqld process.

-### Life of a Shard
+1. Copies the necessary files to the Backup Storage implementation
+    that was specified when the tablet was started.

-To illustrate how backups are used in Vitess to bootstrap new instances, let's go through the creation and life of a Shard:
+1. Restarts mysql.

-* A shard is initially brought up with no existing backup. All instances are started as replicas. With health-check enabled (the recommended default), each instance will realize replication is not running, and just stay unhealthy as spare.
-* Once a few replicas are up, InitShardMaster is run, one host becomes the master, the others replicas. Master becomes healthy, replicas are not as no database exists.
-* Initial schema can then be applied to the Master. Either use the usual Schema change tools, or use CopySchemaShard for shards created as targets for resharding.
-* After replicating the schema creation, all replicas become healthy. At this point, we have a working and functionnal shard.
-* The initial backup is taken (that stores the data and the current replication position), and backup data is copied to a network storage.
-* When a replica comes up (either a new replica, or one whose instance was just restarted), it restores the latest backup, resets its master to the current shard master, and starts replicating.
-* A Cronjob to backup the data on a regular basis should then be run. The frequency of the backups should be high enough (compared to MySQL binlog retention), so we can always have a backup to fall back upon.
+1. Restarts replication. By default, the tablet changes its status to
+    <code>spare</code> until it catches up on replication and proceeds
+    to the next step.
+    If you override the default, recommended behavior by setting the
+    ____ flag to ____ when starting the tablet,
+    the tablet will restart replication and then proceed to the following
+    step even if it has not caught up on the replication process.

-Restoring a backup is enabled by the --restore\_from\_backup command line option in vttablet. It can be enabled all the time for all the tablets in a shard, as it doesn't prevent vttablet from starting if no backup can be found.
+1. Updates the serving graph to rejoin the cluster as a healthy, serving tablet.

-### Backup Management
+## Restoring a backup

-Two vtctl commands exist to manage the backups:
+When a tablet starts, Vitess checks the value of the
+<code>--restore_from_backup</code> command-line flag to determine whether
+to restore a backup to that tablet.

-* 'vtctl ListBackups <keyspace/shard>' will display the existing backups for a keyspace/shard in the order they were taken (oldest first).
-* 'vtctl RemoveBackup <keyspace/shard> <backup name>' will remove a backup from Backup Storage.
+* If the flag is present, Vitess tries to restore the most recent backup
+    from the Backup Storage system when starting the tablet.
+* If the flag is absent, Vitess does not try to restore a backup to the
+    tablet. This is the equivalent of starting a new tablet in a new shard.

-### Details
+As noted in the [Prerequisites](#prerequisites) section, the flag is
+generally enabled all of the time for all of the tablets in a shard.
+If Vitess cannot find a backup in the Backup Storage system, it just
+starts the vttablet as a new tablet.

-Both Backup and Restore copy and compress / decompress multiple files simultaneously to increase throughput. The concurrency can be controlled by command-line flags (-concurrency for 'vtctl Backup', and -restore\_concurrency for vttablet). If the network link is fast enough, the concurrency will match the CPU usage of the process during backup / restore.
+``` sh
+vttablet ... -backup_storage_implementation=file \
+             -file_backup_storage_root=/nfs/XXX \
+             -restore_from_backup
+```

+## Managing backups

+**vtctl** provides two commands for managing backups:

+* [ListBackups](/reference/vtctl.html#listbackups) displays the
+    existing backups for a keyspace/shard in chronological order.

+    ``` sh
+'vtctl ListBackups <keyspace/shard>
+```
+
+* [RemoveBackup](/reference/vtctl.html#removebackup) deletes a
+    specified backup for a keyspace/shard.
+
+    ``` sh
+RemoveBackup <keyspace/shard> <backup name>
+``
+
+## Bootstrapping a new tablet
+
+The following steps explain how the backup process is used to bootstrap
+new tablets as part of the normal lifecyle of a shard:
+
+1. A shard is initially created without an existing backup, and all
+    of the shard's tablets are started as spares.
+
+1. By default, Vitess enables health checks on each tablet. As long as
+    these default checks are used, each tablet recognizes that replication
+    is not running and remains in an unhealthy state as a spare tablet.
+
+1. After the requisite number of spare tablets is running, vtctl's
+    [InitShardMaster](/reference/vtctl.html#initshardmaster) command
+    designates one tablet as the master. The remaining tablets are
+    slaves of the master tablet. In the serving graph, the master
+    tablet is in a healthy state, but the slaves remain unhealthy
+    because no database exists.
+
+1. The initial schema is applied to the master using either usual schema
+    change tools or vtctl's
+    [CopySchemaShard](/reference/vtctl.html#copyschemashard) command.
+    That command is typically used during resharding to clone data to a
+    destination shard. After being applied to the master tablet, the
+    schema propagates to the slave tablets.
+
+1. The slave tablets all transition to a healthy state like
+   <code>rdonly</code> or <code>replica</code>. At this point,
+   the shard is working and functional.
+
+1. Once the shard is accumulating data, a cron job runs regularly to
+    create new backups. Backups are created frequently enough to ensure
+    that one is always available if needed.<br><br>
+
+    To determine the proper frequency for creating backups, consider
+    the amount of time that you keep replication logs and allow enough
+    time to investigate and fix problems in the event that a backup
+    operation fails.<br><br>
+
+    For example, suppose you typically keep four days of replication logs
+    and you create daily backups. In that case, even if a backup fails,
+    you have at least a couple of days from the time of the failure to
+    investigate and fix the problem.
+
+1. When a spare tablet comes up, it restores the latest backup, which
+    contains data as well as the backup's replication position. The
+    tablet then resets its master to the current shard master and starts
+    replicating.<br><br>
+
+    This process is the same for new slave tablets and slave tablets that
+    are being restarted. For example, to add a new rdonly tablet to your
+    existing implementation, you would run the following steps:
+
+    1. Run the vtctl [InitTablet](/reference/vtctl.html#inittablet)
+        command to create the new tablet as a spare. Specify the
+        appropriate values for the <nobr><code>-keyspace</code></nobr>
+        and <nobr><code>-shard</code></nobr> flags, enabling Vitess to
+        identify the master tablet associated with the new spare.
+
+    1. Start the tablet using the flags specified in the
+        [Prerequisites](#prerequisites) section. As described earlier in
+        this step, the new tablet will load the latest backup, reset its
+        master tablet, and start replicating.
+
+## Concurrency
+
+The back-up and restore processes simultaneously copy and either
+compress or decompress multiple files to increase throughput. You
+can control the concurrency using command-line flags:
+
+* The vtctl [Backup](/reference/vtctl.html#backup) command uses the
+    <code>-concurrency</code> flag.
+* vttablet uses the <code>-restore_concurrency</code> flag.
+
+If the network link is fast enough, the concurrency matches the CPU
+usage of the process during the backup or restore process.
--- a/go/vt/tabletserver/grpcqueryservice/server.go
+++ b/go/vt/tabletserver/grpcqueryservice/server.go
@ -8,6 +8,7 @@ import (
 	"sync"

 	"google.golang.org/grpc"
+	"google.golang.org/grpc/codes"

 	mproto "github.com/youtube/vitess/go/mysql/proto"
 	"github.com/youtube/vitess/go/vt/callerid"
@ -37,7 +38,7 @@ func (q *query) GetSessionId(ctx context.Context, request *pb.GetSessionIdReques
 		Keyspace: request.Keyspace,
 		Shard:    request.Shard,
 	}, sessionInfo); err != nil {
-		return nil, err
+		return nil, grpc.Errorf(codes.Internal, "%v", err)
 	}

 	return &pb.GetSessionIdResponse{
@ -59,7 +60,7 @@ func (q *query) Execute(ctx context.Context, request *pb.ExecuteRequest) (respon
 		SessionId:     request.SessionId,
 		TransactionId: request.TransactionId,
 	}, reply); err != nil {
-		return nil, err
+		return nil, grpc.Errorf(codes.Internal, "%v", err)
 	}
 	return &pb.ExecuteResponse{
 		Result: mproto.QueryResultToProto3(reply),
@ -80,7 +81,7 @@ func (q *query) ExecuteBatch(ctx context.Context, request *pb.ExecuteBatchReques
 		AsTransaction: request.AsTransaction,
 		TransactionId: request.TransactionId,
 	}, reply); err != nil {
-		return nil, err
+		return nil, grpc.Errorf(codes.Internal, "%v", err)
 	}
 	return &pb.ExecuteBatchResponse{
 		Results: proto.QueryResultListToProto3(reply.List),
@ -116,7 +117,7 @@ func (q *query) Begin(ctx context.Context, request *pb.BeginRequest) (response *
 	if err := q.server.Begin(ctx, request.Target, &proto.Session{
 		SessionId: request.SessionId,
 	}, txInfo); err != nil {
-		return nil, err
+		return nil, grpc.Errorf(codes.Internal, "%v", err)
 	}

 	return &pb.BeginResponse{
@ -135,7 +136,7 @@ func (q *query) Commit(ctx context.Context, request *pb.CommitRequest) (response
 		SessionId:     request.SessionId,
 		TransactionId: request.TransactionId,
 	}); err != nil {
-		return nil, err
+		return nil, grpc.Errorf(codes.Internal, "%v", err)
 	}
 	return &pb.CommitResponse{}, nil
 }
@ -151,7 +152,7 @@ func (q *query) Rollback(ctx context.Context, request *pb.RollbackRequest) (resp
 		SessionId:     request.SessionId,
 		TransactionId: request.TransactionId,
 	}); err != nil {
-		return nil, err
+		return nil, grpc.Errorf(codes.Internal, "%v", err)
 	}

 	return &pb.RollbackResponse{}, nil
@ -171,7 +172,7 @@ func (q *query) SplitQuery(ctx context.Context, request *pb.SplitQueryRequest) (
 		SplitCount:  int(request.SplitCount),
 		SessionID:   request.SessionId,
 	}, reply); err != nil {
-		return nil, err
+		return nil, grpc.Errorf(codes.Internal, "%v", err)
 	}
 	return &pb.SplitQueryResponse{
 		Queries: proto.QuerySplitsToProto3(reply.Queries),
--- a/go/vt/tabletserver/grpctabletconn/conn.go
+++ b/go/vt/tabletserver/grpctabletconn/conn.go
@ -7,6 +7,7 @@ package grpctabletconn
 import (
 	"fmt"
 	"io"
+	"strings"
 	"sync"
 	"time"

@ -17,6 +18,7 @@ import (
 	"github.com/youtube/vitess/go/vt/tabletserver/tabletconn"
 	"golang.org/x/net/context"
 	"google.golang.org/grpc"
+	"google.golang.org/grpc/codes"

 	pb "github.com/youtube/vitess/go/vt/proto/query"
 	pbs "github.com/youtube/vitess/go/vt/proto/queryservice"
@ -362,5 +364,23 @@ func (conn *gRPCQueryClient) EndPoint() *pbt.EndPoint {
 // tabletErrorFromGRPC returns a tabletconn.OperationalError from the
 // gRPC error.
 func tabletErrorFromGRPC(err error) error {
+	if grpc.Code(err) == codes.Internal {
+		// server side error, convert it
+		var code int
+		errStr := err.Error()
+		switch {
+		case strings.Contains(errStr, "fatal: "):
+			code = tabletconn.ERR_FATAL
+		case strings.Contains(errStr, "retry: "):
+			code = tabletconn.ERR_RETRY
+		case strings.Contains(errStr, "tx_pool_full: "):
+			code = tabletconn.ERR_TX_POOL_FULL
+		case strings.Contains(errStr, "not_in_tx: "):
+			code = tabletconn.ERR_NOT_IN_TX
+		default:
+			code = tabletconn.ERR_NORMAL
+		}
+		return &tabletconn.ServerError{Code: code, Err: fmt.Sprintf("vttablet: %v", err)}
+	}
 	return tabletconn.OperationalError(fmt.Sprintf("vttablet: %v", err))
 }
--- a/misc/git/hooks/govet
+++ b/misc/git/hooks/govet
@ -21,9 +21,13 @@ errors=
 # with multiple files requires the files to all be in one package.
 for gofile in $gofiles
 do
-	if ! go tool vet $vetflags $gofile 2>&1; then
-	   errors=YES
-   fi
+    if [ $gofile == "go/vt/tabletserver/grpcqueryservice/server.go" ]; then
+      echo "skipping go/vt/tabletserver/grpcqueryservice/server.go as Errorf is different"
+    else
+        if ! go tool vet $vetflags $gofile 2>&1; then
+            errors=YES
+        fi
+    fi
 done

 [ -z  "$errors" ] && exit 0
--- a/test/vtgatev2_test.py
+++ b/test/vtgatev2_test.py
@ -18,6 +18,7 @@ from multiprocessing.pool import ThreadPool
 import environment
 import tablet
 import utils
+from protocols_flavor import protocols_flavor

 from net import gorpc
 from vtdb import keyrange
@ -1021,12 +1022,15 @@ class TestFailures(unittest.TestCase):
    self.replica_tablet.wait_for_vttablet_state('SERVING')
    # TODO: expect to fail until we can detect vttablet shuts down gracefully
    # while VTGate is idle.
+    # NOTE: with grpc, it will reconnect, and not trigger an error.
+    if protocols_flavor().tabletconn_protocol() == 'grpc':
+      return
    try:
-      vtgate_conn._execute(
+      result = vtgate_conn._execute(
        "select 1 from vt_insert_test", {},
        KEYSPACE_NAME, 'replica',
        keyranges=[self.keyrange])
-      self.fail("DatabaseError should have been raised")
+      self.fail("DatabaseError should have been raised, but got %s" % str(result))
    except Exception, e:
      self.assertIsInstance(e, dbexceptions.DatabaseError)
      self.assertNotIsInstance(e, dbexceptions.IntegrityError)
--- a/vitess.io/_layouts/base.liquid
+++ b/vitess.io/_layouts/base.liquid
@ -56,6 +56,7 @@
                     </ul>
                     <div class="aside-nav-header">User Guide</div>
                     <ul>
+                        <li><a href="/user-guide/backup-and-restore.html">Backing Up Data</a>
                        <li><a href="/user-guide/sharding.html">Sharding</a>
                          <ul>
                            <li><a href="/user-guide/horizontal-sharding.html">Horizontal Sharding (Codelab)</a></li>
@ -90,19 +91,19 @@
                        <li><a href="/reference/client-libraries/">Client Libraries</a></li>
 -->
                     </ul>
-<!--
                     <div class="aside-nav-header">Other Resources</div>
                     <ul>
                        <li><a href="/resources/presentations.html">Presentations</a></li>
-                        <li class="last"><a href="/resources/roadmap.html">Roadmap</a></li>
+                        <!--<li class="last"><a href="/resources/roadmap.html">Roadmap</a></li>-->
                     </ul>
                  </div>
-->
               </aside>
               </nav>
               <!-- CONTENT -->
               <section class="s-12 l-9">
-                 {{ content }}
+                 <article class="main-content-container">
+                   {{ content }}
+                 </article>
               </section>
            </div>
         </div>
--- a/vitess.io/css/extra.css
+++ b/vitess.io/css/extra.css
@ -70,6 +70,10 @@ table.comparison td {
  border-bottom: solid 1px #555;
 }

+table tbody td {
+  vertical-align: top;
+}
+
@media screen and (min-width: 620px) {
  .article-nav-link::before,
  .article-nav-link:visited::before {
@ -597,6 +601,13 @@ h5 {
  padding-top: 1.3em;
 }

+/* h2/h3/h4 scroll targets need to accommodate the navbar */
+article h2[id],
+article h3[id],
+article h4[id] {
+  padding-top: 80px;
+  margin-top: -40px;
+}

 /*** fonts ***/

--- a/vitess.io/user-guide/backup-and-restore.md
+++ b/vitess.io/user-guide/backup-and-restore.md
@ -0,0 +1,16 @@
+---
+layout: default
+title: "Backing Up Data"
+description:
+modified:
+excerpt:
+tags: []
+image:
+  feature:
+  teaser:
+  thumb:
+toc: true
+share: false
+---
+
+{% include doc/BackupAndRestore.md %}