Only TestReparentDuringWorkerCopy requires a large amount of rows to keep vtworker SplitClone busy for a long time such that we can time it and run a planned reparent just after it started.
With this change, only TestReparentDuringWorkerCopy will use the new flag which should usually have a high value (default in open-source: 3k).
The existing flag --num_insert_rows covers all test and requires no mimum number of rows. Therefore, I've reduced its value from 3k to 100.
This way we are consistent and all healthcheck related flags are properly grouped.
vttablet: Split up flag "binlog_player_retry_delay" into the existing flag and "binlog_player_healthcheck_retry_delay". Now the two flags are used for two different purposes. Similar to the worker code, all healthcheck flags are properly grouped now.
Renamed strategy flag from "-populate_blp_checkpoint" to
"-skip_populate_blp_checkpoint".
Since it's on by default now, I've removed all occurrences where the old
flag was set explicitly.
Before this, we forced rdonly tablets back to "rdonly" using ChangeSlaveType. However, they would go back to their target tablet type eventually: When taken out by vtworker, their type becomes "worker". When the work is done, their type goes back to "spare". If the healthcheck sees that the spare caught up with replication, it goes back to the target tablet type i.e. "rdonly".
By forcing the healthcheck, we shorten the test time and don't have to wait for the next healthcheck interval.
This way, it's more clear that this test is indepedent of the vertical_split test itself.
Improved testing the expected list of db types in tablet_control.
Although the copy was successful, we have to verify it to catch the case where the database already existed on the destination, but with different options e.g. a different character set.
In that case, MySQL would have skipped our CREATE DATABASE IF NOT EXISTS statement. We want to fail early in this case because vtworker SplitDiff fails in case of such an inconsistency as well.
Specified dependeny for the Maven exec plugin (Travis tried to use an old location which no longer works).
Also added missing checking of the error code.
The tabletmanager grace period is the amount of time it pauses after
broadcasting to vtgate that it's going to stop serving a particular
target type (e.g. when going spare, or when being promoted to master).
During this period, we expect vtgate to gracefully redirect traffic
elsewhere, before the tablet actually starts rejecting queries for that
target type.
The -serving_state_grace_period flag defaults to 0, which corresponds
roughly to the old behavior, with an additional health broadcast.
However, after manually running the tests with value 0, I've set the
default value in test/tablet.py to 1s. There are no branches in the
code for the 0 value; it just means the Sleep() returns immediately.
So it isn't worth running the entire test suite twice from now on.
Some changes of sqltypes were not backward compatible:
- Sometimes we get nil bytes for valid string types.
- bsonrpc and grpc were different for Result.Rows.
If the machine a tablet is on loses contact with the cluster, a human or
automated cluster manager may launch a replacement tablet on another
machine. If the original tablet later regains contact, we need to
prevent its healthcheck from modifying the tablet record, which is now
owned by a new instance.
To do this, we rely on the fact that each tablet will set its IP and
primary port in topology on startup. The IP:port should never change for
the life of a process, and any new process that runs simultaneously with
the old one should always have a different IP:port tuple.
Thus, we say that whichever tablet has its IP:port in the tablet record
is the true owner of that record. The healthcheck of any other tablet is
not allowed to modify the record. Note that this check for ownership
applies ONLY to the healthcheck (which includes going SPARE on
shutdown). All other updates to tablet records are unaffected.
The new Value implementation is now based on the vitess types.
* The inner interface has now been replaced by typ and val.
* All Values are expected to be consistent with their types.
For example, an Int64 type must contain a number.
* The functions that build values generally ensure consistency.
* There is a set of 'Trusted' functions that can bypass this
consistency check. They should be used with care.
* The proto3 conversion functions build the correct Value types
based on the field types.
* The bson conversion function provides a Repair function that
allows you to fix up the types after the fact. This should be
deleted after bson is deprecated.
* The building of Values from a QueryResult is non-trivial because
the field info is not part of the QueryResult for streaming
queries. So, the API requires fields to be explicitly passed in.
* Fuctions that encode or convert to native types expect Value
to be consistent. If not, they panic.
* proto3.QueryResult is considered to be trusted. If it contains
inconsistent data, it will cause panics.
* The EventStreamer has been fixed to ensure that the fields and
rows it publishes are trustable: They can used as parameters
to the Trusted API.
* The Raw() function usage has been minimized. We should see if
it can be deprecated. This way, we can make Result truly read-only.
There are a few more tweaks that need to be done:
* The Proto3ToResult call plumbing was hacked in to make everything
work. That part needs cleaning.
* The bind vars don't need to be converted to their native types
any more.
These tests no longer use our test runner and therefore fail when
test.go supplies --skip-build by default.
Specifying an explicit command in the test config overrides the default
assumption that the test would use our custom test runner.
There were two almost identical methods in utils.py and in tabletmanager.py.
For the tablet type I'm using strings instead of the proto constants now because it's easier to read and shorter. The proto function which converts the string into an enum value will still check if the type is valid.
While porting tests from java_vtgate_test_helper to vtcombo, I found
that streaming queries that fetched more than ~300 rows would indicate
success, but receive fewer rows than expected. The number of rows actually
returned was always in the same ballpark, but would fluctuate from run
to run.
It turned out that since vtcombo returns results in-memory from vttablet
to vtgate, we were returning a *Result once the rows filled up the
buffer size, and then concurrently modifying the already-returned struct
to fill in the next set of rows.
These blocks tearDown the environment when setUp fails. When the exception does not get re-raised, it becomes effectively swallowed and the test runs despite the failed setUp.
Track the state of each Zookeeper connection in the "ZkCachedConn" variable instead.
Remove "ZkMetaConn" variable because "ZkCachedConn" has the same information now.
Update tabletmanager.py end-to-end test.
I'm removing states.go because it caused a deadlock in the worker.py test.
Its removal is no loss because the original intent was to detect flip-flopping with that code. However, that's no longer necessary.
The observed deadlock occurred when a) somebody polls /debug/vars while b) we create a new Zookeeper connection and publish a "ZkCachedConn" variable for it.
When the connection gets created, then the "states" object is created. The same call also calls expvar.Publish() eventually and this is where the deadlock occurs. (It's a deadlock between a mutex in the "expvar" package and a mutex in the Zookeeper connection package.)
The test requires that some steps occur before shard_2 comes up,
so shard_2 has a separate setup function. However, the test was trying
to combine teardown of shard_2 with the rest, which was confusing and
resulted in attempting to init mysql twice without tearing it down.
This splits out teardown of shard_2 so it behaves correctly and is
easier to understand.
This gets rid of the opaque mysql-db-dir.tbz archive, replacing it with
a .sql file. The .sql file approach makes it clear what state the DB is
initialized with, and also makes it easy to customize.
The test does a planned reparent, but then later steps assume the old
tablet is still the master. After the reparent, we should update the
test's expectation of who the master is.
SrvKeyspace object doesn't exist. Fixing an issue with zktopo
that was not returning correct error for missing SrvKeyspace.
Also fixing some (but not all) pylint errors in the files
I touched.
- vttablet now uses query service to talk to other tablet for health check.
- making all retries and timeouts configurable (using short values in tests).
- doing a single manual health check on source tablets so their health is good.
The Makefile previously listed tests explicitly for groups like
site_test and worker_test. These lists got out of date when tests were
removed from test/config.json, and the make rules broke. Now the groups
are defined in config.json itself, so there is one place to update
everything.
This commit changes the following protocols:
- binlog_player_protocol
- vtctl_client_protocol
The only BSON protocol left is vtgate pending the implementation of the
gRPC vtgate client.
Note that we originally added this change in
https://github.com/youtube/vitess/pull/1230
However, we reverted it because the Kubernetes tutorial and images were
out of sync. Therefore, this commit technically is the revert of the
revert.
Revert "Revert "Change protocol defaults to grpc.""
This reverts commit 5e5f40a04e.
We assume the latest backup is the one that's at the end when the list
of directories is sorted. This assumption is violated if we put the
tablet alias first in the name.