With the new upcoming API changes, the rowcache invalidator
life cycle is going to be different from query engine.
So, it's better that SqlQuery owns and manages it separately.
Since go/bson no longer encodes the non-standard uint64 type,
it's not safe to send uint64 over bsonrpc for a field that is
unmarshaled into interface{}, since it will be unmarshaled as int64.
The only case where this happens is bind vars. We could fix bsonrpc
by switching it to use bson-encoded proto3 structs, since those
use concrete field types for bind vars rather than interface{}.
However, bsonrpc is deprecated anyway so instead of fixing it,
we will just switch to grpc for all clients written in Go that
talk to queryservice.
Unfortunately, "Target" is not set in SqlQuery for the destination master during resharding because its QueryService state is "not serving".
Therefore, the health check does not include the "Target".
I've removed the "Target" check for now to unblock the automation task.
However, I think we should fix this and untangle the "Target" in SqlQuery from the serving state respectively the healthcheck.
This ensures that conversions tests will fail because they test that all fields get properly converted. To fix the tests, we'll have to update the internal conversion.
This feature is needed for improving the usability of
ExecuteBatch. People may want a dup key error from insert
to be ignored so that they can follow it with other statements.
In the implementation, there's one caveat I had to handle, which
is that the ignore should be removed if it's an upsert. Otherwise,
we never get the dup key errror.
1. bind variables in SplitQuery should not contain leading ':'.
2. fix custom_sharding.py test to pass through returned bind variables from VTGate.
3. fix SplitQuery tests in Java.
1. SplitQuery will split a VARBINARY column as a hex string. That is, it assumes
the key range is [0x0000000, 0xFFFFFFFF] and then divides the range into
intervals based on split count.
2. Introduce splitBoundariesStringColumn to handle string column case.
3. Refactor existing SplitQuery in sqlquery.go: create two helper functions
getColumnType and getColumnMinMax.
Found a good way to know which index caused a dup key error.
The error message itself contains that info. This is better than
looking at RowsAffected because there are other conditions that cause
that field to be 0.
Upsert is mostly feature complete. There are still two
issues to resolve:
1. RowsAffected is returned as 0 if no rows were modified. So, we cannot
assume that no rows were matched. So, the value cannot be used to verify
if the dup key matched a pk or a unique key.
2. I mistooke VALUES to be VALUE, and coded defense against using that
construct in the parser. It turns out that the parser already allows use
of VALUES. So, I'll have to find a different way to prevent that usage
for upserts.
This fixes the problem that the wrapped integer value can be used without having to go through the interface.
Fixed all unit tests where this problem occurred.
The last commit broke one unit test because it was using
upper case INDEX names, which is not possible in vitess.
Only table names can contain upper case letters.
Fix for issue https://github.com/youtube/vitess/issues/797.
Names that used keywords were not always getting back-quoted
correctly during codegen. This is a comprehensive fix that
covers all such possible cases.
1. Deny any table access by default if tableacl does not find a ACL entry.
2. Fix queryservice integration test config file.
3. Log tableacl error only when 'queryserver-config-strict-table-acl' flag
is turned on (default: off)
1. Add three varzs TableACLAllowed, TableACLDenied and TableACLPseudoDenied.
2. Each has labels: TableName, TableGroup, PlanID and Username
3. TableACLPseudoDenied varz will be set when a query would have been denied
but wan't because the system is in dry-run mode or because the caller is a
superuser.
1. an exempt user will skip tableacl checking and have access to all Vitess tables.
2. add queryservice flag "queryserver-config-acl-exempt-acl" to specify exempt acl name.
1. In non-tx Plan_Other execution workflow, should not use ":=" to create a local "err"
variable since it is defined as return variable in qre.Execute(). The early code
causes problem because the original "err" will be reset to nil after the switch clause
and qre.Execute() will return (nil, nil) in such case.
2. execDmlAutoCommit should return "err" instead of "nil" in the last return statement.
1. Remove panics out of QueryExecutor and return error instead.
2. Some places like TxPool, CachePool, SchemaInfo still panics and will be captured
and handled in SqlQuery.
In short, the tableacl configuration will be represented as a
tableacl.proto. tableacl module holds a global "currentACL" instance
which maps a table group to its ACLs.
1. Introduce tableacl.proto which defines acl per table group.
A table group either contains only one table or many tables
that share the same name prefix.
2. Remove "All" and "AllString" funcs from acl.Factory interface.
Looks like there are no integration transaction tests for
ExecuteBatch vttablet. I'm not going to add them since
vttablet will be an internal API. I'll instead add them
for vtgate when that implemenation completes.
fakesqldb returns an empty QueryResult if a query is not recognized.
This behavior reduces typing in unit tests but also causes confusions.
In general, if caller does not set the mock result for a particular
query, fakesqldb should return an error.
1. Add SplitColumn field to SplitQuery so that caller could hint SplitQuery endpoint
which column is best for splitting query.
2. The split column must be indexed and this will be verified on the server side.
3. coding style fixes suggested by golint.
1. Add EnableAutoCommit in queryservice, false by default.
2. If EnableAutoCommit is true, a DML outsides a transaction will be accepted
by querservice and it will starts a transaction to execute this DML.
3. Turn on this flag in integration tests to make existing tests pass.
The test was failing in a Docker image whose timezone setting was
different.
It was using time.Unix(), which adds local timezone info after
converting.
For a single dml that does not run inside a transaction, queryservice
will wrap it with a Begin and a Commit. In a case that query plan is
not recognized, Rollback will be called.
Go's select will randomly pick a case if one or more of the communications can
proceed. In querylogz_test.go, timeout is set to zero and then it inserts a
message to the channel. This sometimes trigers timeout and thus message is not
retrieved and rendered.
1. double check the timeout condition when there is a message in the channel.
2. add a large timeout in the unit test to avoid run into the race condition.
codex_test.go uses createTableInfo func to create a TableInfo instance
for testing; however, it uses a map to store columns and then iterates
that map to add columns into TableInfo instance. This is not deterministic
because map iteration order is not guaranteed to be the insertion order.
1. remove global stats variables defined in query_engine.go.
2. introduce QueryServiceStats struct to be a holder of these stats.
3. use a dependency injection way to push QueryServiceStats instance from
QueryEngine to ConnPool, DBConn, TxPool, etc.
1. queryservice should not publish stats if either flag EnablePublishStats
is enabled or var name is empty.
2. stats.NewInt should call Publish if name is empty.
3. mysqlctl.NewMysqld publishes dba stats only if dbaName is not empty.
4. mysqlctl.NewMysqld publishes app stats only if appName is not empty.
5. QueryEngine publishes stats only if flag EnablePublishStats is enabled.
6. RowCacheInvalidator publishes stats only if flag EnablePublishStats is enabled.
CachePool launches a memcache process, if failed, it will sleep 100 millis
before the next attempt. This slows down the unit test since a fake memcache
service is guaranteed to succeed.
go/streamlog/streamlog.go starts a separate go routine for each new logger instance.
In the querylogz_test.go, it tries to call logger.Send(logStats) and then assumes this
logStats has been delivered. Then it calls querylogzHandler and verifies the result.
However, it is not guaranteed the message will be delivered after Send call, because Send
just put the message into dataQueue and only the go routine delivers it to each consumers.
Under load, Begin gets occasionally called just when the context
is about to expire. In such cases, the query gets killed as soon
as it's issued. However, The Begin call itself almost never fails
because it's too fast. So, this just results in the connection
getting closed, which isn't seen until the next statement executes,
and results in a confusing 2006 error.
So, we give ourselves a 10ms grace period to make sure that we
give Begin enough time to succeed. This way, we won't see as many
confusing error messages.
There's a race condition between context and resource pool.
If a context is already expired, there's still a chance that
a Get will succeed because both channels are ready to communicate.
This bug also masked another bug in dbconnpool and some other
bugs in our tests.
All fixed now.
1. rename logzcss.go to logz_utils.go and move two util funcs from txlogz.go to it.
2. let querylogzHandler accept a streamlog.StreamLogger instance instead of always
using SqlQueryLogger.
3. add two params timeout and limit to /querylogz, this allows one to control how many
queries the querylogz page shows.
1. add newQueryServiceConfig, newDBConfigs and newMysqld to testutils_test.go.
2. enforce these funcs under testUtils struct for better code readability.
transaction.go only has a single func Commit, it might make more sense
to move Commit to QueryEngine since it uses the QueryEngine's tx pool
and schemaInfo.
this change effectively brings sqlquery.go test coverage to 99%
1. add "queryCalled" in fakesqldb.DB so that tests could know number of
times a query hits a database.
2. add testUtils struct in testutils_test and this struct contains common
test funcs that could be used in tabletserver package.
3. add sqlquery tests to test SqlQuery.allowQueries failure modes.
4. add sqlquery tests to test SqlQuery.checkMySQL failure modes.
5. add sqlquery tests to test transaction failure modes.
6. add sqlquery tests to test SqlQuery.ExecuteBatch failure modes.
7. add sqlquery tests to test SqlQuery.SplitQuery failure modes.
1. use panic in query rule info when registering a registered query rule.
2. fix a bug in query_rules.go that should get an error if "op" is < QR_IN or > QR_NOTIN.
3. add more test cases to test query rules failure cases.
4. fix a comment that the returned error should be "want bool for OnMismatch" when OnMismatched is not a boolean (instead of "want bool for OnAbsent").
1. Refactor stream_queryz.go, move registerStreamQueryzHandlers to queryctl.go.
2. Add two funcs: streamQueryzHandler and streamQueryzTerminateHandler.
1. Rename streamlogger.go to sqlstats_stats.go since this matches file content better.
2. Rename QUERY_SOURCE_* contants to use camel case as suggested by golint.
3. Create a testutils_test.go to contain common test unitilies that could share among
all queryservice unit tests.
1. move simpleacl.go from go/vt/tableacl to go/vt/simpleacl
2. add an AclFactory interface.
3. move both ACL and AclFactory interfaces to go/vt/tableacl/acl
4. define a package level variable called "acls" in tableacl which
contains all registered AclFactory.
5. enable simpleacl as the default tableacl mechanism.
To plugin the specific tableacl implementation other than simpleacl, one
could either modify all binaries (go/cmd/vtocc/vtocc.go and go/cmd/vttablet/vttablet.go)
to call tableacl.Register(...) and then point tableacl.DefaultACL to the
specific tableacl. Alternatively, one could also put the specific tableacl
implementation under go/vt/tableacl dir and use the init function to finish
the registeration.
1. remove cache pool dependency from MemcacheStats; instead, NewMemcacheStats func takes
a func param to return memcache stats. This change makes MemcacheStats more lightweight
and easy to unit test.
2. replace bool params in original NewMemcacheStats func by flags: enableMain, enableSlabs
and enableItems.
3. Pre-allocate memory for MemcacheStats.main, slabs and items. This removes several null checks.
4. expose refresh freq to let caller decide how frequent to refresh the stats.
There is only one rowcache per queryservice and cache pool only creates
connections to the underlying cache service. Therefore, the fakecacheservice
should only have a single place for cached data and share it across multiple
connections.
this change allows code to create multiple SqlQuery instances by assigning different
values to newly added configs. This avoids using a global SqlQuery instance in unit
tests and also may benefit with other use cases.
fakesqldb.Conn will not init proto.QueryResult.Rows if RowsAffected is zero.
However, according to go/mysql/mysql.go, the ideal response should be an empty
proto.QueryResult.Rows.
1. merge insert.go and select.go into dml.go
2. merge plan_type.go into plan.go
3. add some comments and rename variables to follow golint's suggestions
fakesqldb.Register currently takes two params, one is a set of queries it supports
and the other is whether it returns a connection error. This make fakesqldb not very
intuitive to use for unit test.
1. Add a fakesqldb.DB struct that stores all queries supported by a fake sqldbconn.
Let fakesqldb.Register return a *fakesqldb.DB
2. Unit test is able to add or remove a query via the returned *DB.
1. add statsPrefix to allow caller to add prefix for schema info stats
2. add endpoints param in NewSchemaInfo to allow caller provides a different
http endpoints to serve
1. Add a connFail param in fakesqldb.Register fuction so that it is possible to
fail the dbconnpool.DBConnection intentionally in order to test error handling
logic.
2. make fakesqldb.Conn return empty result if a request is unknown instead returns
an error. This is mainly because schema_info and planbuilder send verious sql
queries to mysql to get some infomation and put those queries to fakesqldb makes
test hard to understand.
NewTxPool always creates a new txStats instance with a hard coded name.
This is good until some unit tests need to create two TxPool instances.
The second NewTxPool call will fail because of the hard coded stats variable
has been exported in the first call.
This happens when there are two unit tests for TxPool and SqlQuery, since SqlQuery
will implicitly create a TxPool instance.
One possible fix would be to add an extra txStatsPrefix in NewTxPool function
so that unit test is able to register its own txStats with an unique name.
Given the fact that NewTxPool function is only called once in QueryEngine, this
fix might be simpler than other alternative options.
In addition, this fix also allows unit test to create different TxPool instances
and makes each test function independent to each other.
1. add sqldb.Conn and move mysql.ConnectionParams to sqldb package.
2. change mysql.Connect function to return sqldb.Conn.
3. update places using *mysql.Connection to sqldb.Conn.
4. update places using *mysql.ConnectionParams to *sqldb.ConnParams.
5. changes places that use mysql.Connect to sqldb.Get().
6. Some randome go style fix (suggested by golint).
7. define package level variable DefaultDB in sqldb.
8. go/mysql will register its Connect func in init function.
It turns out that ioutil.Tempdir generates names that are
longer than the grte socket file name limit of 108. So,
we're instead using a socket filename hint passed in from
the command line.
1. Add SessionId to SplitQueryRequest
2. Do not panic if a table contains no rows, just return empty set of
boundaries.
3. Remove panic style error handling and explicitly return errors
Reusing socket files seems to cause trouble occasionally.
The new scheme generates a unique socket file name every time
we launch memcache. This makes the memcache socket & port
obsolete, which will be removed later.
It's risky to set an expiry for cache invalidations. There are
unlikely, but possible race conditions that could cause an
untimely query to populate the rowcache with stale data.
If disallowQueries hangs, things generally go bad because
the server remains hung and there's no opportunity for it
to come back up. So, it's better to crash in such situations.
This should generally not happen, but it's a good failsafe.
The older state management code was in a confused state
with no clarity on how locks, atomic updates, and
wait groups interacted. It's all cleaned up now.
The rules are simpler now, the code is more readable
and all documented inline.
Change default retryDelay to 2ms.
Only retry on tx_pool_full, retry, and fatal errors, while it is not in transaction, not cancelled, and not exceeding deadline.
Add integration tests to test retry logic with vttablet shutting down.
1. timeout: txlogz will keep dumping transactions until timeout,
default value: 10s
2. limit: txlogz will keep dumping transactions until it hits the limit,
default value: 300
Implementing a new pool and associated connection to
better manage connection problems. This new combination
addresses the following problems:
- On connection error, retry if allowed.
- CheckMySQL if necessary.
- New Exec API with deadline.
- Improved connection killer that kills the connection on
both sides.
This is just the framework for now. The next CL will
actually change the tabletserver to use this.
The test is manual for now. So, it's disabled by default.
e.g. if we're accessing the status page at:
http://my.proxy/vttablet-123/debug/status
then the JS on that page needs to request vars from:
/vttablet-123/debug/vars
instead of just /debug/vars.
This CL has a few coordinated changes:
- Any connection failures trigger a CheckMySQL. The check tries
to connect to MySQL. If it fails, the query service is shut down.
This can be triggerred by the rowcache invalidator also.
This check is rate-limited to once/second.
- To counter this behavior, vtocc tries restarting the query service
every 30s. This also means that there's no need for the waitForMySQL
flag in AllowQueries.
- Connection errors are now classified as FATAL. This will allow clients
to more proactively go elsewhere.