* Update dependencies
* Update Taskfile to ensure task --force works
* Fix wait-for-operator-ready script
Previously it only waited 5s which was less than the total time required
to list all CRDs. This caused the script to fail if it happened to run
after all CRDs were added (list takes long time) but succeed if it ran
right as CRDs were being added (list takes less time because not all
CRDs added yet).
* Fix failing ASOv1 test
* Support granting ALL privileges in a database to a user
If ALL is specified, we don't delete any privileges.
* Add validation webhooks for MySQL[AAD]User to prevent server-ALL
Since the ASO mysql admin user doesn't have privileges to grant global
ALL privileges, prevent this from being set in the resources.
* Update the docs for mysql users and aad users
* Test the MySQL[AAD]User webhook prevents ALL at the server level
* Update KeyVault SecretClient to recover soft delete
* Includes a test ensuring that this works
* Add Azure SQL Combined test to ensure create+delete+recreate works
* Update CI to not fail on stderr
* Add OperatorMode config value and use it from main
It's specified as AZURE_OPERATOR_MODE, with possible values `webhooks`, `watchers` and `watchers-and-webhooks`. Use the setting from main() to decide whether watchers and webhooks should be started.
* Move reconciler and webhook registration out of main
Move it to controllers.RegisterReconcilers and controllers.RegisterWebhooks so that it can be shared between main and the controller tests.
* Test the watchers/webhooks behaviour of different operator modes
* Include operator logging when TEST_EMIT_ASO_LOGS is set
This can be very useful when trying to understand why a test is failing, but it's far too noisy to include all the time.
* Add tests for OperatorMode
* Add make targets and pipeline jobs for webhooks + watchers modes
* Remove envtest job timeout for now
It seems like there's a problem in the job that's causing it to be killed by the timeout, but the way the job is killed prevents us from seeing the output which would let us fix the underlying problem in whichever test is failing.
* Use require rather than assert in Azure SQL
Also in the Ensure* helpers.
The assert library doesn't stop the current test if the assertion fails, which means that the test run always ends up finishing with a timeout waiting for something that can't happen because some prerequisite failed.
In general the require model is better for tests. The downside is that you might need to run the test multiple times to see all the problems, but it avoids nonsensical situations where you timeout deleting a resource that you failed to create.
* Increase test FailoverGracePeriod to 60
Previous runs were failing with an error saying that 60 is the minimum. (Not sure whether this is a new constraint?)
* Rewrite Retry as a non-recursive function
It was producing very annoying stack traces if a test timed out.
* Support user specified MySQLServer secrets
- The specified secret must be a Kubernetes secret.
- The specified secret must contain a "username" and "password" field.
- The specified secret must be in the same namespace as the MySQLServer.
- If the specified secret doesn't exist, reconciliation will be blocked
until the secret does exist. Once the secret is created, reconciliation
will continue as normal.
- The operator does not make the user specified secret owned by
the MySQLServer.
- The operator still creates a secret containing connection string details
and username/password for the server. This secret is named as it was
before. This means that the customer specified username and password
are consumed to create this secret, but other resources such as MySQLUser
still consume the generated secret file.
* postgresqluser: Close DBs to prevent leaking connections
This is much simpler than a cache of DBs. I started going down that
path but as always it's removing no-longer needed DBs from the cache
that makes it more complicated. For now this will fix the leak and if
we have a problem with opening and closing connections being too slow
we can fix that then.
* mysql[aad]user: Close DB when we're finished with it
This is much simpler than trying to use a DB cache for it.
* azuresql: Close DBs when finished with them
In the azuresqlaction, azuresqlmanageduser and azuresqluser
reconcilers.
* Change region used for MySQL test
The ASO CI subscription can't create servers in eastus2 but westus2 is
allowed at the moment.
* Add a target namespaces config, only watch resources therein
* Initial work on target namespace test
* Get target namespace test working in both cases
* More useful logging when creating test RG fails
* Run the no-target-namespaces test in the CI pipeline
This is handled in the same way as the secret naming version setting,
but the more settings we add (some more are on the way), the more
unwieldy it's going to be. We need to come up with a better way of
making different settings testable.
* Rework install- targets so they don't trample go.mod & .sum
Renamed them to install-tools and install-test-tools, since they're
installing binaries used in the build process rather than code
dependencies.
Run the `go get` commands in a temp directory and dummy module so that
they don't update the ASO go.mod and .sum files with dependencies that
our code doesn't actually depend on.
* Use the unfiltered API reader when looking for AAD identities
When target namespaces are set, there's no guarantee that the
operator's namespace is included. The identity finder always needs to
look in the operator namespace so pass it the API reader which
bypasses the filtered cache.
* Review tweaks, thanks @matthchr!
* Update CosmosDB SDK version
* Refactor CosmosDB folder structure
- This is in preparation for adding new CosmosDB resources.
* Rename HandleEnsureError with a clearer name
- Also add documentation.
* Add new CosmosDBSQLDatabase resource
* Add PollURLKind to status
- Use it to differentiate between Create/Delete polling.
* Update webhook/CA injection patches to apply to CRD v1
* Updated go.mod/.sum with changed deps from upgrading controller-tools
* Run go mod tidy in hack/*
And remove a dead file.
Co-authored-by: George Pollard <gpollard@microsoft.com>
* Keep track of the polling URL when creating a MySQL server
Previously this was being dropped on the floor unless the create call
returned an AsyncOpIncompleteError. This works fine except in the
case of creating a replica at the same time as the leader server (or
very quickly after). In that case the create server call returns no
error, but querying the poll URL shows that the creation failed
because the SourceServerID didn't match anything. Without capturing
the poll URL on the initial creation the reconciliation gets stuck
because it thinks that the 404 it gets when checking for the
server indicates that it's just not ready yet - provisioning never goes
back to false and it doesn't retry the creation.
* Review tweaks, thanks @matthchr!
* Ensure FailedProvisioning=false when Provisioned=true
A user reported a case where a RedisCache was provisioned and working
but was marked as FailedProvisioning=true. This seems to have happened
because it initially failed (possibly a temporary error that we aren't
checking for), and then on a subsequent periodic reconciliation the
ARM query reported that it was deployed successfully. Use the
SetProvisioned helper method to prevent this state.
* Fix v1 secret naming
- Fix issue where namespace was mistakenly included in v1 secret
naming key generation. Some resources are not expected to have
namespace prefix in certain KeyVault scenarios.
* Increase build timeout a bit
* Don't create many different randoms in test
* SecretClient should not be modified
* Change region VM tests are run in
- Due to capacity constraints. We can move back later.
* Some fixes
* Add timeout for all reconciles
- Also ensure that connection timeout is specified for
AzureSQL when connecting to the server.
* Randomize KV name
* Fix typo in readme
* Improve secrets documentation
* Return proper error if we cannot deserialize secret
* Add new AZURE_SECRET_NAMING_VERSION mode
The new mode allows us to fix inconsitencies in how secrets
were named without making a breaking change.
- AppInsights created secrets in the same namespace
as the resource but with name:
"appinsights-<resourceGroup>-<resourceName>"
- Storage created secrets in the same namespace
as the resource but with name:
"storage-<resourceGroup>-<resourceName>"
- AzureSQL resources created resources with
a different naming scheme as well.
- Other resources created a secret in the same
namespace with the secret name being the
resource name.
The new V2 mode ensures that all resources create secrets
in KeyVault and/or Kubernetes with a consistent naming pattern.
* Update Helm chart (but don't generate new package)
* Fix bug where SQLManagedUser Namespace could be empty
- This would prevent secrets from being created in Kubernetes
* Enable V2 secrets for EnvTest tests
* Use v1beta1 explicitly with controller-gen
* PR feedback
* PR feedback
* Better testing
* Add v1alpha2 MySQLUser
This removes DbName from MySQLUser and adds DatabaseRoles to store
per-database permissions. Roles will now only store server-wide
permissions.
Add conversions between v1alpha1 and v1alpha2 versions.
* Add v1alpha2 MySQLAADUser
This removes DBName from MySQLAADUser and adds DatabaseRoles to store
per-database permissions. Roles will now only store server-wide
permissions.
Add conversions between v1alpha1 and v1alpha2 versions.
* Set up conversion webhooks for MySQLUser and MySQLAADUser
* Review feedback, thanks @matthchr!
* Ensure `preserveUnknownFields: false` is set in all webhook patches
These were set for all types with version conversions but not the
others (which aren't in use since they are still commented out in
kustomization.yaml). Turning them on in the rest to remove one step in
the process of adding conversion webhooks to types in the future.
This setting is required for conversion to work - it seems like the
only reason it's not set in the patches is that they were generated by
kubebuilder before the setting was mandatory.
* Add provisioning state methods to v1alpha2 ASOStatus
* Reimplement MySQLUser and MySQLAADUser reconciliation with v1alpha2
They now check server-level (in USER_PRIVILEGES) and
database-level (SCHEMA_PRIVILEGES) permissions.
* Update controller tests to work with v1alpha2 MySQLUser
* Move system database constant to mysql.SystemDatabase
Also rename the ServerPort and DriverName constants so they don't
repeat the name of the package.
* Change EnsureUserDatabaseRoles to return just an error
And change the reconciliation code for user and aad user to just treat
that as a provisioning failure, rather than saying that it had
succeeded but there were some errors which is just confusing. We still
try to apply changes to all databases even if there is an error for
one of them.
Also other review changes, thanks @matthchr!
* Azure SQL FailoverGroup improvements
- Fix bug preventing reconciliation of updates after a FailoverGroup
was created.
- Fix bug where status of long running operation was not properly
monitored.
* Add unit tests to CI
The operator was originally reconciling AccessPolicy's after the rest of the KV
had been created (see: #1158). This isn't actually required because even doing
this there are tons of reasons that this can fail. I've filed #1351 to track
removing ClientID from the CRD in a future API version as there are a ton of
obscure ways that we can fail to translate that ID into an ObjectID.
* Fix README to refer to current version of helm chart
- Don't specify using :latest for the controller image
in the README either as that defeats the purpose of
hardcoding the specific image in the Helm chart.
* Fix 0.1.0 Helm chart to refer to correct container image version
- Using :latest is not correct as that means it will always be
pulling the latest version of the docker container.
* Fix possible null reference error
These could be setup issues like the AAD admin not being set on the
server - giving up immediately means the only way to get the user
created successfully is to edit the resource to remove the finalizer,
delete it, then re-add. The reconcile loop for regular users doesn't
use this pattern of returning true, nil when it can't reach the server.
This can mean that if there's a communication error when checking with
Azure then we just remove the k8s database resource without trying to
delete the Azure one.
* Clarify that Sid should be client ID for managed identities
* Add more detail to logging for errors connecting to MySQL
Co-authored-by: Bevan Arps <bevan.arps@microsoft.com>