ARO-RP/docs/deployment-model.md

8.4 KiB

Deployment model

For better or worse, the ARO-RP codebase has four different deployment models.

1. Production deployment (PROD)

Running in production. PROD deployments at a given commit are intended to be identical (bar configuration) across all regions, regardless if the region is a designated canary region (westcentralus / eastus2euap) or not.

Subscription feature flags are used to prevent end users from accessing the ARO service in canary regions, or in regions or for api-versions which are in the process of being built out. The subscription used for regular E2E service health checking has the relevant feature flags set.

The RP configures deny assignments on cluster resource groups only when running in PROD. This is because Azure policy only permits deny assignments to be set by first party RPs when running in PROD. The deny assignment functionality is gated by the DisableDenyAssignments RP feature flag, which must be set in all non-PROD deployments.

2. Pre-production deployment (INT)

INT deployment is intended to be as identical as possible to PROD, although inevitably there are always some differences.

A subscription feature flag is used to selectively redirect requests to the INT RP.

Here is a non-exhaustive list of differences between INT and PROD:

  • INT is deployed entirely separately from PROD in the MSIT tenant, which does not have production access overheads.

  • The INT ACR is entirely separate from PROD.

  • INT uses different subdomains for hosting the RP service and clusters.

  • INT does not use the production first party AAD application. Instead it uses a multitenant AAD application which must be manually patched and granted permissions in any subscription where the RP will deploy clusters.

  • There is standing access (i.e. no JIT) to the INT environment, INT elevated geneva actions and INT SRE portal.

  • INT uses the Test instances of Geneva for RP and cluster logging and monitoring. Geneva actions use separate credentials to authenticate to the INT RP.

  • Monitoring of the INT environment does not match PROD monitoring.

  • As previously mentioned, deny assignments are not enabled in INT.

3. Development deployment

A developer is able to deploy the entire ARO service stack in Azure in a way that is intended to be as representative as possible of PROD/INT, and many ARO service components can also be meaningfully run and debugged without being run on Azure infrastructure at all. This latter "local development mode" is also currently used by our pull request E2E testing.

Some magic is needed to make all of this work, and this translates into a larger delta from PROD/INT in some cases:

  • Development deployment is entirely separate from INT and PROD and may in principal use any AAD tenant.

  • Development uses different subdomains again for hosting the RP service and clusters.

  • No inbound ARM layer

    In PROD/INT, service REST API requests are made to PROD ARM, and this proxies the requests to the RP service. Thus PROD/INT RPs are configured to authorize only incoming service REST API requests from ARM.

    In development, ARM does not front the RP service, thus different authorizers are used. In development mode, the authorizer used for ARM is also used for Geneva actions, so a developer can test Geneva actions manually.

    The ARO Go and Python client libraries in this repo carry patches such that they when the environment variable RP_MODE=development is set, they dial the RP on localhost with no authentication instead of dialling ARM.

    In addition, any HTTP headers injected by ARM via its proxying are unavailable in development mode. For instance, the RP frontend fakes up the Referer header in this case, in order for client polling code to work correctly in development mode.

  • No first party application

    In PROD, ARM is configured to automagically grant the RP first party application Owner on any resource group it creates in a customer subscription.

    In INT, the INT multitenant application which fakes the first party application is granted Owner on every subscription which is INT enabled. This simple but has the disadvantage that the RP has more permissions in INT than it does in PROD.

    In development, pkg/env/armhelper.go fakes up ARM's automagic behaviour using a completely separate helper AAD application. This makes setting up the development more onerous, but has the advantage that the RP's permissions in development match those in PROD.

  • No cluster signed certificates

    Integration with Digicert is disabled in development mode. This is controlled by the DisableSignedCertificates RP feature flag.

  • No readiness delay

    In PROD/INT, the RP waits 2 minutes before indicating health to its load balancer, helping us to detect if the RP crash loops. Similarly, it waits for frontend and backend tasks to complete before exiting. To make the feature development/test cycle faster, these behaviours are disabled in development mode via the DisableReadinessDelay feature flag.

  • Standard_D2s_v3 workers required

    In development mode, use of Standard_D2s_v3 workers is required as a cost-saving measure. This is controlled by the RequireD2sV3Workers feature flag.

  • There is standing access to development infrastructure using shared development credentials.

  • Test instances of Geneva, matching INT, are used in development mode for cluster logging and monitoring (and RP logging and monitoring as appropriate).

  • Development environments are not monitored.

  • As previously mentioned, deny assignments are not enabled in development.

See Prepare a shared RP development environment for the process to set up a development environment. The same development AAD applications and credentials are used regardless whether the RP runs on Azure or locally.

3a. Development on Azure

In the case that a developer deploys the entire ARO service stack in Azure, in addition to the differences listed in section 3, note the following:

  • Currently a separate ACR is created which must be populated with the latest OpenShift release. TODO: this is inconvenient and adds expense.

  • Service VMSS capacity is set to 1 instead of 3 (i.e. not highly available) to save time and money.

  • Because the RP is internet-facing, TLS subject name and issuer authentication is required for all API accesses.

  • hack/tunnel is used to forward RP API requests from a listener on localhost, wrapping these with the aforementioned TLS client authentication.

3b. Local development mode / CI

Many ARO service components can be meaningfully run and debugged locally on a developer's laptop. Notable exceptions include the deployment tooling including the custom script extension which is used to initialize the RP VMSS.

"Local development mode" is also currently used by our pull request E2E testing. This has the advantage of saving the time, money and flakiness that would be implied by setting up an entire service stack on every PR. However it is also disadvantageous in the sense that coverage is less and the testing is less representative.

When running in local development mode, in addition to the differences listed in section 3, note the following:

  • Local development mode is enabled, regardless of component, by setting the environment variable RP_MODE=development. This enables code guarded by env.IsLocalDevelopmentMode() and also automatically sets many of the RP feature flags listed in section 3.

  • All services listen on localhost only and authentication is largely disabled.

    The ARO Go and Python client libraries in this repo carry patches such that they when the environment variable RP_MODE=development is set, they dial the RP on localhost with no authentication instead of dialling ARM.

  • Generation of ACR tokens per cluster is disabled; the INT ACR is used to pull OpenShift container images.

  • Production VM instance metadata and MSI authorizers obviously don't work. These are fixed up using environment variables. See pkg/util/instancemetadata.

  • The INT/PROD mechanism of dialing a cluster API server whose private endpoint is on the RP vnet also obviously doesn't work. Local development RPs share a proxy VM which is deployed on the RP vnet which can proxy these connections. See pkg/proxy.

  • As a cost saving exercise, all local development RPs share a single Cosmos DB account (but containing a unique database per developer) per region.