add deployment model documentation

2021-04-28 11:20:12 -05:00 · 2021-04-28 11:20:12 -05:00 · 645e5c9089
--- a/docs/deployment-model.md
+++ b/docs/deployment-model.md
@ -0,0 +1,200 @@
+# Deployment model
+
+For better or worse, the ARO-RP codebase has four different deployment models.
+
+
+## 1. Production deployment (PROD)
+
+Running in production.  PROD deployments at a given commit are intended to be
+identical (bar configuration) across all regions, regardless if the region is a
+designated canary region (westcentralus / eastus2euap) or not.
+
+Subscription [feature flags](feature-flags.md) are used to prevent end users
+from accessing the ARO service in canary regions, or in regions or for
+api-versions which are in the process of being built out.  The subscription used
+for regular E2E service health checking has the relevant feature flags set.
+
+The RP configures deny assignments on cluster resource groups only when running
+in PROD.  This is because Azure policy only permits deny assignments to be set
+by first party RPs when running in PROD.  The deny assignment functionality is
+gated by the DisableDenyAssignments RP feature flag, which must be set in all
+non-PROD deployments.
+
+
+## 2. Pre-production deployment (INT)
+
+INT deployment is intended to be as identical as possible to PROD, although
+inevitably there are always some differences.
+
+A subscription [feature flag](feature-flags.md) is used to selectively redirect
+requests to the INT RP.
+
+Here is a non-exhaustive list of differences between INT and PROD:
+
+* INT is deployed entirely separately from PROD in the MSIT tenant, which does
+  not have production access overheads.
+
+* The INT ACR is entirely separate from PROD.
+
+* INT uses different subdomains for hosting the RP service and clusters.
+
+* INT does not use the production first party AAD application.  Instead it uses
+  a multitenant AAD application which must be manually patched and granted
+  permissions in any subscription where the RP will deploy clusters.
+
+* There is standing access (i.e. no JIT) to the INT environment, INT elevated
+  geneva actions and INT SRE portal.
+
+* INT uses the Test instances of Geneva for RP and cluster logging and
+  monitoring.  Geneva actions use separate credentials to authenticate to the
+  INT RP.
+
+* Monitoring of the INT environment does not match PROD monitoring.
+
+* As previously mentioned, deny assignments are not enabled in INT.
+
+
+## 3. Development deployment
+
+A developer is able to deploy the entire ARO service stack in Azure in a way
+that is intended to be as representative as possible of PROD/INT, and many ARO
+service components can also be meaningfully run and debugged without being run
+on Azure infrastructure at all.  This latter "local development mode" is also
+currently used by our pull request E2E testing.
+
+Some magic is needed to make all of this work, and this translates into a larger
+delta from PROD/INT in some cases:
+
+* Development deployment is entirely separate from INT and PROD and may in
+  principal use any AAD tenant.
+
+* Development uses different subdomains again for hosting the RP service and
+  clusters.
+
+* No inbound ARM layer
+
+  In PROD/INT, service REST API requests are made to PROD ARM, and this proxies
+  the requests to the RP service.  Thus PROD/INT RPs are configured to authorize
+  only incoming service REST API requests from ARM.
+
+  In development, ARM does not front the RP service, thus different authorizers
+  are used.  In development mode, the authorizer used for ARM is also used for
+  Geneva actions, so a developer can test Geneva actions manually.
+
+  The ARO Go and Python client libraries in this repo carry patches such that
+  they when the environment variable `RP_MODE=development` is set, they dial the
+  RP on localhost with no authentication instead of dialling ARM.
+
+  In addition, any HTTP headers injected by ARM via its proxying are unavailable
+  in development mode.  For instance, the RP frontend fakes up the Referer
+  header in this case, in order for client polling code to work correctly in
+  development mode.
+
+* No first party application
+
+  In PROD, ARM is configured to automagically grant the RP first party
+  application Owner on any resource group it creates in a customer subscription.
+
+  In INT, the INT multitenant application which fakes the first party
+  application is granted Owner on every subscription which is INT enabled.  This
+  simple but has the disadvantage that the RP has more permissions in INT than
+  it does in PROD.
+
+  In development, pkg/env/armhelper.go fakes up ARM's automagic behaviour using
+  a completely separate helper AAD application.  This makes setting up the
+  development more onerous, but has the advantage that the RP's permissions in
+  development match those in PROD.
+
+* No cluster signed certificates
+
+  Integration with Digicert is disabled in development mode.  This is controlled
+  by the DisableSignedCertificates RP feature flag.
+
+* No readiness delay
+
+  In PROD/INT, the RP waits 2 minutes before indicating health to its load
+  balancer, helping us to detect if the RP crash loops.  Similarly, it waits for
+  frontend and backend tasks to complete before exiting.  To make the feature
+  development/test cycle faster, these behaviours are disabled in development
+  mode via the DisableReadinessDelay feature flag.
+
+* Standard_D2s_v3 workers required
+
+  In development mode, use of Standard_D2s_v3 workers is required as a
+  cost-saving measure.  This is controlled by the RequireD2sV3Workers feature
+  flag.
+
+* There is standing access to development infrastructure using shared
+  development credentials.
+
+* Test instances of Geneva, matching INT, are used in development mode for
+  cluster logging and monitoring (and RP logging and monitoring as appropriate).
+
+* Development environments are not monitored.
+
+* As previously mentioned, deny assignments are not enabled in development.
+
+See [Prepare a shared RP development
+environment](prepare-a-shared-rp-development-environment.md) for the process to
+set up a development environment.  The same development AAD applications and
+credentials are used regardless whether the RP runs on Azure or locally.
+
+
+## 3a. Development on Azure
+
+In the case that a developer deploys the entire ARO service stack in Azure, in
+addition to the differences listed in section 3, note the following:
+
+* Currently a separate ACR is created which must be populated with the latest
+  OpenShift release.  TODO: this is inconvenient and adds expense.
+
+* Service VMSS capacity is set to 1 instead of 3 (i.e. not highly available) to
+  save time and money.
+
+* Because the RP is internet-facing, TLS subject name and issuer authentication
+  is required for all API accesses.
+
+* hack/tunnel is used to forward RP API requests from a listener on localhost,
+  wrapping these with the aforementioned TLS client authentication.
+
+
+## 3b. Local development mode / CI
+
+Many ARO service components can be meaningfully run and debugged locally on a
+developer's laptop.  Notable exceptions include the deployment tooling including
+the custom script extension which is used to initialize the RP VMSS.
+
+"Local development mode" is also currently used by our pull request E2E testing.
+This has the advantage of saving the time, money and flakiness that would be
+implied by setting up an entire service stack on every PR.  However it is also
+disadvantageous in the sense that coverage is less and the testing is less
+representative.
+
+When running in local development mode, in addition to the differences listed in
+section 3, note the following:
+
+* Local development mode is enabled, regardless of component, by setting the
+  environment variable `RP_MODE=development`.  This enables code guarded by
+  `env.IsLocalDevelopmentMode()` and also automatically sets many of the RP
+  feature flags listed in section 3.
+
+* All services listen on localhost only and authentication is largely disabled.
+
+  The ARO Go and Python client libraries in this repo carry patches such that
+  they when the environment variable `RP_MODE=development` is set, they dial the
+  RP on localhost with no authentication instead of dialling ARM.
+
+* Generation of ACR tokens per cluster is disabled; the INT ACR is used to pull
+  OpenShift container images.
+
+* Production VM instance metadata and MSI authorizers obviously don't work.
+  These are fixed up using environment variables.  See
+  pkg/util/instancemetadata.
+
+* The INT/PROD mechanism of dialing a cluster API server whose private endpoint
+  is on the RP vnet also obviously doesn't work.  Local development RPs share a
+  proxy VM which is deployed on the RP vnet which can proxy these connections.
+  See pkg/proxy.
+
+* As a cost saving exercise, all local development RPs share a single Cosmos DB
+  account (but containing a unique database per developer) per region.