Move Prometheus/Grafana config to separate file

- Move grafana admin login info to credentials - Update documentation for Prometheus/Grafana integration - Resolves #205
2018-06-08 13:17:40 -07:00 · 2018-06-08 13:17:40 -07:00 · 612e2a50e5
--- a/README.md
+++ b/README.md
@ -4,6 +4,8 @@
 [![Image Layers](https://images.microbadger.com/badges/image/alfpark/batch-shipyard:latest-cli.svg)](http://microbadger.com/images/alfpark/batch-shipyard)

 # Batch Shipyard
+<img src="https://azurebatchshipyard.blob.core.windows.net/github/README-dash.png" alt="dashboard" width="1024" />
+
 [Batch Shipyard](https://github.com/Azure/batch-shipyard) is a tool to help
 provision and execute container-based batch processing and HPC workloads on
 [Azure Batch](https://azure.microsoft.com/services/batch/) compute
@ -23,21 +25,17 @@ in Azure, independent of any integrated Azure Batch functionality.
 Azure Batch compute nodes
 * Automated deployment of required Docker and/or Singularity images to
 compute nodes
-* Accelerated Docker and Singularity image deployment at scale to compute
-pools consisting of a large number of VMs via private peer-to-peer
-distribution of container images among the compute nodes
 * Mixed mode support for Docker and Singularity: run your Docker and
 Singularity containers within the same job, side-by-side or even concurrently
 * Comprehensive data movement support: move data easily between locally
 accessible storage systems, remote filesystems, Azure Blob or File Storage,
 and compute nodes
-* Support for Docker Registries including
-[Azure Container Registry](https://azure.microsoft.com/services/container-registry/)
-and other Internet-accessible public and private registries
-* Support for the [Singularity Hub](https://singularity-hub.org/) Container
-Registry
 * Support for serverless execution binding with
 [Azure Functions](http://batch-shipyard.readthedocs.io/en/latest/60-batch-shipyard-site-extension/)
+* Support for Docker Registries including
+[Azure Container Registry](https://azure.microsoft.com/services/container-registry/),
+other Internet-accessible public and private registries, and support for
+the [Singularity Hub](https://singularity-hub.org/) Container Registry
 * [Standalone Remote Filesystem Provisioning](http://batch-shipyard.readthedocs.io/en/latest/65-batch-shipyard-remote-fs/)
 with integration to auto-link these filesystems to compute nodes with
 support for [NFS](https://en.wikipedia.org/wiki/Network_File_System) and
@ -49,6 +47,10 @@ via [blobfuse](https://github.com/Azure/azure-storage-fuse),
 [GlusterFS](https://www.gluster.org/) provisioned directly on compute nodes
 (which can act as a distributed local file system/cache), and custom Linux
 mount support (fstab)
+* Automated, integrated
+[resource monitoring](http://batch-shipyard.readthedocs.io/en/latest/66-batch-shipyard-resource-monitoring/)
+with [Prometheus](https://prometheus.io/) and [Grafana](https://grafana.com/)
+for Batch pools and RemoteFS storage clusters
 * Seamless integration with Azure Batch job, task and file concepts along with
 full pass-through of the
 [Azure Batch API](https://azure.microsoft.com/documentation/articles/batch-api-basics/)
@ -89,6 +91,9 @@ optional creation of SSH tunneling scripts to Docker Hosts on compute nodes
 on compliant Windows compute node pools with the ability to activate
 [Azure Hybrid Use Benefit](https://azure.microsoft.com/pricing/hybrid-benefit/)
 if applicable
+* Accelerated Docker and Singularity image deployment at scale to compute
+pools consisting of a large number of VMs via private peer-to-peer
+distribution of container images among the compute nodes

 ## Installation
 ### Azure Cloud Shell
--- a/config_templates/config.yaml
+++ b/config_templates/config.yaml
@ -126,44 +126,3 @@ global_resources:
      include:
      - '*.bin'
      path: /another/local/path/dir
-monitoring:
-  location: <Azure region, e.g., eastus>
-  resource_group: my-prom-server-rg
-  hostname_prefix: prom
-  ssh:
-    username: shipyard
-    ssh_public_key: /path/to/rsa/publickey.pub
-    ssh_public_key_data: ssh-rsa ...
-    ssh_private_key: /path/to/rsa/privatekey
-    generated_file_export_path: null
-  public_ip:
-    enabled: true
-    static: false
-  virtual_network:
-    name: myvnet
-    resource_group: my-vnet-resource-group
-    existing_ok: false
-    address_space: 10.0.0.0/16
-    subnet:
-      name: my-server-subnet
-      address_prefix: 10.0.0.0/24
-  network_security:
-    ssh:
-    - '*'
-    grafana:
-    - '*'
-  vm_size: STANDARD_D2_V2
-  accelerated_networking: false
-  services:
-    resource_polling_interval: 15
-    lets_encrypt:
-      enabled: true
-      use_staging_environment: true
-    prometheus:
-      port: 9090
-      scrape_interval: 10s
-    grafana:
-      admin:
-        user: admin
-        password: admin
-      additional_dashboards: []
--- a/config_templates/credentials.yaml
+++ b/config_templates/credentials.yaml
@ -99,3 +99,9 @@ credentials:
        filename: some/path/token.cache
    credentials_secret_id: https://<vault_name>.vault.azure.net/secrets/<secret_id>
    uri: https://<vault_name>.vault.azure.net/
+  # monitoring credentials
+  monitoring:
+    grafana:
+      admin:
+        username: grafana_username
+        password: grafana_user_password
--- a/config_templates/monitor.yaml
+++ b/config_templates/monitor.yaml
@ -0,0 +1,48 @@
+monitoring:
+  location: <Azure region, e.g., eastus>
+  resource_group: my-prom-server-rg
+  hostname_prefix: prom
+  ssh:
+    username: shipyard
+    ssh_public_key: /path/to/rsa/publickey.pub
+    ssh_public_key_data: ssh-rsa ...
+    ssh_private_key: /path/to/rsa/privatekey
+    generated_file_export_path: null
+  public_ip:
+    enabled: true
+    static: false
+  virtual_network:
+    name: myvnet
+    resource_group: my-vnet-resource-group
+    existing_ok: false
+    address_space: 10.0.0.0/16
+    subnet:
+      name: my-server-subnet
+      address_prefix: 10.0.0.0/24
+  network_security:
+    ssh:
+    - '*'
+    grafana:
+    - 1.2.3.0/24
+    - 2.3.4.5
+    prometheus:
+    - 2.3.4.5
+    custom_inbound_rules:
+      myrule:
+        destination_port_range: 5000-5001
+        protocol: '*'
+        source_address_prefix:
+        - 1.2.3.4
+        - 5.6.7.0/24
+  vm_size: STANDARD_D2_V2
+  accelerated_networking: false
+  services:
+    resource_polling_interval: 15
+    lets_encrypt:
+      enabled: true
+      use_staging_environment: true
+    prometheus:
+      port: 9090
+      scrape_interval: 10s
+    grafana:
+      additional_dashboards: null
--- a/convoy/fleet.py
+++ b/convoy/fleet.py
@ -4152,7 +4152,7 @@ def action_monitor_add(table_client, config, poolid, fscluster):
    :param list fscluster: list of fs clusters to monitor
    """
    if util.is_none_or_empty(poolid) and util.is_none_or_empty(fscluster):
-        logger.error('no resources specified')
+        logger.error('no monitoring resources specified to add')
        return
    # ensure that we are operating in AAD mode for batch
    if util.is_not_empty(poolid):
@ -4205,6 +4205,14 @@ def action_monitor_remove(table_client, config, all, poolid, fscluster):
    if not all and util.is_not_empty(poolid):
        bc = settings.credentials_batch(config)
        _check_for_batch_aad(bc, 'remove pool monitors')
+    if (not all and util.is_none_or_empty(poolid) and
+            util.is_none_or_empty(fscluster)):
+        logger.error('no monitoring resources specified to remove')
+        return
+    if all and (util.is_not_empty(poolid) or util.is_not_empty(fscluster)):
+        raise ValueError(
+            'cannot specify --all with specific monitoring resources to '
+            'remove')
    storage.remove_resources_from_monitoring(
        table_client, config, all, poolid, fscluster)

--- a/convoy/keyvault.py
+++ b/convoy/keyvault.py
@ -256,3 +256,13 @@ def parse_secret_ids(client, config):
                'invalid'.format(secid))
        settings.set_credentials_registry_password(
            config, reg, False, password)
+    # monitioring passwords
+    secid = settings.credentials_grafana_admin_password_secret_id(config)
+    if secid is not None:
+        logger.debug('fetching Grafana admin password from keyvault')
+        password = get_secret(client, secid)
+        if util.is_none_or_empty(password):
+            raise ValueError(
+                'Grafana admin password retrieved for secret id {} is '
+                'invalid'.format(secid))
+        settings.set_credentials_grafana_admin_password(config, password)
--- a/convoy/settings.py
+++ b/convoy/settings.py
@ -4164,7 +4164,7 @@ def monitoring_prometheus_settings(config):
        conf = {}
        port = None
    else:
-        port = str(_kv_read(conf, 'port', default=9090))
+        port = str(_kv_read(conf, 'port'))
    return PrometheusMonitoringSettings(
        port=port,
        scrape_interval=_kv_read_checked(
@ -4172,6 +4172,33 @@ def monitoring_prometheus_settings(config):
    )


+def credentials_grafana_admin_password_secret_id(config):
+    # type: (dict) -> str
+    """Get Grafana admin password KeyVault Secret Id
+    :param dict config: configuration object
+    :rtype: str
+    :return: keyvault secret id
+    """
+    try:
+        secid = config[
+            'credentials']['monitoring']['grafana']['admin'][
+                'password_keyvault_secret_id']
+        if util.is_none_or_empty(secid):
+            raise KeyError()
+    except KeyError:
+        return None
+    return secid
+
+
+def set_credentials_grafana_admin_password(config, pw):
+    # type: (dict, str) -> None
+    """Set Grafana admin password
+    :param dict config: configuration object
+    :param str pw: password
+    """
+    config['credentials']['monitoring']['grafana']['admin']['password'] = pw
+
+
 def monitoring_grafana_settings(config):
    # type: (dict) -> GrafanaMonitoringSettings
    """Get grafana monitoring settings
@ -4183,10 +4210,20 @@ def monitoring_grafana_settings(config):
        conf = config['monitoring']['services']['grafana']
    except KeyError:
        conf = {}
-    admin = _kv_read_checked(conf, 'admin', default={})
+    try:
+        gaconf = config['credentials']['monitoring']['grafana']
+    except KeyError:
+        gaconf = {}
+    admin = _kv_read_checked(gaconf, 'admin', default={})
+    admin_user = _kv_read_checked(admin, 'username')
+    if util.is_none_or_empty(admin_user):
+        raise ValueError('Grafana admin user is invalid')
+    admin_password = _kv_read_checked(admin, 'password')
+    if util.is_none_or_empty(admin_password):
+        raise ValueError('Grafana admin password is invalid')
    return GrafanaMonitoringSettings(
-        admin_user=_kv_read_checked(admin, 'user', default='admin'),
-        admin_password=_kv_read_checked(admin, 'password', default='admin'),
+        admin_user=admin_user,
+        admin_password=admin_password,
        additional_dashboards=_kv_read_checked(conf, 'additional_dashboards'),
    )

--- a/convoy/validator.py
+++ b/convoy/validator.py
@ -57,6 +57,7 @@ class ConfigType(enum.Enum):
    Pool = 3,
    Jobs = 4,
    RemoteFS = 5,
+    Monitor = 6,


 # global defines
@ -82,6 +83,10 @@ _SCHEMAS = {
        'name': 'RemoteFS',
        'schema': pathlib.Path(_ROOT_PATH, 'schemas/fs.yaml'),
    },
+    ConfigType.Monitor: {
+        'name': 'Monitor',
+        'schema': pathlib.Path(_ROOT_PATH, 'schemas/monitor.yaml'),
+    },
 }

 # configure loggers
--- a/docs/02-batch-shipyard-quickstart.md
+++ b/docs/02-batch-shipyard-quickstart.md
@ -94,8 +94,8 @@ remove them with the following commands:
 ## <a name="ludicrous"></a>Ludicrous Speed Quickstart
 Pre-jump checklist:

-* Fresh Linux machine with network access
-* `git` is installed
+* Linux, Mac or WSL machine with network access
+* `git` and Python3 is installed
 * Comfortable with Linux commandline
 * Have an active Azure subscription
 * Understand how to use the Azure Portal
@ -112,9 +112,9 @@ Execute jump:
 git clone https://github.com/Azure/batch-shipyard.git
 cd batch-shipyard
 ./install.sh
-nano recipes/TensorFlow-CPU/config/credentials.yaml
-# edit required properties in file and save
 export SHIPYARD_CONFIGDIR=recipes/TensorFlow-CPU/config
+nano $SHIPYARD_CONFIGDIR/credentials.yaml
+# edit required properties in file and save
 ./shipyard pool add
 ./shipyard jobs add --tail stdout.txt
 ```
--- a/docs/10-batch-shipyard-configuration.md
+++ b/docs/10-batch-shipyard-configuration.md
@ -19,6 +19,10 @@ Batch Shipyard jobs and tasks configuration
 Batch Shipyard remote filesystem configuration. This configuration is
 entirely optional unless using the remote filesystem capabilities of
 Batch Shipyard.
+6. [Monitoring](16-batch-shipyard-configuration-monitor.md) -
+Batch Shipyard resource monitoring configuration. This configuration is
+entirely optional unless using the resource monitoring capabilities of
+Batch Shipyard.

 Note that all potential properties are described here and that specifying
 all such properties may result in invalid configuration as some properties
--- a/docs/11-batch-shipyard-configuration-credentials.md
+++ b/docs/11-batch-shipyard-configuration-credentials.md
@ -99,6 +99,12 @@ credentials:
        filename: some/path/token.cache
    credentials_secret_id: https://<vault_name>.vault.azure.net/secrets/<secret_id>
    uri: https://<vault_name>.vault.azure.net/
+  monitoring:
+    grafana:
+      admin:
+        username: grafana_username
+        password: grafana_user_password
+        password_keyvault_secret_id: https://<vault_name>.vault.azure.net/secrets/<secret_id>
 ```

 ## Details
@ -231,13 +237,15 @@ public repositories on Docker Hub or Singularity Hub. However, this is
 required if pulling from authenticated private registries such as a secured
 Azure Container Registry or private repositories on Docker Hub.
    * (optional) `hub` defines the login property to Docker Hub. This is only
-    required for private repos on Docker Hub.
-    * (optional) `username` username to log in to Docker Hub
-    * (optional) `password` password associated with the username
-    * (optional) `password_keyvault_secret_id` property can be used to
-      reference an Azure KeyVault secret id. Batch Shipyard will contact the
-      specified KeyVault and replace the `password` value as returned by
-      Azure KeyVault.
+      required for private repos on Docker Hub.
+        * (required) `username` username to log in to Docker Hub
+        * (required unless `password_keyvault_secret_id` is specified)
+          `password` password associated with the username
+        * (required unless `password` is specified)
+          `password_keyvault_secret_id` property can be used to
+          reference an Azure KeyVault secret id. Batch Shipyard will contact
+          the specified KeyVault and replace the `password` value as returned
+          by Azure KeyVault.
    * (optional) `myserver-myorg.azurecr.io` is an example property that
      defines a private container registry to connect to. This is an example to
      connect to the [Azure Container Registry service](https://azure.microsoft.com/services/container-registry/).
@ -247,12 +255,14 @@ Azure Container Registry or private repositories on Docker Hub.
      `global_resources`:`additional_registries`:`docker`,
      `global_resources`:`additional_registries`:`singularity` in the global
      configuration.
-      * (optional) `username` username to log in to this registry
-      * (optional) `password` password associated with this username
-      * (optional) `password_keyvault_secret_id` property can be used to
-        reference an Azure KeyVault secret id. Batch Shipyard will contact the
-        specified KeyVault and replace the `password` value as returned by
-        Azure KeyVault.
+        * (required) `username` username to log in to this registry
+        * (required unless `password_keyvault_secret_id` is specified)
+          `password` password associated with the username
+        * (required unless `password` is specified)
+          `password_keyvault_secret_id` property can be used to
+          reference an Azure KeyVault secret id. Batch Shipyard will contact
+          the specified KeyVault and replace the `password` value as returned
+          by Azure KeyVault.

 ### Management: `management`
 * (optional) The `management` property defines the required members for
@ -284,6 +294,19 @@ Please refer to the
 for more information regarding `*_keyvault_secret_id` properties and how
 they are used for credential management with Azure KeyVault.

+### Resource Monitoring: `monitoring`
+* (optional) `grafana` configures the Grafana login for the resource
+monitoring virtual machine
+    * (required) `admin` is the administrator login
+        * (required) `username` is the administrator login username
+        * (required unless `password_keyvault_secret_id` is specified)
+          `password` is the administrator login password
+        * (required unless `password` is specified)
+          `password_keyvault_secret_id` property can be used to
+          reference an Azure KeyVault secret id. Batch Shipyard will contact
+          the specified KeyVault and replace the `password` value as returned
+          by Azure KeyVault.
+
 ## <a name="non-public"></a>Non-Public Azure Regions
 To connect to non-public Azure regions, you will need to ensure that your
 credentials configuration is populated with the correct `authority_url` and
--- a/docs/13-batch-shipyard-configuration-pool.md
+++ b/docs/13-batch-shipyard-configuration-pool.md
@ -128,8 +128,7 @@ pool_specification:
    cadvisor:
      enabled: false
      port: 8080
-      options:
-      - -docker_only
+      options: []
 ```

 The `pool_specification` property has the following members:
--- a/docs/15-batch-shipyard-configuration-fs.md
+++ b/docs/15-batch-shipyard-configuration-fs.md
@ -102,6 +102,11 @@ remote_fs:
          - p10-disk1b
          filesystem: btrfs
          raid_level: 0
+      prometheus:
+        node_exporter:
+          enabled: false
+          port: 9100
+          options: []
 ```

 ## Details
@ -370,13 +375,32 @@ The number of entries in this map must match the `vm_count`.
          to expand the number of disks in the array in the future, you must
          use `btrfs` as the filesystem. At least two disks per virtual
          machine are required for RAID-0.
+* (optional) `prometheus` properties are to control if collectors for metrics
+to export to [Prometheus](https://prometheus.io/) monitoring are enabled.
+Note that all exporters do not have their ports exposed to the internet by
+default. This means that the Prometheus instance itself must reside
+on, or peered with, the virtual network that the storage cluster is in. This
+ensures that external parties cannot scrape exporter metrics from storage
+cluster VMs.
+    * (optional) `node_exporter` contains options for the
+      [Node Exporter](https://github.com/prometheus/node_exporter) metrics
+      exporter.
+        * (optional) `enabled` property enables or disables this exporter.
+          Default is `false`.
+        * (optional) `port` is the port for Prometheus to connect to scrape.
+          This is the internal port on the storage cluster VM.
+        * (optional) `options` is a list of options to pass to the
+          node exporter instance running on all nodes. The following
+          collectors are force disabled, in addition to others disabled by
+          default: textfile, wifi, xfs, zfs. The nfs collector is enabled if
+          the file server is NFS, automatically.

 ## Remote Filesystems with Batch Shipyard Guide
 Please see the [full guide](65-batch-shipyard-remote-fs.md) for information
 on how this feature works in Batch Shipyard.

 ## Full template
-A full template of a credentials file can be found
+A full template of a RemoteFS configuration file can be found
 [here](https://github.com/Azure/batch-shipyard/tree/master/config_templates).
 Note that these templates cannot be used as-is and must be modified to fit
 your scenario.
--- a/docs/16-batch-shipyard-configuration-monitor.md
+++ b/docs/16-batch-shipyard-configuration-monitor.md
@ -0,0 +1,187 @@
+# Batch Shipyard Resource Monitoring Configuration
+This page contains in-depth details on how to configure the resource
+monitoring configuration file for Batch Shipyard.
+
+## Schema
+The monitoring schema is as follows:
+
+```yaml
+monitoring:
+  location: <Azure region, e.g., eastus>
+  resource_group: my-prom-server-rg
+  hostname_prefix: prom
+  ssh:
+    username: shipyard
+    ssh_public_key: /path/to/rsa/publickey.pub
+    ssh_public_key_data: ssh-rsa ...
+    ssh_private_key: /path/to/rsa/privatekey
+    generated_file_export_path: null
+  public_ip:
+    enabled: true
+    static: false
+  virtual_network:
+    name: myvnet
+    resource_group: my-vnet-resource-group
+    existing_ok: false
+    address_space: 10.0.0.0/16
+    subnet:
+      name: my-server-subnet
+      address_prefix: 10.0.0.0/24
+  network_security:
+    ssh:
+    - '*'
+    grafana:
+    - 1.2.3.0/24
+    - 2.3.4.5
+    prometheus:
+    - 2.3.4.5
+  vm_size: STANDARD_D2_V2
+  accelerated_networking: false
+  services:
+    resource_polling_interval: 15
+    lets_encrypt:
+      enabled: true
+      use_staging_environment: true
+    prometheus:
+      port: 9090
+      scrape_interval: 10s
+    grafana:
+      additional_dashboards: null
+```
+
+The `monitoring` property has the following members:
+
+* (required) `location` is the Azure region name for the resources, e.g.,
+`eastus` or `northeurope`. The `location` specified must match the same
+region as your Azure Batch account if monitring compute pools and/or within
+the same region if monitoring storage clusters.
+* (required) `resource_group` this is the resource group to use for the
+monitoring resource.
+* (required) `hostname_prefix` is the DNS label prefix to apply to each
+virtual machine and resource allocated for the monitoring resource. It should
+be unique.
+* (required) `ssh` is the SSH admin user to create on the machine. This is not
+optional in this configuration as it is in the pool specification. If you are
+running Batch Shipyard on Windows, please refer to
+[these instructions](85-batch-shipyard-ssh-docker-tunnel.md#ssh-keygen)
+on how to generate an SSH keypair for use with Batch Shipyard.
+    * (required) `username` is the admin user to create on all virtual machines
+    * (optional) `ssh_public_key` is the path to a pre-existing ssh public
+      key to use. If this is not specified, an RSA public/private key pair will
+      be generated for use in your current working directory (with a
+      non-colliding name for auto-generated SSH keys for compute pools, i.e.,
+      `id_rsa_shipyard_remotefs`). On Windows only, if this is option is not
+      specified, the SSH keys are not auto-generated (unless `ssh-keygen.exe`
+      can be invoked in the current working directory or is in `%PATH%`).
+      This option cannot be specified with `ssh_public_key_data`.
+    * (optional) `ssh_public_key_data` is the raw RSA public key data in
+      OpenSSH format, e.g., a string starting with `ssh-rsa ...`. Only one
+      key may be specified. This option cannot be specified with
+      `ssh_public_key`.
+    * (optional) `ssh_private_key` is the path to an existing SSH private key
+      to use against either `ssh_public_key` or `ssh_public_key_data` for
+      connecting to storage nodes and performing operations that require SSH
+      such as cluster resize and detail status. This option should only be
+      specified if either `ssh_public_key` or `ssh_public_key_data` are
+      specified.
+    * (optional) `generated_file_export_path` is an optional path to specify
+      for where to create the RSA public/private key pair.
+* (optional) `public_ip` are public IP properties for the virtual machine.
+    * (optional) `enabled` designates if public IPs should be assigned. The
+      default is `true`. Note that if public IP is disabled, then you must
+      create an alternate means for accessing the resource monitor virtual
+      machine through a "jumpbox" on the virtual network. If this property
+      is set to `false` (disabled), then any action requiring SSH, or the
+      SSH command itself, will occur against the private IP address of the
+      virtual machine.
+    * (optional) `static` is to specify if static public IPs should be assigned
+      to each virtual machine allocated. The default is `false` which
+      results in dynamic public IP addresses. A "static" FQDN will be provided
+      per virtual machine, regardless of this setting if public IPs are
+      enabled.
+* (required) `virtual_network` is the virtual network to use for the
+resource monitor.
+    * (required) `name` is the virtual network name
+    * (optional) `resource_group` is the resource group for the virtual
+      network. If this is not specified, the resource group name falls back
+      to the resource group specified in the resource monitor.
+    * (optional) `existing_ok` allows use of a pre-existing virtual network.
+      The default is `false`.
+    * (required if creating, optional otherwise) `address_space` is the
+      allowed address space for the virtual network.
+    * (required) `subnet` specifies the subnet properties. This subnet should
+      be exclusive to the resource monitor and cannot be shared with other
+      resources, including Batch compute nodes. Batch compute nodes and storage
+      clusters can co-exist on the same virtual network, but should be in
+      separate subnets.
+        * (required) `name` is the subnet name.
+        * (required) `address_prefix` is the subnet address prefix to use for
+          allocation of the resource monitor virtual machine to.
+* (required) `network_security` defines the network security rules to apply
+to the resource monitoring virtual machine.
+    * (required) `ssh` is the rule for which address prefixes to allow for
+      connecting to sshd port 22 on the virtual machine. In the example, `"*"`
+      allows any IP address to connect. This is an array property which allows
+      multiple address prefixes to be specified.
+    * (optional) `grafana` rule allows grafana HTTPS (443) server port to be
+      exposed to the specified address prefix. Multiple address prefixes
+      can be specified.
+    * (optional) `prometheus` rule allows the Prometheus server pot to be
+      exposed to the specified address prefix. Multiple address prefixes
+      can be specified.
+    * (optional) `custom_inbound_rules` are custom inbound rules for other
+      services that you need to expose.
+        * (required) `<rule name>` is the name of the rule; the example uses
+          `myrule`. Each rule name should be unique.
+            * (required) `destination_port_range` is the ports on each virtual
+              machine that will be exposed. This can be a single port and
+              should be a string.
+            * (required) `source_address_prefix` is an array of address
+              prefixes to allow.
+        * (required) `protocol` is the protocol to allow. Valid values are
+          `tcp`, `udp` and `*` (which means any protocol).
+* (required) `vm_size` is the virtual machine instance size to use.
+* (optional) `accelerated_networking` enables or disables
+[accelerated networking](https://docs.microsoft.com/azure/virtual-network/create-vm-accelerated-networking-cli).
+The default is `false` if not specified.
+* (required) `services` defines the behavior of the services that run on
+the monitoring resource virtual machine.
+    * (optional) `resource_polling_interval` is the polling interval in
+      seconds for monitored resource discovery. The default is `15` seconds.
+    * (optional) `lets_encrypt` defines options for enabling
+      [Let's Encrypt](https://letsencrypt.org/) on the
+      [nginx](https://www.nginx.com/) reverse proxy for TLS encryption. This
+      can only be enabled if the `public_ip` is enabled.
+        * (required) `enabled` controls if Let's Encrypt is enabled or not.
+          The default is `true`.
+        * (optional) `use_staging_environment` forces the certificate request
+          to happen against Let's Encrypt's staging servers. Although this
+          will enable encryption over HTTP, since the CA is fake, warnings
+          will appear with most browsers when attempting to connect to the
+          service endpoints on the resource monitoring VM. This is useful
+          to ensure your configuration is correct before switching to a
+          production certificate. The default is `true`.
+    * (optional) `prometheus` configures the Prometheus server endpoint on the
+      resource monitoring VM. Note that it is not required to define this
+      section. If it is omitted, then the Prometheus server is not exposed.
+        * (optional) `port` is the port to use. If this is value is omitted,
+          the Prometheus server is not exposed.
+        * (optional) `scrape_interval` is the collector scrape interval to
+          use. The default is `10s`. Note that valid values are Prometheus
+          [duration strings](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#%3Cduration%3E).
+    * (optional) `grafana` configures the Grafana endpoint on the resource
+      monitoring VM
+        * (optional) `additional_dashboards` is a dictionary of additional
+          Grafana dashboards to provision. The format of the dictionary is
+          `filename.json: URL`. For example,
+          `my_custom_dash.json: https://some.url`.
+
+## Resource Monitoring with Batch Shipyard Guide
+Please see the [full guide](66-batch-shipyard-resource-monitoring.md) for
+information on how this feature works in Batch Shipyard.
+
+## Full template
+A full template of a resource monitoring configuration file can be found
+[here](https://github.com/Azure/batch-shipyard/tree/master/config_templates).
+Note that these templates cannot be used as-is and must be modified to fit
+your scenario.
--- a/docs/20-batch-shipyard-usage.md
+++ b/docs/20-batch-shipyard-usage.md
@ -20,9 +20,9 @@ you can invoke as:
 shipyard.cmd
 ```

-If you installed manually (i.e., did not use the installer scripts), then
-you will need to invoke the Python interpreter and pass the script as an
-argument. For example:
+If you installed manually (i.e., took the non-recommended installation path
+and did not use the installer scripts), then you will need to invoke the
+Python interpreter and pass the script as an argument. For example:
 ```
 python3 shipyard.py
 ```
@ -55,6 +55,8 @@ shipyard <command> <subcommand> <options>
 For instance:
 ```shell
 shipyard pool add --configdir config
+# or equivalent in Linux for this particular command
+SHIPYARD_CONFIGDIR=config shipyard pool add
 ```
 Would create a pool on the Batch account as specified in the config files
 found in the `config` directory. Please note that `<options>` must be
@ -90,6 +92,7 @@ These options must be specified after the command and sub-command. These are:
  --fs TEXT                       RemoteFS config file
  --pool TEXT                     Pool config file
  --jobs TEXT                     Jobs config file
+  --monitor TEXT                  Resource monitoring config file
  --subscription-id TEXT          Azure Subscription ID
  --keyvault-uri TEXT             Azure KeyVault URI
  --keyvault-credentials-secret-id TEXT
@ -148,6 +151,8 @@ current working directory (i.e., `.`).
    * `--jobs path/to/jobs.yaml` is required for job-related actions.
    * `--fs path/to/fs.yaml` is required for fs-related actions and some pool
      actions.
+    * `--monitor path/to/monitor.yaml` is required for resource monitoring
+      actions.
 * `--subscription-id` is the Azure Subscription Id associated with the
 Batch account or Remote file system resources. This is only required for
 creating pools with a virtual network specification or with `fs` commands.
@ -183,6 +188,7 @@ instead:
 * `SHIPYARD_POOL_CONF` in lieu of `--pool`
 * `SHIPYARD_JOBS_CONF` in lieu of `--jobs`
 * `SHIPYARD_FS_CONF` in lieu of `--fs`
+* `SHIPYARD_MONITOR_CONF` in lieu of `--monitor`
 * `SHIPYARD_SUBSCRIPTION_ID` in lieu of `--subscription-id`
 * `SHIPYARD_KEYVAULT_URI` in lieu of `--keyvault-uri`
 * `SHIPYARD_KEYVAULT_CREDENTIALS_SECRET_ID` in lieu of
@ -198,8 +204,7 @@ instead:
 * `SHIPYARD_AAD_CERT_THUMBPRINT` in lieu of `--aad-cert-thumbprint`

 ## Commands
-`shipyard` (and `shipyard.py`) script contains the following top-level
-commands:
+`shipyard` has the following top-level commands:
 ```
  account   Batch account actions
  cert      Certificate actions
@ -209,6 +214,7 @@ commands:
  jobs      Jobs actions
  keyvault  KeyVault actions
  misc      Miscellaneous actions
+  monitor   Monitoring actions
  pool      Pool actions
  storage   Storage actions
 ```
@ -384,8 +390,8 @@ storage cluster to perform actions against.
      its subnets
    * `--generate-from-prefix` will attempt to generate all resource names
      using conventions used. This is helpful when there was an issue with
-      cluster deletion and the original virtual machine(s) resources can no
-      longer by enumerated. Note that OS disks and data disks cannot be
+      cluster creation/deletion and the original virtual machine(s) resources
+      cannot be enumerated. Note that OS disks and data disks cannot be
      deleted with this option. Please use `fs disks del` to delete disks
      that may have been used in the storage cluster.
    * `--no-wait` does not wait for deletion completion. It is not recommended
@ -580,6 +586,55 @@ or has run the specified task
      attempt to find a suitable TensorFlow image from Docker images in the
      global resource list or will acquire one on demand for this command.

+## `monitor` Command
+The `monitor` command has the following sub-commands:
+```
+  add      Add a resource to monitor
+  create   Create a monitoring resource
+  destroy  Destroy a monitoring resource
+  list     List all monitored resources
+  remove   Remove a resource from monitoring
+  ssh      Interactively login via SSH to monitoring...
+  start    Starts a previously suspended monitoring...
+  suspend  Suspend a monitoring resource
+```
+
+* `add` will add a resource to monitor to an existing monitoring VM
+    * `--poolid` will add the specified Batch pool to monitor
+    * `--remote-fs` will add the specified RemoteFS cluster to monitor
+* `create` will create a monitoring resource VM
+* `destroy` will destroy a monitoring resource VM
+    * `--delete-resource-group` will delete the entire resource group that
+      contains the monitoring resource. Please take care when using this
+      option as any resource in the resoure group is deleted which may be
+      other resources that are not Batch Shipyard related.
+    * `--delete-virtual-network` will delete the virtual network and all of
+      its subnets
+    * `--generate-from-prefix` will attempt to generate all resource names
+      using conventions used. This is helpful when there was an issue with
+      monitoring creation/deletion and the original virtual machine resources
+      cannot be enumerated. Note that OS disks cannot be deleted with this
+      option. Please use an alternate means (i.e., the Azure Portal) to
+      delete disks that may have been used by the monitoring VM.
+    * `--no-wait` does not wait for deletion completion. It is not recommended
+      to use this parameter.
+* `list` will list all monitored resources
+* `remove` will remove a resource to monitor to an existing monitoring VM
+    * `--all` will remove all resources that are currently monitored
+    * `--poolid` will remove the specified Batch pool to monitor
+    * `--remote-fs` will remove the specified RemoteFS cluster to monitor
+* `ssh` will interactively log into a compute node via SSH.
+    * `COMMAND` is an optional argument to specify the command to run. If your
+      command has switches, preface `COMMAND` with double dash as per POSIX
+      convention, e.g., `pool ssh -- sudo docker ps -a`.
+    * `--tty` allocates a pseudo-terminal
+* `start` will start a previously suspended monitoring VM
+    * `--no-wait` does not wait for the restart to complete. It is not
+      recommended to use this parameter.
+* `suspend` suspends a monitoring VM
+    * `--no-wait` does not wait for the suspension to complete. It is not
+      recommended to use this parameter.
+
 ## `pool` Command
 The `pool` command has the following sub-commands:
 ```
--- a/docs/63-batch-shipyard-custom-images.md
+++ b/docs/63-batch-shipyard-custom-images.md
@ -242,3 +242,8 @@ following:
      `batch.node.ubuntu 16.04` as the `node_agent` value. You can view a
      complete list of supported node agent sku ids with the `pool listskus`
      command.
+
+### ARM Image Retention Requirements
+Ensure that the ARM image exists for the lifetimes of any pool referencing
+the custom image. Failure to do so can result in pool allocation failures
+and/or resize failures.
--- a/docs/64-batch-shipyard-byovnet.md
+++ b/docs/64-batch-shipyard-byovnet.md
@ -65,6 +65,13 @@ to fit the desired number of target dedicated and low priority compute nodes.
 Note that this calculation does not consider autoscale where the number of
 nodes can exceed the specified targets.

+### Forced Tunneling and User-Defined Routes
+If you are redirecting Internet-bound traffic from the subnet back to
+on-premises, then you may have to add
+[user-defined routes](https://docs.microsoft.com/azure/virtual-network/virtual-networks-udr-overview)
+to that subnet. Please follow the instructions at this
+[document](https://docs.microsoft.com/azure/batch/batch-virtual-network#user-defined-routes-for-forced-tunneling).
+
 ## Network Security
 Azure provides a resource called a Network Security Group that allows you
 to define security rules to restrict inbound and outbound network traffic
--- a/docs/65-batch-shipyard-remote-fs.md
+++ b/docs/65-batch-shipyard-remote-fs.md
@ -69,7 +69,7 @@ fault domains of the GlusterFS servers
 * Automatic volume mounting of remote filesystems into a Docker container
 executed through Batch Shipyard

-## Overview and Mental Model
+## Mental Model
 A Batch Shipyard provisioned remote filesystem is built on top of different
 resources in Azure. These resources are from networking, storage and
 compute. To more readily explain the concepts that form a Batch Shipyard
@ -174,10 +174,6 @@ explanation of each remote filesystem and storage cluster configuration
 option. Please see [this page](20-batch-shipyard-usage.md) for documentation
 on `fs` command usage.

-You can find information regarding User Subscription Batch accounts and how
-to create them at this
-[blog post](https://docs.microsoft.com/azure/batch/batch-account-create-portal#user-subscription-mode).
-
 ## Sample Recipes
 Sample recipes for RemoteFS storage clusters of NFS and GlusterFS types can
 be found in the
--- a/docs/66-batch-shipyard-resource-monitoring.md
+++ b/docs/66-batch-shipyard-resource-monitoring.md
@ -0,0 +1,260 @@
+# Resource Monitoring with Batch Shipyard
+The focus of this article is to explain how to provision a resource monitor
+for monitoring Batch pools and RemoteFS clusters.
+
+<img src="https://azurebatchshipyard.blob.core.windows.net/github/66-container_metrics.png" alt="dashboard" width="1024" />
+
+## Overview
+For many scenarios, it is often desirable to have visibility into a set of
+machines to gain insights through certain metrics over time. A global
+monitoring resource is valuable to peer into per-machine and aggregate
+metrics for Batch processing workloads as jobs are processed for measurements
+such as CPU, memory and network usage. As Batch Shipyard's execution model
+is based on containers, insights into container behavior is also desirable
+in addition to host-level metrics.
+
+Creating a monitoring system that can monitor ephemeral resources such
+as Batch nodes that may autoscale up or down at any moment and across
+disparate resources such as Batch pools and RemoteFS clusters can be
+challenging. Securing these resources adds additional complexity.
+Fortunately, Batch Shipyard has commands that can help setup such monitoring
+resources quickly.
+
+## Major Features
+* Supports monitoring Azure Batch Pools and Batch Shipyard provisioned
+storage clusters
+* Automatic service discovery of compute nodes and RemoteFS VMs capable of
+adding and removing monitored resources even through Batch pool
+autoscale/resize and storage cluster resizes
+* Automated installs of all required collectors and services on supported
+resources, including Batch pools and RemoteFS VMs
+* Fully automated setup of nginx reverse proxy to Grafana (and optionally
+Prometheus server) with automatic provisioning of Let's Encrypt TLS
+certificates for encrypted HTTP access
+* Automatic set up of network security rules for exposed services
+* Rich default dashboard for monitoring Batch Shipyard resources out-of-the
+box
+* Support for monitoring resource VM suspension (deallocation) and restart
+* Support for accelerated networking, boot diagnostics and serial console
+access
+* Automatic SSH keypair provisioning and setup
+
+## Mental Model
+A Batch Shipyard provisioned monitoring resource is built on top of different
+resources in Azure. To more readily explain the concepts that form a Batch
+Shipyard monitoring resource, let's start with a high-level conceptual
+layout of all of the components and possible interacting actors.
+
+```
+                                  +-------------+  +------------------------+
+                                  |             |  |                        |
+                                  | Azure Batch |  | Azure Resource Manager |
+                                  |             |  |                        |
+                                  +---------^---+  +----^-------------------+
+                                            |           |
+                                            |           |
+              +-------------------------------------------------------------------------------------+
+              |                             |           |                                           |
+              | |-----------------------------------------------------|                             |
+              | |                           |           |             |                             |
+              | | --------------------------------------------------- |                             |
+              | | |                         |           |           | |     +---------------------+ |
+---------+   | | |  +-----------+          | MSI       | MSI       | |     | +-----------------+ | |
+|         |   | | |  |           |          |           |           | |     | |                 | | |
+| Let's   |   | | |  | Let's     |        +-+-----------+--+        | |     | | Batch Shipyard  | | |
+| Encrypt <----------+ Encrypt   |        |                |        | |     | | RemoteFS VM Y   | | |
+| CA      |   | | |  | TLS Certs |        | Batch Shipyard |        | |     | |                 | | |
+|         |   | | |  |           |        | Heimdall       |        | |     | +---------------+ | | |
+---------+   | | |  +----+------+        |                |     +------------> Node Exporter | | | |
+              | | |       |               +-------+--------+     |  | |     | +---------------+ | | |
+              | | |       |                       |              |  | |     | |                 | | |
+              | | | +-----v--+                    |              |  | |     | +------------+    | | |
+              | | | |        |                    |              |  | |     | | Private IP |    | | |
+              | | | | nginx  |   +-----------+    | Automated    |  | |     | | 10.2.0.4   |    | | |
+              | | | |        |   |           |    | Service      |  | |     | +------------+----+ | |
+              | | | +------+ |   |  Grafana  |    | Discovery    |  | |     |         Subnet C    | |
+---------+   | | | | Port +----->           |    |              |  | |     |         10.2.0.0/24 | |
+|         +---------> 443  | |   +--------+--+    |              |  | |     +---------------------+ |
+| Web     |   | | | +------+ |            |       |              |  | |                             |
+| Browser |   | | | | Port | |         +--v-------v-----+        |  | |     +---------------------+ |
+|         +---------> 9090 +----------->                |        |  | |     | +-----------------+ | |
+---------+   | | | +------+ |         |   Prometheus   +--------+  | |     | |                 | | |
+              | | | |        |         |                |           | |     | | Azure Batch     | | |
+              | | | +--------+         +---------+------+           | |     | | Compute Node X  | | |
+              | | |                              |                  | |     | |                 | | |
+              | | |                              |                  | |     | +---------------+ | | |
+              | | +-----------+------------+     |                  | |  +----> Node Exporter | | | |
+              | | | Public IP | Private IP |     +-----------------------+  | +----------+----+ | | |
+              | | | 1.2.3.4   | 10.0.0.4   |                        | |  +----> cAdvisor |      | | |
+              | | +-----------+------------+------------------------+ |     | +----------+      | | |
+              | |                                         Subnet A    |     | |                 | | |
+              | |                                         10.0.0.0/24 |     | +------------+    | | |
+              | +-----------------------------------------------------+     | | Private IP |    | | |
+              |                                                             | | 10.1.0.4   |    | | |
+              |                                                             | +------------+----+ | |
+              |                                                             |         Subnet B    | |
+              | Virtual Network                                             |         10.1.0.0/24 | |
+              | 10.0.0.0/8                                                  +---------------------+ |
+              +-------------------------------------------------------------------------------------+
+```
+
+The base layer for all of the resources within a monitoring resource is
+an Azure Virtual Network. This virtual network can be shared
+amongst other network-level resources such as network interfaces. The virtual
+network can be "partitioned" into sub-address spaces through the use of
+subnets. In the example above, we have three subnets where
+`Subnet A 10.0.0.0/24` hosts the resource monitor,
+`Subnet B 10.1.0.0/16` contains a pool of Azure Batch compute nodes to
+monitor, and `Subnet C 10.2.0.0/24` contains a Batch Shipyard RemoteFS
+cluster to monitor. No resource in `Subnet B` or `Subnet C` is strictly
+required for the Batch Shipyard monitoring resource to work, although you
+will want either one or the other at the minimum so you have some resource
+to monitor.
+
+When provisioning Batch pools or RemoteFS storage clusters, you are able
+to specify `prometheus` compatible collectors to install. If configured,
+Batch Shipyard takes care of installing these packages to the resources and
+are immediately ready to be scraped by the Prometheus server.
+
+When the resource monitor virtual machine is created, the bootstrap
+process automatically contacts the Let's Encrypt CA to provision TLS
+certificates for nginx. Nginx is configured to reverse proxy requests to
+Grafana over the standard HTTPS port (443) and, optionally, to the Prometheus
+server on the specified port. Grafana is automatically provisioned with
+the correct data source and a rich default dashboard for monitoring Batch
+Shipyard resources. Internally, a Batch Shipyard process runs alongside
+Grafana and the Prometheus server to enumerate any resources that have
+been specified to monitor. The "Batch Shipyard Heimdall" container
+encapsulates this functionality by either querying the Azure Batch service
+or Azure Resource Manager endpoints for the requested resources to monitor.
+No sensitive credentials are passed to the resource monitoring virtual
+machine. Instead, Batch Shipyard Heimdall uses Azure MSI to authenticate
+with Azure Active Directory with least user privilege (LUP) to enumerate the
+specified resources to monitor. This information is then used to populate
+Prometheus service discovery. Once the Prometheus server begins to scrape
+metrics, then this data is available for visualization in Grafana.
+
+## Configuration
+In order to enable resource monitoring, there are a few configuration changes
+that must be made to enable this feature. You must enable a resource or
+set of resources to be monitored and then create the monitoring resource.
+
+### Monitored Resource Configuration
+Batch pools and RemoteFS storage clusters can be monitored. Below explains
+the configuration required to enable each.
+
+#### Pool Configuration
+The following is a sample snippet for a Batch pool to be monitored. Note that
+this configuration must be applied prior to creation.
+
+```yaml
+pool_specification:
+  # ... other settings
+  virtual_network:
+    # virtual network settings must be set
+  prometheus:
+    node_exporter:
+      enabled: true
+    cadvisor:
+      enabled: true
+```
+
+A `virtual_network` must be specified so the resource monitor can connect
+to the compute nodes in the Batch pool. Please see the
+[virtual network guide](64-batch-shipyard-byovnet.md) for more information.
+
+The `prometheus` section enables the Prometheus-compatible collectors to
+be automatically installed and configured. For Batch pools, two collectors
+are available:
+
+1. [Node Exporter](https://github.com/prometheus/node_exporter)
+2. [cAdvisor](https://github.com/google/cadvisor)
+
+It is recommended to enable both of these collectors if utilizing
+resource monitoring with Batch pool targets. Other `prometheus` options and
+more information can be found in the
+[Pool configuration doc](13-batch-shipyard-configuration-pool.md).
+
+#### RemoteFS Configuration
+The following is a sample snippet for a RemoteFS storage cluster to be
+monitored. Note that this configuration must be applied prior to creation.
+
+```yaml
+remote_fs:
+  # ... other settings
+  virtual_network:
+    # virtual network settings must be set
+  prometheus:
+    node_exporter:
+      enabled: true
+```
+
+The `prometheus` section enables the Prometheus-compatible collectors to
+be automatically installed and configured. Only the
+[Node Exporter](https://github.com/prometheus/node_exporter) collector is
+currently available for RemoteFS clusters. Other `prometheus` options and
+more information can be found in the
+[RemoteFS configuration doc](15-batch-shipyard-configuration-fs.md).
+
+### Resource Monitor Configuration
+The resource monitoring virtual machine requires configuration to provision.
+
+#### Credentials Configuration
+Specifying the Grafana admin credentials are required in the credentials
+configuration. Below is a sample:
+
+```yaml
+credentials:
+  # ... other settings
+  monitoring:
+    grafana:
+      admin:
+        username: admin
+        password: admin
+```
+
+Note that you can also use a KeyVault secret id for the `password` or store
+the credentials entirely within KeyVault. Please see the
+[credentials](11-batch-shipyard-configuration-credentials.md) configuration
+guide for more information.
+
+#### Monitor Configuration
+The resource monitor must be configured according to the
+[monitor configuration doc](16-batch-shipyard-configuration-monitor.md).
+Please refer to that guide for a full explanation of each monitoring
+configuration option.
+
+## Usage Documentation
+The workflow for standing up a monitoring resource is creation followed by
+adding an applicable resources to monitor. Below is an example, assuming
+monitoring has been properly configured as per prior section guidance.
+
+```shell
+# create a resource monitor
+shipyard monitor create
+# note the FQDN emitted in the log at the end of the provisioning process
+
+# create a Batch pool where work is to be performed
+# this hypothetical pool id is mybatchpool
+shipyard pool add
+
+# add the Batch pool above as a resource to monitor
+shipyard monitor add --poolid mybatchpool
+```
+
+After the monitor is added, you can point your web browser at the
+monitoring resource FQDN emitted above. You can remove individual
+resources to monitor with the command `shipyard monitor remove`.
+Once you have no need for your monitoring resource, you can either suspend
+it or remove it altogether.
+
+```shell
+# remove the prior Batch pool monitor
+shipyard monitor remove --poolid mybatchpool
+
+# destroy the monitoring resource entirely
+shipyard monitor destroy
+```
+
+Please see [this page](20-batch-shipyard-usage.md) for in-depth documentation
+on `monitor` command usage.
--- a/docs/99-current-limitations.md
+++ b/docs/99-current-limitations.md
@ -63,3 +63,7 @@ underlying VM and host drivers.
 * Adding tasks to the same job across multiple, concurrent Batch Shipyard
 invocations may result in failure if task ids for these jobs are
 auto-generated.
+
+### Monitoring Limitations
+* Only Linux Batch pools and RemoteFS clusters can be monitored. Windows
+Batch pools are not supported.
--- a/mkdocs.yml
+++ b/mkdocs.yml
@ -18,6 +18,7 @@ pages:
  - Pool: 13-batch-shipyard-configuration-pool.md
  - Jobs: 14-batch-shipyard-configuration-jobs.md
  - RemoteFS: 15-batch-shipyard-configuration-fs.md
+  - Monitoring: 16-batch-shipyard-configuration-monitor.md
 - CLI Commands and Usage: 20-batch-shipyard-usage.md
 - Platform Image support: 25-batch-shipyard-platform-image-support.md
 - In-Depth Feature Guides:
@ -27,6 +28,7 @@ pages:
  - Custom Images for Host Compute Nodes: 63-batch-shipyard-custom-images.md
  - Virtual Networks: 64-batch-shipyard-byovnet.md
  - Remote Filesystems: 65-batch-shipyard-remote-fs.md
+  - Resource Monitoring: 66-batch-shipyard-resource-monitoring.md
  - Data Movement: 70-batch-shipyard-data-movement.md
  - Azure KeyVault for Credential Management: 74-batch-shipyard-azure-keyvault.md
  - Credential Encryption: 75-batch-shipyard-credential-encryption.md
--- a/schemas/config.yaml
+++ b/schemas/config.yaml
@ -194,139 +194,3 @@ mapping:
                  path:
                    type: str
                    required: true
-
-  monitoring:
-    type: map
-    mapping:
-      location:
-        type: str
-        required: true
-      resource_group:
-        type: str
-        required: true
-      hostname_prefix:
-        type: str
-        required: true
-      ssh:
-        type: map
-        required: true
-        mapping:
-          username:
-            type: str
-            required: true
-          ssh_public_key:
-            type: str
-          ssh_public_key_data:
-            type: str
-          ssh_private_key:
-            type: str
-          generated_file_export_path:
-            type: str
-      public_ip:
-        type: map
-        mapping:
-          enabled:
-            type: bool
-          static:
-            type: bool
-      virtual_network:
-        type: map
-        required: true
-        mapping:
-          name:
-            type: str
-            required: true
-          resource_group:
-            type: str
-          existing_ok:
-            type: bool
-          address_space:
-            type: str
-          subnet:
-            type: map
-            mapping:
-              name:
-                type: str
-                required: true
-              address_prefix:
-                type: str
-                required: true
-      network_security:
-        type: map
-        required: true
-        mapping:
-          ssh:
-            type: seq
-            required: true
-            sequence:
-              - type: str
-          grafana:
-            type: seq
-            required: true
-            sequence:
-              - type: str
-          prometheus:
-            type: seq
-            sequence:
-              - type: str
-          custom_inbound_rules:
-            type: map
-            mapping:
-              regex;([a-zA-Z0-9]+):
-                type: map
-                mapping:
-                  destination_port_range:
-                    type: str
-                    required: true
-                  protocol:
-                    type: str
-                    enum: ['*', 'tcp', 'udp']
-                  source_address_prefix:
-                    type: seq
-                    required: true
-                    sequence:
-                      - type: str
-      vm_size:
-        type: str
-        required: true
-      accelerated_networking:
-        type: bool
-      services:
-        type: map
-        mapping:
-          resource_polling_interval:
-            type: int
-          lets_encrypt:
-            type: map
-            mapping:
-              enabled:
-                type: bool
-                required: true
-              use_staging_environment:
-                type: bool
-          prometheus:
-            type: map
-            mapping:
-              port:
-                type: int
-                required: true
-              scrape_interval:
-                type: str
-          grafana:
-            type: map
-            mapping:
-              admin:
-                type: map
-                mapping:
-                  user:
-                    type: str
-                    required: true
-                  password:
-                    type: str
-                    required: true
-              additional_dashboards:
-                type: map
-                mapping:
-                  regex;([a-zA-Z0-9]+\.json):
-                    type: str
-                    required: true
--- a/schemas/credentials.yaml
+++ b/schemas/credentials.yaml
@ -205,3 +205,20 @@ mapping:
            type: str
          uri:
            type: str
+      monitoring:
+        type: map
+        mapping:
+          grafana:
+            type: map
+            mapping:
+              admin:
+                type: map
+                required: true
+                mapping:
+                  username:
+                    type: str
+                    required: true
+                  password:
+                    type: str
+                  password_keyvault_secret_id:
+                    type: str
--- a/schemas/monitor.yaml
+++ b/schemas/monitor.yaml
@ -0,0 +1,130 @@
+desc: Monitoring Configuration Schema
+
+type: map
+mapping:
+  monitoring:
+    type: map
+    mapping:
+      location:
+        type: str
+        required: true
+      resource_group:
+        type: str
+        required: true
+      hostname_prefix:
+        type: str
+        required: true
+      ssh:
+        type: map
+        required: true
+        mapping:
+          username:
+            type: str
+            required: true
+          ssh_public_key:
+            type: str
+          ssh_public_key_data:
+            type: str
+          ssh_private_key:
+            type: str
+          generated_file_export_path:
+            type: str
+      public_ip:
+        type: map
+        mapping:
+          enabled:
+            type: bool
+          static:
+            type: bool
+      virtual_network:
+        type: map
+        required: true
+        mapping:
+          name:
+            type: str
+            required: true
+          resource_group:
+            type: str
+          existing_ok:
+            type: bool
+          address_space:
+            type: str
+          subnet:
+            type: map
+            mapping:
+              name:
+                type: str
+                required: true
+              address_prefix:
+                type: str
+                required: true
+      network_security:
+        type: map
+        required: true
+        mapping:
+          ssh:
+            type: seq
+            required: true
+            sequence:
+              - type: str
+          grafana:
+            type: seq
+            required: true
+            sequence:
+              - type: str
+          prometheus:
+            type: seq
+            sequence:
+              - type: str
+          custom_inbound_rules:
+            type: map
+            mapping:
+              regex;([a-zA-Z0-9]+):
+                type: map
+                mapping:
+                  destination_port_range:
+                    type: str
+                    required: true
+                  protocol:
+                    type: str
+                    enum: ['*', 'tcp', 'udp']
+                  source_address_prefix:
+                    type: seq
+                    required: true
+                    sequence:
+                      - type: str
+      vm_size:
+        type: str
+        required: true
+      accelerated_networking:
+        type: bool
+      services:
+        type: map
+        required: true
+        mapping:
+          resource_polling_interval:
+            type: int
+          lets_encrypt:
+            type: map
+            mapping:
+              enabled:
+                type: bool
+                required: true
+              use_staging_environment:
+                type: bool
+          prometheus:
+            type: map
+            mapping:
+              port:
+                type: int
+              scrape_interval:
+                type: str
+          grafana:
+            type: map
+            mapping:
+              additional_dashboards:
+                type: map
+                mapping:
+                  regex;([a-zA-Z0-9]+\.json):
+                    type: str
+                    required: true
--- a/shipyard.py
+++ b/shipyard.py
@ -61,8 +61,11 @@ class CliContext(object):
        self.yes = False
        self.raw = None
        self.config = None
+        self.conf_config = None
+        self.conf_pool = None
        self.conf_jobs = None
        self.conf_fs = None
+        self.conf_monitor = None
        # clients
        self.batch_mgmt_client = None
        self.batch_client = None
@ -122,7 +125,8 @@ class CliContext(object):
        self._set_global_cli_options()
        self._init_keyvault_client()
        self._init_config(
-            skip_global_config=False, skip_pool_config=True, fs_storage=True)
+            skip_global_config=False, skip_pool_config=True,
+            skip_monitor_config=True, fs_storage=True)
        _, self.resource_client, self.compute_client, self.network_client, \
            self.storage_mgmt_client, _, _ = \
            convoy.clients.create_all_clients(self)
@ -130,8 +134,7 @@ class CliContext(object):
        convoy.fleet.fetch_storage_account_keys_from_aad(
            self.storage_mgmt_client, self.config, fs_storage=True)
        self.blob_client, _ = convoy.clients.create_storage_clients()
-        self._cleanup_after_initialize(
-            skip_global_config=False, skip_pool_config=True)
+        self._cleanup_after_initialize()

    def initialize_for_monitor(self):
        # type: (CliContext) -> None
@ -142,7 +145,8 @@ class CliContext(object):
        self._set_global_cli_options()
        self._init_keyvault_client()
        self._init_config(
-            skip_global_config=False, skip_pool_config=True, fs_storage=True)
+            skip_global_config=False, skip_pool_config=True,
+            skip_monitor_config=False, fs_storage=True)
        self.auth_client, self.resource_client, self.compute_client, \
            self.network_client, self.storage_mgmt_client, _, _ = \
            convoy.clients.create_all_clients(self)
@ -151,8 +155,7 @@ class CliContext(object):
            self.storage_mgmt_client, self.config, fs_storage=True)
        self.blob_client, self.table_client = \
            convoy.clients.create_storage_clients()
-        self._cleanup_after_initialize(
-            skip_global_config=False, skip_pool_config=True)
+        self._cleanup_after_initialize()

    def initialize_for_keyvault(self):
        # type: (CliContext) -> None
@ -163,9 +166,9 @@ class CliContext(object):
        self._set_global_cli_options()
        self._init_keyvault_client()
        self._init_config(
-            skip_global_config=True, skip_pool_config=True, fs_storage=False)
-        self._cleanup_after_initialize(
-            skip_global_config=True, skip_pool_config=True)
+            skip_global_config=True, skip_pool_config=True,
+            skip_monitor_config=True, fs_storage=False)
+        self._cleanup_after_initialize()

    def initialize_for_batch(self):
        # type: (CliContext) -> None
@ -176,7 +179,8 @@ class CliContext(object):
        self._set_global_cli_options()
        self._init_keyvault_client()
        self._init_config(
-            skip_global_config=False, skip_pool_config=False, fs_storage=False)
+            skip_global_config=False, skip_pool_config=False,
+            skip_monitor_config=True, fs_storage=False)
        _, self.resource_client, self.compute_client, self.network_client, \
            self.storage_mgmt_client, self.batch_mgmt_client, \
            self.batch_client = \
@ -186,8 +190,7 @@ class CliContext(object):
            self.storage_mgmt_client, self.config, fs_storage=False)
        self.blob_client, self.table_client = \
            convoy.clients.create_storage_clients()
-        self._cleanup_after_initialize(
-            skip_global_config=False, skip_pool_config=False)
+        self._cleanup_after_initialize()

    def initialize_for_storage(self):
        # type: (CliContext) -> None
@ -198,7 +201,8 @@ class CliContext(object):
        self._set_global_cli_options()
        self._init_keyvault_client()
        self._init_config(
-            skip_global_config=False, skip_pool_config=False, fs_storage=False)
+            skip_global_config=False, skip_pool_config=False,
+            skip_monitor_config=True, fs_storage=False)
        # inject storage account keys if via aad
        _, _, _, _, self.storage_mgmt_client, _, _ = \
            convoy.clients.create_all_clients(self)
@ -206,8 +210,7 @@ class CliContext(object):
            self.storage_mgmt_client, self.config, fs_storage=False)
        self.blob_client, self.table_client = \
            convoy.clients.create_storage_clients()
-        self._cleanup_after_initialize(
-            skip_global_config=False, skip_pool_config=False)
+        self._cleanup_after_initialize()

    def _set_global_cli_options(self):
        # type: (CliContext) -> None
@ -224,22 +227,18 @@ class CliContext(object):
        if self.verbose:
            convoy.util.set_verbose_logger_handlers()

-    def _cleanup_after_initialize(
-            self, skip_global_config, skip_pool_config):
+    def _cleanup_after_initialize(self):
        # type: (CliContext) -> None
        """Cleanup after initialize_for_* funcs
        :param CliContext self: this
-        :param bool skip_global_config: skip global config
-        :param bool skip_pool_config: skip pool config
        """
        # free conf objects
        del self.conf_credentials
        del self.conf_fs
-        if not skip_global_config:
-            del self.conf_config
-        if not skip_pool_config:
-            del self.conf_pool
-            del self.conf_jobs
+        del self.conf_config
+        del self.conf_pool
+        del self.conf_jobs
+        del self.conf_monitor
        # free cli options
        del self.verbose
        del self.yes
@ -312,12 +311,13 @@ class CliContext(object):

    def _init_config(
            self, skip_global_config=False, skip_pool_config=False,
-            fs_storage=False):
-        # type: (CliContext, bool, bool, bool) -> None
+            skip_monitor_config=True, fs_storage=False):
+        # type: (CliContext, bool, bool, bool, bool) -> None
        """Initializes configuration of the context
        :param CliContext self: this
        :param bool skip_global_config: skip global config
        :param bool skip_pool_config: skip pool config
+        :param bool skip_monitor_config: skip monitoring config
        :param bool fs_storage: adjust storage settings for fs
        """
        # reset config
@ -357,6 +357,16 @@ class CliContext(object):
        self.conf_fs = CliContext.ensure_pathlib_conf(self.conf_fs)
        convoy.validator.validate_config(
            convoy.validator.ConfigType.RemoteFS, self.conf_fs)
+        # set/validate monitoring config
+        if not skip_monitor_config:
+            self.conf_monitor = self._form_conf_path(
+                self.conf_monitor, 'monitor')
+            if self.conf_monitor is None:
+                raise ValueError('monitor conf file was not specified')
+            self.conf_monitor = CliContext.ensure_pathlib_conf(
+                self.conf_monitor)
+            convoy.validator.validate_config(
+                convoy.validator.ConfigType.Monitor, self.conf_monitor)
        # fetch credentials from keyvault, if conf file is missing
        kvcreds = None
        if self.conf_credentials is None or not self.conf_credentials.exists():
@ -405,6 +415,8 @@ class CliContext(object):
                self.conf_jobs = CliContext.ensure_pathlib_conf(self.conf_jobs)
                if self.conf_jobs.exists():
                    self._read_config_file(self.conf_jobs)
+        if not skip_monitor_config:
+            self._read_config_file(self.conf_monitor)
        # adjust settings
        convoy.fleet.initialize_globals(convoy.settings.verbose(self.config))
        if not skip_global_config:
@ -728,6 +740,19 @@ def fs_option(f):
        callback=callback)(f)


+def monitor_option(f):
+    def callback(ctx, param, value):
+        clictx = ctx.ensure_object(CliContext)
+        clictx.conf_monitor = value
+        return value
+    return click.option(
+        '--monitor',
+        expose_value=False,
+        envvar='SHIPYARD_MONITOR_CONF',
+        help='Resource monitoring config file',
+        callback=callback)(f)
+
+
 def _storage_cluster_id_argument(f):
    def callback(ctx, param, value):
        return value
@ -787,6 +812,7 @@ def fs_cluster_options(f):


 def monitor_options(f):
+    f = monitor_option(f)
    f = _azure_subscription_id_option(f)
    return f