iotedge/doc/troubleshoot-checks.md

12 KiB

Built-in troubleshooting functionality

The built-in troubleshooting functionality in the iotedge CLI, "iotedge check", performs configuration and connectivity checks for commonly encountered issues.

iotedge help check displays detailed usage information.

Scope

The troubleshooting tool is focused on

  • Surfacing potential problems that prevent the edge device from connecting to the cloud/upstream.

  • Surfacing potential configuration deviations from recommended production best-practices.

By design, it does not check for errors in the edge workload deployment. For example, it does not check that the device can access any private container registries, errors in module create options, etc. Deployment validation is best performed in the facility where it is authored.

Checks that would involve parsing IoT Edge module logs or metrics are also out of scope.

Result types

Results from checks are characterized as either errors or warnings.

Errors have a high likelihood of preventing the IoT Edge runtime or the modules from connecting to the cloud/upstream.

Warnings might not affect immediate connectivity but are potential deviations from best practices, and may affect long term stability, offline operation or supportability of the edge device.

If there are warnings but no errors, the tool will exit successfully with code 0. Use --warnings-as-errors to treat warnings as errors.

Configuration checks details

config.yaml is well-formed

This check validates that IoT Edge's config.yaml is valid and free of any syntax (e.g. whitespace) errors.

If the check fails with an error, the line number and position reported in the error may not be the exact location of the problem.

config.yaml has well-formed connection string

If the config.yaml uses manual provisioning with a connection string, this check validates that the connection string is well-formed and contains the required Hostname, DeviceId and SharedAccessKey parameters.

container engine is installed and functional

This check validates that a container engine is installed and running, and is accessible at the endpoint specified in the moby_runtime.uri field.

host OS is supported

If the device is running Windows and set to use Windows containers, this check validates that the Windows version is supported.

While the Windows installer script prevents installing on an unsupported OS version, it is possible to install on a supported OS version that then gets updated to a newer version that isn't supported.

config.yaml has correct hostname

This check validates that the value of the hostname field in the config.yaml is the same as the device's actual hostname, or that it's a fully-qualified domain name with the device hostname as the first component.

It also validates that the value complies with RFC 1035, since some modules and downstream devices have difficulty connecting to a domain name that doesn't comply with that RFC.

If the hostname if longer than 64 characters it issues a warning. Hostname longer than 64 charaters cannot be used as local issuer in certificates.

config.yaml has correct parent hostname

This check validates if the parent hostname exist. Parent hostname is only used when the IoT Edge device is nested.

If it exists: It validates that the value complies with RFC 1035, since some modules and downstream devices have difficulty connecting to a domain name that doesn't comply with that RFC. It validates that parent hostname is not longer than 64 characters.

Resolve parent hostname inside container

When in nested configuration, this check validates that parent hostname can be resolved fom inside a container. The extra hosts property added to edge Agent are added to the diagnostic image for name resolution.

config.yaml has correct URIs for daemon mgmt endpoint

This check validates that the value of the connect.management_uri field in the config.yaml is valid, and that the IoT Edge daemon's management endpoint can be queried through it.

latest security daemon

This check validates that the version of the IoT Edge daemon is the same as the value specified in https://aka.ms/latest-iotedge-stable

You can override the expected version using the --expected-aziot-edged-version switch, in which case the tool will not query that URL.

Note that the tool does not validate the versions of the Edge Agent and Edge Hub modules.

host time is close to real time

This check validates that the device's local time is close to the time reported by an NTP server. pool.ntp.org:123 is used by default, and can be overridden with the --ntp-server parameter.

When in nested configuration pool.ntp.org:123 might not be available and IoTedge will connect to a parent IoTedge and not to IoT Hub. The time is the checked directly against the parent IoT edge.

container time is close to host time

This check validates that a container sees a local time that is close to the host device's local time.

DNS server (warning)

This check validates that a DNS server has been specified in the container engine's daemon.json file. DNS best practices are documented at https://aka.ms/iotedge-prod-checklist-dns

It is possible to specify a DNS server in the Edge device's deployment instead of in the container engine's daemon.json, and the tool does not detect this. If you have done so, you should ignore this warning.

IPv6 network configuration

This check validates that if IPv6 container network configuration is enabled in config.yaml (by setting the value of moby_runtime.network.ipv6 field to true), the container engine's daemon.json file also has IPv6 support enabled. To enable IPv6 support for the container runtime, please refer to this guide https://aka.ms/iotedge-docker-ipv6.

IPv6 container runtime network configuration is currently not supported for the Windows operating system and this check fails if IPv6 support is enabled in the container enginer's daemon.json file.

production readiness: certificates (warning)

This check validates that device CA and trusted CA certificates have been defined in the certificates section of the config.yaml. If these certificates are not specified, the device operates in quickstart mode and is not supported in production. Certificate management best practices are documented at https://aka.ms/iotedge-prod-checklist-certs

production readiness: certificates expiry

This check validates that the device CA certificate is valid for at least seven more days.

If the certificate has already expired, it is reported as an error. If the certificate will expire in less than seven days, it is reported as a warning.

production readiness: container engine (warning)

This check validates that the container engine is the Moby container engine. Any other container engine, such as Docker CE, is not supported in production. See https://aka.ms/iotedge-prod-checklist-moby for details.

EdgeAgent module can be pulled from upstream

Try to download edge agent image using image name specified in config.yaml

production readiness: logs policy (warning)

This check validates that the container engine is configured to rotate module logs, by specifying log options and limits in the container engine's daemon.json. Log management best practices are documented at https://aka.ms/iotedge-prod-checklist-logs

By setting these properties in daemon.json, the settings are automatically propagated to all module containers. It is also possible to specify this in the Edge device's deployment instead, and the tool does not detect this. If you have done so, you should ignore this warning.

production readiness: Edge Agent's / Edge Hub's storage directory is persisted on the host filesystem

The tool checks the Edge Agent and Edge Hub containers to validate that their respective storage directories are mounted from the host. If this is not done, it is possible that some state is lost if the containers are deleted or updated, such as Edge Agent's cache of module state or Edge Hub's unsent messages.

These checks require the Edge Agent and Edge Hub containers to have been created.

Connectivity check details

Note: When in nested configuration, tests try to connect to parent instead of IoThub.

host can connect to and perform TLS handshake with DPS endpoint

If the device is set up to use DPS provisioning, the tool connects to the DPS endpoint and completes a TLS handshake with it.

host can connect to and perform TLS handshake with IoT Hub/Upstream AMQP / HTTPS / MQTT port

The tool connects to the IoT Hub/upstream's AMQP port (5671), HTTPS port (443) and MQTT port (8883), and completes a TLS handshake for each. This verifies that the IoT Hub/upstream is reachable from the device, and that the device is configured to accept its TLS certificate.

For nested edge scenario, the FQDN of the upstream is taken from parent hostname. When using manual provisioning, the FQDN of the IoT Hub is taken from the connection string. For DPS provisioning, you must specify the FQDN of the IoT Hub using the --iothub-hostname parameter.

The IoT Edge daemon only uses the HTTPS protocol to connect to the IoT Hub/upstream, but connectivity from the host for the AMQP and MQTT protocols can be useful when investigating issues.

container on the default network can connect to IoT Hub AMQP / HTTPS / MQTT port

The tool launches a diagnostics container on the default (bridge) container network. This container connects to the IoT Hub/upstream's AMQP port (5671), HTTPS port (443) and MQTT port (8883). This verifies that the IoT Hub is reachable from containers on the default container network.

For nested edge scenario, the FQDN of the upstream is taken from parent hostname. When using manual provisioning, the FQDN of the IoT Hub is taken from the connection string. For DPS provisioning, you must specify the FQDN of the IoT Hub using the --iothub-hostname parameter.

Note that these checks do not perform a TLS handshake with the IoT Hub/upstream. They only test that a TCP connection can be established to the respective port.

Note that these checks do not run for Windows containers since they are redundant with the following checks.

container on the IoT Edge module network can connect to IoT Hub AMQP / HTTPS / MQTT port

The tool launches a diagnostics container on the IoT Edge container network specified by the moby_runtime.network field (defaults to azure-iot-edge on Linux and nat on Windows). This container connects to the IoT Hub/upstream's AMQP port (5671), HTTPS port (443) and MQTT port (8883). This verifies that the IoT Hub is reachable from containers on the IoT Edge container network.

For nested edge scenario, the FQDN of the upstream is taken from parent hostname. When using manual provisioning, the FQDN of the IoT Hub is taken from the connection string. For DPS provisioning, you must specify the FQDN of the IoT Hub using the --iothub-hostname parameter.

Note that these checks do not perform a TLS handshake with the IoT Hub. They only test that a TCP connection can be established to the respective port.

Edge Hub can bind to ports on host

Edge Hub can bind to ports on the host so that it can be used as a gateway for leaf devices. For example, the default createOptions for Edge Hub set it to bind to ports 443, 5671 and 8883. If any of these ports are already in use on the host device by other services, the Edge Hub container will be unable to start up. The tool validates that Edge Hub is already running (in which case it has successfully bound to any ports it wanted to bind to), or that the ports are available for it to bind to when it does start.

On a new device, the IoT Edge daemon doesn't try to start the Edge Hub container until a deployment is applied to that device. Until then, this check will return an error because the tool can only detect which ports to test for if the IoT Edge daemon has tried to start the Edge Hub container at least once.