diff --git a/CHANGELOG.md b/CHANGELOG.md index b1472a999e..357362fc5d 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -21,6 +21,10 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0. - Fixed issue with recovery of large ledger entries (#3986). +### Documentation + +- The "Node Output" page has been relabelled as "Troubleshooting" in the documentation and CLI commands for troubleshooting have been added to it. + ## [3.0.0-dev0] ### Added diff --git a/doc/operations/index.rst b/doc/operations/index.rst index 642591573f..def73f93f9 100644 --- a/doc/operations/index.rst +++ b/doc/operations/index.rst @@ -68,10 +68,10 @@ This section describes how :term:`Operators` manage the different nodes constitu --- - :fa:`file-alt` :doc:`node_output` + :fa:`wrench` :doc:`troubleshooting` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - Monitor node health and events using logs. + Troubleshooting tips for unexpected events. --- @@ -99,6 +99,6 @@ This section describes how :term:`Operators` manage the different nodes constitu certificates recovery network - node_output + troubleshooting resource_usage operator_rpc_api diff --git a/doc/operations/node_output.rst b/doc/operations/node_output.rst deleted file mode 100644 index 7ebbb34373..0000000000 --- a/doc/operations/node_output.rst +++ /dev/null @@ -1,24 +0,0 @@ -Node Output -=========== - -By default node output is written to ``stdout`` and to ``stderr`` and can be handled accordingly. - -There is an option to generate machine-readable logs for monitoring. To enable this, set the ``logging.format`` configuration entry to ``"Json"``. The generated logs will be in JSON format as displayed below: - -.. code-block:: json - - { - "e_ts": "2019-09-02T14:47:24.589386Z", - "file": "../src/consensus/aft/raft.h", - "h_ts": "2019-09-02T14:47:24.589384Z", - "level": "info", - "msg": "Deserialising signature at 24\n", - "number": 651 - } - -- ``e_ts`` is the ISO 8601 UTC timestamp of the log if logged inside the enclave (field will be missing if line was logged on the host side) -- ``h_ts`` is the ISO 8601 UTC timestamp of the log when logged on the host side -- ``file`` is the file the log originated from -- ``number`` is the line number in the file the log originated from -- ``level`` is the level of the log message [info, debug, trace, fail, fatal] -- ``msg`` is the log message \ No newline at end of file diff --git a/doc/operations/troubleshooting.rst b/doc/operations/troubleshooting.rst new file mode 100644 index 0000000000..8782f129aa --- /dev/null +++ b/doc/operations/troubleshooting.rst @@ -0,0 +1,89 @@ +Troubleshooting CCF +=================== + +This page contains troubleshooting tips for CCF. + +Tips for interacting with CCF to diagnose issues +------------------------------------------------ +.. note:: In the examples below this documentation uses ``example-ccf-domain.com`` as an example CCF domain, you will need to replace that with your own CCF domain when using these commands. You will also need to add authentication parameters such as ``--cacert`` to the curl commands, see :doc:`Issuing commands ` for an example. + +.. note:: CCF may be deployed with a load balancer which may cache the node which last responded to a query from an IP address. Until the cache clears, the load balancer will direct any subsequent queries from that IP address to the same node. As an example, if the cache clears after one minute, then in order to get a response from a different node, an operator must wait one minute between queries. + +Below are descriptions of CLI commands and how they are useful for diagnosing CCF issues: + +**“What node is handling my requests?”** + +.. code-block:: bash + + curl https://example-ccf-domain.com/node/network/nodes/self -i + +This is useful to identify which node is handling queries. The node ID can be found in the ``location`` header as shown in the example command output below: + +.. code-block:: bash + + HTTP/1.1 308 Permanent Redirect + content-length: 0 + location: https://example-ccf-domain/node/network/nodes/ + +**“What CCF version is running?”** + +.. code-block:: bash + + curl https://example-ccf-domain.com/node/version + +This is useful to confirm the version that is running. + +**“What nodes are part of the current network?”** + +.. code-block:: bash + + curl https://example-ccf-domain.com/node/network/nodes + +This will show information for all nodes in the network. In a healthy network all nodes will show ``“status”: “Trusted”``, and one node only will show ``“primary” = true``. This is the healthy state of the network. +Around upgrades/restarts/migrations nodes will transition through unhealthy states temporarily. If the network remains in an unhealthly state for a long time, this indicates there is an issue. + +You can obtain this information for a single node by querying the :http:GET:`/node/network/nodes/{node_id}` endpoint, where ``{node id}`` can be obtained from the :http:GET:`/node/network/nodes/self` endpoint described above. Take note of the ``node_data`` field in the response which contains useful correlation IDs. + +**“Is the network in the middle of a reconfiguration?”** + +.. code-block:: bash + + curl https://example-ccf-domain.com/node/consensus + +This has a few bits of data that might help us diagnose a partitioned/faulty network. In particular, most of the time there should be a single entry in the ``configs`` list. During an upgrade/restart/migration, there may be multiple values. If multiple values persist for a long time, it suggests something went wrong during the reconfiguration. + +**“Is the CCF network stable?”** + +.. code-block:: bash + + curl https://example-ccf-domain.com/node/commit + +This is a good endpoint to query to check if the CCF service is reachable. Additionally, a large and increasing difference between the ``View`` in the :term:`Transaction ID` in this response, and the ``current_view`` from the :http:GET:`/node/consensus` response, indicates a partitioned node. For example, if the response from :http:GET:`/node/commit` shows the ``View`` is ``15``, and the response from :http:GET:`/node/consensus` states the ``current view`` is ``78967`` and that number is constantly increasing, then this indicates the node is unable to make consensus progress, which likely indicates it is unable to contact other nodes. + +.. tip:: See :ccf_repo:`tests/infra/health_watcher.py` for a detailed technical example of how the health of the network can be monitored. + + +Node Output +=========== + +By default node output is written to ``stdout`` and to ``stderr`` and can be handled accordingly. + +There is an option to generate machine-readable logs for monitoring. To enable this, set the ``logging.format`` configuration entry to ``"Json"``. The generated logs will be in JSON format as displayed below: + +.. code-block:: json + + { + "e_ts": "2019-09-02T14:47:24.589386Z", + "file": "../src/consensus/aft/raft.h", + "h_ts": "2019-09-02T14:47:24.589384Z", + "level": "info", + "msg": "Deserialising signature at 24\n", + "number": 651 + } + +- ``e_ts`` is the ISO 8601 UTC timestamp of the log if logged inside the enclave (field will be missing if line was logged on the host side) +- ``h_ts`` is the ISO 8601 UTC timestamp of the log when logged on the host side +- ``file`` is the file the log originated from +- ``number`` is the line number in the file the log originated from +- ``level`` is the level of the log message [info, debug, trace, fail, fatal] +- ``msg`` is the log message \ No newline at end of file diff --git a/doc/overview/glossary.rst b/doc/overview/glossary.rst index 8136b1457c..d9ac2ec957 100644 --- a/doc/overview/glossary.rst +++ b/doc/overview/glossary.rst @@ -85,7 +85,7 @@ Glossary `Transport Layer Security `_ is an IETF cryptographic protocol standard designed to secure communications between a client and a server over a computer network. Transaction ID - Unique transaction identifier in CCF, composed of a View and a Sequence Number. Sequence Numbers start from 1, and are contiguous. Views are monotonic. + Unique transaction identifier in CCF, composed of a View and a Sequence Number separated by a period. Sequence Numbers start from 1, and are contiguous. Views are monotonic. E.g. The transaction ID ``2.15`` indicates the View is ``2`` and the Sequence Number is ``15``. Users Directly interact with the application running in CCF. Their public identity should be voted in by members before they are allowed to issue requests. diff --git a/livehtml.sh b/livehtml.sh index 884bf178e3..a1ed25a42b 100755 --- a/livehtml.sh +++ b/livehtml.sh @@ -4,6 +4,16 @@ set -e +echo "Generate version.py if it doesn't already exist" +if [ ! -f "python/version.py" ] + then + mkdir tmp_build + cd tmp_build + cmake -L -GNinja .. + cd .. + rm -rf tmp_build +fi + echo "Setting up Python environment..." if [ ! -f "env/bin/activate" ] then