Skip to content
Lars Kuhtz edited this page Dec 8, 2019 · 2 revisions

Failure handling during initialization

Components of a chainweb-node should check during initialization for failure conditions that would prevent a node from performing its task. If a node detects such a condition it should,

  1. try to fix the issue, and if that isn't possible,
  2. emit an error log that describes the problem and possibly provides hints how to resolve the issue, and
  3. throw an exception, which will cause the node to terminate.

The exception message will show up on stderr of the node. On systems with systemd the exception message will be recorded in the journal. On the testnet nodes journalctl -b -u chainweb can be used to view the journal.

Also any failure in the logging system or any log messages that are emitted before the logging system is initialized are logged to stderr, and show up in the journal.

Failure handling during operations

Once the node is initialized and API servers and the P2P clients are started, components should try hard to avoid failing. Components should

  1. catch all synchronous exceptions,
  2. emit an error log or warning log that describes the problem, including possible actions that must be take to address the issue,
  3. restart the component, subject to backoff or throttling logic as needed.

Most components do this by being wrapped in runForever or runForeverThrottled from Chainweb.Utils.

Components must not catch asynchronous exceptions, that don't originate from the component itself. The functions catchSynchronous and catchAllSynchronous (and their variants) from Chainweb.Utils can be used to catch synchronous exceptions but ignore asynchronous exceptions.

Exceptions that terminate the node

There are a few fatal conditions that a node can't recover from by itself. In those cases an asynchronous exception should be thrown that terminates the node.

One example of such a condition is when the node receives a KILL signal from the environment. Another example is when a kill-switch triggers.

Here is an example how an asynchronous exception can be defined:

newtype KillSwitch = KillSwitch T.Text
instance Show KillSwitch where
    show (KillSwitch t) = "kill switch triggered: " <> T.unpack t
instance Exception KillSwitch where
    fromException = asyncExceptionFromException
    toException = asyncExceptionToException

When such an exception is thrown it will escape from the exception handlers use in runForever and terminate the chainweb node. When this happens the exception value is printed by the runtime to stderr.

The code should also write a meaning full Error log message before throwing such an exception.

List of Initialization Failures

  • Configuration:

    • parsing of configuration fails
    • validation of configuration fails
  • Logging system:

    • Elasticsearch index can't be created
    • Log files can't be opened
  • Databases:

    • RocksDb database can't be opened
    • sqlite database can't be opened
    • not enough disk space available
  • Networking:

    • Certificate generation fails
    • Certificate or Key can't be read
    • Certificate is invalid (e.g. expired)
  • Chain Resources:

    • pruning of block header database files files (detects inconsistencies)
  • BlockHeaderDb / Consensus:

    • Hashes of genesis headers don't match expected hashes for the given chainweb version
    • Missing dependencies in BlockHeaderDb (in part checked by db pruning)
  • Pact Service:

    • Hashes of genesis payloads don't match the expected hashes for the given chainweb version
  • Mempool:

  • P2P Networking:

    • No bootstrap nodes configured (should this be a failure?)
    • Synchronization with all bootstrap nodes fails
    • No network link available
    • DNS lookup not available (is this a failure? most peers are know by IP)
    • All HTTP connections fail with 502
  • HTTP Server:

    • port can't be allocated
  • Miner:

List of Unrecoverable Operation Failures

  • Ctrl-C / kill
  • KillSwitch
  • ReorgLimitExceeded