Failure Conditions

Failure handling during initialization

Components of a chainweb-node should check during initialization for failure conditions that would prevent a node from performing its task. If a node detects such a condition it should,

try to fix the issue, and if that isn't possible,
emit an error log that describes the problem and possibly provides hints how to resolve the issue, and
throw an exception, which will cause the node to terminate.

The exception message will show up on stderr of the node. On systems with systemd the exception message will be recorded in the journal. On the testnet nodes journalctl -b -u chainweb can be used to view the journal.

Also any failure in the logging system or any log messages that are emitted before the logging system is initialized are logged to stderr, and show up in the journal.

Failure handling during operations

Once the node is initialized and API servers and the P2P clients are started, components should try hard to avoid failing. Components should

catch all synchronous exceptions,
emit an error log or warning log that describes the problem, including possible actions that must be take to address the issue,
restart the component, subject to backoff or throttling logic as needed.

Most components do this by being wrapped in runForever or runForeverThrottled from Chainweb.Utils.

Components must not catch asynchronous exceptions, that don't originate from the component itself. The functions catchSynchronous and catchAllSynchronous (and their variants) from Chainweb.Utils can be used to catch synchronous exceptions but ignore asynchronous exceptions.

Exceptions that terminate the node

There are a few fatal conditions that a node can't recover from by itself. In those cases an asynchronous exception should be thrown that terminates the node.

One example of such a condition is when the node receives a KILL signal from the environment. Another example is when a kill-switch triggers.

Here is an example how an asynchronous exception can be defined:

newtype KillSwitch = KillSwitch T.Text
instance Show KillSwitch where
    show (KillSwitch t) = "kill switch triggered: " <> T.unpack t
instance Exception KillSwitch where
    fromException = asyncExceptionFromException
    toException = asyncExceptionToException

When such an exception is thrown it will escape from the exception handlers use in runForever and terminate the chainweb node. When this happens the exception value is printed by the runtime to stderr.

The code should also write a meaning full Error log message before throwing such an exception.

List of Initialization Failures

List of Unrecoverable Operation Failures

Ctrl-C / kill
KillSwitch
ReorgLimitExceeded

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Failure Conditions

Failure handling during initialization

Failure handling during operations

Exceptions that terminate the node

List of Initialization Failures

List of Unrecoverable Operation Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally