make signal handler less greedy: only handle signals from expected memory ranges #23
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
EOS VM uses page protection for guarding memory accesses and interrupting execution. Currently, when EOS VM starts execution it prepares its signal handler to handle any faults that occur until execution is complete as an
access violation
WASM error. This means both faults that occur inside of WASM execution and in any host functions that WASM calls are all reported and treated as a recoverableaccess violation
.Because EOS VM captures
SIGBUS
(wholly unnecessary on Linux, but needed on macOS) a substantial number of (very much rare corner case, but still very real) unrecoverable system errors occurring in host functions will instead be treated as a recoverableaccess violation
as if the WASM simply accessed out of bounds memory in its sandbox. This can include an IO error on the DB file, an IO error when swapping, running out of disk space, an unrecoverable ECC error, running out of free huge pages (inheap
mode w/ huge pages enabled), and maybe more. These unrecoverable system errors should not be handled as a recoverable WASM memory violation.Removing
SIGBUS
from being handled on Linux would generally resolve this problem, though if a host function had a defect causing aSIGSEGV
it would fall in to the same improper handling. So for a more thorough solution, now the signal handler will only handleSIGSEGV
/SIGBUS
/SIGFPE
on given memory ranges -- the WASM code & WASM memory. Faults that occur outside these ranges are forwarded to the next handler (or kill the application if EOS VM's handler is the last chained). This behavior is similar to how EOS VM OC's handler operates. I've also removedSIGBUS
from being handled on Linux entirely to resolve the exceptionally unlikely scenario of catching an ECC failure inside of WASM memory.Of course, this means if one of the above system errors are occurring, nodeos will now simply be killed whereas before it'd potentially get stuck in some wedged state that was still cleanly stoppable. While that might sound bad, it's a good thing: we should only be recovering from errors we know we can properly recover from.
This behavior is a theory on AntelopeIO/leap#2242: some fault is masquerading as an
access violation
due to the current greediness of the handlers.