make signal handler less greedy: only handle signals from expected memory ranges #23

spoonincode · 2024-03-18T20:38:08Z

EOS VM uses page protection for guarding memory accesses and interrupting execution. Currently, when EOS VM starts execution it prepares its signal handler to handle any faults that occur until execution is complete as an access violation WASM error. This means both faults that occur inside of WASM execution and in any host functions that WASM calls are all reported and treated as a recoverable access violation.

Because EOS VM captures SIGBUS (wholly unnecessary on Linux, but needed on macOS) a substantial number of (very much rare corner case, but still very real) unrecoverable system errors occurring in host functions will instead be treated as a recoverable access violation as if the WASM simply accessed out of bounds memory in its sandbox. This can include an IO error on the DB file, an IO error when swapping, running out of disk space, an unrecoverable ECC error, running out of free huge pages (in heap mode w/ huge pages enabled), and maybe more. These unrecoverable system errors should not be handled as a recoverable WASM memory violation.

Removing SIGBUS from being handled on Linux would generally resolve this problem, though if a host function had a defect causing a SIGSEGV it would fall in to the same improper handling. So for a more thorough solution, now the signal handler will only handle SIGSEGV/SIGBUS/SIGFPE on given memory ranges -- the WASM code & WASM memory. Faults that occur outside these ranges are forwarded to the next handler (or kill the application if EOS VM's handler is the last chained). This behavior is similar to how EOS VM OC's handler operates. I've also removed SIGBUS from being handled on Linux entirely to resolve the exceptionally unlikely scenario of catching an ECC failure inside of WASM memory.

Of course, this means if one of the above system errors are occurring, nodeos will now simply be killed whereas before it'd potentially get stuck in some wedged state that was still cleanly stoppable. While that might sound bad, it's a good thing: we should only be recovering from errors we know we can properly recover from.

This behavior is a theory on AntelopeIO/leap#2242: some fault is masquerading as an access violation due to the current greediness of the handlers.

since we're going to longjmp out of this function, probably best to stay as trivial as possible

only handle signals from expected memory ranges

5996e48

spoonincode mentioned this pull request Mar 18, 2024

make EOS VM signal handler less greedy: only handle signals from expected memory range AntelopeIO/leap#2322

Merged

spoonincode requested a review from linh2931 March 18, 2024 20:41

need <span> include

9962c37

greg7mdp approved these changes Mar 18, 2024

View reviewed changes

linh2931 approved these changes Mar 18, 2024

View reviewed changes

and another <span>

4f9ef0e

greg7mdp approved these changes Mar 19, 2024

View reviewed changes

replace lambda with function

f725536

since we're going to longjmp out of this function, probably best to stay as trivial as possible

linh2931 approved these changes Mar 21, 2024

View reviewed changes

spoonincode merged commit aa8bd0a into main Mar 22, 2024
10 checks passed

spoonincode deleted the limit_signal_handler branch March 22, 2024 15:03

swatanabe mentioned this pull request Apr 9, 2024

Merge from AntelopeIO gofractally/eos-vm#14

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make signal handler less greedy: only handle signals from expected memory ranges #23

make signal handler less greedy: only handle signals from expected memory ranges #23

spoonincode commented Mar 18, 2024

make signal handler less greedy: only handle signals from expected memory ranges #23

make signal handler less greedy: only handle signals from expected memory ranges #23

Conversation

spoonincode commented Mar 18, 2024