Disable sigaltstack overriding in asan builds

This commit is an attempt to fix a number of flaky crashes that we've been seeing on OSS-Fuzz for some time now. These crashes only reproduce under ASAN and even then have been spotty to reproduce. The current thinking is that a test with threads (e.g. only `wast_tests` using some of the threads spec tests) is required to run some wasm which will register a `sigaltstack`. Destruction of this `sigaltstack` happens with TLS destructors which seems to have a bad interaction with ASAN state additionally being destroyed around that time. This whole interaction means that no one test case is enough to reproduce the corruption. Many crashes on OSS-Fuzz are likely due to "some historical test case spawned a thread" which corrupted something to crash later. The test case that I can reproduce with locally requires rerunning it in the same process a few thousand times to get a reproduction. The purpose of the `sigaltstack` is to ensure that we have a big enough stack, primarily in debug mode, for testing if a trap is wasm. The hope is that this extra size of the Rust-standard-library-default's stack size is not necessary in release mode with ASAN. In the end time will tell with OSS-Fuzz to see if we can keep this or if we need to both install a bigger sigaltstack in addition to managing them differently in ASAN builds.
alexcrichton · Jan 15, 2025 · 97b79d8 · 97b79d8
1 parent 48f4621
commit 97b79d8
Showing 1 changed file with 53 additions and 0 deletions.
diff --git a/crates/wasmtime/src/runtime/vm/sys/unix/signals.rs b/crates/wasmtime/src/runtime/vm/sys/unix/signals.rs
@@ -393,8 +393,61 @@ unsafe fn set_pc(cx: *mut libc::c_void, pc: usize, arg1: usize) {
 /// always large enough for our signal handling code. Override it by creating
 /// and registering our own alternate stack that is large enough and has a guard
 /// page.
+///
+/// Note that one might reasonably ask why do this at all? Why not remove
+/// `SA_ONSTACK` from our signal handlers entirely? The basic reason for that is
+/// because we want to print a message on stack overflow. The Rust standard
+/// library will print this message by default and by us overriding the
+/// `SIGSEGV` handler above we're now sharing responsibility for that as well.
+/// We must have `SA_ONSTACK` to even attempt to being able to printing this
+/// message, and so we leave it turned on. Wasmtime will determine a stack
+/// overflow fault isn't caused by wasm and then forward to libstd's signal
+/// handler which will actually print-and-abort.
+///
+/// Another reasonable question might be why we need to increase the size of the
+/// sigaltstack at all? This is something which we may want to reconsider in the
+/// future. For now it helps keep debug builds working which consume more stack
+/// when handling normal wasm out-of-bounds and faults. Perhaps in the future we
+/// could optimize this more or maybe even do something clever like lazily
+/// allocate the sigaltstack on the fault itself. (e.g. trampoline from a tiny
+/// stack to the "big stack" during a wasm fault or something like that)
 #[cold]
 pub fn lazy_per_thread_init() {
+    // This is a load-bearing requirement to keep address-sanitizer working and
+    // prevent crashes during fuzzing. The general idea here is that we skip the
+    // sigaltstack setup below entirely on asan builds, aka fuzzing. The exact
+    // reason for this is not entirely known, but the closest guess we have at
+    // this time is something like:
+    //
+    // * ASAN builds intercept mmap/munmap to keep track of what's going on.
+    // * The sigaltstack below registers a TLS destructor for when the current
+    //   thread exits to deallocate the stack.
+    // * ASAN looks to also have TLS destructors for its own internal state.
+    // * The current assumption is that the order of these TLS destructors can
+    //   cause corruption in ASAN state where if we run after asan's destructor
+    //   it may intercept munmap and then asan doesn't know it's been
+    //   de-initialized yet.
+    //
+    // The reproduction of this involved a standalone project built with
+    // `-Zsanitizer=address` where internally it would spawn two threads. Each
+    // thread would build a "hello world" module and then one of the threads
+    // would execute a noop exported function. If this was run thousands of
+    // times in a loop in the same process it would eventually crash under asan.
+    //
+    // It's notably not quite so simple as frobbing TLS destructors. There's
+    // clearly something else going on with ASAN state internally which we don't
+    // fully understand at this time. An attempt to make a standalone C++
+    // reproduction, for example, was not successful. In lieu of that the best
+    // we have for now is to disable our custom and larger sigaltstack in asan
+    // builds.
+    //
+    // The exact source was
+    // https://gist.github.com/alexcrichton/6815a5d57a3c5ca94a8d816a9fcc91af for
+    // future reference if necessary.
+    if cfg!(asan) {
+        return;
+    }
+
     // This thread local is purely used to register a `Stack` to get deallocated
     // when the thread exists. Otherwise this function is only ever called at
     // most once per-thread.