Fix hanging issues in integration tests #558

lionel- · 2024-10-02T11:16:43Z

Closes #551.
Branched from #547.

I've been facing issues locally with instability which in turned caused integration tests to hang indefinitely. And on CI we now also test on Windows, which is causing similar hang issues in the Amalthea integration tests (but due to other instability causes). This PR addresses my local instability issue and attempts to better propagate panics and problems in server-side threads so that the test threads do not hang waiting for messages.

It turns out that the instability was being caused by this top-level callback installed by the cli package: https://github.com/r-lib/cli/blob/1220ed092c03e167ff0062e9839c81d7258a4600/R/onload.R#L33-L40. This caused extra output containing an ANSI escape sequence to be streamed on IOpub at random times. To avoid this, we now pass --vanilla to R on startup. (Note that Positron will not see these "show cursor" escapes because we redirect stdout in production builds, which causes istty(stdout()) to be FALSE and cli does not register that task callback.)

We now also use assert_matches() to get better information about actual incoming messages when they don't match what is expected. Unlike assert_match(), this shows the contents of the failing input which helps a lot.

Regarding tests hanging, the key point is that when a background thread panics, a panic hook is run to record information about the panic, the thread is unwound, but then nothing else happens if the thread is not joined. So when a test thread is waiting for an expected message and the server thread has panicked, the panic is not propagated to the client-side thread, which waits forever.

What did not work is setting a panic hook to propagate background panics to the main thread:

Notifying other threads from the panic hook is feasible but we need to specifically wait for notifications at the other end.
Another approach is to flip a global flag that other threads can check periodically, but that's manual and error-prone.
Calling the original panic hook and then exiting the process with exit() or abort() does not work out because the hard exit prevents the test runner from printing a summary of the test results.

Things that this PR does instead to (partially) address this:

We really need the client test thread to panic when something goes wrong in the server threads. Propagating panics from the server threads is hard, so instead we now poll for expected incoming messages with a timeout. After one second the client thread panics.
In the zmq/crossbeam merging threads, we now panic instead of logging if we're running tests and something unexpected happens, e.g. when an expected message doesn't come in or when a socket connection fails. This allows the test runner's panic hook to take note of the panic.

Whether to panic is implemented by checking for cfg(debug_assertions). This way we remain lenient (but verbose) in production builds.
Sometimes the panic occurs even before the kernel starts up, e.g. because a zmq socket fails to connect. This is not a problem for the main Ark binary because this all happens on the main thread, where panics cause an exit. However in integration tests we run R in a background thread so it can be reused across multiple test threads. An early panic would then be silently ignored and the tests would be waiting indefinitely for R initialization.

To fix this, I split RMain::setup() from RMain::start(). This allows the test init to run the former (via start_kernel()) independently on its own thread, before starting the R REPL on a background thread. Most init problems can then be propagated correctly to the test runner.

This task callback was causing intermittent unexpected output: https://github.com/r-lib/cli/blob/1220ed092c03e167ff0062e9839c81d7258a4600/R/onload.R#L33-L40 Note that `istty(stdout())` is `TRUE` in these tests even though it's `FALSE` in Positron. That's likely because we don't redirect stdout, see https://github.com/posit-dev/ark/blob/main/crates/amalthea/src/sys/unix/stream_capture.rs#L39-L42 For clarity and consistency with the regular path, we also set `--interactive`

This way the former can be run in a background thread whereas the latter can still run in the main thread and propagate panics

Should help getting better reports

crates/ark/src/fixtures/dummy_frontend.rs

crates/ark/src/interface.rs

crates/ark/src/start.rs

crates/ark/src/interface.rs

DavisVaughan · 2024-10-03T14:05:48Z

crates/ark/src/interface.rs

+        // Update the main thread ID in case the REPL is started in a background
+        // thread (e.g. in integration tests)
+        unsafe { R_MAIN_THREAD_ID = Some(std::thread::current().id()) };


Should we even set R_MAIN_THREAD_ID from setup() then? It feels weird to do so twice to me.

I feel the same but we have to because write_console(), called during startup when the banner is emitted, calls RMain::get_mut().

I've now:

Split write_console() into pre-init and post-init handlers, as you suggested yesterday. The pre-init handler pushes output to a new global variable R_BANNER so it doesn't need to call RMain::get_mut().

Added the RMain::positron_ns field. The modules watchers now call RMain::get_mut() to access the namespace. This solves issues having to do with the creation of an RThreadSafe object which requires RMain::get() which would fail if the R thread ID is not initialized.

This makes it possible to set up the thread ID only once, in RMain::start().

It was not possible to split write_console() because it doesn't look like the Windows API for setting interface handlers supports updating them after the initial setup. So I've reverted this, but we still use the same strategy for setting R_MAIN_THREAD_ID only once.

DavisVaughan · 2024-10-03T14:13:00Z

crates/ark/src/fixtures/dummy_frontend.rs

-        // Wait for startup to complete
-        RMain::wait_r_initialized();


Remove wait_r_initialized() now?

DavisVaughan · 2024-10-03T14:22:44Z

crates/ark/src/fixtures/dummy_frontend.rs

-                SessionMode::Console,
-                false,
-            );
+            RMain::start();


I am mildly concerned about the fact that R_INIT sets "set" in setup() at the very end.

But we typically think of R_INIT as "R is ready to receive input", but that really isn't true because we haven't started the REPL yet (not until we call start() here).

I feel like its probably ok? Maybe R_INIT should be R_STARTING or something, I'm not sure. And the remaining helper would be is_r_starting(). If you look at the 1 place is_r_initialized() is currently being used, it is to capture output that is part of the startup banner. Switching the name to is_r_starting() seems like it would still make sense there.

I was wondering about that too, but I couldn't move it to the start() method.

But I think that's ok because R is ready to receive input or be accessed with well protected (top-level exec) API calls even if the REPL has not been started yet.

crates/ark/src/fixtures/dummy_frontend.rs

This reverts commit d2a0ce4.

lionel- added 6 commits October 2, 2024 11:53

Use assert_matches()

006059c

Receive messages with timeout

f11cc92

Simpler approach to implementing timeouts

801c801

Split RMain::setup() from RMain::start()

749f940

This way the former can be run in a background thread whereas the latter can still run in the main thread and propagate panics

Panic in tests when socket fails

2a93960

Should help getting better reports

lionel- requested a review from DavisVaughan October 2, 2024 12:00

lionel- mentioned this pull request Oct 2, 2024

Test multiple expressions in console and notebook modes #559

Merged

Base automatically changed from feature/error-tests to main October 3, 2024 14:22

DavisVaughan approved these changes Oct 3, 2024

View reviewed changes

lionel- added 8 commits October 4, 2024 10:44

Finish renaming start_repl() to start()

40f2486

Specify which panics will propagate

c56f6b7

Merge init variables

a64a25e

Split write_console() between pre-init and post-init handlers

d2a0ce4

Initialize R thread ID only once

1b54e66

Import r_write_console_init on Windows too

edcc0d7

Revert "Split write_console() between pre-init and post-init handlers"

de31833

This reverts commit d2a0ce4.

Use R_BANNER in unified write_console()

83b8e6d

lionel- force-pushed the bugfix/test-hangs branch from cd5fc43 to 83b8e6d Compare October 4, 2024 09:47

lionel- merged commit bef038b into main Oct 4, 2024
5 of 6 checks passed

lionel- deleted the bugfix/test-hangs branch October 4, 2024 10:12

github-actions bot locked and limited conversation to collaborators Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix hanging issues in integration tests #558

Fix hanging issues in integration tests #558

Uh oh!

lionel- commented Oct 2, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DavisVaughan Oct 3, 2024

Uh oh!

lionel- Oct 3, 2024

Uh oh!

lionel- Oct 4, 2024

Uh oh!

lionel- Oct 4, 2024

Uh oh!

DavisVaughan Oct 3, 2024

Uh oh!

DavisVaughan Oct 3, 2024

Uh oh!

lionel- Oct 3, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fix hanging issues in integration tests #558

Fix hanging issues in integration tests #558

Uh oh!

Conversation

lionel- commented Oct 2, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DavisVaughan Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

lionel- Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

lionel- Oct 4, 2024

Choose a reason for hiding this comment

Uh oh!

lionel- Oct 4, 2024

Choose a reason for hiding this comment

Uh oh!

DavisVaughan Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

DavisVaughan Oct 3, 2024

Choose a reason for hiding this comment

Uh oh!

lionel- Oct 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lionel- Oct 3, 2024 •

edited

Loading