Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hanging issues in integration tests #558

Merged
merged 14 commits into from
Oct 4, 2024
Merged

Fix hanging issues in integration tests #558

merged 14 commits into from
Oct 4, 2024

Conversation

lionel-
Copy link
Contributor

@lionel- lionel- commented Oct 2, 2024

Closes #551.
Branched from #547.

I've been facing issues locally with instability which in turned caused integration tests to hang indefinitely. And on CI we now also test on Windows, which is causing similar hang issues in the Amalthea integration tests (but due to other instability causes). This PR addresses my local instability issue and attempts to better propagate panics and problems in server-side threads so that the test threads do not hang waiting for messages.

It turns out that the instability was being caused by this top-level callback installed by the cli package: https://github.com/r-lib/cli/blob/1220ed092c03e167ff0062e9839c81d7258a4600/R/onload.R#L33-L40. This caused extra output containing an ANSI escape sequence to be streamed on IOpub at random times. To avoid this, we now pass --vanilla to R on startup. (Note that Positron will not see these "show cursor" escapes because we redirect stdout in production builds, which causes istty(stdout()) to be FALSE and cli does not register that task callback.)

We now also use assert_matches() to get better information about actual incoming messages when they don't match what is expected. Unlike assert_match(), this shows the contents of the failing input which helps a lot.

Regarding tests hanging, the key point is that when a background thread panics, a panic hook is run to record information about the panic, the thread is unwound, but then nothing else happens if the thread is not joined. So when a test thread is waiting for an expected message and the server thread has panicked, the panic is not propagated to the client-side thread, which waits forever.

What did not work is setting a panic hook to propagate background panics to the main thread:

  • Notifying other threads from the panic hook is feasible but we need to specifically wait for notifications at the other end.
  • Another approach is to flip a global flag that other threads can check periodically, but that's manual and error-prone.
  • Calling the original panic hook and then exiting the process with exit() or abort() does not work out because the hard exit prevents the test runner from printing a summary of the test results.

Things that this PR does instead to (partially) address this:

  • We really need the client test thread to panic when something goes wrong in the server threads. Propagating panics from the server threads is hard, so instead we now poll for expected incoming messages with a timeout. After one second the client thread panics.

  • In the zmq/crossbeam merging threads, we now panic instead of logging if we're running tests and something unexpected happens, e.g. when an expected message doesn't come in or when a socket connection fails. This allows the test runner's panic hook to take note of the panic.

    Whether to panic is implemented by checking for cfg(debug_assertions). This way we remain lenient (but verbose) in production builds.

  • Sometimes the panic occurs even before the kernel starts up, e.g. because a zmq socket fails to connect. This is not a problem for the main Ark binary because this all happens on the main thread, where panics cause an exit. However in integration tests we run R in a background thread so it can be reused across multiple test threads. An early panic would then be silently ignored and the tests would be waiting indefinitely for R initialization.

    To fix this, I split RMain::setup() from RMain::start(). This allows the test init to run the former (via start_kernel()) independently on its own thread, before starting the R REPL on a background thread. Most init problems can then be propagated correctly to the test runner.

This task callback was causing intermittent unexpected output:

https://github.com/r-lib/cli/blob/1220ed092c03e167ff0062e9839c81d7258a4600/R/onload.R#L33-L40

Note that `istty(stdout())` is `TRUE` in these tests even though it's
`FALSE` in Positron. That's likely because we don't redirect stdout, see
https://github.com/posit-dev/ark/blob/main/crates/amalthea/src/sys/unix/stream_capture.rs#L39-L42

For clarity and consistency with the regular path, we also set
`--interactive`
This way the former can be run in a background thread whereas the latter
can still run in the main thread and propagate panics
Should help getting better reports
@lionel- lionel- requested a review from DavisVaughan October 2, 2024 12:00
Base automatically changed from feature/error-tests to main October 3, 2024 14:22
crates/ark/src/fixtures/dummy_frontend.rs Show resolved Hide resolved
crates/ark/src/interface.rs Outdated Show resolved Hide resolved
crates/ark/src/start.rs Outdated Show resolved Hide resolved
crates/ark/src/interface.rs Outdated Show resolved Hide resolved
Comment on lines +435 to +437
// Update the main thread ID in case the REPL is started in a background
// thread (e.g. in integration tests)
unsafe { R_MAIN_THREAD_ID = Some(std::thread::current().id()) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we even set R_MAIN_THREAD_ID from setup() then? It feels weird to do so twice to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the same but we have to because write_console(), called during startup when the banner is emitted, calls RMain::get_mut().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now:

  • Split write_console() into pre-init and post-init handlers, as you suggested yesterday. The pre-init handler pushes output to a new global variable R_BANNER so it doesn't need to call RMain::get_mut().

  • Added the RMain::positron_ns field. The modules watchers now call RMain::get_mut() to access the namespace. This solves issues having to do with the creation of an RThreadSafe object which requires RMain::get() which would fail if the R thread ID is not initialized.

This makes it possible to set up the thread ID only once, in RMain::start().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was not possible to split write_console() because it doesn't look like the Windows API for setting interface handlers supports updating them after the initial setup. So I've reverted this, but we still use the same strategy for setting R_MAIN_THREAD_ID only once.

Comment on lines -55 to -56
// Wait for startup to complete
RMain::wait_r_initialized();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove wait_r_initialized() now?

SessionMode::Console,
false,
);
RMain::start();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am mildly concerned about the fact that R_INIT sets "set" in setup() at the very end.

But we typically think of R_INIT as "R is ready to receive input", but that really isn't true because we haven't started the REPL yet (not until we call start() here).

I feel like its probably ok? Maybe R_INIT should be R_STARTING or something, I'm not sure. And the remaining helper would be is_r_starting(). If you look at the 1 place is_r_initialized() is currently being used, it is to capture output that is part of the startup banner. Switching the name to is_r_starting() seems like it would still make sense there.

Copy link
Contributor Author

@lionel- lionel- Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering about that too, but I couldn't move it to the start() method.

But I think that's ok because R is ready to receive input or be accessed with well protected (top-level exec) API calls even if the REPL has not been started yet.

crates/ark/src/fixtures/dummy_frontend.rs Show resolved Hide resolved
@lionel- lionel- force-pushed the bugfix/test-hangs branch from cd5fc43 to 83b8e6d Compare October 4, 2024 09:47
@lionel- lionel- merged commit bef038b into main Oct 4, 2024
5 of 6 checks passed
@lionel- lionel- deleted the bugfix/test-hangs branch October 4, 2024 10:12
@github-actions github-actions bot locked and limited conversation to collaborators Oct 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ark: Panics in dummy kernel threads should be propagated to test harness
2 participants