-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix hanging issues in integration tests #558
Conversation
This task callback was causing intermittent unexpected output: https://github.com/r-lib/cli/blob/1220ed092c03e167ff0062e9839c81d7258a4600/R/onload.R#L33-L40 Note that `istty(stdout())` is `TRUE` in these tests even though it's `FALSE` in Positron. That's likely because we don't redirect stdout, see https://github.com/posit-dev/ark/blob/main/crates/amalthea/src/sys/unix/stream_capture.rs#L39-L42 For clarity and consistency with the regular path, we also set `--interactive`
This way the former can be run in a background thread whereas the latter can still run in the main thread and propagate panics
Should help getting better reports
// Update the main thread ID in case the REPL is started in a background | ||
// thread (e.g. in integration tests) | ||
unsafe { R_MAIN_THREAD_ID = Some(std::thread::current().id()) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we even set R_MAIN_THREAD_ID
from setup()
then? It feels weird to do so twice to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel the same but we have to because write_console()
, called during startup when the banner is emitted, calls RMain::get_mut()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've now:
-
Split
write_console()
into pre-init and post-init handlers, as you suggested yesterday. The pre-init handler pushes output to a new global variableR_BANNER
so it doesn't need to callRMain::get_mut()
. -
Added the
RMain::positron_ns
field. The modules watchers now callRMain::get_mut()
to access the namespace. This solves issues having to do with the creation of anRThreadSafe
object which requiresRMain::get()
which would fail if the R thread ID is not initialized.
This makes it possible to set up the thread ID only once, in RMain::start()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was not possible to split write_console()
because it doesn't look like the Windows API for setting interface handlers supports updating them after the initial setup. So I've reverted this, but we still use the same strategy for setting R_MAIN_THREAD_ID
only once.
// Wait for startup to complete | ||
RMain::wait_r_initialized(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove wait_r_initialized()
now?
SessionMode::Console, | ||
false, | ||
); | ||
RMain::start(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am mildly concerned about the fact that R_INIT
sets "set" in setup()
at the very end.
But we typically think of R_INIT
as "R is ready to receive input", but that really isn't true because we haven't started the REPL yet (not until we call start()
here).
I feel like its probably ok? Maybe R_INIT
should be R_STARTING
or something, I'm not sure. And the remaining helper would be is_r_starting()
. If you look at the 1 place is_r_initialized()
is currently being used, it is to capture output that is part of the startup banner. Switching the name to is_r_starting()
seems like it would still make sense there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering about that too, but I couldn't move it to the start()
method.
But I think that's ok because R is ready to receive input or be accessed with well protected (top-level exec) API calls even if the REPL has not been started yet.
This reverts commit d2a0ce4.
cd5fc43
to
83b8e6d
Compare
Closes #551.
Branched from #547.
I've been facing issues locally with instability which in turned caused integration tests to hang indefinitely. And on CI we now also test on Windows, which is causing similar hang issues in the Amalthea integration tests (but due to other instability causes). This PR addresses my local instability issue and attempts to better propagate panics and problems in server-side threads so that the test threads do not hang waiting for messages.
It turns out that the instability was being caused by this top-level callback installed by the cli package: https://github.com/r-lib/cli/blob/1220ed092c03e167ff0062e9839c81d7258a4600/R/onload.R#L33-L40. This caused extra output containing an ANSI escape sequence to be streamed on IOpub at random times. To avoid this, we now pass
--vanilla
to R on startup. (Note that Positron will not see these "show cursor" escapes because we redirect stdout in production builds, which causesistty(stdout())
to beFALSE
and cli does not register that task callback.)We now also use
assert_matches()
to get better information about actual incoming messages when they don't match what is expected. Unlikeassert_match()
, this shows the contents of the failing input which helps a lot.Regarding tests hanging, the key point is that when a background thread panics, a panic hook is run to record information about the panic, the thread is unwound, but then nothing else happens if the thread is not joined. So when a test thread is waiting for an expected message and the server thread has panicked, the panic is not propagated to the client-side thread, which waits forever.
What did not work is setting a panic hook to propagate background panics to the main thread:
exit()
orabort()
does not work out because the hard exit prevents the test runner from printing a summary of the test results.Things that this PR does instead to (partially) address this:
We really need the client test thread to panic when something goes wrong in the server threads. Propagating panics from the server threads is hard, so instead we now poll for expected incoming messages with a timeout. After one second the client thread panics.
In the zmq/crossbeam merging threads, we now panic instead of logging if we're running tests and something unexpected happens, e.g. when an expected message doesn't come in or when a socket connection fails. This allows the test runner's panic hook to take note of the panic.
Whether to panic is implemented by checking for
cfg(debug_assertions)
. This way we remain lenient (but verbose) in production builds.Sometimes the panic occurs even before the kernel starts up, e.g. because a zmq socket fails to connect. This is not a problem for the main Ark binary because this all happens on the main thread, where panics cause an exit. However in integration tests we run R in a background thread so it can be reused across multiple test threads. An early panic would then be silently ignored and the tests would be waiting indefinitely for R initialization.
To fix this, I split
RMain::setup()
fromRMain::start()
. This allows the test init to run the former (viastart_kernel()
) independently on its own thread, before starting the R REPL on a background thread. Most init problems can then be propagated correctly to the test runner.