Skip to content

rt: overhaul task hooks #7197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

rt: overhaul task hooks #7197

wants to merge 9 commits into from

Conversation

Noah-Kennedy
Copy link
Contributor

@Noah-Kennedy Noah-Kennedy commented Mar 6, 2025

This change overhauls the entire task hooks system so that users can propagate arbitrary information between task hook invocations and pass context data between the hook "harnesses" for parent and child tasks at time of spawn.

This is intended to be significantly more extensible and long-term maintainable than the current task hooks system, and should ultimately be much easier to stabilize.

See #7306 for motivation.

@github-actions github-actions bot added R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR R-loom-multi-thread-alt Run loom multi-thread alt tests on this PR labels Mar 6, 2025
@Noah-Kennedy
Copy link
Contributor Author

I'll fix CI tomorrow.

@rcoh
Copy link
Contributor

rcoh commented Mar 6, 2025

Just reading the API, I really like this. I think its an API that allows us to expand in the future (especially if there isn't a huge perf hit by virtue of open-ended XYZContext objects).

I might consider keeping the current hook APIs (but implementing them behind the scenes with the new API). I think, ultimately, they may still be simpler for some use cases.

@Noah-Kennedy
Copy link
Contributor Author

Just reading the API, I really like this. I think its an API that allows us to expand in the future (especially if there isn't a huge perf hit by virtue of open-ended XYZContext objects).

I might consider keeping the current hook APIs (but implementing them behind the scenes with the new API). I think, ultimately, they may still be simpler for some use cases.

Thank you for the feedback! Expandability was one of my main priorities with this changeset, and I'm delighted to hear that others are also excited about this.

Regarding keeping both sets of hooks around, I don't think this is a great idea due to the complexity of the systems involved, and I honestly just wanna get rid of the old APIs as they were never very good to begin with. That being said, I don't think it will be much work for folks to migrate over, and my hope is that we can leave this API completely unchanged even while we let it bake prior to stabilization, so hopefully no more migrations will be needed.

@Noah-Kennedy
Copy link
Contributor Author

I'm running into some issues with loom, and I'm currently unsure if this is because of an actual issue or just false positives.

@Darksonn do you have any idea what might be causing the current failures with the multi-threaded scheduler?

@Noah-Kennedy Noah-Kennedy requested a review from Darksonn March 6, 2025 21:33
@mox692
Copy link
Member

mox692 commented Mar 7, 2025

Hmm, it looks like the loom failure log is not showing up on ci? ... I captured this locally anyway:

logs
running 1 test
test runtime::tests::loom_multi_thread::group_c::pool_shutdown has been running for over 60 seconds

thread 'runtime::tests::loom_multi_thread::group_c::pool_shutdown' panicked at /home/runner/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/loom-0.7.2/src/rt/location.rs:115:9:
Causality violation: Concurrent write accesses to `UnsafeCell`.

stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: loom::rt::location::PanicBuilder::fire
   3: loom::rt::cell::State::track_write
   4: scoped_tls::ScopedKey<T>::with
   5: loom::rt::cell::Cell::start_write
   6: loom::cell::unsafe_cell::UnsafeCell<T>::with_mut
   7: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
   8: tokio::runtime::scheduler::multi_thread::worker::Context::run
   9: tokio::runtime::context::scoped::Scoped<T>::set
  10: loom::thread::LocalKey<T>::try_with
  11: tokio::runtime::context::runtime::enter_runtime
  12: tokio::runtime::scheduler::multi_thread::worker::run
  13: loom::cell::unsafe_cell::UnsafeCell<T>::with_mut
  14: tokio::runtime::task::core::Core<T,S>::poll
  15: tokio::runtime::task::harness::Harness<T,S>::poll
  16: loom::cell::unsafe_cell::UnsafeCell<T>::with_mut
  17: tokio::runtime::task::UnownedTask<S>::run
  18: tokio::runtime::blocking::pool::Inner::run
  19: core::ops::function::FnOnce::call_once{{vtable.shim}}
  20: generator::stack::StackBox<F>::call_once
  21: generator::detail::gen::gen_init_impl
  22: generator::detail::asm::gen_init
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
worker thread panicking; aborting process
error: test failed, to rerun pass `--lib`

I haven't looked the entire code, but I suspect there's a data race caused by a concurrent write to an UnsafeCell somewhere.

@Noah-Kennedy
Copy link
Contributor Author

Hmm, it looks like the loom failure log is not showing up on ci? ... I captured this locally anyway:
logs

I haven't looked the entire code, but I suspect there's a data race caused by a concurrent write to an UnsafeCell somewhere.

One of the loom tests (pool_shutdown) is timing out because it goes on too long, probably because somehow the number of branches is exploding.

@Noah-Kennedy Noah-Kennedy force-pushed the noah/the-hookening branch 2 times, most recently from 75c21b5 to 8963b5d Compare March 7, 2025 19:40
@LloydW93
Copy link

LloydW93 commented Apr 29, 2025

Hmm, it looks like the loom failure log is not showing up on ci? ... I captured this locally anyway:
logs
I haven't looked the entire code, but I suspect there's a data race caused by a concurrent write to an UnsafeCell somewhere.

One of the loom tests (pool_shutdown) is timing out because it goes on too long, probably because somehow the number of branches is exploding.

This part of the problem I think is because the loom-compile CI workflow doesn't build with --release. I also see stack overflows in spawn_blocking_when_paused locally without the release target.

However, test runtime::tests::loom_multi_thread::group_a::only_blocking_with_pending is failing because two concurrent threads - one blocking and one non-blocking, are both running with the same task hooks, as they are inherited by default. As the inner type must be Send this implies that the operation is safe as long as we don't modify the Option itself. I think it's the possibility that we do that which loom is warning about, though we don't actually do this and so the usage is, I think, safe. Here's poll() for quick reference:

    /// Safety: mutual exclusion is required to call this function.
    pub(crate) fn poll(self) {
        #[cfg(tokio_unstable)]
        self.trailer().hooks.with_mut(|ptr| unsafe {
            let _guard = ptr.as_mut().and_then(|x| {
                x.as_mut().map(|x| {
                    let _ = panic::catch_unwind(panic::AssertUnwindSafe(|| {
                        x.before_poll(&mut BeforeTaskPollContext {
                            _phantom: Default::default(),
                        })
                    }));

                    set_task_hooks(NonNull::new(
                        (&mut **x) as *mut (dyn TaskHookHarness + Send + Sync + 'static),
                    ))
                })
            });

            let vtable = self.header().vtable;
            (vtable.poll)(self.ptr);
        });
        #[cfg(not(tokio_unstable))]
        unsafe {
            let vtable = self.header().vtable;
            (vtable.poll)(self.ptr);
        }
    }

As I think we only care about the mutable reference to the Option, that would mean we can take the poll out of the with_mut closure, and that seems to make the loom tests pass.

I've pushed that change, a ci.yml update for release builds in loom tests, and a rebase, in my fork at https://github.com/LloydW93/tokio/tree/lloyd/the-hookening - let me know if you just want to pull the commits into your branch or would rather a new PR.

LloydW93 added a commit to LloydW93/tokio that referenced this pull request Apr 29, 2025
In both CONTRIBUTING.md, and in loom's own documentation, it is
recommended to run loom's tests with the --release profile, as they
execute many permutations of the same test and therefore compilation
time benefits are worthwhile.

In tokio-rs#7197 we see this with some individual tests taking over 60s to
complete.
@Noah-Kennedy Noah-Kennedy force-pushed the noah/the-hookening branch from 8963b5d to de227f8 Compare May 2, 2025 20:01
This change overhauls the entire task hooks system so that users can propagate arbitrary information between task hook invocations and pass context data between the hook "harnesses" for parent and child tasks at time of spawn.

This is intended to be significantly more extensible and long-term maintainable than the current task hooks system, and should ultimately be much easier to stabilize.
@Noah-Kennedy Noah-Kennedy force-pushed the noah/the-hookening branch from de227f8 to 0095c7f Compare May 2, 2025 20:01
@Noah-Kennedy
Copy link
Contributor Author

Hmm, it looks like the loom failure log is not showing up on ci? ... I captured this locally anyway:
logs
I haven't looked the entire code, but I suspect there's a data race caused by a concurrent write to an UnsafeCell somewhere.

One of the loom tests (pool_shutdown) is timing out because it goes on too long, probably because somehow the number of branches is exploding.

This part of the problem I think is because the loom-compile CI workflow doesn't build with --release. I also see stack overflows in spawn_blocking_when_paused locally without the release target.

However, test runtime::tests::loom_multi_thread::group_a::only_blocking_with_pending is failing because two concurrent threads - one blocking and one non-blocking, are both running with the same task hooks, as they are inherited by default. As the inner type must be Send this implies that the operation is safe as long as we don't modify the Option itself. I think it's the possibility that we do that which loom is warning about, though we don't actually do this and so the usage is, I think, safe. Here's poll() for quick reference:

    /// Safety: mutual exclusion is required to call this function.
    pub(crate) fn poll(self) {
        #[cfg(tokio_unstable)]
        self.trailer().hooks.with_mut(|ptr| unsafe {
            let _guard = ptr.as_mut().and_then(|x| {
                x.as_mut().map(|x| {
                    let _ = panic::catch_unwind(panic::AssertUnwindSafe(|| {
                        x.before_poll(&mut BeforeTaskPollContext {
                            _phantom: Default::default(),
                        })
                    }));

                    set_task_hooks(NonNull::new(
                        (&mut **x) as *mut (dyn TaskHookHarness + Send + Sync + 'static),
                    ))
                })
            });

            let vtable = self.header().vtable;
            (vtable.poll)(self.ptr);
        });
        #[cfg(not(tokio_unstable))]
        unsafe {
            let vtable = self.header().vtable;
            (vtable.poll)(self.ptr);
        }
    }

As I think we only care about the mutable reference to the Option, that would mean we can take the poll out of the with_mut closure, and that seems to make the loom tests pass.

I've pushed that change, a ci.yml update for release builds in loom tests, and a rebase, in my fork at https://github.com/LloydW93/tokio/tree/lloyd/the-hookening - let me know if you just want to pull the commits into your branch or would rather a new PR.

Thanks Lloyd for digging into this!

I've applied your suggestions and that fixed the loom issues.

@Noah-Kennedy Noah-Kennedy enabled auto-merge (squash) May 4, 2025 17:23
@Darksonn
Copy link
Contributor

Darksonn commented May 5, 2025

Can you make a module for all the hook types and traits so we don't pollute tokio::runtime?

@carllerche
Copy link
Member

This is a very involved change, including non-trivial changes to runtime internals and complex APIs. Before moving forward with this PR, we need to start with a proposal that explains the motivation, problem being solved, etc. I don't know what problem this is trying to solve.

@Noah-Kennedy
Copy link
Contributor Author

Documented the motivation in #7306

@Noah-Kennedy Noah-Kennedy disabled auto-merge May 5, 2025 16:57
@Noah-Kennedy Noah-Kennedy enabled auto-merge (squash) May 5, 2025 16:58
@Noah-Kennedy
Copy link
Contributor Author

Can you make a module for all the hook types and traits so we don't pollute tokio::runtime?

Sure!

@jlizen
Copy link
Member

jlizen commented May 6, 2025

Documented the motivation in #7306

Left my comments on the RFC, as they were around goals / high-level API rather than impl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
R-loom-current-thread Run loom current-thread tests on this PR R-loom-multi-thread Run loom multi-thread tests on this PR R-loom-multi-thread-alt Run loom multi-thread alt tests on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants