Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a shutdown watchdog to subspace binaries #3170

Closed
wants to merge 6 commits into from
Closed

Conversation

teor2345
Copy link
Member

@teor2345 teor2345 commented Oct 25, 2024

Sometimes, subspace binaries don't exit, because an async or spawn_blocking() task is stuck (or is taking a very long time). I've seen both the node and farmer do this recently.

This happens because tokio's runtime gets dropped at the end of tokio::main, but it waits forever for async tasks to yield, or blocking tasks to finish.

Having a node or farmer stuck forever is bad user experience. It could also make scripted nodes hang, rather than shutting down and restarting.

This PR adds a shutdown watchdog which makes shutdown more user friendly:

  • if the user does Ctrl-C a second time, the process exits immediately
  • if shutdown takes more than a minute, the process exits
  • if tokio's taskdump config is enabled, the watchdog logs stack traces of stuck tasks
  • before shutting down, the watchdog logs the process ID, so the user can run a command to collect stack traces of all threads

I'm open to changing the shutdown timeout, or any of these other features.

We could replace process::exit() with Runtime::shutdown_timeout(), but that would be a bigger refactor, because we'd need to replace tokio::main with runtime::Builder. And we'd still need code for a second Ctrl-C and taskdumps.

scopeguard is already in our dependencies as a dependency of parking-lot.

Code contributor checklist:

@teor2345
Copy link
Member Author

The rustsec action failed with:

Unexpected end of JSON input

https://github.com/autonomys/subspace/actions/runs/11511038141/job/32043824086#step:4:29

So I restarted it.

Copy link
Member

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not seen farmer or node hanging for a long time, but we do block process exit to allow all background operations to finish what they are doing and exit gracefully.

If we have bugs related to this (which I'm not 100% convinced we do) then we need to fix them.

Can you provide more details when and how this happened to you and how I can hopefully reproduce it?

The rustsec action failed with:

This was fixed yesterday in #3168, just ignore it for now, it doesn't prevent merging.

@teor2345
Copy link
Member Author

I have not seen farmer or node hanging for a long time, but we do block process exit to allow all background operations to finish what they are doing and exit gracefully.

If we have bugs related to this (which I'm not 100% convinced we do) then we need to fix them.

Can you provide more details when and how this happened to you and how I can hopefully reproduce it?

Sure! Part of my reason for writing this code was to get the async stack traces that were blocking shutdown, and try and diagnose it myself. My next step was to try and reproduce it myself, and open a ticket with detailed instructions.

I was running a devnet build on devnet after we’d shut some parts of it down, and getting a whole bunch of failures. Both the node and farmer hung. So it’s also possible that these kinds of bugs happen when there’s no network, or malfunctioning or missing nodes in the network.

I’ll open a ticket later today.

@teor2345
Copy link
Member Author

I don’t think this kind of implementation will work, because we have to install the shutdown watchdog between the time when we start exiting, and the first shutdown or drop operation that could hang. That point is tricky to find, and hard to test for.

I think we’d be better putting our effort into fixing specific hangs like #3175 and #3178.

It might be a good idea to install our Ctrl-C handler on a thread or separate runtime, so it keeps working even during shutdown. But that’s already an edge case, shutdown should just work.

@teor2345 teor2345 closed this Oct 31, 2024
@teor2345 teor2345 deleted the better-shutdown branch December 10, 2024 04:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants