Replace panic with error in snap sync #3176

teor2345 · 2024-10-28T01:01:10Z

Panicking in async code causes a race condition which hangs the node roughly half the time on my machine. (The other half it exits with an error.)

As part of this race, the Ctrl-C handler is dropped, because it is an async future running in the same executor. So the user can't use Ctrl-C to exit.

The only way I found to exit the process was kill -KILL. A SIGTERM was ignored.

Close #3175.

Code contributor checklist:

I have read, understood and followed contributing guide

nazar-pc

Thanks!

teor2345 · 2024-10-28T01:03:13Z

crates/subspace-networking/src/node_runner.rs

-                    panic!("Gossibsub protocol is disabled.");
+                    panic!("Gossipsub protocol is disabled.");


This is the other place where we panic in async code. Is it a logic error, or should we replace it with returning an error to the (async) caller?

This is a hard one. On one hand this is not a local decision that can be made in this context, on the other hand if we get to this point, it is a very fundamental application composition that I would say justified crashing the whole app since a lot of things will not work properly if this is actually the case.

Ultimately though we should probably simply remove gossipsub until we actually use it (it can be enabled or disabled during instantiation of the networking and it is currently disabled in 100% of cases and was for a very long time).

shamil-gadelshin · 2024-10-28T13:30:00Z

This change contradicts the previous PR: #3044
Should we invoke process exiting instead?

nazar-pc · 2024-10-28T13:32:17Z

It doesn't really. It is still an error, previous version just assumed that no segments means they don't exist, while in practice it might be caused by network issues.

shamil-gadelshin · 2024-10-28T13:37:25Z

We don't treat the new error as critical on the upper level, do we?

nazar-pc · 2024-10-28T13:41:44Z

It'll still get stuck I think, so shouldn't make a lot of difference in practice.

shamil-gadelshin · 2024-10-28T13:46:42Z

Not sure I follow. What I mean by the initial remark is that we have rolled back the PR which improved UX and if we want to remove the panic then we should propagate and/or handle the error correctly.

nazar-pc · 2024-10-28T13:50:04Z

It ended up being worse UX due to hanging. I think what we should do is to not eat an error in snap_sync and return it up the stack instead, so the whole DSN sync task can exit, which is an essential task and will bring down the whole node with it.

shamil-gadelshin · 2024-10-28T13:56:49Z

I mean the same when I suggest error propagation.

teor2345 added 2 commits October 28, 2024 10:58

Replace panic with error in snap sync

5b7b50d

Fix typo in node runner panic message

7214791

teor2345 added bug Something isn't working node Node (service library/node app) labels Oct 28, 2024

teor2345 self-assigned this Oct 28, 2024

teor2345 requested review from shamil-gadelshin, nazar-pc and rg3l3dr as code owners October 28, 2024 01:01

nazar-pc approved these changes Oct 28, 2024

View reviewed changes

teor2345 commented Oct 28, 2024

View reviewed changes

teor2345 added this pull request to the merge queue Oct 28, 2024

Merged via the queue into main with commit ffa02b1 Oct 28, 2024
8 checks passed

teor2345 deleted the fix-snap-sync-panic branch October 28, 2024 05:13

teor2345 mentioned this pull request Oct 29, 2024

Exit when snap sync needs user intervention #3187

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace panic with error in snap sync #3176

Replace panic with error in snap sync #3176

teor2345 commented Oct 28, 2024 •

edited

Loading

nazar-pc left a comment

teor2345 Oct 28, 2024

nazar-pc Oct 28, 2024

shamil-gadelshin commented Oct 28, 2024

nazar-pc commented Oct 28, 2024

shamil-gadelshin commented Oct 28, 2024

nazar-pc commented Oct 28, 2024

shamil-gadelshin commented Oct 28, 2024

nazar-pc commented Oct 28, 2024

shamil-gadelshin commented Oct 28, 2024

		panic!("Gossibsub protocol is disabled.");
		panic!("Gossipsub protocol is disabled.");

Replace panic with error in snap sync #3176

Replace panic with error in snap sync #3176

Conversation

teor2345 commented Oct 28, 2024 • edited Loading

Code contributor checklist:

nazar-pc left a comment

Choose a reason for hiding this comment

teor2345 Oct 28, 2024

Choose a reason for hiding this comment

nazar-pc Oct 28, 2024

Choose a reason for hiding this comment

shamil-gadelshin commented Oct 28, 2024

nazar-pc commented Oct 28, 2024

shamil-gadelshin commented Oct 28, 2024

nazar-pc commented Oct 28, 2024

shamil-gadelshin commented Oct 28, 2024

nazar-pc commented Oct 28, 2024

shamil-gadelshin commented Oct 28, 2024

teor2345 commented Oct 28, 2024 •

edited

Loading