Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace panic with error in snap sync #3176

Merged
merged 2 commits into from
Oct 28, 2024
Merged

Replace panic with error in snap sync #3176

merged 2 commits into from
Oct 28, 2024

Conversation

teor2345
Copy link
Member

@teor2345 teor2345 commented Oct 28, 2024

Panicking in async code causes a race condition which hangs the node roughly half the time on my machine. (The other half it exits with an error.)

As part of this race, the Ctrl-C handler is dropped, because it is an async future running in the same executor. So the user can't use Ctrl-C to exit.

The only way I found to exit the process was kill -KILL. A SIGTERM was ignored.

Close #3175.

Code contributor checklist:

@teor2345 teor2345 added bug Something isn't working node Node (service library/node app) labels Oct 28, 2024
@teor2345 teor2345 self-assigned this Oct 28, 2024
Copy link
Member

@nazar-pc nazar-pc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

panic!("Gossibsub protocol is disabled.");
panic!("Gossipsub protocol is disabled.");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the other place where we panic in async code. Is it a logic error, or should we replace it with returning an error to the (async) caller?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a hard one. On one hand this is not a local decision that can be made in this context, on the other hand if we get to this point, it is a very fundamental application composition that I would say justified crashing the whole app since a lot of things will not work properly if this is actually the case.

Ultimately though we should probably simply remove gossipsub until we actually use it (it can be enabled or disabled during instantiation of the networking and it is currently disabled in 100% of cases and was for a very long time).

@teor2345 teor2345 added this pull request to the merge queue Oct 28, 2024
Merged via the queue into main with commit ffa02b1 Oct 28, 2024
8 checks passed
@teor2345 teor2345 deleted the fix-snap-sync-panic branch October 28, 2024 05:13
@shamil-gadelshin
Copy link
Contributor

This change contradicts the previous PR: #3044
Should we invoke process exiting instead?

@nazar-pc
Copy link
Member

It doesn't really. It is still an error, previous version just assumed that no segments means they don't exist, while in practice it might be caused by network issues.

@shamil-gadelshin
Copy link
Contributor

We don't treat the new error as critical on the upper level, do we?

@nazar-pc
Copy link
Member

It'll still get stuck I think, so shouldn't make a lot of difference in practice.

@shamil-gadelshin
Copy link
Contributor

Not sure I follow. What I mean by the initial remark is that we have rolled back the PR which improved UX and if we want to remove the panic then we should propagate and/or handle the error correctly.

@nazar-pc
Copy link
Member

It ended up being worse UX due to hanging. I think what we should do is to not eat an error in snap_sync and return it up the stack instead, so the whole DSN sync task can exit, which is an essential task and will bring down the whole node with it.

@shamil-gadelshin
Copy link
Contributor

I mean the same when I suggest error propagation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working node Node (service library/node app)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Node hangs when started on broken network, after worker tries to shut down
3 participants