-
Notifications
You must be signed in to change notification settings - Fork 145
Decouple blocking content validation from main Overlay Service loop #342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hahaha, yup definitely seems like this is going to be an issue. Thanks for typing this up. I'm almost done with purely infura based validation, which I'll try to wrap up quickly then I'll spend some time on this issue |
@mrferris Digging into this one now. Feels like a separate "validate and store content" thread is a suitable solution (for this issue and #340). But want to point out that in #345 @ogenev introduced an |
Yeah, this tokio::select loop pattern may just be making our life difficult. Trin's event-loop handling here should probably be spawning all of its handling code as non-blocking right? If tokio::select isn't good at that, maybe we should look at a different approach. Rather than figure out one-by-one how to offload it into a different thread. |
This thought relates to content validation code that was merged in #331. Let me know if I'm wrong, or if this is a duplicate/already discussed.
I think that we should remove any blocking async code in the main
overlay_service.rs
loop, which currently occurs here. This line'sawait
was added in order to call the chain ofprocess_content
->process_received_content
->validate_content
, wherevalidate_content
is a blocking async function that makes this entire chain of functions need to be async.The reasoning for avoiding this is that as long as this function is being
await
ed on, theselect!
loop cannot proceed. This means that no queries can progress, requests from the overlay cannot be acted upon, our responses to other nodes cannot happen (leading to peer requests timing out / us being marked disconnected), bucket refreshes can't happen, etc.Under nominal conditions this shouldn't block for long, but an infura request taking longer than expected would turn our node into a brick for that time. And later it gets even worse if we need to request a new accumulator snapshot in order to validate data: the content query to do that won't be able to progress, leaving us deadlocked unable to get the content that we need in order to validate.
An alternative would be to send our unvalidated content to a dedicated content validation thread, which will then store it if valid.
The text was updated successfully, but these errors were encountered: