Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

services err handling with backoff when overloaded #39

Closed

Conversation

tzemanovic
Copy link
Contributor

@tzemanovic tzemanovic commented Nov 8, 2023

This PR adds:

  • service error handling with exponential backoff when overloaded on services that require response
  • dropping of info queries on errors as those can be shed (returning an Exception response)

To achieve this, the socket's stream and sink must be held across await points to keep them alive, so I ended up wrapping them in std::sync::Mutex though the access should always be exclusive anyway. Additionally, the request stream is made peekable so that we're not removing requests that must be handled and only popped once a service call is successful.

@tzemanovic tzemanovic marked this pull request as ready for review November 8, 2023 09:44
@hdevalence
Copy link
Member

Why does this need to be added to the socket handling in tower-abci? How does tower-abci know what the correct behavior for the application should be? The original design was to allow applications to layer in additional behaviors with Tower middleware:

Because each category of requests is handled by a different service, request behavior can be customized on a per-category basis using Tower Layers. For instance, load-shedding can be added to InfoRequests but not ConsensusRequests, or different categories can have different timeout policies, or different types of instrumentation.

Does that not work? Should we document it better?

@tzemanovic
Copy link
Contributor Author

@hdevalence unless I'm missing something, I don't think there is any way to handle this in an app. When a service call fails, the error ends up panicking in the unwrap in tokio::spawn(async move { conn.run(read, write).await.unwrap() });. For the mempool and consensus services, there is no other option then to retry later. If there is no response or the response is an exception, it makes CometBFT crash before it even gets to the app, so I think it makes sense to handle overload here. For the info service, there is more flexibility, so maybe we could provide some way to choose how to handle its errors.

Unrelated, I realized that the changes here will force flushing after every request which is not ideal, so I'll try to improve that.

@xla
Copy link
Contributor

xla commented Nov 10, 2023

Surprised by this change as well, besides @hdevalence point that this is probably not the right point of integration I'd like to understand the motivation better. Could you outline a scenario where backoff/retry is improving QoS between the app and CometBFT while not violating the protocol?

@tzemanovic
Copy link
Contributor Author

@xla Say you e.g. set .load_shed() on the Info service in an app. The call to the Service is issued by tower-abci and when it fails with tower::load_shed::error::Overloaded, the call to Connection.run returns the error and subsequently panics on the unwrap in conn.run(read, write).await.unwrap(). As this happens internally, there is no way for an app to handle it (even if it was catching panics, the sockets get closed so there is no way to recover from it)

@tzemanovic
Copy link
Contributor Author

Looking back at the code, I think the most reasonable thing would probably be to only shed info request (and maybe also snapshot) on overload by returning exception response without doing any backoff. For mempool service, it seems that load_shed is not much use not responding to its requests makes cometBFT crash

@tzemanovic
Copy link
Contributor Author

closing in favor of #40

@tzemanovic tzemanovic closed this Nov 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants