services err handling with backoff when overloaded #39

tzemanovic · 2023-11-08T08:30:36Z

This PR adds:

service error handling with exponential backoff when overloaded on services that require response
dropping of info queries on errors as those can be shed (returning an Exception response)

To achieve this, the socket's stream and sink must be held across await points to keep them alive, so I ended up wrapping them in std::sync::Mutex though the access should always be exclusive anyway. Additionally, the request stream is made peekable so that we're not removing requests that must be handled and only popped once a service call is successful.

hdevalence · 2023-11-08T14:52:47Z

Why does this need to be added to the socket handling in tower-abci? How does tower-abci know what the correct behavior for the application should be? The original design was to allow applications to layer in additional behaviors with Tower middleware:

Because each category of requests is handled by a different service, request behavior can be customized on a per-category basis using Tower Layers. For instance, load-shedding can be added to InfoRequests but not ConsensusRequests, or different categories can have different timeout policies, or different types of instrumentation.

Does that not work? Should we document it better?

tzemanovic · 2023-11-09T10:41:30Z

@hdevalence unless I'm missing something, I don't think there is any way to handle this in an app. When a service call fails, the error ends up panicking in the unwrap in tokio::spawn(async move { conn.run(read, write).await.unwrap() });. For the mempool and consensus services, there is no other option then to retry later. If there is no response or the response is an exception, it makes CometBFT crash before it even gets to the app, so I think it makes sense to handle overload here. For the info service, there is more flexibility, so maybe we could provide some way to choose how to handle its errors.

Unrelated, I realized that the changes here will force flushing after every request which is not ideal, so I'll try to improve that.

xla · 2023-11-10T04:27:07Z

Surprised by this change as well, besides @hdevalence point that this is probably not the right point of integration I'd like to understand the motivation better. Could you outline a scenario where backoff/retry is improving QoS between the app and CometBFT while not violating the protocol?

tzemanovic · 2023-11-10T15:05:26Z

@xla Say you e.g. set .load_shed() on the Info service in an app. The call to the Service is issued by tower-abci and when it fails with tower::load_shed::error::Overloaded, the call to Connection.run returns the error and subsequently panics on the unwrap in conn.run(read, write).await.unwrap(). As this happens internally, there is no way for an app to handle it (even if it was catching panics, the sockets get closed so there is no way to recover from it)

tzemanovic · 2023-11-10T16:07:15Z

Looking back at the code, I think the most reasonable thing would probably be to only shed info request (and maybe also snapshot) on overload by returning exception response without doing any backoff. For mempool service, it seems that load_shed is not much use not responding to its requests makes cometBFT crash

tzemanovic · 2023-11-11T13:50:13Z

closing in favor of #40

tzemanovic added 3 commits November 8, 2023 09:32

backoff retry when service is overloaded

0da0f1f

allow to drop failed info queries

f8e6302

backport to 034

dd4299d

tzemanovic force-pushed the tomas/err-handling-main branch from 0ee80e5 to dd4299d Compare November 8, 2023 08:33

tzemanovic marked this pull request as ready for review November 8, 2023 09:44

tzemanovic closed this Nov 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

services err handling with backoff when overloaded #39

services err handling with backoff when overloaded #39

tzemanovic commented Nov 8, 2023 •

edited

Loading

hdevalence commented Nov 8, 2023

tzemanovic commented Nov 9, 2023

xla commented Nov 10, 2023

tzemanovic commented Nov 10, 2023

tzemanovic commented Nov 10, 2023

tzemanovic commented Nov 11, 2023

services err handling with backoff when overloaded #39

services err handling with backoff when overloaded #39

Conversation

tzemanovic commented Nov 8, 2023 • edited Loading

hdevalence commented Nov 8, 2023

tzemanovic commented Nov 9, 2023

xla commented Nov 10, 2023

tzemanovic commented Nov 10, 2023

tzemanovic commented Nov 10, 2023

tzemanovic commented Nov 11, 2023

tzemanovic commented Nov 8, 2023 •

edited

Loading