Skip to content

object store: retry / recover after partially reading a streaming response ( fix timeout errors / error decoding response body ) #15

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
alamb opened this issue Mar 6, 2025 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Mar 6, 2025

Problem Description

if a request fails mid-stream (after we begin to read data) it is not retried and instead an error is returned. The error message, somewhat confusingly, often says error decoding response body

Some examples:

ExternalError(General("ParquetObjectReader::get_byte_ranges error: Generic MicrosoftAzure error: error decoding response body"))

Generic S3 error: error decoding response body

Workaround

You can often work around this error by increasing the network timeout to something longer than the 30s default

Related Tickets

As @crepererum says on #272 :

So long store short: People agree that this would be a good feature to have, but it requires a proper implementation.

Background

Streaming ✅

Some APIs like ObjectStore::get are "streaming" in the sense that they start returning data as soon as it comes back from the network (as opposed to buffering the response before returning to the caller)

This is great for performance as response processing can happen immediately and limits memory usage for large payloads 🏆

Retries ✅

In order to deal with the intermittent errors that occur processing object store requests, most ObjectStore implementations retry the request if they encounter error (see retry.rs)

Retries + Streaming ❌

However, there is a problem when streaming is mixed with the existing retries. Specifically, if a request fails mid-stream (after some, but not all, of the data has been returned to the client), just retrying the entire request isn't enough because then the client would be potentially be given the same data from the start of the response that it had already been given

Solution

Describe the solution you'd like

Implementing retries for streaming reads would need something more complicated like retrying the request just for the bytes that hadn't been already read

Any solution for this I think needs:

  1. Very good tests / clear documentation

Describe alternatives you've considered

@crepererum suggests on #272 :

retrying would need to make a new request with a new range starting after the last received byte and ideally also an ETAG/version check to ensure that the object that is returned by the retry is the the one that was already "in flight". This retry mechanic is obviously chaining/nested, i.e. if the retry fails mid-stream, you wanna have yet another retry that picks up the where the previous one ended.

@alamb alamb added the enhancement New feature or request label Mar 6, 2025
@alamb alamb changed the title object store: retry / recover after partially reading a streaming response object store: retry / recover after partially reading a streaming response ( fix timeout errors/error decoding response body ) Mar 18, 2025
@alamb alamb changed the title object store: retry / recover after partially reading a streaming response ( fix timeout errors/error decoding response body ) object store: retry / recover after partially reading a streaming response ( fix timeout errors / error decoding response body ) Mar 18, 2025
@alamb alamb self-assigned this Mar 18, 2025
@alamb
Copy link
Contributor Author

alamb commented Mar 18, 2025

@crepererum had a good point that automatically requesting the remainder of a request that timed out would have the nice property of automatically adjusting to network conditions

For example, if the network was fast the request would just proceed as normal. If the network was not as fast that is ok too as the timeout would be hit, and then the request retried. And we wouldn't have to make any up front determination about the network or request size

@alamb
Copy link
Contributor Author

alamb commented Mar 19, 2025

Copying some discussion with @ryzhyk from #14:

In terms of rety the idea is that the rety doesn't retry the entire request. Instead it would only retry the remaining bytes that had not yet been returned.

So let's say you had a 200 MB request but the network can only retrieve 10MB in 30s

  • The first request would fetch the first 10MB but timeout
  • Then the retry would request the remaining 190MB
  • The second request would fetch the second 10MB and timeout
  • Then the retry would request the remaining 180MB
  • .. and so on

@alamb
Copy link
Contributor Author

alamb commented Mar 20, 2025

Migrating from arrow-rs issue #7242

@alamb alamb transferred this issue from apache/arrow-rs Mar 20, 2025
@ion-elgreco
Copy link
Contributor

Is someone working on this? I would like to take a jab at this

@crepererum crepererum assigned ion-elgreco and unassigned alamb Apr 14, 2025
@crepererum
Copy link
Contributor

@ion-elgreco thank you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants