object store: retry / recover after partially reading a streaming response ( fix timeout errors / `error decoding response body` ) #15

alamb · 2025-03-06T13:00:29Z

Problem Description

if a request fails mid-stream (after we begin to read data) it is not retried and instead an error is returned. The error message, somewhat confusingly, often says error decoding response body

Some examples:

ExternalError(General("ParquetObjectReader::get_byte_ranges error: Generic MicrosoftAzure error: error decoding response body"))

Generic S3 error: error decoding response body

Workaround

You can often work around this error by increasing the network timeout to something longer than the 30s default

Related Tickets

there are many related discussions / report error decoding response body after upgrade to object store 0.10 #272
we see this in DataFusion too Timeouts reading "large" files from object stores over "slow" connections datafusion#15067
object_store: Retry on connection duration timeouts (retry / recover after partially reading a streaming response) #53
object_store: retry on response decoding errors arrow-rs#6519
to_pyarrow_table() on a table in S3 kept getting "Generic S3 error: error decoding response body" delta-io/delta-rs#2595

As @crepererum says on #272 :

So long store short: People agree that this would be a good feature to have, but it requires a proper implementation.

Background

Streaming ✅

Some APIs like ObjectStore::get are "streaming" in the sense that they start returning data as soon as it comes back from the network (as opposed to buffering the response before returning to the caller)

This is great for performance as response processing can happen immediately and limits memory usage for large payloads 🏆

Retries ✅

In order to deal with the intermittent errors that occur processing object store requests, most ObjectStore implementations retry the request if they encounter error (see retry.rs)

Retries + Streaming ❌

However, there is a problem when streaming is mixed with the existing retries. Specifically, if a request fails mid-stream (after some, but not all, of the data has been returned to the client), just retrying the entire request isn't enough because then the client would be potentially be given the same data from the start of the response that it had already been given

Solution

Describe the solution you'd like

Implementing retries for streaming reads would need something more complicated like retrying the request just for the bytes that hadn't been already read

Any solution for this I think needs:

Very good tests / clear documentation

Describe alternatives you've considered

@crepererum suggests on #272 :

retrying would need to make a new request with a new range starting after the last received byte and ideally also an ETAG/version check to ensure that the object that is returned by the retry is the the one that was already "in flight". This retry mechanic is obviously chaining/nested, i.e. if the retry fails mid-stream, you wanna have yet another retry that picks up the where the previous one ended.

The text was updated successfully, but these errors were encountered:

alamb · 2025-03-18T14:53:21Z

@crepererum had a good point that automatically requesting the remainder of a request that timed out would have the nice property of automatically adjusting to network conditions

For example, if the network was fast the request would just proceed as normal. If the network was not as fast that is ok too as the timeout would be hit, and then the request retried. And we wouldn't have to make any up front determination about the network or request size

alamb · 2025-03-19T22:26:42Z

Copying some discussion with @ryzhyk from #14:

In terms of rety the idea is that the rety doesn't retry the entire request. Instead it would only retry the remaining bytes that had not yet been returned.

So let's say you had a 200 MB request but the network can only retrieve 10MB in 30s

The first request would fetch the first 10MB but timeout
Then the retry would request the remaining 190MB
The second request would fetch the second 10MB and timeout
Then the retry would request the remaining 180MB
.. and so on

alamb · 2025-03-20T21:17:18Z

Migrating from arrow-rs issue #7242

ion-elgreco · 2025-04-14T06:49:44Z

Is someone working on this? I would like to take a jab at this

crepererum · 2025-04-14T08:41:21Z

@ion-elgreco thank you :)

alamb added the enhancement New feature or request label Mar 6, 2025

ion-elgreco mentioned this issue Mar 7, 2025

to_pyarrow_table() on a table in S3 kept getting "Generic S3 error: error decoding response body" delta-io/delta-rs#2595

Open

alamb changed the title ~~object store: retry / recover after partially reading a streaming response~~ object store: retry / recover after partially reading a streaming response ( fix timeout errors/error decoding response body ) Mar 18, 2025

alamb changed the title ~~object store: retry / recover after partially reading a streaming response ( fix timeout errors/error decoding response body )~~ object store: retry / recover after partially reading a streaming response ( fix timeout errors / error decoding response body ) Mar 18, 2025

alamb self-assigned this Mar 18, 2025

alamb mentioned this issue Mar 18, 2025

Timeouts reading "large" files from object stores over "slow" connections apache/datafusion#15067

Open

alamb mentioned this issue Mar 18, 2025

[DISCUSSION] [object_store] New crate with object store combinators / utilitles #14

Open

alamb transferred this issue from apache/arrow-rs Mar 20, 2025

alamb mentioned this issue Mar 20, 2025

[EPIC] Port object_store content from arrow-rs repository #2

Closed

6 tasks

crepererum assigned ion-elgreco and unassigned alamb Apr 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

object store: retry / recover after partially reading a streaming response ( fix timeout errors / `error decoding response body` ) #15

object store: retry / recover after partially reading a streaming response ( fix timeout errors / `error decoding response body` ) #15

alamb commented Mar 6, 2025 •

edited

Loading

alamb commented Mar 18, 2025 •

edited

Loading

alamb commented Mar 19, 2025

alamb commented Mar 20, 2025

ion-elgreco commented Apr 14, 2025

crepererum commented Apr 14, 2025

object store: retry / recover after partially reading a streaming response ( fix timeout errors / error decoding response body ) #15

object store: retry / recover after partially reading a streaming response ( fix timeout errors / error decoding response body ) #15

Comments

alamb commented Mar 6, 2025 • edited Loading

Problem Description

Workaround

Related Tickets

Background

Streaming ✅

Retries ✅

Retries + Streaming ❌

Solution

alamb commented Mar 18, 2025 • edited Loading

alamb commented Mar 19, 2025

alamb commented Mar 20, 2025

ion-elgreco commented Apr 14, 2025

crepererum commented Apr 14, 2025

object store: retry / recover after partially reading a streaming response ( fix timeout errors / `error decoding response body` ) #15

object store: retry / recover after partially reading a streaming response ( fix timeout errors / `error decoding response body` ) #15

alamb commented Mar 6, 2025 •

edited

Loading

alamb commented Mar 18, 2025 •

edited

Loading