`error decoding response body` after upgrade to object store 0.10 #272

ion-elgreco · 2024-06-13T16:14:37Z

Describe the bug
We bumped the object store to 0.10 in delta-rs, and now we already seeing a couple reports on the following error error decoding response body. Happens on Azure and S3.

See delta-io/delta-rs#2595 and delta-io/delta-rs#2592

To Reproduce
Seems to occur when reading tables or doing operations on them.

Expected behavior
Don't have an issue decoding the response body

Additional context

@thomasfrederikhoeck @k-ye

The text was updated successfully, but these errors were encountered:

tustvold · 2024-06-15T06:50:45Z

I think we would need a reproducer to action this, the linked issues aren't even clearly implicating object_store

Xuanwo · 2024-06-15T07:16:04Z

Please also print the source of the error via Debug print. Usually, it should be caused by connection reset or similar network related errors.

ion-elgreco · 2024-06-15T07:18:16Z

@thomasfrederikhoeck @k-ye can you guys provide additional details please

thomasfrederikhoeck · 2024-06-24T08:38:21Z

@Xuanwo I would love to be of more help but I don't now how to do this in delta-rs (an in turn object_store). I didn't help setting the timeout to 300s.

@ion-elgreco Can you point me in the direction of how I can provide better logs?

Xuanwo · 2024-06-24T09:36:29Z

@Xuanwo I would love to be of more help but I don't now how to do this in delta-rs (an in turn object_store). I didn't help setting the timeout to 300s.

Hi, if you can consistently reproduce this issue, please change the following places:

https://github.com/delta-io/delta-rs/blob/d17ed97b5bda0cadbc0df959f8fb38e275570c87/python/src/error.rs#L41-L51

fn object_store_to_py(err: ObjectStoreError) -> PyErr {
    match err {
        ObjectStoreError::NotFound { .. } => PyFileNotFoundError::new_err(err.to_string()),
        ObjectStoreError::Generic { source, .. }
            if source.to_string().contains("AWS_S3_ALLOW_UNSAFE_RENAME") =>
        {
            DeltaProtocolError::new_err(source.to_string())
        }
        _ => PyIOError::new_err(err.to_string()),
    }
}

Don't use err.to_string(), print it's debug message instead.

thomasfrederikhoeck · 2024-06-24T22:19:05Z

@Xuanwo Ah thanks!! I get the following consistently :

Generic {
    store: "MicrosoftAzure",
    source: reqwest::Error {
        kind: Decode,
        source: reqwest::Error {
            kind: Body,
            source: TimedOut,
        },
    },
}

I also tried bumping the timeout to 600s. I still get _internal.DeltaError: Failed to parse parquet: Parquet error: Z-order failed while scanning data: ArrowError(ExternalError(General("ParquetObjectReader::get_byte_ranges error: Generic MicrosoftAzure error: error decoding response body")), None) but I never hit the debug print in this case. I am however seeing a lot of

[2024-06-24T21:09:33Z INFO  object_store::client::retry] Encountered transport error backing off for 0.1 seconds, retry 1 of 10: error sending request for url (REDACTED)
[2024-06-24T21:13:00Z DEBUG hyper_util::client::legacy::client] client connection error: error shutting down connection

Xuanwo · 2024-06-25T01:51:27Z

[2024-06-24T21:09:33Z INFO object_store::client::retry] Encountered transport error backing off for 0.1 seconds, retry 1 of 10: error sending request for url (REDACTED)

I suspect there's an issue with the network connection between your environment and Azure.

Could you provide more details about your setup?

Where are you conducting the tests?
Which Azure region or zone are you using, or are you just using Azurite?
What is the average latency and bandwidth between your location and Azure?
Can you try using azcopy to read/write large file?

thomasfrederikhoeck · 2024-06-25T11:27:35Z

@Xuanwo I might be network related but I have some feeling that is related to how object_store or delta-rs handles if there is a lower throughput than within a Azure data center (some connections going stale while waiting for somthing else).

Running from local laptop in Europe
Azure region is Westeurope
I just benchmarked with azcopy bench "https://ACCOUNT.blob.core.windows.net/CONTAINER?SAS" --file-count 20 --size-per-file 10000M. So 20 files of 10 Gb and here I get a throughput of 145 Mb/s. It runs through with no failures.

The benchmark took 1+ hours with no failure while the delta-rs call fails within a few minutes.

ion-elgreco · 2024-08-14T09:42:20Z

@Xuanwo @tustvold I can concur this also happens to us in v0.18.1/2, I can see logs in LakeFS which says "context canceled".

Somewhere in object store the connection is getting dropped constantly with large files. Can you guys give suggestions on how to debug.

For me I am connecting within a VNET in EU amsterdam

tustvold · 2024-08-14T09:58:11Z

Are you mixing IO with CPU bound work, I wonder if you are stalling out the tokio runtime

ion-elgreco · 2024-08-14T10:19:25Z

Hmm I am not sure, I started working on delta-rs a year ago and most of this FileSystem handling code was already there.

We essentially create a DeltaFileSystemHandler which we expose to Python. In python we create a DeltaStorageHandler which inherits the pyarrow FileSystemHandler methods, which we implement to call the Rust DeltaFileSystemHandler.

I think Pyarrow just calls read on an ObjectInputFile, which in rust calls get_range on the underlying object-store (https://github.com/delta-io/delta-rs/blob/c446b1287dedba122b941d8d1d4ae6290aa86d5c/python/src/filesystem.rs#L467-L495)

    fn read<'py>(&mut self, nbytes: Option<i64>, py: Python<'py>) -> PyResult<Bound<'py, PyBytes>> {
        self.check_closed()?;
        let range = match nbytes {
            Some(len) => {
                let end = i64::min(self.pos + len, self.content_length) as usize;
                std::ops::Range {
                    start: self.pos as usize,
                    end,
                }
            }
            _ => std::ops::Range {
                start: self.pos as usize,
                end: self.content_length as usize,
            },
        };
        let nbytes = (range.end - range.start) as i64;
        self.pos += nbytes;
        let data = if nbytes > 0 {
            py.allow_threads(|| {
                rt().block_on(self.store.get_range(&self.path, range))
                    .map_err(PythonError::from)
            })?
        } else {
            "".into()
        };
        // TODO: PyBytes copies the buffer. If we move away from the limited CPython
        // API (the stable C API), we could implement the buffer protocol for
        // bytes::Bytes and return this zero-copy.
        Ok(PyBytes::new_bound(py, data.as_ref()))

Here rt() just creates a runtime if it doesn't exist yet. So I don't see where here CPU bound related stuff is happening, unless I have to consider the cpu bound work in python which is done by PyArrow itself?

tustvold · 2024-08-14T10:55:05Z

That at least looks plausible, how big are the ranges we're fetching and how long are we fetching for? I wonder if we're running into some Azure limit, it sounds like they're hanging up for some reason

ion-elgreco · 2024-08-14T11:01:48Z

What do you mean with looks plausible? :)

@tustvold I'll put some print statements in the ranges, to see what is being requested! Will get back to you on that!

tustvold · 2024-08-14T11:04:43Z

What do you mean with looks plausible?

I can't see anything obviously wrong, but also don't know much about pyarrow so can't say definitively if it is correct

ion-elgreco · 2024-08-14T11:54:34Z

@tustvold It seems pyarrow fetches 4 files in parallel and then reads around 30MB each time:

https://gist.github.com/ion-elgreco/e2339990843755b40475dbd6e72e4697

tustvold · 2024-08-14T12:02:42Z

That's on the chonkier end of optimal, but not ludicrous. How long do the fetches take?

ion-elgreco · 2024-08-14T12:15:09Z

Hmm what would you suggest is more optimal? Like 10MB?

So now my VPN connection throughput is working fine, so it seems each fetch takes around 4-8 secs.

python/src/filesystem.rs:501:17] elapsed.as_secs() = 6
[python/src/filesystem.rs:473:9] (&self.path, &nbytes) = (
    Path {
        raw: "product_line_code=DUMMY/100-f1cafe66-476f-4818-8199-5c5a4a6eb4ef-0.parquet",
    },
    Some(
        33231426,
    ),
)

tustvold · 2024-08-14T12:19:18Z

VPN connection

Oh... This is almost certainly what is causing this issue, very few VPNs will support large volume data transfer. It is almost certainly dropping the connections in the interest of preserving QoS for other users. Shuttling data through a VPN box is not only likely to be the cause of your issue, it is also likely very expensive.

ion-elgreco · 2024-08-14T12:21:07Z

I see that could explain it for me, however my colleague saw timeouts on his Azure Compute instance, so Azure <-> Azure connection within our vnet

tustvold · 2024-08-14T12:27:04Z

I'm afraid I don't have any other ideas, something outside of object_store is dropping the connection. This could be Azure itself, Azure blob storage definitely gives off the impression of being an MVP that somehow got shipped, but it is more likely to be some middleware network appliance, like a VPN, NAT gateway or similar. AWS has private gateway endpoints that must be configured for S3, I am not sure if Azure needs something similar.

thomasfrederikhoeck · 2024-08-14T12:34:29Z

@Xuanwo I might be network related but I have some feeling that is related to how object_store or delta-rs handles if there is a lower throughput than within a Azure data center (some connections going stale while waiting for somthing else).

Running from local laptop in Europe

Azure region is Westeurope

I just benchmarked with azcopy bench "https://ACCOUNT.blob.core.windows.net/CONTAINER?SAS" --file-count 20 --size-per-file 10000M. So 20 files of 10 Gb and here I get a throughput of 145 Mb/s. It runs through with no failures.

The benchmark took 1+ hours with no failure while the delta-rs call fails within a few minutes.

@tustvold The weird thing is that I can run some rather large data opeartions (taking an 1+ hour) with azcopy bench without seeing any dropped connection or something like that.

I can maybe add: Before this PR in polars we sometimes saw similar issues but I'm very far from knowledge-able on networking.

tustvold · 2024-08-14T12:38:35Z

Before this PR in polars

This is why I asked about starving the tokio threadpool, this does not appear to be the issue @ion-elgreco is running into from what he has shared.

with azcopy bench without seeing any dropped connection or something like that.

Azcopy will be using multipart uploads, which uses smaller requests that are therefore less susceptible to dropped connections

ion-elgreco · 2024-08-14T12:39:51Z

I got a bit confused myself here, but @thomasfrederikhoeck you have issues during Optimize where the data is being read differently. @tustvold, here it seems we read a Parquet object within a tokio task, should this be rayon threadpool instead?

let stream = match operations {
            OptimizeOperations::Compact(bins) => futures::stream::iter(bins)
                .flat_map(|(_, (partition, bins))| {
                    futures::stream::iter(bins).map(move |bin| (partition.clone(), bin))
                })
                .map(|(partition, files)| {
                    debug!(
                        "merging a group of {} files in partition {:?}",
                        files.len(),
                        partition,
                    );
                    for file in files.iter() {
                        debug!("  file {}", file.location);
                    }
                    let object_store_ref = log_store.object_store();
                    let batch_stream = futures::stream::iter(files.clone())
                        .then(move |file| {
                            let object_store_ref = object_store_ref.clone();
                            async move {
                                let file_reader = ParquetObjectReader::new(object_store_ref, file);
                                ParquetRecordBatchStreamBuilder::new(file_reader)
                                    .await?
                                    .build()
                            }
                        })
                        .try_flatten()
                        .boxed();

                    let rewrite_result = tokio::task::spawn(Self::rewrite_files(
                        self.task_parameters.clone(),
                        partition,
                        files,
                        log_store.object_store().clone(),
                        futures::future::ready(Ok(batch_stream)),
                    ));
                    util::flatten_join_error(rewrite_result)
                })
                .boxed(),

Later down in the code we read that stream from above one by one and cast each recordBatch which is cpu bound I guess? And then we write it as a parquet again:

while let Some(maybe_batch) = read_stream.next().await {
            let mut batch = maybe_batch?;

            batch = super::cast::cast_record_batch(
                &batch,
                task_parameters.file_schema.clone(),
                false,
                true,
            )?;
            partial_metrics.num_batches += 1;
            writer.write(&batch).await.map_err(DeltaTableError::from)?;
        }

tustvold · 2024-08-14T12:44:29Z

You should avoid doing any non-trivial CPU-bound work on the tokio threadpool that you use for IO. The way I've seen this done successfully is running DF in one tokio threadpool, and then spawning IO from it into a different one. There was some work in the past to make this easier, see apache/arrow-rs#4040, but I never got it over the line. I'll file a ticket

Edit: Filed apache/arrow-rs#6248

ion-elgreco · 2024-08-14T12:51:46Z

@tustvold thanks for the support! And insights 😄, not extreme expert on rust async yet, so I might need to look into this on how I could allocate or split up these pools.

How fast do you think apache/arrow-rs#6248 could land?

tustvold · 2024-09-09T12:41:26Z

Even though the trace points to pyarrow
Should I repost this on the datafusion issue or do you think this is different and is related to pyarrow/arrow-rs

The issue pertains to how CPU bound work is starving IO, this side-channel will not be reflected in stack traces. Additionally there is something in-between that is connecting pyarrow to object_store, we don't provide such an integration. The delta-rs people will likely be best placed to comment on what this is.

alamb · 2024-09-09T12:49:09Z

Let's leave this ticket open until we sort out the next steps (though I agree with @tustvold that I don't predict any code changes in arrow-rs)

alamb · 2024-09-09T12:51:51Z

@alamb what's the question exactly?

Currently our logstore is spawning all tasks in a separate runtime that can be configured

My question is "would you be willing to summarize this ticket / write up a blog post (perhaps on the DataFusion blog) explaining how to spawn IO related tasks on a different thread pool?

I am 🎣 for help as I would like to write this blog too (so we can distill down this ticket and others for future discussion) but I am struggling to find time

crepererum · 2024-09-10T08:42:21Z

I think there's something else going on that is independent of the tokio runtime issues. I was able to reproduce this locally with a somewhat broken kubectl port-forward (which is kinda famous for it's unstable connection handling when transferring large amounts of data). I'm hitting the same error case, and it is NOT retried at all. It seems that this is this error case here:

https://github.com/seanmonstar/reqwest/blob/09884ed0a09d43ebd5c67491caf4ad5683fba995/src/error.rs#L191

So I think we should extend the retry logic to capture this case. However this might be difficult in the streaming case (i.e. when the error occurs mid-stream), see

https://github.com/apache/arrow-rs/blob/7a5155c5f0e21559203d4a5363cffd7ec0394817/object_store/src/client/get.rs#L245-L251

At least we should try to improve the error message. It seems that the Display impl. is not super helpful and we might wanna use Debug instead, see seanmonstar/reqwest#2373

tustvold · 2024-09-10T09:01:43Z

Retrying interrupted streaming requests is tracked by - #53

I'm a bit wary of this ticket just becoming a general dumping ground for any networking related issue, which is part of why I closed it...

crepererum · 2024-09-10T09:25:58Z

Error display improvements tracked by #48.

tustvold · 2024-09-10T09:33:53Z

Ok so here is my summary of this ticket, and the action items going forward.

Problem

The error decoding response body error occurs whenever an HTTP request body is interrupted whilst streaming the response body. This is after the HTTP status code has been returned and determined to be OK. I am not aware of any object stores using HTTP trailers to indicate errors, they just terminate the request. We do not currently support retrying request interrupted mid-stream in this way, and there are complexities involved in supporting this (#6287).

Causes

There are two related causes of this:

Mixing IO and CPU on the same runtime, stalling out IO and causing the server to hangup - error decoding response body after upgrade to object store 0.10 #272
An unstable network connection through some sort of proxy or VPN - error decoding response body after upgrade to object store 0.10 #272 error decoding response body after upgrade to object store 0.10 #272

Outcome

As for the follow on work:

The upstream fix for 1. is tracked by Document DataFusion Threading / tokio runtimes (how to separate IO and CPU bound work) datafusion#12393
Improve reqwests error display #48 tracks improving the error message to provide more context
object_store: Retry on connection duration timeouts (retry / recover after partially reading a streaming response) #53 tracks retrying interrupted streaming requests

Please let me know if I have missed anything, otherwise I will look to close this issue in favour of the linked issues in the next few days. I think this issue has been very helpful, and I'm grateful for everyone who has participated, but I am keen to put this on a more actionable footing.

ion-elgreco · 2024-09-10T10:18:42Z

@alamb what's the question exactly?

Currently our logstore is spawning all tasks in a separate runtime that can be configured

My question is "would you be willing to summarize this ticket / write up a blog post (perhaps on the DataFusion blog) explaining how to spawn IO related tasks on a different thread pool?

I am 🎣 for help as I would like to write this blog too (so we can distill down this ticket and others for future discussion) but I am struggling to find time

I could but not soon, I recently started a new job so that's keeping me quite busy

ion-elgreco · 2024-09-10T17:22:52Z

What I still don't quite get is that we are still seeing errors but only on the reading side through the DeltaFileHandler which is exposed into a pyarrow filesystem, there should be zero cpu bound tasks on that tokio runtime

alamb · 2024-09-10T18:25:56Z

What I still don't quite get is that we are still seeing errors but only on the reading side through the DeltaFileHandler which is exposed into a pyarrow filesystem, there should be zero cpu bound tasks on that tokio runtime

Perhaps it is related to just some networks errors and a retry of streaming and #53 tracks retrying interrupted streaming requests

ion-elgreco · 2024-09-10T19:25:57Z

@alamb delta-io/delta-rs#2595 (comment)

I asked them to add a timeout increase and that resolved it, i guess it would already help a lot if the true error surfaces, it might be all those folks have low network throughput

alamb · 2024-09-10T19:40:18Z

@alamb delta-io/delta-rs#2595 (comment)

I asked them to add a timeout increase and that resolved it, i guess it would already help a lot if the true error surfaces, it might be all those folks have low network throughput

I believe @itsjunetime may be able to take a look at improving the errors and retries. We'll keep the tickets updated

erratic-pattern · 2024-10-06T21:13:24Z

I've opened a PR apache/arrow-rs#6519 that will retry on reqwest::Error::Decode errors. I'm not sure which issue to associate it with, but it seems related to this one.

Tangeroooo · 2025-03-05T03:58:57Z

I've got the same error when I using ParquetRecordBatchStream with ParquetObjectReader. I found that it may relate to file size. When I try to read file size bigger than 335MB, reading file fails always. Not fail with until 335MB. I used ceph with S3 API.

crepererum · 2025-03-05T13:36:51Z

@Tangeroooo I think we're open for contributions to fix that. The essence is described in apache/arrow-rs#6519 (comment) or in my words: The object_store interface returns a stream of data. If that stream fails midway (e.g. because there's a timeout or a network glitch), retrying would need to make a new request with a new range starting after the last received byte and ideally also an ETAG/version check to ensure that the object that is returned by the retry is the the one that was already "in flight". This retry mechanic is obviously chaining/nested, i.e. if the retry fails mid-stream, you wanna have yet another retry that picks up the where the previous one ended. So long store short: People agree that this would be a good feature to have, but it requires a proper implementation.

Tangeroooo · 2025-03-06T05:29:10Z

@crepererum Thank you for your kind explanation!

alamb · 2025-03-06T13:01:09Z

The essence is described in apache/arrow-rs#6519 (comment) or in my words: The object_store interface returns a stream of data. If that stream fails midway (e.g. because there's a timeout or a network glitch)

Since this keeps coming up, I filed a separate ticket to track

object store: retry / recover after partially reading a streaming response ( fix timeout errors / error decoding response body ) #15

alamb · 2025-03-18T14:47:48Z

I am convinced that this issue / error would be solved by retrying stream on errors

object store: retry / recover after partially reading a streaming response ( fix timeout errors / error decoding response body ) #15

ion-elgreco added the bug Something isn't working label Jun 13, 2024

ion-elgreco changed the title ~~error decoding response body after upgrade to object store 0.10~~ error decoding response body after upgrade to object store 0.10 Jun 13, 2024

abhiaagarwal mentioned this issue Jun 14, 2024

chore: implement parquet error handling for object_store apache/arrow-rs#5889

Merged

thomasfrederikhoeck mentioned this issue Jul 3, 2024

AsyncChunkReader::get_bytes error: Generic MicrosoftAzure error: error decoding response body delta-io/delta-rs#2592

Closed

tustvold mentioned this issue Aug 14, 2024

Add ParquetObjectReader::with_runtime apache/arrow-rs#6248

Closed

alamb reopened this Sep 9, 2024

crepererum self-assigned this Sep 10, 2024

crepererum mentioned this issue Sep 10, 2024

Improve reqwests error display #48

Open

Sevenannn mentioned this issue Oct 2, 2024

Bug: Reading multiple datasets from S3 causes error Generic S3 error: error decoding response body spiceai/spiceai#2936

Open

1 task

erratic-pattern mentioned this issue Oct 6, 2024

object_store: Retry on connection duration timeouts (retry / recover after partially reading a streaming response) #53

Closed

rohitrastogi mentioned this issue Dec 5, 2024

Add IoObjectStore that uses main runtime for network requests datafusion-contrib/datafusion-dft#248

Closed

alamb mentioned this issue Jan 25, 2025

Example for using a separate threadpool for CPU bound work (try 2) apache/datafusion#14286

Draft

kylebarron mentioned this issue Mar 6, 2025

Improve error printing developmentseed/obstore#169

Closed

alamb mentioned this issue Mar 20, 2025

object store: retry / recover after partially reading a streaming response ( fix timeout errors / error decoding response body ) #15

Open

alamb transferred this issue from apache/arrow-rs Mar 20, 2025

alamb mentioned this issue Mar 20, 2025

[EPIC] Port object_store content from arrow-rs repository #2

Closed

6 tasks

linhr mentioned this issue Apr 3, 2025

Run object store in a separate Tokio runtime lakehq/sail#432

Merged

error decoding response body after upgrade to object store 0.10 #272

error decoding response body after upgrade to object store 0.10 #272

Comments

ion-elgreco commented Jun 13, 2024 • edited Loading

tustvold commented Jun 15, 2024

Xuanwo commented Jun 15, 2024

ion-elgreco commented Jun 15, 2024

thomasfrederikhoeck commented Jun 24, 2024

Xuanwo commented Jun 24, 2024 • edited Loading

thomasfrederikhoeck commented Jun 24, 2024

Xuanwo commented Jun 25, 2024

thomasfrederikhoeck commented Jun 25, 2024 • edited Loading

ion-elgreco commented Aug 14, 2024

tustvold commented Aug 14, 2024

ion-elgreco commented Aug 14, 2024 • edited Loading

tustvold commented Aug 14, 2024

ion-elgreco commented Aug 14, 2024

tustvold commented Aug 14, 2024

ion-elgreco commented Aug 14, 2024 • edited Loading

tustvold commented Aug 14, 2024

ion-elgreco commented Aug 14, 2024

tustvold commented Aug 14, 2024

ion-elgreco commented Aug 14, 2024

tustvold commented Aug 14, 2024 • edited Loading

thomasfrederikhoeck commented Aug 14, 2024

tustvold commented Aug 14, 2024

ion-elgreco commented Aug 14, 2024 • edited Loading

tustvold commented Aug 14, 2024 • edited Loading

ion-elgreco commented Aug 14, 2024

tustvold commented Sep 9, 2024

alamb commented Sep 9, 2024

alamb commented Sep 9, 2024

crepererum commented Sep 10, 2024 • edited Loading

tustvold commented Sep 10, 2024 • edited Loading

crepererum commented Sep 10, 2024

tustvold commented Sep 10, 2024

ion-elgreco commented Sep 10, 2024

ion-elgreco commented Sep 10, 2024 • edited Loading

alamb commented Sep 10, 2024

ion-elgreco commented Sep 10, 2024

alamb commented Sep 10, 2024

erratic-pattern commented Oct 6, 2024

Tangeroooo commented Mar 5, 2025 • edited Loading

crepererum commented Mar 5, 2025

Tangeroooo commented Mar 6, 2025

alamb commented Mar 6, 2025

alamb commented Mar 18, 2025

`error decoding response body` after upgrade to object store 0.10 #272

`error decoding response body` after upgrade to object store 0.10 #272

ion-elgreco commented Jun 13, 2024 •

edited

Loading

Xuanwo commented Jun 24, 2024 •

edited

Loading

thomasfrederikhoeck commented Jun 25, 2024 •

edited

Loading

ion-elgreco commented Aug 14, 2024 •

edited

Loading

ion-elgreco commented Aug 14, 2024 •

edited

Loading

tustvold commented Aug 14, 2024 •

edited

Loading

ion-elgreco commented Aug 14, 2024 •

edited

Loading

tustvold commented Aug 14, 2024 •

edited

Loading

crepererum commented Sep 10, 2024 •

edited

Loading

tustvold commented Sep 10, 2024 •

edited

Loading

ion-elgreco commented Sep 10, 2024 •

edited

Loading

Tangeroooo commented Mar 5, 2025 •

edited

Loading