[DISCUSSION] [object_store] New crate with object store combinators / utilitles #14

alamb · 2025-03-08T10:07:50Z

Please describe what you are trying to do.

TLDR: let's combine forces rather than all reimplementing caching / chunking / etc in object_store!

The ObjectStore trait is flexible and it is common to compose a stack of ObjectStore with one wrapping underlying stores

For example, the ThrottledStore and LimitStore provided with the object store crate does exactly this

┌──────────────────────────────┐
│        ThrottledStore        │
│(adds user configured delays) │
└──────────────────────────────┘
                ▲               
                │               
                │               
┌──────────────────────────────┐
│      Inner ObjectStore       │
│   (for example, AmazonS3)    │
└──────────────────────────────┘

Many Different Behaviors

There are many types of behaviors that can be implemented this way. Some examples I am aware of:

The ThrottledStore and LimitStore provided with the object store crate
Runs on a different tokio runtime (such as the DeltaIOStorageBackend in delta rs from @ion-elgreco.
Limit the total size of any individual request (e.g. the LimitedRequestSizeObjectStore from Timeouts reading "large" files from object stores over "slow" connections datafusion#15067)
Break single large requests into multiple concurrent small requests ("chunking") - @crepererum is working on this I think in influx
Caches results of requests locally using memory / disk (see ObjectStoreMemCache in influxdb3_core), and this one in slatedb @criccomini (thanks @ion-elgreco for the pointer)
Collect statistics / traces and report metrics (see ObjectStoreMetrics in influxdb3_core)
Visualization of object store requests over time

Desired behavior is varied and application specific

Also, depending on the needs of the particular app, the ideal behavior / policy is likely different.

For example,

In the case of Timeouts reading "large" files from object stores over "slow" connections datafusion#15067, splitting one large request into several small requests made in series is likely the desired approach (maximize the chance they succeed)
If you are trying to maximize read bandwidth in a cloud server setting, splitting up ("Chunking") large requests into several parallel ones may be desired
If you are trying to minimize costs (for example doing bulk reorganizations / compactions on historical data that are not latency sensitive), using a single request for large objects (what is done today) might be desired
Maybe you want to adapt more dynamically to network and object store conditions as described in Exploiting Cloud Object Storage for High-Performance Analytics

So the point is that I don't think any one individual policy will work for all use cases (though we can certainly discuss changing the default policy)

Since ObjectStore is already composable, I already see projects implementing these types of things independently (for example, delta-rs and influxdb_iox both have a cross runtime object stores, and @mildbyte from splitgraph implemented some sort of visualization of object store requests over time)

I believe this is similar to the OpenDAL concept of layers but @Xuanwo please correct me if I am wrong

Desired Solution

I would like it ti be easier for users of object_store to access such features without having implement custom wrappers in parallel independently

Alternatives

New `object_store_util` crate

One alternative is to make a new crate, namedobject_store_util or similar mirroring futures-util and tokio-util that has a bunch of these ObjectStore combinators

This could be housed outside of the apache organization, but I think it would be most valuable for the community if it was inside

Add additional policies to provided implmenetations

An alternate is to implement a more sophisticated default implementations (for example, add more options to the AmazonS3 implementation.

One upside of this approach is it could take advantage of implementation specific features

One downside is additional code and configuration complexity, especially as the different strategies are all applicable to multiple stores (e.g. GCP, S3 and Azure). Another downside is specifying the policy might be complex (like specifying concurrency along with chunking and under what circumstances should each be used)

Additional context

The text was updated successfully, but these errors were encountered:

tustvold · 2025-03-08T11:03:49Z

Thank you for starting this discussion, I think we should definitely provide more utilities/primitives in this space.

The ThrottledStore and LimitStore provided with the object store crate

FWIW these should probably be deprecated and re-implemented at the HttpClient level.

Collect statistics / traces and report metrics (see ObjectStoreMetrics in influxdb3_core)
Runs on a different tokio runtime (such as the DeltaIOStorageBackend in delta rs from @ion-elgreco.
Collect statistics / traces and report metrics (see ObjectStoreMetrics in influxdb3_core)
Visualization of object store requests over time

Now we have the HttpClient abstraction, I think this is the level I would encourage implementing most of these.

Limit the total size of any individual request (e.g. the LimitedRequestSizeObjectStore from
apache/datafusion#15067)
Break single large requests into multiple concurrent small requests ("chunking") - @crepererum is working on this I think in influx
Limit the total size of any individual request (e.g. the LimitedRequestSizeObjectStore from apache/datafusion#15067)

This feels like something better built into some sort of TransferManager that sits on top of the ObjectStore API, as opposed to baking it in at the ObjectStore level. Perhaps in a similar vein to BufWriter.

This would, for example, allow registering a single ObjectStore, but then having different IO configurations for different areas of the stack. It would also potentially allow for greater concurrency, as the ObjectStore API has no mechanism by which chunks fetched in parallel could be returned out of order. This would be especially useful when downloading files to disk, as it avoids needing to hold chunks in memory unnecessarily.

See #267 for some prior discussion.

Add additional policies to provided implementations

FWIW all the first-party implementations share a lot of the same underlying logic, e.g. with things like GetClient, and so it may actually not be all that bad

alamb · 2025-03-08T11:16:28Z

This feels like something better built into some sort of TransferManager that sits on top of the ObjectStore API, as opposed to baking it in at the ObjectStore level. Perhaps in a similar vein to BufWriter.

I think there is room for both some lower level ObjectStore wrappers as well as more full featured transfer manager or higher abstraction depending on the needs, and resources of the underlying application

tustvold · 2025-03-08T12:32:40Z

I've created apache/arrow-rs#7253 as an example of how the HttpClient abstraction can be used for more fine-grained control of requests, including spawning IO to a separate tokio runtime.

flaneur2020 · 2025-03-09T07:45:01Z

I believe this is similar to the OpenDAL concept of layers but @Xuanwo please correct me if I am wrong

i suppose this might be also related with the "Operator" concept in OpenDAL, which can help handling the chunking & concurrency parameter in a builder pattern like this:

    let s = op
        .reader_with("hello.txt")
        .concurrent(8)
        .chunk(256)
        .await?
        .into_stream(1024..2048)
        .await?;

Layer is a bit lower level than this Operator imo, we can wrap cache & metris to s3 operations like read, write, while chunking & concurrency is handled in the Operator level 🤔.

alamb · 2025-03-18T16:38:58Z

@crepererum and I spoke about this issue today.

In the case of apache/datafusion#15067, splitting one large request into several small requests made in series is likely the desired approach (maximize the chance they succeed)

@crepererum rightly pointed out that implementing retries (aka #15) would be better than splitting into smaller requests to make a timeout as the retry mechanism automatically adjusts to current network conditions

However, otherwise we have a few more potential items we may propose upstreaming

RacingReads (reduce latency of first (and complete) fetch by running multiple requests in parallel)
Chunking (reduce latency of completing requests on time)
MemoryCache (cache in memory, but handle teeing responses, implement streaming logic)
DiskCache (takes advantage of io_uring)
Testing framework

Roughtly speaking what we are thinking is:

complete object_store break out: [DISCUSSION] Proposal move object_store to its own github repo? arrow-rs#6183
Propose addig a testing framework + racing reads
In parallel try to get object store: retry / recover after partially reading a streaming response ( fix timeout errors / error decoding response body ) #15 done

alamb · 2025-03-18T16:40:06Z

@criccomini I am curious if you would have a use for RacingReads. This basically would reduce the overall latency for object store requests by running multiple requests in parallel and returning the one that completed first. The tradeoff is that this strategy increases $$$ linearly as it makes more requests

ryzhyk · 2025-03-19T05:55:27Z

@crepererum rightly pointed out that implementing retries (aka #7242) would be better than splitting into smaller requests to make a timeout as the retry mechanism automatically adjusts to current network conditions

Isn't there an upper bound on the timeout (30s by default)? And if the bound isn't large enough to push that 200MiB row group through a slow connection, won't the request fail anyway? And even if the request succeeds eventually, relying on retries to dynamically adjust the timeout seems wasteful compared to bounding request size, improving the chances the request will succeed the first time.

alamb · 2025-03-19T22:25:29Z

@crepererum rightly pointed out that implementing retries (aka #7242) would be better than splitting into smaller requests to make a timeout as the retry mechanism automatically adjusts to current network conditions

Isn't there an upper bound on the timeout (30s by default)? And if the bound isn't large enough to push that 200MiB row group through a slow connection, won't the request fail anyway?

I think the idea is you don't re-request the entire object, only bytes remaining

So let's say you had a 200 MB request but the network can only retrieve 10MB in 30s

The first request would fetch the first 10MB but timeout
Then the retry would request the remaining 190MB
The second request would fetch the second 10MB and timeout
Then the retry would request the remaining 180MB
.. and so on

I agree this is not clear -- I will post the same on #15

ryzhyk · 2025-03-19T22:37:46Z

That makes a lot of sense, thanks for clarifying! So this means that the same data won't get fetched multiple times, which is nice. Does the user still need to configure large enough retry_timeout and max number of retries or those bounds won't apply in this scenario where every retry fetches some data?

alamb · 2025-03-20T15:14:12Z

Does the user still need to configure large enough retry_timeout and max number of retries or those bounds won't apply in this scenario where every retry fetches some data?

I am not sure yet -- it will depend on how the feature is implemented. It is interesting to think about what to do when the process is making (very) slow progress.

alamb · 2025-03-20T21:17:17Z

Migrating from arrow-rs issue #7251

criccomini · 2025-03-26T00:43:14Z

@criccomini I am curious if you would have a use for RacingReads. This basically would reduce the overall latency for object store requests by running multiple requests in parallel and returning the one that completed first. The tradeoff is that this strategy increases $$$ linearly as it makes more requests

This is a nice to have for us. It's certainly crossed my mind, but we haven't implemented it yet. In some cases, I suspect SlateDB users will want low latency at all costs. In other cases, cost is the main thing. :)

alamb · 2025-04-17T20:50:41Z

Chunked Reads as requested in this ticket is similar

Ability to chunk download from object store #274

Pipelines with many delta connectors hit timeout errors, likely due to apache/arrow-rs-object-store#14 Until that is fixed, we introduce a mechanism to restrict the number of concurrent readers across all delta connectors. From the docs: -- Maximum number of concurrent object store reads performed by all Delta Lake connectors. This setting is used to limit the number of concurrent reads of the object store in a pipeline with a large number of Delta Lake connectors. When multiple connectors are simultaneously reading from the object store, this can lead to transport timeouts. When enabled, this setting limits the number of concurrent reads across all connectors. This is a global setting that affects all Delta Lake connectors, and not just the connector where it is specified. It should therefore be used at most once in a pipeline. If multiple connectors specify this setting, they must all use the same value. The default value is 6. Signed-off-by: Leonid Ryzhyk <[email protected]>

alamb added the enhancement New feature or request label Mar 8, 2025

This was referenced Mar 8, 2025

Timeouts reading "large" files from object stores over "slow" connections apache/datafusion#15067

Open

Example for using a separate threadpool for CPU bound work (try 2) apache/datafusion#14286

Draft

This was referenced Mar 8, 2025

Request metrics developmentseed/obstore#105

Open

Add CachingStore developmentseed/obstore#247

Open

kylebarron mentioned this issue Mar 12, 2025

WIP: vendor metrics code developmentseed/obstore#320

Closed

ryzhyk mentioned this issue Mar 18, 2025

to_pyarrow_table() on a table in S3 kept getting "Generic S3 error: error decoding response body" delta-io/delta-rs#2595

Open

alamb mentioned this issue Mar 19, 2025

object store: retry / recover after partially reading a streaming response ( fix timeout errors / error decoding response body ) #15

Open

alamb mentioned this issue Mar 20, 2025

[ObjectStore] Add SpawnService for running requests on different tokio runtime/Handle apache/arrow-rs#7253

Closed

alamb transferred this issue from apache/arrow-rs Mar 20, 2025

This was referenced Mar 20, 2025

[EPIC] Port object_store content from arrow-rs repository #2

Closed

Ability to chunk download from object store #274

Open

criccomini mentioned this issue Mar 26, 2025

add a dbcache impl that uses the foyer hybrid cache slatedb/slatedb#526

Merged

ion-elgreco mentioned this issue Apr 12, 2025

refactor!: move storage module into logstore delta-io/delta-rs#3382

Merged

ryzhyk mentioned this issue Apr 18, 2025

[adapters] Delta source max_concurrent_readers setting. feldera/feldera#3897

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSSION] [object_store] New crate with object store combinators / utilitles #14

[DISCUSSION] [object_store] New crate with object store combinators / utilitles #14

alamb commented Mar 8, 2025 •

edited

Loading

tustvold commented Mar 8, 2025

alamb commented Mar 8, 2025

tustvold commented Mar 8, 2025

flaneur2020 commented Mar 9, 2025 •

edited

Loading

alamb commented Mar 18, 2025

alamb commented Mar 18, 2025

ryzhyk commented Mar 19, 2025

alamb commented Mar 19, 2025

ryzhyk commented Mar 19, 2025

alamb commented Mar 20, 2025

alamb commented Mar 20, 2025

criccomini commented Mar 26, 2025

alamb commented Apr 17, 2025

[DISCUSSION] [object_store] New crate with object store combinators / utilitles #14

[DISCUSSION] [object_store] New crate with object store combinators / utilitles #14

Comments

alamb commented Mar 8, 2025 • edited Loading

Please describe what you are trying to do.

Many Different Behaviors

Desired behavior is varied and application specific

Desired Solution

Alternatives

New object_store_util crate

Add additional policies to provided implmenetations

tustvold commented Mar 8, 2025

alamb commented Mar 8, 2025

tustvold commented Mar 8, 2025

flaneur2020 commented Mar 9, 2025 • edited Loading

alamb commented Mar 18, 2025

alamb commented Mar 18, 2025

ryzhyk commented Mar 19, 2025

alamb commented Mar 19, 2025

ryzhyk commented Mar 19, 2025

alamb commented Mar 20, 2025

alamb commented Mar 20, 2025

criccomini commented Mar 26, 2025

alamb commented Apr 17, 2025

alamb commented Mar 8, 2025 •

edited

Loading

New `object_store_util` crate

flaneur2020 commented Mar 9, 2025 •

edited

Loading