Skip to content

feat!: update storage configuration system #3383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 16, 2025
Merged

Conversation

roeap
Copy link
Collaborator

@roeap roeap commented Apr 13, 2025

Description

Along with the addition of a caching layer, this PR also attempts to formalise how we handle configuration a bit more. With the caching, we can now configure four layers that can be applied to object stores.

  • limiting
  • io runtime
  • retry
  • cache

Additionally, we currently need to wrap stores in a PrefixStore pointing at the table root.

Each of these layers has its own configuration that is parsed from the same global map of configuration keys. The configuration for all of these is now aggregated in a new struct StorageConfig.

Our previous abstraction StorageOptions had over time lost its utility since no implementations lived in the core crate anymore and we were just wrapping a HashMap.

The same configuration is also passed again to LogStore and ObjectStore factories where again some parsing happens. With this PR we now present the already parsed configuration (StorageConfig) which exposes utility methods for integrations to decorate / augment the storages the produce.

The new caching store is fairly contained, and I'd be happy to carve that out if it helps reviews.

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Apr 13, 2025
Copy link

codecov bot commented Apr 13, 2025

Codecov Report

Attention: Patch coverage is 62.17617% with 219 lines in your changes missing coverage. Please review.

Project coverage is 71.90%. Comparing base (7538c58) to head (e35c9ca).
Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
crates/core/src/logstore/config.rs 65.16% 42 Missing and 12 partials ⚠️
crates/core/src/logstore/storage/runtime.rs 9.30% 34 Missing and 5 partials ⚠️
crates/aws/src/storage.rs 61.90% 16 Missing ⚠️
crates/mount/src/file.rs 0.00% 15 Missing ⚠️
crates/azure/src/lib.rs 0.00% 14 Missing ⚠️
crates/mount/src/lib.rs 0.00% 14 Missing ⚠️
crates/core/src/logstore/mod.rs 84.21% 4 Missing and 8 partials ⚠️
crates/catalog-unity/src/lib.rs 0.00% 10 Missing ⚠️
crates/gcp/src/lib.rs 0.00% 9 Missing ⚠️
crates/hdfs/src/lib.rs 0.00% 8 Missing ⚠️
... and 9 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3383      +/-   ##
==========================================
- Coverage   71.97%   71.90%   -0.07%     
==========================================
  Files         145      147       +2     
  Lines       45774    45848      +74     
  Branches    45774    45848      +74     
==========================================
+ Hits        32944    32966      +22     
- Misses      10746    10786      +40     
- Partials     2084     2096      +12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Collaborator

@ion-elgreco ion-elgreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good, happy to see the simplification of StorageConfig, makes it also easier in the future to extend it!

Just left some minor comments

@roeap roeap force-pushed the cached-store branch 2 times, most recently from 2e8c4fe to dba8d52 Compare April 13, 2025 18:22
@roeap roeap mentioned this pull request Apr 14, 2025
@github-actions github-actions bot added the binding/python Issues for the Python package label Apr 14, 2025
@roeap roeap force-pushed the cached-store branch 3 times, most recently from 85e9e93 to f70d4d8 Compare April 14, 2025 14:39
@rtyler rtyler self-assigned this Apr 14, 2025
@roeap roeap marked this pull request as ready for review April 14, 2025 21:47
@roeap roeap requested a review from ion-elgreco April 14, 2025 21:47
Comment on lines +148 to +152
let prefix = match ObjectStoreScheme::parse(table_root) {
Ok((ObjectStoreScheme::AmazonS3, _)) => Path::parse(table_root.path())?,
Ok((_, path)) => path,
_ => Path::parse(table_root.path())?,
};
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is concerning me a bit - why does AWS need special treatment? However as we move to working with URLs, I hope we can create a very narrow code path and convert urls to paths in exactly one way ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems not necessary both return the same

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed! but if we don't do this, tests start to fail .. i was also wondering if maybe the aws cli behaves differently encoding paths when uploading files?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@roeap @ion-elgreco Why do we wrap with a PrefixStore() always? This totally breaks the DeltaTableBuilder:
/// Set the storage backend.
///
/// If a backend is not provided then it is derived from table_uri.
///
/// # Arguments
///
/// * storage - A shared reference to an ObjectStore with
/// "/" pointing at delta table root (i.e. where _delta_log is located).
/// * location - A url corresponding to the storagle location of storage.
pub fn with_storage_backend(mut self, storage: Arc, location: Url) -> Self {
self.storage_backend = Some((storage, location));
self
}

this api is now broken, the url paths will be all wrong since the passed location is now prefixed and not relative to the assumed root of the storage backend

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to prepare for support for absolute paths, but if you are seeing something wrong. Please create a repro :)

Comment on lines +17 to +31
pub trait ObjectStoreFactory: Send + Sync {
/// Parse URL options and create an object store instance.
///
/// The object store instance returned by this method must point at the root of the storage location.
/// Root in this case means scheme, authority/host and maybe port.
/// The path segment is returned as second element of the tuple. It must point at the path
/// corresponding to the path segment of the URL.
///
/// The store should __NOT__ apply the decorations via the passed `StorageConfig`
fn parse_url_opts(
&self,
url: &Url,
options: &HashMap<String, String>,
retry: &RetryConfig,
) -> DeltaResult<(ObjectStoreRef, Path)>;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we document the new expected behavior of this method.

Comment on lines +95 to +116
trait LogStoreFactoryExt {
fn with_options_internal(
&self,
store: ObjectStoreRef,
location: &Url,
options: &StorageConfig,
io_runtime: Option<IORuntime>,
) -> DeltaResult<Arc<dyn LogStore>>;
}

impl<T: LogStoreFactory + ?Sized> LogStoreFactoryExt for T {
fn with_options_internal(
&self,
store: ObjectStoreRef,
location: &Url,
options: &StorageConfig,
io_runtime: Option<IORuntime>,
) -> DeltaResult<Arc<dyn LogStore>> {
let store = options.decorate_store(store, location, io_runtime.map(|r| r.get_handle()))?;
self.with_options(Arc::new(store), location, options)
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we use the internal trait to make sure we are injection ourselves into the flow to wrap our layers around the object store. I feel this can be improved, bit hoping to do that in a follou up where we can have a more targeted discussion around this.

Comment on lines +247 to 261
impl LogStoreConfig {
pub fn decorate_store<T: ObjectStore + Clone>(
&self,
store: T,
table_root: Option<&url::Url>,
handle: Option<Handle>,
) -> DeltaResult<Box<dyn ObjectStore>> {
let table_url = table_root.unwrap_or(&self.location);
self.options.decorate_store(store, table_url, handle)
}

pub fn object_store_factory(&self) -> ObjectStoreFactoryRegistry {
self::factories::object_store_factories()
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logstore config is a good way for us to inject functionality into the logstore, since the API dictates log stores also expose it again, but we control the implementation... this should alos gove us a way to just inject a registry taken from - let's say datafusion session - and wire it through.

Comment on lines -219 to -223
/// Storage option keys to use when creating [ObjectStore].
///
/// The same key should be used whether passing a key in the hashmap or setting it as an environment variable.
/// Must be implemented for a given storage provider
pub mod storage_constants {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i could not find any way we actually inspect the environment to pichk up this variable. that said, wit the updated configuration parsing we should now have a clear way how we can enable much more config via then environment in a manageable way.

Comment on lines -13 to -39
use super::DeltaTable;
use crate::errors::{DeltaResult, DeltaTableError};
use crate::logstore::storage::{factories, IORuntime, StorageOptions};
use crate::logstore::LogStoreRef;

#[allow(dead_code)]
#[derive(Debug, thiserror::Error)]
enum BuilderError {
#[error("Store {backend} requires host in storage url, got: {url}")]
MissingHost { backend: String, url: String },
#[error("Missing configuration {0}")]
Required(String),
#[error("Failed to find valid credential.")]
MissingCredential,
#[error("Failed to decode SAS key: {0}\nSAS keys must be percent-encoded. They come encoded in the Azure portal and Azure Storage Explorer.")]
Decode(String),
#[error("Delta-rs must be build with feature '{feature}' to support url: {url}.")]
MissingFeature { feature: &'static str, url: String },
#[error("Failed to parse table uri")]
TableUri(#[from] url::ParseError),
}

impl From<BuilderError> for DeltaTableError {
fn from(err: BuilderError) -> Self {
DeltaTableError::Generic(err.to_string())
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we never used any of these 😄

Comment on lines -57 to -63
/// Indicates whether our use case requires tracking tombstones.
/// This defaults to `true`
///
/// Read-only applications never require tombstones. Tombstones
/// are only required when writing checkpoints, so even many writers
/// may want to skip them.
pub require_tombstones: bool,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we never reference this config ... reading tomstones has been lazy for a while now.

Comment on lines -1337 to -1343
pub fn get_py_storage_backend(&self) -> PyResult<filesystem::DeltaFileSystemHandler> {
Ok(filesystem::DeltaFileSystemHandler {
inner: self.object_store()?,
config: self._config.clone(),
known_sizes: None,
})
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also a relic ...

Copy link
Collaborator

@ion-elgreco ion-elgreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you resolve these open comments? Then we are good to go

Signed-off-by: Robert Pack <[email protected]>
@roeap
Copy link
Collaborator Author

roeap commented Apr 15, 2025

@ion-elgreco - your comments should be addressed now.

let path = std::fs::canonicalize(dir)?;
(path, None)
} else {
let tmp_dir = tempfile::tempdir()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested this? I feel like this will still get destroyed before actually access those tmp files

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not hooked up yet, the plan is to actually wire this up in the next PR, then I'll also add tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to also just remove the cache impl from this PR, and put it all in hte next one.a follow up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

went with just removing it ... better then introducing dead code. should hopefully make for a cleaner review in the next one 😄.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah because the docs say, once TempDir goes out of scope, it's deconstructed

Signed-off-by: Robert Pack <[email protected]>
@roeap roeap changed the title feat!: add caching layer and update storage configuration system feat!: update storage configuration system Apr 15, 2025
@rtyler rtyler added this pull request to the merge queue Apr 16, 2025
Merged via the queue into delta-io:main with commit b2dd8fa Apr 16, 2025
31 checks passed
@roeap roeap deleted the cached-store branch April 16, 2025 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants