2025-01-16 kernel meeting notes #647
OussamaSaoudi
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
summary
pull requests/ongoing threads
attendees
@nicklan @hntd187 @zachschuermann @OussamaSaoudi-db @roeap @dennyglee @scovich @rtyler @scovich @mightyshazam
leftovers
notes stream
nick: How to wrangle the feature flags. (arrow versions)
tyler: Good news: nearly done, just need to update the ffi/acceptance crate. Bad news: they've had problems maintaining arrow in different supcrates in a workspace. Workspace doesn't hold the arrow spec. Instead feature flags and package renaming needs to be used in the kernel toml. Need to reexport the arrow symbols we're using. Need to make kernel use its own reexport. What's nice is that the addition of new versions is easy.
nick: Why import it under arrow_53_name::*
tyler: Arrow 53 needs to show up in the toml, which shows up in the module space. Hence there can be conflicts
Ryan: is it possible to have a helper module/crate that is one source for the crates?
tyler: This is done in the kernel crate. Adding another crate doesn't help, because the crate would have to push feature flags through. Symbols are pushed out publicly, so test can reimport the version of arrow that the kernel crate is reexporting.
Zach: You can have one place where we reexport the names arrow_53, arrow_54 as arrow (based on configuration flags?).
nick: We can move this to its own crate if we need to. But now everything that depends on arrow and using kernel would have to use kernel's arrow.
ryan: Parsing column of json strings bug that was filed to arrow. Hopefully ppl aren't using kernel just for arrow, and instead using arrow.
Zach: TODO: 0.7.0 release after feature flags. Do we want to include the OSS PR from Tim saucer
Zach: TODO: ExpressionTransform for record batch creation for new
create_one
APIryan: Concerning UC OSS. question: What opinions on catalog creation do consumers have? Does delta-rs have a notion of catalog that kernel should know about? What about other consumers like polars?
tyler: Currently, delta-rs has a catalog concept. It's not great. It does table resolution, so insufficient for what UC provides. It is a catalog + other stuff. Access control, lineage, etc. What ppl use UC for, it's table resolution + other stuff. stephen has been looking into credential provider/vendoring. AWS SDK has its own concept, also object store has its own. There's no interfaces for lineage presently. Permissions and credentials is a lot of added complexity.
ryan: There are two use cases that delta-rs wants. 1) UC OSS api exposed as a rust construct, so that everything UC provides is exposed to rust ecosystem. Would that be useful?
tyler: this isn't necessarily the ask. Kernel should not have UC integration. Need interface definition that we can provide to kernel. The
dyn Catalog
trait should let you specify all the functionality of a Catalog without UC dependency. Should allow multiple catalog.Stephen: Just do what http (?) does.
ryan: makes sense. Thing that makes it tricky is that ideally kernel shouldn't know about catalogs. Kernel only needs to know how to get a table. One way is to look at FS. Another way is catalog packaging enough info for kernel to do snapshots and other functionality. There needs to be a way for kernel to identify the table is not FS. There needs to be an opaque interface that kernel can rely on to perform commits. 2 things needed: 1) Some way to get info from catalog for snapshot. 2) Other is to have a way to commit.
nick: These things could be part of the kernel ecosystem. They don't need to be a part of the core kernel crate.
ryan: Interested to hear tyler's opinion
robert: There's a PR that has a kernel-backed Snapshot in delta-kernel-rs. How far do we want to push the default engine, and should that be a separate crate in delta-rs? We're starting to do a bunch of caching that lives in the FS/engine. This functionality should be a part of the engine. Either delta-rs builds an elaborate engine, and the default engine stays lean, or default engine starts doing the fancier strategies.
robert: Is there a notion of a stateful session when talking to catalog apis?
tyler: This is the wrench that Coordinated Commits (CC) throws in. kernel is designed so that either 1) the engine has control, or 2) the kernel. It's a passing back and forth. LakeFS and CC makes it a poor user/developer experience when passing around control. Need to manage a lot of possible callbacks in response to different responses. API surface blows up with things like CC. For credential vending, kernel needs to occaisionally callback into the engine to fetch a new one when it inevitably fails. CC pushes us in a direction of more complex api surface. Much more complex than an FS api where a lot more failures and branching can happen. There's a credential API in delta-rs, but it isn't executed very well. Becomes very specific to the catalog that is being used too. Would need a better trait/callback system to handle all these possible responses.
nick: Is a notion of a context a solution to all this? C would have a
void *
that holds needed context. Maybe kernel needs to pass around context belonging to the engine. Is the concern that there's gonna be way too many handlers to handle everything?tyler: yes. A handler for every failure case in CC/conflict resultion would be way too many handler.
ryan: 2 questions: 1) For credential vending, the credential is never stored. Every call takes an engine interface, so the credential could be created for each new call that lives for that duration. (could be stack allocated in the engine). The handle in the read_parquet handler should be handling these credentials. Is there something wrong in this model that doesn't capture these problems.
tyler: Need to match on the result of kernel and handle each scenario.
Ryan: Is the engine immutable? It's supposed to be mutable (internally). The engine should internally handle the context. Credential handling should be transparent to the kernel.
Robert: I think that can be done today. The object store impl can retry for credentials if it expires. Token based approaches have internal refresh mechanism.
Stephen: +1. It's completely hidden. Tho it doesn't have to be
Robert: Passing context thorugh is interesting. We talked about caching, and I built a prototype so we can do updates for snapshot and pass in scan data. Now we're losing info about json parsing. Hard to do without internal changes. Need to iterate over 2 sets of data to match things up. Passing in a context would make this easier. Ryan's proposal is to handle everything behind the engine handle.
Ryan: Maybe we should talk about that. Maybe we should reevaluate whether that's a mut reference.
Tyler: CC path is tricky and we should talk about it. Let's talk.
nick: Re mutability, we probably want interior mutability. Use a RefCell/Mutex internally. From kernel perspective, we don't mutate, so this is a good use case for interior mutability.
Beta Was this translation helpful? Give feedback.
All reactions