-
Notifications
You must be signed in to change notification settings - Fork 77
feat: provide basic columnar batch API #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Nice start, I had the same in a branch. Shame on me for not getting the pull request sent previously 🤦 |
@rtyler feel free to move your PR forward this week if you want. I don't think I'll get back to this until Sunday. |
b4896a4
to
632b4e5
Compare
// TODO: Should all these methods perform bounds checking? Should they return | ||
// errors for out of bounds? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open question for discussion.
/// Check if the element at the specified index is null. | ||
fn is_null(&self, i: usize) -> bool; | ||
|
||
// TODO: should these methods type check? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open question for discussion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coming from the "don't pessimize performance" angle, I would advocate that the getters be defined to not type check and also to return values directly rather than Option
:
- The column vector is typed already so presumably clients should validate once when receiving the vector/batch and be done with it.
- If nulls are involved, there will be a branch no matter what, and the
is_null
method that anyway needs to exist seems good enough for that?
Note that the above does not preclude assertions -- perhaps debug-only -- to catch obvious bad behaviors in testing.
As an alternative, we could define vectors as strongly typed, and allow implementations to specialize (e.g. ColumnVectorImpl<i32>
would define a get_i32
that is hard-wired to return a result (no check needed) and all other getters are hard-wired to throw. I suppose some base implementation could provide the "hard-wired to throw" aspect and then each concrete vector just has to define the getter it actually supports?
Nulls are trickier... ideally a columnar format would track nullable columns as (column,bitvector) pairs, to allow bitwise operations for e.g. AND and OR predicates. If we took that approach then nothing in the interface itself is nullable and we eliminate some sources of "fun" while also allowing (but not requiring) branchless evaluation strategies that columnar engines tend to have very strong opinions about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually... I was chatting with @zachschuermann and @nicklan, and it came up that ColumnBatch
should probably be an opaque type -- it advertises its schema and length, but doesn't expose a ColumnVector
concept at all. Instead, code in kernel that needs to consume a specific column can request either an iterator or an array of primitive values for each such column. And then co-iterate over the resulting flat schema by ordinals -- the schema and ordinals should both be known at compile time.
Note: This idea presumes that ColumnBatch
is the interface kernel uses to communicate with the engine -- NOT the interface engine uses internally nor exposes to the outside world. The engine is free to consume column batches however it wants, because the engine defines and creates them in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it came up that ColumnBatch should probably be an opaque type -- it advertises its schema and length, but doesn't expose a ColumnVector concept at all. Instead, code in kernel that needs to consume a specific column can request either an iterator or an array of primitive values for each such column.
I'm a fan of this idea. The current interface to ColumnarBatch is very row-oriented and doesn't seem like it would be an efficient way to access values in something columnar.
However, the one wrinkle I see is I don't know how struct, array, and map values would work here. And I think those are critical for reading from the log. We could prototype something and see if there is a good solution for those. LMK if you have any initial ideas on how to approach that.
/// Get the string value at the specified index. | ||
fn get_string(&self, i: usize) -> DeltaResult<Option<&str>>; | ||
|
||
// TODO: add other primitive types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO captured in #52
arrow-array = { version = "^47.0" } | ||
arrow-arith = { version = "^47.0" } | ||
arrow-json = { version = "^47.0" } | ||
arrow-ord = { version = "^47.0" } | ||
arrow-schema = { version = "^47.0" } | ||
arrow-select = { version = "^47.0" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrow upgrade is necessary due to a bug in MapArray which would cause our tests to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropping some wild ideas/comments on this PR for discussion.
Hopefully it's a helpful pot-stirring exercise.
/// sub-module. Engines may provide their own implementations optimized for their | ||
/// in-memory format. | ||
pub trait ColumnarBatch { | ||
type Column: ColumnVector; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using typedefs to make arbitrary type rename is an anti-pattern in most languages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a typedef. It's an associated type.
You can read this as, "every implementor of ColumnarBatch has an type associated with it called Column. Column must implement ColumnVector."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah. that would be my ignorance of rust, sorry!
Self: Sized; | ||
|
||
/// Iterate over the rows in the batch. | ||
fn rows(&self) -> Box<dyn Iterator<Item = Box<dyn Row<Column = Self::Column>>>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be better as a stand-alone RowIterator
class that can "mount" a columnar batch?
(might be a cleaner memory lifetime story?)
Or do we expect different columnar batch implementations to have some internal magic that means their row iterators shouldn't have the same basic implementation?
e.g. Arrow doesn't have any row-oriented interface, so there's no internal magic there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only special case I can think of is constant columns, where arrow has a Datum
type that behaves like an arrary but internally just stores the constant. However I do not think that this would impact a generic implementation for a RowIterator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel agnostic on this. If it weren't for struct columns, I'd feel tempted to remove the whole Row concept entirely until we felt we really needed it.
/// Check if the element at the specified index is null. | ||
fn is_null(&self, i: usize) -> bool; | ||
|
||
// TODO: should these methods type check? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coming from the "don't pessimize performance" angle, I would advocate that the getters be defined to not type check and also to return values directly rather than Option
:
- The column vector is typed already so presumably clients should validate once when receiving the vector/batch and be done with it.
- If nulls are involved, there will be a branch no matter what, and the
is_null
method that anyway needs to exist seems good enough for that?
Note that the above does not preclude assertions -- perhaps debug-only -- to catch obvious bad behaviors in testing.
As an alternative, we could define vectors as strongly typed, and allow implementations to specialize (e.g. ColumnVectorImpl<i32>
would define a get_i32
that is hard-wired to return a result (no check needed) and all other getters are hard-wired to throw. I suppose some base implementation could provide the "hard-wired to throw" aspect and then each concrete vector just has to define the getter it actually supports?
Nulls are trickier... ideally a columnar format would track nullable columns as (column,bitvector) pairs, to allow bitwise operations for e.g. AND and OR predicates. If we took that approach then nothing in the interface itself is nullable and we eliminate some sources of "fun" while also allowing (but not requiring) branchless evaluation strategies that columnar engines tend to have very strong opinions about.
/// Check if the element at the specified index is null. | ||
fn is_null(&self, i: usize) -> bool; | ||
|
||
// TODO: should these methods type check? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually... I was chatting with @zachschuermann and @nicklan, and it came up that ColumnBatch
should probably be an opaque type -- it advertises its schema and length, but doesn't expose a ColumnVector
concept at all. Instead, code in kernel that needs to consume a specific column can request either an iterator or an array of primitive values for each such column. And then co-iterate over the resulting flat schema by ordinals -- the schema and ordinals should both be known at compile time.
Note: This idea presumes that ColumnBatch
is the interface kernel uses to communicate with the engine -- NOT the interface engine uses internally nor exposes to the outside world. The engine is free to consume column batches however it wants, because the engine defines and creates them in the first place.
fn size(&self) -> usize; | ||
|
||
/// Check if the element at the specified index is null. | ||
fn is_null(&self, i: usize) -> bool; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I would favor tracking nulls as separate columns if possible.
Looking at arrow -- it exposes a similar is_null+get pair for each column, which on its face would interfere with branchless evaluation strategies. I would expect that under the hood arrow compute internally uses null+value column pairs rather than using the public interface.
Looking at DuckDB -- its isNullLoop uses a UnifiedVectorFormat with three internal columns: an optional index shuffling column (.sel
), a validity column (.validity
) and the actual data column (.data
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at arrow -- it exposes a similar is_null+get pair for each column, which on its face would interfere with branchless evaluation strategies. I would expect that under the hood arrow compute internally uses null+value column pairs rather than using the public interface.
Yeah you are correct. Most performant sensitive users get the underlying null and value buffers and process them in a vectorized fashion.
This provides a starting implementation for
ColumnarBatch
. There is a default implementation for Arrow.I've only implemented support for a few types to help focus the discussion. The remaining data types will be implemented as a follow up in #52.
To keep the PR focused, I've also left out integrating columnar batch with other parts. That is tracked in #53. It's likely that will be blocked on #40.
Closes #21.