feat: provide basic columnar batch API #41

wjones127 · 2023-09-18T03:47:03Z

This provides a starting implementation for ColumnarBatch. There is a default implementation for Arrow.

I've only implemented support for a few types to help focus the discussion. The remaining data types will be implemented as a follow up in #52.

To keep the PR focused, I've also left out integrating columnar batch with other parts. That is tracked in #53. It's likely that will be blocked on #40.

Closes #21.

rtyler · 2023-09-19T22:26:35Z

Nice start, I had the same in a branch. Shame on me for not getting the pull request sent previously 🤦

wjones127 · 2023-09-20T00:16:35Z

Nice start, I had the same in a branch. Shame on me for not getting the pull request sent previously

@rtyler feel free to move your PR forward this week if you want. I don't think I'll get back to this until Sunday.

wjones127 · 2023-10-19T03:37:31Z

kernel/src/columnar_batch.rs

+// TODO: Should all these methods perform bounds checking? Should they return
+// errors for out of bounds?


Open question for discussion.

wjones127 · 2023-10-19T03:37:43Z

kernel/src/columnar_batch.rs

+    /// Check if the element at the specified index is null.
+    fn is_null(&self, i: usize) -> bool;
+
+    // TODO: should these methods type check?


Open question for discussion?

Coming from the "don't pessimize performance" angle, I would advocate that the getters be defined to not type check and also to return values directly rather than Option:

The column vector is typed already so presumably clients should validate once when receiving the vector/batch and be done with it.

If nulls are involved, there will be a branch no matter what, and the is_null method that anyway needs to exist seems good enough for that?

Note that the above does not preclude assertions -- perhaps debug-only -- to catch obvious bad behaviors in testing.

As an alternative, we could define vectors as strongly typed, and allow implementations to specialize (e.g. ColumnVectorImpl<i32> would define a get_i32 that is hard-wired to return a result (no check needed) and all other getters are hard-wired to throw. I suppose some base implementation could provide the "hard-wired to throw" aspect and then each concrete vector just has to define the getter it actually supports?

Nulls are trickier... ideally a columnar format would track nullable columns as (column,bitvector) pairs, to allow bitwise operations for e.g. AND and OR predicates. If we took that approach then nothing in the interface itself is nullable and we eliminate some sources of "fun" while also allowing (but not requiring) branchless evaluation strategies that columnar engines tend to have very strong opinions about.

Actually... I was chatting with @zachschuermann and @nicklan, and it came up that ColumnBatch should probably be an opaque type -- it advertises its schema and length, but doesn't expose a ColumnVector concept at all. Instead, code in kernel that needs to consume a specific column can request either an iterator or an array of primitive values for each such column. And then co-iterate over the resulting flat schema by ordinals -- the schema and ordinals should both be known at compile time.

Note: This idea presumes that ColumnBatch is the interface kernel uses to communicate with the engine -- NOT the interface engine uses internally nor exposes to the outside world. The engine is free to consume column batches however it wants, because the engine defines and creates them in the first place.

it came up that ColumnBatch should probably be an opaque type -- it advertises its schema and length, but doesn't expose a ColumnVector concept at all. Instead, code in kernel that needs to consume a specific column can request either an iterator or an array of primitive values for each such column.

I'm a fan of this idea. The current interface to ColumnarBatch is very row-oriented and doesn't seem like it would be an efficient way to access values in something columnar.

However, the one wrinkle I see is I don't know how struct, array, and map values would work here. And I think those are critical for reading from the log. We could prototype something and see if there is a good solution for those. LMK if you have any initial ideas on how to approach that.

wjones127 · 2023-10-19T03:41:03Z

kernel/src/columnar_batch.rs

+    /// Get the string value at the specified index.
+    fn get_string(&self, i: usize) -> DeltaResult<Option<&str>>;
+
+    // TODO: add other primitive types


TODO captured in #52

wjones127 · 2023-10-19T03:46:34Z

kernel/Cargo.toml

+arrow-array = { version = "^47.0" }
+arrow-arith = { version = "^47.0" }
+arrow-json = { version = "^47.0" }
+arrow-ord = { version = "^47.0" }
+arrow-schema = { version = "^47.0" }
+arrow-select = { version = "^47.0" }


Arrow upgrade is necessary due to a bug in MapArray which would cause our tests to fail.

ryan-johnson-databricks

Dropping some wild ideas/comments on this PR for discussion.
Hopefully it's a helpful pot-stirring exercise.

ryan-johnson-databricks · 2023-11-16T17:10:16Z

kernel/src/columnar_batch.rs

+/// sub-module. Engines may provide their own implementations optimized for their
+/// in-memory format.
+pub trait ColumnarBatch {
+    type Column: ColumnVector;


Using typedefs to make arbitrary type rename is an anti-pattern in most languages?

This is not a typedef. It's an associated type.

You can read this as, "every implementor of ColumnarBatch has an type associated with it called Column. Column must implement ColumnVector."

ah. that would be my ignorance of rust, sorry!

ryan-johnson-databricks · 2023-11-16T17:31:43Z

kernel/src/columnar_batch.rs

+        Self: Sized;
+
+    /// Iterate over the rows in the batch.
+    fn rows(&self) -> Box<dyn Iterator<Item = Box<dyn Row<Column = Self::Column>>>>;


Would this be better as a stand-alone RowIterator class that can "mount" a columnar batch?
(might be a cleaner memory lifetime story?)

Or do we expect different columnar batch implementations to have some internal magic that means their row iterators shouldn't have the same basic implementation?

e.g. Arrow doesn't have any row-oriented interface, so there's no internal magic there.

the only special case I can think of is constant columns, where arrow has a Datum type that behaves like an arrary but internally just stores the constant. However I do not think that this would impact a generic implementation for a RowIterator.

I feel agnostic on this. If it weren't for struct columns, I'd feel tempted to remove the whole Row concept entirely until we felt we really needed it.

ryan-johnson-databricks · 2023-11-16T17:34:59Z

kernel/src/columnar_batch.rs

+    /// Check if the element at the specified index is null.
+    fn is_null(&self, i: usize) -> bool;
+
+    // TODO: should these methods type check?


Coming from the "don't pessimize performance" angle, I would advocate that the getters be defined to not type check and also to return values directly rather than Option:

The column vector is typed already so presumably clients should validate once when receiving the vector/batch and be done with it.

If nulls are involved, there will be a branch no matter what, and the is_null method that anyway needs to exist seems good enough for that?

Note that the above does not preclude assertions -- perhaps debug-only -- to catch obvious bad behaviors in testing.

As an alternative, we could define vectors as strongly typed, and allow implementations to specialize (e.g. ColumnVectorImpl<i32> would define a get_i32 that is hard-wired to return a result (no check needed) and all other getters are hard-wired to throw. I suppose some base implementation could provide the "hard-wired to throw" aspect and then each concrete vector just has to define the getter it actually supports?

Nulls are trickier... ideally a columnar format would track nullable columns as (column,bitvector) pairs, to allow bitwise operations for e.g. AND and OR predicates. If we took that approach then nothing in the interface itself is nullable and we eliminate some sources of "fun" while also allowing (but not requiring) branchless evaluation strategies that columnar engines tend to have very strong opinions about.

ryan-johnson-databricks · 2023-11-16T18:36:02Z

kernel/src/columnar_batch.rs

+    /// Check if the element at the specified index is null.
+    fn is_null(&self, i: usize) -> bool;
+
+    // TODO: should these methods type check?


Actually... I was chatting with @zachschuermann and @nicklan, and it came up that ColumnBatch should probably be an opaque type -- it advertises its schema and length, but doesn't expose a ColumnVector concept at all. Instead, code in kernel that needs to consume a specific column can request either an iterator or an array of primitive values for each such column. And then co-iterate over the resulting flat schema by ordinals -- the schema and ordinals should both be known at compile time.

Note: This idea presumes that ColumnBatch is the interface kernel uses to communicate with the engine -- NOT the interface engine uses internally nor exposes to the outside world. The engine is free to consume column batches however it wants, because the engine defines and creates them in the first place.

ryan-johnson-databricks · 2023-11-16T19:22:48Z

kernel/src/columnar_batch.rs

+    fn size(&self) -> usize;
+
+    /// Check if the element at the specified index is null.
+    fn is_null(&self, i: usize) -> bool;


Overall, I would favor tracking nulls as separate columns if possible.

Looking at arrow -- it exposes a similar is_null+get pair for each column, which on its face would interfere with branchless evaluation strategies. I would expect that under the hood arrow compute internally uses null+value column pairs rather than using the public interface.

Looking at DuckDB -- its isNullLoop uses a UnifiedVectorFormat with three internal columns: an optional index shuffling column (.sel), a validity column (.validity) and the actual data column (.data).

Looking at arrow -- it exposes a similar is_null+get pair for each column, which on its face would interfere with branchless evaluation strategies. I would expect that under the hood arrow compute internally uses null+value column pairs rather than using the public interface.

Yeah you are correct. Most performant sensitive users get the underlying null and value buffers and process them in a vectorized fashion.

wjones127 added 3 commits September 25, 2023 21:56

wip: implement columnar batch API

bd7ce3b

wip: pursue slice approach

dc715a9

flesh out columnar batch

632b4e5

wjones127 force-pushed the columnar-batch branch from b4896a4 to 632b4e5 Compare October 11, 2023 04:31

wjones127 added 2 commits October 10, 2023 22:28

a little more progress

f17fbd5

fill out tests

07d1702

wjones127 commented Oct 19, 2023

View reviewed changes

wjones127 mentioned this pull request Oct 19, 2023

Implement remaining types for ColumnarBatch #52

Closed

wjones127 commented Oct 19, 2023

View reviewed changes

wjones127 marked this pull request as ready for review October 19, 2023 03:45

wjones127 changed the title ~~wip: implement columnar batch API~~ feat: provide basic columnar batch API Oct 19, 2023

wjones127 requested review from nicklan and roeap October 19, 2023 03:46

wjones127 commented Oct 19, 2023

View reviewed changes

remove irrelevant change

5f6193f

ryan-johnson-databricks reviewed Nov 16, 2023

View reviewed changes

nicklan closed this Mar 19, 2024

		// TODO: Should all these methods perform bounds checking? Should they return
		// errors for out of bounds?

feat: provide basic columnar batch API #41

feat: provide basic columnar batch API #41

Uh oh!

Conversation

wjones127 commented Sep 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rtyler commented Sep 19, 2023

Uh oh!

wjones127 commented Sep 20, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryan-johnson-databricks left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wjones127 commented Sep 18, 2023 •

edited

Loading