-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid unnecessary null buffer when decoding a primitive array with no null #13
Avoid unnecessary null buffer when decoding a primitive array with no null #13
Conversation
… null This undoes a change in behavior caused by datafusion-contrib/datafusion-orc@e3b0be4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea how to proceed?
What about trying to roundtrip ourselves using the write functionality from this crate itself to generate the test file (or test bytes since we can keep it in memory).
I believe current write functionality will always have a present stream as long as the input array has a null buffer (doesn't check the amount of nulls):
Lines 106 to 127 in f934c8e
// Handling case where if encoding across RecordBatch boundaries, arrays | |
// might introduce a NullBuffer | |
match (array.nulls(), &mut self.present) { | |
// Need to copy only the valid values as indicated by null_buffer | |
(Some(null_buffer), Some(present)) => { | |
present.extend(null_buffer); | |
for index in null_buffer.valid_indices() { | |
let v = array.value(index); | |
self.encoder.write_one(v); | |
} | |
} | |
(Some(null_buffer), None) => { | |
// Lazily initiate present buffer and ensure backfill the already encoded values | |
let mut present = BooleanEncoder::new(); | |
present.extend_present(self.encoded_count); | |
present.extend(null_buffer); | |
self.present = Some(present); | |
for index in null_buffer.valid_indices() { | |
let v = array.value(index); | |
self.encoder.write_one(v); | |
} | |
} |
So could rely on that to craft the test file you require instead of fiddling with PyORC perhaps?
src/array_decoder/mod.rs
Outdated
Some(present) if present.null_count() > 0 => { | ||
self.iter.decode_spaced(data.as_mut_slice(), &present)?; | ||
let array = PrimitiveArray::<T>::new(data.into(), Some(present)); | ||
Ok(array) | ||
} | ||
None => { | ||
_ => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on putting the check in derive_present_vec
to share this functionality for non-primitive arrays?
orc-rust/src/array_decoder/mod.rs
Lines 211 to 226 in f934c8e
fn derive_present_vec( | |
present: &mut Option<PresentDecoder>, | |
parent_present: Option<&NullBuffer>, | |
batch_size: usize, | |
) -> Option<Result<NullBuffer>> { | |
match (present, parent_present) { | |
(Some(present), Some(parent_present)) => { | |
let element_count = parent_present.len() - parent_present.null_count(); | |
let present = present.next_buffer(element_count); | |
Some(merge_parent_present(parent_present, present)) | |
} | |
(Some(present), None) => Some(present.next_buffer(batch_size)), | |
(None, Some(parent_present)) => Some(Ok(parent_present.clone())), | |
(None, None) => None, | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, done.
This reverts commit 260d641. orc-rust v0.4.0 and v0.5.0 break ar_row: * datafusion-contrib/orc-rust#13 * https://github.com/datafusion-contrib/datafusion-orc/issues/137
So non-primitive vecs benefit from it too
I've raised progval#1 with my proposed test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thanks for this |
This undoes a change in behavior caused by datafusion-contrib/datafusion-orc@e3b0be4, and is consistent with what pyorc does for non-primitive arrays.
I'm afraid the test does not actually work. For a reason I don't understand, the test passes even without the change to
src/array_decoder/mod.rs
.However, I can reproduce the failure by replacing
no_nulls.orc
with this file (and removing all the assertions, and addingassert!(batch.column_by_name("perms").unwrap().nulls().is_none());
)I wrote
tests/basic/data/no_nulls.py
using the same code as what was used to producedirectory_entry-all.orc
, so I am guessing this is a change in pyorc version: I generatedno_nulls.orc
with pyorc 0.8.0 (which bundles ORC C++ 1.7.7), whiledirectory_entry-all.orc
was (probably) generated with pyorc 0.7.0 (which bundles ORC C++ 1.7.5). Unfortunately, pyorc 0.7.0 did not provide a wheel, and fails to build ORC on my system (because of a compiler warning and-Werror
), so I can't check.Any idea how to proceed?