Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid unnecessary null buffer when decoding a primitive array with no null #13

Merged
merged 5 commits into from
Jan 3, 2025

Conversation

progval
Copy link
Collaborator

@progval progval commented Oct 29, 2024

This undoes a change in behavior caused by datafusion-contrib/datafusion-orc@e3b0be4, and is consistent with what pyorc does for non-primitive arrays.

I'm afraid the test does not actually work. For a reason I don't understand, the test passes even without the change to src/array_decoder/mod.rs.

However, I can reproduce the failure by replacing no_nulls.orc with this file (and removing all the assertions, and adding assert!(batch.column_by_name("perms").unwrap().nulls().is_none());)

I wrote tests/basic/data/no_nulls.py using the same code as what was used to produce directory_entry-all.orc, so I am guessing this is a change in pyorc version: I generated no_nulls.orc with pyorc 0.8.0 (which bundles ORC C++ 1.7.7), while directory_entry-all.orc was (probably) generated with pyorc 0.7.0 (which bundles ORC C++ 1.7.5). Unfortunately, pyorc 0.7.0 did not provide a wheel, and fails to build ORC on my system (because of a compiler warning and -Werror), so I can't check.

Any idea how to proceed?

Copy link
Collaborator

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea how to proceed?

What about trying to roundtrip ourselves using the write functionality from this crate itself to generate the test file (or test bytes since we can keep it in memory).

I believe current write functionality will always have a present stream as long as the input array has a null buffer (doesn't check the amount of nulls):

// Handling case where if encoding across RecordBatch boundaries, arrays
// might introduce a NullBuffer
match (array.nulls(), &mut self.present) {
// Need to copy only the valid values as indicated by null_buffer
(Some(null_buffer), Some(present)) => {
present.extend(null_buffer);
for index in null_buffer.valid_indices() {
let v = array.value(index);
self.encoder.write_one(v);
}
}
(Some(null_buffer), None) => {
// Lazily initiate present buffer and ensure backfill the already encoded values
let mut present = BooleanEncoder::new();
present.extend_present(self.encoded_count);
present.extend(null_buffer);
self.present = Some(present);
for index in null_buffer.valid_indices() {
let v = array.value(index);
self.encoder.write_one(v);
}
}

So could rely on that to craft the test file you require instead of fiddling with PyORC perhaps?

Comment on lines 101 to 106
Some(present) if present.null_count() > 0 => {
self.iter.decode_spaced(data.as_mut_slice(), &present)?;
let array = PrimitiveArray::<T>::new(data.into(), Some(present));
Ok(array)
}
None => {
_ => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on putting the check in derive_present_vec to share this functionality for non-primitive arrays?

fn derive_present_vec(
present: &mut Option<PresentDecoder>,
parent_present: Option<&NullBuffer>,
batch_size: usize,
) -> Option<Result<NullBuffer>> {
match (present, parent_present) {
(Some(present), Some(parent_present)) => {
let element_count = parent_present.len() - parent_present.null_count();
let present = present.next_buffer(element_count);
Some(merge_parent_present(parent_present, present))
}
(Some(present), None) => Some(present.next_buffer(batch_size)),
(None, Some(parent_present)) => Some(Ok(parent_present.clone())),
(None, None) => None,
}
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, done.

swhmirror pushed a commit to SoftwareHeritage/swh-graph that referenced this pull request Dec 10, 2024
@Jefffrey
Copy link
Collaborator

Any idea how to proceed?

What about trying to roundtrip ourselves using the write functionality from this crate itself to generate the test file (or test bytes since we can keep it in memory).

I believe current write functionality will always have a present stream as long as the input array has a null buffer (doesn't check the amount of nulls):

// Handling case where if encoding across RecordBatch boundaries, arrays
// might introduce a NullBuffer
match (array.nulls(), &mut self.present) {
// Need to copy only the valid values as indicated by null_buffer
(Some(null_buffer), Some(present)) => {
present.extend(null_buffer);
for index in null_buffer.valid_indices() {
let v = array.value(index);
self.encoder.write_one(v);
}
}
(Some(null_buffer), None) => {
// Lazily initiate present buffer and ensure backfill the already encoded values
let mut present = BooleanEncoder::new();
present.extend_present(self.encoded_count);
present.extend(null_buffer);
self.present = Some(present);
for index in null_buffer.valid_indices() {
let v = array.value(index);
self.encoder.write_one(v);
}
}

So could rely on that to craft the test file you require instead of fiddling with PyORC perhaps?

I've raised progval#1 with my proposed test

Copy link
Collaborator

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Jefffrey Jefffrey merged commit 7618add into datafusion-contrib:main Jan 3, 2025
11 of 12 checks passed
@Jefffrey
Copy link
Collaborator

Jefffrey commented Jan 3, 2025

Thanks for this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants