-
Notifications
You must be signed in to change notification settings - Fork 924
Safe API to replace NullBuffers
for Arrays
#6528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think this makes sense, the only thing to be careful with is arrays where null values may have undefined contents, e.g. dictionaries. In such cases, allowing users to go from a null to not null, could have safety implications |
FWIW the nullif kernel is very similar to this, but with the caveat that nulls can only remain null, avoiding the above issue Edit: in fact the operation you describe is the nullif kernel I think... |
I think the nullif kernel provides this, so perhaps this can be closed? |
Sounds good -- the current documentation on nullif is pretty sparse (and thus perhaps we can make it easier to discover / more likely people can find it) with some better docs https://docs.rs/arrow/latest/arrow/compute/kernels/nullif/fn.nullif.html I'll try and find some time |
PR to improve the docs: #6658 After doing that it is somewhat of the inverse of what I was looking for (it sets the element to I will attempt a PR with a proposed API as well for consideration |
I made #6659 to add The actual usecase I have is not to turn elements unnull, but instead make them null if a boolean array is not true (the inverse of what Maybe a better API would be something like I will ponder |
You could either negate the boolean array, or construct a BooleanArray with a null buffer of the buffer you want to be null if false, and a values buffer of entirely false. Neither is exactly ideal, but the performance, especially of the latter, is likely going to be hard to beat. |
🤔 this is now we do it today in DataFusion: https://github.com/apache/datafusion/blob/f23360f1e0f90abcf92a067de55a9053da545ed2/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/nulls.rs#L53-L102 (basically call |
This request has come up again on #7119 (comment) I think adding a function along the lines of (possibly with Result and some error handling)
Would be a valuable addition. This could also be added as a |
I think the bigger advantage would be it would be more discoverable |
FWIW, this is the variant I ended up using, since I needed the null mask unions to propagate recursively through the whole struct: /// Splits a StructArray into its parts, unions in the parent null mask, and uses the result to
/// recursively update the children as well before putting everything back together.
fn compute_nested_null_masks(sa: StructArray, parent_nulls: Option<&NullBuffer>) -> StructArray {
let (fields, columns, nulls) = sa.into_parts();
let nulls = NullBuffer::union(parent_nulls, nulls.as_ref());
let columns = columns
.into_iter()
.map(|column| match column.as_struct_opt() {
Some(sa) => Arc::new(compute_nested_null_masks(sa.clone(), nulls.as_ref())) as _,
None => {
let data = column.to_data();
let nulls = NullBuffer::union(nulls.as_ref(), data.nulls());
let builder = data.into_builder().nulls(nulls);
// Use an unchecked build to avoid paying a redundant O(k) validation cost for a
// `RecordBatch` with k leaf columns.
//
// SAFETY: The builder was constructed from an `ArrayData` we extracted from the
// column. The change we make is the null buffer, via `NullBuffer::union` with input
// null buffers that were _also_ extracted from the column and its parent. A union
// can only _grow_ the set of NULL rows, so data validity is preserved. Even if the
// `parent_nulls` somehow had a length mismatch --- which it never should, having
// also been extracted from our grandparent --- the mismatch would have already
// caused `NullBuffer::union` to panic.
let data = unsafe { builder.build_unchecked() };
make_array(data)
}
})
.collect();
// Use an unchecked constructor to avoid paying O(n*k) a redundant null buffer validation cost
// for a `RecordBatch` with n rows and k leaf columns.
//
// SAFETY: We are simply reassembling the input `StructArray` we previously broke apart, with
// updated null buffers. See above for details about null buffer safety.
unsafe { StructArray::new_unchecked(fields, columns, nulls) } NOTE: A side advantage of the above approach is that the builder has fully recursive checking while the struct array only validates that its (presumably already correct) children fit together nicely. So if the above were changed to use fallible |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
While implementing apache/datafusion#12792 and various other things in DataFusion I find myself often wanting to combine a filter and a null mask
The relevant code is like
Describe the solution you'd like
I would like an API like
with_nulls
that returns a new array with the same data but a new null mask so my code would look likeDescribe alternatives you've considered
I can keep using the unsafe APIs
Additional context
The text was updated successfully, but these errors were encountered: