Skip to content

Safe API to replace NullBuffers for Arrays #6528

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
alamb opened this issue Oct 8, 2024 · 11 comments
Open

Safe API to replace NullBuffers for Arrays #6528

alamb opened this issue Oct 8, 2024 · 11 comments
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers help wanted

Comments

@alamb
Copy link
Contributor

alamb commented Oct 8, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
While implementing apache/datafusion#12792 and various other things in DataFusion I find myself often wanting to combine a filter and a null mask

The relevant code is like

        // combine existing nulls, if any, and a filter:
        let nulls = filtered_null_mask(opt_filter, input);

        // make a new array with a new null buffer
        let output: ArrayRef = match input.data_type() {
            // TODO it would be nice to have safe apis in arrow-rs to update the null buffers in the arrays
            DataType::Utf8 => {
                let input = input.as_string::<i32>();
                // safety: values / offsets came from a valid string array, so are valid utf8
                // and we checked nulls has the same length as values
                unsafe {
                    Arc::new(StringArray::new_unchecked(
                        input.offsets().clone(),
                        input.values().clone(),
                        nulls,
                    ))
                }
            }

Describe the solution you'd like
I would like an API like with_nulls that returns a new array with the same data but a new null mask so my code would look like

        // combine existing nulls, if any, and a filter:
        let nulls = filtered_null_mask(opt_filter, input);

        // make a new array, with the same data but a new null buffer
        // panics if the nulls length doesn't match the array's length
        let output = input.with_nulls(nulls);

Describe alternatives you've considered
I can keep using the unsafe APIs

Additional context

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Oct 8, 2024
@tustvold
Copy link
Contributor

tustvold commented Oct 9, 2024

I think this makes sense, the only thing to be careful with is arrays where null values may have undefined contents, e.g. dictionaries. In such cases, allowing users to go from a null to not null, could have safety implications

@tustvold
Copy link
Contributor

tustvold commented Oct 9, 2024

FWIW the nullif kernel is very similar to this, but with the caveat that nulls can only remain null, avoiding the above issue

Edit: in fact the operation you describe is the nullif kernel I think...

@tustvold
Copy link
Contributor

I think the nullif kernel provides this, so perhaps this can be closed?

@alamb
Copy link
Contributor Author

alamb commented Oct 30, 2024

Sounds good -- the current documentation on nullif is pretty sparse (and thus perhaps we can make it easier to discover / more likely people can find it) with some better docs

https://docs.rs/arrow/latest/arrow/compute/kernels/nullif/fn.nullif.html

I'll try and find some time

@alamb
Copy link
Contributor Author

alamb commented Oct 31, 2024

PR to improve the docs: #6658

After doing that it is somewhat of the inverse of what I was looking for (it sets the element to null when the mask is true, rather than setting the element to null when the mask is not true).

I will attempt a PR with a proposed API as well for consideration

@alamb
Copy link
Contributor Author

alamb commented Oct 31, 2024

I made #6659 to add Array::with_nulls but it has the issues @tustvold describes

The actual usecase I have is not to turn elements unnull, but instead make them null if a boolean array is not true (the inverse of what nullif does).

Maybe a better API would be something like nullifnot or similar.

I will ponder

@tustvold
Copy link
Contributor

You could either negate the boolean array, or construct a BooleanArray with a null buffer of the buffer you want to be null if false, and a values buffer of entirely false. Neither is exactly ideal, but the performance, especially of the latter, is likely going to be hard to beat.

@alamb
Copy link
Contributor Author

alamb commented Oct 31, 2024

🤔

this is now we do it today in DataFusion: https://github.com/apache/datafusion/blob/f23360f1e0f90abcf92a067de55a9053da545ed2/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator/nulls.rs#L53-L102

(basically call NullBuffer::Union) maybe that is good enough 🤔

@tustvold
Copy link
Contributor

tustvold commented Feb 12, 2025

This request has come up again on #7119 (comment)

I think adding a function along the lines of (possibly with Result and some error handling)

pub fn mask(a: &dyn Array, mask: &BooleanBuffer) -> ArrayRef {
    let nulls = a.nulls().clone();
    let builder = a.to_data().into_builder();
    
    let builder = match nulls {
        None => builder.nulls(Some(mask.clone().into())),
        Some(nulls) => builder.nulls((mask & nulls.inner()).into())
    };
    
    make_array(unsafe {builder.build_unchecked()})
}

Would be a valuable addition.

This could also be added as a mask method to Array, which might have marginal performance advantages

@alamb
Copy link
Contributor Author

alamb commented Feb 12, 2025

This could also be added as a mask method to Array, which might have marginal performance advantages

I think the bigger advantage would be it would be more discoverable

@scovich
Copy link
Contributor

scovich commented Feb 13, 2025

FWIW, this is the variant I ended up using, since I needed the null mask unions to propagate recursively through the whole struct:

/// Splits a StructArray into its parts, unions in the parent null mask, and uses the result to
/// recursively update the children as well before putting everything back together.
fn compute_nested_null_masks(sa: StructArray, parent_nulls: Option<&NullBuffer>) -> StructArray {
    let (fields, columns, nulls) = sa.into_parts();
    let nulls = NullBuffer::union(parent_nulls, nulls.as_ref());
    let columns = columns
        .into_iter()
        .map(|column| match column.as_struct_opt() {
            Some(sa) => Arc::new(compute_nested_null_masks(sa.clone(), nulls.as_ref())) as _,
            None => {
                let data = column.to_data();
                let nulls = NullBuffer::union(nulls.as_ref(), data.nulls());
                let builder = data.into_builder().nulls(nulls);
                // Use an unchecked build to avoid paying a redundant O(k) validation cost for a
                // `RecordBatch` with k leaf columns.
                //
                // SAFETY: The builder was constructed from an `ArrayData` we extracted from the
                // column. The change we make is the null buffer, via `NullBuffer::union` with input
                // null buffers that were _also_ extracted from the column and its parent. A union
                // can only _grow_ the set of NULL rows, so data validity is preserved. Even if the
                // `parent_nulls` somehow had a length mismatch --- which it never should, having
                // also been extracted from our grandparent --- the mismatch would have already
                // caused `NullBuffer::union` to panic.
                let data = unsafe { builder.build_unchecked() };
                make_array(data)
            }
        })
        .collect();

    // Use an unchecked constructor to avoid paying O(n*k) a redundant null buffer validation cost
    // for a `RecordBatch` with n rows and k leaf columns.
    //
    // SAFETY: We are simply reassembling the input `StructArray` we previously broke apart, with
    // updated null buffers. See above for details about null buffer safety.
    unsafe { StructArray::new_unchecked(fields, columns, nulls) }

NOTE: A side advantage of the above approach is that the builder has fully recursive checking while the struct array only validates that its (presumably already correct) children fit together nicely. So if the above were changed to use fallible ArrayDataBuilder::build and StructArray::new, the overhead would be relatively low because the builder is only used at leaf level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers help wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants