Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of series.ext_array.replace_with_mask() #52

Open
hombit opened this issue May 3, 2024 · 1 comment
Open

Improve performance of series.ext_array.replace_with_mask() #52

hombit opened this issue May 3, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@hombit
Copy link
Collaborator

hombit commented May 3, 2024

Currently, arrow misses the support of pyarrow.compute.replace_with_mask for struct arrays:
apache/arrow#29558

That's why we have our own implementation used by NestedExtenstionArray.__setitem__(). The implementation has an overhead of creating a len(self)-sized struct array to perform the replacement. This approach would work well when we are going to replace many elements, but when we replacing just few, it would produce a large memory foot-print and probably take a while.

An alternative approach would be copying the original array to np.ndarray[pa.StructScalar], replace the elements in-place, and convert it back:

def replace_with_mask(array: pa.ChunkedArray, mask: pa.BooleanArray, value: pa.Array) -> pa.ChunkedArray:
    """Replace the elements of the array with the value where the mask is True"""
    np_array = np.fromiter(array, dtype=object)
    np_array[mask] = value
    new_pa_array = pa.array(np_array)
    return pa.chunked_array([new_pa_array])

We should create a benchmark and see what works faster and have smaller memory foot-print.

@hombit hombit added enhancement New feature or request investigation Explore the idea labels May 3, 2024
@hombit
Copy link
Collaborator Author

hombit commented May 7, 2024

Benchmarks reveal the problem with single element assignment performance, this rise happened after we switched from ArrowExtensionArray to a custom implementation of NestedExtensionArray:

https://lincc-frameworks.github.io/nested-pandas/#benchmarks.AssignSingleDfToNestedSeries.time_run

@hombit hombit added bug Something isn't working and removed enhancement New feature or request investigation Explore the idea labels May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant