Skip to content

support merging primitive dictionaries in interleave and concat #7468

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

davidhewitt
Copy link

@davidhewitt davidhewitt commented May 5, 2025

Which issue does this PR close?

Rationale for this change

Fixes a panic :)

What changes are included in this PR?

Introduces a strategy for merging primitive dictionaries to deduplicate values when naive concatentation would overflow the key type.

Tests for both interleave and concat are included.

Are there any user-facing changes?

Just the bugfix.

@github-actions github-actions bot added the arrow Changes to the arrow crate label May 5, 2025
merge_concat_byte_dictionaries(&dictionaries, output_len)
},
other => Err(ArrowError::NotYetImplemented(format!(
"interleave does not yet support merging dictionaries with value type {other:?}"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"interleave does not yet support merging dictionaries with value type {other:?}"
"concat does not yet support merging dictionaries with value type {other:?}"

LargeUtf8 => bytes_ptr_eq::<LargeUtf8Type>,
Binary => bytes_ptr_eq::<BinaryType>,
LargeBinary => bytes_ptr_eq::<LargeBinaryType>,
_ => |_, _| false,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this line allows should_merge_dictionary_values to return true for the primitive data types, but it also introduces a regression for weird types like dictionaries-of-unions. This heuristic might ask for these to merge if the values exceed the expected output length, which will now fail. Before they might be able to concat / interleave via the fallback methods even if merging is not supported (however they could easily hit the panic case).

I think probably the solution is to adjust this heuristic so it outputs why it needs to merge (e.g. must merge to attempt to avoid overflow) - and in callers, for overflow we would fail gracefully for types which can't yet be merged.

@davidhewitt
Copy link
Author

I'll rework to fixup, please don't bother reviewing for the moment.

@davidhewitt davidhewitt marked this pull request as draft May 5, 2025 15:45
@davidhewitt davidhewitt marked this pull request as ready for review May 5, 2025 16:56
@davidhewitt
Copy link
Author

Ok, this should be ready for review.

@tustvold
Copy link
Contributor

tustvold commented May 5, 2025

If I am not mistaken this will generate non-trivial code for all possible values of dictionary key types and primitive key type, I don't think we can really do this. Perhaps you could look at how the arithmetic kernels, comparison kernels of other selection kernels avoid this.

For reference on my computer before this change a clean build of arrow-select and all dependents takes

$ time cargo build --release
Executed in    7.86 secs    fish           external
   usr time   40.93 secs    0.00 micros   40.93 secs
   sys time    1.55 secs  935.00 micros    1.54 secs

-- Second Run --
________________________________________________________
Executed in    7.80 secs    fish           external
   usr time   40.79 secs    1.10 millis   40.79 secs
   sys time    1.51 secs    0.03 millis    1.51 secs

Whereas this PR

________________________________________________________
Executed in   11.58 secs    fish           external
   usr time   57.66 secs    0.00 micros   57.66 secs
   sys time    1.72 secs  923.00 micros    1.71 secs

-- Second Run --
________________________________________________________
Executed in   15.21 secs    fish           external
   usr time   71.19 secs    1.07 millis   71.19 secs
   sys time    2.41 secs    0.00 millis    2.41 secs

Given how niche primitive dictionaries are, and how fundamental arrow-select is, we need to be very careful here.

@davidhewitt
Copy link
Author

Great point, I have a few other tasks to handle in the coming days so I will seek to return to this soon, though it might be the end of the week or longer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

interleave panics with dictionary key overflows if combined values arrays are longer than key size
2 participants