-
Notifications
You must be signed in to change notification settings - Fork 916
support merging primitive dictionaries in interleave and concat #7468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
arrow-select/src/concat.rs
Outdated
merge_concat_byte_dictionaries(&dictionaries, output_len) | ||
}, | ||
other => Err(ArrowError::NotYetImplemented(format!( | ||
"interleave does not yet support merging dictionaries with value type {other:?}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"interleave does not yet support merging dictionaries with value type {other:?}" | |
"concat does not yet support merging dictionaries with value type {other:?}" |
LargeUtf8 => bytes_ptr_eq::<LargeUtf8Type>, | ||
Binary => bytes_ptr_eq::<BinaryType>, | ||
LargeBinary => bytes_ptr_eq::<LargeBinaryType>, | ||
_ => |_, _| false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this line allows should_merge_dictionary_values
to return true
for the primitive data types, but it also introduces a regression for weird types like dictionaries-of-unions. This heuristic might ask for these to merge if the values exceed the expected output length, which will now fail. Before they might be able to concat / interleave via the fallback methods even if merging is not supported (however they could easily hit the panic case).
I think probably the solution is to adjust this heuristic so it outputs why it needs to merge (e.g. must merge to attempt to avoid overflow) - and in callers, for overflow we would fail gracefully for types which can't yet be merged.
I'll rework to fixup, please don't bother reviewing for the moment. |
Ok, this should be ready for review. |
If I am not mistaken this will generate non-trivial code for all possible values of dictionary key types and primitive key type, I don't think we can really do this. Perhaps you could look at how the arithmetic kernels, comparison kernels of other selection kernels avoid this. For reference on my computer before this change a clean build of arrow-select and all dependents takes
Whereas this PR
Given how niche primitive dictionaries are, and how fundamental arrow-select is, we need to be very careful here. |
Great point, I have a few other tasks to handle in the coming days so I will seek to return to this soon, though it might be the end of the week or longer. |
Which issue does this PR close?
interleave
panics with dictionary key overflows if combined values arrays are longer than key size #7466Rationale for this change
Fixes a panic :)
What changes are included in this PR?
Introduces a strategy for merging primitive dictionaries to deduplicate values when naive concatentation would overflow the key type.
Tests for both
interleave
andconcat
are included.Are there any user-facing changes?
Just the bugfix.