-
Notifications
You must be signed in to change notification settings - Fork 1.5k
ExternalSorter Fails to Spill Dictionaries #4658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@tustvold we're hitting this issue in present-day at Pydantic. It's possibly been made more relevant by the recent fixes to spill heuristics. Is your preferred solution still to use row format? I assume by row format you mean We would benefit from this fairly urgently so I'm willing to put in the work to fix this. |
It looks like an alternative to using the row format in datafusion might be to support delta-encoded dictionaries in |
I suspect this is what I was going for, although it was 2 years ago so can't confess to really remembering and a lot has likely changed since then.
This is likely an approach, although it would involve re-encoding the dictionaries as there is no mechanism to remap an existing key.
IMO I'd be tempted to start a discussion on the mailing list about why this constraint exists, it seems somewhat nonsensical to me given that the footer identifies where the dictionary blocks are, and therefore it is trivial to determine which dictionary to use. If anything the support for delta dictionaries is more surprising... |
Actually the thought occurs to me that for the spilling use-case the random access provided by IPC files isn't useful, and so an even simpler option would be to just switch to using the IPC Stream format instead |
Ok great, I'll work on that today 👍 |
Describe the bug
ExternalSorter currently spills data using the Arrow IPC format, unfortunately the IPC file format does not support replacing a dictionary with the same ID. Consequently if
ExternalSorter
spills two batches with dictionary encoded columns, it will error withTo Reproduce
Expected behavior
Various options present themselves:
Of these I think the last is the most compelling, I plan to work on this in the coming month
Additional context
It is possible Ballista also runs into this issue
The text was updated successfully, but these errors were encountered: