Skip to content

ExternalSorter Fails to Spill Dictionaries #4658

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tustvold opened this issue Dec 16, 2022 · 5 comments · Fixed by #14868
Closed

ExternalSorter Fails to Spill Dictionaries #4658

tustvold opened this issue Dec 16, 2022 · 5 comments · Fixed by #14868
Assignees
Labels
bug Something isn't working

Comments

@tustvold
Copy link
Contributor

Describe the bug

ExternalSorter currently spills data using the Arrow IPC format, unfortunately the IPC file format does not support replacing a dictionary with the same ID. Consequently if ExternalSorter spills two batches with dictionary encoded columns, it will error with

Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single dictionary for a given field across all batches

To Reproduce

Expected behavior

Various options present themselves:

  • Number the dictionary IDs consistently across DataFusion (a monumental task)
  • Write batches to separate files
  • Spill to the row format instead of Arrow IPC

Of these I think the last is the most compelling, I plan to work on this in the coming month

Additional context

It is possible Ballista also runs into this issue

@tustvold tustvold added the bug Something isn't working label Dec 16, 2022
@tustvold tustvold self-assigned this Dec 16, 2022
@davidhewitt
Copy link
Contributor

@tustvold we're hitting this issue in present-day at Pydantic. It's possibly been made more relevant by the recent fixes to spill heuristics. Is your preferred solution still to use row format?

I assume by row format you mean arrow-row, however it's not clear to me if there's a standard way to serialize these to a file. I could create something simple which just writes the streams of rows as binary (probably a length then the bytes, repeated?).

We would benefit from this fairly urgently so I'm willing to put in the work to fix this.

@davidhewitt
Copy link
Contributor

It looks like an alternative to using the row format in datafusion might be to support delta-encoded dictionaries in arrow-rs. apache/arrow-rs#6783

@tustvold
Copy link
Contributor Author

tustvold commented Feb 24, 2025

I assume by row format you mean arrow-row, however it's not clear to me if there's a standard way to serialize these to a file. I could create something simple which just writes the streams of rows as binary (probably a length then the bytes, repeated?).

I suspect this is what I was going for, although it was 2 years ago so can't confess to really remembering and a lot has likely changed since then.

It looks like an alternative to using the row format in datafusion might be to support delta-encoded dictionaries in arrow-rs. apache/arrow-rs#6783

This is likely an approach, although it would involve re-encoding the dictionaries as there is no mechanism to remap an existing key.

Is your preferred solution still to use row format?

IMO I'd be tempted to start a discussion on the mailing list about why this constraint exists, it seems somewhat nonsensical to me given that the footer identifies where the dictionary blocks are, and therefore it is trivial to determine which dictionary to use. If anything the support for delta dictionaries is more surprising...

@tustvold
Copy link
Contributor Author

Actually the thought occurs to me that for the spilling use-case the random access provided by IPC files isn't useful, and so an even simpler option would be to just switch to using the IPC Stream format instead

@davidhewitt
Copy link
Contributor

Ok great, I'll work on that today 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants