-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Improve spill performance: Disable re-validation of spilled files #15320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
enhancement
New feature or request
Comments
This was referenced Mar 19, 2025
take |
After disable the validation, tpch benchmark on my m1pro :
|
Since it's a prerequisite task for #15321, I made a pr here :) |
Have you configured the TPCH benchmark to spill (as in limit the memory used)? If not I wouldn't expect any performance improvement 🤔 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is your feature request related to a problem or challenge?
Today when DataFusion spills files to disk, it uses the Arrow IPC format
Here is the code:
datafusion/datafusion/physical-plan/src/spill.rs
Lines 60 to 88 in 988a535
The IPC reader currently re-validates that all the data written is valid arrow data (for example, that the strings are valid utf8)
54.3.0
(Mar 2025) arrow-rs#7107 release has the ability to disable this validationDisabling the validation resulted in a 3x performance increase in the arrow benchmarks
Here are the relvant arrow-rs prs / issues:
with_skip_validation
flag to IPCStreamReader
,FileReader
andFileDecoder
arrow-rs#7120Describe the solution you'd like
I would like to disable validation when reading the spill files back in.
Describe alternatives you've considered
Additional context
with_skip_validation
flag to IPCStreamReader
,FileReader
andFileDecoder
arrow-rs#7120The text was updated successfully, but these errors were encountered: