-
Notifications
You must be signed in to change notification settings - Fork 1.5k
[EPIC] A collection of tickets for improving sorting larger than memory datasets / spilling sorts #15271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I also think that by collecting the related items we may be able to find some more review capacity (as I think this is an important capability for Spark / Comet. FYI @comphead / @andygrove / @kazuyukitanimura ) |
I filed some more tickets about ways to improve performance with external spill files |
I added #15323 |
@alamb Thank you for summarizing, I'm also interested in this topic and may have more time to join the game in May, but I will keep an eye on the progress. |
This should make sure that queries don't just use an unlimited amount of memory and eventually also spill to disk. At least for now it seems that spill to disk for sorting isn't quite working yet, see apache/datafusion#15028 apache/datafusion#15271 So the behaviour with this patch is that the query just aborts if you have a resource limit set in the pipeline config. Signed-off-by: Gerd Zellweger <[email protected]>
This should make sure that queries don't just use an unlimited amount of memory and eventually also spill to disk. At least for now it seems that spill to disk for sorting isn't quite working yet, see apache/datafusion#15028 apache/datafusion#15271 So the behaviour with this patch is that the query just aborts if you have a resource limit set in the pipeline config. Signed-off-by: Gerd Zellweger <[email protected]>
This should make sure that queries don't just use an unlimited amount of memory and eventually also spill to disk. At least for now it seems that spill to disk for sorting isn't quite working yet, see apache/datafusion#15028 apache/datafusion#15271 So the behaviour with this patch is that the query just aborts if you have a resource limit set in the pipeline config. Signed-off-by: Gerd Zellweger <[email protected]>
I saw while debugging some performance issue in and also when merging we read at least 1 batch from every spill file: datafusion/datafusion/physical-plan/src/aggregates/row_hash.rs Lines 1059 to 1062 in 7317198
so If I have a lot of spill files or if every batch is really huge (contains very large lists - like result for array_agg on large dataset) we have all of this in memory. |
I think this is the same underlying cause as |
There is a lot of work to support plans like
SELECT * .. ORDER BY x,y,z
when that data doesn't all fit in RAM.I believe the core usecase is "compaction" where multiple small files are resorted/merged/written as larger fiiles
max_temp_directory_size
to limit max disk usage for spilling queries #14975--memory-limit
for all benchmarking tools #14641mmap
the spill files #15321TopK
queries #15538FYI @2010YOUY01 and @Kontinuation I was trying to find some other organizing ticket for improvements in this area and I could not, so I filed this one
Related:
DiskManagerBuilder
to construct DiskManagers #15319The text was updated successfully, but these errors were encountered: