Skip to content

[EPIC] A collection of tickets for improving sorting larger than memory datasets / spilling sorts #15271

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
8 of 18 tasks
Tracked by #15771 ...
alamb opened this issue Mar 17, 2025 · 6 comments
Open
8 of 18 tasks
Tracked by #15771 ...
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Mar 17, 2025

There is a lot of work to support plans like SELECT * .. ORDER BY x,y,z when that data doesn't all fit in RAM.

I believe the core usecase is "compaction" where multiple small files are resorted/merged/written as larger fiiles

FYI @2010YOUY01 and @Kontinuation I was trying to find some other organizing ticket for improvements in this area and I could not, so I filed this one

Related:

@alamb alamb added the enhancement New feature or request label Mar 17, 2025
@alamb alamb changed the title [EPIC] A collection of tickets for improving sorting larger-than-ram datasets [EPIC] A collection of tickets for improving sorting larger-than-ram datasets / spilling sorts Mar 17, 2025
@alamb
Copy link
Contributor Author

alamb commented Mar 17, 2025

I also think that by collecting the related items we may be able to find some more review capacity (as I think this is an important capability for Spark / Comet. FYI @comphead / @andygrove / @kazuyukitanimura )

@alamb alamb changed the title [EPIC] A collection of tickets for improving sorting larger-than-ram datasets / spilling sorts [EPIC] A collection of tickets for improving sorting larger than memory datasets / spilling sorts Mar 17, 2025
@alamb
Copy link
Contributor Author

alamb commented Mar 19, 2025

I filed some more tickets about ways to improve performance with external spill files

@andygrove
Copy link
Member

I added #15323

@xudong963
Copy link
Member

@alamb Thank you for summarizing, I'm also interested in this topic and may have more time to join the game in May, but I will keep an eye on the progress.

gz added a commit to feldera/feldera that referenced this issue Mar 21, 2025
This should make sure that queries don't just use an
unlimited amount of memory and eventually also spill
to disk. At least for now it seems that spill to disk
for sorting isn't quite working yet, see

apache/datafusion#15028
apache/datafusion#15271

So the behaviour with this patch is that the query
just aborts if you have a resource limit set in the
pipeline config.

Signed-off-by: Gerd Zellweger <[email protected]>
gz added a commit to feldera/feldera that referenced this issue Mar 21, 2025
This should make sure that queries don't just use an
unlimited amount of memory and eventually also spill
to disk. At least for now it seems that spill to disk
for sorting isn't quite working yet, see

apache/datafusion#15028
apache/datafusion#15271

So the behaviour with this patch is that the query
just aborts if you have a resource limit set in the
pipeline config.

Signed-off-by: Gerd Zellweger <[email protected]>
github-merge-queue bot pushed a commit to feldera/feldera that referenced this issue Mar 21, 2025
This should make sure that queries don't just use an
unlimited amount of memory and eventually also spill
to disk. At least for now it seems that spill to disk
for sorting isn't quite working yet, see

apache/datafusion#15028
apache/datafusion#15271

So the behaviour with this patch is that the query
just aborts if you have a resource limit set in the
pipeline config.

Signed-off-by: Gerd Zellweger <[email protected]>
@rluvaton
Copy link
Contributor

rluvaton commented Apr 2, 2025

I saw while debugging some performance issue in AggregateExec I see that we keep all spilled files open (RefCountedTempFile as it keep tempfile which hold File).

and also when merging we read at least 1 batch from every spill file:

for spill in self.spill_state.spills.drain(..) {
let stream = self.spill_state.spill_manager.read_spill_as_stream(spill)?;
streams.push(stream);
}

so If I have a lot of spill files or if every batch is really huge (contains very large lists - like result for array_agg on large dataset) we have all of this in memory.

@alamb
Copy link
Contributor Author

alamb commented Apr 3, 2025

so If I have a lot of spill files or if every batch is really huge (contains very large lists - like result for array_agg on large dataset) we have all of this in memory.

I think this is the same underlying cause as

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants