Reduce number of tokio blocking threads in SortExec spill #15323

andygrove · 2025-03-19T21:33:54Z

Is your feature request related to a problem or challenge?

In Comet, we see some queries "hang" when running with minimal memory. The issue appears to be that we have hundreds of spill files and each spill requires its own tokio blocking thread and Comet does not have enough threads available.

See apache/datafusion-comet#1523 (comment) for more detail.

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

alamb · 2025-03-21T00:48:15Z

Do you see too many threads when writing the spill files or when reading?

andygrove · 2025-03-21T15:56:35Z

Do you see too many threads when writing the spill files or when reading?

This is when reading, during the merge operation.

In merge phase, each spill file will be wrapped by a stream backed by a blocking thread (see read_spill_as_stream), so we'll spawn at least 183 blocking threads when there are 183 spill files to merge spilled data.

alamb · 2025-03-21T19:00:02Z

Makes sense -- with 183 spill files, we probably would need to merge in stages

For example starting with 183 spill files

run 10 jobs, each merging about 10 files into one (results in 10 files)
run the final merge of 10 files

This results in 2x the IO (have to read / write each row twice) but it would be possible at least to parallelize the merges of the earlier step

I think @2010YOUY01 was starting to look into a SpillFileManager -- this is the kind of behavior I would imagine being part of such a thing

rluvaton · 2025-04-03T18:27:31Z

I think I have the the same problem but in AggregateExec when using row_hash, as it spills as well and use SortPreservingMergeStream.

I think the solution should actually be in SortPreservingMergeStream rather than SpillFileManager no? although it does not spawn blocking threads it should support the multiple levels to merge

alamb · 2025-04-03T20:14:44Z

I think I have the the same problem but in AggregateExec when using row_hash, as it spills as well and use SortPreservingMergeStream.

I think the solution should actually be in SortPreservingMergeStream rather than SpillFileManager no? although it does not spawn blocking threads it should support the multiple levels to merge

I am not sure / familiar enough with the code to know off the top of my head.

I do think having hash and sort use the same codepath (that we can then go optimize a lot) sounds like a great idea

rluvaton · 2025-04-06T14:35:02Z

I have a working version locally and will create a PR soon.

there is a problem though, tokio don't expose the maximum number of blocking threads, and if you try to call spawn_blocking while there are no available threads, no error will be returned.

this is important as for example Comet set this by default to 10, and tokio default is 512 IIRC.

the working version can be improved with some optimization like prefetch and more, but it will be good enough for now and we can iterate further

andygrove · 2025-04-06T14:45:25Z

I have a working version locally and will create a PR soon, just one problem, I don't think we can know the number of blocking threads tokio is configured with.

this is important as for example Comet set this by default to 10, and tokio default is 512 IIRC.

the working version can be improved with some optimization like prefetch and more, but it will be good enough for now and we can iterate further

Comet currently creates a new tokio runtime per plan but there is a proposal to move to a global tokio runtime (per executor) instead.

apache/datafusion-comet#1590

rluvaton · 2025-04-06T14:48:02Z

Comet currently creates a new tokio runtime per plan but there is a proposal to move to a global tokio runtime (per executor) instead.

apache/datafusion-comet#1590

even if you use global tokio runtime and set the number of blocking threads to be a 1000 for example, there can be 1001 spill files. the problem is the same

alamb · 2025-04-06T19:10:45Z

even if you use global tokio runtime and set the number of blocking threads to be a 1000 for example, there can be 1001 spill files. the problem is the same

At some point the system is going to be IO bound so having more blocking threads doing I/O isn't going to help IO and will likely consume non trivial time context switching between them

I think a better solution is to more carefully manage how many files are being spilled / read at any time. This will be more complicated (as we'll likely have to do multiple merge phases, etc) but I think it is a better approach in the long run

rluvaton · 2025-04-06T21:18:00Z

I created a draft PR with a solution, would appreciate your opinion:

feat: add multi level merge for sorting #15608

Fixes apache#15323. The previous design of reading spill files was a `push` design, spawning long lived blocking tasks which repeatedly read records, send them and wait until they are received. This design had an issue where progress wasn't guaranteed (i.e., there was a deadlock) if there were more spill files than the blocking thread pool in tokio which were all waited for together. To solve this, the design is changed to a `pull` design, where blocking tasks are spawned for every read, removing waiting on the IO threads and guaranteeing progress. While there might be an added overhead for repeatedly calling `spawn_blocking`, it's probably insignificant compared to the IO cost of reading from the disk.

rluvaton · 2025-04-09T10:07:06Z

Removed the PR in favor of @ashdnazg better PR:

Remove waits from blocking threads reading spill files. #15654

Fixes apache#15323. The previous design of reading spill files was a `push` design, spawning long lived blocking tasks which repeatedly read records, send them and wait until they are received. This design had an issue where progress wasn't guaranteed (i.e., there was a deadlock) if there were more spill files than the blocking thread pool in tokio which were all waited for together. To solve this, the design is changed to a `pull` design, where blocking tasks are spawned for every read, removing waiting on the IO threads and guaranteeing progress. While there might be an added overhead for repeatedly calling `spawn_blocking`, it's probably insignificant compared to the IO cost of reading from the disk.

andygrove added the enhancement New feature or request label Mar 19, 2025

This was referenced Mar 19, 2025

[EPIC] A collection of tickets for improving sorting larger than memory datasets / spilling sorts #15271

Open

Investigate TPC-H q4 hanging when not enough memory is allocated apache/datafusion-comet#1523

Closed

alamb mentioned this issue Mar 22, 2025

fix: Redundant files spilled during external sort + introduce SpillManager #15355

Merged

andygrove mentioned this issue Apr 6, 2025

perf: Use a global tokio runtime apache/datafusion-comet#1614

Merged

rluvaton mentioned this issue Apr 6, 2025

feat: add multi level merge for sorting #15608

Closed

2010YOUY01 mentioned this issue Apr 7, 2025

Cascaded spill merge and re-spill #15610

Open

ashdnazg mentioned this issue Apr 9, 2025

Remove waits from blocking threads reading spill files. #15654

Merged

alamb closed this as completed in #15654 Apr 12, 2025

alamb closed this as completed in b6a5174 Apr 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce number of tokio blocking threads in SortExec spill #15323

Reduce number of tokio blocking threads in SortExec spill #15323

andygrove commented Mar 19, 2025

alamb commented Mar 21, 2025

andygrove commented Mar 21, 2025

alamb commented Mar 21, 2025

rluvaton commented Apr 3, 2025

alamb commented Apr 3, 2025

rluvaton commented Apr 6, 2025 •

edited

Loading

andygrove commented Apr 6, 2025

rluvaton commented Apr 6, 2025 •

edited

Loading

alamb commented Apr 6, 2025

rluvaton commented Apr 6, 2025

rluvaton commented Apr 9, 2025

Reduce number of tokio blocking threads in SortExec spill #15323

Reduce number of tokio blocking threads in SortExec spill #15323

Comments

andygrove commented Mar 19, 2025

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Mar 21, 2025

andygrove commented Mar 21, 2025

alamb commented Mar 21, 2025

rluvaton commented Apr 3, 2025

alamb commented Apr 3, 2025

rluvaton commented Apr 6, 2025 • edited Loading

andygrove commented Apr 6, 2025

rluvaton commented Apr 6, 2025 • edited Loading

alamb commented Apr 6, 2025

rluvaton commented Apr 6, 2025

rluvaton commented Apr 9, 2025

rluvaton commented Apr 6, 2025 •

edited

Loading

rluvaton commented Apr 6, 2025 •

edited

Loading