fix: Redundant files spilled during external sort + introduce `SpillManager` #15355

2010YOUY01 · 2025-03-22T12:27:50Z

Which issue does this PR close?

Rationale for this change

What's the inefficiency

Let's walkthrough an example, there is one external sort query with 1 partition, and sort exec has:

10MB memory limit
1MB sort_spill_reservation_bytes (see configuration explanation in https://datafusion.apache.org/user-guide/configs.html)

During the execution:

SortExec will read in batches, and 10MB memory limit is reached
It will sorted all buffered batches in-place, and merge them at once. Note only the 1MB buffer from 10MB total limit is pre-reserved for merging, so there is only 1MB available to store the merged output.
After we have collected 1MB of merged batch, one spill will be triggered. And this 1MB space will be cleared, the merging can continue.
Inefficency: Now ExternalSorter will create a new spill file for those 1MB merged batches, after spilling all intermediates, all spilled files will be merged at once, then there are too many files to merge.
Ideal case: All batches in a single sorted run can be incrementally appended to a single file.

Reproducer

Execute datafusion-cli with cargo run --profile release-nonlto -- --mem-pool-type fair -m 10M

set datafusion.execution.sort_spill_reservation_bytes = 1000000;
set datafusion.execution.target_partitions = 4;

explain analyze select * from generate_series(1, 1000000) as t1(v1) order by v1;

Main: 10 spills
PR: 2 spills

Rationale for the fix

Introduced a new spill interface SpillManager with the ability of incrementally appending batches to a written file.

SpillManager is designed to do RecordBatch <---> raw file, and configurations can be put inside SpillManager to control how do we do the serialization physically for future optimizations.
Example configurations:

General purpose compression like lz4
Specialized encoding other than the current Arrow IPC , or configurations to change the IPC Writer behavior
- See datafusion-comet's proprietary serde implementation in feat: Implement custom RecordBatch serde for shuffle for improved performance datafusion-comet#1190
- One example of extra configuration can be Improve spill performance: Disable re-validation of spilled files #15320

SpillManager is not responsible for holding spilled files inside, because the logical representation of those files can vary, I think it's clearer to place those raw files inside spilling operators.
For example, vec<RefCountedTempFile> is managed inside SortExec, the implicit rule is within each file all entries are sorted by the sort keys, also in Comet's ShuffleWriterExec, each partition should maintain one in-progress file. If we keep those tempfiles inside SpillManager, it's hard to clearly define those implicit requirements.
Additionally, SpillManager is responsible for updating related statistics, the spill-related metrics should be the same across operators, so this part of of code can also be reused. Also, total disk usage limit for spilled files can be easily implemented upon it.

Why refactor and introduce `SpillManager`

This fix can be implemented without a major refactor. However, this change is included to prepare for supporting disk limits for spilling queries, as described in #14975

What changes are included in this PR?

Group spilling related metrics into one struct
Introduce SpillManager
Update SortExec to use the new SpillManager interface

TODO:

- [ ] There are two extra operators that can be changed to this new interface (Aggregate and SortMergeJoin), they're planned to be included in this PR. I plan to do it after getting some review feedback.
It will be done as a follow on to minimize the current patch.

Are these changes tested?

For the too-many-spills issue: one test case is updated, and more comment is added above the assertion to prevent regression.
For SpillManager: unit tests are included.

Are there any user-facing changes?

No.

alamb · 2025-03-22T14:42:29Z

I filed limit max disk usage for spilling queries #15358 to track the feature request, and linked this ticket

alamb

THanks @2010YOUY01 -- this is looking good

I left a question about the change to external sorting

I really like the idea of the SpillManager -- maybe we could start this project with a single PR to add the SpillManger and pull the common code out (in datafusion/physical-plan/src/spill/manager.rs for example) and then do a follow on ticket to add new features

alamb · 2025-03-22T14:43:19Z

datafusion/physical-plan/src/sorts/sort.rs

@@ -65,23 +63,14 @@ struct ExternalSorterMetrics {
    /// metrics
    baseline: BaselineMetrics,

-    /// count of spills during the execution of the operator


alamb · 2025-03-22T14:44:55Z

datafusion/physical-plan/src/sorts/sort.rs

+    /// Arrow IPC format)
+    /// Within the same spill file, the data might be chunked into multiple batches,
+    /// and ordered by sort keys.
+    finished_spill_files: Vec<RefCountedTempFile>,


It might make more sense to have the SpillManager own these files so there can't be different sets of references

I think it will be hard to define the semantics of those temp files if we put them inside SpillManager, because different operators will interpret those files differently:

For SortExec, vec<RefCountedTempFile> is representing multiple sorted runs on sort keys.

For ShuffleWriterExec in datafusion-comet, since Spark's shuffle operator is blocking (due to spark's staged execution design), it might want to keep vec<InProgresSpillFile> instead.

Similarly, if we want to spill Rows to accelerate SortExec, or we want to implement spilling hash join, the temp files will have very different logical meanings.

Overall, the SpillManager is designed only to do RecordBatch <-> raw file with different configurations and stat accounting. Operators have more flexibility to implement specific utilities for managing raw files, which have diverse semantics.

Do you see any potential issues or improvements?

The different semantics for different operations makes sense to me

I was thinking more mechnically, like just storing the Vecas a field onSortManager` and allowing Sort and Hash, etc to access / manipulate it as required. I think it is fine to consider this in a future PR as well

alamb · 2025-03-22T14:48:46Z

datafusion/physical-plan/src/sorts/sort.rs

-    /// Returns the amount of memory freed.
-    async fn spill(&mut self) -> Result<usize> {
+    /// When calling, all `in_mem_batches` must be sorted, and then all of them will
+    /// be appended to the in-progress spill file.


If they must all be sorted, then maybe you can put an assert/check that self.in_mem_batches_sorted is true

Addressed in bf4ab62

alamb · 2025-03-22T14:53:00Z

datafusion/physical-plan/src/sorts/sort.rs

+            internal_datafusion_err!("In-progress spill file should be initialized")
+        })?;
+
+        for batch in batches {


I don't understand this logic -- i thought that each individual self.in_mem_batches was sorted but they aren't sorted overall

Thus if we write write them back to back to the same spill file, the spill file itself won't be sorted

Like if the two in memory batches are

A B

1 10

2 10

2 10

A B

1 10

2 10

2 10

I think this code would produce a single spill file like

A B

1 10

2 10

2 10

1 10

2 10

2 10

Which is not sorted 🤔

On the other hand all the tests are passing so maybe I misunderstand what this is doing (or we have a testing gap)

No, they are globally sorted. In different stages, in_mem_batches can either represent unordered input, or globally sorted run (but chunked into smaller batches)
I agree this approach has poor understandability and is error-prone, I'll try to improve it.

Thanks -- maybe for this PR we could just add some comments

Filed #15372

alamb · 2025-03-22T14:54:07Z

datafusion/physical-plan/src/spill.rs

@@ -223,25 +229,182 @@ impl IPCStreamWriter {
    }
 }

+/// The `SpillManager` is responsible for the following tasks:


Love the spill manager 👍

alamb · 2025-03-22T14:58:01Z

There are two extra operators that can be changed to this new interface (Aggregate and SortMergeJoin), they're planned to be included in this PR. I plan to do it after getting some review feedback.

I request that we do this feature in multiple smaller PRs which will be easier to review / understand

BTW I think this PR may address some of this issue to0:

Reduce number of tokio blocking threads in SortExec spill #15323

alamb

Thank you @2010YOUY01 -- I think this is a great step forward: the code is more nicely structured and I think the spilling works better.

I left several comments about potential future improvements, but most of them probably should be done as follow on PRs.

Perhaps @Kontinuation or @kazuyukitanimura has some time to review as well

alamb · 2025-03-23T10:54:16Z

datafusion/physical-plan/src/spill.rs

+    }
+}
+
+pub(crate) struct InProgressSpillFile {


I think it would help to add some high level comments here about what an InProgressSpill file is

Addressed in bf4ab62

alamb · 2025-03-23T10:56:31Z

datafusion/physical-plan/src/sorts/sort.rs

+
+    /// Finishes the in-progress spill file and moves it to the finished spill files.
+    async fn spill_finish(&mut self) -> Result<()> {
+        let mut in_progress_file =


I am finding the various states of the ExternalSorter hard to track (specifically what are the valid combinations of in_mem_batches, in_progress_spill_file, spill, and sorted_in_mem

I wonder if we could move to some sort of state enum that would make this easier to understand

Like

struct SortState AllInMemory {...} InProgressSpill { ... } AllOnDisk {...} ... }

Filed #15372

alamb · 2025-03-23T10:59:09Z

datafusion/physical-plan/src/spill.rs

+/// Note: The caller (external operators such as `SortExec`) is responsible for interpreting the spilled files.
+/// For example, all records within the same spill file are ordered according to a specific order.
+#[derive(Debug, Clone)]
+pub(crate) struct SpillManager {


As a follow on PR, I suggest starting to break up this code into multiple modules (like spill/mod.rs, spill/spill_manager.rs, etc

Filed #15373

Kontinuation · 2025-03-24T04:08:05Z

3. After we have collected 1MB of merged batch, one spill will be triggered. And this 1MB space will be cleared, the merging can continue.
Inefficency: Now ExternalSorter will create a new spill file for those 1MB merged batches, after spilling all intermediates, all spilled files will be merged at once, then there are too many files to merge.
Ideal case: All batches in a single sorted run can be incrementally appended to a single file.

It seems to be a regression introduced by #14823.

2010YOUY01 · 2025-03-24T07:49:37Z

After we have collected 1MB of merged batch, one spill will be triggered. And this 1MB space will be cleared, the merging can continue.
Inefficency: Now ExternalSorter will create a new spill file for those 1MB merged batches, after spilling all intermediates, all spilled files will be merged at once, then there are too many files to merge.
Ideal case: All batches in a single sorted run can be incrementally appended to a single file.

It seems to be a regression introduced by #14823.

That's true, so I feel obligated to fix it.

Thank you for the review @alamb and @Kontinuation , I have addressed the review comments.

comphead

Thanks @2010YOUY01 love tests. I think we need to move other ops like sort_merge_join or row hash to spilled manager?

I'll create a tickets if so

#15401
#15400

alamb · 2025-03-24T19:16:19Z

After we have collected 1MB of merged batch, one spill will be triggered. And this 1MB space will be cleared, the merging can continue.
Inefficency: Now ExternalSorter will create a new spill file for those 1MB merged batches, after spilling all intermediates, all spilled files will be merged at once, then there are too many files to merge.
Ideal case: All batches in a single sorted run can be incrementally appended to a single file.

It seems to be a regression introduced by #14823.

That's true, so I feel obligated to fix it.

@2010YOUY01 is this something that should be tracked with a follow on ticket?

alamb · 2025-03-24T19:17:32Z

It looks to me like there are 4 approvals of this PR and a bunch of potential work stacked up on it, so let's merge it to keep the code flowing

Thank you everyone for the reviews and work. It is so exciting to see this area of DataFusion get love

2010YOUY01 · 2025-03-25T03:00:48Z

Thanks @2010YOUY01 love tests. I think we need to move other ops like sort_merge_join or row hash to spilled manager?

I'll create a tickets if so

#15401 #15400

Thank you, I already did it by filing #15374
I'll get to those tasks soon.

2010YOUY01 · 2025-03-25T03:02:27Z

After we have collected 1MB of merged batch, one spill will be triggered. And this 1MB space will be cleared, the merging can continue.
Inefficency: Now ExternalSorter will create a new spill file for those 1MB merged batches, after spilling all intermediates, all spilled files will be merged at once, then there are too many files to merge.
Ideal case: All batches in a single sorted run can be incrementally appended to a single file.

It seems to be a regression introduced by #14823.

That's true, so I feel obligated to fix it.

@2010YOUY01 is this something that should be tracked with a follow on ticket?

@alamb Ah no, I was referring to the fix in this PR.

…anager` (apache#15355) * implement SpillManager * more comments

implement SpillManager

b6fa6bc

alamb reviewed Mar 22, 2025

View reviewed changes

alamb mentioned this pull request Mar 22, 2025

Improve Spill Performance: mmap the spill files #15321

Open

alamb approved these changes Mar 23, 2025

View reviewed changes

Kontinuation approved these changes Mar 24, 2025

View reviewed changes

2010YOUY01 mentioned this pull request Mar 24, 2025

Refactor SortExec's buffered batches for better code readability #15372

Open

more comments

bf4ab62

This was referenced Mar 24, 2025

Refactor SpillManager into a separate file #15373

Closed

Use SpillManager in AggregateExec and SortMergeJoinExec #15374

Closed

comphead approved these changes Mar 24, 2025

View reviewed changes

kazuyukitanimura approved these changes Mar 24, 2025

View reviewed changes

alamb merged commit 0c2aa0c into apache:main Mar 24, 2025
27 checks passed

This was referenced Mar 24, 2025

Use spill manager in sort merge join #15400

Closed

Use spill manager in row hasher #15401

Closed

2010YOUY01 mentioned this pull request Mar 25, 2025

refactor: Use SpillManager for all spilling scenarios #15405

Merged

qstommyshu pushed a commit to qstommyshu/datafusion that referenced this pull request Mar 27, 2025

fix: Redundant files spilled during external sort + introduce `SpillM…

1dd5f31

…anager` (apache#15355) * implement SpillManager * more comments

mbutrovich mentioned this pull request Mar 27, 2025

chore: Reimplement ShuffleWriterExec using interleave_record_batch apache/datafusion-comet#1511

Merged

2010YOUY01 mentioned this pull request Mar 31, 2025

Test: configuration fuzzer for (external) sort queries #15501

Merged

nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025

fix: Redundant files spilled during external sort + introduce `SpillM…

f0a0218

…anager` (apache#15355) * implement SpillManager * more comments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Redundant files spilled during external sort + introduce `SpillManager` #15355

fix: Redundant files spilled during external sort + introduce `SpillManager` #15355

2010YOUY01 commented Mar 22, 2025 •

edited

Loading

alamb commented Mar 22, 2025

alamb left a comment

alamb Mar 22, 2025

alamb Mar 22, 2025

2010YOUY01 Mar 23, 2025

alamb Mar 23, 2025

alamb Mar 22, 2025

2010YOUY01 Mar 24, 2025

alamb Mar 22, 2025

2010YOUY01 Mar 23, 2025

alamb Mar 23, 2025

2010YOUY01 Mar 24, 2025

alamb Mar 22, 2025

alamb commented Mar 22, 2025

alamb left a comment

alamb Mar 23, 2025

2010YOUY01 Mar 24, 2025

alamb Mar 23, 2025

2010YOUY01 Mar 24, 2025 •

edited

Loading

alamb Mar 23, 2025

2010YOUY01 Mar 24, 2025

Kontinuation commented Mar 24, 2025

2010YOUY01 commented Mar 24, 2025

comphead left a comment •

edited

Loading

alamb commented Mar 24, 2025

alamb commented Mar 24, 2025

2010YOUY01 commented Mar 25, 2025

2010YOUY01 commented Mar 25, 2025 •

edited

Loading

fix: Redundant files spilled during external sort + introduce SpillManager #15355

fix: Redundant files spilled during external sort + introduce SpillManager #15355

Conversation

2010YOUY01 commented Mar 22, 2025 • edited Loading

Which issue does this PR close?

Rationale for this change

What's the inefficiency

Reproducer

Rationale for the fix

Why refactor and introduce SpillManager

What changes are included in this PR?

TODO:

Are these changes tested?

Are there any user-facing changes?

alamb commented Mar 22, 2025

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Mar 22, 2025

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

2010YOUY01 Mar 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Kontinuation commented Mar 24, 2025

2010YOUY01 commented Mar 24, 2025

comphead left a comment • edited Loading

Choose a reason for hiding this comment

alamb commented Mar 24, 2025

alamb commented Mar 24, 2025

2010YOUY01 commented Mar 25, 2025

2010YOUY01 commented Mar 25, 2025 • edited Loading

fix: Redundant files spilled during external sort + introduce `SpillManager` #15355

fix: Redundant files spilled during external sort + introduce `SpillManager` #15355

2010YOUY01 commented Mar 22, 2025 •

edited

Loading

Why refactor and introduce `SpillManager`

2010YOUY01 Mar 24, 2025 •

edited

Loading

comphead left a comment •

edited

Loading

2010YOUY01 commented Mar 25, 2025 •

edited

Loading