POC: Optimize SortPreservingMergeExec to avoid merging non-overlapping partitions #4

xudong963 · 2025-03-27T09:43:52Z

This is the POC for optimized SPM to avoid merging non-overlapping partitions.

The PR glues three PRs:

The origin PR: feat: Optimize SortPreservingMergeExec to avoid merging non-overlapping partitions apache/datafusion#13296 to optimize SortPreservingMergeExec to avoid merging non-overlapping partitions
The PR: Support computing statistics for FileGroup apache/datafusion#15432 which adds statistics for FileGroup, aka, partition-level statistics.
Add ProgressiveEval operator: feat: Add ProgressiveEval operator apache/datafusion#10490

Finally, we get the following result!!

DataFusion CLI v46.0.1
> set datafusion.execution.collect_statistics = true;
0 row(s) fetched. 
Elapsed 0.003 seconds.

> CREATE EXTERNAL TABLE t2 (id INT not null, date DATE) STORED AS PARQUET LOCATION './data/' PARTITIONED BY (date) WITH ORDER (id ASC);
0 row(s) fetched. 
Elapsed 0.006 seconds.

> INSERT INTO t2 VALUES (4, '2025-03-01'), (3, '2025-3-02'), (2, '2025-03-03'), (1, '2025-03-04');
+-------+
| count |
+-------+
| 4     |
+-------+
1 row(s) fetched. 
Elapsed 0.022 seconds.

> EXPLAIN SELECT * FROM t2 ORDER BY id ASC;
[datafusion/core/src/datasource/listing/table.rs:859:9] state.config().collect_statistics() = true
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Sort: t2.id ASC NULLS LAST                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|               |   TableScan: t2 projection=[id, date]                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| physical_plan | ProgressiveEvalExec: partition_groups=[[3, 2, 1, 0]]                                                                                                                                                                                                                                                                                                                                                                                                                        |
|               |   DataSourceExec: file_groups={4 groups: [[Users/xudong/opensource/datafusion/data/date=2025-03-01/GE7lpLxg3gu27zCY.parquet], [Users/xudong/opensource/datafusion/data/date=2025-03-02/GE7lpLxg3gu27zCY.parquet], [Users/xudong/opensource/datafusion/data/date=2025-03-03/GE7lpLxg3gu27zCY.parquet], [Users/xudong/opensource/datafusion/data/date=2025-03-04/GE7lpLxg3gu27zCY.parquet]]}, projection=[id, date], output_ordering=[id@0 ASC NULLS LAST], file_type=parquet |
|               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched. 
Elapsed 0.012 seconds.

> SELECT * FROM t2 ORDER BY id ASC;
[datafusion/core/src/datasource/listing/table.rs:859:9] state.config().collect_statistics() = true
+----+------------+
| id | date       |
+----+------------+
| 1  | 2025-03-04 |
| 2  | 2025-03-03 |
| 3  | 2025-03-02 |
| 4  | 2025-03-01 |
+----+------------+
4 row(s) fetched. 
Elapsed 0.012 seconds.

> EXPLAIN SELECT * FROM t2 ORDER BY id ASC limit 2;
[datafusion/core/src/datasource/listing/table.rs:859:9] state.config().collect_statistics() = true
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Sort: t2.id ASC NULLS LAST, fetch=2                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|               |   TableScan: t2 projection=[id, date]                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| physical_plan | ProgressiveEvalExec: fetch=2, partition_groups=[[3, 2, 1, 0]]                                                                                                                                                                                                                                                                                                                                                                                                               |
|               |   DataSourceExec: file_groups={4 groups: [[Users/xudong/opensource/datafusion/data/date=2025-03-01/GE7lpLxg3gu27zCY.parquet], [Users/xudong/opensource/datafusion/data/date=2025-03-02/GE7lpLxg3gu27zCY.parquet], [Users/xudong/opensource/datafusion/data/date=2025-03-03/GE7lpLxg3gu27zCY.parquet], [Users/xudong/opensource/datafusion/data/date=2025-03-04/GE7lpLxg3gu27zCY.parquet]]}, projection=[id, date], output_ordering=[id@0 ASC NULLS LAST], file_type=parquet |
|               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched. 
Elapsed 0.008 seconds.

> SELECT * FROM t2 ORDER BY id ASC limit 2;
[datafusion/core/src/datasource/listing/table.rs:859:9] state.config().collect_statistics() = true
+----+------------+
| id | date       |
+----+------------+
| 1  | 2025-03-04 |
| 2  | 2025-03-03 |
+----+------------+
2 row(s) fetched. 
Elapsed 0.011 seconds.

>

…g partitions

xudong963 · 2025-03-27T09:48:23Z

datafusion/datasource/src/source.rs

@@ -174,6 +175,21 @@ impl ExecutionPlan for DataSourceExec {
        self.data_source.statistics()
    }

+    fn statistics_by_partition(&self) -> datafusion_common::Result<Vec<Statistics>> {


The partition-level statistic of DataSource will be passed to downstream nodes.

We need to implement the method for different nodes.

Yes, I think this may end up being a decent amount of work. For our use case at polygon.io we are probably going to want to impl it for FilterExec and UnionExec at a minimum, I'm probably missing some others too.

We have our own in-house version of ProgressiveEval, which we are currently iterating upon to make generalized. This part (extracting and ordering all partitions in the DAG below) is indeed the majority of the work thus far.

We've debugged and gotten unit tests for the majority of that code. I'll have a chat with @alamb about how much we can move upstream to the apache project.

I think this part will work well after this PR is merged: apache#15432 which introduces the partition-level statistics.

And after apache#15432 is merged, we can add the statistics_by_partition API to ExecutionPlan and implement it for all nodes.

suremarc

Looks mostly good to me, save for the issue about returning a SortMerge. Also seemed like there were some stray calls to unwrap that shouldn't be there

datafusion/physical-plan/src/sorts/sort_preserving_merge.rs

datafusion/proto/src/physical_plan/mod.rs

wiedld

I did a really quick skim to be helpful; not sure if these comments may lead you astray (we have a lot of tech debt in our implementation which I'm still extracting).

Let me know if you want to have a chat and sync on this project. I'm on the datafusion discord at @holometabola.

wiedld · 2025-03-29T00:36:01Z

datafusion/physical-plan/src/sorts/sort_preserving_merge.rs

+        let partition_groups = input
+            .statistics_by_partition()


This partition group mapping (swapping ordering by statistics) is also the approach we are using.

wiedld · 2025-03-29T00:36:24Z

datafusion/physical-plan/src/sorts/sort_preserving_merge.rs

        Self {
            input,
            expr,
            metrics: ExecutionPlanMetricsSet::new(),
            fetch: None,
            cache,
            enable_round_robin_repartition: true,
+            progressive_eval_exec,


We are instead replacing the SPM node with a ProgressEval node during an optimizer run. I assume this is a WIP and that will occur eventually?

Yes, doing it in Optimizer makes sense to me

wiedld · 2025-03-29T00:38:51Z

datafusion/physical-plan/src/sorts/progressive_eval.rs

+}
+
+/// Concat input streams until reaching the fetch limit
+struct ProgressiveEvalStream {


Both of your stream structures are a close mirror to how we do it too. 😆

apache#10490

Yes, from the PR

wiedld · 2025-03-29T00:42:16Z

datafusion/datasource/src/source.rs

@@ -174,6 +175,21 @@ impl ExecutionPlan for DataSourceExec {
        self.data_source.statistics()
    }

+    fn statistics_by_partition(&self) -> datafusion_common::Result<Vec<Statistics>> {


We have our own in-house version of ProgressiveEval, which we are currently iterating upon to make generalized. This part (extracting and ordering all partitions in the DAG below) is indeed the majority of the work thus far.

We've debugged and gotten unit tests for the majority of that code. I'll have a chat with @alamb about how much we can move upstream to the apache project.

wiedld · 2025-03-29T00:49:30Z

datafusion/physical-plan/src/sorts/sort_preserving_merge.rs

+            .map(|min_max_stats| {
+                let res = min_max_stats.first_fit();


The binpacking into nonoverlapping chains occurs here.

We do something a bit different. We are ungrouping at the data source level such that we increase the number of non-overlapping partitions (and it's min/max range). It works for the plans we usually build, but may not be the best solution for the general approach. 🤔

xudong963 · 2025-03-29T02:51:14Z

Let me know if you want to have a chat and sync on this project. I'm on the datafusion discord at @holometabola.

Thanks, @wiedld. After reading your review comments, I think we're very close. I'll create a subgroup in DF Discord, and we can talk more next Monday. And Happy Weekend!

alamb · 2025-03-29T10:04:20Z

We've debugged and gotten unit tests for the majority of that code. I'll have a chat with @alamb about how much we can move upstream to the apache project.

I agree it will be good to have a chat -- from my perspective if we can work with @suremarc and @xudong963 to implement the more general analysis upstream that is probably preferable to doing it ourself internally. Once it is implemented upstream we could then backport it into our codebase temporarily as we work through the upgrades, perhaps

alamb · 2025-03-31T17:52:30Z

Let me know if you want to have a chat and sync on this project. I'm on the datafusion discord at @holometabola.

Thanks, @wiedld. After reading your review comments, I think we're very close. I'll create a subgroup in DF Discord, and we can talk more next Monday. And Happy Weekend!

In case anyone else is interested, the subgroup in discord is: https://discord.com/channels/885562378132000778/1356122416258220114/1356122427591098378

alamb · 2025-04-01T18:10:06Z

This is related to [EPIC] Avoid sort for already sorted Parquet files that do not overlap values on condition apache/datafusion#6672

xudong963 added 2 commits March 26, 2025 16:52

Support computing statistics for FileGroup

eb6463f

POC: Optimize SortPreservingMergeExec to avoid merging non-overlappin…

6a4e83d

…g partitions

github-actions bot added physical-expr common datasource optimizer proto labels Mar 27, 2025

xudong963 commented Mar 27, 2025

View reviewed changes

update format

11d7789

suremarc reviewed Mar 27, 2025

View reviewed changes

datafusion/physical-plan/src/sorts/sort_preserving_merge.rs Outdated Show resolved Hide resolved

datafusion/physical-plan/src/sorts/sort_preserving_merge.rs Outdated Show resolved Hide resolved

datafusion/proto/src/physical_plan/mod.rs Outdated Show resolved Hide resolved

refine

ced059c

github-actions bot removed the proto label Mar 28, 2025

xudong963 force-pushed the spm_optimized_experimental branch from ed95de9 to ced059c Compare March 28, 2025 05:07

xudong963 mentioned this pull request Mar 28, 2025

Analysis to supportSortPreservingMerge --> ProgressiveEval apache/datafusion#15191

Open

wiedld reviewed Mar 29, 2025

View reviewed changes

xudong963 force-pushed the file_group_stats branch 2 times, most recently from d8090fe to 04bb1d6 Compare March 31, 2025 01:32

xudong963 force-pushed the file_group_stats branch from 04bb1d6 to e775493 Compare April 1, 2025 05:11

alamb mentioned this pull request Apr 1, 2025

[EPIC] Avoid sort for already sorted Parquet files that do not overlap values on condition apache/datafusion#6672

Open

11 tasks

POC: Optimize SortPreservingMergeExec to avoid merging non-overlapping partitions #4

Are you sure you want to change the base?

POC: Optimize SortPreservingMergeExec to avoid merging non-overlapping partitions #4

Uh oh!

Conversation

xudong963 commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wiedld Mar 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

suremarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wiedld left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wiedld Mar 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wiedld Mar 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xudong963 commented Mar 29, 2025

Uh oh!

alamb commented Mar 29, 2025

Uh oh!

alamb commented Mar 31, 2025

Uh oh!

alamb commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

xudong963 commented Mar 27, 2025 •

edited

Loading

wiedld Mar 29, 2025 •

edited

Loading

wiedld left a comment •

edited

Loading

wiedld Mar 29, 2025 •

edited

Loading

wiedld Mar 29, 2025 •

edited

Loading

alamb commented Apr 1, 2025 •

edited

Loading