Perf: Support Utf8View datatype single column comparisons for SortPreservingMergeStream #15348

zhuqi-lucas · 2025-03-21T09:54:25Z

Which issue does this PR close?

Closes partof #15096

Rationale for this change

Support Utf8View datatype single column comparisons for SortPreservingMergeStream

What changes are included in this PR?

Support Utf8View datatype single column comparisons for SortPreservingMergeStream

Are these changes tested?

Yes

Are there any user-facing changes?

Support Utf8View datatype single column comparisons for SortPreservingMergeStream

…servingMergeStream

Omega359 · 2025-03-21T11:32:08Z

datafusion/physical-plan/src/sorts/cursor.rs

+    }
+
+    fn eq(l: &Self, l_idx: usize, r: &Self, r_idx: usize) -> bool {
+        unsafe { GenericByteViewArray::compare_unchecked(l, l_idx, r, r_idx).is_eq() }


Please add a 'safety:' note to say why is is ok to use unsafe here. An example

Thank you @Omega359 for review, good example, i will address it.

I agree it would be good to justify the use of unchecked (which I think is ok here)

The docs say https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html#method.compare_unchecked

SO maybe the safety argument is mostly "The left/right_idx must within range of each array"

It also seems like we need to be comparing the Null masks too 🤔 like checking if the values are null before comparing

Given that this comparison is typically the hottest part of a merge operation maybe we should try using unchecked comparisions elswhere

2010YOUY01 · 2025-03-21T11:52:49Z

Thank you for the work on better Utf8View support. I tried one sort benchmark with sort-preserving merging on a single Utf8View column, but it gets slower:

Reproducer

cargo run --profile release-nonlto --bin dfbench -- sort-tpch -p /Users/yongting/Code/datafusion/benchmarks/data/tpch_sf10 -q 3

main: 8s
pr: 10s

According to the flamegraph, an extra overhead of libsystem_platform.dylib_platform_memcmp showed up inside SortPreservingMergeStream
It's not obvious why, I'll try to help figure it out later.

flamegraphs.zip

zhuqi-lucas · 2025-03-21T12:35:50Z

Thank you for the work on better Utf8View support. I tried one sort benchmark with sort-preserving merging on a single Utf8View column, but it gets slower:

Reproducer
cargo run --profile release-nonlto --bin dfbench -- sort-tpch -p /Users/yongting/Code/datafusion/benchmarks/data/tpch_sf10 -q 3
main: 8s pr: 10s

According to the flamegraph, an extra overhead of libsystem_platform.dylib_platform_memcmp showed up inside SortPreservingMergeStream It's not obvious why, I'll try to help figure it out later.

flamegraphs.zip

Thank you @2010YOUY01 for review, i may know the problem about the above Reproducer:

The q3 sort bench mark is a special case sort by l_comment which is mostly long string larger than 12 bytes, meanwhile it has many case with same prefix, it means the 4 bytes view are also same, so the compare logic will go to the last part to compare the buffer, it will make the compare regression.
You can try to sort the normal case which the string is mostly less than 12 bytes. And if some cases larger than 12 bytes, we also will optimize use the 4 bytes view to compare, for example change the q3 to sql which will use the normal string to order by:

SELECT l_shipmode, l_comment, l_partkey
        FROM lineitem
        ORDER BY l_shipmode;

It will show the performance improvement.

And finally, i think we need to create a follow-up ticket to improve and investigate the regression case. It will be valuable for us to improve it. Thanks!

zhuqi-lucas · 2025-03-22T09:54:34Z

Updated the result for short string sort which will benefit a lot from StringView type, add here is the Q 11 for sort test:

-    const SORT_QUERIES: [&'static str; 10] = [
+    const SORT_QUERIES: [&'static str; 11] = [
         // Q1: 1 sort key (type: INTEGER, cardinality: 7) + 1 payload column
         r#"
         SELECT l_linenumber, l_partkey
@@ -159,6 +159,12 @@ impl RunOpt {
         FROM lineitem
         ORDER BY l_orderkey, l_suppkey, l_linenumber, l_comment
         "#,
+        // Q11: 1 sort key (type: VARCHAR, cardinality: 4.5M) + 1 payload column
+        r#"
+        SELECT l_shipmode, l_comment, l_partkey
+        FROM lineitem
+        ORDER BY l_shipmode;
+        "#,
     ];

This PR:

Q11 iteration 0 took 5645.3 ms and returned 59986052 rows
Q11 iteration 1 took 5641.1 ms and returned 59986052 rows
Q11 iteration 2 took 5520.6 ms and returned 59986052 rows
Q11 avg time: 5602.33 ms

The main:

Q11 iteration 0 took 6687.5 ms and returned 59986052 rows
Q11 iteration 1 took 6504.5 ms and returned 59986052 rows
Q11 iteration 2 took 6544.6 ms and returned 59986052 rows
Q11 avg time: 6578.87 ms

About 20% performance improvement.

alamb

Thank you @zhuqi-lucas -- this looks pretty sweet. I think we need to sort out nulls and safety comment and this will be good to go

alamb · 2025-03-22T14:30:48Z

datafusion/physical-plan/src/sorts/cursor.rs

+    }
+
+    fn eq(l: &Self, l_idx: usize, r: &Self, r_idx: usize) -> bool {
+        unsafe { GenericByteViewArray::compare_unchecked(l, l_idx, r, r_idx).is_eq() }


I agree it would be good to justify the use of unchecked (which I think is ok here)

The docs say https://docs.rs/arrow/latest/arrow/array/struct.GenericByteViewArray.html#method.compare_unchecked

SO maybe the safety argument is mostly "The left/right_idx must within range of each array"

It also seems like we need to be comparing the Null masks too 🤔 like checking if the values are null before comparing

Given that this comparison is typically the hottest part of a merge operation maybe we should try using unchecked comparisions elswhere

zhuqi-lucas · 2025-03-23T10:07:26Z

Thank you @zhuqi-lucas -- this looks pretty sweet. I think we need to sort out nulls and safety comment and this will be good to go

Thank you @alamb for review, good suggestion, and i checked the nullable check is checked in the parent wrapper call, for example:

impl<T: CursorValues> CursorValues for ArrayValues<T> {
    fn len(&self) -> usize {
        self.values.len()
    }

    fn eq(l: &Self, l_idx: usize, r: &Self, r_idx: usize) -> bool {
        match (l.is_null(l_idx), r.is_null(r_idx)) {
            (true, true) => true,
            (false, false) => T::eq(&l.values, l_idx, &r.values, r_idx),
            _ => false,
        }
    }

    fn eq_to_previous(cursor: &Self, idx: usize) -> bool {
        assert!(idx > 0);
        match (cursor.is_null(idx), cursor.is_null(idx - 1)) {
            (true, true) => true,
            (false, false) => T::eq(&cursor.values, idx, &cursor.values, idx - 1),
            _ => false,
        }
    }

    fn compare(l: &Self, l_idx: usize, r: &Self, r_idx: usize) -> Ordering {
        match (l.is_null(l_idx), r.is_null(r_idx)) {
            (true, true) => Ordering::Equal,
            (true, false) => match l.options.nulls_first {
                true => Ordering::Less,
                false => Ordering::Greater,
            },
            (false, true) => match l.options.nulls_first {
                true => Ordering::Greater,
                false => Ordering::Less,
            },
            (false, false) => match l.options.descending {
                true => T::compare(&r.values, r_idx, &l.values, l_idx),
                false => T::compare(&l.values, l_idx, &r.values, r_idx),
            },
        }
    }
}

I try to address comments and suggestions in latest PR. And for longer string compare regression for StringView, #15348 (comment)
i still need time to investigate more, i am willing to create a new ticket to investigate and dig into. Thanks.

zhuqi-lucas · 2025-03-24T05:21:34Z

Added some new testing, we need to improve High Cardinality Performance for sorting with utf8_view, and the most performance regression is with sort_partitioned.

Comparison: UTF8 vs UTF8_VIEW Sorting Performance

Based on the benchmark results, we compare utf8 and utf8_view across different sorting methods, including low cardinality and high cardinality cases.

Low Cardinality Performance

Sorting Method	`utf8` Time (ms)	`utf8_view` Time (ms)	`utf8_view` Improvement
merge sorted	3.8926	3.6713	5.7% faster
sort merge	3.9152	3.6265	7.4% faster
sort	6.0351	5.7904	4.1% faster
sort partitioned	236.24 µs	167.18 µs	29.2% faster

Observations

utf8_view is consistently faster across all sorting methods.
The most significant improvement is in sort partitioned (29.2% faster).
sort merge also benefits significantly (7.4% faster), likely due to utf8_view reducing memory allocations or copies.

High Cardinality Performance

Sorting Method	`utf8` Time (ms)	`utf8_view` Time (ms)	`utf8_view` Improvement
merge sorted	4.6662	5.0999	-9.3% (slower)
sort merge	4.7102	5.7224	-21.5% (slower)
sort	7.0020	6.3274	9.6% faster
sort partitioned	242.99 µs	679.86 µs	-180% (much slower)

Observations

utf8_view performs worse for high cardinality cases:
- merge sorted is 9.3% slower.
- sort merge is 21.5% slower.
- sort partitioned is 180% slower, a drastic drop.
However, utf8_view still improves the sort method by 9.6%, likely due to reduced string operations.

Key Takeaways

For low cardinality, utf8_view is the better choice, especially for sort partitioned and sort merge, with 7.4% to 29.2% improvements.
For high cardinality, utf8_view underperforms in merge sorted, sort merge, and especially sort partitioned, making it a worse choice.

zhuqi-lucas · 2025-03-24T07:25:15Z

I compared the sort_partition for utf8 and utf8view benchmark flamegraph for high cardinality:

The utf8_view:

The utf8:

It looks like the utf8 sort partition, will reservation size less memory besides utf8view, so it optimize to use concat_batches:

// If less than sort_in_place_threshold_bytes, concatenate and sort in place
        if self.reservation.size() < self.sort_in_place_threshold_bytes {
            // Concatenate memory batches together and sort
            let batch = concat_batches(&self.schema, &self.in_mem_batches)?;
            self.in_mem_batches.clear();
            self.reservation
                .try_resize(get_reserved_byte_for_record_batch(&batch))?;
            let reservation = self.reservation.take();
            return self.sort_batch_stream(batch, metrics, reservation);
        }

So it will be much fast. But why Utf8View reserve more memory for each partition, i need to to continue dig into.

Updated, when i change the sort_in_place_threshold_bytes default value from 1M to 2M, the sort_partition for utf8_view has huge improvement from 679.86 µs to 179.79 µs:

sort partitioned utf8 view high cardinality
                        time:   [178.27 µs 179.79 µs 181.19 µs]

Create a follow-up ticket for this improvement:

#15375

zhuqi-lucas · 2025-03-24T09:48:06Z

I did some POC of the automatically concat_batches which is totally another improvement ticket besides this PR:

#15375 (comment)

Very good performance improvement i can see, need more testing and investigation. And it's not limited to utf8_view enabled, i did not apply this PR to the testing for above comments result.

alamb

Thanks @zhuqi-lucas -- this is quite cool and it is very neat it spawned a bunch of observations

alamb · 2025-03-24T20:20:54Z

datafusion/physical-plan/src/sorts/cursor.rs

@@ -294,16 +294,44 @@ impl CursorValues for StringViewArray {
    }

    fn eq(l: &Self, l_idx: usize, r: &Self, r_idx: usize) -> bool {
+        // SAFETY: Both l_idx and r_idx are guaranteed to be within bounds,
+        // and any null-checks are handled in the outer layers.
+        // Fast path: Compare the lengths (or a proxy of the lengths) before full byte comparison.


alamb · 2025-03-24T20:21:59Z

datafusion/physical-plan/src/sorts/cursor.rs

        unsafe {
            GenericByteViewArray::compare_unchecked(cursor, idx, cursor, idx - 1).is_eq()
        }
    }

    fn compare(l: &Self, l_idx: usize, r: &Self, r_idx: usize) -> Ordering {
+        // SAFETY: Prior assertions guarantee that l_idx and r_idx are valid indices.
+        // Null-checks are assumed to have been handled in the wrapper (e.g., ArrayValues).
+        assert!(l_idx < l.len());


asserts are left in release builds of rust.

alamb · 2025-03-24T20:24:23Z

utf8_view performs worse for high cardinality cases:

I think it would be a great project to improve the performance of utf8_view for sorting high cardinality - maybe we can add a ticket to track that work too

zhuqi-lucas · 2025-03-25T01:40:38Z

utf8_view performs worse for high cardinality cases:

Thank you @alamb for review, created a follow-up ticket for this also:

#15402

Strange, latest testing can't produce the regression for some high cardinality result, only the sort partition we can always reproduced described already in above comments:

Latest testing result:

Benchmark	`utf8` Time	`utf8_view` Time	Change (%)	Performance
merge sorted utf8 low cardinality	3.0828 ms	2.5903 ms	-16.0%	Improved
sort merge utf8 low cardinality	3.1103 ms	2.6354 ms	-15.3%	Improved
sort utf8 low cardinality	4.5160 ms	3.9261 ms	-13.1%	Improved
sort partitioned utf8 low cardinality	193.02 µs	190.20 µs	-1.5%	Improved
merge sorted utf8 high cardinality	4.5441 ms	3.9072 ms	-14.0%	Improved
sort merge utf8 high cardinality	4.6552 ms	4.2851 ms	-8.0%	Improved
sort utf8 high cardinality	5.3901 ms	4.5989 ms	-14.7%	Improved
sort partitioned utf8 high cardinality	223.88 µs	2.6200 ms	+1070.0%	Regressed
merge sorted utf8 tuple	10.669 ms	13.209 ms	+23.8%	Regressed
sort merge utf8 tuple	13.204 ms	15.009 ms	+13.7%	Regressed
sort utf8 tuple	10.280 ms	10.098 ms	-1.8%	Improved
sort partitioned utf8 tuple	3.2097 ms	3.5555 ms	+10.8%	Regressed
merge sorted mixed tuple	9.2576 ms	10.770 ms	+16.3%	Regressed
sort merge mixed tuple	10.520 ms	11.871 ms	+12.8%	Regressed
sort mixed tuple	9.4906 ms	8.7883 ms	-7.4%	Improved
sort partitioned mixed tuple	2.4459 ms	3.2249 ms	+31.9%	Regressed

I will continue the investigation in the new ticket.

zhuqi-lucas · 2025-03-25T03:00:42Z

Thank you for the work on better Utf8View support. I tried one sort benchmark with sort-preserving merging on a single Utf8View column, but it gets slower:

Reproducer
cargo run --profile release-nonlto --bin dfbench -- sort-tpch -p /Users/yongting/Code/datafusion/benchmarks/data/tpch_sf10 -q 3
main: 8s pr: 10s

According to the flamegraph, an extra overhead of libsystem_platform.dylib_platform_memcmp showed up inside SortPreservingMergeStream It's not obvious why, I'll try to help figure it out later.

flamegraphs.zip

This one also deserve a new ticket to investigation, created a ticket now:
#15403

cc @2010YOUY01 @alamb

alamb · 2025-03-25T15:44:17Z

Thanks again @zhuqi-lucas

…servingMergeStream (apache#15348) * Perf: Support Utf8View datatype single column comparisons for SortPreservingMergeStream * Add safety and bench sql * fix * Fix * Add benchmark testing

Perf: Support Utf8View datatype single column comparisons for SortPre…

8343d5e

…servingMergeStream

Omega359 reviewed Mar 21, 2025

View reviewed changes

alamb reviewed Mar 22, 2025

View reviewed changes

Add safety and bench sql

93e46fb

zhuqi-lucas added 3 commits March 23, 2025 18:09

fix

1a3857c

Fix

d3808c1

Add benchmark testing

ef32003

github-actions bot added the core Core DataFusion crate label Mar 24, 2025

zhuqi-lucas mentioned this pull request Mar 24, 2025

Perf: Support automatically concat_batches for sort which will improve performance #15375

Open

Weijun-H changed the title ~~Perf: Support Utf8View datatype single column comparisons for SortPre…~~ Perf: Support Utf8View datatype single column comparisons for SortPreservingMergeStream Mar 24, 2025

alamb approved these changes Mar 24, 2025

View reviewed changes

zhuqi-lucas mentioned this pull request Mar 25, 2025

Investigate why Utf8View performs worse for sort with high cardinality cases and improve it. #15402

Open

zhuqi-lucas mentioned this pull request Mar 25, 2025

Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving merging on a single Utf8View ) #15403

Closed

alamb merged commit a0a063d into apache:main Mar 25, 2025
29 checks passed

zhuqi-lucas mentioned this pull request Mar 27, 2025

Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving mergi… #15447

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf: Support Utf8View datatype single column comparisons for SortPreservingMergeStream #15348

Perf: Support Utf8View datatype single column comparisons for SortPreservingMergeStream #15348

zhuqi-lucas commented Mar 21, 2025 •

edited by Weijun-H

Loading

Omega359 Mar 21, 2025

zhuqi-lucas Mar 21, 2025

alamb Mar 22, 2025

2010YOUY01 commented Mar 21, 2025

zhuqi-lucas commented Mar 21, 2025 •

edited

Loading

zhuqi-lucas commented Mar 22, 2025 •

edited

Loading

alamb left a comment

alamb Mar 22, 2025

zhuqi-lucas commented Mar 23, 2025 •

edited

Loading

zhuqi-lucas commented Mar 24, 2025 •

edited

Loading

zhuqi-lucas commented Mar 24, 2025 •

edited

Loading

zhuqi-lucas commented Mar 24, 2025 •

edited

Loading

alamb left a comment

alamb Mar 24, 2025

alamb Mar 24, 2025

alamb commented Mar 24, 2025

zhuqi-lucas commented Mar 25, 2025 •

edited

Loading

zhuqi-lucas commented Mar 25, 2025

alamb commented Mar 25, 2025

Perf: Support Utf8View datatype single column comparisons for SortPreservingMergeStream #15348

Perf: Support Utf8View datatype single column comparisons for SortPreservingMergeStream #15348

Conversation

zhuqi-lucas commented Mar 21, 2025 • edited by Weijun-H Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Omega359 Mar 21, 2025

Choose a reason for hiding this comment

zhuqi-lucas Mar 21, 2025

Choose a reason for hiding this comment

alamb Mar 22, 2025

Choose a reason for hiding this comment

2010YOUY01 commented Mar 21, 2025

zhuqi-lucas commented Mar 21, 2025 • edited Loading

zhuqi-lucas commented Mar 22, 2025 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

alamb Mar 22, 2025

Choose a reason for hiding this comment

zhuqi-lucas commented Mar 23, 2025 • edited Loading

zhuqi-lucas commented Mar 24, 2025 • edited Loading

Comparison: UTF8 vs UTF8_VIEW Sorting Performance

Low Cardinality Performance

Observations

High Cardinality Performance

Observations

Key Takeaways

zhuqi-lucas commented Mar 24, 2025 • edited Loading

zhuqi-lucas commented Mar 24, 2025 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

alamb Mar 24, 2025

Choose a reason for hiding this comment

alamb Mar 24, 2025

Choose a reason for hiding this comment

alamb commented Mar 24, 2025

zhuqi-lucas commented Mar 25, 2025 • edited Loading

zhuqi-lucas commented Mar 25, 2025

alamb commented Mar 25, 2025

zhuqi-lucas commented Mar 21, 2025 •

edited by Weijun-H

Loading

zhuqi-lucas commented Mar 21, 2025 •

edited

Loading

zhuqi-lucas commented Mar 22, 2025 •

edited

Loading

zhuqi-lucas commented Mar 23, 2025 •

edited

Loading

zhuqi-lucas commented Mar 24, 2025 •

edited

Loading

zhuqi-lucas commented Mar 24, 2025 •

edited

Loading

zhuqi-lucas commented Mar 24, 2025 •

edited

Loading

zhuqi-lucas commented Mar 25, 2025 •

edited

Loading