StringArrayView(Utf8View) slower cases compare to StringArray(Utf8) #7350

zhuqi-lucas · 2025-03-28T09:54:16Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This ticket collected the Utf8View slower than Ut8 cases and try to improve it.

Mostly the cases happen when the string has same 4bytes prefix, but one of the string to compare is larger than 12 bytes, it will make it happen.

Describe the solution you'd like
Make Utf8View regression cases faster.

- Add reproducer cases which the Utf8View will slower than Utf8
- Add code implementation to improve the Utf8View regression cases

Describe alternatives you've considered
Make Utf8View regression cases faster.

Additional context
Make Utf8View regression cases faster.

From the benchmark testing from datafusion sort tpch, there are regressions about the Utf8View compare:

We'd better to improve it from arrow-rs, so we can benefit a lot for datafusion.

The text was updated successfully, but these errors were encountered:

alamb · 2025-03-28T15:48:13Z

FYI @XiangpengHao

zhuqi-lucas · 2025-03-31T04:27:16Z

Can't find a better solution to optimize it besides we add new new ByteView to support 8bytes prefix? But i am not sure if we deserve to do it.

alamb · 2025-03-31T14:40:51Z

Can't find a better solution to optimize it besides we add new new ByteView to support 8bytes prefix? But i am not sure if we deserve to do it.

I don't understand what this means -- perhaps you can make a small PR to demonstrate?

XiangpengHao · 2025-03-31T14:41:46Z

we add new new ByteView to support 8bytes prefix

I think Arrow spec says we need to do 4 bytes prefix: https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout

As you have pointed out, StringViewArray is not always better than StringArray, especially when the prefixes are the same.

But I do believe there are micro-architecture level optimizations we can do to improve performance, like better compiler hint, prefetching, gc tuning etc.

Another direction is probably to rewrite the FilterExec/CoalesenceExec to emit StringArray rather than StringViewArray, the idea is to use StringView in lower levels of the plan and use String in higher levels of the plan

alamb · 2025-03-31T14:42:52Z

I do think theoretically StringArray is likely to be faster than StringViewArray for larger strings in many cases as it is more efficient (it has fewer indirections)

Another direction is probably to rewrite the FilterExec/CoalesenceExec to emit StringArray rather than StringViewArray, the idea is to use StringView in lower levels of the plan and use String in higher levels of the plan

that is a very interesting idea 🤔

zhuqi-lucas · 2025-04-01T03:04:22Z

Thank you @XiangpengHao @alamb , i was thinking to support longer inline prefix for StringView to compare, but it looks like it's always fixed to 4 bytes, we can't change it easily.

we add new new ByteView to support 8bytes prefix

I think Arrow spec says we need to do 4 bytes prefix: https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout

As you have pointed out, StringViewArray is not always better than StringArray, especially when the prefixes are the same.

But I do believe there are micro-architecture level optimizations we can do to improve performance, like better compiler hint, prefetching, gc tuning etc.

Another direction is probably to rewrite the FilterExec/CoalesenceExec to emit StringArray rather than StringViewArray, the idea is to use StringView in lower levels of the plan and use String in higher levels of the plan

I agree, the linked PR using GC to as a workaround for sort merge compare cases.

apache/datafusion#15447

I do think theoretically StringArray is likely to be faster than StringViewArray for larger strings in many cases as it is more efficient (it has fewer indirections)

Another direction is probably to rewrite the FilterExec/CoalesenceExec to emit StringArray rather than StringViewArray, the idea is to use StringView in lower levels of the plan and use String in higher levels of the plan

that is a very interesting idea 🤔

For FilterExec/CoalesenceExec, interesting, this is using GC to reduce the overhead of FilterExec/CoalesenceExec. May be we can try rewrite the FilterExec/CoalesenceExec to emit StringArray and to compare the gain and loss.

zhuqi-lucas added the enhancement Any new improvement worthy of a entry in the changelog label Mar 28, 2025

This was referenced Mar 28, 2025

Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving mergi… apache/datafusion#15447

Merged

Add additional benchmarks for utf8view comparison kernels #7351

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StringArrayView(Utf8View) slower cases compare to StringArray(Utf8) #7350

StringArrayView(Utf8View) slower cases compare to StringArray(Utf8) #7350

zhuqi-lucas commented Mar 28, 2025 •

edited by alamb

Loading

alamb commented Mar 28, 2025

zhuqi-lucas commented Mar 31, 2025

alamb commented Mar 31, 2025

XiangpengHao commented Mar 31, 2025

alamb commented Mar 31, 2025

zhuqi-lucas commented Apr 1, 2025 •

edited

Loading

StringArrayView(Utf8View) slower cases compare to StringArray(Utf8) #7350

StringArrayView(Utf8View) slower cases compare to StringArray(Utf8) #7350

Comments

zhuqi-lucas commented Mar 28, 2025 • edited by alamb Loading

alamb commented Mar 28, 2025

zhuqi-lucas commented Mar 31, 2025

alamb commented Mar 31, 2025

XiangpengHao commented Mar 31, 2025

alamb commented Mar 31, 2025

zhuqi-lucas commented Apr 1, 2025 • edited Loading

zhuqi-lucas commented Mar 28, 2025 •

edited by alamb

Loading

zhuqi-lucas commented Apr 1, 2025 •

edited

Loading