Skip to content

StringArrayView(Utf8View) slower cases compare to StringArray(Utf8) #7350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
zhuqi-lucas opened this issue Mar 28, 2025 · 6 comments
Open
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Mar 28, 2025

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This ticket collected the Utf8View slower than Ut8 cases and try to improve it.

Mostly the cases happen when the string has same 4bytes prefix, but one of the string to compare is larger than 12 bytes, it will make it happen.

Describe the solution you'd like
Make Utf8View regression cases faster.

    • Add reproducer cases which the Utf8View will slower than Utf8
    • Add code implementation to improve the Utf8View regression cases

Describe alternatives you've considered
Make Utf8View regression cases faster.

Additional context
Make Utf8View regression cases faster.

From the benchmark testing from datafusion sort tpch, there are regressions about the Utf8View compare:

We'd better to improve it from arrow-rs, so we can benefit a lot for datafusion.

@alamb
Copy link
Contributor

alamb commented Mar 28, 2025

FYI @XiangpengHao

@zhuqi-lucas
Copy link
Contributor Author

Can't find a better solution to optimize it besides we add new new ByteView to support 8bytes prefix? But i am not sure if we deserve to do it.

@alamb
Copy link
Contributor

alamb commented Mar 31, 2025

Can't find a better solution to optimize it besides we add new new ByteView to support 8bytes prefix? But i am not sure if we deserve to do it.

I don't understand what this means -- perhaps you can make a small PR to demonstrate?

@XiangpengHao
Copy link
Contributor

we add new new ByteView to support 8bytes prefix

I think Arrow spec says we need to do 4 bytes prefix: https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout

As you have pointed out, StringViewArray is not always better than StringArray, especially when the prefixes are the same.

But I do believe there are micro-architecture level optimizations we can do to improve performance, like better compiler hint, prefetching, gc tuning etc.

Another direction is probably to rewrite the FilterExec/CoalesenceExec to emit StringArray rather than StringViewArray, the idea is to use StringView in lower levels of the plan and use String in higher levels of the plan

@alamb
Copy link
Contributor

alamb commented Mar 31, 2025

I do think theoretically StringArray is likely to be faster than StringViewArray for larger strings in many cases as it is more efficient (it has fewer indirections)

Another direction is probably to rewrite the FilterExec/CoalesenceExec to emit StringArray rather than StringViewArray, the idea is to use StringView in lower levels of the plan and use String in higher levels of the plan

that is a very interesting idea 🤔

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Apr 1, 2025

Thank you @XiangpengHao @alamb , i was thinking to support longer inline prefix for StringView to compare, but it looks like it's always fixed to 4 bytes, we can't change it easily.

we add new new ByteView to support 8bytes prefix

I think Arrow spec says we need to do 4 bytes prefix: https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout

As you have pointed out, StringViewArray is not always better than StringArray, especially when the prefixes are the same.

But I do believe there are micro-architecture level optimizations we can do to improve performance, like better compiler hint, prefetching, gc tuning etc.

Another direction is probably to rewrite the FilterExec/CoalesenceExec to emit StringArray rather than StringViewArray, the idea is to use StringView in lower levels of the plan and use String in higher levels of the plan

I agree, the linked PR using GC to as a workaround for sort merge compare cases.

apache/datafusion#15447

I do think theoretically StringArray is likely to be faster than StringViewArray for larger strings in many cases as it is more efficient (it has fewer indirections)

Another direction is probably to rewrite the FilterExec/CoalesenceExec to emit StringArray rather than StringViewArray, the idea is to use StringView in lower levels of the plan and use String in higher levels of the plan

that is a very interesting idea 🤔

For FilterExec/CoalesenceExec, interesting, this is using GC to reduce the overhead of FilterExec/CoalesenceExec. May be we can try rewrite the FilterExec/CoalesenceExec to emit StringArray and to compare the gain and loss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

3 participants