-
Notifications
You must be signed in to change notification settings - Fork 923
StringArrayView(Utf8View) slower cases compare to StringArray(Utf8) #7350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
FYI @XiangpengHao |
Can't find a better solution to optimize it besides we add new new ByteView to support 8bytes prefix? But i am not sure if we deserve to do it. |
I don't understand what this means -- perhaps you can make a small PR to demonstrate? |
I think Arrow spec says we need to do 4 bytes prefix: https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout As you have pointed out, StringViewArray is not always better than StringArray, especially when the prefixes are the same. But I do believe there are micro-architecture level optimizations we can do to improve performance, like better compiler hint, prefetching, gc tuning etc. Another direction is probably to rewrite the FilterExec/CoalesenceExec to emit StringArray rather than StringViewArray, the idea is to use StringView in lower levels of the plan and use String in higher levels of the plan |
I do think theoretically StringArray is likely to be faster than StringViewArray for larger strings in many cases as it is more efficient (it has fewer indirections)
that is a very interesting idea 🤔 |
Thank you @XiangpengHao @alamb , i was thinking to support longer inline prefix for StringView to compare, but it looks like it's always fixed to 4 bytes, we can't change it easily.
I agree, the linked PR using GC to as a workaround for sort merge compare cases.
For FilterExec/CoalesenceExec, interesting, this is using GC to reduce the overhead of FilterExec/CoalesenceExec. May be we can try rewrite the FilterExec/CoalesenceExec to emit StringArray and to compare the gain and loss. |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This ticket collected the Utf8View slower than Ut8 cases and try to improve it.
Mostly the cases happen when the string has same 4bytes prefix, but one of the string to compare is larger than 12 bytes, it will make it happen.
Describe the solution you'd like
Make Utf8View regression cases faster.
Describe alternatives you've considered
Make Utf8View regression cases faster.
Additional context
Make Utf8View regression cases faster.
From the benchmark testing from datafusion sort tpch, there are regressions about the Utf8View compare:
Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving merging on a single
Utf8View
) datafusion#15403Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving mergi… datafusion#15447
We'd better to improve it from arrow-rs, so we can benefit a lot for datafusion.
The text was updated successfully, but these errors were encountered: