Skip to content

Investigate why Utf8View performs worse for sort with high cardinality cases and improve it. #15402

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
zhuqi-lucas opened this issue Mar 25, 2025 · 2 comments
Labels
enhancement New feature or request

Comments

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Mar 25, 2025

Is your feature request related to a problem or challenge?

This is a follow-up for the comments:
#15348 (comment)

Original result:

High Cardinality Performance

Sorting Method utf8 Time (ms) utf8_view Time (ms) utf8_view Improvement
merge sorted 4.6662 5.0999 -9.3% (slower)
sort merge 4.7102 5.7224 -21.5% (slower)
sort 7.0020 6.3274 9.6% faster
sort partitioned 242.99 µs 679.86 µs -180% (much slower)

Observations

  • utf8_view performs worse for high cardinality cases:
    • merge sorted is 9.3% slower.
    • sort merge is 21.5% slower.
    • sort partitioned is 180% slower, a drastic drop.
  • However, utf8_view still improves the sort method by 9.6%, likely due to reduced string operations.

We need to investigate why Utf8View performs worse for high cardinality cases and improve it.

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

@zhuqi-lucas zhuqi-lucas added the enhancement New feature or request label Mar 25, 2025
@zhuqi-lucas zhuqi-lucas changed the title Investigate why Utf8View performs worse for high cardinality cases and improve it. Investigate why Utf8View performs worse for sort with high cardinality cases and improve it. Mar 25, 2025
@zhuqi-lucas
Copy link
Contributor Author

Latest testing result:

Benchmark utf8 Time utf8_view Time Change (%) Performance
merge sorted utf8 low cardinality 3.0828 ms 2.5903 ms -16.0% Improved
sort merge utf8 low cardinality 3.1103 ms 2.6354 ms -15.3% Improved
sort utf8 low cardinality 4.5160 ms 3.9261 ms -13.1% Improved
sort partitioned utf8 low cardinality 193.02 µs 190.20 µs -1.5% Improved
merge sorted utf8 high cardinality 4.5441 ms 3.9072 ms -14.0% Improved
sort merge utf8 high cardinality 4.6552 ms 4.2851 ms -8.0% Improved
sort utf8 high cardinality 5.3901 ms 4.5989 ms -14.7% Improved
sort partitioned utf8 high cardinality 223.88 µs 2.6200 ms +1070.0% Regressed
merge sorted utf8 tuple 10.669 ms 13.209 ms +23.8% Regressed
sort merge utf8 tuple 13.204 ms 15.009 ms +13.7% Regressed
sort utf8 tuple 10.280 ms 10.098 ms -1.8% Improved
sort partitioned utf8 tuple 3.2097 ms 3.5555 ms +10.8% Regressed
merge sorted mixed tuple 9.2576 ms 10.770 ms +16.3% Regressed
sort merge mixed tuple 10.520 ms 11.871 ms +12.8% Regressed
sort mixed tuple 9.4906 ms 8.7883 ms -7.4% Improved
sort partitioned mixed tuple 2.4459 ms 3.2249 ms +31.9% Regressed

@LSUDOKO
Copy link

LSUDOKO commented Mar 28, 2025

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants