Skip to content

Add additional benchmarks for utf8view comparison kernels #7351

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 30, 2025

Conversation

zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented Mar 28, 2025

Which issue does this PR close?

Rationale for this change

Add reproducer cases which the Utf8View will slower than Utf8

What changes are included in this PR?

Add reproducer cases which the Utf8View will slower than Utf8

Are there any user-facing changes?

No

@zhuqi-lucas
Copy link
Contributor Author

Benchmark Utf8 Time (µs) Utf8View Time (µs) Slower (%)
eq custom 97.426 - 97.940 210.93 - 214.13 116%
neq custom 100.27 - 102.04 211.39 - 212.46 111%
gt custom 84.338 - 84.634 196.90 - 197.52 133%
gt custom scalar 99.384 - 99.817 181.69 - 182.25 83%
gt custom scalar long string 100.32 - 100.65 212.39 - 212.92 112%
like custom scalar 89.111 - 90.811 177.80 - 180.97 100%
nlike custom scalar 87.976 - 89.682 98.116 - 101.02 12%
ilike custom scalar 199.82 - 200.68 312.25 - 314.25 56%
nilike custom scalar 198.95 - 199.67 232.68 - 234.71 17%

@zhuqi-lucas zhuqi-lucas changed the title Slower utf8view reproducer Slower utf8view cases reproducer Mar 28, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zhuqi-lucas --- this looks great. I can't wait to see the performance improvements.

I left some style comments. My biggest suggestion for this PR is to follow the pattern of the rest of the benchmarks in this function which are organized by operation rather than by data type I think

So for example, put the new like tests with longer strings next to the existing like tests in

// StringArray: LIKE benchmarks
c.bench_function("like_utf8 scalar equals", |b| {
b.iter(|| bench_like_utf8_scalar(&arr_string, "xxxx"))
});

criterion_group!(benches, add_benchmark);
// Generate a string array that meets the requirements:
// All strings start with "test", followed by a tail, ensuring the total length is greater than 12 bytes.
fn make_custom_string_array(size: usize, tail_len: usize, rng: &mut StdRng) -> Vec<Option<String>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could call this long_strings or something as it doesn't really make a string array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added it to the original benchmark, thanks @alamb.

.collect()
}

fn add_custom_string_benchmarks(c: &mut Criterion) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found "custom" somewhat vague. Perhaps we could use the term long instead

    let long_strings = make_long_string_array(SIZE, 12, &mut rng);
    let long_string_array = StringArray::from(long_strings);
    let long_string_view = StringViewArray::from_iter(long_string_array.iter());

let custom_string_view = StringViewArray::from_iter(custom_string_array.iter());

// Benchmark the eq operation for utf8 (StringArray)
c.bench_function("eq custom utf8", |b| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a name that includes the types of arguments as well as the operator?

Maybe something like

Suggested change
c.bench_function("eq custom utf8", |b| {
c.bench_function("eq utf8 utf8view long", |b| {

And then similarly below

(0..size)
.map(|_| {
// Generate the tail: use visible ASCII characters (32 to 126) to ensure a valid UTF-8 string.
let tail: String = (0..tail_len)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably doesn't matter in this case, but you can avoid at least one allocation like this:

            // Generate the tail: use visible ASCII characters (32 to 126) to ensure a valid UTF-8 string.
            let s: String = "test".chars()
                .chain((0..tail_len).map(|_| rng.random_range(32u8..127u8) as char))
                .collect();
            Some(s)

@alamb alamb changed the title Slower utf8view cases reproducer Add additional benchmarks for utf8view comparison kernels Mar 28, 2025
@zhuqi-lucas
Copy link
Contributor Author

Thank you @zhuqi-lucas --- this looks great. I can't wait to see the performance improvements.

I left some style comments. My biggest suggestion for this PR is to follow the pattern of the rest of the benchmarks in this function which are organized by operation rather than by data type I think

So for example, put the new like tests with longer strings next to the existing like tests in

// StringArray: LIKE benchmarks
c.bench_function("like_utf8 scalar equals", |b| {
b.iter(|| bench_like_utf8_scalar(&arr_string, "xxxx"))
});

Thank you @alamb for review, it makes sense to keep the original format for new benchmark, i will address the comments.

@zhuqi-lucas
Copy link
Contributor Author

zhuqi-lucas commented Mar 29, 2025

Updated the result for the changed code, we need to optimize the compare operation first which is the most regression for same prefix long strings:

📌 UTF-8 vs UTF-8 View Benchmark Comparison

Benchmark StringArray (UTF-8) StringViewArray (UTF-8 View) Performance Change
eq long same prefix strings 294.95 µs 514.73 µs StringArray is ~1.7x faster
neq long same prefix strings 296.12 µs 517.70 µs StringArray is ~1.75x faster
lt long same prefix strings 331.46 µs 493.17 µs StringArray is ~1.5x faster
long same prefix strings like_utf8 scalar equals 181.78 µs 196.91 µs StringArray is ~8.3% faster
long same prefix strings like_utf8 scalar contains 1.7854 ms 1.8405 ms StringArray is ~3% faster
long same prefix strings like_utf8 scalar ends with 583.86 µs 594.73 µs StringArray is ~1.9% faster
long same prefix strings like_utf8 scalar starts with 664.69 µs 672.16 µs StringArray is ~1.1% faster
long same prefix strings like_utf8 scalar complex 590.86 µs 604.81 µs StringArray is ~2.3% faster
long same prefix strings like_utf8view scalar equals 181.78 µs 196.91 µs StringArray is ~8.3% faster
long same prefix strings like_utf8view scalar contains 1.7854 ms 1.8405 ms StringArray is ~3% faster
long same prefix strings like_utf8view scalar ends with 583.86 µs 594.73 µs StringArray is ~1.9% faster
long same prefix strings like_utf8view scalar starts with 664.69 µs 672.16 µs StringArray is ~1.1% faster
long same prefix strings like_utf8view scalar complex 590.86 µs 604.81 µs StringArray is ~2.3% faster

We can currently only adding those testing, it's enough for us to improve the code and testing again. Because the compare_unchecked function is which we want to improve, it mostly used by all compare function for stringview, other function such as like or regex related, we can improve and testing after it.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zhuqi-lucas -- looks great to me

@alamb
Copy link
Contributor

alamb commented Mar 30, 2025

We can currently only adding those testing, it's enough for us to improve the code and testing again. Because the compare_unchecked function is which we want to improve, it mostly used by all compare function for stringview, other function such as like or regex related, we can improve and testing after it.

I agree

Thanks again for pushing this work forward

@alamb alamb merged commit 99b2b3b into apache:main Mar 30, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants