-
Notifications
You must be signed in to change notification settings - Fork 924
Add additional benchmarks for utf8view comparison kernels #7351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @zhuqi-lucas --- this looks great. I can't wait to see the performance improvements.
I left some style comments. My biggest suggestion for this PR is to follow the pattern of the rest of the benchmarks in this function which are organized by operation rather than by data type I think
So for example, put the new like
tests with longer strings next to the existing like
tests in
arrow-rs/arrow/benches/comparison_kernels.rs
Lines 228 to 232 in a2cc426
// StringArray: LIKE benchmarks | |
c.bench_function("like_utf8 scalar equals", |b| { | |
b.iter(|| bench_like_utf8_scalar(&arr_string, "xxxx")) | |
}); |
arrow/benches/comparison_kernels.rs
Outdated
criterion_group!(benches, add_benchmark); | ||
// Generate a string array that meets the requirements: | ||
// All strings start with "test", followed by a tail, ensuring the total length is greater than 12 bytes. | ||
fn make_custom_string_array(size: usize, tail_len: usize, rng: &mut StdRng) -> Vec<Option<String>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could call this long_strings
or something as it doesn't really make a string array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added it to the original benchmark, thanks @alamb.
arrow/benches/comparison_kernels.rs
Outdated
.collect() | ||
} | ||
|
||
fn add_custom_string_benchmarks(c: &mut Criterion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found "custom" somewhat vague. Perhaps we could use the term long
instead
let long_strings = make_long_string_array(SIZE, 12, &mut rng);
let long_string_array = StringArray::from(long_strings);
let long_string_view = StringViewArray::from_iter(long_string_array.iter());
arrow/benches/comparison_kernels.rs
Outdated
let custom_string_view = StringViewArray::from_iter(custom_string_array.iter()); | ||
|
||
// Benchmark the eq operation for utf8 (StringArray) | ||
c.bench_function("eq custom utf8", |b| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a name that includes the types of arguments as well as the operator?
Maybe something like
c.bench_function("eq custom utf8", |b| { | |
c.bench_function("eq utf8 utf8view long", |b| { |
And then similarly below
arrow/benches/comparison_kernels.rs
Outdated
(0..size) | ||
.map(|_| { | ||
// Generate the tail: use visible ASCII characters (32 to 126) to ensure a valid UTF-8 string. | ||
let tail: String = (0..tail_len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably doesn't matter in this case, but you can avoid at least one allocation like this:
// Generate the tail: use visible ASCII characters (32 to 126) to ensure a valid UTF-8 string.
let s: String = "test".chars()
.chain((0..tail_len).map(|_| rng.random_range(32u8..127u8) as char))
.collect();
Some(s)
Thank you @alamb for review, it makes sense to keep the original format for new benchmark, i will address the comments. |
Updated the result for the changed code, we need to optimize the compare operation first which is the most regression for same prefix long strings: 📌 UTF-8 vs UTF-8 View Benchmark Comparison
We can currently only adding those testing, it's enough for us to improve the code and testing again. Because the compare_unchecked function is which we want to improve, it mostly used by all compare function for stringview, other function such as like or regex related, we can improve and testing after it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @zhuqi-lucas -- looks great to me
I agree Thanks again for pushing this work forward |
Which issue does this PR close?
Rationale for this change
Add reproducer cases which the Utf8View will slower than Utf8
What changes are included in this PR?
Add reproducer cases which the Utf8View will slower than Utf8
Are there any user-facing changes?
No