-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove the branches from len_utf8
#125129
Conversation
This changes `len_utf8` to add all of the range comparisons together, rather than branching on each one. We should definitely test performance though, because it's possible that this will pessimize mostly-ascii inputs that would have had a short branch-predicted path before.
rustbot has assigned @Mark-Simulacrum. Use |
Here's a godbolt comparison. For a single character, it looks like this: Before: len_char:
mov eax, 1
cmp edi, 128
jb .LBB0_3
mov eax, 2
cmp edi, 2048
jb .LBB0_3
cmp edi, 65536
mov eax, 4
sbb rax, 0
.LBB0_3:
ret After: len_char:
xor eax, eax
cmp edi, 127
seta al
cmp edi, 2048
sbb rax, -1
cmp edi, 65536
sbb rax, -1
inc rax
ret I also included an example of summing cc @lincot @scottmcm -- this may be relevant to #124810 too. |
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
Remove the branches from `len_utf8` This changes `len_utf8` to add all of the range comparisons together, rather than branching on each one. We should definitely test performance though, because it's possible that this will pessimize mostly-ascii inputs that would have had a short branch-predicted path before.
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
In a benchmark, current
Branchless:
bench.rs#![feature(test)]
extern crate test;
use core::{array, mem::MaybeUninit};
use rand::seq::SliceRandom;
use rand_pcg::Pcg64Mcg;
use test::{black_box, Bencher};
const TAG_CONT: u8 = 0b1000_0000;
const TAG_TWO_B: u8 = 0b1100_0000;
const TAG_THREE_B: u8 = 0b1110_0000;
const TAG_FOUR_B: u8 = 0b1111_0000;
const MAX_ONE_B: u32 = 0x80;
const MAX_TWO_B: u32 = 0x800;
const MAX_THREE_B: u32 = 0x10000;
#[inline]
const fn len_utf8(code: u32) -> usize {
const BRANCHLESS: bool = true;
if BRANCHLESS {
1 + ((code >= MAX_ONE_B) as usize)
+ ((code >= MAX_TWO_B) as usize)
+ ((code >= MAX_THREE_B) as usize)
} else {
if code < MAX_ONE_B {
1
} else if code < MAX_TWO_B {
2
} else if code < MAX_THREE_B {
3
} else {
4
}
}
}
#[inline]
fn len_chars(cs: &[char]) -> usize {
cs.iter().map(|&c| len_utf8(c as u32)).sum()
}
#[inline]
pub fn push(s: &mut String, ch: char) {
let len = s.len();
let ch_len = len_utf8(ch as u32);
s.reserve(ch_len);
// SAFETY: at least the length needed to encode `ch`
// has been reserved in `self`
unsafe {
encode_utf8_raw_unchecked(ch as u32, s.as_mut_vec().spare_capacity_mut());
s.as_mut_vec().set_len(len + ch_len);
}
}
#[inline]
pub fn encode_utf8_raw(code: u32, dst: &mut [u8]) -> &mut [u8] {
let len = len_utf8(code);
if dst.len() < len {
panic!(
"encode_utf8: need {} bytes to encode U+{:X}, but the buffer has {}",
len,
code,
dst.len(),
);
}
// SAFETY: `encode_utf8_raw_unchecked` only writes initialized bytes to the slice,
// `dst` has been checked to be long enough to hold the encoded codepoint
unsafe { encode_utf8_raw_unchecked(code, &mut *(dst as *mut [u8] as *mut [MaybeUninit<u8>])) }
}
#[inline]
pub unsafe fn encode_utf8_raw_unchecked(code: u32, dst: &mut [MaybeUninit<u8>]) -> &mut [u8] {
let len = len_utf8(code);
// SAFETY: the caller must guarantee that `dst` is at least `len` bytes long
unsafe {
match len {
1 => {
dst.get_unchecked_mut(0).write(code as u8);
}
2 => {
dst.get_unchecked_mut(0)
.write((code >> 6 & 0x1F) as u8 | TAG_TWO_B);
dst.get_unchecked_mut(1)
.write((code & 0x3F) as u8 | TAG_CONT);
}
3 => {
dst.get_unchecked_mut(0)
.write((code >> 12 & 0x0F) as u8 | TAG_THREE_B);
dst.get_unchecked_mut(1)
.write((code >> 6 & 0x3F) as u8 | TAG_CONT);
dst.get_unchecked_mut(2)
.write((code & 0x3F) as u8 | TAG_CONT);
}
4 => {
dst.get_unchecked_mut(0)
.write((code >> 18 & 0x07) as u8 | TAG_FOUR_B);
dst.get_unchecked_mut(1)
.write((code >> 12 & 0x3F) as u8 | TAG_CONT);
dst.get_unchecked_mut(2)
.write((code >> 6 & 0x3F) as u8 | TAG_CONT);
dst.get_unchecked_mut(3)
.write((code & 0x3F) as u8 | TAG_CONT);
}
_ => unreachable!(),
}
}
// SAFETY: data has been written to the first `len` bytes
unsafe { &mut *(dst.get_unchecked_mut(..len) as *mut [MaybeUninit<u8>] as *mut [u8]) }
}
#[bench]
fn bench_len_chars(bencher: &mut Bencher) {
const BYTES: usize = 1024;
bencher.bytes = BYTES as _;
let mut rng = Pcg64Mcg::new(0xcafe_f00d_d15e_a5e5);
let cs: [_; BYTES] = array::from_fn(|_| *['0', 'д', '❗', '🤨'].choose(&mut rng).unwrap());
bencher.iter(|| len_chars(black_box(&cs)));
}
const ITERATIONS: u64 = if cfg!(miri) { 1 } else { 10_000 };
#[bench]
fn bench_push_1_byte(bencher: &mut Bencher) {
const CHAR: char = '0';
assert_eq!(CHAR.len_utf8(), 1);
bencher.bytes = ITERATIONS;
bencher.iter(|| {
let mut s = String::with_capacity(ITERATIONS as _);
for _ in 0..black_box(ITERATIONS) {
push(&mut s, black_box(CHAR));
}
s
});
}
#[bench]
fn bench_push_2_bytes(bencher: &mut Bencher) {
const CHAR: char = 'д';
assert_eq!(CHAR.len_utf8(), 2);
bencher.bytes = 2 * ITERATIONS;
bencher.iter(|| {
let mut s = String::with_capacity((2 * ITERATIONS) as _);
for _ in 0..black_box(ITERATIONS) {
push(&mut s, black_box(CHAR));
}
s
});
}
#[bench]
fn bench_push_3_bytes(bencher: &mut Bencher) {
const CHAR: char = '❗';
assert_eq!(CHAR.len_utf8(), 3);
bencher.bytes = 3 * ITERATIONS;
bencher.iter(|| {
let mut s = String::with_capacity((3 * ITERATIONS) as _);
for _ in 0..black_box(ITERATIONS) {
push(&mut s, black_box(CHAR));
}
s
});
}
#[bench]
fn bench_push_4_bytes(bencher: &mut Bencher) {
const CHAR: char = '🤨';
assert_eq!(CHAR.len_utf8(), 4);
bencher.bytes = 4 * ITERATIONS;
bencher.iter(|| {
let mut s = String::with_capacity((4 * ITERATIONS) as _);
for _ in 0..black_box(ITERATIONS) {
push(&mut s, black_box(CHAR));
}
s
});
} So despite autovectorization, it appears to be slower for Also, curiously, if we hint the compiler that the branchless version equals |
Thanks for the benchmark! I agree that ascii is taking a hit, like I originally suspected, but my results look more favorable on the vectorized sum. AMD Ryzen 7 5800X, original:
branchless:
AMD Ryzen 7 7700X, original:
branchless:
Intel i7-1365U, original:
branchless:
All of these were using the current nightly on Fedora 40, with default target options.
|
I'll still wait for results from the perf server, but I'll be fine with closing this if there's no clear gain, which seems likely. |
Can I suggest a hybrid? pub fn len_utf8_semibranchless(code: u32) -> usize {
if code < MAX_ONE_B {
1
} else {
2
+ ((code >= MAX_TWO_B) as usize)
+ ((code >= MAX_THREE_B) as usize)
}
} |
Finished benchmarking commit (1c3131c): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)Results (primary 0.8%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesThis benchmark run did not return any relevant results for this metric. Binary sizeResults (primary -0.0%, secondary 0.1%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 679.638s -> 678.862s (-0.11%) |
I do wonder if we should lean into the ASCII branch more. Letting the branch predictor assist runs of ASCII or non-ASCII, which are probably fairly common, might end up being a good idea, like we have branchy fast-paths for ASCII in the UTF-8 checks. |
Indeed, it was the Original:
Branchless:
Semibranchless:
|
Trying @orlp's suggestion... @bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
Remove the branches from `len_utf8` This changes `len_utf8` to add all of the range comparisons together, rather than branching on each one. We should definitely test performance though, because it's possible that this will pessimize mostly-ascii inputs that would have had a short branch-predicted path before.
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (681b867): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)Results (primary -2.3%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesThis benchmark run did not return any relevant results for this metric. Binary sizeResults (primary 0.0%, secondary 0.1%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 678.357s -> 680.336s (0.29%) |
In The semibranchless version also doesn't allow vectorization. I've added a benchmark for random chars and benchmark resultsBranchy, then branchy:
Branchless, then branchy:
Semibranchless, then branchy:
Branchless, then branchless:
bench.rs#![feature(test)]
extern crate test;
use core::{array, mem::MaybeUninit};
use rand::seq::SliceRandom;
use rand_pcg::Pcg64Mcg;
use test::{black_box, Bencher};
const TAG_CONT: u8 = 0b1000_0000;
const TAG_TWO_B: u8 = 0b1100_0000;
const TAG_THREE_B: u8 = 0b1110_0000;
const TAG_FOUR_B: u8 = 0b1111_0000;
const MAX_ONE_B: u32 = 0x80;
const MAX_TWO_B: u32 = 0x800;
const MAX_THREE_B: u32 = 0x10000;
#[inline]
const fn len_utf8_branchy(code: u32) -> usize {
if code < MAX_ONE_B {
1
} else if code < MAX_TWO_B {
2
} else if code < MAX_THREE_B {
3
} else {
4
}
}
#[inline]
const fn len_utf8_branchless(code: u32) -> usize {
1 + ((code >= MAX_ONE_B) as usize)
+ ((code >= MAX_TWO_B) as usize)
+ ((code >= MAX_THREE_B) as usize)
}
#[inline]
const fn len_utf8_semibranchless(code: u32) -> usize {
if code < MAX_ONE_B {
1
} else {
2 + ((code >= MAX_TWO_B) as usize) + ((code >= MAX_THREE_B) as usize)
}
}
#[inline]
pub fn push(s: &mut String, ch: char) {
let len = s.len();
let ch_len = len_utf8_branchless(ch as u32);
s.reserve(ch_len);
// SAFETY: at least the length needed to encode `ch`
// has been reserved in `self`
unsafe {
encode_utf8_raw_unchecked(ch as u32, s.as_mut_vec().spare_capacity_mut());
s.as_mut_vec().set_len(len + ch_len);
}
}
#[inline]
pub unsafe fn encode_utf8_raw_unchecked(code: u32, dst: &mut [MaybeUninit<u8>]) -> &mut [u8] {
let len = len_utf8_branchy(code);
// SAFETY: the caller must guarantee that `dst` is at least `len` bytes long
unsafe {
match len {
1 => {
dst.get_unchecked_mut(0).write(code as u8);
}
2 => {
dst.get_unchecked_mut(0)
.write((code >> 6 & 0x1F) as u8 | TAG_TWO_B);
dst.get_unchecked_mut(1)
.write((code & 0x3F) as u8 | TAG_CONT);
}
3 => {
dst.get_unchecked_mut(0)
.write((code >> 12 & 0x0F) as u8 | TAG_THREE_B);
dst.get_unchecked_mut(1)
.write((code >> 6 & 0x3F) as u8 | TAG_CONT);
dst.get_unchecked_mut(2)
.write((code & 0x3F) as u8 | TAG_CONT);
}
4 => {
dst.get_unchecked_mut(0)
.write((code >> 18 & 0x07) as u8 | TAG_FOUR_B);
dst.get_unchecked_mut(1)
.write((code >> 12 & 0x3F) as u8 | TAG_CONT);
dst.get_unchecked_mut(2)
.write((code >> 6 & 0x3F) as u8 | TAG_CONT);
dst.get_unchecked_mut(3)
.write((code & 0x3F) as u8 | TAG_CONT);
}
_ => unreachable!(),
}
}
// SAFETY: data has been written to the first `len` bytes
unsafe { &mut *(dst.get_unchecked_mut(..len) as *mut [MaybeUninit<u8>] as *mut [u8]) }
}
const ITERATIONS: u64 = if cfg!(miri) { 1 } else { 10_000 };
#[bench]
fn bench_push_1_byte(bencher: &mut Bencher) {
const CHAR: char = '0';
assert_eq!(CHAR.len_utf8(), 1);
bencher.bytes = ITERATIONS;
bencher.iter(|| {
let mut s = String::with_capacity(ITERATIONS as _);
for _ in 0..black_box(ITERATIONS) {
push(&mut s, black_box(CHAR));
}
s
});
}
#[bench]
fn bench_push_2_bytes(bencher: &mut Bencher) {
const CHAR: char = 'д';
assert_eq!(CHAR.len_utf8(), 2);
bencher.bytes = 2 * ITERATIONS;
bencher.iter(|| {
let mut s = String::with_capacity((2 * ITERATIONS) as _);
for _ in 0..black_box(ITERATIONS) {
push(&mut s, black_box(CHAR));
}
s
});
}
#[bench]
fn bench_push_3_bytes(bencher: &mut Bencher) {
const CHAR: char = '❗';
assert_eq!(CHAR.len_utf8(), 3);
bencher.bytes = 3 * ITERATIONS;
bencher.iter(|| {
let mut s = String::with_capacity((3 * ITERATIONS) as _);
for _ in 0..black_box(ITERATIONS) {
push(&mut s, black_box(CHAR));
}
s
});
}
#[bench]
fn bench_push_4_bytes(bencher: &mut Bencher) {
const CHAR: char = '🤨';
assert_eq!(CHAR.len_utf8(), 4);
bencher.bytes = 4 * ITERATIONS;
bencher.iter(|| {
let mut s = String::with_capacity((4 * ITERATIONS) as _);
for _ in 0..black_box(ITERATIONS) {
push(&mut s, black_box(CHAR));
}
s
});
}
#[bench]
fn bench_push_random_bytes(bencher: &mut Bencher) {
bencher.bytes = (2 + 3) * ITERATIONS / 2;
let mut rng = Pcg64Mcg::new(0xcafe_f00d_d15e_a5e5);
let input: [_; ITERATIONS as usize] =
array::from_fn(|_| *['0', 'д', '❗', '🤨'].choose(&mut rng).unwrap());
bencher.iter(|| {
let mut s = String::with_capacity((4 * ITERATIONS) as _);
for c in input {
push(&mut s, black_box(c));
}
s
});
} |
The results look pretty neutral to me, @cuviper -- feel free to close or continue iterating. |
I don't think I'll spend any more time here, but anyone else is welcome to continue with the idea if they like. |
This changes
len_utf8
to add all of the range comparisons together,rather than branching on each one. We should definitely test performance
though, because it's possible that this will pessimize mostly-ascii
inputs that would have had a short branch-predicted path before.