Remove the branches from `len_utf8` #125129

cuviper · 2024-05-14T19:30:34Z

This changes len_utf8 to add all of the range comparisons together,
rather than branching on each one. We should definitely test performance
though, because it's possible that this will pessimize mostly-ascii
inputs that would have had a short branch-predicted path before.

This changes `len_utf8` to add all of the range comparisons together, rather than branching on each one. We should definitely test performance though, because it's possible that this will pessimize mostly-ascii inputs that would have had a short branch-predicted path before.

rustbot · 2024-05-14T19:30:42Z

r? @Mark-Simulacrum

rustbot has assigned @Mark-Simulacrum.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

cuviper · 2024-05-14T19:34:12Z

Here's a godbolt comparison. For a single character, it looks like this:

Before:

len_char:
        mov     eax, 1
        cmp     edi, 128
        jb      .LBB0_3
        mov     eax, 2
        cmp     edi, 2048
        jb      .LBB0_3
        cmp     edi, 65536
        mov     eax, 4
        sbb     rax, 0
.LBB0_3:
        ret

After:

len_char:
        xor     eax, eax
        cmp     edi, 127
        seta    al
        cmp     edi, 2048
        sbb     rax, -1
        cmp     edi, 65536
        sbb     rax, -1
        inc     rax
        ret

I also included an example of summing [char] lengths, and the new implementation shows auto-vectorization.

cc @lincot @scottmcm -- this may be relevant to #124810 too.

cuviper · 2024-05-14T19:35:47Z

@bors try @rust-timer queue

Remove the branches from `len_utf8` This changes `len_utf8` to add all of the range comparisons together, rather than branching on each one. We should definitely test performance though, because it's possible that this will pessimize mostly-ascii inputs that would have had a short branch-predicted path before.

bors · 2024-05-14T19:36:58Z

⌛ Trying commit 38f14be with merge 1c3131c...

bors · 2024-05-14T21:16:53Z

☀️ Try build successful - checks-actions
Build commit: 1c3131c (1c3131cffea82ac12fe79bbd8ae3fc5dbbde3650)

lincot · 2024-05-14T21:23:48Z

In a benchmark, current len_utf8 gives me:

test bench_len_chars    ... bench:         290 ns/iter (+/- 7) = 3531 MB/s
test bench_push_1_byte  ... bench:      10,489 ns/iter (+/- 169) = 953 MB/s
test bench_push_2_bytes ... bench:      14,303 ns/iter (+/- 295) = 1398 MB/s
test bench_push_3_bytes ... bench:      19,078 ns/iter (+/- 328) = 1572 MB/s
test bench_push_4_bytes ... bench:      23,845 ns/iter (+/- 357) = 1677 MB/s

Branchless:

test bench_len_chars    ... bench:         315 ns/iter (+/- 8) = 3250 MB/s
test bench_push_1_byte  ... bench:      12,196 ns/iter (+/- 209) = 819 MB/s
test bench_push_2_bytes ... bench:      14,433 ns/iter (+/- 283) = 1385 MB/s
test bench_push_3_bytes ... bench:      16,871 ns/iter (+/- 356) = 1778 MB/s
test bench_push_4_bytes ... bench:      21,463 ns/iter (+/- 681) = 1863 MB/s

bench.rs

#![feature(test)]

extern crate test;
use core::{array, mem::MaybeUninit};
use rand::seq::SliceRandom;
use rand_pcg::Pcg64Mcg;
use test::{black_box, Bencher};

const TAG_CONT: u8 = 0b1000_0000;
const TAG_TWO_B: u8 = 0b1100_0000;
const TAG_THREE_B: u8 = 0b1110_0000;
const TAG_FOUR_B: u8 = 0b1111_0000;
const MAX_ONE_B: u32 = 0x80;
const MAX_TWO_B: u32 = 0x800;
const MAX_THREE_B: u32 = 0x10000;

#[inline]
const fn len_utf8(code: u32) -> usize {
    const BRANCHLESS: bool = true;

    if BRANCHLESS {
        1 + ((code >= MAX_ONE_B) as usize)
            + ((code >= MAX_TWO_B) as usize)
            + ((code >= MAX_THREE_B) as usize)
    } else {
        if code < MAX_ONE_B {
            1
        } else if code < MAX_TWO_B {
            2
        } else if code < MAX_THREE_B {
            3
        } else {
            4
        }
    }
}

#[inline]
fn len_chars(cs: &[char]) -> usize {
    cs.iter().map(|&c| len_utf8(c as u32)).sum()
}

#[inline]
pub fn push(s: &mut String, ch: char) {
    let len = s.len();
    let ch_len = len_utf8(ch as u32);
    s.reserve(ch_len);

    // SAFETY: at least the length needed to encode `ch`
    // has been reserved in `self`
    unsafe {
        encode_utf8_raw_unchecked(ch as u32, s.as_mut_vec().spare_capacity_mut());
        s.as_mut_vec().set_len(len + ch_len);
    }
}

#[inline]
pub fn encode_utf8_raw(code: u32, dst: &mut [u8]) -> &mut [u8] {
    let len = len_utf8(code);
    if dst.len() < len {
        panic!(
            "encode_utf8: need {} bytes to encode U+{:X}, but the buffer has {}",
            len,
            code,
            dst.len(),
        );
    }

    // SAFETY: `encode_utf8_raw_unchecked` only writes initialized bytes to the slice,
    // `dst` has been checked to be long enough to hold the encoded codepoint
    unsafe { encode_utf8_raw_unchecked(code, &mut *(dst as *mut [u8] as *mut [MaybeUninit<u8>])) }
}

#[inline]
pub unsafe fn encode_utf8_raw_unchecked(code: u32, dst: &mut [MaybeUninit<u8>]) -> &mut [u8] {
    let len = len_utf8(code);
    // SAFETY: the caller must guarantee that `dst` is at least `len` bytes long
    unsafe {
        match len {
            1 => {
                dst.get_unchecked_mut(0).write(code as u8);
            }
            2 => {
                dst.get_unchecked_mut(0)
                    .write((code >> 6 & 0x1F) as u8 | TAG_TWO_B);
                dst.get_unchecked_mut(1)
                    .write((code & 0x3F) as u8 | TAG_CONT);
            }
            3 => {
                dst.get_unchecked_mut(0)
                    .write((code >> 12 & 0x0F) as u8 | TAG_THREE_B);
                dst.get_unchecked_mut(1)
                    .write((code >> 6 & 0x3F) as u8 | TAG_CONT);
                dst.get_unchecked_mut(2)
                    .write((code & 0x3F) as u8 | TAG_CONT);
            }
            4 => {
                dst.get_unchecked_mut(0)
                    .write((code >> 18 & 0x07) as u8 | TAG_FOUR_B);
                dst.get_unchecked_mut(1)
                    .write((code >> 12 & 0x3F) as u8 | TAG_CONT);
                dst.get_unchecked_mut(2)
                    .write((code >> 6 & 0x3F) as u8 | TAG_CONT);
                dst.get_unchecked_mut(3)
                    .write((code & 0x3F) as u8 | TAG_CONT);
            }
            _ => unreachable!(),
        }
    }

    // SAFETY: data has been written to the first `len` bytes
    unsafe { &mut *(dst.get_unchecked_mut(..len) as *mut [MaybeUninit<u8>] as *mut [u8]) }
}

#[bench]
fn bench_len_chars(bencher: &mut Bencher) {
    const BYTES: usize = 1024;
    bencher.bytes = BYTES as _;
    let mut rng = Pcg64Mcg::new(0xcafe_f00d_d15e_a5e5);
    let cs: [_; BYTES] = array::from_fn(|_| *['0', 'д', '❗', '🤨'].choose(&mut rng).unwrap());
    bencher.iter(|| len_chars(black_box(&cs)));
}

const ITERATIONS: u64 = if cfg!(miri) { 1 } else { 10_000 };

#[bench]
fn bench_push_1_byte(bencher: &mut Bencher) {
    const CHAR: char = '0';
    assert_eq!(CHAR.len_utf8(), 1);
    bencher.bytes = ITERATIONS;
    bencher.iter(|| {
        let mut s = String::with_capacity(ITERATIONS as _);
        for _ in 0..black_box(ITERATIONS) {
            push(&mut s, black_box(CHAR));
        }
        s
    });
}

#[bench]
fn bench_push_2_bytes(bencher: &mut Bencher) {
    const CHAR: char = 'д';
    assert_eq!(CHAR.len_utf8(), 2);
    bencher.bytes = 2 * ITERATIONS;
    bencher.iter(|| {
        let mut s = String::with_capacity((2 * ITERATIONS) as _);
        for _ in 0..black_box(ITERATIONS) {
            push(&mut s, black_box(CHAR));
        }
        s
    });
}

#[bench]
fn bench_push_3_bytes(bencher: &mut Bencher) {
    const CHAR: char = '❗';
    assert_eq!(CHAR.len_utf8(), 3);
    bencher.bytes = 3 * ITERATIONS;
    bencher.iter(|| {
        let mut s = String::with_capacity((3 * ITERATIONS) as _);
        for _ in 0..black_box(ITERATIONS) {
            push(&mut s, black_box(CHAR));
        }
        s
    });
}

#[bench]
fn bench_push_4_bytes(bencher: &mut Bencher) {
    const CHAR: char = '🤨';
    assert_eq!(CHAR.len_utf8(), 4);
    bencher.bytes = 4 * ITERATIONS;
    bencher.iter(|| {
        let mut s = String::with_capacity((4 * ITERATIONS) as _);
        for _ in 0..black_box(ITERATIONS) {
            push(&mut s, black_box(CHAR));
        }
        s
    });
}

So despite autovectorization, it appears to be slower for len_chars
with chars of random length. In String::push it sacrifices ASCII
for 3 and 4 byte chars.

Also, curiously, if we hint the compiler that the branchless version equals
to the non-branchless version, it uses the latter: godbolt.

cuviper · 2024-05-14T22:56:06Z

Thanks for the benchmark! I agree that ascii is taking a hit, like I originally suspected, but my results look more favorable on the vectorized sum.

AMD Ryzen 7 5800X, original:

test bench_len_chars    ... bench:         619.19 ns/iter (+/- 26.64) = 1654 MB/s
test bench_push_1_byte  ... bench:      11,439.94 ns/iter (+/- 301.62) = 874 MB/s
test bench_push_2_bytes ... bench:      10,856.02 ns/iter (+/- 439.35) = 1842 MB/s
test bench_push_3_bytes ... bench:      15,011.17 ns/iter (+/- 339.38) = 1998 MB/s
test bench_push_4_bytes ... bench:      19,140.64 ns/iter (+/- 212.55) = 2089 MB/s

branchless:

test bench_len_chars    ... bench:         354.67 ns/iter (+/- 1.95) = 2892 MB/s
test bench_push_1_byte  ... bench:      11,122.96 ns/iter (+/- 518.84) = 899 MB/s
test bench_push_2_bytes ... bench:      12,862.57 ns/iter (+/- 109.74) = 1554 MB/s
test bench_push_3_bytes ... bench:      14,385.21 ns/iter (+/- 160.54) = 2085 MB/s
test bench_push_4_bytes ... bench:      16,230.85 ns/iter (+/- 198.19) = 2464 MB/s

AMD Ryzen 7 7700X, original:

test bench_len_chars    ... bench:         488.00 ns/iter (+/- 12.27) = 2102 MB/s
test bench_push_1_byte  ... bench:       5,816.53 ns/iter (+/- 163.51) = 1719 MB/s
test bench_push_2_bytes ... bench:       9,846.46 ns/iter (+/- 323.56) = 2031 MB/s
test bench_push_3_bytes ... bench:      12,459.59 ns/iter (+/- 2,992.14) = 2407 MB/s
test bench_push_4_bytes ... bench:      14,598.12 ns/iter (+/- 155.69) = 2740 MB/s

branchless:

test bench_len_chars    ... bench:         312.69 ns/iter (+/- 10.73) = 3282 MB/s
test bench_push_1_byte  ... bench:       8,301.02 ns/iter (+/- 42.66) = 1204 MB/s
test bench_push_2_bytes ... bench:      11,581.18 ns/iter (+/- 39.54) = 1726 MB/s
test bench_push_3_bytes ... bench:      12,544.53 ns/iter (+/- 533.95) = 2391 MB/s
test bench_push_4_bytes ... bench:      14,519.43 ns/iter (+/- 36.51) = 2755 MB/s

Intel i7-1365U, original:

test bench_len_chars    ... bench:         629.65 ns/iter (+/- 13.55) = 1627 MB/s
test bench_push_1_byte  ... bench:       5,443.92 ns/iter (+/- 157.32) = 1837 MB/s
test bench_push_2_bytes ... bench:      11,739.24 ns/iter (+/- 370.86) = 1703 MB/s
test bench_push_3_bytes ... bench:      14,572.58 ns/iter (+/- 25.91) = 2058 MB/s
test bench_push_4_bytes ... bench:      17,236.41 ns/iter (+/- 435.68) = 2320 MB/s

branchless:

test bench_len_chars    ... bench:         551.74 ns/iter (+/- 1.60) = 1858 MB/s
test bench_push_1_byte  ... bench:      10,603.42 ns/iter (+/- 17.86) = 943 MB/s
test bench_push_2_bytes ... bench:      14,588.05 ns/iter (+/- 37.33) = 1370 MB/s
test bench_push_3_bytes ... bench:      16,011.42 ns/iter (+/- 29.64) = 1873 MB/s
test bench_push_4_bytes ... bench:      18,701.51 ns/iter (+/- 18.93) = 2138 MB/s

All of these were using the current nightly on Fedora 40, with default target options.
(i.e. no -Ctarget-cpu or features to enable extra vector stuff.)

$ rustc +nightly -Vv
rustc 1.80.0-nightly (ab14f944a 2024-05-13)
binary: rustc
commit-hash: ab14f944afe4234db378ced3801e637eae6c0f30
commit-date: 2024-05-13
host: x86_64-unknown-linux-gnu
release: 1.80.0-nightly
LLVM version: 18.1.4

cuviper · 2024-05-14T22:59:29Z

I'll still wait for results from the perf server, but I'll be fine with closing this if there's no clear gain, which seems likely.

orlp · 2024-05-15T01:03:07Z

Can I suggest a hybrid?

pub fn len_utf8_semibranchless(code: u32) -> usize {
    if code < MAX_ONE_B {
        1
    } else {
        2
        + ((code >= MAX_TWO_B) as usize)
        + ((code >= MAX_THREE_B) as usize)
    }
}

rust-timer · 2024-05-15T01:06:55Z

Finished benchmarking commit (1c3131c): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.3%	[0.2%, 0.4%]	14
Regressions ❌ (secondary)	0.4%	[0.3%, 0.4%]	6
Improvements ✅ (primary)	-0.5%	[-0.6%, -0.4%]	2
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.2%	[-0.6%, 0.4%]	16

Max RSS (memory usage)

Results (primary 0.8%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	3.5%	[2.5%, 4.4%]	2
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-4.5%	[-4.5%, -4.5%]	1
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.8%	[-4.5%, 4.4%]	3

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

Results (primary -0.0%, secondary 0.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.1%	[0.0%, 0.1%]	12
Regressions ❌ (secondary)	0.1%	[0.1%, 0.1%]	1
Improvements ✅ (primary)	-0.1%	[-0.3%, -0.0%]	9
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-0.0%	[-0.3%, 0.1%]	21

Bootstrap: 679.638s -> 678.862s (-0.11%)
Artifact size: 316.03 MiB -> 316.26 MiB (0.07%)

scottmcm · 2024-05-15T03:22:53Z

I do wonder if we should lean into the ASCII branch more. Letting the branch predictor assist runs of ASCII or non-ASCII, which are probably fairly common, might end up being a good idea, like we have branchy fast-paths for ASCII in the UTF-8 checks.

lincot · 2024-05-15T09:38:07Z

Indeed, it was the -Ctarget-cpu option that messed up with vectorization.
Without it, the results are:

Original:

test bench_len_chars    ... bench:         744 ns/iter (+/- 13) = 1376 MB/s
test bench_push_1_byte  ... bench:      12,482 ns/iter (+/- 225) = 801 MB/s
test bench_push_2_bytes ... bench:      12,123 ns/iter (+/- 308) = 1649 MB/s
test bench_push_3_bytes ... bench:      16,598 ns/iter (+/- 308) = 1807 MB/s
test bench_push_4_bytes ... bench:      21,426 ns/iter (+/- 668) = 1866 MB/s

Branchless:

test bench_len_chars    ... bench:         404 ns/iter (+/- 9) = 2534 MB/s
test bench_push_1_byte  ... bench:      12,507 ns/iter (+/- 399) = 799 MB/s
test bench_push_2_bytes ... bench:      14,826 ns/iter (+/- 183) = 1348 MB/s
test bench_push_3_bytes ... bench:      16,058 ns/iter (+/- 182) = 1868 MB/s
test bench_push_4_bytes ... bench:      18,949 ns/iter (+/- 254) = 2110 MB/s

Semibranchless:

test bench_len_chars    ... bench:         542 ns/iter (+/- 16) = 1889 MB/s
test bench_push_1_byte  ... bench:      12,658 ns/iter (+/- 377) = 790 MB/s
test bench_push_2_bytes ... bench:      14,220 ns/iter (+/- 193) = 1406 MB/s
test bench_push_3_bytes ... bench:      16,635 ns/iter (+/- 210) = 1803 MB/s
test bench_push_4_bytes ... bench:      21,321 ns/iter (+/- 227) = 1876 MB/s

cuviper · 2024-05-15T15:36:14Z

Trying @orlp's suggestion...

@bors try @rust-timer queue

Remove the branches from `len_utf8` This changes `len_utf8` to add all of the range comparisons together, rather than branching on each one. We should definitely test performance though, because it's possible that this will pessimize mostly-ascii inputs that would have had a short branch-predicted path before.

bors · 2024-05-15T15:37:25Z

⌛ Trying commit ba2f5a9 with merge 681b867...

bors · 2024-05-15T17:12:18Z

☀️ Try build successful - checks-actions
Build commit: 681b867 (681b867cb4045990114517eda02cfc9a3f8cb9c8)

rust-timer · 2024-05-15T19:28:32Z

Finished benchmarking commit (681b867): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.3%	[0.3%, 0.3%]	1
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.3%	[-0.4%, -0.2%]	2
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-0.1%	[-0.4%, 0.3%]	3

Max RSS (memory usage)

Results (primary -2.3%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	1.4%	[0.0%, 2.8%]	2
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-4.8%	[-9.1%, -2.6%]	3
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-2.3%	[-9.1%, 2.8%]	5

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

Results (primary 0.0%, secondary 0.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.1%	[0.0%, 0.4%]	5
Regressions ❌ (secondary)	0.1%	[0.1%, 0.1%]	3
Improvements ✅ (primary)	-0.1%	[-0.2%, -0.0%]	6
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.0%	[-0.2%, 0.4%]	11

Bootstrap: 678.357s -> 680.336s (0.29%)
Artifact size: 316.14 MiB -> 316.20 MiB (0.02%)

lincot · 2024-05-15T21:02:30Z

In String::push, in the second use of len_uft8 we match against
the possible lengths, so the branches are inevitable,
and the branchless version doesn't play well there (godbolt).
In the first use, the reserve call, the branchless version lengthens the ASCII
path, and the semibranchless only removes a single jump (godbolt).

The semibranchless version also doesn't allow vectorization.

I've added a benchmark for random chars and have found the mere presence of it
to slow down bench_push_1_byte (fixed by setting codegen-units to 1).

benchmark results

Branchy, then branchy:

test bench_push_1_byte       ... bench:      10,436 ns/iter (+/- 116) = 958 MB/s
test bench_push_2_bytes      ... bench:      12,065 ns/iter (+/- 136) = 1657 MB/s
test bench_push_3_bytes      ... bench:      16,792 ns/iter (+/- 243) = 1786 MB/s
test bench_push_4_bytes      ... bench:      19,085 ns/iter (+/- 379) = 2095 MB/s
test bench_push_random_bytes ... bench:      22,145 ns/iter (+/- 851) = 1128 MB/s

Branchless, then branchy:

test bench_push_1_byte       ... bench:      11,297 ns/iter (+/- 146) = 885 MB/s
test bench_push_2_bytes      ... bench:      12,112 ns/iter (+/- 250) = 1651 MB/s
test bench_push_3_bytes      ... bench:      16,727 ns/iter (+/- 110) = 1793 MB/s
test bench_push_4_bytes      ... bench:      19,365 ns/iter (+/- 440) = 2065 MB/s
test bench_push_random_bytes ... bench:      19,006 ns/iter (+/- 465) = 1315 MB/s

Semibranchless, then branchy:

test bench_push_1_byte       ... bench:      10,452 ns/iter (+/- 141) = 956 MB/s
test bench_push_2_bytes      ... bench:      12,218 ns/iter (+/- 337) = 1636 MB/s
test bench_push_3_bytes      ... bench:      16,955 ns/iter (+/- 312) = 1769 MB/s
test bench_push_4_bytes      ... bench:      19,153 ns/iter (+/- 990) = 2088 MB/s
test bench_push_random_bytes ... bench:      20,529 ns/iter (+/- 779) = 1217 MB/s

Branchless, then branchless:

test bench_push_1_byte       ... bench:      12,068 ns/iter (+/- 757) = 828 MB/s
test bench_push_2_bytes      ... bench:      14,547 ns/iter (+/- 341) = 1374 MB/s
test bench_push_3_bytes      ... bench:      16,833 ns/iter (+/- 370) = 1782 MB/s
test bench_push_4_bytes      ... bench:      19,103 ns/iter (+/- 172) = 2093 MB/s
test bench_push_random_bytes ... bench:      75,144 ns/iter (+/- 576) = 332 MB/s

bench.rs

#![feature(test)]

extern crate test;
use core::{array, mem::MaybeUninit};
use rand::seq::SliceRandom;
use rand_pcg::Pcg64Mcg;
use test::{black_box, Bencher};

const TAG_CONT: u8 = 0b1000_0000;
const TAG_TWO_B: u8 = 0b1100_0000;
const TAG_THREE_B: u8 = 0b1110_0000;
const TAG_FOUR_B: u8 = 0b1111_0000;
const MAX_ONE_B: u32 = 0x80;
const MAX_TWO_B: u32 = 0x800;
const MAX_THREE_B: u32 = 0x10000;

#[inline]
const fn len_utf8_branchy(code: u32) -> usize {
    if code < MAX_ONE_B {
        1
    } else if code < MAX_TWO_B {
        2
    } else if code < MAX_THREE_B {
        3
    } else {
        4
    }
}

#[inline]
const fn len_utf8_branchless(code: u32) -> usize {
    1 + ((code >= MAX_ONE_B) as usize)
        + ((code >= MAX_TWO_B) as usize)
        + ((code >= MAX_THREE_B) as usize)
}

#[inline]
const fn len_utf8_semibranchless(code: u32) -> usize {
    if code < MAX_ONE_B {
        1
    } else {
        2 + ((code >= MAX_TWO_B) as usize) + ((code >= MAX_THREE_B) as usize)
    }
}

#[inline]
pub fn push(s: &mut String, ch: char) {
    let len = s.len();
    let ch_len = len_utf8_branchless(ch as u32);
    s.reserve(ch_len);

    // SAFETY: at least the length needed to encode `ch`
    // has been reserved in `self`
    unsafe {
        encode_utf8_raw_unchecked(ch as u32, s.as_mut_vec().spare_capacity_mut());
        s.as_mut_vec().set_len(len + ch_len);
    }
}

#[inline]
pub unsafe fn encode_utf8_raw_unchecked(code: u32, dst: &mut [MaybeUninit<u8>]) -> &mut [u8] {
    let len = len_utf8_branchy(code);
    // SAFETY: the caller must guarantee that `dst` is at least `len` bytes long
    unsafe {
        match len {
            1 => {
                dst.get_unchecked_mut(0).write(code as u8);
            }
            2 => {
                dst.get_unchecked_mut(0)
                    .write((code >> 6 & 0x1F) as u8 | TAG_TWO_B);
                dst.get_unchecked_mut(1)
                    .write((code & 0x3F) as u8 | TAG_CONT);
            }
            3 => {
                dst.get_unchecked_mut(0)
                    .write((code >> 12 & 0x0F) as u8 | TAG_THREE_B);
                dst.get_unchecked_mut(1)
                    .write((code >> 6 & 0x3F) as u8 | TAG_CONT);
                dst.get_unchecked_mut(2)
                    .write((code & 0x3F) as u8 | TAG_CONT);
            }
            4 => {
                dst.get_unchecked_mut(0)
                    .write((code >> 18 & 0x07) as u8 | TAG_FOUR_B);
                dst.get_unchecked_mut(1)
                    .write((code >> 12 & 0x3F) as u8 | TAG_CONT);
                dst.get_unchecked_mut(2)
                    .write((code >> 6 & 0x3F) as u8 | TAG_CONT);
                dst.get_unchecked_mut(3)
                    .write((code & 0x3F) as u8 | TAG_CONT);
            }
            _ => unreachable!(),
        }
    }

    // SAFETY: data has been written to the first `len` bytes
    unsafe { &mut *(dst.get_unchecked_mut(..len) as *mut [MaybeUninit<u8>] as *mut [u8]) }
}

const ITERATIONS: u64 = if cfg!(miri) { 1 } else { 10_000 };

#[bench]
fn bench_push_1_byte(bencher: &mut Bencher) {
    const CHAR: char = '0';
    assert_eq!(CHAR.len_utf8(), 1);
    bencher.bytes = ITERATIONS;
    bencher.iter(|| {
        let mut s = String::with_capacity(ITERATIONS as _);
        for _ in 0..black_box(ITERATIONS) {
            push(&mut s, black_box(CHAR));
        }
        s
    });
}

#[bench]
fn bench_push_2_bytes(bencher: &mut Bencher) {
    const CHAR: char = 'д';
    assert_eq!(CHAR.len_utf8(), 2);
    bencher.bytes = 2 * ITERATIONS;
    bencher.iter(|| {
        let mut s = String::with_capacity((2 * ITERATIONS) as _);
        for _ in 0..black_box(ITERATIONS) {
            push(&mut s, black_box(CHAR));
        }
        s
    });
}

#[bench]
fn bench_push_3_bytes(bencher: &mut Bencher) {
    const CHAR: char = '❗';
    assert_eq!(CHAR.len_utf8(), 3);
    bencher.bytes = 3 * ITERATIONS;
    bencher.iter(|| {
        let mut s = String::with_capacity((3 * ITERATIONS) as _);
        for _ in 0..black_box(ITERATIONS) {
            push(&mut s, black_box(CHAR));
        }
        s
    });
}

#[bench]
fn bench_push_4_bytes(bencher: &mut Bencher) {
    const CHAR: char = '🤨';
    assert_eq!(CHAR.len_utf8(), 4);
    bencher.bytes = 4 * ITERATIONS;
    bencher.iter(|| {
        let mut s = String::with_capacity((4 * ITERATIONS) as _);
        for _ in 0..black_box(ITERATIONS) {
            push(&mut s, black_box(CHAR));
        }
        s
    });
}

#[bench]
fn bench_push_random_bytes(bencher: &mut Bencher) {
    bencher.bytes = (2 + 3) * ITERATIONS / 2;
    let mut rng = Pcg64Mcg::new(0xcafe_f00d_d15e_a5e5);
    let input: [_; ITERATIONS as usize] =
        array::from_fn(|_| *['0', 'д', '❗', '🤨'].choose(&mut rng).unwrap());
    bencher.iter(|| {
        let mut s = String::with_capacity((4 * ITERATIONS) as _);
        for c in input {
            push(&mut s, black_box(c));
        }
        s
    });
}

Mark-Simulacrum · 2024-05-25T21:56:37Z

The results look pretty neutral to me, @cuviper -- feel free to close or continue iterating.

cuviper · 2024-06-21T00:24:18Z

I don't think I'll spend any more time here, but anyone else is welcome to continue with the idea if they like.

rustbot assigned Mark-Simulacrum May 14, 2024

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels May 14, 2024

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label May 14, 2024

This comment has been minimized.

Sign in to view

rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels May 15, 2024

Try len_utf8 with one branch for ASCII's sake

ba2f5a9

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label May 15, 2024

This comment has been minimized.

Sign in to view

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label May 15, 2024

Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 25, 2024

cuviper closed this Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove the branches from `len_utf8` #125129

Remove the branches from `len_utf8` #125129

cuviper commented May 14, 2024

rustbot commented May 14, 2024

cuviper commented May 14, 2024

cuviper commented May 14, 2024

This comment has been minimized.

bors commented May 14, 2024

bors commented May 14, 2024

This comment has been minimized.

lincot commented May 14, 2024

cuviper commented May 14, 2024

cuviper commented May 14, 2024

orlp commented May 15, 2024

rust-timer commented May 15, 2024

scottmcm commented May 15, 2024

lincot commented May 15, 2024

cuviper commented May 15, 2024

This comment has been minimized.

bors commented May 15, 2024

bors commented May 15, 2024

This comment has been minimized.

rust-timer commented May 15, 2024

lincot commented May 15, 2024 •

edited

Loading

Mark-Simulacrum commented May 25, 2024

cuviper commented Jun 21, 2024

Remove the branches from len_utf8 #125129

Remove the branches from len_utf8 #125129

Conversation

cuviper commented May 14, 2024

rustbot commented May 14, 2024

cuviper commented May 14, 2024

cuviper commented May 14, 2024

This comment has been minimized.

bors commented May 14, 2024

bors commented May 14, 2024

This comment has been minimized.

lincot commented May 14, 2024

cuviper commented May 14, 2024

cuviper commented May 14, 2024

orlp commented May 15, 2024

rust-timer commented May 15, 2024

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

scottmcm commented May 15, 2024

lincot commented May 15, 2024

cuviper commented May 15, 2024

This comment has been minimized.

bors commented May 15, 2024

bors commented May 15, 2024

This comment has been minimized.

rust-timer commented May 15, 2024

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

lincot commented May 15, 2024 • edited Loading

Mark-Simulacrum commented May 25, 2024

cuviper commented Jun 21, 2024

Remove the branches from `len_utf8` #125129

Remove the branches from `len_utf8` #125129

lincot commented May 15, 2024 •

edited

Loading