Skip to content

Add a fast path for the data state using SSE2 instructions #601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 15, 2025

Conversation

simonwuelker
Copy link
Contributor

@simonwuelker simonwuelker commented Apr 14, 2025

The data state is where the HTML tokenizer spends most of it's time. It is also very simple - all it does is scan the input stream for the next character in a set. This can easily be optimized with SIMD instructions. The algorithm I used is described in https://lemire.me/blog/2024/06/08/scan-html-faster-with-simd-instructions-chrome-edition/.

This change significantly speeds up the tokenizer. Both lipsum.html and lipsum.zh.html see improvements of 70-80%, which is not suprising since they never leave the data state.

Very small inputs regress slightly. There is also a performance regression of ~5% for malicious input that consists only of tags (the strong.html benchmark). In that case the SIMD instructions are overkill because the target character (<) is always the first one in the input stream.

Note that the implementation could be made significantly faster by not keeping track of newlines. The only use in servo for the line number is for script elements, where the line number eventually ends up in https://github.com/servo/mozjs/blob/d1525dfaee22cc1ea9ee16c552cdeedaa9f20741/mozjs-sys/src/jsglue.cpp#L608.

Benchmark results
html tokenizing lipsum.html
                        time:   [1.8185 µs 1.8217 µs 1.8253 µs]
                        change: [-75.139% -75.046% -74.950%] (p = 0.00 < 0.05)
                        Performance has improved.

html tokenizing lipsum-zh.html
                        time:   [1.0855 µs 1.0918 µs 1.1002 µs]
                        change: [-80.345% -80.260% -80.165%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

html tokenizing medium-fragment.html
                        time:   [30.136 µs 30.179 µs 30.227 µs]
                        change: [-0.8699% -0.5956% -0.3121%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  8 (8.00%) high mild
  2 (2.00%) high severe

html tokenizing small-fragment.html
                        time:   [2.5780 µs 2.5908 µs 2.6047 µs]
                        change: [-11.670% -11.371% -11.033%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) high mild
  3 (3.00%) high severe

html tokenizing tiny-fragment.html
                        time:   [316.23 ns 316.68 ns 317.18 ns]
                        change: [+2.2782% +3.0717% +3.7761%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe

html tokenizing strong.html
                        time:   [20.781 µs 20.887 µs 21.012 µs]
                        change: [+5.7845% +6.2623% +6.7381%] (p = 0.00 < 0.05)
                        Performance has regressed.

@nicoburns
Copy link
Contributor

A note that if you're looking at html5ever performance, you might want to look at https://github.com/untitaker/html5gum which only has a tokenizer (no tree builder), but claims to be ~5x faster than html5ever at tokenizing.

Very small inputs regress slightly. There is also a performance regression of ~5% for malicious input that consists only of tags (the strong.html benchmark). In that case the SIMD instructions are overkill because the target character (<) is always the first one in the input stream.

Would it make sense to check the first character using scalar code and then jump into the SIMD code?

@simonwuelker
Copy link
Contributor Author

simonwuelker commented Apr 15, 2025

Would it make sense to check the first character using scalar code and then jump into the SIMD code?

Maybe!

Chromium only enters the SIMD loop when there is a non-whitespace character: https://source.chromium.org/chromium/chromium/src/+/main:third_party/blink/renderer/core/html/parser/html_document_parser_fastpath.cc;l=781-796

@simonwuelker
Copy link
Contributor Author

A note that if you're looking at html5ever performance, you might want to look at https://github.com/untitaker/html5gum which only has a tokenizer (no tree builder), but claims to be ~5x faster than html5ever at tokenizing.

I did a very rough benchmark which simply tokenizes https://html.spec.whatwg.org. htm5gum is indeed faster than html5ever, but only by around 25-50%. It does not seem even close to 5x, but to get more conclusive results I would need to put some more effort into benchmarking. They also do not support features that are necessary for a browser engine, like incremental parsing(?) and parser reentrancy.

Copy link
Member

@mrobinson mrobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me and it's hard to argue with the performance benefits. I've done a quick check of this, but it would be nice to have a few more eyes on it.

@simonwuelker As this is still marked as a draft, I'm not sure if this is blocked on anything.

@simonwuelker
Copy link
Contributor Author

@simonwuelker As this is still marked as a draft, I'm not sure if this is blocked on anything.

It's not blocked, I wanted to figure out a way to not have performance regressions anymore. The lipsum benchmarks are plain text, so while the speedups are very nice they are not very representative of "normal" web content.

I tried checking the first character using scalar code and only entering SIMD when it is not in the set, as suggested by @nicoburns . It seems to fix all the regressions seen previously;

New Benchmark results
html tokenizing lipsum.html
                        time:   [1.8492 µs 1.8525 µs 1.8559 µs]
                        change: [-76.059% -75.876% -75.707%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

html tokenizing lipsum-zh.html
                        time:   [1.1155 µs 1.1176 µs 1.1199 µs]
                        change: [-80.801% -80.603% -80.433%] (p = 0.00 < 0.05)
                        Performance has improved.

html tokenizing medium-fragment.html
                        time:   [30.757 µs 30.831 µs 30.911 µs]
                        change: [-5.6858% -5.2866% -4.8994%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

html tokenizing small-fragment.html
                        time:   [2.6424 µs 2.6511 µs 2.6603 µs]
                        change: [-12.575% -12.225% -11.904%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) low mild
  5 (5.00%) high mild

html tokenizing tiny-fragment.html
                        time:   [305.13 ns 305.71 ns 306.32 ns]
                        change: [-0.0519% +0.1803% +0.4443%] (p = 0.15 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild

html tokenizing strong.html
                        time:   [20.644 µs 20.701 µs 20.758 µs]
                        change: [+0.6038% +1.8223% +2.5950%] (p = 0.00 < 0.05)
                        Change within noise threshold.

     Running unittests lib.rs (target/release/deps/markup5ever-05fb43043a741506)

running 6 tests
test interface::tests::ns_macro ... ignored
test util::buffer_queue::test::can_eat ... ignored
test util::buffer_queue::test::can_pop_except_set ... ignored
test util::buffer_queue::test::can_unconsume ... ignored
test util::buffer_queue::test::smoke_test ... ignored
test util::smallcharset::test::nonmember_prefix ... ignored

test result: ok. 0 passed; 0 failed; 6 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running unittests lib.rs (target/release/deps/markup5ever_rcdom-d9caee62b3ede2ce)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running unittests src/lib.rs (target/release/deps/match_token-194a2652d3ef5200)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running unittests src/lib.rs (target/release/deps/xml5ever-3013b7c65eebb4ce)

running 2 tests
test tokenizer::test::simple_namespace ... ignored
test tokenizer::test::wrong_namespaces ... ignored

test result: ok. 0 passed; 0 failed; 2 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/xml5ever.rs (target/release/deps/xml5ever-80c5a79428f77bac)
Gnuplot not found, using plotters backend
xml tokenizing strong.xml
                        time:   [21.767 µs 21.955 µs 22.223 µs]
                        change: [-0.2343% +0.5410% +1.6585%] (p = 0.28 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe

The fact that performance in strong.html does not change at all is expected, since it never reaches the SIMD code.

@simonwuelker
Copy link
Contributor Author

I've done a quick check of this, but it would be nice to have a few more eyes on it.

Btw, I'm happy to explain what's happening instruction-by-instruction if reviewing this is otherwise too intricate.

Comment on lines +1913 to +1917
let quote_mask = _mm_set1_epi8('<' as i8);
let escape_mask = _mm_set1_epi8('&' as i8);
let carriage_return_mask = _mm_set1_epi8('\r' as i8);
let zero_mask = _mm_set1_epi8('\0' as i8);
let newline_mask = _mm_set1_epi8('\n' as i8);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_mm_set1_epi8 creates a SIMD vector where each element has the same value.

For example _mm_set1_epi8('<' as i8) returns a __m128i, a 128 bit vector consisting of 16 8-bit integers. Each of those integers has the value 60.

Comment on lines +1926 to +1927
// Load a 16 byte chunk from the input
let data = _mm_loadu_si128(start.add(i) as *const __m128i);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_mm_loadu_si128 takes a pointer and reads 128 bits at the given address into a SIMD register.

Comment on lines +1930 to +1934
let quotes = _mm_cmpeq_epi8(data, quote_mask);
let escapes = _mm_cmpeq_epi8(data, escape_mask);
let carriage_returns = _mm_cmpeq_epi8(data, carriage_return_mask);
let zeros = _mm_cmpeq_epi8(data, zero_mask);
let newlines = _mm_cmpeq_epi8(data, newline_mask);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_mm_cmpeq_epi8 takes two SIMD vectors and compares them element wise. Each entry in the result is one if the two operands match and zero otherwise.

Therefore, quotes, escapes etc now contain the test results for the 16-byte input chunk that we just loaded.

Comment on lines +1938 to +1941
let test_result = _mm_or_si128(
_mm_or_si128(quotes, zeros),
_mm_or_si128(escapes, carriage_returns),
);
Copy link
Contributor Author

@simonwuelker simonwuelker May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_mm_or_si128 does exactly what it sounds like: It computes the elementwise OR of two SIMD vectors.

Therefore, test_result now contains 16 8-bit integers that are either zero or one - zero if the character at the position did not match an element in the set and one otherwise.

_mm_or_si128(quotes, zeros),
_mm_or_si128(escapes, carriage_returns),
);
let bitmask = _mm_movemask_epi8(test_result);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_mm_movemask_epi8 throws away all the bits we don't need anymore. It creates a 16-bit integer consisting of the least-significant-bits from each SIMD entry.

For example, the SIMD vector [1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ] becomes 0b1010000000000000.

let bitmask = _mm_movemask_epi8(test_result);
let newline_mask = _mm_movemask_epi8(newlines);

if (bitmask != 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if at least one bit in the mask is set then one of the 16 input characters was in the set that we were looking for.

Comment on lines +1947 to +1951
let position = if cfg!(target_endian = "little") {
bitmask.trailing_zeros() as usize
} else {
bitmask.leading_zeros() as usize
};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To find the exact position of the character we can just count the leading zeros (or, the leading characters that were not in the set) and add them to the offset of the 16 byte chunk that we loaded.

Comment on lines +1953 to +1957
n_newlines += (newline_mask & ((1 << position) - 1)).count_ones() as u64;
i += position;
break;
} else {
n_newlines += newline_mask.count_ones() as u64;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, html5ever emits the line number for each token.
In SIMD this is implemented as follows:

  1. Compute the bitmask like before, but only for \n characters
  2. If the 16 byte chunk contains a character in the set:
    2.1 Mask the bits that come before the first character in the set, let the number of newlines be the set bits in the newline-bitmask
  3. Otherwise:
    3.1 Let the number of newlines be the set bits in the newline-bitmask.

Unfortunately, this makes the algorithm significantly slower than it could be.

Comment on lines +1963 to +1973
// Process any remaining bytes (less than STRIDE)
while let Some(c) = raw_bytes.get(i) {
if matches!(*c, b'<' | b'&' | b'\r' | b'\0') {
break;
}
if *c == b'\n' {
n_newlines += 1;
}

i += 1;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block takes care of any input chunks that are too small for SIMD.

///
/// [data state]: https://html.spec.whatwg.org/#data-state
/// [here]: https://lemire.me/blog/2024/06/08/scan-html-faster-with-simd-instructions-chrome-edition/
unsafe fn data_state_sse2_fast_path(&self, input: &mut StrTendril) -> Option<SetResult> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SIMD can only process 16-byte chunks. So instead of scanning the input character-by-character we unroll the loop like this:

// This is pseudocode
while remaining_input.len() > SIMD_CHUNK_SIZE {
    // use SIMD algorithm
}

while !remaining_input.is_empty() {
     // use scalar algorithm
}

Copy link
Member

@mrobinson mrobinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me!

@simonwuelker simonwuelker enabled auto-merge May 15, 2025 17:07
@simonwuelker simonwuelker added this pull request to the merge queue May 15, 2025
Merged via the queue into servo:main with commit b6f9a7c May 15, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants