Add a fast path for the data state using SSE2 instructions #601

simonwuelker · 2025-04-14T04:24:07Z

The data state is where the HTML tokenizer spends most of it's time. It is also very simple - all it does is scan the input stream for the next character in a set. This can easily be optimized with SIMD instructions. The algorithm I used is described in https://lemire.me/blog/2024/06/08/scan-html-faster-with-simd-instructions-chrome-edition/.

This change significantly speeds up the tokenizer. Both lipsum.html and lipsum.zh.html see improvements of 70-80%, which is not suprising since they never leave the data state.

Very small inputs regress slightly. There is also a performance regression of ~5% for malicious input that consists only of tags (the strong.html benchmark). In that case the SIMD instructions are overkill because the target character (<) is always the first one in the input stream.

Note that the implementation could be made significantly faster by not keeping track of newlines. The only use in servo for the line number is for script elements, where the line number eventually ends up in https://github.com/servo/mozjs/blob/d1525dfaee22cc1ea9ee16c552cdeedaa9f20741/mozjs-sys/src/jsglue.cpp#L608.

Benchmark results

html tokenizing lipsum.html
                        time:   [1.8185 µs 1.8217 µs 1.8253 µs]
                        change: [-75.139% -75.046% -74.950%] (p = 0.00 < 0.05)
                        Performance has improved.

html tokenizing lipsum-zh.html
                        time:   [1.0855 µs 1.0918 µs 1.1002 µs]
                        change: [-80.345% -80.260% -80.165%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

html tokenizing medium-fragment.html
                        time:   [30.136 µs 30.179 µs 30.227 µs]
                        change: [-0.8699% -0.5956% -0.3121%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  8 (8.00%) high mild
  2 (2.00%) high severe

html tokenizing small-fragment.html
                        time:   [2.5780 µs 2.5908 µs 2.6047 µs]
                        change: [-11.670% -11.371% -11.033%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  7 (7.00%) high mild
  3 (3.00%) high severe

html tokenizing tiny-fragment.html
                        time:   [316.23 ns 316.68 ns 317.18 ns]
                        change: [+2.2782% +3.0717% +3.7761%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe

html tokenizing strong.html
                        time:   [20.781 µs 20.887 µs 21.012 µs]
                        change: [+5.7845% +6.2623% +6.7381%] (p = 0.00 < 0.05)
                        Performance has regressed.

nicoburns · 2025-04-15T12:08:18Z

A note that if you're looking at html5ever performance, you might want to look at https://github.com/untitaker/html5gum which only has a tokenizer (no tree builder), but claims to be ~5x faster than html5ever at tokenizing.

Very small inputs regress slightly. There is also a performance regression of ~5% for malicious input that consists only of tags (the strong.html benchmark). In that case the SIMD instructions are overkill because the target character (<) is always the first one in the input stream.

Would it make sense to check the first character using scalar code and then jump into the SIMD code?

simonwuelker · 2025-04-15T12:30:16Z

Would it make sense to check the first character using scalar code and then jump into the SIMD code?

Maybe!

Chromium only enters the SIMD loop when there is a non-whitespace character: https://source.chromium.org/chromium/chromium/src/+/main:third_party/blink/renderer/core/html/parser/html_document_parser_fastpath.cc;l=781-796

simonwuelker · 2025-04-17T11:32:41Z

A note that if you're looking at html5ever performance, you might want to look at https://github.com/untitaker/html5gum which only has a tokenizer (no tree builder), but claims to be ~5x faster than html5ever at tokenizing.

I did a very rough benchmark which simply tokenizes https://html.spec.whatwg.org. htm5gum is indeed faster than html5ever, but only by around 25-50%. It does not seem even close to 5x, but to get more conclusive results I would need to put some more effort into benchmarking. They also do not support features that are necessary for a browser engine, like incremental parsing(?) and parser reentrancy.

mrobinson

Seems good to me and it's hard to argue with the performance benefits. I've done a quick check of this, but it would be nice to have a few more eyes on it.

@simonwuelker As this is still marked as a draft, I'm not sure if this is blocked on anything.

simonwuelker · 2025-04-25T12:05:29Z

@simonwuelker As this is still marked as a draft, I'm not sure if this is blocked on anything.

It's not blocked, I wanted to figure out a way to not have performance regressions anymore. The lipsum benchmarks are plain text, so while the speedups are very nice they are not very representative of "normal" web content.

I tried checking the first character using scalar code and only entering SIMD when it is not in the set, as suggested by @nicoburns . It seems to fix all the regressions seen previously;

New Benchmark results

html tokenizing lipsum.html
                        time:   [1.8492 µs 1.8525 µs 1.8559 µs]
                        change: [-76.059% -75.876% -75.707%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

html tokenizing lipsum-zh.html
                        time:   [1.1155 µs 1.1176 µs 1.1199 µs]
                        change: [-80.801% -80.603% -80.433%] (p = 0.00 < 0.05)
                        Performance has improved.

html tokenizing medium-fragment.html
                        time:   [30.757 µs 30.831 µs 30.911 µs]
                        change: [-5.6858% -5.2866% -4.8994%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

html tokenizing small-fragment.html
                        time:   [2.6424 µs 2.6511 µs 2.6603 µs]
                        change: [-12.575% -12.225% -11.904%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) low mild
  5 (5.00%) high mild

html tokenizing tiny-fragment.html
                        time:   [305.13 ns 305.71 ns 306.32 ns]
                        change: [-0.0519% +0.1803% +0.4443%] (p = 0.15 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild

html tokenizing strong.html
                        time:   [20.644 µs 20.701 µs 20.758 µs]
                        change: [+0.6038% +1.8223% +2.5950%] (p = 0.00 < 0.05)
                        Change within noise threshold.

     Running unittests lib.rs (target/release/deps/markup5ever-05fb43043a741506)

running 6 tests
test interface::tests::ns_macro ... ignored
test util::buffer_queue::test::can_eat ... ignored
test util::buffer_queue::test::can_pop_except_set ... ignored
test util::buffer_queue::test::can_unconsume ... ignored
test util::buffer_queue::test::smoke_test ... ignored
test util::smallcharset::test::nonmember_prefix ... ignored

test result: ok. 0 passed; 0 failed; 6 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running unittests lib.rs (target/release/deps/markup5ever_rcdom-d9caee62b3ede2ce)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running unittests src/lib.rs (target/release/deps/match_token-194a2652d3ef5200)

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running unittests src/lib.rs (target/release/deps/xml5ever-3013b7c65eebb4ce)

running 2 tests
test tokenizer::test::simple_namespace ... ignored
test tokenizer::test::wrong_namespaces ... ignored

test result: ok. 0 passed; 0 failed; 2 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/xml5ever.rs (target/release/deps/xml5ever-80c5a79428f77bac)
Gnuplot not found, using plotters backend
xml tokenizing strong.xml
                        time:   [21.767 µs 21.955 µs 22.223 µs]
                        change: [-0.2343% +0.5410% +1.6585%] (p = 0.28 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high severe

The fact that performance in strong.html does not change at all is expected, since it never reaches the SIMD code.

simonwuelker · 2025-05-06T09:01:39Z

I've done a quick check of this, but it would be nice to have a few more eyes on it.

Btw, I'm happy to explain what's happening instruction-by-instruction if reviewing this is otherwise too intricate.

simonwuelker · 2025-05-15T07:55:05Z