-
Notifications
You must be signed in to change notification settings - Fork 235
Add a fast path for the data state using SSE2 instructions #601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
A note that if you're looking at html5ever performance, you might want to look at https://github.com/untitaker/html5gum which only has a tokenizer (no tree builder), but claims to be ~5x faster than html5ever at tokenizing.
Would it make sense to check the first character using scalar code and then jump into the SIMD code? |
Maybe! Chromium only enters the SIMD loop when there is a non-whitespace character: https://source.chromium.org/chromium/chromium/src/+/main:third_party/blink/renderer/core/html/parser/html_document_parser_fastpath.cc;l=781-796 |
I did a very rough benchmark which simply tokenizes https://html.spec.whatwg.org. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good to me and it's hard to argue with the performance benefits. I've done a quick check of this, but it would be nice to have a few more eyes on it.
@simonwuelker As this is still marked as a draft, I'm not sure if this is blocked on anything.
It's not blocked, I wanted to figure out a way to not have performance regressions anymore. The I tried checking the first character using scalar code and only entering SIMD when it is not in the set, as suggested by @nicoburns . It seems to fix all the regressions seen previously; New Benchmark results
The fact that performance in |
Btw, I'm happy to explain what's happening instruction-by-instruction if reviewing this is otherwise too intricate. |
let quote_mask = _mm_set1_epi8('<' as i8); | ||
let escape_mask = _mm_set1_epi8('&' as i8); | ||
let carriage_return_mask = _mm_set1_epi8('\r' as i8); | ||
let zero_mask = _mm_set1_epi8('\0' as i8); | ||
let newline_mask = _mm_set1_epi8('\n' as i8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_mm_set1_epi8
creates a SIMD vector where each element has the same value.
For example _mm_set1_epi8('<' as i8)
returns a __m128i
, a 128 bit vector consisting of 16 8-bit integers. Each of those integers has the value 60
.
// Load a 16 byte chunk from the input | ||
let data = _mm_loadu_si128(start.add(i) as *const __m128i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_mm_loadu_si128
takes a pointer and reads 128 bits at the given address into a SIMD register.
let quotes = _mm_cmpeq_epi8(data, quote_mask); | ||
let escapes = _mm_cmpeq_epi8(data, escape_mask); | ||
let carriage_returns = _mm_cmpeq_epi8(data, carriage_return_mask); | ||
let zeros = _mm_cmpeq_epi8(data, zero_mask); | ||
let newlines = _mm_cmpeq_epi8(data, newline_mask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_mm_cmpeq_epi8
takes two SIMD vectors and compares them element wise. Each entry in the result is one if the two operands match and zero otherwise.
Therefore, quotes
, escapes
etc now contain the test results for the 16-byte input chunk that we just loaded.
let test_result = _mm_or_si128( | ||
_mm_or_si128(quotes, zeros), | ||
_mm_or_si128(escapes, carriage_returns), | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_mm_or_si128
does exactly what it sounds like: It computes the elementwise OR of two SIMD vectors.
Therefore, test_result
now contains 16 8-bit integers that are either zero or one - zero if the character at the position did not match an element in the set and one otherwise.
_mm_or_si128(quotes, zeros), | ||
_mm_or_si128(escapes, carriage_returns), | ||
); | ||
let bitmask = _mm_movemask_epi8(test_result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_mm_movemask_epi8
throws away all the bits we don't need anymore. It creates a 16-bit integer consisting of the least-significant-bits from each SIMD entry.
For example, the SIMD vector [1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ]
becomes 0b1010000000000000
.
let bitmask = _mm_movemask_epi8(test_result); | ||
let newline_mask = _mm_movemask_epi8(newlines); | ||
|
||
if (bitmask != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if at least one bit in the mask is set then one of the 16 input characters was in the set that we were looking for.
let position = if cfg!(target_endian = "little") { | ||
bitmask.trailing_zeros() as usize | ||
} else { | ||
bitmask.leading_zeros() as usize | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To find the exact position of the character we can just count the leading zeros (or, the leading characters that were not in the set) and add them to the offset of the 16 byte chunk that we loaded.
n_newlines += (newline_mask & ((1 << position) - 1)).count_ones() as u64; | ||
i += position; | ||
break; | ||
} else { | ||
n_newlines += newline_mask.count_ones() as u64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, html5ever
emits the line number for each token.
In SIMD this is implemented as follows:
- Compute the bitmask like before, but only for
\n
characters - If the 16 byte chunk contains a character in the set:
2.1 Mask the bits that come before the first character in the set, let the number of newlines be the set bits in the newline-bitmask - Otherwise:
3.1 Let the number of newlines be the set bits in the newline-bitmask.
Unfortunately, this makes the algorithm significantly slower than it could be.
// Process any remaining bytes (less than STRIDE) | ||
while let Some(c) = raw_bytes.get(i) { | ||
if matches!(*c, b'<' | b'&' | b'\r' | b'\0') { | ||
break; | ||
} | ||
if *c == b'\n' { | ||
n_newlines += 1; | ||
} | ||
|
||
i += 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block takes care of any input chunks that are too small for SIMD.
/// | ||
/// [data state]: https://html.spec.whatwg.org/#data-state | ||
/// [here]: https://lemire.me/blog/2024/06/08/scan-html-faster-with-simd-instructions-chrome-edition/ | ||
unsafe fn data_state_sse2_fast_path(&self, input: &mut StrTendril) -> Option<SetResult> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SIMD can only process 16-byte chunks. So instead of scanning the input character-by-character we unroll the loop like this:
// This is pseudocode
while remaining_input.len() > SIMD_CHUNK_SIZE {
// use SIMD algorithm
}
while !remaining_input.is_empty() {
// use scalar algorithm
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good to me!
Signed-off-by: Simon Wülker <[email protected]>
The data state is where the HTML tokenizer spends most of it's time. It is also very simple - all it does is scan the input stream for the next character in a set. This can easily be optimized with SIMD instructions. The algorithm I used is described in https://lemire.me/blog/2024/06/08/scan-html-faster-with-simd-instructions-chrome-edition/.
This change significantly speeds up the tokenizer. Both
lipsum.html
andlipsum.zh.html
see improvements of 70-80%, which is not suprising since they never leave the data state.Very small inputs regress slightly. There is also a performance regression of ~5% for malicious input that consists only of tags (the
strong.html
benchmark). In that case the SIMD instructions are overkill because the target character (<
) is always the first one in the input stream.Note that the implementation could be made significantly faster by not keeping track of newlines. The only use in servo for the line number is for script elements, where the line number eventually ends up in https://github.com/servo/mozjs/blob/d1525dfaee22cc1ea9ee16c552cdeedaa9f20741/mozjs-sys/src/jsglue.cpp#L608.
Benchmark results