-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block-wise text processing and attribute normalization using memchr and avoiding UTF-8 verification. #134
base: master
Are you sure you want to change the base?
Conversation
1761229
to
30f4379
Compare
10% is a good speed-up. I would suggest moving all/most of the Also note that attributes normalization is a slow path to begin with. We're skipping it in most cases. So performance benefits would not be that drastic. If you're looking for easy performance wins from Either way, performance optimizations are always welcome, but it's a very slow path of trials and errors. |
This is a synthetic XML file which is basically made up of attribute values which need normalization. The other benchmarks are flat.
Will do.
I think it depends on how you look at it: With SIMD we can check blocks of characters whether they need processing and copy them over as blocks of bytes if not. Of course, entity parsing and such will still work character-by-character. This might actually be the bottleneck here as well. Looking at flamegraphs,
I agree. The reason I was looking at attribute value normalization (and text processing) is that I think the byte-by-byte processing with post-hoc UTF-8 validation of
Yes, I also saw that. As there is quite a bit of such post-checking in the tokenizer, what would you think about a feature similar to positions which would skip these? Over at $DAYJOB, we mostly parse machine-generated API responses which we analyze using e.g. Python and lxml if anything goes wrong. So for us, optionally disabling those skips would be an easy win. (We also have positions disabled.) |
What is "open-code"?
Got it. I know that
Which checks exactly? I doubt there are any check that can be classified as optional. |
Meaning something that is inlined directly instead of calling a function from
I am thinking of checks like Lines 680 to 682 in bc7a816
which act strictly as filters on the set of accepted documents. They do not influence how a document is parsed. They only reject some that would otherwise be parsed even though they would violate the XML specification. |
I thought
I'm skeptical about it.Also, I doubt it has a serious performance hit. |
Not And there are abominations like |
My bad. But aren't |
I am copying those blocks of the attribute value which do not need normalization, i.e. which container neither entities nor white space. My understanding of the problem is that calling into |
Aren't we simply referencing those? Only attributes that need normalization are allocated.
I honestly doubt In my mind, most of time is spend on various minor checks, validation, string scanning, etc. Aka a pretty high-level stuff. |
Whole attribute values which do not need normalization are borrowed. But here I am talking about blocks of characters which can be copied in bulk from attribute values which overall do need normalization. I mean, this is what this PR basically does: It finds the runs of uninteresting character in between whitespace and entities and copies that in bulk (using
All of this discussion is in the context of |
30f4379
to
2a1ebd7
Compare
0bc5ca2
to
5a49525
Compare
I found a much larger performance issue: Merging of text nodes is quadratic and the newly added |
5a49525
to
00695d3
Compare
Ok, if we stick to For a real world case I can think of SVG, the reason why Here is an example of real-world SVG with long paths, but unfortunately without new-lines, https://en.wikipedia.org/wiki/File:Jupiter_diagram.svg |
00695d3
to
9c5d8a3
Compare
9c5d8a3
to
5a30a42
Compare
So I added the Jupiter diagram as a benchmark and got the following results for the relevant part of the suite:
meaning the synthetic benchmarks see significant improvements while the realistic files are mostly flat. So in summary, to me it looks like the block-wise approach can help, but will not significantly improve real-world XML files. So in the end, I think the question is whether the code becomes more or less maintainable this way. I do like that the whole of Of course, for now the code is still badly integrated with the |
As an aside, note that for the Jupiter SVG, basically all of the time is spent in I have been experimenting with some larger refactorings in a personal rewrite and among other things, it does not check the characters and just
So I guess, in this particular using |
Indeed, the following diff diff --git a/src/tokenizer.rs b/src/tokenizer.rs
index e1d8c53..a0597bd 100644
--- a/src/tokenizer.rs
+++ b/src/tokenizer.rs
@@ -1,6 +1,8 @@
use core::ops::Range;
use core::str;
+use memchr::memchr2;
+
use crate::{Error, TextPos};
type Result<T> = core::result::Result<T, Error>;
@@ -570,9 +572,27 @@ fn parse_element<'input>(s: &mut Stream<'input>, events: &mut impl XmlEvents<'in
let quote_c = quote as char;
// The attribute value must not contain the < character.
let value_start = s.pos();
- s.skip_chars(|_, c| c != quote_c && c != '<')?;
+ let Some(pos) = memchr2(quote, b'<', s.as_str().as_bytes()) else {
+ return Err(Error::UnexpectedEndOfStream);
+ };
+ s.advance(pos);
let value = s.slice_back_span(value_start);
s.consume_byte(quote)?;
+
+ if value.as_str().as_bytes().is_ascii() {
+ for &b in value.as_str().as_bytes() {
+ if !(b >= 0x20 || b.is_xml_space()) {
+ return Err(Error::NonXmlChar(b as char, s.gen_text_pos()));
+ }
+ }
+ } else {
+ for ch in value.as_str().chars() {
+ if !ch.is_xml_char() {
+ return Err(Error::NonXmlChar(ch, s.gen_text_pos()));
+ }
+ }
+ }
+
let end = s.pos();
events.token(Token::Attribute(start..end, qname_len, eq_len, prefix, local, value))?;
} yields the following performance improvement
This is not quite ready as the reported error positions are not correct (the stream is already advanced to the closing quote before the validation runs), but I would say it does look promising. |
But we do not validate UTF-8, we validate "XML char", which is different, no? Or are we talking about different things again? In fact, having an optimized
Yes, that was my original suggestion for
In my opinion, |
Well, the two for loops above are basically it I guess. Check bytes if there is only ASCII, check code points otherwise with some
I agree.
If you think the block-wise code is write-only, then I don't think this is particularly generous assessment. It does not mesh well with how I also don't think we have a valid number on the performance improvement yet. We just know that much more significant bottlenecks exist that need addressing first before this can even be measured. Please keep in mind the -90% benchmark comparison against a variant of this library which works this way exclusively. |
Got it. Makes sense.
The current code looks fine to me. The original was way to low-level for my taste and all low-level code should be in Stream, ideally. Overall, the latest version of this patch looks fine to me. I'm fine with merging it.
|
I completely agree. This why I made this an open to-do item in the cover letter and left the MR in draft status. It is not mergeable without better integration with I'll probably look into the much more relevant |
This applies the same technique here which significantly improved
process_cdata
to text processing and attribute value normalization. This allows to drop byte-by-byte processing viaNextChunk
,CharToBytes
andTextBuffer
.Still to be resolved:
Stream
, i.e. fewer calls toStream::as_str
.