Block-wise text processing and attribute normalization using memchr and avoiding UTF-8 verification. #134

adamreichold · 2025-01-14T23:03:25Z

This applies the same technique here which significantly improved process_cdata to text processing and attribute value normalization. This allows to drop byte-by-byte processing via NextChunk, CharToBytes and TextBuffer.

Still to be resolved:

Better integration with Stream, i.e. fewer calls to Stream::as_str.
Synthesize benchmark stressing attribute normalization.
Synthesize benchmark stressing text processing.
Rerun the benchmarks and compare the results to master.

RazrFalcon · 2025-01-15T11:01:41Z

10% is a good speed-up.

I would suggest moving all/most of the memchr code to the Stream itself. Right now you're basically avoiding it.
It should make the code much easier to read and will reduce code duplication.

Also note that attributes normalization is a slow path to begin with. We're skipping it in most cases. So performance benefits would not be that drastic.
And the normalization logic isn't performance-friendly either. We still have to analyze each character after all.

If you're looking for easy performance wins from memchr I would suggest starting with the tokenizer itself. This is where the most of string scanning is done. skip_spaces, consume_qname, gen_text_pos.
On the other hand we have to call is_xml_char almost always... Maybe splitting token parsing and validation into two steps might help in some cases. I.e. quickly find the token end via memchr and then validate its content after. Might be more cache-friendly. But error processing should not suffer.

Either way, performance optimizations are always welcome, but it's a very slow path of trials and errors.

adamreichold · 2025-01-16T08:05:33Z

10% is a good speed-up.

This is a synthetic XML file which is basically made up of attribute values which need normalization. The other benchmarks are flat.

I would suggest moving all/most of the memchr code to the Stream itself. Right now you're basically avoiding it. It should make the code much easier to read and will reduce code duplication.

Will do.

Also note that attributes normalization is a slow path to begin with. We're skipping it in most cases. So performance benefits would not be that drastic. And the normalization logic isn't performance-friendly either. We still have to analyze each character after all.

I think it depends on how you look at it: With SIMD we can check blocks of characters whether they need processing and copy them over as blocks of bytes if not. Of course, entity parsing and such will still work character-by-character. This might actually be the bottleneck here as well. Looking at flamegraphs, memcpy is really prominent in the implementation proposed here and it might be that the blocks are too short to offset the cost of calling into libc. Maybe I should open-code a memcpy for small blocks?

If you're looking for easy performance wins from memchr I would suggest starting with the tokenizer itself. This is where the most of string scanning is done. skip_spaces, consume_qname, gen_text_pos.

I agree. The reason I was looking at attribute value normalization (and text processing) is that I think the byte-by-byte processing with post-hoc UTF-8 validation of TextBuffer could be removed if those two are ported to a block-wise approach.

On the other hand we have to call is_xml_char almost always... Maybe splitting token parsing and validation into two steps might help in some cases. I.e. quickly find the token end via memchr and then validate its content after. Might be more cache-friendly. But error processing should not suffer.

Yes, I also saw that. consume_chars in its current form does not benefit from memchr but it might still be a win to split finding the delimiters and checking for validity.

As there is quite a bit of such post-checking in the tokenizer, what would you think about a feature similar to positions which would skip these? Over at $DAYJOB, we mostly parse machine-generated API responses which we analyze using e.g. Python and lxml if anything goes wrong. So for us, optionally disabling those skips would be an easy win. (We also have positions disabled.)

RazrFalcon · 2025-01-16T13:19:07Z

Maybe I should open-code a memcpy for small blocks?

What is "open-code"?

The reason I was looking at attribute value normalization...

Got it. I know that TextBuffer isn't super fast. I was struggling to make sure XML would be parsed correctly first. Also, it was written like 5-6 years ago. I would probably write it differently now.

what would you think about a feature similar to positions which would skip these?

Which checks exactly? I doubt there are any check that can be classified as optional.
Also, a build feature would not work here, because some other dependency can enable it back. So it should be just a bool.
Overall, more options - more tests - more work. Not good. Also, if you want to bypass spec-defined behavior you should simply fork this project.

adamreichold · 2025-01-16T14:39:01Z

What is "open-code"?

Meaning something that is inlined directly instead of calling a function from libc.

Which checks exactly? I doubt there are any check that can be classified as optional.

I am thinking of checks like

roxmltree/src/tokenizer.rs

Lines 680 to 682 in bc7a816

    
           if text.contains('>') && text.contains("]]>") { 
        
               return Err(Error::InvalidCharacterData(s.gen_text_pos())); 
        
           }

which act strictly as filters on the set of accepted documents. They do not influence how a document is parsed. They only reject some that would otherwise be parsed even though they would violate the XML specification.

RazrFalcon · 2025-01-16T16:05:25Z

Meaning something that is inlined directly instead of calling a function from libc.

I thought memchr doesn't use libc.

I am thinking of checks like

I'm skeptical about it.Also, I doubt it has a serious performance hit.

adamreichold · 2025-01-16T18:28:00Z

I thought memchr doesn't use libc.

Not memchr, memcpy which is called via push_str is hot in the profiles. One could try copying really small chunks below word size directly.

And there are abominations like wild_copy which however would actually need inline assembly. ;)

RazrFalcon · 2025-01-17T11:53:48Z

My bad. But aren't memcpy is effectively Rust Rust internals at this point. Can we even optimize it?
Also, what exactly are copying here and why its so slow?

adamreichold · 2025-01-17T13:06:51Z

My bad. But aren't memcpy is effectively Rust Rust internals at this point. Can we even optimize it?

memcpy is called by the standard library, so yes, it is an implementation detail of String::push_str (via ptr::copy_nonoverlapping). But also we can optimize, for example we can try to open-code a short-copy loop which is then inlined without the overhead of calling into memcpy (which will just see that the block is short and dispatch into its corresponding variant). The above linked wild_copy is an example of the most extreme form of open-coding a replacement for ptr::copy_nonoverlapping.

Also, what exactly are copying here and why its so slow?

I am copying those blocks of the attribute value which do not need normalization, i.e. which container neither entities nor white space. My understanding of the problem is that calling into memcpy implies function call overhead and the blocks can be too short to amortize that overhead for attribute values. Hence the idea that avoid the function call overhead and open-coding a copies of a word or less might help.

RazrFalcon · 2025-01-17T14:11:11Z

I am copying those blocks of the attribute value which do not need normalization, i.e. which container neither entities nor white space.

Aren't we simply referencing those? Only attributes that need normalization are allocated.

My understanding of the problem is that calling into memcpy implies function call overhead and the blocks can be too short to amortize that overhead for attribute values.

I honestly doubt roxmltree is that fast that function calls become expensive. Optimizing memcpy feels like an unnecessary, super low-level optimization.
But you're free to prove me wrong.

In my mind, most of time is spend on various minor checks, validation, string scanning, etc. Aka a pretty high-level stuff.
XML parsing is basically death by a thousand cuts. We don't have any really slow paths, but rather a lot of tiny slowdowns.
In the end of the day, 95% of the code is simply scanning a string for characters.

adamreichold · 2025-01-17T18:57:30Z

Aren't we simply referencing those? Only attributes that need normalization are allocated.

Whole attribute values which do not need normalization are borrowed. But here I am talking about blocks of characters which can be copied in bulk from attribute values which overall do need normalization. I mean, this is what this PR basically does: It finds the runs of uninteresting character in between whitespace and entities and copies that in bulk (using memcpy) and handles only whitespace and entities character by character.

In my mind, most of time is spend on various minor checks, validation, string scanning, etc. Aka a pretty high-level stuff.
XML parsing is basically death by a thousand cuts. We don't have any really slow paths, but rather a lot of tiny slowdowns.
In the end of the day, 95% of the code is simply scanning a string for characters.

All of this discussion is in the context of _normalize_attribute, i.e. when limiting the profile on that function, memcpy takes about a third of the runtime on this branch here. Whether attribute value normalization is relevant overall is certainly a valid question and depends on the document corpus in question, but memcpy definitely is a hotspot in attribute value normalization for this code. (The other hotspot is memchr3 but that is fine/to be expected.)

adamreichold · 2025-01-19T11:43:01Z

I found a much larger performance issue: Merging of text nodes is quadratic and the newly added text stress test does not even finish as long as text node merging is active. So I will have to look into that first before I can even run the benchmarks added here. I guess collecting the text buffers and joining them when the node text definitely finishes should avoid this...

RazrFalcon · 2025-01-20T11:19:07Z

Ok, if we stick to _normalize_attribute, then we're still optimizing an edge case, right? Most attributes either do no need normalization or rather short, so scanning and memcpy overhead doesn't matter.
Unless you have some bizarre, real-world XMLs that have absurdly long attributes.

For a real world case I can think of SVG, the reason why roxmltree exists to begin with.
In SVG we can have absurdly long path attributes. They can have thousands of characters.
And SVG images stored as base64 and split into block of N characters ending with a new-line.
Both can be used for testing.

Here is an example of real-world SVG with long paths, but unfortunately without new-lines, https://en.wikipedia.org/wiki/File:Jupiter_diagram.svg

…rification.

… TextBuffer can be dropped.

adamreichold · 2025-01-20T18:36:53Z

For a real world case I can think of SVG, the reason why roxmltree exists to begin with.
In SVG we can have absurdly long path attributes. They can have thousands of characters.
And SVG images stored as base64 and split into block of N characters ending with a new-line.
Both can be used for testing.

So I added the Jupiter diagram as a benchmark and got the following results for the relevant part of the suite:

 name                  master ns/iter  block-wise ns/iter  diff ns/iter   diff %  speedup 
 attributes_roxmltree  5,306,651       3,824,009             -1,482,642  -27.94%   x 1.39 
 cdata_roxmltree       206,585         203,130                   -3,455   -1.67%   x 1.02 
 huge_roxmltree        4,909,342       4,895,234                -14,108   -0.29%   x 1.00 
 jupiter_roxmltree     2,887,725       2,885,374                 -2,351   -0.08%   x 1.00 
 large_roxmltree       2,116,274       1,967,409               -148,865   -7.03%   x 1.08 
 medium_roxmltree      575,326         565,669                   -9,657   -1.68%   x 1.02 
 text_roxmltree        5,464,843       3,062,220             -2,402,623  -43.97%   x 1.78 
 tiny_roxmltree        3,991           4,063                         72    1.80%   x 0.98

meaning the synthetic benchmarks see significant improvements while the realistic files are mostly flat.

So in summary, to me it looks like the block-wise approach can help, but will not significantly improve real-world XML files. So in the end, I think the question is whether the code becomes more or less maintainable this way.

I do like that the whole of TextBuffer, CharToBytes and ParseNextChunk are gone. I also really like that the UTF-8 validation of already known valid UTF-8 is gone. But from birds eye point of view, block-wise is more complex than byte-by-byte.

Of course, for now the code is still badly integrated with the Stream type, so it might look better after that is done.

adamreichold · 2025-01-20T18:57:02Z

As an aside, note that for the Jupiter SVG, basically all of the time is spent in char::is_xml_char and Chars::next:

I have been experimenting with some larger refactorings in a personal rewrite and among other things, it does not check the characters and just memchrs for the delimiters and the result is interesting:

 jupiter     2,887,725                            221,525             -2,666,200  -92.33%  x 13.04

So I guess, in this particular using memchr and to look for the closing quote would really be a boon. Of course, we need need to be clever if we do want to unconditionally apply is_xml_char. Mabye calling [u8]::is_ascii first and then at least using the u8::is_xml_char variant?

adamreichold · 2025-01-20T19:14:41Z

Indeed, the following diff

diff --git a/src/tokenizer.rs b/src/tokenizer.rs
index e1d8c53..a0597bd 100644
--- a/src/tokenizer.rs
+++ b/src/tokenizer.rs
@@ -1,6 +1,8 @@
 use core::ops::Range;
 use core::str;
 
+use memchr::memchr2;
+
 use crate::{Error, TextPos};
 
 type Result<T> = core::result::Result<T, Error>;
@@ -570,9 +572,27 @@ fn parse_element<'input>(s: &mut Stream<'input>, events: &mut impl XmlEvents<'in
                 let quote_c = quote as char;
                 // The attribute value must not contain the < character.
                 let value_start = s.pos();
-                s.skip_chars(|_, c| c != quote_c && c != '<')?;
+                let Some(pos) = memchr2(quote, b'<', s.as_str().as_bytes()) else {
+                    return Err(Error::UnexpectedEndOfStream);
+                };
+                s.advance(pos);
                 let value = s.slice_back_span(value_start);
                 s.consume_byte(quote)?;
+
+                if value.as_str().as_bytes().is_ascii() {
+                    for &b in value.as_str().as_bytes() {
+                        if !(b >= 0x20 || b.is_xml_space()) {
+                            return Err(Error::NonXmlChar(b as char, s.gen_text_pos()));
+                        }
+                    }
+                } else {
+                    for ch in value.as_str().chars() {
+                        if !ch.is_xml_char() {
+                            return Err(Error::NonXmlChar(ch, s.gen_text_pos()));
+                        }
+                    }
+                }
+
                 let end = s.pos();
                 events.token(Token::Attribute(start..end, qname_len, eq_len, prefix, local, value))?;
             }

yields the following performance improvement

 name               before ns/iter  after ns/iter  diff ns/iter   diff %  speedup 
 jupiter_roxmltree  2,885,376       1,210,744        -1,674,632  -58.04%   x 2.38

This is not quite ready as the reported error positions are not correct (the stream is already advanced to the closing quote before the validation runs), but I would say it does look promising.

RazrFalcon · 2025-01-21T11:36:38Z

I also really like that the UTF-8 validation of already known valid UTF-8 is gone.

But we do not validate UTF-8, we validate "XML char", which is different, no? Or are we talking about different things again?

In fact, having an optimized is_xml_char would be nice. Checking each byte is slow, so maybe something like str::is_xml_str would work.
But we have to split scanning and validation first.

Indeed, the following diff

Yes, that was my original suggestion for memchr. Scan first, validate later. Having non-XML characters is an extreme edge case.

So in the end, I think the question is whether the code becomes more or less maintainable this way.

In my opinion, roxmltree code is already very complicated. Making it nicer, easier to read and follow should be a priority.
Anding writing-only code to get 1% perf boost for an edge-case is not ideal.

adamreichold · 2025-01-21T19:31:58Z

But we do not validate UTF-8, we validate "XML char", which is different, no? Or are we talking about different things again?

TextBuffer builds the resulting string byte by byte and hence has to revalidate UTF-8 in its finish method. That is gone away as the block-wise code proposed here copies ony &str and char. (It is also possible to write a character-by-character variant of TextBuffer but it is prohibitively slow. I tried.)

In fact, having an optimized is_xml_char would be nice. Checking each byte is slow, so maybe something like str::is_xml_str would work.

Well, the two for loops above are basically it I guess. Check bytes if there is only ASCII, check code points otherwise with some #[cold] thrown in.

Making it nicer, easier to read and follow should be a priority.

I agree.

Anding writing-only code to get 1% perf boost for an edge-case is not ideal.

If you think the block-wise code is write-only, then I don't think this is particularly generous assessment. It does not mesh well with how Stream is currently structured, but that can be changed now that Stream is an implementation detail of this crate. On the plus side, there are significantly fewer intermediate abstractions like TextBuffer, CharToBytes and NextChunk involved.

I also don't think we have a valid number on the performance improvement yet. We just know that much more significant bottlenecks exist that need addressing first before this can even be measured. Please keep in mind the -90% benchmark comparison against a variant of this library which works this way exclusively.

RazrFalcon · 2025-01-22T16:25:51Z

TextBuffer builds the resulting string byte by byte and hence has to revalidate UTF-8 in its finish method.

Got it. Makes sense.

If you think the block-wise code is write-only

The current code looks fine to me. The original was way to low-level for my taste and all low-level code should be in Stream, ideally.

Overall, the latest version of this patch looks fine to me. I'm fine with merging it.
But I would prefer to move more code, specifically memchr calls, to Stream somehow.

roxmltree, in my mind, has 4 layers: low-level string processing (Stream), XML tokenizing, tree construction (parser.rs) and public API (lib.rs).
I don't want to mix them much. Each layer should do its own thing. Otherwise we would end up with spaghetti.
Basically, I want to avoid low-level string tokenizing (via memchr and friends) in the tree construction code. It's way too complicated already.

adamreichold · 2025-01-22T20:18:14Z

Overall, the latest version of this patch looks fine to me. I'm fine with merging it.
But I would prefer to move more code, specifically memchr calls, to Stream somehow.

I completely agree. This why I made this an open to-do item in the cover letter and left the MR in draft status. It is not mergeable without better integration with Stream.

I'll probably look into the much more relevant parse_element optimization first though, as a separate PR. This code is not going anywhere after all...

adamreichold force-pushed the block-wise-normalize-attribute branch 2 times, most recently from 1761229 to 30f4379 Compare January 14, 2025 23:05

adamreichold force-pushed the block-wise-normalize-attribute branch from 30f4379 to 2a1ebd7 Compare January 18, 2025 18:38

adamreichold marked this pull request as draft January 18, 2025 19:34

adamreichold changed the title ~~Block-wise attribute normalization using memchr and avoiding UTF-8 verification.~~ Block-wise text processing and attribute normalization using memchr and avoiding UTF-8 verification. Jan 18, 2025

adamreichold force-pushed the block-wise-normalize-attribute branch 2 times, most recently from 0bc5ca2 to 5a49525 Compare January 19, 2025 11:40

adamreichold force-pushed the block-wise-normalize-attribute branch from 5a49525 to 00695d3 Compare January 19, 2025 15:21

adamreichold force-pushed the block-wise-normalize-attribute branch from 00695d3 to 9c5d8a3 Compare January 20, 2025 16:01

adamreichold added 3 commits January 20, 2025 19:29

Add large SVG file as a realistic case using large attribute values.

e0e6f12

Block-wise attribute normalization using memchr and avoiding UTF-8 ve…

e791518

…rification.

Also do text processing block-wise so that NextChunk, CharToBytes and…

5a30a42

… TextBuffer can be dropped.

adamreichold force-pushed the block-wise-normalize-attribute branch from 9c5d8a3 to 5a30a42 Compare January 20, 2025 18:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block-wise text processing and attribute normalization using memchr and avoiding UTF-8 verification. #134

Block-wise text processing and attribute normalization using memchr and avoiding UTF-8 verification. #134

adamreichold commented Jan 14, 2025 •

edited

Loading

RazrFalcon commented Jan 15, 2025 •

edited

Loading

adamreichold commented Jan 16, 2025

RazrFalcon commented Jan 16, 2025

adamreichold commented Jan 16, 2025

RazrFalcon commented Jan 16, 2025

adamreichold commented Jan 16, 2025 •

edited

Loading

RazrFalcon commented Jan 17, 2025

adamreichold commented Jan 17, 2025

RazrFalcon commented Jan 17, 2025

adamreichold commented Jan 17, 2025

adamreichold commented Jan 19, 2025

RazrFalcon commented Jan 20, 2025

adamreichold commented Jan 20, 2025

adamreichold commented Jan 20, 2025

adamreichold commented Jan 20, 2025

RazrFalcon commented Jan 21, 2025

adamreichold commented Jan 21, 2025

RazrFalcon commented Jan 22, 2025

adamreichold commented Jan 22, 2025

Block-wise text processing and attribute normalization using memchr and avoiding UTF-8 verification. #134

Are you sure you want to change the base?

Block-wise text processing and attribute normalization using memchr and avoiding UTF-8 verification. #134

Conversation

adamreichold commented Jan 14, 2025 • edited Loading

RazrFalcon commented Jan 15, 2025 • edited Loading

adamreichold commented Jan 16, 2025

RazrFalcon commented Jan 16, 2025

adamreichold commented Jan 16, 2025

RazrFalcon commented Jan 16, 2025

adamreichold commented Jan 16, 2025 • edited Loading

RazrFalcon commented Jan 17, 2025

adamreichold commented Jan 17, 2025

RazrFalcon commented Jan 17, 2025

adamreichold commented Jan 17, 2025

adamreichold commented Jan 19, 2025

RazrFalcon commented Jan 20, 2025

adamreichold commented Jan 20, 2025

adamreichold commented Jan 20, 2025

adamreichold commented Jan 20, 2025

RazrFalcon commented Jan 21, 2025

adamreichold commented Jan 21, 2025

RazrFalcon commented Jan 22, 2025

adamreichold commented Jan 22, 2025

adamreichold commented Jan 14, 2025 •

edited

Loading

RazrFalcon commented Jan 15, 2025 •

edited

Loading

adamreichold commented Jan 16, 2025 •

edited

Loading