Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block-wise text processing and attribute normalization using memchr and avoiding UTF-8 verification. #134

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

adamreichold
Copy link
Collaborator

@adamreichold adamreichold commented Jan 14, 2025

This applies the same technique here which significantly improved process_cdata to text processing and attribute value normalization. This allows to drop byte-by-byte processing via NextChunk, CharToBytes and TextBuffer.

Still to be resolved:

  • Better integration with Stream, i.e. fewer calls to Stream::as_str.
  • Synthesize benchmark stressing attribute normalization.
  • Synthesize benchmark stressing text processing.
  • Rerun the benchmarks and compare the results to master.

@adamreichold adamreichold force-pushed the block-wise-normalize-attribute branch 2 times, most recently from 1761229 to 30f4379 Compare January 14, 2025 23:05
@RazrFalcon
Copy link
Owner

RazrFalcon commented Jan 15, 2025

10% is a good speed-up.

I would suggest moving all/most of the memchr code to the Stream itself. Right now you're basically avoiding it.
It should make the code much easier to read and will reduce code duplication.

Also note that attributes normalization is a slow path to begin with. We're skipping it in most cases. So performance benefits would not be that drastic.
And the normalization logic isn't performance-friendly either. We still have to analyze each character after all.

If you're looking for easy performance wins from memchr I would suggest starting with the tokenizer itself. This is where the most of string scanning is done. skip_spaces, consume_qname, gen_text_pos.
On the other hand we have to call is_xml_char almost always... Maybe splitting token parsing and validation into two steps might help in some cases. I.e. quickly find the token end via memchr and then validate its content after. Might be more cache-friendly. But error processing should not suffer.

Either way, performance optimizations are always welcome, but it's a very slow path of trials and errors.

@adamreichold
Copy link
Collaborator Author

10% is a good speed-up.

This is a synthetic XML file which is basically made up of attribute values which need normalization. The other benchmarks are flat.

I would suggest moving all/most of the memchr code to the Stream itself. Right now you're basically avoiding it. It should make the code much easier to read and will reduce code duplication.

Will do.

Also note that attributes normalization is a slow path to begin with. We're skipping it in most cases. So performance benefits would not be that drastic. And the normalization logic isn't performance-friendly either. We still have to analyze each character after all.

I think it depends on how you look at it: With SIMD we can check blocks of characters whether they need processing and copy them over as blocks of bytes if not. Of course, entity parsing and such will still work character-by-character. This might actually be the bottleneck here as well. Looking at flamegraphs, memcpy is really prominent in the implementation proposed here and it might be that the blocks are too short to offset the cost of calling into libc. Maybe I should open-code a memcpy for small blocks?

If you're looking for easy performance wins from memchr I would suggest starting with the tokenizer itself. This is where the most of string scanning is done. skip_spaces, consume_qname, gen_text_pos.

I agree. The reason I was looking at attribute value normalization (and text processing) is that I think the byte-by-byte processing with post-hoc UTF-8 validation of TextBuffer could be removed if those two are ported to a block-wise approach.

On the other hand we have to call is_xml_char almost always... Maybe splitting token parsing and validation into two steps might help in some cases. I.e. quickly find the token end via memchr and then validate its content after. Might be more cache-friendly. But error processing should not suffer.

Yes, I also saw that. consume_chars in its current form does not benefit from memchr but it might still be a win to split finding the delimiters and checking for validity.

As there is quite a bit of such post-checking in the tokenizer, what would you think about a feature similar to positions which would skip these? Over at $DAYJOB, we mostly parse machine-generated API responses which we analyze using e.g. Python and lxml if anything goes wrong. So for us, optionally disabling those skips would be an easy win. (We also have positions disabled.)

@RazrFalcon
Copy link
Owner

Maybe I should open-code a memcpy for small blocks?

What is "open-code"?

The reason I was looking at attribute value normalization...

Got it. I know that TextBuffer isn't super fast. I was struggling to make sure XML would be parsed correctly first. Also, it was written like 5-6 years ago. I would probably write it differently now.

what would you think about a feature similar to positions which would skip these?

Which checks exactly? I doubt there are any check that can be classified as optional.
Also, a build feature would not work here, because some other dependency can enable it back. So it should be just a bool.
Overall, more options - more tests - more work. Not good. Also, if you want to bypass spec-defined behavior you should simply fork this project.

@adamreichold
Copy link
Collaborator Author

What is "open-code"?

Meaning something that is inlined directly instead of calling a function from libc.

Which checks exactly? I doubt there are any check that can be classified as optional.

I am thinking of checks like

roxmltree/src/tokenizer.rs

Lines 680 to 682 in bc7a816

if text.contains('>') && text.contains("]]>") {
return Err(Error::InvalidCharacterData(s.gen_text_pos()));
}

which act strictly as filters on the set of accepted documents. They do not influence how a document is parsed. They only reject some that would otherwise be parsed even though they would violate the XML specification.

@RazrFalcon
Copy link
Owner

Meaning something that is inlined directly instead of calling a function from libc.

I thought memchr doesn't use libc.

I am thinking of checks like

I'm skeptical about it.Also, I doubt it has a serious performance hit.

@adamreichold
Copy link
Collaborator Author

adamreichold commented Jan 16, 2025

I thought memchr doesn't use libc.

Not memchr, memcpy which is called via push_str is hot in the profiles. One could try copying really small chunks below word size directly.

And there are abominations like wild_copy which however would actually need inline assembly. ;)

@RazrFalcon
Copy link
Owner

My bad. But aren't memcpy is effectively Rust Rust internals at this point. Can we even optimize it?
Also, what exactly are copying here and why its so slow?

@adamreichold
Copy link
Collaborator Author

My bad. But aren't memcpy is effectively Rust Rust internals at this point. Can we even optimize it?

memcpy is called by the standard library, so yes, it is an implementation detail of String::push_str (via ptr::copy_nonoverlapping). But also we can optimize, for example we can try to open-code a short-copy loop which is then inlined without the overhead of calling into memcpy (which will just see that the block is short and dispatch into its corresponding variant). The above linked wild_copy is an example of the most extreme form of open-coding a replacement for ptr::copy_nonoverlapping.

Also, what exactly are copying here and why its so slow?

I am copying those blocks of the attribute value which do not need normalization, i.e. which container neither entities nor white space. My understanding of the problem is that calling into memcpy implies function call overhead and the blocks can be too short to amortize that overhead for attribute values. Hence the idea that avoid the function call overhead and open-coding a copies of a word or less might help.

@RazrFalcon
Copy link
Owner

I am copying those blocks of the attribute value which do not need normalization, i.e. which container neither entities nor white space.

Aren't we simply referencing those? Only attributes that need normalization are allocated.

My understanding of the problem is that calling into memcpy implies function call overhead and the blocks can be too short to amortize that overhead for attribute values.

I honestly doubt roxmltree is that fast that function calls become expensive. Optimizing memcpy feels like an unnecessary, super low-level optimization.
But you're free to prove me wrong.

In my mind, most of time is spend on various minor checks, validation, string scanning, etc. Aka a pretty high-level stuff.
XML parsing is basically death by a thousand cuts. We don't have any really slow paths, but rather a lot of tiny slowdowns.
In the end of the day, 95% of the code is simply scanning a string for characters.

@adamreichold
Copy link
Collaborator Author

Aren't we simply referencing those? Only attributes that need normalization are allocated.

Whole attribute values which do not need normalization are borrowed. But here I am talking about blocks of characters which can be copied in bulk from attribute values which overall do need normalization. I mean, this is what this PR basically does: It finds the runs of uninteresting character in between whitespace and entities and copies that in bulk (using memcpy) and handles only whitespace and entities character by character.

In my mind, most of time is spend on various minor checks, validation, string scanning, etc. Aka a pretty high-level stuff.
XML parsing is basically death by a thousand cuts. We don't have any really slow paths, but rather a lot of tiny slowdowns.
In the end of the day, 95% of the code is simply scanning a string for characters.

All of this discussion is in the context of _normalize_attribute, i.e. when limiting the profile on that function, memcpy takes about a third of the runtime on this branch here. Whether attribute value normalization is relevant overall is certainly a valid question and depends on the document corpus in question, but memcpy definitely is a hotspot in attribute value normalization for this code. (The other hotspot is memchr3 but that is fine/to be expected.)

@adamreichold adamreichold force-pushed the block-wise-normalize-attribute branch from 30f4379 to 2a1ebd7 Compare January 18, 2025 18:38
@adamreichold adamreichold marked this pull request as draft January 18, 2025 19:34
@adamreichold adamreichold changed the title Block-wise attribute normalization using memchr and avoiding UTF-8 verification. Block-wise text processing and attribute normalization using memchr and avoiding UTF-8 verification. Jan 18, 2025
@adamreichold adamreichold force-pushed the block-wise-normalize-attribute branch 2 times, most recently from 0bc5ca2 to 5a49525 Compare January 19, 2025 11:40
@adamreichold
Copy link
Collaborator Author

I found a much larger performance issue: Merging of text nodes is quadratic and the newly added text stress test does not even finish as long as text node merging is active. So I will have to look into that first before I can even run the benchmarks added here. I guess collecting the text buffers and joining them when the node text definitely finishes should avoid this...

@adamreichold adamreichold force-pushed the block-wise-normalize-attribute branch from 5a49525 to 00695d3 Compare January 19, 2025 15:21
@RazrFalcon
Copy link
Owner

Ok, if we stick to _normalize_attribute, then we're still optimizing an edge case, right? Most attributes either do no need normalization or rather short, so scanning and memcpy overhead doesn't matter.
Unless you have some bizarre, real-world XMLs that have absurdly long attributes.

For a real world case I can think of SVG, the reason why roxmltree exists to begin with.
In SVG we can have absurdly long path attributes. They can have thousands of characters.
And SVG images stored as base64 and split into block of N characters ending with a new-line.
Both can be used for testing.

Here is an example of real-world SVG with long paths, but unfortunately without new-lines, https://en.wikipedia.org/wiki/File:Jupiter_diagram.svg

@adamreichold adamreichold force-pushed the block-wise-normalize-attribute branch from 00695d3 to 9c5d8a3 Compare January 20, 2025 16:01
@adamreichold adamreichold force-pushed the block-wise-normalize-attribute branch from 9c5d8a3 to 5a30a42 Compare January 20, 2025 18:31
@adamreichold
Copy link
Collaborator Author

For a real world case I can think of SVG, the reason why roxmltree exists to begin with.
In SVG we can have absurdly long path attributes. They can have thousands of characters.
And SVG images stored as base64 and split into block of N characters ending with a new-line.
Both can be used for testing.

So I added the Jupiter diagram as a benchmark and got the following results for the relevant part of the suite:

 name                  master ns/iter  block-wise ns/iter  diff ns/iter   diff %  speedup 
 attributes_roxmltree  5,306,651       3,824,009             -1,482,642  -27.94%   x 1.39 
 cdata_roxmltree       206,585         203,130                   -3,455   -1.67%   x 1.02 
 huge_roxmltree        4,909,342       4,895,234                -14,108   -0.29%   x 1.00 
 jupiter_roxmltree     2,887,725       2,885,374                 -2,351   -0.08%   x 1.00 
 large_roxmltree       2,116,274       1,967,409               -148,865   -7.03%   x 1.08 
 medium_roxmltree      575,326         565,669                   -9,657   -1.68%   x 1.02 
 text_roxmltree        5,464,843       3,062,220             -2,402,623  -43.97%   x 1.78 
 tiny_roxmltree        3,991           4,063                         72    1.80%   x 0.98 

meaning the synthetic benchmarks see significant improvements while the realistic files are mostly flat.

So in summary, to me it looks like the block-wise approach can help, but will not significantly improve real-world XML files. So in the end, I think the question is whether the code becomes more or less maintainable this way.

I do like that the whole of TextBuffer, CharToBytes and ParseNextChunk are gone. I also really like that the UTF-8 validation of already known valid UTF-8 is gone. But from birds eye point of view, block-wise is more complex than byte-by-byte.

Of course, for now the code is still badly integrated with the Stream type, so it might look better after that is done.

@adamreichold
Copy link
Collaborator Author

As an aside, note that for the Jupiter SVG, basically all of the time is spent in char::is_xml_char and Chars::next:

image

I have been experimenting with some larger refactorings in a personal rewrite and among other things, it does not check the characters and just memchrs for the delimiters and the result is interesting:

 jupiter     2,887,725                            221,525             -2,666,200  -92.33%  x 13.04 

So I guess, in this particular using memchr and to look for the closing quote would really be a boon. Of course, we need need to be clever if we do want to unconditionally apply is_xml_char. Mabye calling [u8]::is_ascii first and then at least using the u8::is_xml_char variant?

@adamreichold
Copy link
Collaborator Author

Indeed, the following diff

diff --git a/src/tokenizer.rs b/src/tokenizer.rs
index e1d8c53..a0597bd 100644
--- a/src/tokenizer.rs
+++ b/src/tokenizer.rs
@@ -1,6 +1,8 @@
 use core::ops::Range;
 use core::str;
 
+use memchr::memchr2;
+
 use crate::{Error, TextPos};
 
 type Result<T> = core::result::Result<T, Error>;
@@ -570,9 +572,27 @@ fn parse_element<'input>(s: &mut Stream<'input>, events: &mut impl XmlEvents<'in
                 let quote_c = quote as char;
                 // The attribute value must not contain the < character.
                 let value_start = s.pos();
-                s.skip_chars(|_, c| c != quote_c && c != '<')?;
+                let Some(pos) = memchr2(quote, b'<', s.as_str().as_bytes()) else {
+                    return Err(Error::UnexpectedEndOfStream);
+                };
+                s.advance(pos);
                 let value = s.slice_back_span(value_start);
                 s.consume_byte(quote)?;
+
+                if value.as_str().as_bytes().is_ascii() {
+                    for &b in value.as_str().as_bytes() {
+                        if !(b >= 0x20 || b.is_xml_space()) {
+                            return Err(Error::NonXmlChar(b as char, s.gen_text_pos()));
+                        }
+                    }
+                } else {
+                    for ch in value.as_str().chars() {
+                        if !ch.is_xml_char() {
+                            return Err(Error::NonXmlChar(ch, s.gen_text_pos()));
+                        }
+                    }
+                }
+
                 let end = s.pos();
                 events.token(Token::Attribute(start..end, qname_len, eq_len, prefix, local, value))?;
             }

yields the following performance improvement

 name               before ns/iter  after ns/iter  diff ns/iter   diff %  speedup 
 jupiter_roxmltree  2,885,376       1,210,744        -1,674,632  -58.04%   x 2.38 

This is not quite ready as the reported error positions are not correct (the stream is already advanced to the closing quote before the validation runs), but I would say it does look promising.

@RazrFalcon
Copy link
Owner

I also really like that the UTF-8 validation of already known valid UTF-8 is gone.

But we do not validate UTF-8, we validate "XML char", which is different, no? Or are we talking about different things again?

In fact, having an optimized is_xml_char would be nice. Checking each byte is slow, so maybe something like str::is_xml_str would work.
But we have to split scanning and validation first.

Indeed, the following diff

Yes, that was my original suggestion for memchr. Scan first, validate later. Having non-XML characters is an extreme edge case.

So in the end, I think the question is whether the code becomes more or less maintainable this way.

In my opinion, roxmltree code is already very complicated. Making it nicer, easier to read and follow should be a priority.
Anding writing-only code to get 1% perf boost for an edge-case is not ideal.

@adamreichold
Copy link
Collaborator Author

But we do not validate UTF-8, we validate "XML char", which is different, no? Or are we talking about different things again?

TextBuffer builds the resulting string byte by byte and hence has to revalidate UTF-8 in its finish method. That is gone away as the block-wise code proposed here copies ony &str and char. (It is also possible to write a character-by-character variant of TextBuffer but it is prohibitively slow. I tried.)

In fact, having an optimized is_xml_char would be nice. Checking each byte is slow, so maybe something like str::is_xml_str would work.

Well, the two for loops above are basically it I guess. Check bytes if there is only ASCII, check code points otherwise with some #[cold] thrown in.

Making it nicer, easier to read and follow should be a priority.

I agree.

Anding writing-only code to get 1% perf boost for an edge-case is not ideal.

If you think the block-wise code is write-only, then I don't think this is particularly generous assessment. It does not mesh well with how Stream is currently structured, but that can be changed now that Stream is an implementation detail of this crate. On the plus side, there are significantly fewer intermediate abstractions like TextBuffer, CharToBytes and NextChunk involved.

I also don't think we have a valid number on the performance improvement yet. We just know that much more significant bottlenecks exist that need addressing first before this can even be measured. Please keep in mind the -90% benchmark comparison against a variant of this library which works this way exclusively.

@RazrFalcon
Copy link
Owner

TextBuffer builds the resulting string byte by byte and hence has to revalidate UTF-8 in its finish method.

Got it. Makes sense.

If you think the block-wise code is write-only

The current code looks fine to me. The original was way to low-level for my taste and all low-level code should be in Stream, ideally.

Overall, the latest version of this patch looks fine to me. I'm fine with merging it.
But I would prefer to move more code, specifically memchr calls, to Stream somehow.

roxmltree, in my mind, has 4 layers: low-level string processing (Stream), XML tokenizing, tree construction (parser.rs) and public API (lib.rs).
I don't want to mix them much. Each layer should do its own thing. Otherwise we would end up with spaghetti.
Basically, I want to avoid low-level string tokenizing (via memchr and friends) in the tree construction code. It's way too complicated already.

@adamreichold
Copy link
Collaborator Author

Overall, the latest version of this patch looks fine to me. I'm fine with merging it.
But I would prefer to move more code, specifically memchr calls, to Stream somehow.

I completely agree. This why I made this an open to-do item in the cover letter and left the MR in draft status. It is not mergeable without better integration with Stream.

I'll probably look into the much more relevant parse_element optimization first though, as a separate PR. This code is not going anywhere after all...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants