Add normalize_line_end for unescape and test #807

yorkz1994 · 2024-09-25T02:59:53Z

Regarding #806
I added a normalize_line_end function in escape module and related tests.
If unescape function is called, then line end will be normalized.

…as input, e.g. &str

Mingun

I think we should add more tests for the actual usage of this function, i.e. test what would be returned when call public API methods of Attribute and events. Now it is hard to ensure that this will not break due to internal refactorings.

Test failures should be evaluated and fixed. Also, issue #670 seems related, it would be great if you can evaluate how this change is applicable to it.

In the end, could you add a changelog entry. Do not forgot that Changelog.md is a usual markdown file, so does not forgot to update section with links -- they are not linked automatically to the GH PRs/issues.

src/escape.rs

Co-authored-by: Mingun <[email protected]>

yorkz1994 · 2024-09-26T01:22:53Z

I think the failed test is because we did the line end normalization after the unescape, but unescape can contain \r, so these new \rs should not be normalized. We have to do the normalization only for data in the not unescape parts.

codecov-commenter · 2024-09-26T07:55:42Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 60.36%. Comparing base (3ebe221) to head (d3ee1ad).
Report is 3 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #807      +/-   ##
==========================================
+ Coverage   60.17%   60.36%   +0.18%     
==========================================
  Files          41       41              
  Lines       15958    15980      +22     
==========================================
+ Hits         9603     9646      +43     
+ Misses       6355     6334      -21

Flag	Coverage Δ
unittests	`60.36% <100.00%> (+0.18%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Mingun

Good. But I would also like to see the tests when use API of Attribute and BytesText, which are listed here.

Changelog.md

Co-authored-by: Mingun <[email protected]>

yorkz1994 · 2024-09-26T14:08:39Z

Good. But I would also like to see the tests when use API of Attribute and BytesText, which are listed here.

Sorry, don't know what do you want me to do. I am not familiar with XML specification on attributes. What I did is just do normalization of the line end before your further unescape work. So it removes \r before unescape means it is just like no such character in your input at all, so your previous logic should continue to work. Now all your previous cases passed and the added case for testing normalization also works. I don't see anything I can do.

Mingun · 2024-09-26T14:22:01Z

I like just see tests that will call unescape family of methods on Attribute and BytesText which is get from the reader and check that the result does not contain \r. Something like:

// XML with \n \r\n and \r style newlines in various places
const XML: &str = "...";

let mut reader = Reader::from_str(XML);

match reader.read_event().unwrap() {
  Event::Start(event) => {
    let iter = event.attributes();
    let a = iter.next().unwrap();
    #[cfg(not(feature = "encoding"))]
    assert_eq!(a.unescape_value(), "...");
    assert_eq!(a.decode_and_unescape_value(), "...");
  }
  event => panic!("Expected Start, found {:?}", event),
}

match reader.read_event().unwrap() {
  Event::Text(event) => assert_eq!(event.unescape(), "..."),
  event => panic!("Expected Text, found {:?}", event),
}

yorkz1994 · 2024-09-27T00:25:06Z

Like I said, I am not familiar with XML specification, so I don't know where is the proper places to put \r in it and what should be expected after unescape if one appears. For example is it valid to put a \r in a attribute key or value, in tag name? If you know better than me how about you try it yourself.

Mingun · 2024-09-27T09:10:35Z

Just add it everywhere where spaces can occur, we are not not talking about correctness for now (this is another question, we are definitely do not process everything according to the specification). I only want to have a starting point and be sure that this feature worked as assumed when you would use actual API.

yorkz1994 · 2024-09-27T09:19:55Z

Could you provide such input data. I don't want to create such invalid XML. Normally I can think is that the \r will only appear in BytesText due to line end diffs in OS.

Mingun · 2024-09-27T09:46:58Z

<root attribute="\r\r\n\nvalue1\r\r\n\nvalue2\r\r\n\n">\r\r\n\nvalue3\r\r\n\nvalue4\r\r\n\n</root>

yorkz1994 · 2024-09-27T13:10:19Z

f970370
This commit is due to forgot to normalize if input has nothing to unescape.

I add a case from your input, don't know the result is your expected:

#[test]
fn line_ends() {
    const XML: &str = "<root attribute=\"\r\r\n\nvalue1\r\r\n\nvalue2\r\r\n\n\">\r\r\n\nvalue3\r\r\n\nvalue4\r\r\n\n</root>";
    let mut reader = Reader::from_str(XML);
    match reader.read_event().unwrap() {
        Event::Start(event) => {
            let mut iter = event.attributes();
            let a = iter.next().unwrap().unwrap();
            #[cfg(not(feature = "encoding"))]
            assert_eq!(
                a.unescape_value().unwrap(),
                "\n\n\nvalue1\n\n\nvalue2\n\n\n"
            );
            assert_eq!(
                a.decode_and_unescape_value(reader.decoder()).unwrap(),
                "\n\n\nvalue1\n\n\nvalue2\n\n\n"
            );
        }
        event => panic!("Expected Start, found {:?}", event),
    }
    match reader.read_event().unwrap() {
        Event::Text(event) => {
            assert_eq!(event.unescape().unwrap(), "\n\n\nvalue3\n\n\nvalue4\n\n\n")
        }
        event => panic!("Expected Text, found {:?}", event),
    }
}

There is a case in serde-se.rs always fail. Don't know if we should modify serde implementation?

serialize_as!(tuple:
        // Use to_string() to get owned type that is required for deserialization
        ("<\"&'>".to_string(), "with\t\r\n spaces", 3usize)
        => "<root>&lt;\"&amp;'&gt;</root>\
            <root>with\t\r\n spaces</root>\
            <root>3</root>");

---- with_root::tuple stdout ----
thread 'with_root::tuple' panicked at tests/serde-se.rs:1956:5:
deserialization roundtrip: Custom("invalid type: string \"with\\t\\n spaces\", expected a borrowed string")

yorkz1994 · 2024-09-27T13:59:13Z

The failed test is because there is \r in the roundtrip test. It is impossible to have equal serialize and deserialize due to line end normalization. Therefore I have to remove the \r in roundtrip test to get the case pass.

Mingun

I add a case from your input, don't know the result is your expected:

Yes, I think, that it is expected results. According to the XML specification XML processor will never see raw \r, it could appeared in data only if would be escaped as a character reference (note: XML specification names each "XML source" (for example "XML file") as "entity". XML is composed from several "entities" -- main document itself and any other documents loaded via DTD definitions).

Adding a link to https://www.w3.org/TR/xml11/#sec-line-ends somewhere near to the code that handles normalization also would be useful. Well, actually, it describes the 5 combinations of LOF characters, that should be normalized to \n:

\r\n
\r\u0085 (UTF-8: \r\xC2\x85)
\r
\u0085 (UTF-8: \xC2\x85)
\u2028 (UTF-8: \xE2\x80\xA8)

Strictly speaking, that means that we should look for the 5 bytes if we would like to unescape and normalize line ends in one loop: &, ;, \r, \xC2 (or \x85), \xE2 (or \xA8). Unfortunately, memchr supports searching only 3 bytes at once. But at least we can (and should, otherwise we get two line ends instead of one if user tried to normalize other line ends that we do not process) process \r\xC2\x85 in the same way as \r\n.

If it turns out that unescape and line end normalization cannot be combined in one loop, then we should normalize all possible line ends because that require searching only 3 different bytes, which memchr is supporting.

Actually, maybe using some bit tricks, like describing in this article it would be possible to work around memchr constraint, but I'm not experienced enough right now to apply them. Maybe you?

Therefore I have to remove the \r in roundtrip test to get the case pass.

That's fine, but probably should add a comment, that \r excluded from whitespace characters because we performs line end normalization.

Mingun · 2024-09-27T14:23:28Z

src/escape.rs

@@ -266,7 +266,7 @@ where
                    unescaped = Some(String::with_capacity(raw.len()));
                }
                let unescaped = unescaped.as_mut().expect("initialized");
-                unescaped.push_str(&raw[last_end..start]);
+                unescaped.push_str(&normalize_line_end(&raw[last_end..start]));


Escaping may be on the hot path, so it is not fine that we can allocate new owned strings just to get a reference and put it to already allocated string. Would it possible instead use already managed Cow and pass it to the normalize_line_end? Then upgrading Cow from borrowed to owned could be performed either in unescape, or in normalize_line_end and that upgrade always will be performed only once.

Also, do we really need two loops? memchr can search for three bytes at once and we search for three bytes: &, ; and \r. So it could be possible to perform unescaping and line end normalization in one loop.

I don't know how to do it. What if there is nothing to unescape? You still need to call normalize_line_end for such input in the end, then what is the correct function parameter for normalize_line_end so that it can handle both situations?

Hm, don't understand what the trouble? unescape will always be called when we request data via API methods. If cycling search of &, ;, or \r will not find anything, than it is really nothing to change in the data (I assume that cycles for unescaping and normalization merged in one).

Even if did not merge loops, normalize_line_end can accept &mut Cow<str> and change it instead of creating an own Cow by taking reference to &str and returning its own Cow<str>

I am talking about the not merge loops, as you point out that there are more line end situations to consider, it is beyond my knowledge to do SIMD stuff myself to merge the loops.

Current function signature is: fn normalize_line_end(input: &str) -> Cow<str>
Change to what? How to call it?

Maybe your requirement is beyond my knowledge. In that case, I cannot help. We can close this PR.

Ah, read my &mut Cow<str> as &mut Option<String>. So the whole code will look like:

fn unescape_with(...) { let mut unescaped = None; for ... { ... if need to unescape { if unescaped.is_none() { unescaped = Some(String::with_capacity(raw.len())); } let unescaped = unescaped.as_mut().expect("initialized"); } ... normalize_line_end(&mut unescaped, raw.len()); ... } } fn normalize_line_end(input: &mut Option<String>, raw_len: usize) { if need to normalize { if input.is_none() { input = Some(String::with_capacity(raw_len)) } let normalized = input.as_mut().expect("initialized"); ... } }

it is beyond my knowledge to do SIMD stuff myself to merge the loops.

Actually, SIMD and merging loops are two separate conception, it is not necessary to explicitly apply SIMD techniques if you not aware of them, it's totally fine. I only tried to be helpful if you aware of it.

Maybe your requirement is beyond my knowledge. In that case, I cannot help. We can close this PR.

The requirement in only is not to make things worse. First of all we should not make it impossible to correctly process input (that is why we should at least process \r\u0085 because if we turn it into \n\u0085, the user will see two newlines in data instead of one, because specification allows \n, \r\u0085 and \u0085 be served as (one) newline characters). The second, we should to try that without much impact to performance, so if we can avoid unnecessary allocations we should do that.

If you're having a hard time figuring out how to get there, that's fine, it could be so. Thanks for opening PR anyway. I will consider it later when I return to active phase on the project (currently working on other project) and see how to improve it.

Thanks for showing the code.
Yes, it works if you have something to unescape. But what if there is nothing to unescape from the original input?
Then you still have to call normlize_line_end to normalize the original input. How to you put original input raw str as parameter to your current unescape function?

Yes, it works if you have something to unescape. But what if there is nothing to unescape from the original input?

Unescape function decides where it is need to unescape something or not. It itself called always (well, because user requests unescaped and normalized data by demand, it will be called by demand, but that "demand" means "unesсape [and normalize] if there is anything")

How to you put original input raw str as parameter to your current unescape function?

Cannot understand what confuses you here. Calls to unescape will not be changed and it is already called whenever necessary.

pub fn unescape_with<'input, 'entity, F>( raw: &'input str, mut resolve_entity: F, ) -> Result<Cow<'input, str>, EscapeError> where // the lifetime of the output comes from a capture or is `'static` F: FnMut(&str) -> Option<&'entity str>, {

Nothing to escape means there is no & and ; in raw input. It will not go inside while below. So normalize will not happen to input here.

while let Some(start) = iter.by_ref().find(|p| bytes[*p] == b'&') { match iter.next() { Some(end) if bytes[end] == b';' => { // append valid data if unescaped.is_none() { unescaped = Some(String::with_capacity(raw.len())); } let unescaped = unescaped.as_mut().expect("initialized"); unescaped.push_str(&normalize_line_end(&raw[last_end..start])); // search for character correctness let pat = &raw[start + 1..end]; if let Some(entity) = pat.strip_prefix('#') { let codepoint = parse_number(entity).map_err(EscapeError::InvalidCharRef)?; unescaped.push_str(codepoint.encode_utf8(&mut [0u8; 4])); } else if let Some(value) = resolve_entity(pat) { unescaped.push_str(value); } else { return Err(EscapeError::UnrecognizedEntity( start + 1..end, pat.to_string(), )); } last_end = end + 1; } _ => return Err(EscapeError::UnterminatedEntity(start..raw.len())), } }

Will not go inside here either because nothing has been unescaped

if let Some(mut unescaped) = unescaped { if let Some(raw) = raw.get(last_end..) { unescaped.push_str(&normalize_line_end(raw)); } Ok(Cow::Owned(unescaped)) } else {

Must run normalize here, so how to call your changed function fn normalize_line_end(input: &mut Option<String>, raw_len: usize) here with raw as parameter?

Ok(normalize_line_end(raw)) } }

tests/reader-attributes.rs

Mingun · 2024-09-27T14:39:00Z

tests/reader-attributes.rs

+use quick_xml::events::{
+    BytesEnd,
+    Event::{self, *},
+};


Can be reverted, because of changes below.

Co-authored-by: Mingun <[email protected]>

src/escape.rs

Co-authored-by: Mingun <[email protected]>

yorkz1994 · 2024-09-29T09:10:33Z

I did more line ends normalization.
Also changed the normalization implementation to use iterator to avoid extra allocation during normalization.

z0017k9m added 2 commits September 25, 2024 10:25

Add normalize_line_end for unescape and test

4f5ce0d

Change normalize_line_end function parameter type to allow more type …

45a63c5

…as input, e.g. &str

Mingun requested changes Sep 25, 2024

View reviewed changes

src/escape.rs Outdated Show resolved Hide resolved

yorkz1994 and others added 4 commits September 25, 2024 21:51

Update src/escape.rs

1c2ddc5

Co-authored-by: Mingun <[email protected]>

change parameter type from generic to Cow

139b2cb

Add issue 806 to change log.

6962ae3

Fix regression test failure due to line end normalization

11320ad

Mingun requested changes Sep 26, 2024

View reviewed changes

Changelog.md Outdated Show resolved Hide resolved

Update Changelog.md

0bb282e

Co-authored-by: Mingun <[email protected]>

z0017k9m added 2 commits September 27, 2024 20:56

Still need to normalize if nothing unescaped

f970370

Add line ends test for attribute and text

d11c756

roundtrip test cannot include \r

d3ee1ad

Mingun requested changes Sep 27, 2024

View reviewed changes

yorkz1994 and others added 2 commits September 28, 2024 11:05

Update tests/reader-attributes.rs

cbf200f

Co-authored-by: Mingun <[email protected]>

Update tests/reader-attributes.rs

e4f83a5

Co-authored-by: Mingun <[email protected]>

Mingun reviewed Sep 28, 2024

View reviewed changes

src/escape.rs Outdated Show resolved Hide resolved

yorkz1994 and others added 4 commits September 28, 2024 17:26

Update src/escape.rs

cf99a45

Co-authored-by: Mingun <[email protected]>

Add comment why \r is not include in roundtrip test.

628dce7

Remove unused import

44df9bf

Missing b in byte literal

4e1eefc

z0017k9m added 2 commits September 29, 2024 17:05

Add some \r in unescape_text/mixed

ecfeefc

Normalizing more line ends, no allocation

d880edc

Add normalize_line_end for unescape and test #807

Are you sure you want to change the base?

Add normalize_line_end for unescape and test #807

Uh oh!

Conversation

yorkz1994 commented Sep 25, 2024

Uh oh!

Mingun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yorkz1994 commented Sep 26, 2024

Uh oh!

codecov-commenter commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Mingun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yorkz1994 commented Sep 26, 2024

Uh oh!

Mingun commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yorkz1994 commented Sep 27, 2024

Uh oh!

Mingun commented Sep 27, 2024

Uh oh!

yorkz1994 commented Sep 27, 2024

Uh oh!

Mingun commented Sep 27, 2024

Uh oh!

yorkz1994 commented Sep 27, 2024 • edited by Mingun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yorkz1994 commented Sep 27, 2024

Uh oh!

Mingun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mingun Sep 27, 2024

Choose a reason for hiding this comment

Uh oh!

yorkz1994 Sep 28, 2024

Choose a reason for hiding this comment

Uh oh!

Mingun Sep 28, 2024

Choose a reason for hiding this comment

Uh oh!

yorkz1994 Sep 28, 2024

Choose a reason for hiding this comment

Uh oh!

Mingun Sep 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yorkz1994 Sep 28, 2024

Choose a reason for hiding this comment

Uh oh!

Mingun Sep 28, 2024

Choose a reason for hiding this comment

Uh oh!

yorkz1994 Sep 29, 2024 • edited by Mingun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Mingun Sep 27, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yorkz1994 commented Sep 29, 2024

Uh oh!

Uh oh!

codecov-commenter commented Sep 26, 2024 •

edited

Loading

Mingun commented Sep 26, 2024 •

edited

Loading

yorkz1994 commented Sep 27, 2024 •

edited by Mingun

Loading

Mingun left a comment •

edited

Loading

Mingun Sep 28, 2024 •

edited

Loading

yorkz1994 Sep 29, 2024 •

edited by Mingun

Loading