-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add normalize_line_end for unescape and test #807
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should add more tests for the actual usage of this function, i.e. test what would be returned when call public API methods of Attribute
and events. Now it is hard to ensure that this will not break due to internal refactorings.
Test failures should be evaluated and fixed. Also, issue #670 seems related, it would be great if you can evaluate how this change is applicable to it.
In the end, could you add a changelog entry. Do not forgot that Changelog.md
is a usual markdown file, so does not forgot to update section with links -- they are not linked automatically to the GH PRs/issues.
I think the failed test is because we did the line end normalization after the unescape, but unescape can contain |
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #807 +/- ##
==========================================
+ Coverage 60.17% 60.36% +0.18%
==========================================
Files 41 41
Lines 15958 15980 +22
==========================================
+ Hits 9603 9646 +43
+ Misses 6355 6334 -21
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good. But I would also like to see the tests when use API of Attribute
and BytesText
, which are listed here.
Co-authored-by: Mingun <[email protected]>
Sorry, don't know what do you want me to do. I am not familiar with XML specification on attributes. What I did is just do normalization of the line end before your further unescape work. So it removes |
I like just see tests that will call // XML with \n \r\n and \r style newlines in various places
const XML: &str = "...";
let mut reader = Reader::from_str(XML);
match reader.read_event().unwrap() {
Event::Start(event) => {
let iter = event.attributes();
let a = iter.next().unwrap();
#[cfg(not(feature = "encoding"))]
assert_eq!(a.unescape_value(), "...");
assert_eq!(a.decode_and_unescape_value(), "...");
}
event => panic!("Expected Start, found {:?}", event),
}
match reader.read_event().unwrap() {
Event::Text(event) => assert_eq!(event.unescape(), "..."),
event => panic!("Expected Text, found {:?}", event),
} |
Like I said, I am not familiar with XML specification, so I don't know where is the proper places to put |
Just add it everywhere where spaces can occur, we are not not talking about correctness for now (this is another question, we are definitely do not process everything according to the specification). I only want to have a starting point and be sure that this feature worked as assumed when you would use actual API. |
Could you provide such input data. I don't want to create such invalid XML. Normally I can think is that the |
<root attribute="\r\r\n\nvalue1\r\r\n\nvalue2\r\r\n\n">\r\r\n\nvalue3\r\r\n\nvalue4\r\r\n\n</root> |
f970370 I add a case from your input, don't know the result is your expected: #[test]
fn line_ends() {
const XML: &str = "<root attribute=\"\r\r\n\nvalue1\r\r\n\nvalue2\r\r\n\n\">\r\r\n\nvalue3\r\r\n\nvalue4\r\r\n\n</root>";
let mut reader = Reader::from_str(XML);
match reader.read_event().unwrap() {
Event::Start(event) => {
let mut iter = event.attributes();
let a = iter.next().unwrap().unwrap();
#[cfg(not(feature = "encoding"))]
assert_eq!(
a.unescape_value().unwrap(),
"\n\n\nvalue1\n\n\nvalue2\n\n\n"
);
assert_eq!(
a.decode_and_unescape_value(reader.decoder()).unwrap(),
"\n\n\nvalue1\n\n\nvalue2\n\n\n"
);
}
event => panic!("Expected Start, found {:?}", event),
}
match reader.read_event().unwrap() {
Event::Text(event) => {
assert_eq!(event.unescape().unwrap(), "\n\n\nvalue3\n\n\nvalue4\n\n\n")
}
event => panic!("Expected Text, found {:?}", event),
}
} There is a case in serde-se.rs always fail. Don't know if we should modify serde implementation? serialize_as!(tuple:
// Use to_string() to get owned type that is required for deserialization
("<\"&'>".to_string(), "with\t\r\n spaces", 3usize)
=> "<root><\"&'></root>\
<root>with\t\r\n spaces</root>\
<root>3</root>");
|
The failed test is because there is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add a case from your input, don't know the result is your expected:
Yes, I think, that it is expected results. According to the XML specification XML processor will never see raw \r
, it could appeared in data only if would be escaped as a character reference (note: XML specification names each "XML source" (for example "XML file") as "entity". XML is composed from several "entities" -- main document itself and any other documents loaded via DTD definitions).
Adding a link to https://www.w3.org/TR/xml11/#sec-line-ends somewhere near to the code that handles normalization also would be useful. Well, actually, it describes the 5 combinations of LOF characters, that should be normalized to \n
:
\r\n
\r\u0085
(UTF-8:\r\xC2\x85
)\r
\u0085
(UTF-8:\xC2\x85
)\u2028
(UTF-8:\xE2\x80\xA8
)
Strictly speaking, that means that we should look for the 5 bytes if we would like to unescape and normalize line ends in one loop: &
, ;
, \r
, \xC2
(or \x85
), \xE2
(or \xA8
). Unfortunately, memchr supports searching only 3 bytes at once. But at least we can (and should, otherwise we get two line ends instead of one if user tried to normalize other line ends that we do not process) process \r\xC2\x85
in the same way as \r\n
.
If it turns out that unescape and line end normalization cannot be combined in one loop, then we should normalize all possible line ends because that require searching only 3 different bytes, which memchr is supporting.
Actually, maybe using some bit tricks, like describing in this article it would be possible to work around memchr constraint, but I'm not experienced enough right now to apply them. Maybe you?
Therefore I have to remove the \r in roundtrip test to get the case pass.
That's fine, but probably should add a comment, that \r
excluded from whitespace characters because we performs line end normalization.
src/escape.rs
Outdated
@@ -266,7 +266,7 @@ where | |||
unescaped = Some(String::with_capacity(raw.len())); | |||
} | |||
let unescaped = unescaped.as_mut().expect("initialized"); | |||
unescaped.push_str(&raw[last_end..start]); | |||
unescaped.push_str(&normalize_line_end(&raw[last_end..start])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Escaping may be on the hot path, so it is not fine that we can allocate new owned strings just to get a reference and put it to already allocated string. Would it possible instead use already managed Cow
and pass it to the normalize_line_end
? Then upgrading Cow
from borrowed to owned could be performed either in unescape
, or in normalize_line_end
and that upgrade always will be performed only once.
Also, do we really need two loops? memchr
can search for three bytes at once and we search for three bytes: &
, ;
and \r
. So it could be possible to perform unescaping and line end normalization in one loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how to do it. What if there is nothing to unescape? You still need to call normalize_line_end
for such input in the end, then what is the correct function parameter for normalize_line_end
so that it can handle both situations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, don't understand what the trouble? unescape
will always be called when we request data via API methods. If cycling search of &
, ;
, or \r
will not find anything, than it is really nothing to change in the data (I assume that cycles for unescaping and normalization merged in one).
Even if did not merge loops, normalize_line_end
can accept &mut Cow<str>
and change it instead of creating an own Cow
by taking reference to &str
and returning its own Cow<str>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am talking about the not merge loops, as you point out that there are more line end situations to consider, it is beyond my knowledge to do SIMD stuff myself to merge the loops.
Current function signature is: fn normalize_line_end(input: &str) -> Cow<str>
Change to what? How to call it?
Maybe your requirement is beyond my knowledge. In that case, I cannot help. We can close this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, read my &mut Cow<str>
as &mut Option<String>
. So the whole code will look like:
fn unescape_with(...) {
let mut unescaped = None;
for ... {
...
if need to unescape {
if unescaped.is_none() {
unescaped = Some(String::with_capacity(raw.len()));
}
let unescaped = unescaped.as_mut().expect("initialized");
}
...
normalize_line_end(&mut unescaped, raw.len());
...
}
}
fn normalize_line_end(input: &mut Option<String>, raw_len: usize) {
if need to normalize {
if input.is_none() {
input = Some(String::with_capacity(raw_len))
}
let normalized = input.as_mut().expect("initialized");
...
}
}
it is beyond my knowledge to do SIMD stuff myself to merge the loops.
Actually, SIMD and merging loops are two separate conception, it is not necessary to explicitly apply SIMD techniques if you not aware of them, it's totally fine. I only tried to be helpful if you aware of it.
Maybe your requirement is beyond my knowledge. In that case, I cannot help. We can close this PR.
The requirement in only is not to make things worse. First of all we should not make it impossible to correctly process input (that is why we should at least process \r\u0085
because if we turn it into \n\u0085
, the user will see two newlines in data instead of one, because specification allows \n
, \r\u0085
and \u0085
be served as (one) newline characters). The second, we should to try that without much impact to performance, so if we can avoid unnecessary allocations we should do that.
If you're having a hard time figuring out how to get there, that's fine, it could be so. Thanks for opening PR anyway. I will consider it later when I return to active phase on the project (currently working on other project) and see how to improve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for showing the code.
Yes, it works if you have something to unescape. But what if there is nothing to unescape from the original input?
Then you still have to call normlize_line_end to normalize the original input. How to you put original input raw str as parameter to your current unescape function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it works if you have something to unescape. But what if there is nothing to unescape from the original input?
Unescape function decides where it is need to unescape something or not. It itself called always (well, because user requests unescaped and normalized data by demand, it will be called by demand, but that "demand" means "unesсape [and normalize] if there is anything")
How to you put original input raw str as parameter to your current unescape function?
Cannot understand what confuses you here. Calls to unescape
will not be changed and it is already called whenever necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn unescape_with<'input, 'entity, F>(
raw: &'input str,
mut resolve_entity: F,
) -> Result<Cow<'input, str>, EscapeError>
where
// the lifetime of the output comes from a capture or is `'static`
F: FnMut(&str) -> Option<&'entity str>,
{
Nothing to escape means there is no &
and ;
in raw input. It will not go inside while
below. So normalize will not happen to input here.
while let Some(start) = iter.by_ref().find(|p| bytes[*p] == b'&') {
match iter.next() {
Some(end) if bytes[end] == b';' => {
// append valid data
if unescaped.is_none() {
unescaped = Some(String::with_capacity(raw.len()));
}
let unescaped = unescaped.as_mut().expect("initialized");
unescaped.push_str(&normalize_line_end(&raw[last_end..start]));
// search for character correctness
let pat = &raw[start + 1..end];
if let Some(entity) = pat.strip_prefix('#') {
let codepoint = parse_number(entity).map_err(EscapeError::InvalidCharRef)?;
unescaped.push_str(codepoint.encode_utf8(&mut [0u8; 4]));
} else if let Some(value) = resolve_entity(pat) {
unescaped.push_str(value);
} else {
return Err(EscapeError::UnrecognizedEntity(
start + 1..end,
pat.to_string(),
));
}
last_end = end + 1;
}
_ => return Err(EscapeError::UnterminatedEntity(start..raw.len())),
}
}
Will not go inside here either because nothing has been unescaped
if let Some(mut unescaped) = unescaped {
if let Some(raw) = raw.get(last_end..) {
unescaped.push_str(&normalize_line_end(raw));
}
Ok(Cow::Owned(unescaped))
} else {
Must run normalize here, so how to call your changed function fn normalize_line_end(input: &mut Option<String>, raw_len: usize)
here with raw as parameter?
Ok(normalize_line_end(raw))
}
}
tests/reader-attributes.rs
Outdated
use quick_xml::events::{ | ||
BytesEnd, | ||
Event::{self, *}, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be reverted, because of changes below.
Co-authored-by: Mingun <[email protected]>
Co-authored-by: Mingun <[email protected]>
I did more line ends normalization. |
Regarding #806
I added a
normalize_line_end
function inescape
module and related tests.If
unescape
function is called, then line end will be normalized.