Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XALANJ-2617 Fixed serializer for high-surrogate UTF-16 characters #4

Open
wants to merge 3 commits into
base: trunk
Choose a base branch
from

Conversation

peterdemaeyer
Copy link

Fixed serializer such that it correctly deals with high-surrogate UTF-16 characters. This pull request replaces an earlier one from Daniel Kec, see comments on https://issues.apache.org/jira/browse/XALANJ-2617.

@matzesa
Copy link

matzesa commented Oct 2, 2018

Thanks for your pull request. I build serializer.jar with it and it works well with normal text. But I still have problems with text in attributes.

XML (content of element):
😇

XSL:
<xsl:value-of select="."/><input type="text" value="{.}"/>

Output (method xml; encoding utf-8):
&#128519;<input value="&#55357;&#56839;" type="text"/>

Output (method html, encoding utf-8):
&#128519;<input value="😇&#55357;" type="text">

With encoding utf-16 xml and html:
&#128519;<input value="😇" type="text"/>

@peterdemaeyer
Copy link
Author

I'll gladly investigate further and fix, but I'm afraid the description of your reproducible scenario is a bit too concise for me to understand it properly. I understand attributes in XML, but I don't understand what the XSL has to do with it. It would help if you could give me more detailed steps to reproduce.

@peterdemaeyer
Copy link
Author

Discard my previous comment. In the meantime I was able to reproduce the bug and I fixed it for attribute values as well. The fix for attribute values was a bit more involved than that for element text, because in general the method writeAttrString (which serializes attribute values) was not on par with method characters (which serializes element text). The implementation is similar yet different, so my fix is similar yet different as well ;-)

@matzesa
Copy link

matzesa commented Oct 22, 2018

Thanks for your response and fix. It now work well with attributes when the output is xml. When I change the OutputKeys.METHOD to 'html' text in attributes is still broken. The character is not a html entity and there is extra code '�' (&#55357;).

Output of file xalan_emoji.txt (changed from .java to .txt) with your fixed xalan version:

Character: 😀
EXPECTED XML: <?xml version="1.0" encoding="UTF-8"?><a test="&#128512;">&#128512;</a>
 ACTUAL XML: <?xml version="1.0" encoding="UTF-8"?><a test="&#128512;">&#128512;</a>
ACTUAL XML PARSED CHAR 😀
EXPECTED HTML: <a test="&#128512;">&#128512;</a>
 ACTUAL HTML: <a test="😀&#55357;">&#128512;</a>

[Fatal Error] :1:20: Character reference "&#

@matzesa
Copy link

matzesa commented Oct 22, 2018

I tried to investigate the problem and found writeAttrString in ToHTMLStream.java.
On line 1443 the character will be written (but not changed to an entity). On the next few lines the (current) character will be written again. Or do I miss something? So I changed the code that these lines are in an else block:

if (Encodings.isHighUTF16Surrogate(ch))
{

		writeUTF16Surrogate(ch, chars, i, end);
		i++; // two input characters processed
			 // this increments by one and the for()
			 // loop itself increments by another one.
}

// The next is kind of a hack to keep from escaping in the case
// of Shift_JIS and the like.

/*
else if ((ch < m_maxCharacter) && (m_maxCharacter == 0xFFFF)
&& (ch != 160))
{
writer.write(ch);  // no escaping in this case
}
*/

else
{
	String outputStringForChar = m_charInfo.getOutputStringForChar(ch);
	if (null != outputStringForChar)
	{
		writer.write(outputStringForChar);
	}
	else if (escapingNotNeeded(ch))
	{
		writer.write(ch); // no escaping in this case
	}
	else
	{
		writer.write("&#");
		writer.write(Integer.toString(ch));
		writer.write(';');
	}
}

This seems to work, the output of my test (see comment above) is now:

Character: 😀
EXPECTED XML: <?xml version="1.0" encoding="UTF-8"?><a test="&#128512;">&#128512;</a>
 ACTUAL XML: <?xml version="1.0" encoding="UTF-8"?><a test="&#128512;">&#128512;</a>
ACTUAL XML PARSED CHAR 😀
EXPECTED HTML: <a test="&#128512;">&#128512;</a>
 ACTUAL HTML: <a test="😀">&#128512;</a>

ACTUAL HTML PARSED CHAR 😀

The character is not changed to an entity, but for me it is better than an extra and wrong character.

@nicolas-albert
Copy link

Hi,
I try this patch to write some unicode correctly to XML and its works for xalan 1.7.2.

Can we hope a release with this bug fixed or their is a non-official maven repo available with this issue fixed ?

nicolas-albert added a commit to convertigo/convertigo that referenced this pull request Nov 30, 2018
@L3Chuck
Copy link

L3Chuck commented Jan 8, 2019

@peterdemaeyer thanks for this fix, do you know when this can be merged and released ?

@peterdemaeyer
Copy link
Author

My pull request has been ready for several months. I didn't closely examine @nicolas-albert's additional fixes but assuming they are good, this can be merged as far as I'm concerned. I don't think I can merge it myself, so I was just waiting for someone with sufficient privileges to merge it. I'm also quite eager to get this fix in an official release, so if I can/need to do something additional myself, let me know.

@nicolas-albert
Copy link

@peterdemaeyer I didn't customize the code, I only applied this patch to my project (and seems ok).
I hope someone from the XalanJ team will merge this PR for the next release (if any ...)

@plutext
Copy link

plutext commented Feb 11, 2019

Hi @peterdemaeyer with reference to https://issues.apache.org/jira/browse/XALANJ-2617 I have converted http://svn.apache.org/repos/asf/xalan/test/ to https://github.com/plutext/xalan-test

I see you ran minitest and smoketest; those work for me as well :-)

Do you recall whether you tried the conf and perf tests? With these I get some test failures, even with the published xalan and serializer jars from maven central. The same tests fail for me under Java 8 and 11, on both master and branch xalan-j_2_7_x.

FYI, I have converted Xalan to a maven multi module project; minitest and smoketest pass using Java 8 and Java 11 .

I have made module-info.java files for it. Once satisfied about the tests, I'll apply selected patches.

@peterdemaeyer
Copy link
Author

I don't think It tried the conf and perf tests.

@plutext
Copy link

plutext commented Feb 21, 2019

@nicolas-albert convertigo/convertigo@bd37cff is 7,493 additions. Would you consider re-doing your patch without the noise? thanks!

@nicolas-albert
Copy link

@plutext I just apply the PR of this conversation, there is your patch : https://github.com/apache/xalan-j/pull/4/commits
My commit was for my project to override classes of the latest Xalan release and should not be applied to another project.

@peterdemaeyer
Copy link
Author

Guys, you're losing me here. I only touched 1 file. I don't understand where all the rest of the 7493 additions come from, but I can't imagine that this patch is hard to merge. From the discussion above, it's not entirely clear to me what is happening, but I hope this fix can be merged soon.

@plutext
Copy link

plutext commented Feb 23, 2019

I think the confusion comes from your reference to "@nicolas-albert's additional fixes"; I didn't notice he later clarified that there are no additional fixes :-) so sorry for the noise. Regarding likelihood of merging, please see comments at https://issues.apache.org/jira/browse/XALANJ-2419

epilogic-ch added a commit to epilogic-ch/xalan-java that referenced this pull request Feb 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants