-
-
Notifications
You must be signed in to change notification settings - Fork 556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Illegal control characters in XML output #365
Comments
That's bascially a bug in cmark's XML renderer. Markdown allows C0 control characters, but XML 1.0 doesn't (except some whitespace). The only option I see is to encode control characters with an extra element like:
But even this won't work with all XML workflows. XSLT stylesheets like |
Would it be a good idea to introduce a new option, either for |
Or alternatively, would you be willing to help me some example C code that allows me to substitute those characters from the xml before feeding into |
You'd have to strip all control codes between 0x00 and 0x1F, except 0x09, 0x0A and 0x0D. For completeness, you should also strip Unicode code points 0xFFFE and 0xFFFF: https://www.w3.org/TR/REC-xml/#charsets |
I'm no XML expert: would encoding with numeric character references suffice? That would be an easy fix. |
No, this doesn't work. From the spec:
|
If the point of the XML renderer is to faithfully represent the AST, then I see no real alternative to using |
In practice, at least in my cases, control characters in markdown are very rare and usually end up in the test by accident. In the real cases that I encountered, this happend when capturing stdout from a command line tool. An option to simply strip or substitute those works for me, a |
Well, for your purposes it might be fine to strip them. But we don't want to have two distinct ASTs represented by the same XML, if the XML purports to be the most accurate and direct representation of the AST. (There has been discussion of using the XML for tests, for example, and that wouldn't be possible if we stripped content.) |
Well, yes and no, I think you can also argue that control C0 characters aren't part of the content in the first place. They have no textual meaning, just artifacts not for human consumption, like a BOM or EOF. For example, the json specification also states that a BOM character may be ignored by the parser. Either way, any solution that allows me to reliable parse |
I think that would be a reasonable decision to make, but currently the spec doesn't say this. |
Control characters, U+FFFE and U+FFFF aren't allowed in XML 1.0, so replace them with U+FFFD (replacement character). This doesn't solve the problem how to roundtrip these characters, but at least we don't produce invalid XML. See commonmark#365.
Control characters, U+FFFE and U+FFFF aren't allowed in XML 1.0, so replace them with U+FFFD (replacement character). This doesn't solve the problem how to roundtrip these characters, but at least we don't produce invalid XML. See commonmark#365.
The pull request above should fix the issue for @jeroen. We could also use a textual escape mechanism like |
Control characters, U+FFFE and U+FFFF aren't allowed in XML 1.0, so replace them with U+FFFD (replacement character). This doesn't solve the problem how to roundtrip these characters, but at least we don't produce invalid XML. See #365.
Hi, I maintain the R bindings for cmark. One popular use case is converting commonmark to xml for processing the AST.
We are running into a problem when input markdown contains control characters (often captured from a tty), which makes xml output invalid. For example if the markdown text contains
\033
and we convert that to xml, we get:However, trying to parse this with libxml2 fails:
A real world example is this readme file. This was done with the gfm fork, but I think the problem appears the same.
Is this a bug in cmark, or is markdown text not supposed to contain c0 characters in the first place?
cc @nwellnhof
The text was updated successfully, but these errors were encountered: