Illegal control characters in XML output #365

jeroen · 2020-10-19T13:01:11Z

Hi, I maintain the R bindings for cmark. One popular use case is converting commonmark to xml for processing the AST.

We are running into a problem when input markdown contains control characters (often captured from a tty), which makes xml output invalid. For example if the markdown text contains \033 and we convert that to xml, we get:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">
<document xmlns="http://commonmark.org/xml/1.0">
  <paragraph>
    <text xml:space="preserve">�</text>
  </paragraph>
</document>

However, trying to parse this with libxml2 fails:

 Error in read_xml.raw(charToRaw(enc2utf8(x)), "UTF-8", ..., as_html = as_html,  : 
  PCDATA invalid Char value 27 [9]

A real world example is this readme file. This was done with the gfm fork, but I think the problem appears the same.

Is this a bug in cmark, or is markdown text not supposed to contain c0 characters in the first place?

cc @nwellnhof

The text was updated successfully, but these errors were encountered:

nwellnhof · 2020-10-19T14:18:43Z

That's bascially a bug in cmark's XML renderer. Markdown allows C0 control characters, but XML 1.0 doesn't (except some whitespace). The only option I see is to encode control characters with an extra element like:

<text xml:space="preserve">Text containing control charater: <control code="27"/>. More text...</text>

But even this won't work with all XML workflows. XSLT stylesheets like tools/xml2md.xsl simply cannot output C0 control characters.

jeroen · 2020-10-19T14:30:03Z

Would it be a good idea to introduce a new option, either for cmark_parser_new or cmark_render_xml, to strip C0 characters such that we do never end up with invalid xml?

jeroen · 2020-10-19T14:40:33Z

Or alternatively, would you be willing to help me some example C code that allows me to substitute those characters from the xml before feeding into xmlReadMemory ? It is not completely clear to me which characters I should be removing, that are not escaped by cmark, but also not supported by the libxml2 parser.

nwellnhof · 2020-10-19T16:20:45Z

You'd have to strip all control codes between 0x00 and 0x1F, except 0x09, 0x0A and 0x0D. For completeness, you should also strip Unicode code points 0xFFFE and 0xFFFF: https://www.w3.org/TR/REC-xml/#charsets

jgm · 2020-10-19T16:38:26Z

I'm no XML expert: would encoding with numeric character references suffice? That would be an easy fix.

nwellnhof · 2020-10-19T16:50:08Z

No, this doesn't work. From the spec:

Characters referred to using character references MUST match the production for Char.

jgm · 2020-10-19T16:57:10Z

If the point of the XML renderer is to faithfully represent the AST, then I see no real alternative to using <control> or something like that. (That isn't ideal, because it gives the impression that these characters are special nodes in the AST, but is there a better way? Dropping the characters seems less faithful.)

jeroen · 2020-10-19T18:04:31Z

In practice, at least in my cases, control characters in markdown are very rare and usually end up in the test by accident. In the real cases that I encountered, this happend when capturing stdout from a command line tool. An option to simply strip or substitute those works for me, a <control> tag introduce more complexity than it solves...

jgm · 2020-10-19T23:10:15Z

Well, for your purposes it might be fine to strip them. But we don't want to have two distinct ASTs represented by the same XML, if the XML purports to be the most accurate and direct representation of the AST. (There has been discussion of using the XML for tests, for example, and that wouldn't be possible if we stripped content.)

jeroen · 2020-10-20T13:22:34Z

Well, yes and no, I think you can also argue that control C0 characters aren't part of the content in the first place. They have no textual meaning, just artifacts not for human consumption, like a BOM or EOF. For example, the json specification also states that a BOM character may be ignored by the parser.

Either way, any solution that allows me to reliable parse cmark_render_xml output in libxml2 works for me :-)

jgm · 2020-10-21T16:18:10Z

I think you can also argue that control C0 characters aren't part of the content in the first place

I think that would be a reasonable decision to make, but currently the spec doesn't say this.
Perhaps a change to the spec would be wise.

Control characters, U+FFFE and U+FFFF aren't allowed in XML 1.0, so replace them with U+FFFD (replacement character). This doesn't solve the problem how to roundtrip these characters, but at least we don't produce invalid XML. See commonmark#365.

nwellnhof · 2021-02-03T18:46:50Z

The pull request above should fix the issue for @jeroen.

We could also use a textual escape mechanism like \uXXXXXX which would also work with XML attributes. But for most practical purposes, simply replacing control and non-characters shouldn't be a problem.

Control characters, U+FFFE and U+FFFF aren't allowed in XML 1.0, so replace them with U+FFFD (replacement character). This doesn't solve the problem how to roundtrip these characters, but at least we don't produce invalid XML. See #365.

nwellnhof mentioned this issue Feb 3, 2021

Replace invalid characters in XML output #376

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Illegal control characters in XML output #365

Illegal control characters in XML output #365

jeroen commented Oct 19, 2020 •

edited

Loading

nwellnhof commented Oct 19, 2020

jeroen commented Oct 19, 2020

jeroen commented Oct 19, 2020 •

edited

Loading

nwellnhof commented Oct 19, 2020

jgm commented Oct 19, 2020

nwellnhof commented Oct 19, 2020

jgm commented Oct 19, 2020

jeroen commented Oct 19, 2020 •

edited

Loading

jgm commented Oct 19, 2020

jeroen commented Oct 20, 2020 •

edited

Loading

jgm commented Oct 21, 2020

nwellnhof commented Feb 3, 2021

Illegal control characters in XML output #365

Illegal control characters in XML output #365

Comments

jeroen commented Oct 19, 2020 • edited Loading

nwellnhof commented Oct 19, 2020

jeroen commented Oct 19, 2020

jeroen commented Oct 19, 2020 • edited Loading

nwellnhof commented Oct 19, 2020

jgm commented Oct 19, 2020

nwellnhof commented Oct 19, 2020

jgm commented Oct 19, 2020

jeroen commented Oct 19, 2020 • edited Loading

jgm commented Oct 19, 2020

jeroen commented Oct 20, 2020 • edited Loading

jgm commented Oct 21, 2020

nwellnhof commented Feb 3, 2021

jeroen commented Oct 19, 2020 •

edited

Loading

jeroen commented Oct 19, 2020 •

edited

Loading

jeroen commented Oct 19, 2020 •

edited

Loading

jeroen commented Oct 20, 2020 •

edited

Loading