Avoid blindly re-encoding HTML files #373

inductiveload · 2021-07-28T04:23:53Z

Previously, HTML files werei stripped of their XML
Processing Instruction headers and re-encoded from
UTF-8 to HTML-ENTITIIES to be fed into the DomDocument.

This caused problems for documents with CDATA blocks that
contained Unicode, as it's not correct to escape that as
HTML entities in the general case. For example CSS or
binary data doesn't use that escaping system.

Instead, load it directly and then remove the PI nodes
after the fact.

Bug: https://phabricator.wikimedia.org/T271390

Previously, HTML files werei stripped of their XML Processing Instruction headers and re-encoded from UTF-8 to HTML-ENTITIIES to be fed into the DomDocument. This caused problems for documents with CDATA blocks that contained Unicode, as it's not correct to escape that as HTML entities in the general case. For example CSS or binary data doesn't use that escaping system. Instead, load it directly and then remove the PI nodes after the fact. Bug: https://phabricator.wikimedia.org/T271390

samwilson · 2021-07-28T06:17:13Z

src/Util/Util.php

 		$document->encoding = 'UTF-8';
+
+		// Dirty fix to strip out existing XML Processing Instruction nodes
+		// (we already have one from the creation of the DOMDocument)


Is it possible to loadHTML() without adding the PI? II feels a bit inefficient to loop through the whole document (which could be quite large) only to find a dupe of something we've just added here. (Sorry if I'm missing the obvious!)

Tpt · 2021-07-28T07:23:08Z

Maybe dumb idea: instead of hacking around the PHP XML parser, what about using RemexHTML? It is the HTML parser developped for Parsoid so we might hope it properly handles all theses cases.

inductiveload · 2021-07-28T07:30:48Z

@Tpt sounds like a good idea, because whatever I do with the above, it seems to trip over on various inputs.

samwilson · 2023-12-04T03:26:13Z

I think this is all sorted now in #479. Sorry we never resolved it years ago!

inductiveload force-pushed the css_escaping branch from d3fed6a to 8b6d4e5 Compare July 28, 2021 04:29

inductiveload force-pushed the css_escaping branch from 8b6d4e5 to 35c1857 Compare July 28, 2021 04:49

inductiveload changed the title ~~Avoid blindly re-encoding HTML files~~ WIP: Avoid blindly re-encoding HTML files Jul 28, 2021

inductiveload changed the title ~~WIP: Avoid blindly re-encoding HTML files~~ Avoid blindly re-encoding HTML files Jul 28, 2021

samwilson reviewed Jul 28, 2021

View reviewed changes

samwilson added the WIP Work in progress label Sep 21, 2021

samwilson closed this Dec 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid blindly re-encoding HTML files #373

Avoid blindly re-encoding HTML files #373

inductiveload commented Jul 28, 2021

samwilson Jul 28, 2021

Tpt commented Jul 28, 2021

inductiveload commented Jul 28, 2021

samwilson commented Dec 4, 2023

Avoid blindly re-encoding HTML files #373

Avoid blindly re-encoding HTML files #373

Conversation

inductiveload commented Jul 28, 2021

samwilson Jul 28, 2021

Choose a reason for hiding this comment

Tpt commented Jul 28, 2021

inductiveload commented Jul 28, 2021

samwilson commented Dec 4, 2023