Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid blindly re-encoding HTML files #373

Closed
wants to merge 1 commit into from

Conversation

inductiveload
Copy link
Contributor

Previously, HTML files werei stripped of their XML
Processing Instruction headers and re-encoded from
UTF-8 to HTML-ENTITIIES to be fed into the DomDocument.

This caused problems for documents with CDATA blocks that
contained Unicode, as it's not correct to escape that as
HTML entities in the general case. For example CSS or
binary data doesn't use that escaping system.

Instead, load it directly and then remove the PI nodes
after the fact.

Bug: https://phabricator.wikimedia.org/T271390

Previously, HTML files werei stripped of their XML
Processing Instruction headers and re-encoded from
UTF-8 to HTML-ENTITIIES to be fed into the DomDocument.

This caused problems for documents with CDATA blocks that
contained Unicode, as it's not correct to escape that as
HTML entities in the general case. For example CSS or
binary data doesn't use that escaping system.

Instead, load it directly and then remove the PI nodes
after the fact.

Bug: https://phabricator.wikimedia.org/T271390
@inductiveload inductiveload changed the title Avoid blindly re-encoding HTML files WIP: Avoid blindly re-encoding HTML files Jul 28, 2021
@inductiveload inductiveload changed the title WIP: Avoid blindly re-encoding HTML files Avoid blindly re-encoding HTML files Jul 28, 2021
$document->encoding = 'UTF-8';

// Dirty fix to strip out existing XML Processing Instruction nodes
// (we already have one from the creation of the DOMDocument)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to loadHTML() without adding the PI? II feels a bit inefficient to loop through the whole document (which could be quite large) only to find a dupe of something we've just added here. (Sorry if I'm missing the obvious!)

@Tpt
Copy link
Collaborator

Tpt commented Jul 28, 2021

Maybe dumb idea: instead of hacking around the PHP XML parser, what about using RemexHTML? It is the HTML parser developped for Parsoid so we might hope it properly handles all theses cases.

@inductiveload
Copy link
Contributor Author

@Tpt sounds like a good idea, because whatever I do with the above, it seems to trip over on various inputs.

@samwilson samwilson added the WIP Work in progress label Sep 21, 2021
@samwilson
Copy link
Member

I think this is all sorted now in #479. Sorry we never resolved it years ago!

@samwilson samwilson closed this Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP Work in progress
Development

Successfully merging this pull request may close these issues.

3 participants