Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize raw HTML post-processor #1510

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

pawamoy
Copy link
Contributor

@pawamoy pawamoy commented Feb 21, 2025

Closes #1507

Using a set allows for better performances when checking for membership of a tag within block level elements.

Issue-1507: Python-Markdown#1507
Previously, the raw HTML post-processor would precompute all possible replacements for placeholders in a string, based on the HTML stash.

It would then apply a regular expression substitution using these replacements.

Finally, if the text changed, it would recurse, and do all that again.

This was inefficient because placeholders were re-computed each time it recursed, and because only a few replacements would be used anyway.

This change moves the recursion into the regular expression substitution, so that:

1. the regular expression does minimal work on the text (contrary to re-scanning text already scanned in previous frames);
2. but more importantly, replacements aren't computed ahead of time anymore (and even less *several times*), and only fetched from the HTML stash as placeholders are found in the text.

The substitution function relies on the regular expression groups ordering: we make sure to match `<p>PLACEHOLDER</p>` first, before `PLACEHOLDER`. The presence of a wrapping `p` tag indicates whether to wrap again the substitution result, or not (also depending on whether the substituted HTML is a block-level tag).

Issue-1507: Python-Markdown#1507
@pawamoy pawamoy force-pushed the optimize-rawhtml-postprocessor branch from 6113aad to fc9acc0 Compare February 21, 2025 15:38
@pawamoy
Copy link
Contributor Author

pawamoy commented Feb 21, 2025

Hmm, the list->set change could be seen as breaking. We can instead create a new Markdown._block_level_elements attribute, and use that in isblocklevel(). Let me know if you think that's best.

@pawamoy

This comment was marked as resolved.

else:
key = m.group(2)
wrapped = False
if (key := int(key)) >= len(self.md.htmlStash.rawHtmlBlocks):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use html_counter instead:

Suggested change
if (key := int(key)) >= len(self.md.htmlStash.rawHtmlBlocks):
if (key := int(key)) >= self.md.htmlStash.html_counter:

return pattern.sub(substitute_match, html)
return pattern.sub(substitute_match, f"<p>{html}</p>")

if self.md.htmlStash.html_counter:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we could not use html_counter and only rely on the actual list, rawHtmlBlocks:

Suggested change
if self.md.htmlStash.html_counter:
if self.md.htmlStash.rawHtmlBlocks:

if (key := int(key)) >= len(self.md.htmlStash.rawHtmlBlocks):
return m.group(0)
html = self.stash_to_string(self.md.htmlStash.rawHtmlBlocks[key])
if self.isblocklevel(html) or not wrapped:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Micro-optimization (given the list->set change is applied): make isblocklevel check lazy.

Suggested change
if self.isblocklevel(html) or not wrapped:
if not wrapped or self.isblocklevel(html):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimizing the raw HTML post-processor
1 participant