Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing the raw HTML post-processor #1507

Open
pawamoy opened this issue Feb 3, 2025 · 10 comments · May be fixed by #1510
Open

Optimizing the raw HTML post-processor #1507

pawamoy opened this issue Feb 3, 2025 · 10 comments · May be fixed by #1510

Comments

@pawamoy
Copy link
Contributor

pawamoy commented Feb 3, 2025

Locally, I was able to shave 20 seconds off a MkDocs build (from ~50s to ~30s) by optimizing a bit the raw HTML post processor. Would you be open to a PR that optimizes the post-processor code a bit? Any guidance on how to provide benchmarks and results to confirm to performance gain?

@waylan
Copy link
Member

waylan commented Feb 3, 2025

You are certainly welcome to contribute in that way. As far as benchmarks, if you can provide a reproducible command that can be run both before and after the changes which shows a clear performance boost, then that should be sufficient. Note that the command should be run directly on Markdown without any additional third-party code (such as MkDocs). Generally, Python's own built-in tools work fine.

@pawamoy
Copy link
Contributor Author

pawamoy commented Feb 20, 2025

I've started working on this, and will try to provide some context first.

I'm working on a new feature in mkdocstrings/autorefs and noticed that build time went up by a lot. I noticed a lot of time was spent in the RawHtmlPostprocessor.

On certain relatively big pages, the self.md.htmlStash.rawHtmlBlocks list grows quite big, for example 4515 for Griffe's expressions API page. For the most part, the list is made of "<code>", "<span>", "<b>" and their closing variants, with or without HTML classes. Then the list contains about 40 big strings (1k to 20k+ characters) which are mkdocstrings-python's Jinja templates rendered to HTML, one for each Python object injected into the page.

The processor iterates on this list several times: once more (with recursion) as long as placeholders were found and replaced in the current text. Each time it builds a dictionary of string replacements, which it uses in a re.sub call. Each time it searches the whole, expanded text again for placeholders (including parts that were already searched in previous iterations).

This is done for the page content, but also for each ToC entry (in toc.render_inner_html()). In the page above, I have about 500 ToC entries. Each entry does two iterations on the blocks list (only one layer of placeholders, so a single recursion). That means a dict of length 4515 is built 500 * 2 times, for this page only. Each element in each dict is checked to see if it's a block level element, with a regex, and most of the time the first matched HTML tag is checked for membership within a list of length 59, the BLOCK_LEVEL_ELEMENTS list. So for this page, that's about 250+ million string comparisons.


Now for possible optimizations.

  1. Since this is post-processing, maybe we can assume the HTML stash will not grow, or change, and replacements could be computed only once, and cached across the conversion process. Not sure about this one.
  2. At the very least, the HTML stash won't change between each recursion call, so replacements shouldn't be recomputed each time.
  3. Instead of comparing the original text to the modified one to know whether to recurse, meaning the regular expression runs again on the same parts of the text it already ran against, we could move the recursion to the re.sub call: each time a placeholder is found, re.sub its replacement. It means the regex does minimal work.
  4. Change BLOCK_LEVEL_ELEMENTS to be a set. Low-hanging fruit and drastic performance improvement, going from O(N) to O(1) for membership checks of HTML tags.
  5. Micro-optimizations:
    • avoid repeated attribute access in the hot-loop (assign variables before loop)
    • use str() directly instead of self.stash_to_string() which just calls str()
    • use return tag in self.md.block_level_elements instead of return self.md.is_block_level(tag), since Markdown.is_block_level only adds a check on whether the argument is a string, which we already know it is in this context
    • use HTML_PLACEHOLDER % i directly to get a placeholder, instead of get_placeholder(i) (this method of the stash doesn't use self) -> we lose a bit of OOP-encapsulation
    • use more modern constructs where possible, such as f-strings instead of %/{}-formatted strings

Other, less clear optimization: seeing how the stash list has lots of repeated elements, there might be more performant data structures / approaches. But that's more refactor than I'm comfortable with, and the perf gain from the suggestions above is already sufficient, from what I've seen.

@pawamoy

This comment has been minimized.

@pawamoy
Copy link
Contributor Author

pawamoy commented Feb 20, 2025

Actually there's another approach that seems even more obvious: don't build replacements before hand, and only fetch them when we found a placeholder to replace. This makes all the other optimizations irrelevant (except for list->set which I'd still recommend).

@waylan
Copy link
Member

waylan commented Feb 20, 2025

Just a word of caution: the existing implementation of running the regex on already processed text exists because a raw block could contain a placeholder for another raw block. I don't recall if we have a test for that scenario, but that is why it works as it does. How do we get into that situation? I assume a user will have multiple third-party extensions, each of which add placeholders in the text at different stages in some completely unpredictable way. Therefore, we could have a raw block which contains a placeholder, which points to another raw block which contains a placeholder, which points to another raw block which contains a placeholder and so on. All of the placeholders must be swapped back out. The way I solved that was to keep checking the entire text until there are no more matches. I suppose an alternative approach would be to run the regex on the raw block before inserting it back into the document. But I don't think that necessarily reduces the number of times the regex is run.

@pawamoy
Copy link
Contributor Author

pawamoy commented Feb 20, 2025

Yes, that is what I understood 🙂 And your suggested alternative is what I suggested in point 3. I agree the regex will not run fewer times, but it will run on less text.


OK you can run a script in my fork:

git clone https://github.com/pawamoy/markdown pawamoy-markdown
cd pawamoy-markdown
git checkout optimize-raw-html-postprocessor
uvx --with matplotlib python benchmarks.py

This should display a chart where you'll see that the baseline (current implementation) depends on the size of the stash (as well as the number of placeholders in the text), while the new implementation only depends on the number of placeholders in the text. In this example the processor runs twice, once with an empty string and a second time with a string with 10 placeholders.

Image

Let me generate a second chart where the number of placeholders increases.

@pawamoy
Copy link
Contributor Author

pawamoy commented Feb 20, 2025

Growing stash, growing number of placeholders (equal to stash size, so all entries are used). The diff will probably be much more flagrant if we start nesting placeholders.

Image

@pawamoy
Copy link
Contributor Author

pawamoy commented Feb 20, 2025

With 5% nesting:

Image

@pawamoy
Copy link
Contributor Author

pawamoy commented Feb 20, 2025

With a fixed stash size of 5000, increasing ratio of stash entries containing placeholders:

Image

Note that this is a single run of the post-processor. The difference is amplified by the content size (number of pages in MkDocs site, number of ToC entries in each page).

@pawamoy
Copy link
Contributor Author

pawamoy commented Feb 21, 2025

I also found what exactly in the new feature I'm working on was making the build time go up like this.

Previously, mkdocstrings injected "empty" headings (just an id and data-toc-label) in toc's data (otherwise toc wouldn't see them, since mkdocstrings stashes the rendered HTML). The new feature requires that I keep headings text, so it now also sets the text attribute on the heading elements, which triggers toc into running post-processors on it.

I thought about bypassing that by setting a data-text attribute instead of text directly, but I cannot do that for subheadings (headings in docstrings), and I do need potential placeholders to be replaced anyway.

pawamoy added a commit to pawamoy/markdown that referenced this issue Feb 21, 2025
Using a set allows for better performances when checking for membership of a tag within block level elements.

Issue-1507: Python-Markdown#1507
pawamoy added a commit to pawamoy/markdown that referenced this issue Feb 21, 2025
Previously, the raw HTML post-processor would precompute all possible replacements for placeholders in a string, based on the HTML stash.

It would then apply a regular expression substitution using these replacements.

Finally, if the text changed, it would recurse, and do all that again.

This was inefficient because placeholders were re-computed each time it recursed, and because only a few replacements would be used anyway.

This change moves the recursion into the regular expression substitution, so that:

1. the regular expression does minimal work on the text (contrary to re-scanning text already scanned in previous frames);
2. but more importantly, replacements aren't computed ahead of time anymore (and even less *several times*), and only fetched from the HTML stash as placeholders are found in the text.

The substitution function relies on the regular expression groups ordering: we make sure to match `<p>PLACEHOLDER</p>` first, before `PLACEHOLDER`. The presence of a wrapping `p` tag indicates whether to wrap again the substitution result, or not (also depending on whether the substituted HTML is a block-level tag).

Issue-1507: Python-Markdown#1507
@pawamoy pawamoy linked a pull request Feb 21, 2025 that will close this issue
pawamoy added a commit to pawamoy/markdown that referenced this issue Feb 21, 2025
Previously, the raw HTML post-processor would precompute all possible replacements for placeholders in a string, based on the HTML stash.

It would then apply a regular expression substitution using these replacements.

Finally, if the text changed, it would recurse, and do all that again.

This was inefficient because placeholders were re-computed each time it recursed, and because only a few replacements would be used anyway.

This change moves the recursion into the regular expression substitution, so that:

1. the regular expression does minimal work on the text (contrary to re-scanning text already scanned in previous frames);
2. but more importantly, replacements aren't computed ahead of time anymore (and even less *several times*), and only fetched from the HTML stash as placeholders are found in the text.

The substitution function relies on the regular expression groups ordering: we make sure to match `<p>PLACEHOLDER</p>` first, before `PLACEHOLDER`. The presence of a wrapping `p` tag indicates whether to wrap again the substitution result, or not (also depending on whether the substituted HTML is a block-level tag).

Issue-1507: Python-Markdown#1507
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants