-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizing the raw HTML post-processor #1507
Comments
You are certainly welcome to contribute in that way. As far as benchmarks, if you can provide a reproducible command that can be run both before and after the changes which shows a clear performance boost, then that should be sufficient. Note that the command should be run directly on Markdown without any additional third-party code (such as MkDocs). Generally, Python's own built-in tools work fine. |
I've started working on this, and will try to provide some context first. I'm working on a new feature in mkdocstrings/autorefs and noticed that build time went up by a lot. I noticed a lot of time was spent in the On certain relatively big pages, the The processor iterates on this list several times: once more (with recursion) as long as placeholders were found and replaced in the current text. Each time it builds a dictionary of string replacements, which it uses in a This is done for the page content, but also for each ToC entry (in Now for possible optimizations.
Other, less clear optimization: seeing how the stash list has lots of repeated elements, there might be more performant data structures / approaches. But that's more refactor than I'm comfortable with, and the perf gain from the suggestions above is already sufficient, from what I've seen. |
This comment has been minimized.
This comment has been minimized.
Actually there's another approach that seems even more obvious: don't build replacements before hand, and only fetch them when we found a placeholder to replace. This makes all the other optimizations irrelevant (except for list->set which I'd still recommend). |
Just a word of caution: the existing implementation of running the regex on already processed text exists because a raw block could contain a placeholder for another raw block. I don't recall if we have a test for that scenario, but that is why it works as it does. How do we get into that situation? I assume a user will have multiple third-party extensions, each of which add placeholders in the text at different stages in some completely unpredictable way. Therefore, we could have a raw block which contains a placeholder, which points to another raw block which contains a placeholder, which points to another raw block which contains a placeholder and so on. All of the placeholders must be swapped back out. The way I solved that was to keep checking the entire text until there are no more matches. I suppose an alternative approach would be to run the regex on the raw block before inserting it back into the document. But I don't think that necessarily reduces the number of times the regex is run. |
Yes, that is what I understood 🙂 And your suggested alternative is what I suggested in point 3. I agree the regex will not run fewer times, but it will run on less text. OK you can run a script in my fork: git clone https://github.com/pawamoy/markdown pawamoy-markdown
cd pawamoy-markdown
git checkout optimize-raw-html-postprocessor
uvx --with matplotlib python benchmarks.py This should display a chart where you'll see that the baseline (current implementation) depends on the size of the stash (as well as the number of placeholders in the text), while the new implementation only depends on the number of placeholders in the text. In this example the processor runs twice, once with an empty string and a second time with a string with 10 placeholders. Let me generate a second chart where the number of placeholders increases. |
I also found what exactly in the new feature I'm working on was making the build time go up like this. Previously, mkdocstrings injected "empty" headings (just an id and data-toc-label) in I thought about bypassing that by setting a |
Using a set allows for better performances when checking for membership of a tag within block level elements. Issue-1507: Python-Markdown#1507
Previously, the raw HTML post-processor would precompute all possible replacements for placeholders in a string, based on the HTML stash. It would then apply a regular expression substitution using these replacements. Finally, if the text changed, it would recurse, and do all that again. This was inefficient because placeholders were re-computed each time it recursed, and because only a few replacements would be used anyway. This change moves the recursion into the regular expression substitution, so that: 1. the regular expression does minimal work on the text (contrary to re-scanning text already scanned in previous frames); 2. but more importantly, replacements aren't computed ahead of time anymore (and even less *several times*), and only fetched from the HTML stash as placeholders are found in the text. The substitution function relies on the regular expression groups ordering: we make sure to match `<p>PLACEHOLDER</p>` first, before `PLACEHOLDER`. The presence of a wrapping `p` tag indicates whether to wrap again the substitution result, or not (also depending on whether the substituted HTML is a block-level tag). Issue-1507: Python-Markdown#1507
Previously, the raw HTML post-processor would precompute all possible replacements for placeholders in a string, based on the HTML stash. It would then apply a regular expression substitution using these replacements. Finally, if the text changed, it would recurse, and do all that again. This was inefficient because placeholders were re-computed each time it recursed, and because only a few replacements would be used anyway. This change moves the recursion into the regular expression substitution, so that: 1. the regular expression does minimal work on the text (contrary to re-scanning text already scanned in previous frames); 2. but more importantly, replacements aren't computed ahead of time anymore (and even less *several times*), and only fetched from the HTML stash as placeholders are found in the text. The substitution function relies on the regular expression groups ordering: we make sure to match `<p>PLACEHOLDER</p>` first, before `PLACEHOLDER`. The presence of a wrapping `p` tag indicates whether to wrap again the substitution result, or not (also depending on whether the substituted HTML is a block-level tag). Issue-1507: Python-Markdown#1507
Locally, I was able to shave 20 seconds off a MkDocs build (from ~50s to ~30s) by optimizing a bit the raw HTML post processor. Would you be open to a PR that optimizes the post-processor code a bit? Any guidance on how to provide benchmarks and results to confirm to performance gain?
The text was updated successfully, but these errors were encountered: