-
Notifications
You must be signed in to change notification settings - Fork 461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple regex like \w{256}
take tens of milliseconds to compile
#1095
Comments
I've added a benchmark for this to rebar, including a brief analysis: https://github.com/BurntSushi/rebar/blob/9bac8479a550b229e93ea2a7ce71cb6067e05cb2/benchmarks/definitions/reported/i1095-word-repetition.toml After building the engines (the hard part), collecting measurements for this benchmark is easy:
The short answer here is that
This crate is not only faster than RE2 (although they're still in the same ballpark), but it's faster than what it was about 1 year ago. See the analysis I linked to above for why Go's engine compiles this regex so quickly despite being based on finite automata. As for the rest, most engines are backtracking based. A backtracker does not need to convert large sets of codepoints into UTF-8 automata, and so its compilation phase is usually quite quick. Indeed, there's often little to no difference between compiling the Unicode-aware
I'm not familiar enough with the
Yes, Unicode is hard. If you care about CPU time while running tests, it would probably be a good idea to compile the tests with optimizations enabled. Sadly, this is a |
It's also worth adding, although perhaps this won't help you, that you can shrink |
And note that almost this exact example is warned about in the crate docs: https://docs.rs/regex/latest/regex/#unicode-can-impact-memory-usage-and-search-speed and https://docs.rs/regex/latest/regex/#avoid-re-compiling-regexes-especially-in-a-loop |
Thanks for the thorough analysis, I really appreciate you took the time to document that.
Yes sadly my coworkers have started doing just that. For the record, our regex is
Yes I increased the length of the regex to make the issue more obvious, but note that it only takes 6 of those
I don't know if it's worth its own benchmark, but I guess |
No, I don't think it's worth its own benchmark. They are the same regex essentially, with one just being bigger than the other. Do you really need Unicode here? The PR you linked says it is matching a legacy ID. Is it really true that the IDs can contain any combination of Unicode word characters? It also seems like each test runs in its own process? That's probably another thing to change. Usually that model winds up not doing so well precisely because of startup costs not being amortized. Regexes are just one of those. Either way, there really isn't much I can do on my end. Unicode makes stuff like this very difficult in a general purpose finite automata regex engine that prioritizes match performance. |
Well it's a regex already used in production to validate data coming from the users and given we make no assumptions about them, using
I realized an alternative is to use a regex like
I think I just understood this part. What you're saying is that when compiling
I guess many of those 21536 intermediate states can be trimmed as some of them probably won't lead to any Did I get that right? I don't know much about regex engines but that would definitely explain the 1000x to 400000x time difference shown by your benchmark. Anyway thank you for your thoughtful replies :) |
Yes it is definitely sensible! I just mean that now that you've seen compile time is an issue, you might reevaluate whether Unicode is truly required. But converting the bounded repeat to an unbounded one is a standard work around. Sorry, I should have mentioned that initially. Nice find.
For the most part yes. The NFA compiler uses Daciuk's algorithm to build an approximately minimal FSM. But that only applies to the forwards direction. It also needs to build a reverse FSM too, and using Daciuk's algorithm there is not enabled by default because it will likely lead to much worse compile times. Instead, a heuristic is used for exploiting redundant structure. Interestingly, in your case, the regex is fully anchored. And that perhaps means that a reverse NFA isn't needed at all. That would be a nice optimization. It won't magically fix compile times, but it should reduce them by approximately half. |
I created #1097 for tracking the anchored optimization I mentioned above. |
What version of regex are you using?
Version
1.9.6
Describe the bug at a high level.
Simple regex like
\w{256}
take tens of milliseconds to compile in release mode, and hundreds of milliseconds in debug mode.Consider a code base with a single regex like this and a hundred tests in debug mode. If each test run in a different process, that's about 30 seconds of CPU time spent uniquely building this one regex for every run of the test suite.
What are the steps to reproduce the behavior?
What is the actual behavior?
About 40 milliseconds in release mode and 290 milliseconds in debug mode to build
\w{256}
:What is the expected behavior?
Something similar to python where it takes about 50 microseconds to build and match
\w{256}
:The text was updated successfully, but these errors were encountered: