Skip to content
This repository has been archived by the owner on Nov 15, 2024. It is now read-only.

Latest commit

 

History

History
79 lines (64 loc) · 4.31 KB

README.md

File metadata and controls

79 lines (64 loc) · 4.31 KB

Humio JITrex - Fast JVM Bytecode Regex Engine

This project is based on work done in 2000 by Peter Sorotokin as part of the now defunct Jint Programming Language. Peter kindly agreed to relicense the original code under Apache License. See background.

A regex engine that has the following properties:

  • Compiling a regex with com.humio.jitrex.Pattern.compile(...) produces a custom Java class. This step is not as fast as the other regex engines.
  • The resulting regex however is fast, especially once HotSpot has jit'ed the code, so you should cache the compiled Pattern instance.
  • Matching (generally) only does allocation on the first invocation, and the execution behavior of the matcher has very good locality.
  • The regex engine does use backtracking, but inside the buffer allocated in the first application of a Matcher and without overrunning the stack as can happen in pathological cases with java.lang.util.Regex.
  • As another consequence of using backtracking, capture groups work faster than e.g. RE2J.
  • Very large regular expressions can fail to compile if the resuling byte code does not fit the JVM's 64k limit per method.
  • At present, it does not understand code points beyond 0xffff, so . does not match 💩, which is represented as two 16-bit chars. If however your input regex contains a 💩, then it will match just fine. So in most cases this is not an issue.

This is not a regex engine for everyone. It is (relatively) slow to compile, but fast at matching if you use the same Matcher instance many times.

Performance

For our use case, the Mark Twain samples provide a good indication of how jitrex performs. Here is a sample run comparing it with java.util.regex and com.google.re2j. It runs the following regexes over the 302,278 lines of text (15MB) in the collective Mark Twain texts from the Gutenberg project:

regex #found jitrex com.google.re2j java.util.regex
regex matches jitrex com.google.re2j java.util.regex
--- --- --- --- ---
/Twain/ 811 50ms 19ms (38%) 38ms (76%)
/(?i)Twain/ 965 81ms 1129ms (1393%) 115ms (141%)
/[a-z]shing/ 1540 155ms 920ms (593%) 183ms (118%)
/Huck[a-zA-Z]+|Saw[a-zA-Z]+/ 262 140ms 1281ms (915%) 241ms (172%)
/\b\w+nn\b/ 262 180ms 1322ms (734%) 326ms (181%)
/[a-q][^u-z]{13}x/ 3021 426ms 2435ms (571%) 454ms (106%)
/Tom|Sawyer|Huckleberry|Finn/ 2598 181ms 2407ms (1329%) 448ms (247%)
/(?i)(Tom|Sawyer|Huckleberry|Finn)/ 4152 286ms 4853ms (1696%) 572ms (200%)
/.{0,2}(?:Tom|Sawyer|Huckleberry|Finn)/ 2598 594ms 3547ms (597%) 1821ms (306%)
/.{2,4}(Tom|Sawyer|Huckleberry|Finn)/ 1976 995ms 4151ms (417%) 1784ms (179%)
/Tom.{10,25}river|river.{10,25}Tom/ 2 116ms 1308ms (1127%) 201ms (173%)
/[a-zA-Z]+ing/ 78424 165ms 1157ms (701%) 435ms (263%)
/\s[a-zA-Z]{0,12}ing\s/ 45393 271ms 1580ms (583%) 432ms (159%)
/([A-Za-z]awyer|[A-Za-z]inn)\s/ 194 278ms 1643ms (591%) 585ms (210%)
/["'][^"']{0,30}[?!\.]["']/ 8072 148ms 913ms (616%) 174ms (117%)
/∞|✓/ 2 111ms 759ms (683%) 399ms (359%)
/\p{Sm}/ 69 113ms 732ms (647%) 140ms (123%)

Just adding up these numbers provides an overall ~2x speedup relative to java.util.regex, and 7x speedup relative to com.google.re2j.

Background

This project is based on work done in 2000 by Peter Sorotokin as part of the now defunct Jint Programming Language. Peter kindly agreed to relicense the original code under Apache License.

This project also includes code developed by Google as part of the re2j project, made available under the Go License.

Based on Peter Sotorokin's work, we have

  • Added a Java-like API (com.humio.jitrex.Pattern and com.humio.jitrex.Matcher),
  • Removed some Perl-style regex features like < and >,
  • Added many RE2-style regex features such as (?i:...) group local flags, \p{Greek} unicode classes,
  • Made the code pass all relevant parts of the re2j junit tests.
  • Fixed numerous bugs in the process passing the tests.