-
Notifications
You must be signed in to change notification settings - Fork 7
re2r back on CRAN
Regular expressions are powerful tools for parsing loosely structured text data. Russ Cox’s ”Regular Expression Matching Can Be Simple And Fast” explains that due to backreference support, several common regular expression engines have a time complexity which is exponential in pattern size (including PCRE which is used by R). One way to achieve polynomial time complexity is to drop backreference support and use the re2 C++ library.
- Base R functions such as
regexpr
use PCRE when given theperl=TRUE
argument. PCRE includes many useful features, such as named capture, but has an exponential time complexity. - Base R functions such as
regexpr
use TRE when given theperl=FALSE
argument. TRE has a polynomial time complexity but does not include named capture groups. -
stringr::str_match
andstringi::stri_match
use the regex engine from the ICU library, which has an exponential time complexity. The stringi package does not support named capture yet as such a feature set is still considered as experimental in ICU.
The figures below show timings of these libraries for matching the pattern a?a?a?aaa
against the subject aaa
(N=3), for several different values of N. Source.
R package re2r was implemented by Qin Wenfeng in R-GSOC’16 but was archived (removed from CRAN) recently, https://cloud.r-project.org/web/packages/re2r/
TODO MORE DETAILS
The goal of this project is to get re2r back up on CRAN. That involves fixing the issues discussed here, https://github.com/qinwf/re2r/issues/22
Fixing some of the other open issues would be good as well, https://github.com/qinwf/re2r/issues
The re2 package provides a fast and secure way to extract data from text files using regex. More specifically, the polynomial time complexity of the re2 package will be useful for writing secure web applications that accept user-provided regular expressions. For example, say that you write a shiny app that accepts a user-provided regex. If R passes this regex to existing regex libraries such as PCRE or ICU, then this shiny app is vulnerable to a denial-of-service attack. If the user’s regex is malicious, then it could cause the R on the shiny server to stay busy computing a match for a long time (worst case exponential time complexity). In contrast, the re2 library has a worst case polynomial time complexity so would not suffer from such attacks.
Please get in touch with Toby Dylan Hocking <[email protected]> and Qin Wenfeng <[email protected]> after completing at least one of the tests below.
Do one or several — doing more hard tests makes you more likely to be selected.
- Easy: change the benchmark code to test a pattern and subject that you have encountered in one of your own data analyses, and re-compute the timings figures. Make a package with a vignette that has the benchmark timings figures.
- Medium: read Marek’s blog post about internationalized/unicode regular expressions, and write some R code that tests for unicode support in each of the 3 currently available regex engines (PCRE, TRE, ICU).
- Hard: demonstrate that you know how to interface C++ code with R by writing a package with a function that calls some custom C++ code (e.g., try to call
stri_enc_toutf8
from the latest stringi-devel C API from within C++ code).
Students, please post a link to your test results here.