-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pattern Analysis #86
base: master
Are you sure you want to change the base?
Pattern Analysis #86
Conversation
2b62f7e
to
230377d
Compare
Adds functionality to analyze the minimum and maximum # of characters a regex may match.
230377d
to
bc5d07b
Compare
I don't understand motivation of the PR. What's the use-case? |
In my use case I'm using it to categorize regexes at compile time into two sets, ones which may or do consume characters when they succeed. I'm using your library with a c++ library taocpp pegtl which also mimics regular expression rules but to construct more complex grammars. You can try to analyze the grammar for problems, say for example if you've created a grammar which contains an infinite loop, but in order to work that out their library needs to understand rules under one of four categories, one of the two I've described and two others, a rule that behaves like a an alternation or a rule that behaves like a sequence. The two later rules will eventually boil down into one of the top two. The other use case would be rejecting input strings which are too short. If an input string for example was 10 characters long but we know the regex requires a bare minimum of 11, then we can reject the result at the start as opposed to processing all the rules. Rewriting the evaluation function a bit with an extra control struct at the start you could take these results and perform a size check on the start and end iterators. If we're looking for an exact match we can also ignore input strings which are longer than the maximum. You could also use it for the search function in order to terminate early since you'll have what would be the window size of the regex. Say for example we construct a regex If we were searching and had a regex like If we had a regex like With how you've structured the structs it becomes really quick to calculate the whole expression because it generalizes, for any given regex we could have wrapped it in a capture group, which is itself a regex and then applied one of the modifiers, or that for any given regex on its own is equivalent to having wrapped it in a capture group with It might stand out a bit more with the equivalent regex template <typename Pattern>
static constexpr auto trampoline_analysis(Pattern) noexcept;
template <typename... Patterns>
static constexpr auto trampoline_analysis(ctll::list<Patterns...>) noexcept;
template<typename T, typename R>
static constexpr auto trampoline_analysis(T, R captures) noexcept;
// calling with pattern prepare stack and triplet of iterators
template <typename Iterator, typename EndIterator, typename Pattern>
constexpr inline auto match_re(const Iterator begin, const EndIterator end, Pattern pattern) noexcept {
using return_type = decltype(regex_results(std::declval<Iterator>(), find_captures(pattern)));
const analysis_results min_max_range = trampoline_analysis(ctll::list<Pattern>(), return_type{});
const size_t input_size = std::distance(begin, end);
//perform a single size check at the start
if (input_size < min_max_range.first || input_size > min_max_range.second)
return return_type{};
else
return evaluate(begin, begin, end, return_type{}, ctll::list<start_mark, Pattern, assert_end, end_mark, accept>());
} |
I didn't consider variable length encodings before, looking back here's a better sketch of what'd be possible: // calling with pattern prepare stack and triplet of iterators
template <typename Iterator, typename EndIterator, typename Pattern>
constexpr inline auto match_re(const Iterator begin, const EndIterator end, Pattern pattern) noexcept {
using return_type = decltype(regex_results(std::declval<Iterator>(), find_captures(pattern)));
if constexpr (!ctre::is_variable_length_encoded<Iterator>() && std::is_same<std::iterator_traits<Iterator>::iterator_category, std::random_access_iterator_tag>::value) {
constexpr auto lengths = ctre::pattern_match_minmax_characters(ctll::list<start_mark, Pattern, assert_end, end_mark, accept>(), return_type());
//check the size of the input string
auto length = std::distance(begin, end);
if (length >= lengths.min && length <= lengths.max) //if not within bounds we can avoid this call
return evaluate(begin, begin, end, return_type{}, ctll::list<start_mark, Pattern, assert_end, end_mark, accept>());
else
return return_type{};
} else {
//normal unchecked size call
return evaluate(begin, begin, end, return_type{}, ctll::list<start_mark, Pattern, assert_end, end_mark, accept>());
}
}
template <typename Iterator, typename EndIterator, typename Pattern>
constexpr inline auto starts_with_re(const Iterator begin, const EndIterator end, Pattern pattern) noexcept {
using return_type = decltype(regex_results(std::declval<Iterator>(), find_captures(pattern)));
if constexpr (!ctre::is_variable_length_encoded<Iterator>() && std::is_same<std::iterator_traits<Iterator>::iterator_category, std::random_access_iterator_tag>::value) {
constexpr auto lengths = ctre::pattern_match_minmax_characters(ctll::list<start_mark, Pattern, end_mark, accept>(), return_type());
//check the size of the input string
auto length = std::distance(begin, end);
if (length >= lengths.min) //we only check the minimum size requirement since starts with implicitly trails w/ (.*) meaning the max is infinite
return evaluate(begin, begin, end, return_type{}, ctll::list<start_mark, Pattern, end_mark, accept>());
else
return return_type{};
} else {
//normal unchecked size call
return evaluate(begin, begin, end, return_type{}, ctll::list<start_mark, Pattern, end_mark, accept>());
}
}
template <typename Iterator, typename EndIterator, typename Pattern>
constexpr inline auto search_re(const Iterator begin, const EndIterator end, Pattern pattern) noexcept {
using return_type = decltype(regex_results(std::declval<Iterator>(), find_captures(pattern)));
constexpr bool fixed = starts_with_anchor(ctll::list<Pattern>{});
if constexpr (!ctre::is_variable_length_encoded<Iterator>() && std::is_same<std::iterator_traits<Iterator>::iterator_category, std::random_access_iterator_tag>::value) {
constexpr auto lengths = ctre::pattern_match_minmax_characters(ctll::list<start_mark, Pattern, end_mark, accept>(), return_type());
//check the size of the input string
auto length = std::distance(begin, end);
auto it = begin;
for (; end != it && !fixed && length >= lengths.min; ++it) { //similar to starts_with, but we loop
if (auto out = evaluate(begin, it, end, return_type{}, ctll::list<start_mark, Pattern, end_mark, accept>())) {
return out;
}
length--;
}
// in case the RE is empty or fixed
return evaluate(begin, it, end, return_type{}, ctll::list<start_mark, Pattern, end_mark, accept>());
}
else {
//normal unchecked size loop
auto it = begin;
for (; end != it && !fixed; ++it) {
if (auto out = evaluate(begin, it, end, return_type{}, ctll::list<start_mark, Pattern, end_mark, accept>())) {
return out;
}
}
// in case the RE is empty or fixed
return evaluate(begin, it, end, return_type{}, ctll::list<start_mark, Pattern, end_mark, accept>());
}
} The I'm going to probably going to rebase this pr to my rewrite which does the above: |
Had a thought*, pattern analysis can allow you to perform these transformations: ctll::list<assert_begin, consuming sequence, assert_begin, Content...> -> ctll::list<reject>
ctll::list<assert_end, consuming sequence, assert_end, Content...> -> ctll::list<reject> You'd only need to prove some sequence between the anchors consumes a minimum of 1 character, which the algorithm can provide. |
Mind that every such analysis / transformation is instantiating a large amount of templates, and has a big impact on compile time. Also optimizers can see a lot of things already: |
I guess that optimization should be expected since it would likely track the value of the pointer and work out that it would be incremented later. It just strikes me how much compile time cost you're paying for having to do this with templates when for the most part regex's are relatively straightforward and probably could be made into byte code, passed to a compiler, optimized etc... and likely would give you back the same results in almost no time. Say if someone were particularly crazy and made a frontend specifically for regex to llvm and have it build out a function per regex. Doesn't work out in terms of having something native for c++. But if there were an ability to generate bytecode that the compiler could take and use to define a function a lot of this would probably be a lot less painful. Really wish you weren't fighting compile time costs, because in terms of writing optimizations you've made it pretty easy, even if the compiler's showing me up by working those out in the end anyway. Like here: #72 you mention |
Adds functionality to analyze the minimum and maximum # of
characters a regex may match.