Feature/regular expression metafunction #904

MaxSagebaum · 2023-12-21T19:29:55Z

I will update this overview such that it is easy to grasp the status of the implementation.

Example file: example.cpp2

example: @regex type = {
  regex := "ab*bd";
}
main: (args) = {
    r := example().regex.search("abbbbbdfoo");
    std::cout << "got: (r.group(0))$" << std::endl;
}

Current status and planned on doing

Modifiers

 - [x] i                Do case-insensitive pattern matching. For example, "A" will match "a" under /i.
 - [x] m                Treat the string being matched against as multiple lines. That is, change "^" and "$" from matching the start of the string's first line and the end of its last line to matching the start and end of each line within the string.
 - [x] s                Treat the string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match.
 - [x] x and xx         Extend your pattern's legibility by permitting whitespace and comments. Details in "/x and /xx"
 - [x] n                Prevent the grouping metacharacters () from capturing. This modifier, new in 5.22, will stop $1, $2, etc... from being filled in.
 - [ ] c                keep the current position during repeated matching

Escape sequences (Complete)

 - [x] \t          tab                   (HT, TAB)
 - [x] \n          newline               (LF, NL)
 - [x] \r          return                (CR)
 - [x] \f          form feed             (FF)
 - [x] \a          alarm (bell)          (BEL)
 - [x] \e          escape (think troff)  (ESC)
 - [x] \x{}, \x00  character whose ordinal is the given hexadecimal number
 - [x] \o{}, \000  character whose ordinal is the given octal number

Quantifiers (Complete)

 - [x] *           Match 0 or more times
 - [x] +           Match 1 or more times
 - [x] ?           Match 1 or 0 times
 - [x] {n}         Match exactly n times
 - [x] {n,}        Match at least n times
 - [x] {,n}        Match at most n times
 - [x] {n,m}       Match at least n but not more than m times
 - [x] *?        Match 0 or more times, not greedily
 - [x] +?        Match 1 or more times, not greedily
 - [x] ??        Match 0 or 1 time, not greedily
 - [x] {n}?      Match exactly n times, not greedily (redundant)
 - [x] {n,}?     Match at least n times, not greedily
 - [x] {,n}?     Match at most n times, not greedily
 - [x] {n,m}?    Match at least n but not more than m times, not greedily
 - [x] *+     Match 0 or more times and give nothing back
 - [x] ++     Match 1 or more times and give nothing back
 - [x] ?+     Match 0 or 1 time and give nothing back
 - [x] {n}+   Match exactly n times and give nothing back (redundant)
 - [x] {n,}+  Match at least n times and give nothing back
 - [x] {,n}+  Match at most n times and give nothing back
 - [x] {n,m}+ Match at least n but not more than m times and give nothing back

Character Classes and other Special Escapes (Complete)

 - [x] [...]     [1]  Match a character according to the rules of the
                    bracketed character class defined by the "...".
                    Example: [a-z] matches "a" or "b" or "c" ... or "z"
 - [x] [[:...:]] [2]  Match a character according to the rules of the POSIX
                    character class "..." within the outer bracketed
                    character class.  Example: [[:upper:]] matches any
                    uppercase character.
 - [x] \g1       [5]  Backreference to a specific or previous group,
 - [x] \g{-1}    [5]  The number may be negative indicating a relative
                  previous group and may optionally be wrapped in
                  curly brackets for safer parsing.
 - [x] \g{name}  [5]  Named backreference
 - [x] \k<name>  [5]  Named backreference
 - [x] \k'name'  [5]  Named backreference
 - [x] \k{name}  [5]  Named backreference
 - [x] \w        [3]  Match a "word" character (alphanumeric plus "_", plus
                    other connector punctuation chars plus Unicode
                    marks)
 - [x] \W        [3]  Match a non-"word" character
 - [x] \s        [3]  Match a whitespace character
 - [x] \S        [3]  Match a non-whitespace character
 - [x] \d        [3]  Match a decimal digit character
 - [x] \D        [3]  Match a non-digit character
 - [x] \v        [3]  Vertical whitespace
 - [x] \V        [3]  Not vertical whitespace
 - [x] \h        [3]  Horizontal whitespace
 - [x] \H        [3]  Not horizontal whitespace
 - [x] \1        [5]  Backreference to a specific capture group or buffer.
                    '1' may actually be any positive integer.
 - [x] \N        [7]  Any character but \n.  Not affected by /s modifier
 - [x] \K        [6]  Keep the stuff left of the \K, don't include it in $&

Assertions

 - [x] \b     Match a \w\W or \W\w boundary
 - [x] \B     Match except at a \w\W or \W\w boundary
 - [x] \A     Match only at beginning of string
 - [x] \Z     Match only at end of string, or before newline at the end
 - [x] \z     Match only at end of string
 - [ ] \G     Match only at pos() (e.g. at the end-of-match position
          of prior m//g)

Capture groups (Complete)

 - [x] (...)

Quoting metacharacters (Complete)

 - [x] For ^.[]$()*{}?+|\

Extended Patterns

 - [x] (?<NAME>pattern)            Named capture group
 - [x] (?#text)                    Comments
 - [x] (?adlupimnsx-imnsx)         Modification for surrounding context
 - [x] (?^alupimnsx)               Modification for surrounding context
 - [x] (?:pattern)                 Clustering, does not generate a group index.
 - [x] (?adluimnsx-imnsx:pattern)  Clustering, does not generate a group index and modifications for the cluster.
 - [x] (?^aluimnsx:pattern)        Clustering, does not generate a group index and modifications for the cluster.
 - [x] (?|pattern)                 Branch reset
 - [x] (?'NAME'pattern)            Named capture group
 - [ ] (?(condition)yes-pattern|no-pattern)  Conditional patterns.
 - [ ] (?(condition)yes-pattern)             Conditional patterns.
 - [ ] (?>pattern)                 Atomic patterns. (Disable backtrack.)
 - [ ] (*atomic:pattern)           Atomic patterns. (Disable backtrack.)

Lookaround Assertions

 - [x] (?=pattern)                     Positive look ahead.
 - [x] (*pla:pattern)                  Positive look ahead.
 - [x] (*positive_lookahead:pattern)   Positive look ahead.
 - [x] (?!pattern)                     Negative look ahead.
 - [x] (*nla:pattern)                  Negative look ahead.
 - [x] (*negative_lookahead:pattern)   Negative look ahead.
 - [ ] (?<=pattern)                    Positive look behind.
 - [ ] (*plb:pattern)                  Positive look behind.
 - [ ] (*positive_lookbehind:pattern)  Positive look behind.
 - [ ] (?<!pattern)                    Negative look behind.
 - [ ] (*nlb:pattern)                  Negative look behind.
 - [ ] (*negative_lookbehind:pattern)  Negative look behind.

Special Backtracking Control Verbs

 - [ ] (*PRUNE) (*PRUNE:NAME)   No backtracking over this point.
 - [ ] (*SKIP) (*SKIP:NAME)     Start next search here.
 - [ ] (*MARK:NAME) (*:NAME)    Place a named mark.
 - [ ] (*THEN) (*THEN:NAME)     Like PRUNE.
 - [ ] (*COMMIT) (*COMMIT:arg)  Stop searching.
 - [ ] (*FAIL) (*F) (*FAIL:arg) Fail the pattern/branch.
 - [ ] (*ACCEPT) (*ACCEPT:arg)  Accept the pattern/subpattern.

Not planned (Mainly because of Unicode or perl specifics)

Modifiers

 - [ ] p                Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and ${^POSTMATCH} are available for use after matching.
 - [ ] a, d, l, and u   These modifiers, all new in 5.14, affect which character-set rules (Unicode, etc.) are used, as described below in "Character set modifiers".
 - [ ] g                globally match the pattern repeatedly in the string
 - [ ] e                evaluate the right-hand side as an expression
 - [ ] ee               evaluate the right side as a string then eval the result
 - [ ] o                pretend to optimize your code, but actually introduce bugs
 - [ ] r                perform non-destructive substitution and return the new value

Escape sequences

 - [ ] \cK         control char          (example: VT)
 - [ ] \N{name}    named Unicode character or character sequence
 - [ ] \N{U+263D}  Unicode character     (example: FIRST QUARTER MOON)
 - [ ] \l          lowercase next char (think vi)
 - [ ] \u          uppercase next char (think vi)
 - [ ] \L          lowercase until \E (think vi)
 - [ ] \U          uppercase until \E (think vi)
 - [ ] \Q          quote (disable) pattern metacharacters until \E
 - [ ] \E          end either case modification or quoted section, think vi

Character Classes and other Special Escapes

 - [ ]  (?[...])  [8]  Extended bracketed character class
 - [ ] \pP       [3]  Match P, named property.  Use \p{Prop} for longer names
 - [ ] \PP       [3]  Match non-P
 - [ ] \X        [4]  Match Unicode "eXtended grapheme cluster"
 - [ ] \R        [4]  Linebreak

Assertions

 - [ ] \b{}   Match at Unicode boundary of specified type
 - [ ] \B{}   Match where corresponding \b{} doesn't match

Extended Patterns

 - [ ] (?{ code })                 Perl code execution.
 - [ ] (*{ code })                 Perl code execution.
 - [ ] (??{ code })                Perl code execution.
 - [ ] (?PARNO) (?-PARNO) (?+PARNO) (?R) (?0)       Recursive subpattern.
 - [ ] (?&NAME)                   Recursive subpattern.

Script runs

 - [ ] (*script_run:pattern)         All chars in pattern need to be of the same script.
 - [ ] (*sr:pattern)                 All chars in pattern need to be of the same script.
 - [ ] (*atomic_script_run:pattern)  Without backtracking.
 - [ ] (*asr:pattern)                Without backtracking.

MaxSagebaum · 2023-12-21T19:37:01Z

I am aiming to implement the POSIX extended specification with a few extras from perl. All in all I am sticking to the perl interpretation of regular expressions.

MaxSagebaum · 2023-12-28T12:33:40Z

The feature set is now complete as stated in https://en.wikipedia.org/wiki/Regular_expression. I grabbed the test suite from https://wiki.haskell.org/Regex_Posix https://hackage.haskell.org/package/regex-posix-unittest. I am currently working my way through the tests by fixing all the corner cases. Especially, the greed nature of * and the backtracking if grabbed to much might require a rework of the matching logic.

source/regex.h2

…ails.

MaxSagebaum · 2024-01-03T14:31:10Z

I finished now the basic implementation and most of the test suite is passed.

Some notable differences with respect to posix ERE:

Alternatives are not greedy: a|ab matches a and not ab in ab.
The match is case sensitive.

I am now looking into performance tests and I will clean up the code.

regression-tests/test-results/msvc-2022-c++20/pure2-assert-expected-not-null.cpp.output

MaxSagebaum · 2024-07-09T09:45:26Z

@hsutter The regression tests are clean now except for one test where the source line of reflect.h2 has changed. I asked jarzec for help on this.

Otherweise the branch is ready for a review.

hsutter · 2024-07-10T01:59:34Z

Thanks! I just pushed commit 0e1fdd5 which adds support for concatenated string literals that don't use interpolation, so you should now be able to write "\x62" "blub".

…ssion_metafunction

MaxSagebaum · 2024-07-10T11:38:40Z

Ok, thanks. I changed it and it works now. Thanks for implementing it.

…ssion_metafunction

Signed-off-by: Max Sagebaum <[email protected]>

…lar_expression_metafunction

…ssion_metafunction

MaxSagebaum · 2024-07-15T07:21:44Z

Hurray. We are all green now.

hsutter · 2024-07-15T22:57:14Z

Thanks! I'll set aside a block of time to review, planning to start later this week.

If you'll pause making commits to this branch by, say, Wednesday, then I can adjust things via commits to this branch without colliding? (I find fixing merge conflicts difficult, so I avoid making concurrent revisions in the same branch, especially since I plan to rename/move a file or two.)

MaxSagebaum · 2024-07-16T06:06:58Z

Sure, have fun.

Probably line-ends Plus an MSVC minor version update Committing this just-whitespace update to clear the diff list before I make any review changes/renames...

In this project I'm trying to build *.h2 files in the same directory as the *.h they generate, and keep the same name In /include, "cpp2util.h" is named that way because it really is the Cpp2 run-time library... For regex, we could name it regex.h(2) or cpp2regex.h(2)... the argument for using "cpp2" is because it really does include additional run-time support for what will now be one of the Cpp2-built-in metafunctions... anyway we can always revisit that in the future...

Minus a couple of functions that aren't used And minor touchups, mainly int_to_string using more if-constexpr

Up to line ~1600 Looking good, mainly formatting tweaks to follow the repo's style

From line 1600 onward

Signed-off-by: Herb Sutter <[email protected]>

hsutter · 2024-07-21T01:45:19Z

Looks very good, I think this great work is ready to move to main. Thanks again, @MaxSagebaum !

MaxSagebaum · 2024-07-22T06:31:09Z

Thanks. I am glad you found it ok. Now I know that curly braces have be on the next line. Sorry for the formatting.

I will add some documentation next week in another pull request. So that users know how to apply this metafunction.

hsutter · 2024-07-22T19:15:22Z

Perfect, thanks!

MaxSagebaum added 3 commits December 21, 2023 13:38

Missing trailing '...' in variadic template arguments.

6496423

Regular expressions initial setup.

30098c6

Current working status.

32e76e3

MaxSagebaum added 8 commits December 27, 2023 14:03

Handling of groups.

d7abbf7

Handling of alternatives.

34be3d3

Refactor to position based matching.

54cb500

Added regular expression class.

d048884

Added class matching.

93c0511

Add line start and end match.

fc8031b

Compatibility fixes.

5d26075

Added test file for regular expressions.

f36bbbe

JohelEGP reviewed Dec 28, 2023

View reviewed changes

source/regex.h2 Outdated Show resolved Hide resolved

MaxSagebaum added 13 commits January 2, 2024 13:34

Basic state machine for range matchers.

d10faca

Proper range printing and group invalidation on alternatives.

4d58428

State management for ranges. No longer invalid groups if last match f…

d149512

…ails.

Range check in class_matcher and restore of groups in ranges_matcher.

1b3456c

Fix for list matcher and state for alternatives.

e30bce2

Improved handling of empty matches and ranges.

f16fcd2

Bugfix for missing semaphore in typed template parameters.

0383119

Support for posix character classes.

2ddeffa

Fix for missing escape of '\'.

654b2e3

Missing group clear in alternate matcher.

b87b08f

Whitespace errors from cppfront.

ed4fba9

Include regular expressions in cpp2utils.hpp.

4717892

Update for tests.

6d98bb3

MaxSagebaum added 2 commits January 4, 2024 09:06

Remove initialization from context.

e261ea3

Basic char matching.

fff7cf1

Updates for regression tests.

bb27c43

MaxSagebaum commented Jul 9, 2024

View reviewed changes

regression-tests/test-results/msvc-2022-c++20/pure2-assert-expected-not-null.cpp.output Outdated Show resolved Hide resolved

MaxSagebaum added 2 commits July 10, 2024 10:15

Merge remote-tracking branch 'origin/main' into feature/regular_expre…

5208006

…ssion_metafunction

Update for \e escape.

57f8e15

MaxSagebaum and others added 7 commits July 13, 2024 10:23

Merge remote-tracking branch 'origin/main' into feature/regular_expre…

4cfbe00

…ssion_metafunction

CI update tests

f0a76a9

Update for tests.

5942bdd

Merge branch 'main' into feature/regular_expression_metafunction

0172e35

Signed-off-by: Max Sagebaum <[email protected]>

Merge remote-tracking branch 'temp/ci-update-tests' into feature/regu…

8dedcc7

…lar_expression_metafunction

Update for test results.

b99d10e

Merge remote-tracking branch 'origin/main' into feature/regular_expre…

788cf3d

…ssion_metafunction

MaxSagebaum force-pushed the feature/regular_expression_metafunction branch from a3776b5 to 788cf3d Compare July 15, 2024 06:33

Update for regression tests.

b089f02

hsutter added 6 commits July 19, 2024 12:21

Reran regressions on my box - whitespace changes only

9577a0b

Probably line-ends Plus an MSVC minor version update Committing this just-whitespace update to clear the diff list before I make any review changes/renames...

Merge string_util.h into cpp2util.h

2a8994f

Minus a couple of functions that aren't used And minor touchups, mainly int_to_string using more if-constexpr

Review pass through cpp2regex.h2

25b1a26

Up to line ~1600 Looking good, mainly formatting tweaks to follow the repo's style

Finish tweaking pass through cpp2regex.h2

de8348f

From line 1600 onward

Merge branch 'main' into feature/regular_expression_metafunction

ae9fa61

Signed-off-by: Herb Sutter <[email protected]>

hsutter merged commit 254d73e into hsutter:main Jul 21, 2024
20 of 29 checks passed

jarzec mentioned this pull request Jul 21, 2024

CI Update tests after recent changes #1173

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/regular expression metafunction #904

Feature/regular expression metafunction #904

Uh oh!

MaxSagebaum commented Dec 21, 2023 •

edited

Loading

Uh oh!

MaxSagebaum commented Dec 21, 2023

Uh oh!

MaxSagebaum commented Dec 28, 2023 •

edited

Loading

Uh oh!

Uh oh!

MaxSagebaum commented Jan 3, 2024

Uh oh!

Uh oh!

MaxSagebaum commented Jul 9, 2024

Uh oh!

hsutter commented Jul 10, 2024 •

edited

Loading

Uh oh!

MaxSagebaum commented Jul 10, 2024

Uh oh!

MaxSagebaum commented Jul 15, 2024

Uh oh!

hsutter commented Jul 15, 2024

Uh oh!

MaxSagebaum commented Jul 16, 2024

Uh oh!

hsutter commented Jul 21, 2024

Uh oh!

Uh oh!

MaxSagebaum commented Jul 22, 2024

Uh oh!

hsutter commented Jul 22, 2024

Uh oh!

Uh oh!

Feature/regular expression metafunction #904

Feature/regular expression metafunction #904

Uh oh!

Conversation

MaxSagebaum commented Dec 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current status and planned on doing

Modifiers

Escape sequences (Complete)

Quantifiers (Complete)

Character Classes and other Special Escapes (Complete)

Assertions

Capture groups (Complete)

Quoting metacharacters (Complete)

Extended Patterns

Lookaround Assertions

Special Backtracking Control Verbs

Not planned (Mainly because of Unicode or perl specifics)

Modifiers

Escape sequences

Character Classes and other Special Escapes

Assertions

Extended Patterns

Script runs

Uh oh!

MaxSagebaum commented Dec 21, 2023

Uh oh!

MaxSagebaum commented Dec 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

MaxSagebaum commented Jan 3, 2024

Uh oh!

Uh oh!

MaxSagebaum commented Jul 9, 2024

Uh oh!

hsutter commented Jul 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxSagebaum commented Jul 10, 2024

Uh oh!

MaxSagebaum commented Jul 15, 2024

Uh oh!

hsutter commented Jul 15, 2024

Uh oh!

MaxSagebaum commented Jul 16, 2024

Uh oh!

hsutter commented Jul 21, 2024

Uh oh!

Uh oh!

MaxSagebaum commented Jul 22, 2024

Uh oh!

hsutter commented Jul 22, 2024

Uh oh!

Uh oh!

MaxSagebaum commented Dec 21, 2023 •

edited

Loading

MaxSagebaum commented Dec 28, 2023 •

edited

Loading

hsutter commented Jul 10, 2024 •

edited

Loading