Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/regular expression metafunction #904

Merged

Conversation

MaxSagebaum
Copy link
Contributor

@MaxSagebaum MaxSagebaum commented Dec 21, 2023

I will update this overview such that it is easy to grasp the status of the implementation.

Example file: example.cpp2

example: @regex type = {
  regex := "ab*bd";
}
main: (args) = {
    r := example().regex.search("abbbbbdfoo");
    std::cout << "got: (r.group(0))$" << std::endl;
}

Current status and planned on doing

Modifiers

 - [x] i                Do case-insensitive pattern matching. For example, "A" will match "a" under /i.
 - [x] m                Treat the string being matched against as multiple lines. That is, change "^" and "$" from matching the start of the string's first line and the end of its last line to matching the start and end of each line within the string.
 - [x] s                Treat the string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match.
 - [x] x and xx         Extend your pattern's legibility by permitting whitespace and comments. Details in "/x and /xx"
 - [x] n                Prevent the grouping metacharacters () from capturing. This modifier, new in 5.22, will stop $1, $2, etc... from being filled in.
 - [ ] c                keep the current position during repeated matching

Escape sequences (Complete)

 - [x] \t          tab                   (HT, TAB)
 - [x] \n          newline               (LF, NL)
 - [x] \r          return                (CR)
 - [x] \f          form feed             (FF)
 - [x] \a          alarm (bell)          (BEL)
 - [x] \e          escape (think troff)  (ESC)
 - [x] \x{}, \x00  character whose ordinal is the given hexadecimal number
 - [x] \o{}, \000  character whose ordinal is the given octal number

Quantifiers (Complete)

 - [x] *           Match 0 or more times
 - [x] +           Match 1 or more times
 - [x] ?           Match 1 or 0 times
 - [x] {n}         Match exactly n times
 - [x] {n,}        Match at least n times
 - [x] {,n}        Match at most n times
 - [x] {n,m}       Match at least n but not more than m times
 - [x] *?        Match 0 or more times, not greedily
 - [x] +?        Match 1 or more times, not greedily
 - [x] ??        Match 0 or 1 time, not greedily
 - [x] {n}?      Match exactly n times, not greedily (redundant)
 - [x] {n,}?     Match at least n times, not greedily
 - [x] {,n}?     Match at most n times, not greedily
 - [x] {n,m}?    Match at least n but not more than m times, not greedily
 - [x] *+     Match 0 or more times and give nothing back
 - [x] ++     Match 1 or more times and give nothing back
 - [x] ?+     Match 0 or 1 time and give nothing back
 - [x] {n}+   Match exactly n times and give nothing back (redundant)
 - [x] {n,}+  Match at least n times and give nothing back
 - [x] {,n}+  Match at most n times and give nothing back
 - [x] {n,m}+ Match at least n but not more than m times and give nothing back

Character Classes and other Special Escapes (Complete)

 - [x] [...]     [1]  Match a character according to the rules of the
                    bracketed character class defined by the "...".
                    Example: [a-z] matches "a" or "b" or "c" ... or "z"
 - [x] [[:...:]] [2]  Match a character according to the rules of the POSIX
                    character class "..." within the outer bracketed
                    character class.  Example: [[:upper:]] matches any
                    uppercase character.
 - [x] \g1       [5]  Backreference to a specific or previous group,
 - [x] \g{-1}    [5]  The number may be negative indicating a relative
                  previous group and may optionally be wrapped in
                  curly brackets for safer parsing.
 - [x] \g{name}  [5]  Named backreference
 - [x] \k<name>  [5]  Named backreference
 - [x] \k'name'  [5]  Named backreference
 - [x] \k{name}  [5]  Named backreference
 - [x] \w        [3]  Match a "word" character (alphanumeric plus "_", plus
                    other connector punctuation chars plus Unicode
                    marks)
 - [x] \W        [3]  Match a non-"word" character
 - [x] \s        [3]  Match a whitespace character
 - [x] \S        [3]  Match a non-whitespace character
 - [x] \d        [3]  Match a decimal digit character
 - [x] \D        [3]  Match a non-digit character
 - [x] \v        [3]  Vertical whitespace
 - [x] \V        [3]  Not vertical whitespace
 - [x] \h        [3]  Horizontal whitespace
 - [x] \H        [3]  Not horizontal whitespace
 - [x] \1        [5]  Backreference to a specific capture group or buffer.
                    '1' may actually be any positive integer.
 - [x] \N        [7]  Any character but \n.  Not affected by /s modifier
 - [x] \K        [6]  Keep the stuff left of the \K, don't include it in $&

Assertions

 - [x] \b     Match a \w\W or \W\w boundary
 - [x] \B     Match except at a \w\W or \W\w boundary
 - [x] \A     Match only at beginning of string
 - [x] \Z     Match only at end of string, or before newline at the end
 - [x] \z     Match only at end of string
 - [ ] \G     Match only at pos() (e.g. at the end-of-match position
          of prior m//g)

Capture groups (Complete)

 - [x] (...)

Quoting metacharacters (Complete)

 - [x] For ^.[]$()*{}?+|\

Extended Patterns

 - [x] (?<NAME>pattern)            Named capture group
 - [x] (?#text)                    Comments
 - [x] (?adlupimnsx-imnsx)         Modification for surrounding context
 - [x] (?^alupimnsx)               Modification for surrounding context
 - [x] (?:pattern)                 Clustering, does not generate a group index.
 - [x] (?adluimnsx-imnsx:pattern)  Clustering, does not generate a group index and modifications for the cluster.
 - [x] (?^aluimnsx:pattern)        Clustering, does not generate a group index and modifications for the cluster.
 - [x] (?|pattern)                 Branch reset
 - [x] (?'NAME'pattern)            Named capture group
 - [ ] (?(condition)yes-pattern|no-pattern)  Conditional patterns.
 - [ ] (?(condition)yes-pattern)             Conditional patterns.
 - [ ] (?>pattern)                 Atomic patterns. (Disable backtrack.)
 - [ ] (*atomic:pattern)           Atomic patterns. (Disable backtrack.)

Lookaround Assertions

 - [x] (?=pattern)                     Positive look ahead.
 - [x] (*pla:pattern)                  Positive look ahead.
 - [x] (*positive_lookahead:pattern)   Positive look ahead.
 - [x] (?!pattern)                     Negative look ahead.
 - [x] (*nla:pattern)                  Negative look ahead.
 - [x] (*negative_lookahead:pattern)   Negative look ahead.
 - [ ] (?<=pattern)                    Positive look behind.
 - [ ] (*plb:pattern)                  Positive look behind.
 - [ ] (*positive_lookbehind:pattern)  Positive look behind.
 - [ ] (?<!pattern)                    Negative look behind.
 - [ ] (*nlb:pattern)                  Negative look behind.
 - [ ] (*negative_lookbehind:pattern)  Negative look behind.

Special Backtracking Control Verbs

 - [ ] (*PRUNE) (*PRUNE:NAME)   No backtracking over this point.
 - [ ] (*SKIP) (*SKIP:NAME)     Start next search here.
 - [ ] (*MARK:NAME) (*:NAME)    Place a named mark.
 - [ ] (*THEN) (*THEN:NAME)     Like PRUNE.
 - [ ] (*COMMIT) (*COMMIT:arg)  Stop searching.
 - [ ] (*FAIL) (*F) (*FAIL:arg) Fail the pattern/branch.
 - [ ] (*ACCEPT) (*ACCEPT:arg)  Accept the pattern/subpattern.

Not planned (Mainly because of Unicode or perl specifics)

Modifiers

 - [ ] p                Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and ${^POSTMATCH} are available for use after matching.
 - [ ] a, d, l, and u   These modifiers, all new in 5.14, affect which character-set rules (Unicode, etc.) are used, as described below in "Character set modifiers".
 - [ ] g                globally match the pattern repeatedly in the string
 - [ ] e                evaluate the right-hand side as an expression
 - [ ] ee               evaluate the right side as a string then eval the result
 - [ ] o                pretend to optimize your code, but actually introduce bugs
 - [ ] r                perform non-destructive substitution and return the new value

Escape sequences

 - [ ] \cK         control char          (example: VT)
 - [ ] \N{name}    named Unicode character or character sequence
 - [ ] \N{U+263D}  Unicode character     (example: FIRST QUARTER MOON)
 - [ ] \l          lowercase next char (think vi)
 - [ ] \u          uppercase next char (think vi)
 - [ ] \L          lowercase until \E (think vi)
 - [ ] \U          uppercase until \E (think vi)
 - [ ] \Q          quote (disable) pattern metacharacters until \E
 - [ ] \E          end either case modification or quoted section, think vi

Character Classes and other Special Escapes

 - [ ]  (?[...])  [8]  Extended bracketed character class
 - [ ] \pP       [3]  Match P, named property.  Use \p{Prop} for longer names
 - [ ] \PP       [3]  Match non-P
 - [ ] \X        [4]  Match Unicode "eXtended grapheme cluster"
 - [ ] \R        [4]  Linebreak

Assertions

 - [ ] \b{}   Match at Unicode boundary of specified type
 - [ ] \B{}   Match where corresponding \b{} doesn't match

Extended Patterns

 - [ ] (?{ code })                 Perl code execution.
 - [ ] (*{ code })                 Perl code execution.
 - [ ] (??{ code })                Perl code execution.
 - [ ] (?PARNO) (?-PARNO) (?+PARNO) (?R) (?0)       Recursive subpattern.
 - [ ] (?&NAME)                   Recursive subpattern.

Script runs

 - [ ] (*script_run:pattern)         All chars in pattern need to be of the same script.
 - [ ] (*sr:pattern)                 All chars in pattern need to be of the same script.
 - [ ] (*atomic_script_run:pattern)  Without backtracking.
 - [ ] (*asr:pattern)                Without backtracking.

@MaxSagebaum
Copy link
Contributor Author

I am aiming to implement the POSIX extended specification with a few extras from perl. All in all I am sticking to the perl interpretation of regular expressions.

@MaxSagebaum
Copy link
Contributor Author

MaxSagebaum commented Dec 28, 2023

The feature set is now complete as stated in https://en.wikipedia.org/wiki/Regular_expression. I grabbed the test suite from https://wiki.haskell.org/Regex_Posix https://hackage.haskell.org/package/regex-posix-unittest. I am currently working my way through the tests by fixing all the corner cases. Especially, the greed nature of * and the backtracking if grabbed to much might require a rework of the matching logic.

source/regex.h2 Outdated Show resolved Hide resolved
@MaxSagebaum
Copy link
Contributor Author

I finished now the basic implementation and most of the test suite is passed.

Some notable differences with respect to posix ERE:

  • Alternatives are not greedy: a|ab matches a and not ab in ab.
  • The match is case sensitive.

I am now looking into performance tests and I will clean up the code.

@MaxSagebaum
Copy link
Contributor Author

@hsutter The regression tests are clean now except for one test where the source line of reflect.h2 has changed. I asked jarzec for help on this.

Otherweise the branch is ready for a review.

@hsutter
Copy link
Owner

hsutter commented Jul 10, 2024

Thanks! I just pushed commit 0e1fdd5 which adds support for concatenated string literals that don't use interpolation, so you should now be able to write "\x62" "blub".

@MaxSagebaum
Copy link
Contributor Author

Ok, thanks. I changed it and it works now. Thanks for implementing it.

@MaxSagebaum MaxSagebaum force-pushed the feature/regular_expression_metafunction branch from a3776b5 to 788cf3d Compare July 15, 2024 06:33
@MaxSagebaum
Copy link
Contributor Author

Hurray. We are all green now.

@hsutter
Copy link
Owner

hsutter commented Jul 15, 2024

Thanks! I'll set aside a block of time to review, planning to start later this week.

If you'll pause making commits to this branch by, say, Wednesday, then I can adjust things via commits to this branch without colliding? (I find fixing merge conflicts difficult, so I avoid making concurrent revisions in the same branch, especially since I plan to rename/move a file or two.)

@MaxSagebaum
Copy link
Contributor Author

Sure, have fun.

hsutter added 6 commits July 19, 2024 12:21
Probably line-ends

Plus an MSVC minor version update

Committing this just-whitespace update to clear the diff list before I make any review changes/renames...
In this project I'm trying to build *.h2 files in the same directory as the *.h they generate, and keep the same name

In /include, "cpp2util.h" is named that way because it really is the Cpp2 run-time library... For regex, we could name it regex.h(2) or cpp2regex.h(2)... the argument for using "cpp2" is because it really does include additional run-time support for what will now be one of the Cpp2-built-in metafunctions... anyway we can always revisit that in the future...
Minus a couple of functions that aren't used

And minor touchups, mainly int_to_string using more if-constexpr
Up to line ~1600

Looking good, mainly formatting tweaks to follow the repo's style
@hsutter
Copy link
Owner

hsutter commented Jul 21, 2024

Looks very good, I think this great work is ready to move to main. Thanks again, @MaxSagebaum !

@hsutter hsutter merged commit 254d73e into hsutter:main Jul 21, 2024
20 of 29 checks passed
@MaxSagebaum
Copy link
Contributor Author

Thanks. I am glad you found it ok. Now I know that curly braces have be on the next line. Sorry for the formatting.

I will add some documentation next week in another pull request. So that users know how to apply this metafunction.

@hsutter
Copy link
Owner

hsutter commented Jul 22, 2024

Perfect, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants