Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document that for "usual" regex behavior multiline is required #232

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 26 additions & 11 deletions src/Parsing/String.purs
Original file line number Diff line number Diff line change
Expand Up @@ -202,8 +202,11 @@ match p = do
-- | error message.
-- |
-- | The returned parser will try to match the regular expression pattern once,
-- | starting at the current parser position. On success, it will return
-- | the matched substring.
-- | starting at the current parser position. Note that this implies that an
-- | expression starting as `^…` (i.e. with the beginning of line) will match the
-- | current position even in absence of a preceding newline.
-- |
-- | On success, the parser will return the matched substring.
-- |
-- | If the RegExp `String` is constant then we can assume that compilation will
-- | always succeed and `unsafeCrashWith` if it doesn’t. If we dynamically
Expand Down Expand Up @@ -231,14 +234,24 @@ match p = do
-- |
-- | #### Example
-- |
-- | This example shows how to compile and run the `xMany` parser which will
-- | capture the regular expression pattern `x*`.
-- | Compiling and running different regex parsers:
-- |
-- | ```purescript
-- | case regex "x*" noFlags of
-- | Left compileError -> unsafeCrashWith $ "xMany failed to compile: " <> compileError
-- | Right xMany -> runParser "xxxZ" do
-- | xMany
-- | example re flags text =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first example should show the simplest basic usage. Adding this example helper function is an extra indirection.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the code has always been there, I just moved it to a separate function. I can move the function after example if you want.

-- | case regex re flags of
-- | Left compileError -> unsafeCrashWith $ "xMany failed to compile: " <> compileError
-- | Right xMany -> runParser text do
-- | xMany
-- |
-- | -- Capturing a string per `x*` regex.
Copy link
Member

@jamesdbrock jamesdbrock Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For separate examples I think would should have separate markdown code blocks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be a lot of duplicate code for a reader to dig through?

-- | exampleXMany = example "x*" noFlags "xxxZ"
-- |
-- | -- Capturing everything till end of line.
-- | exampleCharsTillEol = example ".*$" multiline "line1\nline2"
-- |
-- | -- Capturing everything till end of string. Note the distinction with
-- | -- `exampleCharsTillEol`.
-- | exampleCharsTillEos = example ".*$" dotAll "line1\nline2"
-- | ```
-- |
-- | #### Flags
Expand All @@ -249,9 +262,11 @@ match p = do
-- | regex "x*" (dotAll <> ignoreCase)
-- | ```
-- |
-- | The `dotAll`, `unicode`, and `ignoreCase` flags might make sense for
-- | a `regex` parser. The other flags will
-- | probably cause surprising behavior and you should avoid them.
-- | The `dotAll`, `multiline`, `unicode`, and `ignoreCase` flags might make
-- | sense for a `regex` parser. In fact, per JS RegExp semantics matching a
Copy link
Member

@jamesdbrock jamesdbrock Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that before we encourage people to use the multiline flag until we should add a bunch of multiline test cases to the test suite to be sure that multiline parsing works the way that we think it does.

Copy link
Author

@Hi-Angel Hi-Angel Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that makes whole PR a moot point. multiline is the behavior users would expect from regular expressions (as explained in the commit message), which is why after stumbling upon odd behavior in the parser and finding out that the usual behavior of regex isn't the one that noFlags gives, I went to document how it should work. Now you're pushing against. I don't get that, but you're the maintainer, so ok.

Copy link
Author

@Hi-Angel Hi-Angel Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multiline is the behavior users would expect from regular expressions

to add to that: besides the explanation referred, there's also a point that explicitly matching newline characters is often frowned upon. This is because there exist at least 3 different styles of newlines. So the usual advice is not to match a \n for example, but to match a $ instead, which again implies multiline.

Copy link
Member

@garyb garyb Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think personally I just view this parser as serving a different purpose than for what enabling multiline would give it. I view it as similar to satisfy, it's just there to define a range or pattern of acceptable characters. If I wanted to be able to skip ahead over an arbitrary number of lines before finding something I'd be doing that explicitly using other parser combinators.

It's already a compromised regex interface too - you can't really use it the way you normally would since the groups in the pattern are inaccessible.

(I'm not saying I'm right in this opinion, just giving context for why I wouldn't miss multiline at all).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think personally I just view this parser as serving a different purpose than for what enabling multiline would give it. I view it as similar to satisfy, it's just there to define a range or pattern of acceptable characters. If I wanted to be able to skip ahead over an arbitrary number of lines before finding something I'd be doing that explicitly using other parser combinators.

FWIW, the current documentation encourages to prefer regex over other parsers. Quoting the relevant block:

This parser may be useful for quickly consuming a large section of the input String, because in a JavaScript runtime environment a compiled RegExp is a lot faster than a monadic parser built from parsing primitives.


It's already a compromised regex interface too - you can't really use it the way you normally would since the groups in the pattern are inaccessible.

Fair, although till now I haven't even noticed lack of groups, because the "combinators style" implies that when there's a pattern one wants to retrieve, you'd just consume it separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that makes whole PR a moot point. multiline is the behavior users would expect from regular expressions, which is why after stumbling upon odd behavior in the parser and finding out that the usual behavior of regex isn't the one that noFlags gives, I went to document how it should work.

Do we even understand how it works though? I checked @garyb ’s case and confirmed that

  runParser "some\nvarious\nlines" do
    m <- fromRight' unsafeCoerce $ regex "various$" multiline
    r <- rest
    pure $ Tuple m r

gives the result

(Right (Tuple "various" "rious\nlines"))

which is surprising and wrong.

So I think that the current advice in the docs

image

is pretty good advice until we decide what kind of behavior the multiline flag should have, fix it so that it behaves that way, and test it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for calling attention to this, though. It would be nice if the multiline flag just worked in an intuitively correct way, but it currently doesn't.

Copy link
Member

@garyb garyb Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just plain broken with multiline currently. It's advancing consumed by 7 for the "various" match, but should advance it an additional 5 as that's the position that "various" starts (I made sure to vary the line lengths for the example to reinforce how weird the behaviour is and have the resulting state start mid-line, as tried it with "foo\nbar\nbaz" at first and almost ended up confusing myself 😄)

It is fixable if we want to support it though, we'd have to do something like perform a search to find the offset of the match and include that when updating the consumed position of the parser. Or most optimally, offer another function in purescript-strings that can report the offset along with the match, since the info is there in the returned object in the underlying JS, and then use that in the implementation here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made an issue #233

-- | single line boundary in a multiline string requires passing `multiline`.
-- |
-- | Other flags will probably cause surprising behavior and should be avoided.
-- |
-- | [*MDN Advanced searching with flags*](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#advanced_searching_with_flags)
regex :: forall m. String -> RegexFlags -> Either String (ParserT String m String)
Expand Down