Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: add functionality to compile globs #85

Open
a-kbd opened this issue Mar 1, 2023 · 7 comments · May be fixed by #99
Open

Feature request: add functionality to compile globs #85

a-kbd opened this issue Mar 1, 2023 · 7 comments · May be fixed by #99

Comments

@a-kbd
Copy link

a-kbd commented Mar 1, 2023

It would be great to have functionality to compile globs similar to https://github.com/gobwas/glob (which is seemingly no longer actively maintained and has bugs causing issues)

This would have performance benefits as one would not be parsing the pattern each time

@lukemassa
Copy link
Contributor

lukemassa commented Jan 23, 2025

@bmatcuk I sketched out what this might look like here: #99, if there's interest I can start working through some implementation. Very open to feedback and/or assistance!

@bmatcuk
Copy link
Owner

bmatcuk commented Jan 24, 2025

Sorry I never replied to this - I spent some time thinking about it, but never came back to it.

I'm not overly convinced "compiling" would improve performance - in fact, it may have a negative impact and would certainly consume more memory.

I see two places where "compiled" data structures typically exist: in interpreted languages where the data structure is turned into dynamically evaluated code, and pre-processing of regular expressions. Go can't do the former, and globs aren't as complicated as the latter, so I'm not sure they'd benefit from it.

In terms of a "glob as a data structure", the only glob operation that is kind of expensive is alternatives (ie, {A,B,C}) because the code has to look ahead to find the end of each alternative, the end of the whole alternative, and whether or not anything follows the alternative. Beyond that, glob evaluation is very linear. There's never any "looking behind", and, except for alternatives and checking the next character after a star or doublestar, there's never any looking ahead either. Any "compiled" data structure would basically just be the exact same linear structure, but take up more memory. I suppose there may be the possibility of a very tiny performance increase for a long string of non-meta characters, given that a simple string comparison operation may be faster than comparing character-by-character, depending on compiler optimizations and processor shenanigans - I seem to remember reading an article about that a little bit ago, but it was talking about optimizing string comparisons in C so I'm not sure if it's relevant to Go.

Anyway, I digress... there's barely any "parsing" of globs at all. Globs are mostly evaluated character-by-character, with small sub-loops for character classes and alternatives. So a "compiled" version wouldn't be substantially different from an array of characters.

Contrast this with regular expressions which have much more complicated language constructs - parsing is not so linear. Repetition specifiers come to mind, as that means the regular expression parser must always look ahead for repetition specifiers. There's also more state to keep track of: capture groups, back-references, non-capture groups that change settings (such as case-sensitivity), etc. Every time a regular expression needs to backtrack, there's a lot to re-parse.

I dunno... I'd have to think about it some more - maybe I'll study some regex compilation to see if there are any ideas that would benefit globs...

@lukemassa
Copy link
Contributor

lukemassa commented Jan 24, 2025

Hm, very interesting! I hadn't given too much thought to how this kind of parsing works and I can definitely see there's a big difference from the "Programs" that regexes are compiled into to the more or less linear "matching" that's needed for globs.

I'll admit my personal interest in this feature is less from a performance perspective and more related to the API of the package itself. I recently added a setting in user configuration, that I first implemented as a regexp, then later as a glob using this package.

Regexp: runatlantis/atlantis#5254
Glob: runatlantis/atlantis#5267

(don't feel the need to understand what the code above is doing, suffice it to say the user specified a string that is used to match against other strings later).

We decided to go with glob because it fit the use case better, but one thing I did like about regexp was I was able to "compile" when loading the configuration and store the regex itself. Later in the "business logic" all we had to do was call Match() against it. For globs, we could Validate the input, but had to store the config as a string. Then later when we call Match() we would have to handle the error, even though it really cannot "actually" error: https://github.com/runatlantis/atlantis/pull/5267/files#diff-7311f0f6bed110419f722206cae72e578e8957a2ee34aee6470dd32695ae1ec3R121

So given that, ironically my naive implementation in #99 is actually sufficient for my needs, though calling what I wrote Compile() is a bit of a stretch :)

@bmatcuk
Copy link
Owner

bmatcuk commented Jan 24, 2025

You could always write a little helper function in your code to call Match but ignore the "error", or panic as you have in your PR.

@lukemassa
Copy link
Contributor

lukemassa commented Jan 24, 2025

Yeah that's true. The only downside would be that if there was a bug such that ValidatePattern and Match disagreed about what a valid pattern is, my code would stack trace, whereas the API of Pattern.Match() could be such that that could not occur.

Your earlier comment has definitely piqued my interest about how this kind of parsing works, I might go ahead and see if I can implement this as purely an academic exercise for myself. If I can get it to be at least as performant of Match() I'll have you take a look, but I'm under no expectations that that means it will be accepted. There are other considerations like broadening the API surface area, as well as maintaining consistency between (what I assume will be at least in part) separate implementations that we'll have to weigh. I mostly think it would be a fun exercise :)

@bmatcuk
Copy link
Owner

bmatcuk commented Jan 24, 2025

Right. I will add that validation during Match is essentially "free", if there's any concern about performance. There's no pre-processing validation step - errors are detected while the pattern runs. The only exception to this is that if the pattern is longer than the string we're trying to match against, then validation runs on the remainder of the pattern so that we don't silently ignore an error that wasn't reached during execution. For example, a pattern of abc[ matched against a string of abc would run validation on the pattern [ since the pattern was longer than the string.

Also, hopefully all the tests catch bugs in Validate vs Match behavior, but I suppose there's no guarantee that every corner case has been tested =(

@lukemassa
Copy link
Contributor

lukemassa commented Jan 24, 2025

Ah, right, that's the specific thing that MatchUnvalidated() avoids, right? Also I just realized I can use that function and it precisely solves my problem, I updated #99.

Given that, that PR is actually a working solution to this problem, right?

  • It allows for an mode of operation where you "upfront" some of the cost of validation
  • As confirmed by the tests, it has the same behavior as Match()
  • It allows for an API where Pattern.Match() cannot error
  • The Match itself has strictly better performance (due to the early return)

From here it's a question of "is it enough to justify the additional API surface area", which I'll leave up to you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants