Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential features based on oniguruma-to-es #18

Open
slevithan opened this issue Jan 3, 2025 · 4 comments
Open

Potential features based on oniguruma-to-es #18

slevithan opened this issue Jan 3, 2025 · 4 comments

Comments

@slevithan
Copy link

slevithan commented Jan 3, 2025

Context: oniguruma-to-es is an advanced Oniguruma to JavaScript transpiler that's written in JS. It was first released recently, and has quickly improved. It's used by Shiki's JS engine and supports more than 97% of TM grammars provided with Shiki (it's handling more than 99.9% of regexes in these grammars, but one unsupported or invalid regex removes support for the grammar). Some details are here about supporting the few remaining grammars, if you're interested.

Do you think there might be opportunities to enhance TmLanguage-Syntax-Highlighter using oniguruma-to-es? For example:

  • You could inform users when a grammar won't be supported by Shiki's JS engine.
  • You could show what a particular Oniguruma regex looks like when transpiled to JS (so people more familiar with JS regexes can understand where there are differences in meaning).
  • The error messages given by oniguruma-to-es for invalid Oniguruma patterns could potentially be helpful when writing/debugging grammars.

Happy to answer any questions. But feel free to close this without comment if you don't think it's a good fit.

@RedCMD
Copy link
Owner

RedCMD commented Jan 4, 2025

I've been keeping an eye on oniguruma-to-es for a while now
is a very cool project indeed

I could add a feature to show what the onig regexes would look like in JS
using a hover/button/command
and extend the error reporting

do the error messages give the position of the error?

are there any plans for JS to onig?
then I could add a convert js regex into onig paste option

have you tried parsing the grammars in this repo?
I know I like to use conditionals and absents :)

would there be support for other versions of oniguruma?
cause VSCode uses oniguruma 6.9.8 and Apple's TextMate 2.0 uses v5.9.6 iirc

do you currently support all characters in group names? (as long as the first character is _a-zA-Z)
eg. (?<name@%_0-9>b)\g<name@%_0-9> is valid onig
image
but \k<name@%_0-9> is not valid
image

@slevithan
Copy link
Author

slevithan commented Jan 4, 2025

I've been keeping an eye on oniguruma-to-es for a while now
is a very cool project indeed

Thanks, glad to hear it. 😊

I could add a feature to show what the onig regexes would look like in JS using a hover/button/command

I think that would be very cool.

For this, it might help to be aware of the avoidSubclass option. For example, you get the following results for the pattern .++:

With default options:

toDetails('.++')
/* →
{ pattern: '(?:(?=($E$[^\\n]+))\\1)',
  flags: 'v',
  options: {
    useEmulationGroups: true,
  },
}
*/

toRegExp('.++')
/* →
new EmulatedRegExp('(?:(?=($E$[^\\n]+))\\1)', 'v', {
  useEmulationGroups: true,
})
*/

With avoidSubclass:

toDetails('.++', {avoidSubclass: true})
/* →
{ pattern: '(?:(?=([^\\n]+))\\1)',
  flags: 'v',
}
*/

toRegExp('.++', {avoidSubclass: true})
/* →
/(?:(?=([^\n]+))\1)/v
*/

// Alternatively, even when not using `avoidSubclass` you can do...
toRegExp('.++').toString()
/* →
'/(?:(?=([^\\n]+))\\1)/v'
...or read the regexp's `.source` and `.flags`
*/

The latter values don't include the $E$ marker (or sometimes $N$E$, where N is an integer 1 or greater) used for injected "emulation groups". All of these results match exactly the same strings. The difference is only in the properties of match results. EmulatedRegExp does some fancy things to hide emulation groups from results (and in some cases to transfer captured values between subpattern results to match Oniguruma's handling).

Note that if you pass values from toDetails as arguments to the EmulatedRegExp constructor (or optionally to RegExp if there is no options property on the returned object), you get the same result as from toRegExp. There are some additional details in the docs for avoidSubclass and in shikijs/shiki#878.

@slevithan
Copy link
Author

slevithan commented Jan 4, 2025

Reporting error positions

[...] and extend the error reporting
do the error messages give the position of the error?

No. That would be nice and I'd welcome contributions that enable this, but it would be difficult because in some cases (subroutines are an example) the generated results are fairly scrambled compared to the input, and errors can come from the tokenizer, parser, transformer, or code generator. But if you only wanted to know whether it's a valid Oniguruma regex (minus features that oniguruma-to-es doesn't yet support when generating its Oniguruma AST), then you only need to worry about toOnigurumaAst which only calls the tokenizer and parser. The tokenizer already includes a .raw property on tokens and as a result it would probably be easy to add a position property. Adding raw and position properties to the parser's AST output would presumably be significantly more work, but doable.

That said, maybe the errors are still useful without a position? In general, oniguruma-to-es's errors are more specific and understandable than the errors that the actual Oniguruma gives, and further improvements to error specificity/messages are certainly possible/welcome.

JS RegExp → Oniguruma

are there any plans for JS to onig? then I could add a convert js regex into onig paste option

Not currently. JS RegExp to Oniguruma would be a cool feature, but it has more limited use cases that I personally don't have.

However, I would welcome it if you wanted to collaborate on this. Compared to going from an Oniguruma AST to a JS RegExp, going from a JS RegExp AST to an Oniguruma pattern would be dramatically simpler. So most of the complex work would be in building a JS RegExp AST. But then, there are of course existing JS RegExp AST builders. The best / most up to date one is probably eslint-community/regexpp. If you used that, going from JS RegExp to Oniguruma wouldn't need to be a huge project like oniguruma-to-es, at least for someone (like yourself) with preexisting in-depth knowledge of Oniguruma and JS RegExp syntax/behavior.

Aside: Eventually I'd love to create a lightweight AST builder for Regex+ syntax. Regex+ syntax is a strict superset of JS RegExp syntax with flag v, so by including support for Regex+'s syntax extensions via options in the parser (or some kind of a plugin system), you'd get a JS-RegExp-with-v parser for free. And it could be further simplified by only supporting RegExp syntax from the latest ES version.

Support for absence and conditionals

have you tried parsing the grammars in this repo?

No, but I generally know the currently-missing features. They're documented in oniguruma-to-es's readme, or at least hinted at (e.g. for \G it mentions that common uses are supported, and gives some examples of supported cases). Of course, I'd love to learn about anything I'm missing.

I know I like to use conditionals and absents :)

Absent repeaters and absent expressions can be emulated, and I plan to support them in future versions. See the tracking issue here: slevithan/oniguruma-to-es#13 😊

Some conditionals can be emulated. E.g. it would be pretty straightforward to change a basic case like (<)?foo(?(1)>) to (?:(<)foo>|foo). But this would be more complicated or break down in some other cases, sometimes for quite nuanced reasons. If only JS didn't make backreferences to nonparticipating groups match the empty string, there would be additional strategies for emulating conditionals (something I wrote about back in 2007). 😞 I don't currently plan to add support for conditionals, but contributions that add support for basic cases would be welcome.

Aside: Oniguruma edge cases make the (?(…)…) structure relatively complex to deal with comprehensively, since the first can be any arbitrary regex, and the second can be empty (turning the conditional into a backreference checker) or include any number of top-level | (other regex flavors restrict it to one | for then/else).

Emulating older versions of Oniguruma

would there be support for other versions of oniguruma? cause VSCode uses oniguruma 6.9.8 and Apple's TextMate 2.0 uses v5.9.6 iirc

Supporting older versions is not currently planned but is possible. I'd welcome contributions that add this in a maintainable way.

Invalid JS identifiers as group names

do you currently support all characters in group names? (as long as the first character is _a-zA-Z)
eg. (?<name@%_0-9>b)\g<name@%_0-9> is valid onig [...]
but \k<name@%_0-9> is not valid

oniguruma-to-es internally distinguishes between names that are valid in Oniguruma vs JS. Currently, it restricts to group/subroutine/backreference names that are valid in both Oniguruma and JS, which is noted in the readme under Supported features → Groups → Named capturing.

Supporting group/subroutine/backreference names that are invalid JS identifiers would require:

  1. Automatically changing or removing the names during transpilation.
  2. Special handling in EmulatedRegExp to add the original name to match results.

I'm not currently planning to support this since I consider it low priority (and I'd encourage TM grammar authors that use invalid JS identifiers as groups names to update their regexes), but I'd welcome contributions that added support for this.

@slevithan
Copy link
Author

slevithan commented Jan 4, 2025

Aside: It's obvious you have extremely in-depth and hard-won knowledge of Oniguruma's nuances and complexity. Even if you don't end up using oniguruma-to-es in this library, if you're ever interested to play with it, I'd find your feedback extremely valuable. 😊 The demo page hopefully makes that easier.


Edit: Thanks for all the fantastic and detailed issues you've filed!! They've now all been addressed, with fixes published in v1.0.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants