Skip to content

Highlight (some) regular expressions using another grammar #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sogaiu opened this issue May 29, 2023 · 15 comments · May be fixed by #84
Open

Highlight (some) regular expressions using another grammar #11

sogaiu opened this issue May 29, 2023 · 15 comments · May be fixed by #84
Assignees
Labels
enhancement Enhancement to existing functionality

Comments

@sogaiu
Copy link

sogaiu commented May 29, 2023

I saw the following bit in the emacs-devel archives:

some files may consist of several parts requiring different tree-sitter
grammars. For example, a JavaScript file may have its documentation
written with jsdoc: JavaScript and jsdoc have a tree-sitter grammar
each.

Is there a way to use a tree-sitter grammar in parts of the file and
another one in other parts? There could be a main grammar and secondary
grammars would be activated on some kinds of nodes of the main one.

Yes, it should be possible, AFAIU. See the node "Multiple Languages"
in the ELisp manual, I believe it explains how to do what you want.

As an idea for "somewhere down the line", perhaps it would be interesting to consider the following...

Since tree-sitter-clojure can recognize regex literals, may be one could apply an appropriate regular expression grammar to highlight the portions within the double quotes.

I don't know how close this grammar is to Clojure's flavor of regex, but may be it or some appropriate modification to it (or something that inherits from it) might be used for the task.

For reference, the part of the manual being referred to in the quote above can be see in .texi form here. I didn't manage to find an HTML version. If you've got a recent enough Emacs from the emacs-29 branch, the info may be viewable from within emacs. Worked for me anyway...


Ah sorry. May be I should have made this in the Discussions area?

@dannyfreeman
Copy link
Contributor

Ah sorry. May be I should have made this in the Discussions area

No an issue is fine. I don't even get notifications from discussions lol.

This is a good idea. Clojure uses java flavored regular expressions. I'm not sure how much they are different from that grammar. If it is it might be worth forking and calling it tree-sitter-java-regex if the dialects of regex have enough differences.

@dannyfreeman dannyfreeman added the enhancement Enhancement to existing functionality label May 29, 2023
@dannyfreeman dannyfreeman self-assigned this May 29, 2023
@sogaiu
Copy link
Author

sogaiu commented May 29, 2023

I don't have the various flavors loaded into my head lately [1].

If I had to guess without looking too closely, I think this is likely to be some JavaScript flavor (or subset of one).

I also don't know / recall whether the various Clojure dialects all support the same regex syntax.

Perhaps this might come in handy eventually.


[1] Mostly working with PEGs in another language ;)

@sogaiu
Copy link
Author

sogaiu commented Jun 20, 2023

Came across this content among Lapce's files:

((regex_lit) @injection.content
 (#set! injection.language "regex"))

dannyfreeman added a commit that referenced this issue Aug 24, 2023
This grammar is bundled in nixos by default and seems good enough for
java regular expressions (the grammar probably supports more features
than java, idk).

Should address issue #11
dannyfreeman added a commit that referenced this issue Aug 24, 2023
This grammar is bundled in nixos by default and seems good enough for
java regular expressions. It is also maintained under the tree-sitter
github org so is "official".

In order to property identify the #" and closing " characters we have to
parse them with the clojure grammar (in case the regex grammar is not
available) and again with the regex grammar as part of the actual
pattern. This could be avoided if either the clojure grammar captured a
node for the inner contents of the regex literal, or the
treesit-range-settings supported some kind of offest argument like the
neovim tree-sitter mechanisms do.

Should address issue #11
@dannyfreeman
Copy link
Contributor

@sogaiu check this out 855cddd

Seems useful for other languages as well. Maybe even belongs in emacs core.

@sogaiu
Copy link
Author

sogaiu commented Aug 25, 2023

Thanks for the heads up!

Hope to take a look soon.

@sogaiu
Copy link
Author

sogaiu commented Aug 25, 2023

Ok, I gave it a try.

I see about capturing #" and ":

clojure-ts-mode-with-regex

@sogaiu
Copy link
Author

sogaiu commented Aug 25, 2023

On a side note, may be it's worth requesting that tree-sitter-regex get added to tree-sitter-module?

dannyfreeman added a commit that referenced this issue Aug 27, 2023
This grammar is bundled in nixos by default and seems good enough for
java regular expressions. It is also maintained under the tree-sitter
github org so is "official".

In order to property identify the #" and closing " characters we have to
parse them with the clojure grammar (in case the regex grammar is not
available) and again with the regex grammar as part of the actual
pattern. This could be avoided if either the clojure grammar captured a
node for the inner contents of the regex literal, or the
treesit-range-settings supported some kind of offest argument like the
neovim tree-sitter mechanisms do.

Should address issue #11
dannyfreeman added a commit that referenced this issue Aug 27, 2023
This grammar is bundled in nixos by default and seems good enough for
java regular expressions. It is also maintained under the tree-sitter
github org so is "official".

In order to property identify the #" and closing " characters we have to
parse them with the clojure grammar (in case the regex grammar is not
available) and again with the regex grammar as part of the actual
pattern. This could be avoided if either the clojure grammar captured a
node for the inner contents of the regex literal, or the
treesit-range-settings supported some kind of offest argument like the
neovim tree-sitter mechanisms do.

Should address issue #11
dannyfreeman added a commit that referenced this issue Aug 27, 2023
This grammar is bundled in nixos by default and seems good enough for
java regular expressions. It is also maintained under the tree-sitter
github org so is "official".

In order to property identify the #" and closing " characters we have to
parse them with the clojure grammar (in case the regex grammar is not
available) and again with the regex grammar as part of the actual
pattern. This could be avoided if either the clojure grammar captured a
node for the inner contents of the regex literal, or the
treesit-range-settings supported some kind of offest argument like the
neovim tree-sitter mechanisms do.

Should address issue #11

I think that multiple parsers per buffer may be too buggy to use right
now. There are situations where no regex will be present on in a buffer,
but the entire buffer will be highlighted as a regular expression. This
functionality probably needs upstream work in Emacs before we can merge
this into the main branch of clojure-ts-mode
@bbatsov
Copy link
Member

bbatsov commented Apr 15, 2025

@rrudakov Perhaps we can apply your learnings from the markdown-inline work here?

@rrudakov
Copy link
Contributor

@rrudakov Perhaps we can apply your learnings from the markdown-inline work here?

I think the biggest issue here is to find a proper grammar. The grammar mentioned in the discussion supports PCRE2, POSIX and JavaScript regexps, I'm not sure that any of those is fully compatible with Java regexps. One difference I can think of is using of double backslashes in Java.

If we find a grammar, adding a new parser and syntax highlighting is pretty straightforward.

@bbatsov
Copy link
Member

bbatsov commented Apr 15, 2025

I think PCRE2 will work well for our case, as if I recall correctly Java's regular expressions were derived from Perl 5. We'll have to check this, though.

@rrudakov
Copy link
Contributor

Image

it works pretty well. We need to decide what do we want to highlight and which faces to use for different elements (I'm not a designer and I'm not a regex expert :) ). The possibilities for syntax highlighting are endless (see the syntax tree on the right buffer).

@rrudakov
Copy link
Contributor

Image

With dark color scheme.

@rrudakov
Copy link
Contributor

There is also an issue in Emacs. When local parsers are used, offset setting has no effect, so hash sign and quotes are also included into the range (it also applicable to our markdown-inline parser).

It's reported to Emacs bug tracker: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=77848

@bbatsov
Copy link
Member

bbatsov commented Apr 16, 2025

Image

With dark color scheme.

This looks good to me. I was going to suggest to focus on match groups, character classes, escapes, anchors and modifiers and I guess that's what you did.

rrudakov added a commit to rrudakov/clojure-ts-mode that referenced this issue Apr 18, 2025
@rrudakov rrudakov linked a pull request Apr 18, 2025 that will close this issue
4 tasks
@rrudakov
Copy link
Contributor

It's reported to Emacs bug tracker: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=77848

The bug is fixed on Emacs master. On Emacs 30 the offset feature doesn't exist, which means that ranges for embedded parsers (markdown-inline and regex) will include quotes and hash character (for regex literal).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement to existing functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants