Skip to content

Replace tree-sitter bindings with tree-house #12972

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 13, 2025
Merged

Replace tree-sitter bindings with tree-house #12972

merged 5 commits into from
May 13, 2025

Conversation

the-mikedavis
Copy link
Member

This PR drops our dependency on the upstream tree-sitter Rust bindings and moves us to a new highlighter crate we've been working on: tree-house.

The main benefit of the new highlighter is that it fixes correctness issues, especially since the reverse query precedence PR. It solves longstanding issues #1151 and #6491, and updates us to the latest tree-sitter C source (v0.22 -> v0.25), which will become important as more grammars adopt the new ABI. It's also a slight performance boost in my benchmarking (5-10%). We've hoped to separate out our highlighting machinery for a while so that other applications could take advantage and the hope is that tree-house is the right way to expose it (\cc @hadronized).

The new highlighter crate makes some abstractions clearer. The types to care about are already established in the syntax module:

  • Loader, an ECS-style holder of all languages (pub struct Language(pub u32)). This stores the language configuration as well as any grammar and query information and allows looking up a language by file-type, language name, shebang, etc.. Instead of being stored on Syntax this is now stored separately on Document and passed to Syntax for updates. In the long run we can eventually store a single ArcSwap<syntax::Loader> on Editor instead of many clones of Arc<ArcSwap<syntax::Loader>>.
  • Syntax, a wrapper of the Syntax type from tree-house. It holds a tree of trees of the injection layers of a document and provides logarithmic lookups to determine an injection layer (including the layer's language, config, tree and queries) for a given byte range. It also has a query iterator that works across injections which can support indents, textobjects, etc. in the future.

There are some differences compared to master that are less significant:

  • The configuration types have been moved to another module, helix_core::syntax::config. This is not strictly necessary for this change but it seems like a good opportunity to move these largely unrelated types away from the syntax module.
  • The new bindings deal in u32s instead of usize. This causes some changes in helix-core and the commands modules but is otherwise an implementation detail.
  • The use of tree_sitter::Point in the indents module has been replaced with byte indices. Generally we can avoid the point types and functions. syntax::generate_edits for example now produces zero points.
  • The TreeCursor type that acts over injection layers is built into tree-house and is basically free to create - it doesn't require any extra sorting of layers. That type in tree-house uses the TreeCursor type from the C library and so it should be more efficient.
  • The Highlight type comes from tree-house now and is represented as a non-max u32 (enables the null pointer optimization for Option<Highlight>).

... and some other changes that are worth reviewing:

  • The abstraction for overlaying highlights onto the syntax highlights has been rewritten to avoid the dynamic dispatch from syntax::merge in favor of a cursor-like API that matches the highlighter. See the OverlayHighlights and OverlayHighlighter types in the rewritten syntax module.
  • Queries may now specify which captures and predicates/properties they use up-front and deny unknown ones. cargo xtask query-check can now point out unknown properties like (#set! priority ..) and unknown predicates like (#has-ancestor? ..) and has a nice visual output. This especially benefits the indent query which previously used runtime panics for the same effect, see IndentQuery.
Example query compilation errors...
Error: Failed to compile highlights for 'julia'

Caused by:
    unknown predicate #has-ancestor?
      --> 246:4
         |
     245 |   (#any-of? @variable.builtin "begin" "end")
     246 |   (#has-ancestor? @variable.builtin index_expression))
         |    ^^^^^^^^^^^^^^
         |

Error: Failed to compile highlights for 'dockerfile'

Caused by:
    unknown property 'priority'
      --> 54:11
        |
     53 |   (heredoc_line) @string)
     54 |   (#set! "priority" 90))
        |           ^^^^^^^^
        |
Error: Failed to compile highlights for 'glimmer'

Caused by:
    unknown predicate #offset!
      --> 21:16
        |
     20 |   arguments: ((template_string) @glimmer
     21 |               (#offset! @glimmer 0 1 0 -1)))
        |                ^^^^^^^^
        |

Error: Failed to compile indents.scm query for 'opencl'

Caused by:
    unknown predicate #not-knd-eq?
      --> 21:4
        |
     20 |   consequence: (_) @indent
     21 |   (#not-knd-eq? @indent "compound_statement")
        |    ^^^^^^^^^^^^
     22 |   (#set! "scope" "all"))
        |

Fixes #1151
Fixes #3391
Fixes #6491

@nik-rev
Copy link
Contributor

nik-rev commented Feb 27, 2025

This is fantastic!! It's great to have parameters retain their color even when the function definition goes out of the viewport.

I just rebased #12768 on top of your PR and it works flawlessly. We don't have to inject into every argument anymore, positional injections work!!

image

@pascalkuthe
Copy link
Member

pascalkuthe commented Feb 28, 2025

While writing the code for query precedence in the injection query I was assuming I was fixing niche bugs (and sometimes questioned neither it was worth writing that quite subtle code). Seeing an actual usecase for it so quickly is great!

@hadronized
Copy link
Contributor

Thanks a lot for doing this! I’m going to have a look at whether I can use it in kak-tree-sitter!

@nik-rev
Copy link
Contributor

nik-rev commented Mar 11, 2025

I've been testing this along with #12768 and found a reproducible crash

This panic happens if:

  • injection.include-children is set
  • the macro's documentation includes usage of the macro itself, where the injection happens inside of the markdown
  • auto-complete the macro

To reproduce, add this at the end of runtime/queries/rust/injections.scm:

((macro_invocation
   macro:
     [
       (scoped_identifier
         name: (_) @_macro_name)
       (identifier) @_macro_name
     ]
   (token_tree . (_) . (_) . (string_literal) @injection.content))
 (#any-of? @_macro_name
  ; std
  "some_macro")
 (#set! injection.language "rust")
  (#set! injection.include-children))

Open a Rust project with this structure with cursor at |:

/// ```
/// some_macro!((), (), "`rust` injection happens here");
/// ```
macro_rules! some_macro {
    ($a:expr, $b:expr, $c:literal) => {};
}

fn main() {
    some_ma|
}

When Rust analyzer auto-completes the some_macro and you go to the next completion, Helix crashes with this error:

thread 'main' panicked at helix-term/src/ui/markdown.rs:93:9:
assertion failed: pos > start
stack backtrace:
   0: rust_begin_unwind
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:72:14
   2: core::panicking::panic
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:144:5
   3: helix_term::ui::markdown::highlighted_code_block
             at ./helix-term/src/ui/markdown.rs:93:9
   4: helix_term::ui::markdown::Markdown::parse
             at ./helix-term/src/ui/markdown.rs:296:40
   5: <helix_term::ui::markdown::Markdown as helix_term::compositor::Component>::render
             at ./helix-term/src/ui/markdown.rs:366:20
   6: <helix_term::ui::completion::Completion as helix_term::compositor::Component>::render
             at ./helix-term/src/ui/completion.rs:579:9
   7: <helix_term::ui::editor::EditorView as helix_term::compositor::Component>::render
             at ./helix-term/src/ui/editor.rs:1588:13
   8: helix_term::compositor::Compositor::render
             at ./helix-term/src/compositor.rs:181:13
   9: helix_term::application::Application::render::{{closure}}
             at ./helix-term/src/application.rs:290:9
  10: helix_term::application::Application::handle_terminal_events::{{closure}}
             at ./helix-term/src/application.rs:664:27
  11: helix_term::application::Application::event_loop_until_idle::{{closure}}
             at ./helix-term/src/application.rs:332:56
  12: helix_term::application::Application::event_loop::{{closure}}
             at ./helix-term/src/application.rs:306:58
  13: helix_term::application::Application::run::{{closure}}
             at ./helix-term/src/application.rs:1140:39
  14: hx::main_impl::{{closure}}
             at ./helix-term/src/main.rs:160:54
  15: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/future/future.rs:124:9
  16: tokio::runtime::park::CachedParkThread::block_on::{{closure}}
             at /home/e/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/park.rs:284:63
  17: tokio::runtime::coop::with_budget
             at /home/e/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/coop.rs:107:5
  18: tokio::runtime::coop::budget
             at /home/e/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/coop.rs:73:5
  19: tokio::runtime::park::CachedParkThread::block_on
             at /home/e/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/park.rs:284:31
  20: tokio::runtime::context::blocking::BlockingRegionGuard::block_on
             at /home/e/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/context/blocking.rs:66:9
  21: tokio::runtime::scheduler::multi_thread::MultiThread::block_on::{{closure}}
             at /home/e/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/scheduler/multi_thread/mod.rs:87:13
  22: tokio::runtime::context::runtime::enter_runtime
             at /home/e/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/context/runtime.rs:65:16
  23: tokio::runtime::scheduler::multi_thread::MultiThread::block_on
             at /home/e/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/scheduler/multi_thread/mod.rs:86:9
  24: tokio::runtime::runtime::Runtime::block_on_inner
             at /home/e/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/runtime.rs:370:45
  25: tokio::runtime::runtime::Runtime::block_on
             at /home/e/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.43.0/src/runtime/runtime.rs:340:13
  26: hx::main_impl
             at ./helix-term/src/main.rs:162:5
  27: hx::main
             at ./helix-term/src/main.rs:44:21
  28: core::ops::function::FnOnce::call_once
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

If you remove the injection.include-children the panic does not happen

But in that case, my PR is no longer able to inject the rustfmt language

@the-mikedavis
Copy link
Member Author

Looks like it was a bug in some of the precedence handling for injections and it could cause injection layers to get out of order. Fixed in helix-editor/tree-house@9e2063c

@nik-rev
Copy link
Contributor

nik-rev commented Mar 14, 2025

Found a segfault

Open this Rust file, with cursor at the end of the comment, after the // in impl A:

//! L
//! - `a`

struct A {}

impl A {
    //
    pub fn a() {}
}

When you try to insert a / to get this file:

//! L
//! - `a`

struct A {}

impl A {
    ///
    pub fn a() {}
}

Instead we have a segfault:
image

Note: Removing the //! comments at the beginning and the segfault will not happen anymore

Tested with the latest version of this PR (does not happen on master)

What is bizarre is that if you copy paste the final version of the file then there is no segfault, and you can write doc comments just fine

I tried to run this with with lldb and gdb to get the stack trace but when I try to do that for some reason I don't get syntax highlighting (so the segfault can't happen):

image

@nik-rev
Copy link
Contributor

nik-rev commented Mar 14, 2025

Also, the markdown language is injected into Rust doc comments, but if the code blocks don't have an rs extension then there will be no syntax highlighting

image

It is fairly common not to have an rs extension in these code blocks. I was wondering if it's possible to inject the rust language into code blocks by default, if no language is specified (but only in rust doc)

At the moment, what I'm thinking is that there can be a markdown-rs tree-sitter query which just ; extends markdown but has an extra rust injection into code blocks which don't specify a language

Then we can use this markdown-rs in Rust doc. What are your thoughts on this?

Edit: I was able to implement this in #13116

image

@the-mikedavis
Copy link
Member Author

the-mikedavis commented Mar 15, 2025

W.r.t. the segfault: I pushed a workaround in helix-editor/tree-house@8b94086. What's happening is that adding a new doc comment to the root layer adds a new range to the markdown combined injection, so it starts out covering ranges [3..6, 9..16] and then with the edit [3..6, 9..16, 46..47]. When we pass the tree from before the edit (covering 3..16), it seems that we break some assumptions in either the C library (in lexer.c) or in tree-sitter-markdown, I'm not sure which. This will need further debugging. In the meantime we can avoid passing the old tree when we add a new range to a combined injection.

This doesn't happen on master because incremental injections are smarter in tree-house. On master we don't pass the old parse tree because we don't recognize that the new injection layer is the same as the old one.

Backtrace...
Thread 1 "hx" received signal SIGSEGV, Segmentation fault.
0x0000563fe0200b5b in ts_lexer_goto (self=0x563ff22645f0, position=...)
    at /home/michael/.cargo/git/checkouts/tree-house-db01805a04979375/abe1ae1/bindings/vendor/src/./lexer.c:186
186	     .bytes = last_included_range->end_byte,
(gdb) bt
#0  0x0000563fe0200b5b in ts_lexer_goto (self=0x563ff22645f0, position=...)
    at /home/michael/.cargo/git/checkouts/tree-house-db01805a04979375/abe1ae1/bindings/vendor/src/./lexer.c:186
#1  0x0000563fe0201494 in ts_lexer_reset (self=0x563ff22645f0, position=...)
    at /home/michael/.cargo/git/checkouts/tree-house-db01805a04979375/abe1ae1/bindings/vendor/src/./lexer.c:398
#2  0x0000563fe0205736 in ts_parser__lex (self=0x563ff22645f0, version=0, parse_state=0)
    at /home/michael/.cargo/git/checkouts/tree-house-db01805a04979375/abe1ae1/bindings/vendor/src/./parser.c:531
#3  0x0000563fe0209964 in ts_parser__advance (self=0x563ff22645f0, version=0, allow_node_reuse=true)
    at /home/michael/.cargo/git/checkouts/tree-house-db01805a04979375/abe1ae1/bindings/vendor/src/./parser.c:1598
#4  0x0000563fe020b77a in ts_parser_parse (self=0x563ff22645f0, old_tree=0x563ff2486e70, input=...)
    at /home/michael/.cargo/git/checkouts/tree-house-db01805a04979375/abe1ae1/bindings/vendor/src/./parser.c:2171
#5  0x0000563fe020bbbf in ts_parser_parse_with_options (self=0x563ff22645f0, old_tree=0x563ff2486e70, input=..., parse_options=...)
    at /home/michael/.cargo/git/checkouts/tree-house-db01805a04979375/abe1ae1/bindings/vendor/src/./parser.c:2240
#6  0x0000563fdfc19b93 in tree_house_bindings::parser::Parser::parse<tree_house_bindings::ropey::RopeInput, ropey::slice::RopeSlice> (self=0x7ffc146bb710, input=..., old_tree=...)
    at /home/michael/.cargo/git/checkouts/tree-house-db01805a04979375/abe1ae1/bindings/src/parser.rs:146
#7  0x0000563fdfbfd12a in tree_house::LayerData::parse<helix_core::syntax::Loader> (self=0x563ff225d6b0, parser=0x7ffc146bb710, source=..., loader=0x563ff216e570)
    at /home/michael/.cargo/git/checkouts/tree-house-db01805a04979375/abe1ae1/highlighter/src/parse.rs:98
#8  0x0000563fdfbfcd28 in tree_house::Syntax::update<helix_core::syntax::Loader> (self=0x563ff22bcf80, source=..., timeout=..., edits=..., loader=0x563ff216e570)
    at /home/michael/.cargo/git/checkouts/tree-house-db01805a04979375/abe1ae1/highlighter/src/parse.rs:57
#9  0x0000563fdfbf9bfc in helix_core::syntax::Syntax::update (self=0x563ff22bcf80, old_source=..., source=...,

last_included_range is not a valid memory location as self->included_range_count is 0 here and last_included_range = &self->included_ranges[self->included_range_count - 1].

@RoloEdits
Copy link
Contributor

Not sure if I am doing something wrong, and the upgrade process is now different than before, but I get the highlighting in the picker, but not the actual buffer.

Picker:
tree-sitter-picker

Buffer:
tree-sitter-buffer

@gabydd
Copy link
Member

gabydd commented Mar 17, 2025

I think I saw this as well it seems like there is something up with the timeout, at least that's what the logs where saying for me I believe

@daedroza
Copy link
Contributor

daedroza commented Mar 18, 2025

updates us to the latest tree-sitter C source (v0.22 -> v0.25), which will become important as more grammars adopt the new ABI.

What's stopping us from moving to a newer tree-sitter version now? I'm dependent on a newer ABI (15) but unable to use since we're stuck with 0.22 treesitter version. However seeing this newer tree-house library only deepens the problem as bunch of PRs might still use the old APIs of helix. It's already painful to keep rebasing the non-merged PRs and seeing such 2k+ changes makes me question my ability on rebasing further...

@the-mikedavis
Copy link
Member Author

Newer versions of the tree-sitter crate introduce a transitive dependency on streaming-iterator which we will not accept, see #11847

@infastin
Copy link

infastin commented Mar 20, 2025

Will directives #select-adjacent! and #strip! be implemented? They are listed in the Tree-sitter User Guide.

@RoloEdits
Copy link
Contributor

RoloEdits commented Mar 21, 2025

So my issue seems a little more interesting(?) than a timeout. When I have 1.76.0 in rust-toolchain.toml (master) I get highlighting properly. But when I put 1.85.1 I get the problem above. (except trying now only seem to get highlighting for scm files)

@RoloEdits
Copy link
Contributor

Actually, just changing lto = "thin" to lto = "fat" also seems to cause this. Can anyone else reproduce this behavior?

@RoloEdits
Copy link
Contributor

RoloEdits commented Mar 31, 2025

Starting to use this in my build and now I am not seeing the comment injections:
image
image

Not sure if this is what was alluded to in an earlier discussion or if this is unknown. Before I copied over the new injections it seemed to work, but I haven't copied over those in a while. So it may be from the precedence PR?

@nik-rev
Copy link
Contributor

nik-rev commented Mar 31, 2025

Starting to use this in my build and now I am not seeing the comment injections:
image
image

Not sure if this is what was alluded to in an earlier discussion or if this is unknown. Before I copied over the new injections it seemed to work, but I haven't copied over those in a while. So it may be from the precedence PR?

Yes, I think its the same as this:

#12972 (comment)

@nik-rev
Copy link
Contributor

nik-rev commented Apr 7, 2025

Here are a few extra injections to add for markdown-rustdoc, which complements 813f771

(fenced_code_block
  (info_string
    (language) @__language)
  (code_fence_content) @injection.content
  ; list of attributes for Rust syntax highlighting:
  ; https://doc.rust-lang.org/rustdoc/write-documentation/documentation-tests.html?highlight=no_run#attributes
  (#match? @__language
  "(ignore|should_panic|no_run|compile_fail|standalone_crate|custom|edition*)")
  (#set! injection.language "rust")
  (#set! injection.include-unnamed-children))

Since at the moment there is no syntax highlighting for ```no_run code blocks in Rust documentation and the like

@RoloEdits
Copy link
Contributor

Not sure if this is a rust tree-sitter parsing issue or some kind of precedence issue.

If a lifetime has the same name as a parameter then its highlighted as a parameter:
image
image

@SoraTenshi
Copy link
Contributor

I believe this PR will have an impact on Tree-sitter queries, so here my question:
How will this exactly be handled? Will there be an influx of new PRs containing tree-sitter query fixes for specific languages be the case, will all the fixes be collected here / should one open a PR for this PR? Or will there be a mass PR after this has been merged to deal with all the changes in the queries?

I am asking, because i'd be happy to help with some languages.

@the-mikedavis
Copy link
Member Author

This should actually only affect locals.scm queries: we will be diverging from the way that locals work in the tree-sitter-cli does locals highlighting (@local.scope, @local.definition, @local.reference). Instead we will run locals queries at parse-time (instead of highlight time) and in the locals.scm files the change is that the scopes we will use are @local.definition.<highlight-scope> (for example @local.definition.variable.parameter). I've updated all of the locals at the time already in this PR but I will take another pass before merging to update any languages that were merged since this was opened. Feel free to send a PR to migrate any new languages as well. It will take a bit of detective work to figure out which ones those are though as I don't remember what the original fork-point of this branch was. (It can probably be approximated based on the PR's publish date)

@RoloEdits
Copy link
Contributor

Did a sampling of helix to see if there is any change from what I saw before, with the final plateau being from the clib of tree-sitter(past the "edge" of what we can "do"), and pretty much saw the same as before. Nice job with the wrapper.
image

(curious enough, the largest block on the left is for rendering the workspace diagnostics on the statusline. Will check that out later this week)

archseer
archseer previously approved these changes May 13, 2025
Copy link
Member

@archseer archseer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get this change merged, I've been running it for a month now with no issues. Helix facing changes look good to me and I've done a preliminary pass through the tree-house code.

for matched_node in m.matched_nodes() {
let node_id = matched_node.node.id();
let capture = Some(matched_node.capture);
let capture_type = if capture == query.indent_capture {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't be a match clause?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly not - I wish you could do Elixir-like pinning in Rust matches like ^query.indent_capture => { .. }

the-mikedavis and others added 5 commits May 13, 2025 18:30
This type also exists on `Editor`. This change brings it to the
`Document` as well because the replacement for `Syntax` in the child
commits will eliminate `Syntax`'s copy of `syn_loader`. `Syntax` will
also be responsible for returning the highlighter and query iterators
(which will borrow the loader), so the loader must be separated from
that type.

In the long run, when we make a larger refactor to have
`Document::apply` be a function of the `Editor` instead of the
`Document`, we will be able to drop this field on `Document` - it is
currently only necessary for `Document::apply`. Once we make that
refactor, we will be able to eliminate the surrounding `Arc` in
`Arc<ArcSwap<syntax::Loader>>` and use the `ArcSwap` directly instead.
@the-mikedavis the-mikedavis merged commit be1cf09 into master May 13, 2025
7 checks passed
@the-mikedavis the-mikedavis deleted the tree-house branch May 13, 2025 23:24
@David-Else
Copy link
Contributor

Since this was merged shebangs for Bash no longer work. If the files begins with #!/bin/bash and has no file extension then it loads as text.

@the-mikedavis
Copy link
Member Author

Ah yep I accidentally dropped some shebang-parsing code thinking it was redundant. Should be fixed now in 09bc67a

@Rudxain
Copy link
Contributor

Rudxain commented May 16, 2025

I've noticed that if no highlight is defined for a scope, the color is #ffffff regardless of language or theme (I can repro on default). I thought it was a bug in Everforest, but I can't unsee it, it's everywhere. Is this intentional?

@the-mikedavis
Copy link
Member Author

Ah good catch, that was a regression in the syntax highlighter. Instead of using the theme's ui.text key the syntax highlighter was basing its styles on Style::default() so text in markdown files for example would be styled with whatever colorscheme your terminal has set for foreground/background. Fixed now in 3ceae88

@SofusA
Copy link
Contributor

SofusA commented May 17, 2025

I don't know if this is the right to report, but after this, html syntax is no longer injected into html! rust macro:

Before:
image

After:
image

@the-mikedavis
Copy link
Member Author

Ah yeah the Slint and HTML injections need to come after the generic (macro_invocation) injections in Rust now since later patterns take precedence. Fixed in 05ae617

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-core Area: Helix core improvements A-tree-sitter Area: Tree-sitter
Projects
None yet