Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tree-sitter] Create SQL grammar #1061

Open
1 task done
savetheclocktower opened this issue Jul 17, 2024 · 2 comments
Open
1 task done

[Tree-sitter] Create SQL grammar #1061

savetheclocktower opened this issue Jul 17, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@savetheclocktower
Copy link
Sponsor Contributor

Have you checked for existing feature requests?

  • Completed

Summary

This recent comment reminds me of the painful constraint that injections don't work across different types of grammar.

We don't experience this much because so many of the built-in packages now have modern Tree-sitter grammars. But there are a few notable ones that don't:

  • Less (CSS)
  • CoffeeScript (and CSON)
  • SQL
  • Objective-C
  • Perl
  • XML

As well as some oddballs:

  • The small grammars that are active for git commit messages, git config files, etc.
  • Mustache (why is Mustache a built-in package?)
  • Makefile
  • Property lists (macOS old-style)

Why do we have these gaps? In short: the most reliable Tree-sitter grammars out there (especially 18 months ago when I began this insane project) were the first-party grammars. If the tree-sitter GitHub organization provided a parser that matched one of our built-in languages, I used it. For more obscure languages, there weren't many obvious candidates for up-to-date, actively-maintained parsers from third parties.

I feel no urgency on the oddballs and not much urgency on the more notable ones. But with Tree-sitter adoption increasing over time (thanks, Neovim!) it's worth checking every few months to see if we can support more languages.

SQL is the best example; it looks like there are a few tree-sitter-sql repositories out there. We should investigate them and see about integrating one.

What benefits does this feature provide?

The main benefit (apart from theoretically improved syntax highlighting, code folding, etc.) is the ability to inject into other grammars. The commenter described above is justifiably annoyed that they can't get their SQL highlighted when they write Markdown and include a SQL code block.

Any alternatives?

The only real alternative is the theoretical ability to be able to inject TextMate grammars into Tree-sitter grammars (and maybe vice-versa). That's something I floated early on, but it'd be so hard that I don't think it's worth the effort. Better to put that work into modernizing more languages.

Other examples:

No response

@savetheclocktower savetheclocktower added the enhancement New feature or request label Jul 17, 2024
@savetheclocktower
Copy link
Sponsor Contributor Author

There are four SQL parsers linked from the Tree-sitter homepage:

  • SQL (SQLite) leads to a repo that is archived.
  • SQL (BigQuery) is optimized for Google's BigQuery SQL dialect, so it's not really up our alley.
  • SQL (PostgreSQL) is a candidate; it “initially focuses on PostgreSQL flavor,” but we'll give it a shot.
  • SQL (General) has the most activity of the four and declares itself to be “general/permissive,” so that's where we'll start.

I spent about 45 minutes this morning testing the SQL (General) parser with some SQL dumps I had lying around (mainly from a backup of a WordPress site), some code snippets for various SQL features, and so on.

Some observations about this parser:

  • Anything that uses standard SQL syntax (common to all major flavors of SQL) seems like it will get highlighted just fine.
  • Lots of custom syntax from various flavors of SQL goes unrecognized.
  • I can tell that it tries to recover when it encounters input it doesn’t understand. But when it gets thrown out of whack, it seems to regress to a mode in which it can still lex symbols (the obvious stuff like SELECT, FROM, etc.) even if it can't figure out how to assemble them into larger constructs. This is appreciated, since it means that syntax highlighting can theoretically survive in such a mode.
  • I was nonetheless surprised by how much stuff it didn’t seem to understand.

For instance:

  • It had trouble parsing my WordPress dumps cleanly. It wasn’t able to recognize some MySQL-specific statements, which is fine. But it also got really flummoxed on a very long INSERT statement with lots of data that (as far as I can tell) was properly delimited. The line of SQL was about 31,000 characters long, so I expected the syntax highlighting itself to fail for Pulsar-related reasons — but I didn’t expect that line to flummox the parser so badly that it couldn’t get back on track afterward.

  • It also had trouble with some of the samples from this repo I found when I was searching for example Postgres dump files. It did much better with this syntax in general, but it has a crucial inability to recognize COPY FROM STDIN.

    Here’s a usage example; in practice, it seems to operate like a heredoc string, interpreting everything in a special mode until it sees an end-of-data marker.

    Because the special mode activates after the ; statement terminator, the fact that the parser doesn’t understand this construct is painful. The end-of-data marker is \., after which a new SQL statement can start immediately, so. In this file, the parser wasn’t able to recover at all once the COPY FROM STDIN statements began.

    This is the only Postgres incompatibility I found in my quick research, but it’s just as catastrophic to syntax highlighting as the many errors I got from my WordPress dumps.

I think it’s time to put this one aside. I’ll try the SQL (PostgreSQL) grammar next.

But if that one doesn’t fare any better, I wonder if there’s a better approach: a Tree-sitter grammar that defaults to total permissiveness for SQL:

  • Parse obvious SQL keywords.
  • Parse things in delimited pairs (like quotation marks). Keep delimiters balanced.
  • Never throw errors; capture everything you don't understand in a special node type until you see something else that you recognize.
  • Preserve anything you don’t recognize up until the next statement terminator.

This sort of parser would function much more like a TextMate grammar: decreasing its ambitions and gaining some robustness in return. The only problem is that I'd have to write it.

@confused-Techie
Copy link
Member

@savetheclocktower Appreciate the write up you've done here.

Although I'll assume that taking on writing a SQL Tree-sitter parser is a bit more to your workload considering what you're already propping up, if you'd like I'd be more than happy to take on a parser for this purpose, as Tree-sitter parsing is where I got most in depth when you first started this, even if my acumen fell off somewhat getting this properly into Pulsar.

And considering the frequency in which I'm working with PostgreSQL for the backend, I'd be motivated to keep things in working order.

If you haven't already started something I'll see if there's anything I can do in this area, in respect to a parser, then ideally that just means I'd have some reliance on getting it integrated into Pulsar.

Obviously if you've already started I don't mean to step on any toes, just shouting out where I can try to fill in to assist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants