* Parses the GHC codebase!
I'm using a trimmed set of the source directories of the compiler and most core libraries in
[this repo](https://github.com/tek/tsh-test-ghc).
This used to break horribly in many files because explicit brace layouts weren't supported very well.
* Faster in most cases!
Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test
codebases in `test/libs`:
Old:
```
effects: 32ms
postgrest: 91ms
ivory: 224ms
polysemy: 84ms
semantic: 1336ms
haskell-language-server: 532ms
flatparse: 45ms
```
New:
```
effects: 29ms
postgrest: 64ms
ivory: 178ms
polysemy: 70ms
semantic: 692ms
haskell-language-server: 390ms
flatparse: 36ms
```
GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times!
To get more detailed info (including new codebases I added, consisting mostly of core libraries), run
`test/parse-libs`.
I also added an interface for running `hyperfine`, exposed as a Nix app – execute
`nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or
`test/libs/tsh-test-ghc/libraries`.
* Smaller size of the shared object.
`tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one.
* Significantly faster time to generate, and slightly faster build.
On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s.
* All terminals now have proper text nodes when possible, like the `.` in modules.
Fixes #102, #107, #115 (partially?).
* Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative
interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code.
Fixes #89, #105, #111.
* Comments aren't pulled into preceding layouts anymore.
Fixes #82, #109.
(Can probably still be improved with a few heuristics for e.g. postfix haddock)
* Similarly, whitespace is kept out of layout-related nodes as much as possible.
Fixes #74.
* Hashes can now be operators in all situations, without sacrificing unboxed tuples.
Fixes #108.
* Expression quotes are now handled separately from quasiquotes and their contents parsed properly.
Fixes #116.
* Explicit brace layouts are now handled correctly.
Fixes #92.
* Function application with multiple block arguments is handled correctly.
* Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like
prefix operator detection.
* Haddock comments have dedicated nodes now.
* Use named precedences instead of closely replicating the GHC parser's productions.
* Different layouts are tracked and closed with their special cases considered.
In particular, multi-way if now has layout.
* Fixed CPP bug where mid-line `#endif` would be false positive.
* CPP only matches legal directives now.
* Generally more lenient parsing than GHC, and in the presence of errors:
* Missing closing tokens at EOF are tolerated for:
* CPP
* Comment
* TH Quotation
* Multiple semicolons in some positions like `if/then`
* Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions
* List comprehensions can have multiple sets of qualifiers (`ParallelListComp`).
* Deriving clauses after GADTs don't require layout anymore.
* Newtype instance heads are working properly now.
* Escaping newlines in comments and cpp works now.
Escaping newlines on regular lines won't be implemented.
* One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)`
I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse
application, infix and negation without lexing all qualified names in the scanner.
I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work.
For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing.
* Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of
Unicode categories, using bitmaps.
I might need to change this to write them all to a shared file, so the set of source files stays the same.