Skip to content

Commit

Permalink
Rewrite the grammar once again.
Browse files Browse the repository at this point in the history
* Parses the GHC codebase!

  I'm using a trimmed set of the source directories of the compiler and most core libraries in
  [this repo](https://github.com/tek/tsh-test-ghc).

  This used to break horribly in many files because explicit brace layouts weren't supported very well.

* Faster in most cases!
  Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test
  codebases in `test/libs`:

  Old:
  ```
  effects: 32ms
  postgrest: 91ms
  ivory: 224ms
  polysemy: 84ms
  semantic: 1336ms
  haskell-language-server: 532ms
  flatparse: 45ms
  ```

  New:
  ```
  effects: 29ms
  postgrest: 64ms
  ivory: 178ms
  polysemy: 70ms
  semantic: 692ms
  haskell-language-server: 390ms
  flatparse: 36ms
  ```

  GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times!
  To get more detailed info (including new codebases I added, consisting mostly of core libraries), run
  `test/parse-libs`.
  I also added an interface for running `hyperfine`, exposed as a Nix app – execute
  `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or
  `test/libs/tsh-test-ghc/libraries`.

* Smaller size of the shared object.

  `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one.

* Significantly faster time to generate, and slightly faster build.

  On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s.

* All terminals now have proper text nodes when possible, like the `.` in modules.
  Fixes #102, #107, #115 (partially?).

* Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative
  interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code.
  Fixes #89, #105, #111.

* Comments aren't pulled into preceding layouts anymore.
  Fixes #82, #109.
  (Can probably still be improved with a few heuristics for e.g. postfix haddock)

* Similarly, whitespace is kept out of layout-related nodes as much as possible.
  Fixes #74.

* Hashes can now be operators in all situations, without sacrificing unboxed tuples.
  Fixes #108.

* Expression quotes are now handled separately from quasiquotes and their contents parsed properly.
  Fixes #116.

* Explicit brace layouts are now handled correctly.
  Fixes #92.

* Function application with multiple block arguments is handled correctly.

* Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like
  prefix operator detection.

* Haddock comments have dedicated nodes now.

* Use named precedences instead of closely replicating the GHC parser's productions.

* Different layouts are tracked and closed with their special cases considered.
  In particular, multi-way if now has layout.

* Fixed CPP bug where mid-line `#endif` would be false positive.

* CPP only matches legal directives now.

* Generally more lenient parsing than GHC, and in the presence of errors:
  * Missing closing tokens at EOF are tolerated for:
    * CPP
    * Comment
    * TH Quotation
  * Multiple semicolons in some positions like `if/then`
  * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions

* List comprehensions can have multiple sets of qualifiers (`ParallelListComp`).

* Deriving clauses after GADTs don't require layout anymore.

* Newtype instance heads are working properly now.

* Escaping newlines in comments and cpp works now.
  Escaping newlines on regular lines won't be implemented.

* One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)`
  I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse
  application, infix and negation without lexing all qualified names in the scanner.
  I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work.
  For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing.

* Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of
  Unicode categories, using bitmaps.
  I might need to change this to write them all to a shared file, so the set of source files stays the same.
  • Loading branch information
tek committed May 4, 2024
1 parent af32d88 commit 50a04bf
Show file tree
Hide file tree
Showing 196 changed files with 80,632 additions and 941,252 deletions.
5 changes: 4 additions & 1 deletion .gitattributes
Original file line number Diff line number Diff line change
@@ -1,2 +1,5 @@
/src/** linguist-vendored
/examples/* linguist-vendored
/test/libs/* linguist-vendored
/src/parser.c -diff
/src/grammar.json -diff
/src/node-types.json -diff
33 changes: 33 additions & 0 deletions .github/workflows/assets.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
name: Publish assets

on:
workflow_run:
workflows: [CI]
types: [completed]
branches: [main]

jobs:
build:
runs-on: ubuntu-latest
if: github.event.workflow_run.conclusion == 'success'
permissions:
contents: read
id-token: write
steps:
- uses: actions/checkout@v4
- uses: DeterminateSystems/nix-installer-action@main
- uses: DeterminateSystems/magic-nix-cache-action@main

- run: nix -L build .#parser-src
- name: Upload parser sources
uses: actions/upload-artifact@v4
with:
name: tree-sitter-haskell-src
path: result/src

- run: nix -L build .#parser-wasm
- name: Upload wasm binary
uses: actions/upload-artifact@v4
with:
name: tree-sitter-haskell-wasm
path: result/tree-sitter-haskell.wasm
50 changes: 34 additions & 16 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,16 @@ name: CI

on:
push:
branches:
- "**"
branches: [main]
tags: ['**']
pull_request:
types:
- opened
- synchronize
types: [opened, synchronize]

jobs:
test:
name: test / ${{ matrix.os }}
runs-on: ${{ matrix.os }}
name: test / ${{matrix.os}}
runs-on: ${{matrix.os}}
if: github.event.pull_request.merged == true || github.event.action != 'closed'
strategy:
fail-fast: false
matrix:
Expand All @@ -27,20 +26,39 @@ jobs:
with:
node-version: '18'

# - name: Install emscripten
# uses: mymindstorm/setup-emsdk@v10
# with:
# version: '2.0.24'
- name: Install emscripten
uses: mymindstorm/setup-emsdk@v14
with:
version: '3.1.47'

- name: Build tree-sitter-haskell
- name: Build dependencies
run: npm install

- name: Run tests
run: npm test

- name: Parse examples
run: npm run examples
- name: Parse libraries
run: npm run libs

- name: Parse libraries with wasm
run: npm run libs-wasm

- name: Run fuzzer
if: ${{matrix.os == 'ubuntu-latest'}}
uses: tree-sitter/fuzz-action@v4

# - name: Parse examples with web binding
# run: npm run examples-wasm
legacy:
permissions:
contents: write
id-token: write
needs: test
if: github.ref_type == 'tag'
uses: ./.github/workflows/legacy.yml

release:
permissions:
contents: read
id-token: write
needs: test
if: github.ref_type == 'tag'
uses: ./.github/workflows/release.yml
36 changes: 36 additions & 0 deletions .github/workflows/legacy.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
name: Update legacy branch

on:
workflow_call:

jobs:
commit:
runs-on: ubuntu-latest
permissions:
contents: write
id-token: write
steps:
- uses: actions/checkout@v4
with:
ref: ${{github.ref}}
- uses: actions/checkout@v4
with:
ref: master

- name: Reset worktree to ${{github.ref_name}}
run: |
git restore --source=${{github.ref}} .
git restore .gitignore
- uses: DeterminateSystems/nix-installer-action@main
- uses: DeterminateSystems/magic-nix-cache-action@main

- name: Generate parser
run: nix -L run .#gen-parser

- name: Commit and push to legacy branch
uses: actions-js/[email protected]
with:
github_token: ${{secrets.GITHUB_TOKEN}}
message: "Legacy release ${{github.ref_name}}"
branch: master
3 changes: 1 addition & 2 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
name: Publish package

on:
push:
tags: ["*"]
workflow_call:

concurrency:
group: ${{github.workflow}}-${{github.ref}}
Expand Down
15 changes: 9 additions & 6 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
node_modules
build
/src/parser.c
/dist-newstyle
/result
/test/libs/*
!/test/libs/.gitkeep
/build/
/target/
/.lib/
/node_modules/
*.log
repos
examples/*
!examples/.gitkeep
.gdb_history
*.o
*.so
/.build/
6 changes: 0 additions & 6 deletions .npmignore

This file was deleted.

38 changes: 13 additions & 25 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

23 changes: 20 additions & 3 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,12 +1,11 @@
[package]
name = "tree-sitter-haskell"
description = "haskell grammar for the tree-sitter parsing library"
version = "0.16.0"
version = "1.0.0"
keywords = ["incremental", "parsing", "haskell"]
categories = ["parsing", "text-editors"]
repository = "https://github.com/tree-sitter/tree-sitter-haskell"
edition = "2018"
license = "MIT"
edition = "2021"

build = "bindings/rust/build.rs"
include = [
Expand All @@ -19,6 +18,24 @@ include = [
[lib]
path = "bindings/rust/lib.rs"

[[test]]
name = "parse-test"
path = "test/rust/parse-test.rs"

[[bin]]
name = "parse"
path = "test/rust/parse.rs"
test = false
bench = false
doc = false

[[bin]]
name = "show"
path = "test/rust/show.rs"
test = false
bench = false
doc = false

[dependencies]
tree-sitter = "0.20"

Expand Down
Loading

0 comments on commit 50a04bf

Please sign in to comment.