Rewrite the grammar once again.

* Parses the GHC codebase! I'm using a trimmed set of the source directories of the compiler and most core libraries in [this repo](https://github.com/tek/tsh-test-ghc). This used to break horribly in many files because explicit brace layouts weren't supported very well. * Faster in most cases! Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test codebases in `test/libs`: Old: ``` effects: 32ms postgrest: 91ms ivory: 224ms polysemy: 84ms semantic: 1336ms haskell-language-server: 532ms flatparse: 45ms ``` New: ``` effects: 29ms postgrest: 64ms ivory: 178ms polysemy: 70ms semantic: 692ms haskell-language-server: 390ms flatparse: 36ms ``` GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times! To get more detailed info (including new codebases I added, consisting mostly of core libraries), run `test/parse-libs`. I also added an interface for running `hyperfine`, exposed as a Nix app – execute `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or `test/libs/tsh-test-ghc/libraries`. * Smaller size of the shared object. `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one. * Significantly faster time to generate, and slightly faster build. On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s. * All terminals now have proper text nodes when possible, like the `.` in modules. Fixes #102, #107, #115 (partially?). * Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code. Fixes #89, #105, #111. * Comments aren't pulled into preceding layouts anymore. Fixes #82, #109. (Can probably still be improved with a few heuristics for e.g. postfix haddock) * Similarly, whitespace is kept out of layout-related nodes as much as possible. Fixes #74. * Hashes can now be operators in all situations, without sacrificing unboxed tuples. Fixes #108. * Expression quotes are now handled separately from quasiquotes and their contents parsed properly. Fixes #116. * Explicit brace layouts are now handled correctly. Fixes #92. * Function application with multiple block arguments is handled correctly. * Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like prefix operator detection. * Haddock comments have dedicated nodes now. * Use named precedences instead of closely replicating the GHC parser's productions. * Different layouts are tracked and closed with their special cases considered. In particular, multi-way if now has layout. * Fixed CPP bug where mid-line `#endif` would be false positive. * CPP only matches legal directives now. * Generally more lenient parsing than GHC, and in the presence of errors: * Missing closing tokens at EOF are tolerated for: * CPP * Comment * TH Quotation * Multiple semicolons in some positions like `if/then` * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions * List comprehensions can have multiple sets of qualifiers (`ParallelListComp`). * Deriving clauses after GADTs don't require layout anymore. * Newtype instance heads are working properly now. * Escaping newlines in comments and cpp works now. Escaping newlines on regular lines won't be implemented. * One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)` I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse application, infix and negation without lexing all qualified names in the scanner. I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work. For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing. * Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of Unicode categories, using bitmaps. I might need to change this to write them all to a shared file, so the set of source files stays the same.
tree-sitter · May 4, 2024 · 50a04bf · 50a04bf
1 parent af32d88
commit 50a04bf
Show file tree

Hide file tree

Showing 196 changed files with 80,632 additions and 941,252 deletions.
diff --git a/.gitattributes b/.gitattributes
@@ -1,2 +1,5 @@
 /src/** linguist-vendored
-/examples/* linguist-vendored
+/test/libs/* linguist-vendored
+/src/parser.c -diff
+/src/grammar.json -diff
+/src/node-types.json -diff
diff --git a/.github/workflows/assets.yml b/.github/workflows/assets.yml
@@ -0,0 +1,33 @@
+name: Publish assets
+
+on:
+  workflow_run:
+    workflows: [CI]
+    types: [completed]
+    branches: [main]
+
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    if: github.event.workflow_run.conclusion == 'success'
+    permissions:
+      contents: read
+      id-token: write
+    steps:
+      - uses: actions/checkout@v4
+      - uses: DeterminateSystems/nix-installer-action@main
+      - uses: DeterminateSystems/magic-nix-cache-action@main
+
+      - run: nix -L build .#parser-src
+      - name: Upload parser sources
+        uses: actions/upload-artifact@v4
+        with:
+          name: tree-sitter-haskell-src
+          path: result/src
+
+      - run: nix -L build .#parser-wasm
+      - name: Upload wasm binary
+        uses: actions/upload-artifact@v4
+        with:
+          name: tree-sitter-haskell-wasm
+          path: result/tree-sitter-haskell.wasm
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -2,17 +2,16 @@ name: CI
 
 on:
   push:
-    branches:
-      - "**"
+    branches: [main]
+    tags: ['**']
   pull_request:
-    types:
-      - opened
-      - synchronize
+    types: [opened, synchronize]
 
 jobs:
   test:
-    name: test / ${{ matrix.os }}
-    runs-on: ${{ matrix.os }}
+    name: test / ${{matrix.os}}
+    runs-on: ${{matrix.os}}
+    if: github.event.pull_request.merged == true || github.event.action != 'closed'
     strategy:
       fail-fast: false
       matrix:
@@ -27,20 +26,39 @@ jobs:
         with:
           node-version: '18'
 
-      # - name: Install emscripten
-      #   uses: mymindstorm/setup-emsdk@v10
-      #   with:
-      #     version: '2.0.24'
+      - name: Install emscripten
+        uses: mymindstorm/setup-emsdk@v14
+        with:
+          version: '3.1.47'
 
-      - name: Build tree-sitter-haskell
+      - name: Build dependencies
         run: npm install
 
       - name: Run tests
         run: npm test
 
-      - name: Parse examples
-        run: npm run examples
+      - name: Parse libraries
+        run: npm run libs
+
+      - name: Parse libraries with wasm
+        run: npm run libs-wasm
+
+      - name: Run fuzzer
+        if: ${{matrix.os == 'ubuntu-latest'}}
+        uses: tree-sitter/fuzz-action@v4
 
-      # - name: Parse examples with web binding
-      #   run: npm run examples-wasm
+  legacy:
+    permissions:
+      contents: write
+      id-token: write
+    needs: test
+    if: github.ref_type == 'tag'
+    uses: ./.github/workflows/legacy.yml
 
+  release:
+    permissions:
+      contents: read
+      id-token: write
+    needs: test
+    if: github.ref_type == 'tag'
+    uses: ./.github/workflows/release.yml
diff --git a/.github/workflows/legacy.yml b/.github/workflows/legacy.yml
@@ -0,0 +1,36 @@
+name: Update legacy branch
+
+on:
+  workflow_call:
+
+jobs:
+  commit:
+    runs-on: ubuntu-latest
+    permissions:
+      contents: write
+      id-token: write
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          ref: ${{github.ref}}
+      - uses: actions/checkout@v4
+        with:
+          ref: master
+
+      - name: Reset worktree to ${{github.ref_name}}
+        run: |
+          git restore --source=${{github.ref}} .
+          git restore .gitignore
+
+      - uses: DeterminateSystems/nix-installer-action@main
+      - uses: DeterminateSystems/magic-nix-cache-action@main
+
+      - name: Generate parser
+        run: nix -L run .#gen-parser
+
+      - name: Commit and push to legacy branch
+        uses: actions-js/[email protected]
+        with:
+          github_token: ${{secrets.GITHUB_TOKEN}}
+          message: "Legacy release ${{github.ref_name}}"
+          branch: master
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -1,8 +1,7 @@
 name: Publish package
 
 on:
-  push:
-    tags: ["*"]
+  workflow_call:
 
 concurrency:
   group: ${{github.workflow}}-${{github.ref}}

diff --git a/.gitignore b/.gitignore
@@ -1,10 +1,13 @@
-node_modules
-build
+/src/parser.c
+/dist-newstyle
+/result
+/test/libs/*
+!/test/libs/.gitkeep
+/build/
+/target/
+/.lib/
+/node_modules/
 *.log
-repos
-examples/*
-!examples/.gitkeep
 .gdb_history
 *.o
 *.so
-/.build/
diff --git a/.npmignore b/.npmignore
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/Cargo.toml b/Cargo.toml
@@ -1,12 +1,11 @@
 [package]
 name = "tree-sitter-haskell"
 description = "haskell grammar for the tree-sitter parsing library"
-version = "0.16.0"
+version = "1.0.0"
 keywords = ["incremental", "parsing", "haskell"]
 categories = ["parsing", "text-editors"]
 repository = "https://github.com/tree-sitter/tree-sitter-haskell"
-edition = "2018"
-license = "MIT"
+edition = "2021"
 
 build = "bindings/rust/build.rs"
 include = [
@@ -19,6 +18,24 @@ include = [
 [lib]
 path = "bindings/rust/lib.rs"
 
+[[test]]
+name = "parse-test"
+path = "test/rust/parse-test.rs"
+
+[[bin]]
+name = "parse"
+path = "test/rust/parse.rs"
+test = false
+bench = false
+doc = false
+
+[[bin]]
+name = "show"
+path = "test/rust/show.rs"
+test = false
+bench = false
+doc = false
+
 [dependencies]
 tree-sitter = "0.20"