Skip to content

Commit

Permalink
Rewrite the grammar once again.
Browse files Browse the repository at this point in the history
* Parses the GHC codebase!

  I'm using a trimmed set of the source directories of the compiler and most core libraries in
  [this repo](https://github.com/tek/tsh-test-ghc).

  This used to break horribly in many files because explicit brace layouts weren't supported very well.

* Faster in most cases!
  Here are a few simple benchmarks to illustrate the difference, not to be taken _too_ seriously, using the test
  codebases in `test/libs`:

  Old:
  ```
  effects: 32ms
  postgrest: 91ms
  ivory: 224ms
  polysemy: 84ms
  semantic: 1336ms
  haskell-language-server: 532ms
  flatparse: 45ms
  ```

  New:
  ```
  effects: 29ms
  postgrest: 64ms
  ivory: 178ms
  polysemy: 70ms
  semantic: 692ms
  haskell-language-server: 390ms
  flatparse: 36ms
  ```

  GHC's `compiler` directory takes 3000ms, but is among the fastest repos for per-line and per-character times!
  To get more detailed info (including new codebases I added, consisting mostly of core libraries), run
  `test/parse-libs`.
  I also added an interface for running `hyperfine`, exposed as a Nix app – execute
  `nix run .#bench-libs -- stm mtl transformers` with the desired set of libraries in `test/libs` or
  `test/libs/tsh-test-ghc/libraries`.

* Smaller size of the shared object.

  `tree-sitter generate` produces a `haskell.so` with a size of 4.4MB for the old grammar, and 3.0MB for the new one.

* Significantly faster time to generate, and slightly faster build.

  On my machine, generation takes 9.34s vs 2.85s, and compiling takes 3.75s vs 3.33s.

* All terminals now have proper text nodes when possible, like the `.` in modules.
  Fixes #102, #107, #115 (partially?).

* Semicolons are now forced after newlines even if the current parse state doesn't allow them, to fail alternative
  interpretations in GLR conflicts that sometimes produced top-level expression splices for valid (and invalid) code.
  Fixes #89, #105, #111.

* Comments aren't pulled into preceding layouts anymore.
  Fixes #82, #109.
  (Can probably still be improved with a few heuristics for e.g. postfix haddock)

* Similarly, whitespace is kept out of layout-related nodes as much as possible.
  Fixes #74.

* Hashes can now be operators in all situations, without sacrificing unboxed tuples.
  Fixes #108.

* Expression quotes are now handled separately from quasiquotes and their contents parsed properly.
  Fixes #116.

* Explicit brace layouts are now handled correctly.
  Fixes #92.

* Function application with multiple block arguments is handled correctly.

* Unicode categories for identifiers now match GHC, and the full unicode character set is supported for things like
  prefix operator detection.

* Haddock comments have dedicated nodes now.

* Use named precedences instead of closely replicating the GHC parser's productions.

* Different layouts are tracked and closed with their special cases considered.
  In particular, multi-way if now has layout.

* Fixed CPP bug where mid-line `#endif` would be false positive.

* CPP only matches legal directives now.

* Generally more lenient parsing than GHC, and in the presence of errors:
  * Missing closing tokens at EOF are tolerated for:
    * CPP
    * Comment
    * TH Quotation
  * Multiple semicolons in some positions like `if/then`
  * Unboxed tuples and sums are allowed to have arbitrary numbers of filled positions

* List comprehensions can have multiple sets of qualifiers (`ParallelListComp`).

* Deriving clauses after GADTs don't require layout anymore.

* Newtype instance heads are working properly now.

* Escaping newlines in comments and cpp works now.
  Escaping newlines on regular lines won't be implemented.

* One remaining issue is that qualified left sections that contain infix ops are broken: `(a + a A.+)`
  I haven't managed to figure out a good strategy for this – my suspicion is that it's impossible to correctly parse
  application, infix and negation without lexing all qualified names in the scanner.
  I will try that out at some point, but for now I'm planning to just accept that this one thing doesn't work.
  For what it's worth, none of the codebases I use for testing contain this construct in a way that breaks parsing.

* Repo now includes a Haskell program that generates C code for classifying characters as belonging to some sets of
  Unicode categories, using bitmaps.
  I might need to change this to write them all to a shared file, so the set of source files stays the same.
  • Loading branch information
tek authored and amaanq committed May 4, 2024
1 parent af32d88 commit 7a3fae1
Show file tree
Hide file tree
Showing 188 changed files with 70,624 additions and 26,829 deletions.
5 changes: 4 additions & 1 deletion .gitattributes
Original file line number Diff line number Diff line change
@@ -1,2 +1,5 @@
/src/** linguist-vendored
/examples/* linguist-vendored
/test/libs/* linguist-vendored
/src/parser.c -diff
/src/grammar.json -diff
/src/node-types.json -diff
33 changes: 33 additions & 0 deletions .github/workflows/assets.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
name: Publish assets

on:
workflow_run:
workflows: [CI]
types: [completed]
branches: [main]

jobs:
build:
runs-on: ubuntu-latest
if: github.event.workflow_run.conclusion == 'success'
permissions:
contents: read
id-token: write
steps:
- uses: actions/checkout@v4
- uses: DeterminateSystems/nix-installer-action@main
- uses: DeterminateSystems/magic-nix-cache-action@main

- run: nix -L build .#parser-src
- name: Upload parser sources
uses: actions/upload-artifact@v4
with:
name: tree-sitter-haskell-src
path: result/src

- run: nix -L build .#parser-wasm
- name: Upload wasm binary
uses: actions/upload-artifact@v4
with:
name: tree-sitter-haskell-wasm
path: result/tree-sitter-haskell.wasm
50 changes: 34 additions & 16 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,16 @@ name: CI

on:
push:
branches:
- "**"
branches: [main]
tags: ['**']
pull_request:
types:
- opened
- synchronize
types: [opened, synchronize]

jobs:
test:
name: test / ${{ matrix.os }}
runs-on: ${{ matrix.os }}
name: test / ${{matrix.os}}
runs-on: ${{matrix.os}}
if: github.event.pull_request.merged == true || github.event.action != 'closed'
strategy:
fail-fast: false
matrix:
Expand All @@ -27,20 +26,39 @@ jobs:
with:
node-version: '18'

# - name: Install emscripten
# uses: mymindstorm/setup-emsdk@v10
# with:
# version: '2.0.24'
- name: Install emscripten
uses: mymindstorm/setup-emsdk@v14
with:
version: '3.1.47'

- name: Build tree-sitter-haskell
- name: Build dependencies
run: npm install

- name: Run tests
run: npm test

- name: Parse examples
run: npm run examples
- name: Parse libraries
run: npm run libs

- name: Parse libraries with wasm
run: npm run libs-wasm

- name: Run fuzzer
if: ${{matrix.os == 'ubuntu-latest'}}
uses: tree-sitter/fuzz-action@v4

# - name: Parse examples with web binding
# run: npm run examples-wasm
legacy:
permissions:
contents: write
id-token: write
needs: test
if: github.ref_type == 'tag'
uses: ./.github/workflows/legacy.yml

release:
permissions:
contents: read
id-token: write
needs: test
if: github.ref_type == 'tag'
uses: ./.github/workflows/release.yml
36 changes: 36 additions & 0 deletions .github/workflows/legacy.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
name: Update legacy branch

on:
workflow_call:

jobs:
commit:
runs-on: ubuntu-latest
permissions:
contents: write
id-token: write
steps:
- uses: actions/checkout@v4
with:
ref: ${{github.ref}}
- uses: actions/checkout@v4
with:
ref: master

- name: Reset worktree to ${{github.ref_name}}
run: |
git restore --source=${{github.ref}} .
git restore .gitignore
- uses: DeterminateSystems/nix-installer-action@main
- uses: DeterminateSystems/magic-nix-cache-action@main

- name: Generate parser
run: nix -L run .#gen-parser

- name: Commit and push to legacy branch
uses: actions-js/[email protected]
with:
github_token: ${{secrets.GITHUB_TOKEN}}
message: "Legacy release ${{github.ref_name}}"
branch: master
3 changes: 1 addition & 2 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
name: Publish package

on:
push:
tags: ["*"]
workflow_call:

concurrency:
group: ${{github.workflow}}-${{github.ref}}
Expand Down
6 changes: 0 additions & 6 deletions .npmignore

This file was deleted.

153 changes: 108 additions & 45 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,47 +1,110 @@
WEB_TREE_SITTER_FILES=README.md package.json tree-sitter-web.d.ts tree-sitter.js tree-sitter.wasm
TREE_SITTER_VERSION=v0.20.1

all: node_modules/web-tree-sitter tree-sitter-haskell.wasm

# build parser.c
src/parser.c: grammar.js
npx tree-sitter generate

# build patched version of web-tree-sitter
node_modules/web-tree-sitter:
@rm -rf tmp/tree-sitter
@git clone \
-c advice.detachedHead=false --quiet \
--depth=1 --branch=$(TREE_SITTER_VERSION) \
https://github.com/tree-sitter/tree-sitter.git \
tmp/tree-sitter
@cp tree-sitter.patch tmp/tree-sitter/
@>/dev/null \
&& cd tmp/tree-sitter \
&& git apply tree-sitter.patch \
&& ./script/build-wasm --debug
@mkdir -p node_modules/web-tree-sitter
@cp tmp/tree-sitter/LICENSE node_modules/web-tree-sitter
@cp $(addprefix tmp/tree-sitter/lib/binding_web/,$(WEB_TREE_SITTER_FILES)) node_modules/web-tree-sitter
@rm -rf tmp/tree-sitter

# build web version of tree-sitter-haskell
# NOTE: requires patched version of web-tree-sitter
tree-sitter-haskell.wasm: src/parser.c src/scanner.c
npx tree-sitter build-wasm

CC := cc
OURCFLAGS := -shared -fPIC -g -O0 -I src
VERSION := 0.0.1

LANGUAGE_NAME := tree-sitter-haskell

# repository
SRC_DIR := src

PARSER_REPO_URL := $(shell git -C $(SRC_DIR) remote get-url origin 2>/dev/null)

ifeq ($(PARSER_URL),)
PARSER_URL := $(subst .git,,$(PARSER_REPO_URL))
ifeq ($(shell echo $(PARSER_URL) | grep '^[a-z][-+.0-9a-z]*://'),)
PARSER_URL := $(subst :,/,$(PARSER_URL))
PARSER_URL := $(subst git@,https://,$(PARSER_URL))
endif
endif

TS ?= tree-sitter

# ABI versioning
SONAME_MAJOR := $(word 1,$(subst ., ,$(VERSION)))
SONAME_MINOR := $(word 2,$(subst ., ,$(VERSION)))

# install directory layout
PREFIX ?= /usr/local
INCLUDEDIR ?= $(PREFIX)/include
LIBDIR ?= $(PREFIX)/lib
PCLIBDIR ?= $(LIBDIR)/pkgconfig

# object files
OBJS := $(patsubst %.c,%.o,$(wildcard $(SRC_DIR)/*.c))

# flags
ARFLAGS := rcs
override CFLAGS += -I$(SRC_DIR) -std=c11 -fPIC

# OS-specific bits
ifeq ($(OS),Windows_NT)
$(error "Windows is not supported")
else ifeq ($(shell uname),Darwin)
SOEXT = dylib
SOEXTVER_MAJOR = $(SONAME_MAJOR).dylib
SOEXTVER = $(SONAME_MAJOR).$(SONAME_MINOR).dylib
LINKSHARED := $(LINKSHARED)-dynamiclib -Wl,
ifneq ($(ADDITIONAL_LIBS),)
LINKSHARED := $(LINKSHARED)$(ADDITIONAL_LIBS),
endif
LINKSHARED := $(LINKSHARED)-install_name,$(LIBDIR)/lib$(LANGUAGE_NAME).$(SONAME_MAJOR).dylib,-rpath,@executable_path/../Frameworks
else
SOEXT = so
SOEXTVER_MAJOR = so.$(SONAME_MAJOR)
SOEXTVER = so.$(SONAME_MAJOR).$(SONAME_MINOR)
LINKSHARED := $(LINKSHARED)-shared -Wl,
ifneq ($(ADDITIONAL_LIBS),)
LINKSHARED := $(LINKSHARED)$(ADDITIONAL_LIBS)
endif
LINKSHARED := $(LINKSHARED)-soname,lib$(LANGUAGE_NAME).so.$(SONAME_MAJOR)
endif
ifneq ($(filter $(shell uname),FreeBSD NetBSD DragonFly),)
PCLIBDIR := $(PREFIX)/libdata/pkgconfig
endif

all: lib$(LANGUAGE_NAME).a lib$(LANGUAGE_NAME).$(SOEXT) $(LANGUAGE_NAME).pc

lib$(LANGUAGE_NAME).a: $(OBJS)
$(AR) $(ARFLAGS) $@ $^

lib$(LANGUAGE_NAME).$(SOEXT): $(OBJS)
$(CC) $(LDFLAGS) $(LINKSHARED) $^ $(LDLIBS) -o $@
ifneq ($(STRIP),)
$(STRIP) $@
endif

$(LANGUAGE_NAME).pc: bindings/c/$(LANGUAGE_NAME).pc.in
sed -e 's|@URL@|$(PARSER_URL)|' \
-e 's|@VERSION@|$(VERSION)|' \
-e 's|@LIBDIR@|$(LIBDIR)|' \
-e 's|@INCLUDEDIR@|$(INCLUDEDIR)|' \
-e 's|@REQUIRES@|$(REQUIRES)|' \
-e 's|@ADDITIONAL_LIBS@|$(ADDITIONAL_LIBS)|' \
-e 's|=$(PREFIX)|=$${prefix}|' \
-e 's|@PREFIX@|$(PREFIX)|' $< > $@

$(SRC_DIR)/parser.c: grammar.js
$(TS) generate --no-bindings

install: all
install -d '$(DESTDIR)$(INCLUDEDIR)'/tree_sitter '$(DESTDIR)$(PCLIBDIR)' '$(DESTDIR)$(LIBDIR)'
install -m644 bindings/c/$(LANGUAGE_NAME).h '$(DESTDIR)$(INCLUDEDIR)'/tree_sitter/$(LANGUAGE_NAME).h
install -m644 $(LANGUAGE_NAME).pc '$(DESTDIR)$(PCLIBDIR)'/$(LANGUAGE_NAME).pc
install -m644 lib$(LANGUAGE_NAME).a '$(DESTDIR)$(LIBDIR)'/lib$(LANGUAGE_NAME).a
install -m755 lib$(LANGUAGE_NAME).$(SOEXT) '$(DESTDIR)$(LIBDIR)'/lib$(LANGUAGE_NAME).$(SOEXTVER)
ln -sf lib$(LANGUAGE_NAME).$(SOEXTVER) '$(DESTDIR)$(LIBDIR)'/lib$(LANGUAGE_NAME).$(SOEXTVER_MAJOR)
ln -sf lib$(LANGUAGE_NAME).$(SOEXTVER_MAJOR) '$(DESTDIR)$(LIBDIR)'/lib$(LANGUAGE_NAME).$(SOEXT)

uninstall:
$(RM) '$(DESTDIR)$(LIBDIR)'/lib$(LANGUAGE_NAME).a \
'$(DESTDIR)$(LIBDIR)'/lib$(LANGUAGE_NAME).$(SOEXTVER) \
'$(DESTDIR)$(LIBDIR)'/lib$(LANGUAGE_NAME).$(SOEXTVER_MAJOR) \
'$(DESTDIR)$(LIBDIR)'/lib$(LANGUAGE_NAME).$(SOEXT) \
'$(DESTDIR)$(INCLUDEDIR)'/tree_sitter/$(LANGUAGE_NAME).h \
'$(DESTDIR)$(PCLIBDIR)'/$(LANGUAGE_NAME).pc

clean:
rm -f debug *.o *.a

debug.so: src/parser.c src/scanner.c
$(CC) $(OURCFLAGS) $(CFLAGS) -o parser.o src/parser.c
$(CC) $(OURCFLAGS) $(CFLAGS) -o scanner.o src/scanner.c
$(CC) $(OURCFLAGS) $(CFLAGS) -o debug.so $(PWD)/scanner.o $(PWD)/parser.o
@echo ""
@echo "-----------"
@echo ""
@echo "To use the debug build with tree-sitter on linux, run:"
@echo "cp debug.so $HOME/.cache/tree-sitter/lib/haskell.so"
$(RM) $(OBJS) $(LANGUAGE_NAME).pc lib$(LANGUAGE_NAME).a lib$(LANGUAGE_NAME).$(SOEXT)

test:
$(TS) test

.PHONY: all install uninstall clean test
5 changes: 4 additions & 1 deletion Package.swift
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,10 @@ let package = Package(
sources: [
"src/parser.c",
"src/scanner.c",
"src/unicode.h",
"src/id.h",
"src/space.h",
"src/symop.h",
"src/varid-start.h",
],
resources: [
.copy("queries")
Expand Down
Loading

0 comments on commit 7a3fae1

Please sign in to comment.