RFC: HIR #273

Boshen · 2023-04-10T11:21:55Z

Boshen
Apr 10, 2023
Maintainer

Background:

Our current AST is cumbersome to work with for compiler authors (me and future collaborators).

I want to an HIR that should enable easier implementations of type checker, linter, transpiler, and minifier. It should be:

concise
- no duplicates such as ArrowFunctionExpression and FunctionExpression
- no redundancies such as ParenthesizedExpression
- no TypeScript
- no JSX
- single IterationStatement instead of for, while, and do while statements
- some syntactic sugar removed
- ... many more
have easier access to scopes and symbols ❗
easy to construct CFG

Boshen · 2023-04-10T11:32:36Z

Boshen
Apr 10, 2023
Maintainer Author

Question: How do we get sourcemaps to work? Do we need to a NodeId to the AST in the parsing phase as seen in rustc_ast?

0 replies

Boshen · 2023-04-10T11:44:37Z

Boshen
Apr 10, 2023
Maintainer Author

Questions: Do we need another allocator for the HIR?

0 replies

Boshen · 2023-04-10T11:49:38Z

Boshen
Apr 10, 2023
Maintainer Author

Question: At what point do we start resolving imports and exports and start building module graphs? Memory arena is getting in the way if we want to start sharing data.

0 replies

marvinhagemeister · 2023-04-21T07:47:25Z

marvinhagemeister
Apr 21, 2023

I'm really excited about this discussion as it's a topic I'm very interested in although I don't have much experience with Rust. Wanted to add a couple of thoughts and opinions gained over the years on working on JS tooling over the years in hopes that it might be helpful somehow.

Symbol table

I've been dabbling a lot with various bundling and minification techniques for JS with some homegrown js-based compilers. A limitation that I ran into with current tools is that they're mostly focused on file based optimization techniques. Babel for example operates on every file individually and has non concept of the program as a whole. But things like treeshaking act on the knowledge of the program as a whole to be able to drop unused functions.

This matters for AST design, because in babel an Identifier node stores its name directly on the node:

{
   "type": "Identifier",
   "name": "fooBar",
  // ...
}

No picture two files with clashing variable names. We want to bundle them into one file like it's typically done for frontend code.

// a.js
const a = "I'm file A";
console.log(a);

// a2.js
const a = "I'm the other A file";
console.log(a);

When merged together you have to rename one of the variables.

// merged.js
const a = "I'm file A";
console.log(a);

const a_2 = "I'm the other A file";
console.log(a_2);

But where does this renaming happen? Merging files is a very common operation of bundlers. This is where a symbol table makes this process a lot easier. Taking inspiration from esbuild an elegant solution to that is to think of identifier names as being merely some sort of global ids. That global id can then be used to retrieve the symbol name that's stored somewhere else. One way to create a global id would be to weave in the absolute file path for example.

{
   "type": "Identifier",
   "name": SOME_GLOBAL_UNIQUE_ID,
  // ...
}

// Symbol table pseudo code (id -> actual name)
{
  SOME_GLOBAL_UNIQUE_ID: "a"
}

With this little change the process of bundling gets a lot easier as the files can be concatenated as is. Once merged, we only have to walk the scope tree and lookup names in the symbol table. If we come across a name that we've already used in the current scope, we can simply change it. We have to do that final pass anyway with a minifier because reusing the same identifier names compresses better.

Basically instead of this:

Rename identifiers -> Merge files -> Rename identifiers (optionally minify)

we only need to do this with a symbol table:

Merge files -> Rename identifiers (optionally minify)

We save one rename pass which can be quite a bit of work in large projects with 20k modules.

`ArrowFunctionExpression` vs `FunctionExpression`

I'm assuming that you want to merge them and move the distinction to another field or something. Distinguishing the two is still important due to the differences in the this binding and if you're planning to support downtranspilation in the future.

`ParenthesizedExpression`

Agree, can be dropped as it's only for formatting purposes.

`TypeScript`

For code transformation it's fine to drop it. For a linter it might be worthwhile to support this as some projects have rules on the length of TS generic names and stuff like that.

`JSX`

It's usually the first thing that gets transformed, but might be useful to support in the case of linting.

Anyway, these are just some random thoughts of mine. I hope some bits are useful.

2 replies

Boshen Apr 21, 2023
Maintainer Author

Symbol table

Oxc has a symbol table, name mangling is done on top of it with very few code: https://github.com/Boshen/oxc/pull/285/files#diff-b32d26ccbfc81c987fbaae7f45d56714844542a20523049d483358ebc1cada83

Which means the process Merge files -> Rename identifiers (optionally minify) it already built within the Oxc compiler. Except it's not merging files any more, it will be merging ASTs (the HIR in this case to be precise).

TypeScript / JSX

If HIR happens, we'll get two linter passes, one for the AST and one for the HIR. This is the infrastructure done in clippy.

marvinhagemeister Apr 21, 2023

Nice, that's perfect! Apologies, I should've looked at the code prior to writing my comment. Happy to see that those things are already considered 👍

HerringtonDarkholme · 2023-04-25T07:12:21Z

HerringtonDarkholme
Apr 25, 2023

I have concerns about how much we can abstract away from AST to HIR. Take the example in OP

no duplicates such as ArrowFunctionExpression and FunctionExpression

ArrowFunctionExpression will not have this nor argument . We need to encode this in HIR anyway.

no redundancies such as ParenthesizedExpression

Consider the typical output where webpack uses extensively for hacking this

(0, eval)(actualObject) // to use indirect eval
(0, console.log)(a) // avoid this

This will make HIR object encode their this value properly in their fields, which make HIR no simpler than AST.

2 replies

Boshen Apr 25, 2023
Maintainer Author

ArrowFunctionExpression and FunctionExpression are two different enum types, they can be put into one type with a tag, so it's easier for people to ask for functions. I have so many places where it needs to explicitly extract the function from these enum, which is tedious.

Currently we have to continuously call expr.innerExpression to unwrap all parentheses.

Boshen Apr 25, 2023
Maintainer Author

(0, eval)(actualObject) // to use indirect eval
(0, console.log)(a) // avoid this

These are not ParenthesizedExpressions? The (x, y) part is a SequenceExpression.

Conaclos · 2023-04-25T11:39:30Z

Conaclos
Apr 25, 2023

no TypeScript

This means that you remove any type info?

1 reply

Boshen Apr 25, 2023
Maintainer Author

A few points:

The AST can still be used for linting TypeScript
If we were ever going to implement some kind of type checking, a cool approach is to build another IR for type inference
The end goal of the HIR is to simplify the implementation of the minifier

Boshen · 2023-09-15T15:50:31Z

Boshen
Sep 15, 2023
Maintainer Author

I decided to remove the HIR from oxc. #917

HIR is a wonderful idea for compiling to lower languages, but after sitting on it for a few months I found that it only adds confusion and uncertainties to both myself and future contributors.

It also adds too much burden to maintainers if we plan to support more downstream tools.

2 replies

hyf0 Sep 15, 2023

I think that HIR could be a very good idea for a type checker. It should be able to provide better DX than AST, because it could attach much more info than pure AST.

Type checking is very complex, so it's reasonable to have a special AST form, which is HIR, for it.

For things like minfier and transpiler, I don't doubt that HIR will provide a better DX than AST, but AST seems enough to be work with.

HerringtonDarkholme Sep 15, 2023

I'm glad to see HIR to be removed and I believe this is the right pathway.
HIR is another layer of abstraction so it is not convenient to use for source-code-aware task like linter or formatter.

Type check probably also needs source knowledge like error range and ts-ignore

Minifier might be the best use of HIR. But I wonder if it does worth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: HIR #273

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 7 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

RFC: HIR #273

Boshen Apr 10, 2023 Maintainer

Replies: 7 comments · 7 replies

Boshen Apr 10, 2023 Maintainer Author

Boshen Apr 10, 2023 Maintainer Author

Boshen Apr 10, 2023 Maintainer Author

marvinhagemeister Apr 21, 2023

Symbol table

ArrowFunctionExpression vs FunctionExpression

ParenthesizedExpression

TypeScript

JSX

Boshen Apr 21, 2023 Maintainer Author

marvinhagemeister Apr 21, 2023

HerringtonDarkholme Apr 25, 2023

Boshen Apr 25, 2023 Maintainer Author

Boshen Apr 25, 2023 Maintainer Author

Conaclos Apr 25, 2023

Boshen Apr 25, 2023 Maintainer Author

Boshen Sep 15, 2023 Maintainer Author

hyf0 Sep 15, 2023

HerringtonDarkholme Sep 15, 2023

Boshen
Apr 10, 2023
Maintainer

Replies: 7 comments 7 replies

Boshen
Apr 10, 2023
Maintainer Author

Boshen
Apr 10, 2023
Maintainer Author

Boshen
Apr 10, 2023
Maintainer Author

marvinhagemeister
Apr 21, 2023

`ArrowFunctionExpression` vs `FunctionExpression`

`ParenthesizedExpression`

`TypeScript`

`JSX`

Boshen Apr 21, 2023
Maintainer Author

HerringtonDarkholme
Apr 25, 2023

Boshen Apr 25, 2023
Maintainer Author

Boshen Apr 25, 2023
Maintainer Author

Conaclos
Apr 25, 2023

Boshen Apr 25, 2023
Maintainer Author

Boshen
Sep 15, 2023
Maintainer Author