New keywords #27

stedolan · 2021-07-07T09:06:13Z

[RFC text copied below]

New keywords for OCaml

New language features require new syntax, and often the best new syntax involves one or more new keywords. However, adding keywords to a language brings backwards-compatibility concerns, as programs that use the keyword as an identifier stop working.

The aim of this RFC is to make it easier to add new keywords to OCaml. Specifically, two small changes are proposed to the lexical syntax:

a new class of "optional keywords", initially empty, which can be disabled by a command-line option or lexer directive. (If disabled, the keywords remain usable as identifier)
a new syntax of "raw identifiers", which allow words to be used as identifiers regardless of whether or not they are keywords.

With these changes, new keywords can be added as optional keywords, disabled by default. Old code works as normal, new code can opt in to the keyword, and identifiers colliding with the keyword can still be used as raw identifiers. Eventually, the new keyword can be enabled by default, but old code continues to work with an explicit compiler flag.

Goals and proposal

The main goal is to allow OCaml to be extended with new keywords in a backwards-compatible way. Specifically:

Old code should still be usable, even if it uses identifiers that now collide with keywords.
New and old code should mix, even if the new code needs to refer to an old identifier which is now a keyword.

The proposal is in two parts:

Add compiler options -use-keyword foo, -no-keyword foo and lexer directives #use_keyword foo, #no_keyword foo to enable and disable an optional keyword foo.

If foo is not recognised as an optional keyword, then -use-keyword foo and #use_keyword foo are errors, while -no-keyword foo and #no_keyword foo are silently ignored. (Silently ignoring these allows compatibility to be maintained before and after an optional keyword is introduced).
Add a new syntax \#foo called a raw identifier. This syntax is equivalent to a plain identifier foo, except that \#foo is always an identifier even if foo is a keyword.

This syntax can be used anywhere that foo can be used: ~\#foo is a labelled argument, `\#Foo is a polymorphic variant, '\#foo is a type variable, \#Foo is a module or constructor name, etc.

(This feature is present in C# (@foo), Swift (`foo`) and Rust (r#foo). None of these syntaxes can be reused directly without conflicts, and the proposal here is closest to Rust's)

Part 1 of the proposal ensures that old code continues to work, although once a new keyword is enabled by default old code will need to use the -no-keyword foo flag to compile unmodified. Part 2 ensures that new code can refer to identifiers exported by old code, even if they are now keywords.

Alternative approaches

There are many ways to extend a language. Here are some possibilties, which seem less preferable to the proposal above:

Just add keywords

The traditional approach is to add keywords and accept some amount of breakage. The last time this occurred (adding nonrec in 4.02) it generated a long discussion on whether it is acceptable to add a new keyword, even one that will break no known code. For keyword proposals that are known to break code (e.g. effect, macro, implicit, unboxed), this seems unworkable.
Symbols instead of keywords

Instead of adding keywords, it is possible to introduce new syntax using entirely non-alphabetic characters. However, it's hard to read and look up unfamiliar symbolic syntax, and the space of remaining options is small as the OCaml lexical syntax is quite crowded. (For an example of this, see RFC #10, which further overloads the # symbol in types, trying to keep this separate from the various current meanings of #).
Attributes instead of keywords

The [@attribute] syntax can be used to add arbitrary annotations to the parse tree, and has already been used for several new language features, including immediate types, unboxed types and explicit tail calls.

It has a couple of disadvantages: first, the syntax is noiser than and inconsistent with existing keyword-based syntax. For example, a single-field record declaration may be annotated as any of mutable, private or unboxed. Two of these are a keyword, while one is spelled [@@unboxed], where the distinction is based mostly on the date that the feature was introduced.

Secondly, since attributes are valid anywhere, subtle bugs are possible if they end up on the wrong parsetree node. For instance, type t [@@@immediate] is silently accepted and declares a non-immediate type: the extra @ in @@@immediate makes it a standalone annotation disconnected from type t, so that it gets parsed and ignored.
Contextual keywords

Some languages (notably, C#) allow words to be used as identifiers yet be recognised as keywords in certain contexts, which provides a high degree of backwards compatibility at the expense of more complex parsing.

However, there are two reasons why this approach is less effective in OCaml: first, OCaml distinguishes fewer contexts. In particular, the C# trick of making a word be a keyword in statement but not in expression context is not useful in a language that does not distinguish statements and expressions. Second, OCaml accepts a sequence of arbitrary space-separated identifiers as a function application, so it is harder to find a construction that does not already mean something.
Overloaded keywords

It is tempting to reuse an existing keyword, by giving it a new meaning in a context which it cannot currently be used. While it does preserve compatibility, this is mostly a bad idea: for instance, see the various confusing meanings of static in C. In particular, this sort of keyword reuse removes the ability to easily look up the meaning of some syntax, which is one of the main reasons to use keywords in the first place.
Multiple parsers

Finally, we could ship multiple versions of the parser that accepted different editions of OCaml's syntax. This does have certain advantages, but has an unusually high maintenance cost, and seems undesirable on that basis alone.

Lupus · 2021-07-07T09:18:31Z

Can't find the description on how Reason wants to approach backward-incompatible syntax changes, but in short: it will expect an attribute with syntax version at the top of the file, parses files without this attribute at some fixed "current" version, and when specified - parses source according to version that is specified. Also refmt allows to upgrade one version to another, automatically bumping the attribute to latest version. /cc @jordwalke for details

Maybe OCaml can do something similar with the help of ocamlformat?

stedolan · 2021-07-07T14:30:40Z

I had a look at how Reason handles new keywords, which comes up e.g. when converting OCaml code to Reason syntax, as in OCaml switch is a valid variable name while it's a keyword in Reason.

Since reasonml/reason#1539, it's done by appending underscores: the OCaml code let switch = 1 is converted to the Reason code let switch_ = 1. This works fine, but I find it a bit more annoying than raw identifiers, as two different identifiers get used for the same thing in different files. (This is less of an issue if you convert everything at once, but an explicit goal here is interoperation)

let-def · 2021-07-12T11:27:10Z

@stedolan Minor point: I don't think we need multiple parsers... The keyword table of lexer.mll can be dynamically populated based on the configuration. That's how Merlin has dealt with a few variants of OCaml language with a single grammar (including Meta OCaml and a few camlp4 extensions).

EduardoRFS · 2021-07-12T12:47:02Z

I also discussed in the past of something like the syntax proposed here with @jordwalke for Reason. It would allow both syntaxes to reserve only the needed keywords while still making everything accessible.

gasche · 2022-04-05T09:26:40Z

I believe that this is a problem that we need to address, and I like the two parts of @stedolan's proposal. Count me in favor of accepting this RFC.

gasche · 2022-04-05T09:28:20Z

... but there has to be some bike-shedding. Is there a proposal for a "raw identifiers" syntax that uses delimiters, and not just a prefix marker? This could be a nice way to interoperate with other dialects or languages and reuse their identifiers, even they use slightly different lexical conventions.

(Maybe \#{|...|} could work for this? If that's the best proposal we come up with, then it can be left for a later extension.)

nojb · 2022-04-05T09:36:47Z

2. (This feature is present in C# (@foo), Swift (`foo`) and Rust (r#foo). None of these syntaxes can be reused directly without conflicts, and the proposal here is closest to Rust's)

What's the problem with `foo`? As far as I can see the only (potential) conflict would be with polymorphic variants, but the single backquote token vs the raw identifier choice would be done in the lexer...

stedolan · 2022-04-05T09:59:07Z

What's the problem with `foo`? As far as I can see the only (potential) conflict would be with polymorphic variants, but the single backquote token vs the raw identifier choice would be done in the lexer...

It's unlikely to be used in the wild, but currently `Foo` Bar is valid syntax, and means the same as `Foo (`Bar)

nojb · 2022-04-05T10:19:57Z

What's the problem with `foo`? As far as I can see the only (potential) conflict would be with polymorphic variants, but the single backquote token vs the raw identifier choice would be done in the lexer...

It's unlikely to be used in the wild, but currently `Foo` Bar is valid syntax, and means the same as `Foo (`Bar)

Right, I had missed that. Thanks.

dra27 · 2022-04-05T11:07:20Z

Three thoughts:

We have the testing tools to see what the effect is, at least on opam-repository - `foo` looks much more raw-quotey to me than \#foo so maybe our last piece of breakage pain here would be to remove `Foo` Bar as valid syntax?!
-no-keyword could potentially need to be applied to a file multiple times (OCaml 6.0 adds foo and then 7.0 adds bar and it turns out my_amazing_module.ml written for OCaml 5.0 had both). I agree that shipping multiple parsers would be a pain, but really what's wanted is to say -keywords-of 5.0 as implying "add -no-keyword foo for every keyword added after 5.0"
A corollary of that second point is that it might be desirable to have a file level attribute to specify that, but I'm not sure that adding our own "shebang" or (lang ocaml) equivalent to every single OCaml source file is really the nicest way to go 🙂

trefis · 2022-04-05T11:24:00Z

so maybe our last piece of breakage pain here would be to remove `Foo` Bar as valid syntax

On that note, a quick scan of opam packages didn't show any use of that "feature"; only string quoting in strings/comments.
(Though I might have messed up the regexp)

fpottier · 2022-04-05T11:43:03Z

Overall, the proposal sounds very reasonable to me.

Regarding `Foo` Bar, I would suggest modifying the current lexer/parser so that ` Bar (with a space after the backtick) is no longer considered valid. This would open the way to allowing `foo` to mean something. (That said, `foo` in Haskell is used for infix function application. Maybe that is something we would like to have, too?)

trefis · 2022-04-05T11:52:19Z

I would suggest modifying the current lexer/parser so that ` Bar (with a space after the backtick) is no longer considered valid.

That's not enough:

# `foo`bar;;
- : [> `foo of [> `bar ] ] = `foo `bar

trefis · 2022-04-05T11:54:01Z

Unless I missunderstood François aim actually … disregard my comment.

(PS: I like the general proposal)

fpottier · 2022-04-05T12:01:38Z

The example you point out is interesting, and its meaning would change if `foo` was interpreted in a new way, but that would be acceptable (I think).

To clarify my proposal, I suggest that BACKQUOTE should no longer be a token. The combination BACKQUOTE ident (which today is recognized by the parser) should be recognized by the lexer as a token, without allowing a space in the middle. And the combination BACKQUOTE ident BACKQUOTE could also be recognized by the lexer as a different token, if desired.

didierremy · 2022-04-05T12:03:10Z

Regarding `Foo` Bar, I would suggest modifying the current lexer/parser so that ` Bar (with a space after the backtick) is no longer considered valid. This would open the way to allowing `foo` to mean something. (That said, `foo` in Haskell is used for infix function application. Maybe that is something we would like to have, too?)

I second François with the idea that `foo` looks nice and could perhaps be reserved for some more useful feature (such as infixes) than the rare uses of keywords as identifiers.

nojb · 2022-04-05T13:50:03Z

What about \foo for raw identifiers?

gasche · 2022-04-05T14:13:53Z

My impression from this collective bikeshedding session is that the proposed \#bah syntax is short enough to be usable in practice, and ugly enough to not steal valuable syntactic space for a more commonly used feature of the future. Maybe we should stick with it for now :-)

gasche · 2022-04-05T14:17:37Z

Except for this syntactic discussion, the most substantial change proposal for the RFC is @dra27's proposal to also have sets-of-keywords defined by their OCaml language version. (With a tasteful choice of semantics, we could eg. have -keywords-of 4.13 -use-keywords effect work to define a basis and extend on it, or -keywords-of 5.5 -no-keywords effect to restrict from it.)

nojb · 2022-04-05T14:31:47Z

@dra27's proposal to also have sets-of-keywords defined by their OCaml language version

I don't have a strong opinion, but it feels a bit ad-hoc to version keywords while we don't have a good story about versioning other aspects of the language/toolchain (warnings, stdlib, CLI, ...). But again, not a strong opinion either way.

gasche · 2022-04-05T14:46:25Z

Meh, the argument of not doing something reasonable on A because we are also bad at B and C is not so convincing.

dra27 · 2022-04-05T14:46:43Z

A tiny argument to reduce the feeling of ad hoc-ness: versioning the CLI and the stdlib involve supporting multiple things, where noting the version a keyword was added is a single "fact" (i.e. it's fixed in the code). Similar has just been done - or is at least proposed - for warnings for the documentation. For warnings, arguably we wouldn't ever want "Warnings of 5.0" - that can silence real warnings in old code (that's just why we don't make new warnings fatal) whereas the same wouldn't be true of old code with new keywords.

nojb · 2022-04-05T14:52:10Z

Meh, the argument of not doing something reasonable on A because we are also bad at B and C is not so convincing.

I agree. I withdraw my reservation.

gasche · 2022-04-08T13:11:44Z

My impression based on this discussion is that @stedolan's proposal is roughly consensual. I will keep oh-so-subtly pushing things towards a clear decision, in the hope of motivating @stedolan (with possibly some help on the parsing part?) to propose an implementation. (I don't think we should necessarily wait for a "formal approval" to start implementing things.)

garrigue · 2022-04-11T02:09:11Z

I do agree on the basic idea (i.e. allow to change the set of keywords for backward compatibility, and offer raw identifiers to allow interaction between pieces of code using different sets of keywords).
This said, I would rather advocate doing keyword selection only by version number, to avoid ending up with diverging codebases (supposing that the set of keywords grows monotonously; there was one exception from ocaml 1 to 2).

Also, I don't like very much \# as a prefix for raw identifiers. I would rather avoid using a backslash. What about ``ident ?
I don't see any conflict, and it looks a little bit more palatable.

One could even add an extra backquote after to use as an infix: e1 ``plus` e2.

The introduction of raw identifiers also suggests that we could at last allow keywords as labels (who hasn't wished to use ~to or ~end as labels), but this is another subject.
I'm also not opposed to make the parsing of labels and variant tags lexer based, even if it breaks some code: such breaking style was never approved, and would have been used only in some random exceptional cases.

gadmm · 2022-05-20T13:12:08Z

In the past I have mentioned the concept of editions/epochs from Rust/C++, but since nobody has picked up on the suggestion so far I would like to mention it again. Knowing you @stedolan, you are probably well aware of this prior art, and you have probably thought about it while writing this proposal (I think that you are alluding to it). But still I wonder whether some of the ongoing discussion is not reinventing the wheel.

The documents that proposed epochs for Rust and C++ are good reads, they go through the motivations in depth and are written by people aware of the ramifications of BC issues. (see Rust, C++)

Rust editions were designed to let the language evolve while solving both problems of backwards-compatibility and avoiding ecosystem splits. They address the fragmentation issue by ensuring that everyone evolves towards the same set of "options" (echoing @garrigue's legitimate worries). They can also be useful as a common versioning scheme for some other aspects of the toolchain later on (cf. @nojb's natural concern). Crucially, there is a distinction between compiler versions and editions. Another important aspect which deviates from the current discussion is that editions are opt-in: you never have to "update" your build system configuration to tell a newer compiler that you want an earlier edition (this would be seen as a breakage).

In the case of Rust, together with the use of automated conversion tools, this opens the door to even more ambitious language changes (I mention it because such automated conversion has already been evoked for some proposed changes to OCaml, though I have not heard from it since).

Overall, I recognize in this proposal what could be building blocks for such a notion of epochs/editions (and maybe you already see it this way), especially:

the mechanism for referring to old identifiers in new code (when mixing editions) seems essential,
but I am less convinced by the current proposal about compiler options and directives.

How is your viewpoint related to editions/epochs, and why is the proposal not "let's introduce a notion of editions in OCaml"?
In particular can you please expand on your comment about the challenge of "multiple parsers" which I am curious about (because it is the closest to the idea to epochs in your description), and the benefits you claim compared to your current proposal, especially in light of @let-def's comment saying there might be a simpler solution to get the same benefits. It looks to me like it is worth digging in this direction.

As a disclaimer, I am familiar with backwards-compatibility implications but I less so with the relevant parts of the compiler to have a good opinion about implementation strategies (hence my questions).

gasche · 2023-07-25T13:31:51Z

This PR has in-effect be merged, as we merged ocaml/ocaml#12323 , but only partially: we implemented the "raw token" syntax but not the shady compiler options. I am going to "close" rather than "merge" this, but I am not sure what to do.

gasche · 2024-10-23T21:27:07Z

I find myself using this feature in situations that are not related to forward-compatibility or new keywords, but when I do want to define a variant name that is also a keyword. For example I'm defining a module Shape that exports a shape for various kinds of built-in OCaml types, so there is Shape.tuple (the shape of a tuple), Shape.array (the shape of an array), and on a whim I also added Shape.\#function (the shape of a function), Shape.\#lazy (the shape of a value of type 'a Lazy.t) and Shape.\#object. Before I had my own naming convention to deal with this situation (a natural variable name happens to be a keyword), which was to add a trailing underscore (so Shape.function_, Shape.lazy_ and Shape.object_), but it is only slightly more pleasing to the eye, and it has the downside of everyone having a different convention around this.

New keywords RFC

f0c0eac

stedolan mentioned this pull request May 10, 2022

Support 'raw identifier' syntax ocaml/ocaml#11252

Closed

Octachron mentioned this pull request Jan 9, 2023

MetaOCaml: reserved >. token ocaml/ocaml#10130

Closed

dra27 mentioned this pull request Jun 19, 2023

Add effect syntax ocaml/ocaml#12309

Merged

9 tasks

OlivierNicole mentioned this pull request Jun 23, 2023

Support 'raw identifier' syntax (new version) ocaml/ocaml#12323

Merged

gasche closed this Jul 25, 2023

Octachron mentioned this pull request Sep 24, 2024

Add a -keywords <version?+list> flag ocaml/ocaml#13471

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New keywords #27

New keywords #27

stedolan commented Jul 7, 2021

Lupus commented Jul 7, 2021

stedolan commented Jul 7, 2021

let-def commented Jul 12, 2021

EduardoRFS commented Jul 12, 2021

gasche commented Apr 5, 2022

gasche commented Apr 5, 2022 •

edited

Loading

nojb commented Apr 5, 2022

stedolan commented Apr 5, 2022

nojb commented Apr 5, 2022

dra27 commented Apr 5, 2022

trefis commented Apr 5, 2022

fpottier commented Apr 5, 2022

trefis commented Apr 5, 2022

trefis commented Apr 5, 2022 •

edited

Loading

fpottier commented Apr 5, 2022

didierremy commented Apr 5, 2022 •

edited

Loading

nojb commented Apr 5, 2022

gasche commented Apr 5, 2022

gasche commented Apr 5, 2022

nojb commented Apr 5, 2022

gasche commented Apr 5, 2022

dra27 commented Apr 5, 2022

nojb commented Apr 5, 2022

gasche commented Apr 8, 2022

garrigue commented Apr 11, 2022

gadmm commented May 20, 2022

gasche commented Jul 25, 2023

gasche commented Oct 23, 2024 •

edited

Loading

New keywords #27

New keywords #27

Conversation

stedolan commented Jul 7, 2021

New keywords for OCaml

Goals and proposal

Alternative approaches

Lupus commented Jul 7, 2021

stedolan commented Jul 7, 2021

let-def commented Jul 12, 2021

EduardoRFS commented Jul 12, 2021

gasche commented Apr 5, 2022

gasche commented Apr 5, 2022 • edited Loading

nojb commented Apr 5, 2022

stedolan commented Apr 5, 2022

nojb commented Apr 5, 2022

dra27 commented Apr 5, 2022

trefis commented Apr 5, 2022

fpottier commented Apr 5, 2022

trefis commented Apr 5, 2022

trefis commented Apr 5, 2022 • edited Loading

fpottier commented Apr 5, 2022

didierremy commented Apr 5, 2022 • edited Loading

nojb commented Apr 5, 2022

gasche commented Apr 5, 2022

gasche commented Apr 5, 2022

nojb commented Apr 5, 2022

gasche commented Apr 5, 2022

dra27 commented Apr 5, 2022

nojb commented Apr 5, 2022

gasche commented Apr 8, 2022

garrigue commented Apr 11, 2022

gadmm commented May 20, 2022

gasche commented Jul 25, 2023

gasche commented Oct 23, 2024 • edited Loading

gasche commented Apr 5, 2022 •

edited

Loading

trefis commented Apr 5, 2022 •

edited

Loading

didierremy commented Apr 5, 2022 •

edited

Loading

gasche commented Oct 23, 2024 •

edited

Loading