Skip to content

New keywords #27

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

New keywords #27

wants to merge 1 commit into from

Conversation

stedolan
Copy link
Contributor

@stedolan stedolan commented Jul 7, 2021

[RFC text copied below]


New keywords for OCaml

New language features require new syntax, and often the best new syntax involves one or more new keywords. However, adding keywords to a language brings backwards-compatibility concerns, as programs that use the keyword as an identifier stop working.

The aim of this RFC is to make it easier to add new keywords to OCaml. Specifically, two small changes are proposed to the lexical syntax:

  • a new class of "optional keywords", initially empty, which can be disabled by a command-line option or lexer directive. (If disabled, the keywords remain usable as identifier)

  • a new syntax of "raw identifiers", which allow words to be used as identifiers regardless of whether or not they are keywords.

With these changes, new keywords can be added as optional keywords, disabled by default. Old code works as normal, new code can opt in to the keyword, and identifiers colliding with the keyword can still be used as raw identifiers. Eventually, the new keyword can be enabled by default, but old code continues to work with an explicit compiler flag.

Goals and proposal

The main goal is to allow OCaml to be extended with new keywords in a backwards-compatible way. Specifically:

  1. Old code should still be usable, even if it uses identifiers that now collide with keywords.

  2. New and old code should mix, even if the new code needs to refer to an old identifier which is now a keyword.

The proposal is in two parts:

  1. Add compiler options -use-keyword foo, -no-keyword foo and lexer directives #use_keyword foo, #no_keyword foo to enable and disable an optional keyword foo.

    If foo is not recognised as an optional keyword, then -use-keyword foo and #use_keyword foo are errors, while -no-keyword foo and #no_keyword foo are silently ignored. (Silently ignoring these allows compatibility to be maintained before and after an optional keyword is introduced).

  2. Add a new syntax \#foo called a raw identifier. This syntax is equivalent to a plain identifier foo, except that \#foo is always an identifier even if foo is a keyword.

    This syntax can be used anywhere that foo can be used: ~\#foo is a labelled argument, `\#Foo is a polymorphic variant, '\#foo is a type variable, \#Foo is a module or constructor name, etc.

    (This feature is present in C# (@foo), Swift (`foo`) and Rust (r#foo). None of these syntaxes can be reused directly without conflicts, and the proposal here is closest to Rust's)

Part 1 of the proposal ensures that old code continues to work, although once a new keyword is enabled by default old code will need to use the -no-keyword foo flag to compile unmodified. Part 2 ensures that new code can refer to identifiers exported by old code, even if they are now keywords.

Alternative approaches

There are many ways to extend a language. Here are some possibilties, which seem less preferable to the proposal above:

  • Just add keywords

    The traditional approach is to add keywords and accept some amount of breakage. The last time this occurred (adding nonrec in 4.02) it generated a long discussion on whether it is acceptable to add a new keyword, even one that will break no known code. For keyword proposals that are known to break code (e.g. effect, macro, implicit, unboxed), this seems unworkable.

  • Symbols instead of keywords

    Instead of adding keywords, it is possible to introduce new syntax using entirely non-alphabetic characters. However, it's hard to read and look up unfamiliar symbolic syntax, and the space of remaining options is small as the OCaml lexical syntax is quite crowded. (For an example of this, see RFC #10, which further overloads the # symbol in types, trying to keep this separate from the various current meanings of #).

  • Attributes instead of keywords

    The [@attribute] syntax can be used to add arbitrary annotations to the parse tree, and has already been used for several new language features, including immediate types, unboxed types and explicit tail calls.

    It has a couple of disadvantages: first, the syntax is noiser than and inconsistent with existing keyword-based syntax. For example, a single-field record declaration may be annotated as any of mutable, private or unboxed. Two of these are a keyword, while one is spelled [@@unboxed], where the distinction is based mostly on the date that the feature was introduced.

    Secondly, since attributes are valid anywhere, subtle bugs are possible if they end up on the wrong parsetree node. For instance, type t [@@@immediate] is silently accepted and declares a non-immediate type: the extra @ in @@@immediate makes it a standalone annotation disconnected from type t, so that it gets parsed and ignored.

  • Contextual keywords

    Some languages (notably, C#) allow words to be used as identifiers yet be recognised as keywords in certain contexts, which provides a high degree of backwards compatibility at the expense of more complex parsing.

    However, there are two reasons why this approach is less effective in OCaml: first, OCaml distinguishes fewer contexts. In particular, the C# trick of making a word be a keyword in statement but not in expression context is not useful in a language that does not distinguish statements and expressions. Second, OCaml accepts a sequence of arbitrary space-separated identifiers as a function application, so it is harder to find a construction that does not already mean something.

  • Overloaded keywords

    It is tempting to reuse an existing keyword, by giving it a new meaning in a context which it cannot currently be used. While it does preserve compatibility, this is mostly a bad idea: for instance, see the various confusing meanings of static in C. In particular, this sort of keyword reuse removes the ability to easily look up the meaning of some syntax, which is one of the main reasons to use keywords in the first place.

  • Multiple parsers

    Finally, we could ship multiple versions of the parser that accepted different editions of OCaml's syntax. This does have certain advantages, but has an unusually high maintenance cost, and seems undesirable on that basis alone.

@Lupus
Copy link

Lupus commented Jul 7, 2021

Can't find the description on how Reason wants to approach backward-incompatible syntax changes, but in short: it will expect an attribute with syntax version at the top of the file, parses files without this attribute at some fixed "current" version, and when specified - parses source according to version that is specified. Also refmt allows to upgrade one version to another, automatically bumping the attribute to latest version. /cc @jordwalke for details

Maybe OCaml can do something similar with the help of ocamlformat?

@stedolan
Copy link
Contributor Author

stedolan commented Jul 7, 2021

I had a look at how Reason handles new keywords, which comes up e.g. when converting OCaml code to Reason syntax, as in OCaml switch is a valid variable name while it's a keyword in Reason.

Since reasonml/reason#1539, it's done by appending underscores: the OCaml code let switch = 1 is converted to the Reason code let switch_ = 1. This works fine, but I find it a bit more annoying than raw identifiers, as two different identifiers get used for the same thing in different files. (This is less of an issue if you convert everything at once, but an explicit goal here is interoperation)

@let-def
Copy link

let-def commented Jul 12, 2021

@stedolan Minor point: I don't think we need multiple parsers... The keyword table of lexer.mll can be dynamically populated based on the configuration. That's how Merlin has dealt with a few variants of OCaml language with a single grammar (including Meta OCaml and a few camlp4 extensions).

@EduardoRFS
Copy link

I also discussed in the past of something like the syntax proposed here with @jordwalke for Reason. It would allow both syntaxes to reserve only the needed keywords while still making everything accessible.

@gasche
Copy link
Member

gasche commented Apr 5, 2022

I believe that this is a problem that we need to address, and I like the two parts of @stedolan's proposal. Count me in favor of accepting this RFC.

@gasche
Copy link
Member

gasche commented Apr 5, 2022

... but there has to be some bike-shedding. Is there a proposal for a "raw identifiers" syntax that uses delimiters, and not just a prefix marker? This could be a nice way to interoperate with other dialects or languages and reuse their identifiers, even they use slightly different lexical conventions.

(Maybe \#{|...|} could work for this? If that's the best proposal we come up with, then it can be left for a later extension.)

@nojb
Copy link

nojb commented Apr 5, 2022

2. (This feature is present in C# (@foo), Swift (`foo`) and Rust (r#foo). None of these syntaxes can be reused directly without conflicts, and the proposal here is closest to Rust's)

What's the problem with `foo`? As far as I can see the only (potential) conflict would be with polymorphic variants, but the single backquote token vs the raw identifier choice would be done in the lexer...

@stedolan
Copy link
Contributor Author

stedolan commented Apr 5, 2022

What's the problem with `foo`? As far as I can see the only (potential) conflict would be with polymorphic variants, but the single backquote token vs the raw identifier choice would be done in the lexer...

It's unlikely to be used in the wild, but currently `Foo` Bar is valid syntax, and means the same as `Foo (`Bar)

@nojb
Copy link

nojb commented Apr 5, 2022

What's the problem with `foo`? As far as I can see the only (potential) conflict would be with polymorphic variants, but the single backquote token vs the raw identifier choice would be done in the lexer...

It's unlikely to be used in the wild, but currently `Foo` Bar is valid syntax, and means the same as `Foo (`Bar)

Right, I had missed that. Thanks.

@dra27
Copy link
Member

dra27 commented Apr 5, 2022

Three thoughts:

  • We have the testing tools to see what the effect is, at least on opam-repository - `foo` looks much more raw-quotey to me than \#foo so maybe our last piece of breakage pain here would be to remove `Foo` Bar as valid syntax?!
  • -no-keyword could potentially need to be applied to a file multiple times (OCaml 6.0 adds foo and then 7.0 adds bar and it turns out my_amazing_module.ml written for OCaml 5.0 had both). I agree that shipping multiple parsers would be a pain, but really what's wanted is to say -keywords-of 5.0 as implying "add -no-keyword foo for every keyword added after 5.0"
  • A corollary of that second point is that it might be desirable to have a file level attribute to specify that, but I'm not sure that adding our own "shebang" or (lang ocaml) equivalent to every single OCaml source file is really the nicest way to go 🙂

@trefis
Copy link

trefis commented Apr 5, 2022

so maybe our last piece of breakage pain here would be to remove `Foo` Bar as valid syntax

On that note, a quick scan of opam packages didn't show any use of that "feature"; only string quoting in strings/comments.
(Though I might have messed up the regexp)

@fpottier
Copy link

fpottier commented Apr 5, 2022

Overall, the proposal sounds very reasonable to me.

Regarding `Foo` Bar, I would suggest modifying the current lexer/parser so that ` Bar (with a space after the backtick) is no longer considered valid. This would open the way to allowing `foo` to mean something. (That said, `foo` in Haskell is used for infix function application. Maybe that is something we would like to have, too?)

@trefis
Copy link

trefis commented Apr 5, 2022

I would suggest modifying the current lexer/parser so that ` Bar (with a space after the backtick) is no longer considered valid.

That's not enough:

# `foo`bar;;
- : [> `foo of [> `bar ] ] = `foo `bar

@trefis
Copy link

trefis commented Apr 5, 2022

Unless I missunderstood François aim actually … disregard my comment.

(PS: I like the general proposal)

@fpottier
Copy link

fpottier commented Apr 5, 2022

The example you point out is interesting, and its meaning would change if `foo` was interpreted in a new way, but that would be acceptable (I think).

To clarify my proposal, I suggest that BACKQUOTE should no longer be a token. The combination BACKQUOTE ident (which today is recognized by the parser) should be recognized by the lexer as a token, without allowing a space in the middle. And the combination BACKQUOTE ident BACKQUOTE could also be recognized by the lexer as a different token, if desired.

@didierremy
Copy link

didierremy commented Apr 5, 2022

Regarding `Foo` Bar, I would suggest modifying the current lexer/parser so that ` Bar (with a space after the backtick) is no longer considered valid. This would open the way to allowing `foo` to mean something. (That said, `foo` in Haskell is used for infix function application. Maybe that is something we would like to have, too?)

I second François with the idea that `foo` looks nice and could perhaps be reserved for some more useful feature (such as infixes) than the rare uses of keywords as identifiers.

@nojb
Copy link

nojb commented Apr 5, 2022

What about \foo for raw identifiers?

@gasche
Copy link
Member

gasche commented Apr 5, 2022

My impression from this collective bikeshedding session is that the proposed \#bah syntax is short enough to be usable in practice, and ugly enough to not steal valuable syntactic space for a more commonly used feature of the future. Maybe we should stick with it for now :-)

@gasche
Copy link
Member

gasche commented Apr 5, 2022

Except for this syntactic discussion, the most substantial change proposal for the RFC is @dra27's proposal to also have sets-of-keywords defined by their OCaml language version. (With a tasteful choice of semantics, we could eg. have -keywords-of 4.13 -use-keywords effect work to define a basis and extend on it, or -keywords-of 5.5 -no-keywords effect to restrict from it.)

@nojb
Copy link

nojb commented Apr 5, 2022

@dra27's proposal to also have sets-of-keywords defined by their OCaml language version

I don't have a strong opinion, but it feels a bit ad-hoc to version keywords while we don't have a good story about versioning other aspects of the language/toolchain (warnings, stdlib, CLI, ...). But again, not a strong opinion either way.

@gasche
Copy link
Member

gasche commented Apr 5, 2022

Meh, the argument of not doing something reasonable on A because we are also bad at B and C is not so convincing.

@dra27
Copy link
Member

dra27 commented Apr 5, 2022

A tiny argument to reduce the feeling of ad hoc-ness: versioning the CLI and the stdlib involve supporting multiple things, where noting the version a keyword was added is a single "fact" (i.e. it's fixed in the code). Similar has just been done - or is at least proposed - for warnings for the documentation. For warnings, arguably we wouldn't ever want "Warnings of 5.0" - that can silence real warnings in old code (that's just why we don't make new warnings fatal) whereas the same wouldn't be true of old code with new keywords.

@nojb
Copy link

nojb commented Apr 5, 2022

Meh, the argument of not doing something reasonable on A because we are also bad at B and C is not so convincing.

I agree. I withdraw my reservation.

@gasche
Copy link
Member

gasche commented Apr 8, 2022

My impression based on this discussion is that @stedolan's proposal is roughly consensual. I will keep oh-so-subtly pushing things towards a clear decision, in the hope of motivating @stedolan (with possibly some help on the parsing part?) to propose an implementation. (I don't think we should necessarily wait for a "formal approval" to start implementing things.)

@garrigue
Copy link

I do agree on the basic idea (i.e. allow to change the set of keywords for backward compatibility, and offer raw identifiers to allow interaction between pieces of code using different sets of keywords).
This said, I would rather advocate doing keyword selection only by version number, to avoid ending up with diverging codebases (supposing that the set of keywords grows monotonously; there was one exception from ocaml 1 to 2).

Also, I don't like very much \# as a prefix for raw identifiers. I would rather avoid using a backslash. What about ``ident ?
I don't see any conflict, and it looks a little bit more palatable.

One could even add an extra backquote after to use as an infix: e1 ``plus` e2.

The introduction of raw identifiers also suggests that we could at last allow keywords as labels (who hasn't wished to use ~to or ~end as labels), but this is another subject.
I'm also not opposed to make the parsing of labels and variant tags lexer based, even if it breaks some code: such breaking style was never approved, and would have been used only in some random exceptional cases.

@gadmm
Copy link

gadmm commented May 20, 2022

In the past I have mentioned the concept of editions/epochs from Rust/C++, but since nobody has picked up on the suggestion so far I would like to mention it again. Knowing you @stedolan, you are probably well aware of this prior art, and you have probably thought about it while writing this proposal (I think that you are alluding to it). But still I wonder whether some of the ongoing discussion is not reinventing the wheel.

The documents that proposed epochs for Rust and C++ are good reads, they go through the motivations in depth and are written by people aware of the ramifications of BC issues. (see Rust, C++)

Rust editions were designed to let the language evolve while solving both problems of backwards-compatibility and avoiding ecosystem splits. They address the fragmentation issue by ensuring that everyone evolves towards the same set of "options" (echoing @garrigue's legitimate worries). They can also be useful as a common versioning scheme for some other aspects of the toolchain later on (cf. @nojb's natural concern). Crucially, there is a distinction between compiler versions and editions. Another important aspect which deviates from the current discussion is that editions are opt-in: you never have to "update" your build system configuration to tell a newer compiler that you want an earlier edition (this would be seen as a breakage).

In the case of Rust, together with the use of automated conversion tools, this opens the door to even more ambitious language changes (I mention it because such automated conversion has already been evoked for some proposed changes to OCaml, though I have not heard from it since).

Overall, I recognize in this proposal what could be building blocks for such a notion of epochs/editions (and maybe you already see it this way), especially:

  • the mechanism for referring to old identifiers in new code (when mixing editions) seems essential,
  • but I am less convinced by the current proposal about compiler options and directives.
  1. How is your viewpoint related to editions/epochs, and why is the proposal not "let's introduce a notion of editions in OCaml"?
  2. In particular can you please expand on your comment about the challenge of "multiple parsers" which I am curious about (because it is the closest to the idea to epochs in your description), and the benefits you claim compared to your current proposal, especially in light of @let-def's comment saying there might be a simpler solution to get the same benefits. It looks to me like it is worth digging in this direction.

As a disclaimer, I am familiar with backwards-compatibility implications but I less so with the relevant parts of the compiler to have a good opinion about implementation strategies (hence my questions).

@gasche
Copy link
Member

gasche commented Jul 25, 2023

This PR has in-effect be merged, as we merged ocaml/ocaml#12323 , but only partially: we implemented the "raw token" syntax but not the shady compiler options. I am going to "close" rather than "merge" this, but I am not sure what to do.

@gasche
Copy link
Member

gasche commented Oct 23, 2024

I find myself using this feature in situations that are not related to forward-compatibility or new keywords, but when I do want to define a variant name that is also a keyword. For example I'm defining a module Shape that exports a shape for various kinds of built-in OCaml types, so there is Shape.tuple (the shape of a tuple), Shape.array (the shape of an array), and on a whim I also added Shape.\#function (the shape of a function), Shape.\#lazy (the shape of a value of type 'a Lazy.t) and Shape.\#object. Before I had my own naming convention to deal with this situation (a natural variable name happens to be a keyword), which was to add a trailing underscore (so Shape.function_, Shape.lazy_ and Shape.object_), but it is only slightly more pleasing to the eye, and it has the downside of everyone having a different convention around this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.