Skip to content

Commit 01ddb3c

Browse files
committed
Merge RFC 3349: "Unicode and escape codes in literals"
The FCP for RFC 3349 completed on 2023-09-22. Let's merge it. Thanks and congratulations to its authors.
2 parents f1cfc16 + 62a7597 commit 01ddb3c

File tree

1 file changed

+128
-0
lines changed

1 file changed

+128
-0
lines changed

text/3349-mixed-utf8-literals.md

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
- Feature Name: `mixed_utf8_literals`
2+
- Start Date: 2022-11-15
3+
- RFC PR: [rust-lang/rfcs#3349](https://github.com/rust-lang/rfcs/pull/3349)
4+
- Tracking Issue: [rust-lang/rust#116907](https://github.com/rust-lang/rust/issues/116907)
5+
6+
# Summary
7+
[summary]: #summary
8+
9+
Relax the restrictions on which characters and escape codes are allowed in string, char, byte string, and byte literals.
10+
11+
Most importantly, this means we accept the exact same characters and escape codes in `"…"` and `b"…"` literals. That is:
12+
13+
- Allow unicode characters, including `\u{…}` escape codes, in byte string literals. E.g. `b"hello\xff我叫\u{1F980}"`
14+
- Also allow non-ASCII `\x…` escape codes in regular string literals, as long as they are valid UTF-8. E.g. `"\xf0\x9f\xa6\x80"`
15+
16+
# Motivation
17+
[motivation]: #motivation
18+
19+
Byte strings (`[u8]`) are a strict superset of regular (utf-8) strings (`str`),
20+
but Rust's byte string literals are currently not a superset of regular string literals:
21+
they reject non-ascii characters and `\u{…}` escape codes.
22+
23+
```
24+
error: non-ASCII character in byte constant
25+
--> src/main.rs:2:16
26+
|
27+
2 | b"hello\xff你\u{597d}"
28+
| ^^ byte constant must be ASCII
29+
|
30+
31+
error: unicode escape in byte string
32+
--> src/main.rs:2:17
33+
|
34+
2 | b"hello\xff你\u{597d}"
35+
| ^^^^^^^^ unicode escape in byte string
36+
|
37+
```
38+
39+
This can be annoying when working with "conventionally UTF-8" strings, such as with the popular [`bstr` crate](https://docs.rs/bstr/latest/bstr/).
40+
For example, right now, there is no convenient way to write a literal like `b"hello\xff你好"`.
41+
42+
Allowing all characters and all known escape codes in both types of string literals reduces the complexity of the language.
43+
We'd no longer have [different escape codes](https://doc.rust-lang.org/reference/tokens.html#characters-and-strings)
44+
for different literal types. We'd only require regular string literals to be valid UTF-8.
45+
46+
# Guide-level explanation
47+
[guide-level-explanation]: #guide-level-explanation
48+
49+
Regular string literals (`""` and `r""`) must be valid UTF-8.
50+
For example, valid strings are `"abc"`, `"🦀"`, `"\u{1F980}"` and `"\xf0\x9f\xa6\x80"`.
51+
`"\xff"` is not valid, however, as that is not valid UTF-8.
52+
53+
Byte string literals (`b""` and `br""`) may include non-ascii characters and unicode escape codes (`\u{…}`), which will be encoded as UTF-8.
54+
55+
The `char` type does not store UTF-8, so while `'\u{1F980}'` is valid, trying to encode it in UTF-8 as in `'\xf0\x9f\xa6\x80'` is not accepted.
56+
In a char literal (`''`), `\x` may only be used for values 0 through 0x7F.
57+
58+
Similarly, in a byte literal (`b''`), `\u` may only be used for values 0 through 0x7F, since those are the only code points that are unambiguously represented as a single byte.
59+
60+
# Reference-level explanation
61+
[reference-level-explanation]: #reference-level-explanation
62+
63+
The ["characters and strings" section in the Rust Reference](https://doc.rust-lang.org/reference/tokens.html#characters-and-strings)
64+
is updated with the following table:
65+
66+
|   | Example | Characters | Escapes | Validation |
67+
|-----------------|-------------|-------------|---------------------------|--------------------------|
68+
| Character | 'H' | All Unicode | ASCII, unicode | Valid unicode code point |
69+
| String | "hello" | All Unicode | ASCII, high byte, unicode | Valid UTF-8 |
70+
| Raw string | r#"hello"# | All Unicode | - | Valid UTF-8 |
71+
| Byte | b'H' | All ASCII | ASCII, high byte | - |
72+
| Byte string | b"hello" | All Unicode | ASCII, high byte, unicode | - |
73+
| Raw byte string | br#"hello"# | All Unicode | - | - |
74+
75+
With the following definitions for the escape codes:
76+
77+
- ASCII: `\'`, `\"`, `\n`, `\r`, `\t`, `\\`, `\0`, `\u{0}` through `\u{7F}`, `\x00` through `\x7F`
78+
- Unicode: `\u{80}` and beyond.
79+
- High byte: `\x80` through `\xFF`
80+
81+
Compared to before, the tokenizer should start accepting:
82+
- unicode characters in `b""` and `br""` literals (which will be encoded as UTF-8),
83+
- all `\x` escapes in `""` literals,
84+
- all `\u` escapes in `b""` literals (which will be encoded as UTF-8), and
85+
- ASCII `\u` escapes in `b''` literals.
86+
87+
Regular string literals (`""`) are checked to be valid UTF-8 afterwards.
88+
(Either during tokenization, or at a later point in time. See future possibilities.)
89+
90+
# Drawbacks
91+
[drawbacks]: #drawbacks
92+
93+
One might unintentionally write `\xf0` instead of `\u{f0}`.
94+
However, for regular string literals that will result in an error in nearly all cases, since that's not valid UTF-8 by itself.
95+
96+
# Alternatives
97+
[alternatives]: #alternatives
98+
99+
- Only extend `b""` (that is, accept `b"🦀"`), but still do not accept non-ASCII `\x` in regular string literals (that is, keep rejecting `"\xf0\x9f\xa6\x80"`).
100+
101+
- Stabilize `concat_bytes!()` and require writing `"hello\xff你好"` as `concat_bytes!(b"hello\xff", "你好")`.
102+
(Assuming we extend the macro to accept a mix of byte string literals and regular string literals.)
103+
104+
# Prior art
105+
[prior-art]: #prior-art
106+
107+
- C and C++ do the same. (Assuming UTF-8 character set.)
108+
- [The `bstr` crate](https://docs.rs/bstr/latest/bstr/)
109+
- Python and Javascript do it differently: `\xff` means `\u{ff}`, because their strings behave like UTF-32 or UTF-16 rather than UTF-8.
110+
(Also, Python's byte strings "accept" `\u` as just `'\\', 'u'`, without any warning or error.)
111+
112+
# Unresolved questions
113+
[unresolved-questions]: #unresolved-questions
114+
115+
- Should `concat!("\xf0\x9f", "\xa6\x80")` work? (The string literals are not valid UTF-8 individually, but are valid UTF-8 after being concatenated.)
116+
117+
(I don't care. I guess we should do whatever is easiest to implement.)
118+
119+
# Future possibilities
120+
[future-possibilities]: #future-possibilities
121+
122+
- Postpone the UTF-8 validation to a later stage, such that macros can accept literals with invalid UTF-8. E.g. `cstr!("\xff")`.
123+
124+
- If we do that, we could also decide to accept _all_ escape codes, even unknown ones, to allow things like `some_macro!("\a\b\c")`.
125+
(The tokenizer would only need to know about `\"`.)
126+
127+
- Update the `concat!()` macro to accept `b""` strings and also not implicitly convert integers to strings, such that `concat!(b"", $x, b"\0")` becomes usable.
128+
(This would need to happen over an edition.)

0 commit comments

Comments
 (0)