Skip to content

Commit 873890e

Browse files
authored
Merge pull request #3348 from m-ou-se/c-str-literal
RFC: `c"…"` string literals
2 parents c85eef1 + 2196c96 commit 873890e

File tree

1 file changed

+136
-0
lines changed

1 file changed

+136
-0
lines changed

text/3348-c-str-literal.md

+136
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
- Feature Name: `c_str_literal`
2+
- Start Date: 2022-11-15
3+
- RFC PR: [rust-lang/rfcs#3348](https://github.com/rust-lang/rfcs/pull/3348)
4+
- Rust Issue: [rust-lang/rust#105723](https://github.com/rust-lang/rust/issues/105723)
5+
6+
# Summary
7+
[summary]: #summary
8+
9+
`c"…"` string literals.
10+
11+
# Motivation
12+
[motivation]: #motivation
13+
14+
Looking at the [amount of `cstr!()` invocations just on GitHub](https://cs.github.com/?scopeName=All+repos&scope=&q=cstr%21+lang%3Arust) (about 3.2k files with matches) it seems like C string literals
15+
are a widely used feature. Implementing `cstr!()` as a `macro_rules` or `proc_macro` requires non-trivial code to get it completely right (e.g. refusing embedded nul bytes),
16+
and is still less flexible than it should be (e.g. in terms of accepted escape codes).
17+
18+
In Rust 2021, we reserved prefixes for (string) literals, so let's make use of that.
19+
20+
# Guide-level explanation
21+
[guide-level-explanation]: #guide-level-explanation
22+
23+
`c"abc"` is a [`&CStr`](https://doc.rust-lang.org/stable/core/ffi/struct.CStr.html). A nul byte (`b'\0'`) is appended to it in memory and the result is a `&CStr`.
24+
25+
All escape codes and characters accepted by `""` and `b""` literals are accepted, except nul bytes.
26+
So, both UTF-8 and non-UTF-8 data can co-exist in a C string. E.g. `c"hello\x80我叫\u{1F980}"`.
27+
28+
The raw string literal variant is prefixed with `cr`. For example, `cr"\"` and `cr##"Hello "world"!"##`. (Just like `r""` and `br""`.)
29+
30+
# Reference-level explanation
31+
[reference-level-explanation]: #reference-level-explanation
32+
33+
Two new [string literal types](https://doc.rust-lang.org/reference/tokens.html#characters-and-strings): `c"…"` and `cr#"…"#`.
34+
35+
Accepted escape codes: [Quote](https://doc.rust-lang.org/reference/tokens.html#quote-escapes) & [Unicode](https://doc.rust-lang.org/reference/tokens.html#unicode-escapes) & [Byte](https://doc.rust-lang.org/reference/tokens.html#byte-escapes).
36+
37+
Nul bytes are disallowed, whether as escape code or source character (e.g. `"\0"`, `"\x00"`, `"\u{0}"` or `"␀"`).
38+
39+
Unicode characters are accepted and encoded as UTF-8. That is, `c"🦀"`, `c"\u{1F980}"` and `c"\xf0\x9f\xa6\x80"` are all accepted and equivalent.
40+
41+
The type of the expression is [`&core::ffi::CStr`](https://doc.rust-lang.org/stable/core/ffi/struct.CStr.html). So, the `CStr` type will have to become a lang item.
42+
(`no_core` programs that don't use `c""` string literals won't need to define this lang item.)
43+
44+
Interactions with string related macros:
45+
46+
- The [`concat` macro](https://doc.rust-lang.org/stable/std/macro.concat.html) will _not_ accept these literals, just like it doesn't accept byte string literals.
47+
- The [`format_args` macro](https://doc.rust-lang.org/stable/std/macro.format_args.html) will _not_ accept such a literal as the format string, just like it doesn't accept a byte string literal.
48+
49+
(This might change in the future. E.g. `format_args!(c"…")` would be cool, but that would require generalizing the macro and `fmt::Arguments` to work for other kinds of strings. (Ideally also for `b"…"`.))
50+
51+
# Rationale and alternatives
52+
[rationale-and-alternatives]: #rationale-and-alternatives
53+
54+
* No `c""` literal, but just a `cstr!()` macro. (Possibly as part of the standard library.)
55+
56+
This requires [complicated machinery](https://github.com/rust-lang/rust/pull/101607/files) to implement correctly.
57+
58+
The trivial implementation of using `concat!($s, "\0")` is problematic for several reasons, including non-string input and embedded nul bytes.
59+
(The unstable `concat_bytes!()` solves some of the problems.)
60+
61+
The popular [`cstr` crate](https://crates.io/crates/cstr) is a proc macro to work around the limiations of a `macro_rules` implementation, but that also has many downsides.
62+
63+
Even if we had the right language features for a trivial correct implementation, there are many code bases where C strings are the primary form of string,
64+
making `cstr!("..")` syntax quite cumbersome.
65+
66+
- No `c""` literal, but make it possible for `""` to implicitly become a `&CStr` through magic.
67+
68+
We already allow integer literals (e.g. `123`) to become one of many types, so perhaps we could do the same to string literals.
69+
70+
(It could be a built-in fixed set of types (e.g. just `str`, `[u8]`, and `CStr`),
71+
or it could be something extensible through something like a `const trait FromStringLiteral`.
72+
Not sure how that would exactly work, but it sounds cool.)
73+
74+
* Allowing only valid UTF-8 and unicode-oriented escape codes (like in `"…"`, e.g. `螃蟹` or `\u{1F980}` but not `\xff`).
75+
76+
For regular string literals, we have this restriction because `&str` is required to be valid UTF-8.
77+
However, C literals (and objects of our `&CStr` type) aren't necessarily valid UTF-8.
78+
79+
* Allowing only ASCII characters and byte-oriented escape codes (like in `b"…"`, e.g. `\xff` but not `螃蟹` or `\u{1F980}`).
80+
81+
While C literals (and `&CStr`) aren't necessarily valid UTF-8, they often do contain UTF-8 data.
82+
Refusing to put UTF-8 in it would make the feature less useful and would unnecessarily make it harder to use unicode in programs that mainly use C strings.
83+
84+
* Having separate `c"…"` and `bc"…"` string literal prefixes for UTF-8 and non-UTF8.
85+
86+
Both of those would be the same type (`&CStr`). Unless we add a special "always valid UTF-8 C string" type, there's not much use in separating them.
87+
88+
* Use `z` instead of `c` (`z"…"`), for "zero terminated" instead of "C string".
89+
90+
We already have a type called `CStr` for this, so `c` seems consistent.
91+
92+
- Also add `c'…'` as [`c_char`](https://doc.rust-lang.org/stable/core/ffi/type.c_char.html) literal.
93+
94+
It'd be identical to `b'…'`, except it'd be a `c_char` instead of `u8`.
95+
96+
This would easily lead to unportable code, since `c_char` is `i8` or `u8` depending on the platform. (Not a wrapper type, but a direct type alias.)
97+
E.g. `fn f(_: i8) {} f(c'a');` would compile only on some platforms.
98+
99+
An alternative is to allow `c'…'` to implicitly be either a `u8` or `i8`. (Just like integer literals can implicitly become one of many types.)
100+
101+
# Drawbacks
102+
[drawbacks]: #drawbacks
103+
104+
- The `CStr` type needs some work. `&CStr` is currently a wide pointer, but it's supposed to be a thin pointer. See https://doc.rust-lang.org/1.65.0/src/core/ffi/c_str.rs.html#87
105+
106+
It's not a blocker, but we might want to try to fix that before stabilizing `c"…"`.
107+
108+
# Prior art
109+
[prior-art]: #prior-art
110+
111+
- C has C string literals (`"…"`). :)
112+
- Nim has `cstring"…"`.
113+
- COBOL has `Z"…"`.
114+
- Probably a lot more languages, but it's hard to search for. :)
115+
116+
# Unresolved questions
117+
[unresolved-questions]: #unresolved-questions
118+
119+
- Also add `c'…'` C character literals? (`u8`, `i8`, `c_char`, or something more flexible?)
120+
121+
- Should we make `&CStr` a thin pointer before stabilizing this? (If so, how?)
122+
123+
- Should the (unstable) [`concat_bytes` macro](https://github.com/rust-lang/rust/issues/87555) accept C string literals? (If so, should it evaluate to a C string or byte string?)
124+
125+
# Future possibilities
126+
[future-possibilities]: #future-possibilities
127+
128+
(These aren't necessarily all good ideas.)
129+
130+
- Make `concat!()` or `concat_bytes!()` work with `c"…"`.
131+
- Make `format_args!(c"…")` (and `format_args!(b"…")`) work.
132+
- Improve the `&CStr` type, and make it FFI safe.
133+
- Accept unicode characters and escape codes in `b""` literals too: [RFC 3349](https://github.com/rust-lang/rfcs/pull/3349).
134+
- More prefixes! `w""`, `os""`, `path""`, `utf16""`, `brokenutf16""`, `utf32""`, `wtf8""`, `ebcdic""`, …
135+
- No more prefixes! Have `let a: &CStr = "…";` work through magic, removing the need for prefixes.
136+
(That won't happen any time soon probably, so that shouldn't block `c"…"` now.)

0 commit comments

Comments
 (0)