|
| 1 | +- Feature Name: `c_str_literal` |
| 2 | +- Start Date: 2022-11-15 |
| 3 | +- RFC PR: [rust-lang/rfcs#3348](https://github.com/rust-lang/rfcs/pull/3348) |
| 4 | +- Rust Issue: [rust-lang/rust#105723](https://github.com/rust-lang/rust/issues/105723) |
| 5 | + |
| 6 | +# Summary |
| 7 | +[summary]: #summary |
| 8 | + |
| 9 | +`c"…"` string literals. |
| 10 | + |
| 11 | +# Motivation |
| 12 | +[motivation]: #motivation |
| 13 | + |
| 14 | +Looking at the [amount of `cstr!()` invocations just on GitHub](https://cs.github.com/?scopeName=All+repos&scope=&q=cstr%21+lang%3Arust) (about 3.2k files with matches) it seems like C string literals |
| 15 | +are a widely used feature. Implementing `cstr!()` as a `macro_rules` or `proc_macro` requires non-trivial code to get it completely right (e.g. refusing embedded nul bytes), |
| 16 | +and is still less flexible than it should be (e.g. in terms of accepted escape codes). |
| 17 | + |
| 18 | +In Rust 2021, we reserved prefixes for (string) literals, so let's make use of that. |
| 19 | + |
| 20 | +# Guide-level explanation |
| 21 | +[guide-level-explanation]: #guide-level-explanation |
| 22 | + |
| 23 | +`c"abc"` is a [`&CStr`](https://doc.rust-lang.org/stable/core/ffi/struct.CStr.html). A nul byte (`b'\0'`) is appended to it in memory and the result is a `&CStr`. |
| 24 | + |
| 25 | +All escape codes and characters accepted by `""` and `b""` literals are accepted, except nul bytes. |
| 26 | +So, both UTF-8 and non-UTF-8 data can co-exist in a C string. E.g. `c"hello\x80我叫\u{1F980}"`. |
| 27 | + |
| 28 | +The raw string literal variant is prefixed with `cr`. For example, `cr"\"` and `cr##"Hello "world"!"##`. (Just like `r""` and `br""`.) |
| 29 | + |
| 30 | +# Reference-level explanation |
| 31 | +[reference-level-explanation]: #reference-level-explanation |
| 32 | + |
| 33 | +Two new [string literal types](https://doc.rust-lang.org/reference/tokens.html#characters-and-strings): `c"…"` and `cr#"…"#`. |
| 34 | + |
| 35 | +Accepted escape codes: [Quote](https://doc.rust-lang.org/reference/tokens.html#quote-escapes) & [Unicode](https://doc.rust-lang.org/reference/tokens.html#unicode-escapes) & [Byte](https://doc.rust-lang.org/reference/tokens.html#byte-escapes). |
| 36 | + |
| 37 | +Nul bytes are disallowed, whether as escape code or source character (e.g. `"\0"`, `"\x00"`, `"\u{0}"` or `"␀"`). |
| 38 | + |
| 39 | +Unicode characters are accepted and encoded as UTF-8. That is, `c"🦀"`, `c"\u{1F980}"` and `c"\xf0\x9f\xa6\x80"` are all accepted and equivalent. |
| 40 | + |
| 41 | +The type of the expression is [`&core::ffi::CStr`](https://doc.rust-lang.org/stable/core/ffi/struct.CStr.html). So, the `CStr` type will have to become a lang item. |
| 42 | +(`no_core` programs that don't use `c""` string literals won't need to define this lang item.) |
| 43 | + |
| 44 | +Interactions with string related macros: |
| 45 | + |
| 46 | +- The [`concat` macro](https://doc.rust-lang.org/stable/std/macro.concat.html) will _not_ accept these literals, just like it doesn't accept byte string literals. |
| 47 | +- The [`format_args` macro](https://doc.rust-lang.org/stable/std/macro.format_args.html) will _not_ accept such a literal as the format string, just like it doesn't accept a byte string literal. |
| 48 | + |
| 49 | +(This might change in the future. E.g. `format_args!(c"…")` would be cool, but that would require generalizing the macro and `fmt::Arguments` to work for other kinds of strings. (Ideally also for `b"…"`.)) |
| 50 | + |
| 51 | +# Rationale and alternatives |
| 52 | +[rationale-and-alternatives]: #rationale-and-alternatives |
| 53 | + |
| 54 | +* No `c""` literal, but just a `cstr!()` macro. (Possibly as part of the standard library.) |
| 55 | + |
| 56 | + This requires [complicated machinery](https://github.com/rust-lang/rust/pull/101607/files) to implement correctly. |
| 57 | + |
| 58 | + The trivial implementation of using `concat!($s, "\0")` is problematic for several reasons, including non-string input and embedded nul bytes. |
| 59 | + (The unstable `concat_bytes!()` solves some of the problems.) |
| 60 | + |
| 61 | + The popular [`cstr` crate](https://crates.io/crates/cstr) is a proc macro to work around the limiations of a `macro_rules` implementation, but that also has many downsides. |
| 62 | + |
| 63 | + Even if we had the right language features for a trivial correct implementation, there are many code bases where C strings are the primary form of string, |
| 64 | + making `cstr!("..")` syntax quite cumbersome. |
| 65 | + |
| 66 | +- No `c""` literal, but make it possible for `""` to implicitly become a `&CStr` through magic. |
| 67 | + |
| 68 | + We already allow integer literals (e.g. `123`) to become one of many types, so perhaps we could do the same to string literals. |
| 69 | + |
| 70 | + (It could be a built-in fixed set of types (e.g. just `str`, `[u8]`, and `CStr`), |
| 71 | + or it could be something extensible through something like a `const trait FromStringLiteral`. |
| 72 | + Not sure how that would exactly work, but it sounds cool.) |
| 73 | + |
| 74 | +* Allowing only valid UTF-8 and unicode-oriented escape codes (like in `"…"`, e.g. `螃蟹` or `\u{1F980}` but not `\xff`). |
| 75 | + |
| 76 | + For regular string literals, we have this restriction because `&str` is required to be valid UTF-8. |
| 77 | + However, C literals (and objects of our `&CStr` type) aren't necessarily valid UTF-8. |
| 78 | + |
| 79 | +* Allowing only ASCII characters and byte-oriented escape codes (like in `b"…"`, e.g. `\xff` but not `螃蟹` or `\u{1F980}`). |
| 80 | + |
| 81 | + While C literals (and `&CStr`) aren't necessarily valid UTF-8, they often do contain UTF-8 data. |
| 82 | + Refusing to put UTF-8 in it would make the feature less useful and would unnecessarily make it harder to use unicode in programs that mainly use C strings. |
| 83 | + |
| 84 | +* Having separate `c"…"` and `bc"…"` string literal prefixes for UTF-8 and non-UTF8. |
| 85 | + |
| 86 | + Both of those would be the same type (`&CStr`). Unless we add a special "always valid UTF-8 C string" type, there's not much use in separating them. |
| 87 | + |
| 88 | +* Use `z` instead of `c` (`z"…"`), for "zero terminated" instead of "C string". |
| 89 | + |
| 90 | + We already have a type called `CStr` for this, so `c` seems consistent. |
| 91 | + |
| 92 | +- Also add `c'…'` as [`c_char`](https://doc.rust-lang.org/stable/core/ffi/type.c_char.html) literal. |
| 93 | + |
| 94 | + It'd be identical to `b'…'`, except it'd be a `c_char` instead of `u8`. |
| 95 | + |
| 96 | + This would easily lead to unportable code, since `c_char` is `i8` or `u8` depending on the platform. (Not a wrapper type, but a direct type alias.) |
| 97 | + E.g. `fn f(_: i8) {} f(c'a');` would compile only on some platforms. |
| 98 | + |
| 99 | + An alternative is to allow `c'…'` to implicitly be either a `u8` or `i8`. (Just like integer literals can implicitly become one of many types.) |
| 100 | + |
| 101 | +# Drawbacks |
| 102 | +[drawbacks]: #drawbacks |
| 103 | + |
| 104 | +- The `CStr` type needs some work. `&CStr` is currently a wide pointer, but it's supposed to be a thin pointer. See https://doc.rust-lang.org/1.65.0/src/core/ffi/c_str.rs.html#87 |
| 105 | + |
| 106 | + It's not a blocker, but we might want to try to fix that before stabilizing `c"…"`. |
| 107 | + |
| 108 | +# Prior art |
| 109 | +[prior-art]: #prior-art |
| 110 | + |
| 111 | +- C has C string literals (`"…"`). :) |
| 112 | +- Nim has `cstring"…"`. |
| 113 | +- COBOL has `Z"…"`. |
| 114 | +- Probably a lot more languages, but it's hard to search for. :) |
| 115 | + |
| 116 | +# Unresolved questions |
| 117 | +[unresolved-questions]: #unresolved-questions |
| 118 | + |
| 119 | +- Also add `c'…'` C character literals? (`u8`, `i8`, `c_char`, or something more flexible?) |
| 120 | + |
| 121 | +- Should we make `&CStr` a thin pointer before stabilizing this? (If so, how?) |
| 122 | + |
| 123 | +- Should the (unstable) [`concat_bytes` macro](https://github.com/rust-lang/rust/issues/87555) accept C string literals? (If so, should it evaluate to a C string or byte string?) |
| 124 | + |
| 125 | +# Future possibilities |
| 126 | +[future-possibilities]: #future-possibilities |
| 127 | + |
| 128 | +(These aren't necessarily all good ideas.) |
| 129 | + |
| 130 | +- Make `concat!()` or `concat_bytes!()` work with `c"…"`. |
| 131 | +- Make `format_args!(c"…")` (and `format_args!(b"…")`) work. |
| 132 | +- Improve the `&CStr` type, and make it FFI safe. |
| 133 | +- Accept unicode characters and escape codes in `b""` literals too: [RFC 3349](https://github.com/rust-lang/rfcs/pull/3349). |
| 134 | +- More prefixes! `w""`, `os""`, `path""`, `utf16""`, `brokenutf16""`, `utf32""`, `wtf8""`, `ebcdic""`, … |
| 135 | +- No more prefixes! Have `let a: &CStr = "…";` work through magic, removing the need for prefixes. |
| 136 | + (That won't happen any time soon probably, so that shouldn't block `c"…"` now.) |
0 commit comments