Skip to content

Strings RFC #1848

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
325 changes: 325 additions & 0 deletions rfcs/003-strings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,325 @@
# Strings in windows-rs

## Summary

Rust and Windows each have a fair amount of string types. This RFC proposes a system for making it easy and obvious how to make the string types from both environments work well together.

## Motivation

Making Rust and Windows string types work together is integral to ensuring correct, performant, and boiler-plate free code.

### What we want to accomplish

In general, we are trying to accomplish two goals:

* Correctly model the Windows string types in a way that leverages Rust's type and ownership systems to prevent incorrect use.
* Allow for easy and low cost ways to convert between Windows and Rust types.

On top of this, we want these APIs to feel natural and obvious to Rust developers.

## Explanation

### The Different Types of Strings

#### Windows

Windows has the following string types:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to look at this is that Windows has a variety of string types that need to be supported for interop with various APIs but that languages can and should only use them on the ABI boundaries as much as is practical. The extreme example of this is C# where only System.String exists in C# and all the different string types are marshaled away behind the scenes. While I'm not suggesting we go quite that far, we should try to think of these as interop types needed only for the ABI boundary and put our energy into making it as efficient and seamless as possible to go from and get back to native and idiomatic Rust string types, and not try to make Windows replacements with all the bells and whistles of Rust's string types.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PCSTR, we can just use *const c_char directly, and CStr provides from_ptr and as_ptr methods to convert to and from it. For PSTR, we would need CString and CStr to allow mutability.

For PWSTR and PCWSTR, it would need roughly the same, except with wide characters. PCWSTR (I think) should be able to be created from a BSTR or HSTRING.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be the ideal situation.

However, the unfortunate situation with Rust is, that Rust does not provide a zero-overhead string type that's compatible with Windows' native string encoding. Vec<u16>/&[u16] is the best Rust has to offer in this respect. I don't see any way "to go from and get back to native and idiomatic Rust string types" at all.

Being pragmatic here, and assuming that unmatched surrogate pairs won't appear in practice is a dangerous route to go down. Doing so will open up code based on the windows crate to attacks, even if only DoS attacks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed that's the ideal hence the "as much as is practical". So we need a null terminated wide char but we need such things as little as humanly possible with the emphasis on interop rather than string features. And since we need HSTRING anyway and HSTRING is a null terminated wide char string that should cover those needs. Even if we end up calling it CWString, for symmetry, that just happens to be implemented as an HSTRING we can at least minimize the number of implementations we're carrying around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. I am a bit skeptical that we'll actually save much code by using an HSTRING as a backing implementation detail, but if this is truly just an implementation detail then it doesn't really matter. We can keep playing with the implementation without impacting users.


* `PSTR`: a mutable, nul-terminated string (i.e., composed of "characters" a.k.a. `u8`s) that is often but not always ANSI encoded.
* `PCSTR`: an immutable version of `PSTR`.
* **QUESTION**: what are the practical differences between this and `PSTR`?
Copy link
Contributor

@mbartlett21 mbartlett21 Jun 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is that PCSTR can be created from a CStr reference, and can also be converted back.
EDIT: That is because CStr doesn't allow mutability.

* `PWSTR`: a mutable, nul-terminated "wide" string (i.e., composed of "wide characters" a.k.a. `u16`s) that is often but not always UTF-16 encoded.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] that is often but not always UTF-16 encoded.

Can you expand on this? I'm not aware of any non-UTF-16 encoded PWSTRs. Are you instead trying to convey that a grapheme may span 1 or 2 code 16-bit units (2-4 bytes)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to create invalid UTF-16 strings because the kernel does not check validity. This is exceedingly rare in practice but it is possible, especially if there's a malicious app causing mischief.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prime candidate here is the filesystem. NTFS allows names to be just about any sequence of 16-bit values (with a few reserved code units). Those names will inevitably be returned from FindFirstFile/FindNextFile, so PWSTR/PWCSTR cannot make any assumptions about the validity of the UTF-16 encoding.

Likewise, named kernel objects (events, mutexes, pipes, ...) can have names that aren't valid UTF-16.

Not quite the norm, but invalid UTF-16 sequences are observed in the wild, usually as a result of an accident/bug, or, more frequently, in attacks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no use trying to define what these are. They're just pointer typedefs, which is why windows-rs defines them as simply as possible. We can provide safe ways to create them and pass them to API calls, but we cannot make any guarantees about them or provide generally safe conversions from them.

* `PCWSTR`: an immutable version of `PWSTR`.
* `BSTR`: an immutable, nul-terminated "wide" string used in COM.
* `HSTRING`: an immutable reference counted, nul-terminated "wide" string used in WinRT.
* Note: `HSTRING`s can sometimes be "fast pass" where the buffer and header is stack allocated. In this case the string must only be used while the stack frame is valid.

Generally, callees are expected only keep reference to string types during their stake frame, and copy (or bump the reference count in the case of `HSTRING`) if they want to keep the string around for longer.

#### Rust

Rust's `alloc` core library and by extension it's `std` library have the following string types:

* `String`: an owned pointer to UTF-8 encoded data together with a known length and capacity.
* `&'a str`: a view into UTF-8 encoded data that lasts for the lifetime `'a`.
* `CString`: an owned, C-compatible, nul-terminated string of bytes.
* `&'a CStr`: a view into a `CString` data that lasts for lifetime `'a`.
* `OsString` (on Windows): an owned sequence of 8-bit values, encoded in a less-strict variant of UTF-8.
* `OsStr` (on Windows): a view into `OsString` data that lasts for lifetime `'a`.

#### Equivalences

Windows' `PCSTR` has a direct in-memory equivalent to Rust's `CString` and `&CStr` where the Rust types have clear ownership semantics while the Windows version does not.

### Modeling Windows String types

#### Ownership for Win32 Strings

Often when dealing with Win32 strings (i.e., `PSTR`, `PCSTR`, `PWSTR`, and `PCWSTR`) who owns the data or how long the data is valid for is well known:

* As "IN" params to a function, the string should only be considered valid for the lifetime of the function call.
* As string literals, the string data is valid for the entirety of the program.

However, sometimes we don't know the ownership or lifetime of a value outside of documentation:

* As "OUT" params to a function, the caller is sometimes owns the data, but it may also be owned by someone else and thus only valid for some lifetime.
* [Example of a "borrowed" OUT param](https://microsoft.github.io/windows-docs-rs/doc/windows/Win32/Globalization/struct.IOptionDescription.html#method.Id). The `id` string is owned by the `IOwnDescription` object.
* [Example of an "owned" OUT param](https://microsoft.github.io/windows-docs-rs/doc/windows/Win32/System/Com/struct.IEnumString.html#method.Next). The caller is responsible for freeing the `rgelt` param.
* As fields in a struct, the ownership of the data is unclear. The string can sometimes be freed when the struct is no longer needed but sometimes something else owns that string.

**In short**: there is no way to programmatically know the lifetime of a win32 string except when it is used as an "IN" param.

#### Ownership for `BSTR`

`BSTR` has the same story as a win32 string type.

#### Ownership for `HSTRING`

Ownership for `HSTRING` is fairly clear:

* As "IN" params, the HSTRING is only being borrowed.
* Typically in any other case, the `HSTRING` handle has had its reference counted incremented and when that handle is no longer needed, the reference count should be decremented.

**In short**: `HSTRING`s are always valid when in scope, and when they are no longer used, their reference counts should be decremented **except** in the case of "IN" params where they are logically "borrowed".

### Proposed API

It's clear that "IN" params are the scenario where ownership differs from the status quo. Therefore, the `windows` crate will treat the types as the following.

* Introduce a new type `CWString` which is the wide string equivalent to `CString`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll review more carefully when I get a moment but my most obvious question is what CWString gives you that HSTRING doesn't already provide. Obviously, there are already a dizzying number of string types and I'd love to not make that dizzier. 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? The difference is that HSTRING has the added complexity of being reference counted. CWString would have the added benefit of being symmetrical with the Rust std library CString type, but I agree that we want to try to limit the explosion of types if we can...

* win32 strings and `BSTR`s should always be treated like raw pointers. Accessing them is unsafe since the compiler can not verify that they are still live with the exception of their use as "IN" params which we discuss below.
* `HSTRING`s are always live and should have their reference count decremented in their `Drop` impl.

### Converting between String types

#### From Windows to Rust types

The main distinction between `HSTRING` and the other Windows string types is that `HSTRING` is always valid while the other string types are essentially wrappers around raw pointers and thus cannot be assumed to point to valid memory. Therefore, while `HSTRING` can host a variety of safe conversions into Rust strings, the other strings cannot.

The main conversions afforded to these types are to:

* `&[u16]` or `&[u8]` (depending if the given Windows type is a wide string or not)
* `String`

Conversion of most types to `CString` and `OString` doesn't make much sense as all Windows types are already FFI safe and those Rust types are almost never needed outside of FFI. However, the user should be given access to the underlying buffers (including the trailing nul byte), and there should be conveniences to converting to `String` since this unlocks the entire rich `std` library functionality for string manipulation.

All string types that aren't `HSTRING` can only expose unsafe functions as it is not well known that they point to valid memory. Additionally, each type should allow conversion to a raw pointer to avoid having to use `std::mem::transmute` in cases where an API expects a raw pointer.

#### From Rust to Windows types

It is common for Rust functionality to want to interact with some Windows APIs which require the Windows string types.

Converting from Rust types usually requires copy the UTF-8 encoded bytes to UTF-16. This can be done with any type that can be referenced as a `str`, so it makes sense to provide `From<&T> where T: AsRef<str>` for `HSTRING` as `HSTRING` will ensure the newly allocated buffer is freed when appropriate.

All the other Windows string types do not have clear ownership semantics and thus providing convenient conversions might risk leading the user to leak memory. This is the same reason logically borrowed types like `CStr` don't allow converting from owned types like `String`. As such, only direct conversions will be provided: conversion from `*const u8` and `*const u16`.

#### API Proposal

With the above in mind, here is what the API should look like:

```rust
impl HSTRING {
// String data without the trailing 0
fn as_wide(&self) -> &[u16] {}
// String data with the trailing 0
fn as_wide_with_nul(&self) -> &[u16] {}
fn to_string(&self) -> Result<String, FromUtf16Error> {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This function collides with ToString::to_string, since HSTRING implements Display below.

Perhaps this could be renamed try_to_string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. That's a shame. We could use into_string which mirrors CString::into_string directly. I originally didn't do this as it's not exactly trivial to reuse the String's buffer for the HSTRING buffer and thus I didn't want to take self.

fn to_string_lossy(&self) -> String {}
fn as_ptr(&self) -> *const c_void;
}

// Display shows the string with non-valid utf16 replaced with �
impl Display for HSTRING {}
// Same as Display but surrounded by ""
impl Debug for HSTRING {}

// Uses the `to_string_lossy` function
impl From<HSTRING> for String {}
// Uses the `to_string` function
impl TryFrom<HSTRING> for STRING {
type Error = FromUtf16Error;
}
impl<T: ?Sized + AsRef<str>> From<&'_ T> for HSTRING {}
impl<T: ?Sized + AsRef<str>> TryFrom<&'_ T> for HSTRING {
type Error = FromUtf16Error;
}

// --------------------------------

/// The wide equivalent to std::ffi::CString;
#[repr(transparent)]
struct CWString(..);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a matching borrowed CWStr, like CString and CStr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps... I'm not sure how often this will be useful in practice though it might be. PCWSTR would just be like a raw pointer so ideally the user would only rarely use it, but I'm unsure how often you'd want a &CWStr and not a &CWString

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rylev
One reason would be so there isn't a double-indirection.
It can also be used to convert a PCWSTR into a safely-used reference without having to copy or own the data. (See for example CStr::from_ptr)


impl CWString {
// This mirrors CString
fn new<T: Into<Vec<u16>>>(t: T) -> Result<Self, NulError> {}
fn from_str<T: AsRef<str>>(t: T) -> Result<Self, NulError> {}
fn as_wide(&self) -> &[u16] {}
// String data with the trailing 0
fn as_wide_with_nul(&self) -> &[u16] {}
fn as_pcwstr(&self) -> PCWSTR {}
fn to_string(&self) -> Result<String, FromUtf16Error> {}
fn to_string_lossy(&self) -> String {}
}

// This allows `CWString` to be turned into a `Borrowed<'a, PCWSTR>`
impl From<&CWString> for &PCWSTR {}
impl From<CWString> for HSTRING {}
impl From<HSTRING> for CWString {}

impl Drop for CWString {
// The `CWString` owns its memory and will free when its dropped
}
// Display shows the string with non-valid utf16 replaced with �
impl Display for CWSTRING {}
// Same as Display but surrounded by ""
impl Debug for CWSTRING {}

// HSTRING should also include implementations of `PartialEq` for
// `String`, `&str`, `OsString`, and `&OsStr`

// --------------------------------

impl BSTR {
fn from_ptr(ptr: *const c_void) -> Self {}
fn as_ptr(&self) -> *const c_void;
// String data without the trailing 0
unsafe fn as_wide(&self) -> &[u16] {}
// String data with the trailing 0
unsafe fn as_wide_with_nul(&self) -> &[u16] {}
unsafe fn to_string(&self) -> Result<String, FromUtf16Error> {}
unsafe fn to_string_lossy(&self) -> String {}
}


// --------------------------------

impl PCWSTR {
fn from_ptr(ptr: *const u16) -> Self {}
fn as_ptr(&self) -> *const u16;
fn is_null(&self) -> bool {}
// String data without the trailing 0
unsafe fn as_wide(&self) -> &[u16] {}
// String data with the trailing 0
unsafe fn as_wide_with_nul(&self) -> &[u16] {}
unsafe fn to_string(&self) -> Result<String, FromUtf16Error> {}
unsafe fn to_string_lossy(&self) -> String {}
unsafe fn display(&self) -> impl Display {}
}
// This allows `CString` and `CSTR` to be turned into a `Borrowed<'a, PCSTR>`
impl From<&CString> for &PCSTR {}
impl From<&CStr> for &PCSTR {}

// --------------------------------

impl PCSTR {
fn from_ptr(ptr: *const u8) -> Self {}
fn as_ptr(&self) -> *const u8;
fn is_null(&self) -> bool {}
// String data without the trailing 0
unsafe fn as_bytes(&self) -> &[u8] {}
// String data with the trailing 0
unsafe fn as_bytes_with_nul(&self) -> &[u8] {}
// Converts to `&str` checking for valid UTF-8
unsafe fn as_str(&self) -> Result<&str, FromUtf8Error> {}
// Converts to `&str` not checking for valid UTF-8
unsafe fn as_str_unchecked(&self) -> &str {}
}

// Note that `Display`, `Debug`, and the conversion traits are not
// included for non-`HSTRING` types because those traits are not
// marked as unsafe
```

#### String literals

Many times the user simply wants to use a string literal to build a string of a certain type.

The windows crate provides the following macros for convenience:

```rust
let x: CWString = w!("hello");
let y: CString = c!("hello");
let z: HSTRING = h!("hello");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need these macros to produce allocated strings. That requires runtime support. They should all be able to return const string literals of PCWSTR and PCSTR respectively. Then they can simply be used to call functions like MessageBoxA/W directly and efficiently.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Common Windows string types like HSTRING and BSTR can then just provide From<&T> implementations for PCWSTR for cases where a string needs to be computed at runtime and in future if/when CString and CWString are available in the std library we can just provide From implementations for those as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we shouldn't need the Into<T> below, unless I'm forgetting something.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this is compatible with the existing state of the windows crate so would make it easier to land the Borrowed PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal is a subset of what I had in mind so I think it's fine to try it out. We can always expand from there to have dedicated CWString types and such.

```

These function build null terminated string data of the appropriate width (`u8` in the case of `PCSTR` and `u16` in the other cases).

**question**: for all these types built from static strings, could we special case them so that they don't use reference counting at all?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just special case for a string literal, then use that with concat!($str, "\0"), etc. For wide strings, you'd probably need something like const_heap (nightly api very far from stable) or compiler support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refers to whether we need a special version of HSTRING and the like that doesn't track reference counts since its backing memory is statically allocated. Actually statically allocating enough memory for a wide string is already a solved problem (e.g., see here).

Copy link
Contributor

@mbartlett21 mbartlett21 Jun 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I didn't realise that the const on stable had got that far. EDIT: And I forgot that you could use it to get a length for a constant array...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So essentially, you'd get a wide string reference (PCWSTR), then pass that to WindowsCreateStringReference (in windows::Win32::System::WinRT), then use the resultant HSTRING structure.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using WindowsCreateStringReference is generally a pessimization unless the calling language natively supports null terminated wide string literals, like C# and C++. For languages like Rust, that's not worth it as the resultant HSTRING typically ends up being copied almost immediately.


### Interaction with `InParam`

In params that at the FFI layer take a PCWSTR will be project as `P: Into<InParam<'a, CWString>`. In params that at the FFI layer take a PCSTR will be project as `P: Into<InParam<'a, CString>`.

**QUESTION**: This will just work when the user owns the strings their passing over the FFI layer. But we'll need someway to turn `PCWSTR` and `PCSTR` into `InParam<'a, CWString>` and `InParam<'a, CString>` respectively. This cannot be a safe operation since the `PCWSTR` and `PCSTR` might be invalid. Perhaps a `unsafe fn InParam::from_abi_unchecked()`?

## Examples

Here is the usage of some of these APIs

```rust
let locale = w!("en-US");
let supported = unsafe { factory.IsSupported(&locale)? };
supported.expect("en-US is supported");

// Create a ISpellChecker
let checker = unsafe { factory.CreateSpellChecker(&locale)? };

// Get errors enumerator for the supplied string
println!("Checking the text: '{}'", input);
let i = CWString::from_str(input).unwrap();
let errors = unsafe { checker.ComprehensiveCheck(&i)? };

// Loop through all the errors
while let Ok(error) = unsafe { errors.Next() } {
// Get the start index and length of the error
let start_index = unsafe { error.StartIndex()? };
let length = unsafe { error.Length()? };

// Get the substring from the utf8 encoded string
let substring = &input[start_index as usize..(start_index + length) as usize];

// Get the corrective action
let action = unsafe { error.CorrectiveAction()? };
println!("{:?}", action);

match action {
CORRECTIVE_ACTION_DELETE => {
println!("Delete '{}'", substring);
}
CORRECTIVE_ACTION_REPLACE => {
// Get the replacement as a widestring and convert to a Rust String
let replacement = unsafe { error.Replacement()? };

println!("Replace: {} with {}", substring, unsafe { replacement.display() });

unsafe { CoTaskMemFree(replacement.as_ptr() as *mut _) };
}
CORRECTIVE_ACTION_GET_SUGGESTIONS => {
// Get an enumerator for all the suggestions for a substring
let suggestions = unsafe { checker.Suggest(CWString::from_str(substring).unwrap())? };

// Loop through the suggestions
loop {
// Get the next suggestion breaking if the call to `Next` failed
let mut suggestion = [PWSTR::default()];
unsafe {
let _ = suggestions.Next(&mut suggestion, std::ptr::null_mut());
}
if suggestion[0].is_null() {
break;
}

println!("Maybe replace: {} with {}", substring, unsafe { suggestion[0].display() });

unsafe { CoTaskMemFree(suggestion[0].as_ptr() as *mut _) };
}
}
_ => {}
}
}
```

```rust
Uri::CreateUri(&h!("http://test/"))
```