Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyByteArray::to_vec is unsound #4736

Open
robsdedude opened this issue Nov 27, 2024 · 12 comments
Open

PyByteArray::to_vec is unsound #4736

robsdedude opened this issue Nov 27, 2024 · 12 comments

Comments

@robsdedude
Copy link

robsdedude commented Nov 27, 2024

https://github.com/PyO3/pyo3/blob/v0.23.2/src/types/bytearray.rs#L297-L299

The implementation just calls the unsafe as_bytes function while not doing anything to uphold the safety invariant that no other thread modifies the bytearray. I think this is fine with the GIL, but without it, another thread might be ruining the show.

Aside: for the as_bytes API to be any useful with free threaded python people will want to know about critical sections (at least I think that's the right tool). So maybe a cross reference in the docs would go a long way.

@alex
Copy link
Contributor

alex commented Nov 27, 2024 via email

@ngoldbaum
Copy link
Contributor

I'm not sure if a critical section is sufficient here if other threads can write to the array via the buffer protocol. Does PyO3 expose read-write access via the buffer protocol?

@alex
Copy link
Contributor

alex commented Nov 27, 2024 via email

@ngoldbaum
Copy link
Contributor

So isn't it unsafe even in the GIL-enabled build? Just less likely to be an issue.

@alex
Copy link
Contributor

alex commented Nov 27, 2024 via email

@ngoldbaum ngoldbaum changed the title PyByteArrayMethods::to_vec might not be sound in free threaded python PyByteArrayMethods::to_vec is unsound Nov 27, 2024
@ngoldbaum
Copy link
Contributor

Perhaps there are other solutions that also address this problem too! I’m very excited to hear other people’s thoughts on how we can address this. As the presence of Python extension modules written in Rust becomes more prominent finding an efficient, interoperable, and sound way of handling Python buffers in Rust will become important.

slightly off-topic aside: I'm also very interested in this from a completely different angle: the buffer protocol isn't extensible, so if I want to add support for datetimes (just one example) I'd need to write a PEP. We should fix that. Adding a borrow-checker to NumPy might be interesting, although it will be tough to instrument every single place NumPy exposes views into array data (not to mention we'd need to support the legacy buffer protocol probably).

@robsdedude
Copy link
Author

robsdedude commented Nov 28, 2024

Is there value in using a critical section as a first step to get at least to the same (lower) level of unsoundness as is present with the GIL, where at least another native extension is needed to cause UB? I'm happy to open a PR in such case.

@ngoldbaum
Copy link
Contributor

I think that's worth doing. Whether or not to mark the function as a whole unsafe I'll leave to other more experienced PyO3 devs.

robsdedude added a commit to robsdedude/pyo3 that referenced this issue Nov 29, 2024
In free-threaded Python, to_vec needs to make sure to run inside a critical
section so that no other Python thread is mutating the bytearray causing UB.

See also PyO3#4736
@robsdedude robsdedude changed the title PyByteArrayMethods::to_vec is unsound PyByteArray::to_vec is unsound Nov 29, 2024
@davidhewitt
Copy link
Member

This is one of those really frustrating potentially-unsafe cases where the reader likely can't know whether a writer might exist. So while marking it as unsafe is potentially a useful signal for readers, it also creates an invariant which I think they likely just have to ignore and hope for the best. Either way, the status quo seems like a silent trap, and I would be open to adding unsafe just to signal there's dragons lurking (though that is a breaking change that would need to wait for 0.24).

The long run, we have to hope for a reworked buffer protocol, I guess. 🤔

@robsdedude
Copy link
Author

Reading the blog post my first thought was: this Python C API is just fundamentally broken. Or am I misreading this? To me it seems impossible to write a correct (i.e., race-free) native extension regardless of the language used. So with that in mind, trying to write a sound extension with PyO3 makes it impossible to use bytearrays at all (other than those the extension created itself and didn't yet give the user a reference to).

@davidhewitt
Copy link
Member

That's indeed my take, too. As I see it, the buffer protocol cannot protect against data races without a higher-level agreement on synchronization (whether that be the GIL or some other API contract, like the arrow capsule interface). For an isolated extension accepting an arbitrary buffer as input, there is likely no higher-level agreement. Hence why PyBuffer is such a clumsy API to use 😢

@alex
Copy link
Contributor

alex commented Nov 29, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants