Improving ergonomics of getting slices from a PyBuffer #2824

itamarst · 2022-12-20T20:23:21Z

I'm writing a function that accepts anything that can create a read-only buffer:

fn validate_cbor(&self, py: Python<'_>, cbor: &PyAny) -> PyResult<()> {
        let buffer = PyBuffer::<u8>::get(cbor)?;
        if !buffer.readonly() {
            return Err(PyValueError::new_err("Must be read-only byte buffer"));
        }
        let slice = buffer
            .as_slice(py)
            .ok_or_else(|| PyTypeError::new_err("Must be a contiguous sequence of bytes"))?;

        // Safety: The slice is &[ReadOnlyCell<u8>]. A ReadOnlyCell has the same
        // memory representation as the underlying data; it's
        // #[repr(transparent)] newtype around UnsafeCell. And per Rust docs
        // "UnsafeCell<T> has the same in-memory representation as its inner
        // type T". So the main issue is whether the data is _really_ read-only.
        // We do the read-only check above, and yes a caller can probably somehow
        // lie, but if they do that, that's really their fault.
        let cbor: &[u8] = unsafe { std::mem::transmute(slice) };
        
        // ... custom code here
}

It's not clear to me why PyBytes::as_bytes() shouldn't return &[ReadOnlyCell<u8>] too. After all, there is nothing preventing a particularly malicious user from overwriting the data in a Python bytes object, just need to write a little bit of C. We need to trust callers to some extent in a world where they can write C.

So my feeling is that as_slice() can just return Some(&[T]), unless there's some reason to expect buffer API implementations to lie about their readonly flag.

As a follow-up, if that makes sense then maybe the pattern above could be simplified and built-in to PyO3.

The text was updated successfully, but these errors were encountered:

itamarst · 2022-12-20T20:25:56Z

Taking a step back: accepting read-only buffers seems like a reasonably common use case for people who want memory-efficient APIs, so it'd be nice to have good ergonomics, especially for people who haven't heard about transmute and/or don't feel confident reasoning about whether they can use it.

adamreichold · 2022-12-21T08:33:41Z

After all, there is nothing preventing a particularly malicious user from overwriting the data in a Python bytes object, just need to write a little bit of C. We need to trust callers to some extent in a world where they can write C.

Writing C code is not required, at least when NumPy is involved: If I change your example into

fn validate_cbor(py: Python<'_>, cbor: &PyAny, cb: &PyAny) -> PyResult<()> {
    let buffer = PyBuffer::<u8>::get(cbor)?;
    if !buffer.readonly() {
        return Err(PyValueError::new_err("Must be read-only byte buffer"));
    }
    let slice = buffer
        .as_slice(py)
        .ok_or_else(|| PyTypeError::new_err("Must be a contiguous sequence of bytes"))?;

    let before = slice.iter().map(|cell| cell.get()).collect::<Vec<_>>();

    cb.call0()?;

    let after = slice.iter().map(|cell| cell.get()).collect::<Vec<_>>();

    assert_eq!(before, after);

    Ok(())
}

to include a callback cb which enables interleaving of Rust and Python code, then checking that before and after are equals can fail if called like

def test_validate_cbor():
    x = np.arange(0, 10, dtype=np.uint8)

    y = x.view()
    y.flags.writeable = False

    def cb():
        x[3] = 42

    validate_cbor(y, cb)

yielding

>       validate_cbor(y, cb)
E       pyo3_runtime.PanicException: assertion failed: `(left == right)`
E         left: `[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]`,
E        right: `[0, 1, 2, 42, 4, 5, 6, 7, 8, 9]`

which would not be sound if slice: &[T] instead of slice: &[ReadonlyCell<T>].

Of course, the view y into x "lies" about its read-only flag in a certain sense, i.e. one cannot manipulate the underlying memory via y, e.g.

>       y[3] = 23
E       ValueError: assignment destination is read-only

but it can still be manipulated using Python code via x.

For this reason, rust-numpy dynamically borrow checks all memory accesses done by safe Rust code so that safe Rust code cannot be the originator of memory unsafety, but considers both other native extension modules as well as Python code unsafe, meaning trusted to manually uphold aliasing discipline, c.f. https://docs.rs/numpy/latest/numpy/borrow/index.html

itamarst · 2022-12-21T12:59:40Z

Hm, yeah, NumPy views are a compelling argument that you can't trust the readonly flag in the general case.

So perhaps a more modest approach would be just keeping people from having to do transmutes themselves?

use std::mem::size_of;

impl<T> PyBuffer<T> {
    /// Safety: ... document constraints here ...
    unsafe fn as_regular_slice(&self) -> Option<&[T]> {
        if !self.readonly() {
            return None;
        }
        let slice = self.as_slice()?;
        const _: () = assert!(
            size_of::<ReadOnlyCell<T>>() == size_of::<T>(),
            "ReadOnlyCell<T> is different size than T"
        );
        Some(unsafe { std::mem::transmute::<&[ReadOnlyCell<T>], &[T]>(slice) })
    }
}

adamreichold · 2022-12-21T13:04:25Z

Hm, yeah, NumPy views are a compelling argument that you can't trust the readonly flag in the general case.

As hinted at above, for rust-numpy we took the opposite approach. We do trust Python code as we trust other "unsafe" code. (We consider the read-only flag, but only to prevent safe Rust code writing to read-only buffers, not to assume immutability.)

So perhaps a more modest approach would be just keeping people from having to do transmutes themselves?

Yeah, I think such methods make sense, but they should come with extensive documentation on the issue of shared mutability in Python code.

kylebarron · 2024-08-21T16:40:33Z

FWIW I had to do the same thing in kylebarron/geo-index#57 and I think
std::slice::from_raw_parts is a bit cleaner than a transmute. The transmute relies on #[repr(transparent)], which feels a bit shaky, and as_slice needs a py token.

let buffer = PyBuffer::<u8>::get_bound(obj)?;
if !buffer.readonly() {
    return Err(PyValueError::new_err("Must be read-only byte buffer."));
}
if buffer.dimensions() != 1 {
    return Err(PyValueError::new_err("Expected 1-dimensional array."));
}
// Note: this is probably superfluous for 1D array
if !buffer.is_c_contiguous() {
    return Err(PyValueError::new_err("Expected c-contiguous array."));
}
if buffer.len_bytes() == 0 {
    return Err(PyValueError::new_err("Buffer has no data."));
}

let len = buffer.item_count();
let data = buffer.buf_ptr() as *const u8;
unsafe { std::slice::from_raw_parts(data, len) }

davidhewitt · 2024-08-23T12:43:30Z

I worry that slicing into Python-mutable memory like this risks UB due to optimizations that rustc might do due to the assumption of immutability. I think in practice this UB would most likely show up as data races on reading the buffer when there is also a writer, but I have no idea if the compiler might choose to do more destructive optimizations.

I wonder if a safer option than copying directly from the Python buffer and violating Rust's aliasing rules is potentially to allocate a temporary stack-based buffer (say, [u8; 64]) and copy chunks from the buffer onto the stack, and process those iteratively. This has probably got a terrible performance price for working in chunks, but until Python can give us a buffer API where the contract promises an immutable buffer, that's the only safe way I can think of to hand out slices of &[T].

kylebarron · 2024-08-23T14:26:39Z

Hmm. Is this worry specifically around not holding the GIL while accessing the slices?

I later read the code more and saw that PyBuffer::as_slice uses the same std::slice::from_raw_parts, just wraps it in a ReadOnlyCell.

pyo3/src/buffer.rs

Lines 370 to 381 in ea6a0aa

    
               pub fn as_slice<'a>(&'a self, _py: Python<'a>) -> Option<&'a [ReadOnlyCell<T>]> { 
        
                   if self.is_c_contiguous() { 
        
                       unsafe { 
        
                           Some(slice::from_raw_parts( 
        
                               self.0.buf as *mut ReadOnlyCell<T>, 
        
                               self.item_count(), 
        
                           )) 
        
                       } 
        
                   } else { 
        
                       None 
        
                   } 
        
               }

I was thinking that directly calling std::slice::from_raw_parts while not holding the GIL was "unsafe" in the sense that the Python user needs to know not to mutate the buffer they're passing in.

alanhdu · 2024-08-26T00:17:34Z

I also have a use-case for this (trying to expose a Rust API that uses &[u8]) that I'd like to use Python's buffer protocol to allow zero-copy access. Rewriting the Rust code to take &[ReadOnlyCell<T>] or a Iterator<u8> (or whatever) is probably not going to happen.

IIUC, the problem with the buffer is basically the one capture din https://alexgaynor.net/2022/oct/23/buffers-on-the-edge/: & implies exclusive access, while there's nothing actually enforcing that for Python, leading to some fundamental issues. So I get that there's unsafety here, but I still think having some-kind of easy zero-copy way to convert a Python object w/ a buffer protocol into a &[u8] would be extremely useful (since the entire point of the buffer protocol is zero-copy data transfers). Even just exposing an unsafe function (with documentation about the problem) would (IMO) be an improvement.

That said, from a user's POV I do like pyo3-numpy's take on this, where we consider Python code "unsafe" (and thus trusted to manually uphold the mutability discipline). It strikes me as a pragmatic compromise, since the majority of Python code is single-threaded anyways and so in practice you'll get exclusive access to the buffer most of the time.

itamarst added the enhancement label Dec 20, 2022

itamarst changed the title ~~Maybe PyBuffer::as_slice() should just return Option<&[T]>~~ Improving ergonomics of getting slices from a PyBuffer Dec 21, 2022

H-Plus-Time mentioned this issue Aug 21, 2024

RTree Buffer protocol, python binding tests kylebarron/geo-index#55

Merged

kylebarron mentioned this issue Oct 1, 2024

Feature requests kylebarron/arro3#195

Open

kylebarron mentioned this issue Dec 16, 2024

Freethreading - module markers + CI kylebarron/arro3#275

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving ergonomics of getting slices from a PyBuffer #2824

Improving ergonomics of getting slices from a PyBuffer #2824

itamarst commented Dec 20, 2022

itamarst commented Dec 20, 2022

adamreichold commented Dec 21, 2022

itamarst commented Dec 21, 2022

adamreichold commented Dec 21, 2022

kylebarron commented Aug 21, 2024

davidhewitt commented Aug 23, 2024 •

edited

Loading

kylebarron commented Aug 23, 2024

alanhdu commented Aug 26, 2024 •

edited

Loading

Improving ergonomics of getting slices from a PyBuffer #2824

Improving ergonomics of getting slices from a PyBuffer #2824

Comments

itamarst commented Dec 20, 2022

itamarst commented Dec 20, 2022

adamreichold commented Dec 21, 2022

itamarst commented Dec 21, 2022

adamreichold commented Dec 21, 2022

kylebarron commented Aug 21, 2024

davidhewitt commented Aug 23, 2024 • edited Loading

kylebarron commented Aug 23, 2024

alanhdu commented Aug 26, 2024 • edited Loading

davidhewitt commented Aug 23, 2024 •

edited

Loading

alanhdu commented Aug 26, 2024 •

edited

Loading