-
Notifications
You must be signed in to change notification settings - Fork 832
Changed PyByte::new_init and PyByteArray::new_init such that init can fail #1083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changed PyByte::new_init and PyByteArray::new_init such that init can fail #1083
Conversation
The failing test is an interesting one. When I run that test in isolation, it succeeds. When I run it in combination with other tests it mostly fails but sometimes succeeds. Do you have an idea of where I introduced that probabilistic bug? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, but isn't it too complex?
src/types/bytearray.rs
Outdated
@@ -23,36 +23,55 @@ impl PyByteArray { | |||
|
|||
/// Creates a new Python `bytearray` object with an `init` closure to write its contents. | |||
/// Before calling `init` the bytearray is zero-initialised. | |||
/// * If `init` returns `Err(e)`, `new_with` will return `Err(e)`. | |||
/// * If `init` returns `Ok(new_len)`, the allocated bytearray will be truncated to length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to truncate the bytes?
I cannot imagine any use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can easily remove the truncation again - it was mentioned by @davidhewitt for a potential API so I included it so we can see how it would look like. For the bytes, the idea is that as init
implies that you are still generating the content, you might not know the exact size of the bytearray / bytestring up front, just an upper bound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imagine you want to serialize some large dataset directly to Python. You may not know its final allocated length, so with this API you can pre-allocate the memory space in Python, write to it, and then return the amount of bytes written so that the array can be safely truncated.
I suppose that an alternative approach would be for argument to init
to be a mutable reference to some custom struct instead of &mut [u8]
:
pub struct UninitializedPyBytes<'a>(*mut ffi::PyObject, Python<'a>);
impl UninitializedPyBytes {
/// Resize the byte buffer, zero-intializing any additional storage allocated
/// May return Err on OOM
pub fn resize(&mut self) -> PyResult<()> { /* ... */ }
/// Access the underlying bytes
pub fn as_bytes(&mut self) -> &mut [u8] { /* ... */ }
}
We could also implement DerefMut<Target = [u8]>
, and maybe Read
and Write
for this. Write
could even support automatically growing the allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we support the same interface for both PyBytes
and PyByteArray
? I don't think we need implement io::Read
here, though. Regarding automatic resizing by io::Write
, should we follow the simply size x 2 rule? Would we also automatically truncate the bytes after writing (which I found requires an additional struct that implements Write
, references the bytes and implements Drop
so we ensure that the truncation to the eventual length is performed after writing has completed)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep I guess Read
is redundant. And yeah I was thinking the struct could have a private method which finishes the writing, truncates if needed, and returns a finished PyBytes
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Truncation has been removed in 4b3422e
src/types/bytearray.rs
Outdated
let py = gil.python(); | ||
let locals = PyDict::new(py); | ||
|
||
py.run( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this kind of test since it can be really unstable because of GC.
We can confirm that DECREF
is called and it's sufficient. I think it's too nervous to test the memory difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If just the comment is ok, I'll gladly remove that test :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test has been removed in 4b3422e
Simple please do not assert the memory difference. |
And I'm also doubtful about assuming |
I don't think the memory difference assert is the problem (unless I'm misreading the error log). I think it is more fundamental memory access issue. |
Personally, I am using the Python allocated bytes to serialise to with serde. As I only get write access to those bytes inside |
src/types/bytearray.rs
Outdated
Err(e) => { | ||
// Deallocate pyptr to avoid leaking the bytearray | ||
ffi::Py_DECREF(pyptr); | ||
return Err(e); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An interesting thought about the UninitializedPyBytes
API is that it could have a Drop
impl which automatically deallocates the pointer, which would mean that it'd happen automatically on Err
return without us having to do anything.
Segmentation fault is an interesting one - did you observe this locally? |
For me, the type of error also changes randomly but has included |
Having seen how this API evolves, it seems a few different tradeoffs in usage as well as some open questions about the best design for the API, as well as what users want from it. I'm beginning to wonder if this API should be moved to a separate crate for now, maybe If so, I'd be very happy to mention this crate in the README |
Exposing this in a separate crate would be an interesting approach, though I am not sure if I can promise to have the time to support it (I am currently doing an internship during which the original issue came up and will be doing my final year project at university next year). Still, I would be happy to explore different API designs. For the main |
A small crate with just a couple of functions hopefully won't generate many support requests ;) Perhaps removing the truncate option from pyo3 but allowing the API to be fallible is a good compromise. @kngwyu what route do you prefer? |
OK, I'm happy to know the use-case.
Agreed.
Hmm ... 🤔 |
For SIGSEGV: since this kind of bug is really difficult to debug, I'm going debug it myself. |
src/types/bytearray.rs
Outdated
let buffer = ffi::PyByteArray_AsString(pyptr) as *mut u8; | ||
debug_assert!(!buffer.is_null()); | ||
// Zero-initialise the uninitialised bytearray | ||
std::ptr::write_bytes(buffer, 0u8, len); | ||
// (Further) Initialise the bytearray in init | ||
init(std::slice::from_raw_parts_mut(buffer, len)); | ||
pybytearray | ||
match init(std::slice::from_raw_parts_mut(buffer, len)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just noticed that if init
panics, then this will leak memory =(
I'm going to add Py::into_ref
later today, which I think you should be able to use to avoid this case.
(Solution I'm thinking would be that instead of if pyptr.is_null()
, you can call Py::from_owned_ptr_or_err
to take ownership of the pointer. And then Py::into_ref(py)
at the end of the function.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for finding this issue! What will Py::into_ref
do? Will I still have to call ffi::Py_DECREF(pyptr)
on an Err
in init
to deallocate it or will this be taken care of as I now have a &PyBytes
reference from the beginning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll have Py<PyBytes>
from the beginning, and .into_ref(py)
will convert this into &PyBytes
.
It'll take care of deallocating automatically 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I was slightly confused by having both Py::from_owned_ptr_or_err(py, ptr)
and py.from_owned_ptr_or_err(ptr)
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code for the new_with
function is actually starting to look very clean and concise - I look forward to using Py<T>::into_ref(py)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #1098 - hopefully you'll have this API soon!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Py::into_ref
is now available to use 🚀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just hope I didn't mess up merging anything ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did fail in one way though and had to rewrite some of my changes (I had prototyped everything with a declaration of Py::into_ref
) - so I might have missed something
We don't observe tha segfault now. |
It must have been related to that test somehow - maybe the dropping of the |
…_bytearray_new_with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for your continued improvements to this API!
@kngwyu What do you think about the modified API? |
LGTM, thanks! |
@davidhewitt @kngwyu Thank you for your continued feedback! Just out of interest, do you have a rough schedule already for the next release of |
I was wondering about this myself yesterday, so have opened an issue to collect opinions. See #1104 |
This is a followup from #1074 to allow
PyBytes::new_with
andPyByteArray::new_with
to fail during initialisation.