-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RTree Buffer protocol, python binding tests #55
Conversation
…plicit from_buffer, to_buffer methods (latter causes a copy to occur). TODO: add tests equiv to geoarrow-rs
Something about #47 (comment) makes me suspect the quasi-zero-copy behaviour in from_interleaved isn't by design - is OwnedRTree meant to be partially entangled with it's source numpy array? |
We both independently figured out how to implement the buffer protocol hours apart kylebarron/arro3#156 ❤️ |
@@ -6,6 +6,7 @@ build-backend = "maturin" | |||
name = "geoindex-rs" | |||
requires-python = ">=3.8" | |||
dependencies = [] | |||
dynamic = ["version"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This says that the version key is dynamic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - it delegates to the build tool (usually intended for git tag based releasing, but in this case, the source of truth for the version is the Cargo.toml), figured it out when setting up packaging (toml linting wouldn't shut up about the [project] section being invalid 😅 ) via https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#static-vs-dynamic-metadata
I think it might be worth splitting this up into two PRs: one for import and one for export. I think the export might be simpler because there's prior work to go off. The export PR can ensure that creating a memoryview in Python from this buffer works, and it can also update the wheel builds to take off the Then the import PR might need a little more discussion. I think we need to figure out how to create a Along with this buffer protocol work, I was considering whether to update geo-index to use Arrow instead of numpy. But maybe that's because I can wrap my head around zero-copy arrow better thus far. |
yeah I think we just need the python matrix |
…ndal-esque (basically verbatim) __getbuffer__
} | ||
#[cfg(any(Py_3_11, not(Py_LIMITED_API)))] | ||
pub unsafe fn __releasebuffer__(&self, _view: *mut ffi::Py_buffer) { | ||
// is there anything to do here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. In kylebarron/arro3#156 I hold an Option<>
in the buffer and manually move the value out of the option to drop it. But I'm not sure if eventually Python would garbage collect the object, at which point it would be dropped
Starting from PyO3/pyo3#2824 (comment) , I think I might have just figured this and a functional search interface out 🤯 (based on RTreeRef). The transmute is mildly alarming, but quite robust (the 'satisfies the buffer protocol', readonly and contiguity guards above protect you quite a bit). The usage so far that I've tested (all results are completely identical): # read in a pre-serialized index, as well as pre-generated interleaved boxes
with open("foo.dat", "rb") as f:
serialized: bytes = f.read()
boxes = np.load("boxes.npy", allow_pickle=False)
view = memoryview(serialized)
res = search_rtree(serialized, min_x, min_y, max_x, max_y)
res = search_rtree(view, min_x, min_y, max_x, max_y)
tree_instance = RTree.from_interleaved(boxes)
# tree_instance appears as just another bytes-like object
res = search_rtree(tree_instance, min_x, min_y, max_x, max_y)
view = memoryview(tree_instance)
res = search_rtree(view, min_x, min_y, max_x, max_y)
exact_np_view = np.frombuffer(serialized, 'u1')
res = search_rtree(exact_np_view, min_x, min_y, max_x, max_y)
incorrect_np_view = exact_np_view.view('f8')
# Will raise with 'buffer contents are not compatible with u8' 💥
res = search_rtree(incorrect_np_view, min_x, min_y, max_x, max_y) |
…~3.0 by the buffer protocol), from_buffer
It looks like it's possible to retain a struct-based approach, but the RTreeInner enum (and the many methods that match on it) would need an extra variant per bit-width for the ref counterparts - I agree, its probably quite onerous. Also fairly likely that there's non-trivial safety issues keeping a long-lived RTreeRef around, especially if it's built atop a numpy u8 view (modifying the array said view is based on is still allowed even if the view is readonly). I'll stick the imports in a secondary PR. |
…on might not actually be necessary), try out matrixed pytest
Are you still able to trigger a segfault with this code? |
Nope - the OpenDAL-derived one avoided that |
Thanks @H-Plus-Time! I'm going to do some updates after merging for impot |
@H-Plus-Time wheel builds are failing after this PR https://github.com/kylebarron/geo-index/actions/runs/10581888047/job/29320301109 Do you have any thoughts? |
Trying to fix this in #58 |
Ach, sorry, I really should have simulated that. I ended up needing to do |
Yeah, that and we didn't actually need to create our own matrix of Python versions. maturin does that for us |
So, from_buffer, to_buffer both work perfectly (I don't think there's really an equivalent for from_buffer in the buffer protocol).
The test_buffer_protocol alas triggers a segfault iff the numpy array originally passed to RTree.from_interleaved (and presumably from_separated) are not copied (that is, if
boxes.__array__()
is called withoutcopy=True
, which I simulated from the python side).The getbuffer, releasebuffer methods are admittedly pulled straight from pyo3 tests, so I could well have flubbed something there 🤷 .
Re the removal of the abi3 feature flag - it looks like setting it in maturin flags interferes with multi-target wheel generation - my go-to for local simulation of what would happen in github actions is basically:
docker run --rm -v $(pwd)/..:/io -w /io/python ghcr.io/pyo3/maturin build -i {python3.8,python3.9,python3.10,python3.11}
I suspect mimicking bits of what connectorx has going over at (https://github.com/sfu-db/connector-x/blob/main/.github/workflows/release.yml) should be sufficient (really just the python matrix).
Update: Nope, copying before from_interleaved doesn't avoid the segfault - debugging does, though.