Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RTree Buffer protocol, python binding tests #55

Merged
merged 11 commits into from
Aug 21, 2024

Conversation

H-Plus-Time
Copy link
Contributor

@H-Plus-Time H-Plus-Time commented Aug 20, 2024

So, from_buffer, to_buffer both work perfectly (I don't think there's really an equivalent for from_buffer in the buffer protocol).

The test_buffer_protocol alas triggers a segfault iff the numpy array originally passed to RTree.from_interleaved (and presumably from_separated) are not copied (that is, if boxes.__array__() is called without copy=True, which I simulated from the python side).

The getbuffer, releasebuffer methods are admittedly pulled straight from pyo3 tests, so I could well have flubbed something there 🤷 .

Re the removal of the abi3 feature flag - it looks like setting it in maturin flags interferes with multi-target wheel generation - my go-to for local simulation of what would happen in github actions is basically:

docker run --rm -v $(pwd)/..:/io -w /io/python ghcr.io/pyo3/maturin build -i {python3.8,python3.9,python3.10,python3.11}

I suspect mimicking bits of what connectorx has going over at (https://github.com/sfu-db/connector-x/blob/main/.github/workflows/release.yml) should be sufficient (really just the python matrix).

Update: Nope, copying before from_interleaved doesn't avoid the segfault - debugging does, though.

…plicit from_buffer, to_buffer methods (latter causes a copy to occur). TODO: add tests equiv to geoarrow-rs
@H-Plus-Time
Copy link
Contributor Author

Something about #47 (comment) makes me suspect the quasi-zero-copy behaviour in from_interleaved isn't by design - is OwnedRTree meant to be partially entangled with it's source numpy array?

@kylebarron
Copy link
Owner

We both independently figured out how to implement the buffer protocol hours apart kylebarron/arro3#156 ❤️

python/build.rs Outdated Show resolved Hide resolved
python/Cargo.toml Outdated Show resolved Hide resolved
@@ -6,6 +6,7 @@ build-backend = "maturin"
name = "geoindex-rs"
requires-python = ">=3.8"
dependencies = []
dynamic = ["version"]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This says that the version key is dynamic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - it delegates to the build tool (usually intended for git tag based releasing, but in this case, the source of truth for the version is the Cargo.toml), figured it out when setting up packaging (toml linting wouldn't shut up about the [project] section being invalid 😅 ) via https://packaging.python.org/en/latest/guides/writing-pyproject-toml/#static-vs-dynamic-metadata

python/src/rtree.rs Outdated Show resolved Hide resolved
python/src/rtree.rs Outdated Show resolved Hide resolved
python/src/rtree.rs Outdated Show resolved Hide resolved
python/src/rtree.rs Outdated Show resolved Hide resolved
python/src/rtree.rs Outdated Show resolved Hide resolved
python/src/rtree.rs Outdated Show resolved Hide resolved
@kylebarron
Copy link
Owner

I think it might be worth splitting this up into two PRs: one for import and one for export. I think the export might be simpler because there's prior work to go off.

The export PR can ensure that creating a memoryview in Python from this buffer works, and it can also update the wheel builds to take off the abi3 tag and create a matrix in the github actions to build a wheel for each python version.

Then the import PR might need a little more discussion. I think we need to figure out how to create a &[u8] from a Python object that exports the buffer protocol. I've also been thinking that to best support importing the buffer protocol, it would be simplest to change the API to be primarily functional. I don't think we can ever create an OwnedRTree from a Python buffer protocol object without copying. But we can create an RTreeRef from a &[u8]. So a top-level function like search_rtree could work on numpy-provided data.

Along with this buffer protocol work, I was considering whether to update geo-index to use Arrow instead of numpy. But maybe that's because I can wrap my head around zero-copy arrow better thus far.

@kylebarron
Copy link
Owner

really just the python matrix

yeah I think we just need the python matrix

}
#[cfg(any(Py_3_11, not(Py_LIMITED_API)))]
pub unsafe fn __releasebuffer__(&self, _view: *mut ffi::Py_buffer) {
// is there anything to do here?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. In kylebarron/arro3#156 I hold an Option<> in the buffer and manually move the value out of the option to drop it. But I'm not sure if eventually Python would garbage collect the object, at which point it would be dropped

@H-Plus-Time
Copy link
Contributor Author

H-Plus-Time commented Aug 21, 2024

I think we need to figure out how to create a &[u8] from a Python object that exports the buffer protocol

Starting from PyO3/pyo3#2824 (comment) , I think I might have just figured this and a functional search interface out 🤯 (based on RTreeRef). The transmute is mildly alarming, but quite robust (the 'satisfies the buffer protocol', readonly and contiguity guards above protect you quite a bit).

The usage so far that I've tested (all results are completely identical):

# read in a pre-serialized index, as well as pre-generated interleaved boxes
with open("foo.dat", "rb") as f:
    serialized: bytes = f.read()
boxes = np.load("boxes.npy", allow_pickle=False)

view = memoryview(serialized)
res = search_rtree(serialized, min_x, min_y, max_x, max_y)
res = search_rtree(view, min_x, min_y, max_x, max_y)

tree_instance = RTree.from_interleaved(boxes)

# tree_instance appears as just another bytes-like object 
res = search_rtree(tree_instance, min_x, min_y, max_x, max_y)
view = memoryview(tree_instance)
res = search_rtree(view, min_x, min_y, max_x, max_y)

exact_np_view = np.frombuffer(serialized, 'u1')
res = search_rtree(exact_np_view, min_x, min_y, max_x, max_y)
incorrect_np_view = exact_np_view.view('f8')
# Will raise with 'buffer contents are not compatible with u8' 💥
res = search_rtree(incorrect_np_view, min_x, min_y, max_x, max_y)

@H-Plus-Time
Copy link
Contributor Author

It looks like it's possible to retain a struct-based approach, but the RTreeInner enum (and the many methods that match on it) would need an extra variant per bit-width for the ref counterparts - I agree, its probably quite onerous.

Also fairly likely that there's non-trivial safety issues keeping a long-lived RTreeRef around, especially if it's built atop a numpy u8 view (modifying the array said view is based on is still allowed even if the view is readonly).

I'll stick the imports in a secondary PR.

python/src/rtree.rs Outdated Show resolved Hide resolved
@kylebarron
Copy link
Owner

Are you still able to trigger a segfault with this code?

@H-Plus-Time
Copy link
Contributor Author

Are you still able to trigger a segfault with this code?

Nope - the OpenDAL-derived one avoided that

@kylebarron
Copy link
Owner

Thanks @H-Plus-Time! I'm going to do some updates after merging for impot

@kylebarron kylebarron merged commit bf91b87 into kylebarron:main Aug 21, 2024
6 checks passed
@kylebarron
Copy link
Owner

@H-Plus-Time wheel builds are failing after this PR

https://github.com/kylebarron/geo-index/actions/runs/10581888047/job/29320301109

image

Do you have any thoughts?

@kylebarron
Copy link
Owner

Trying to fix this in #58

@H-Plus-Time
Copy link
Contributor Author

Ach, sorry, I really should have simulated that. I ended up needing to do --find-interpreter or -i <specific binary name> a few times locally as well - that seems to be what actually sorted it.

@kylebarron
Copy link
Owner

Yeah, that and we didn't actually need to create our own matrix of Python versions. maturin does that for us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants