-
Notifications
You must be signed in to change notification settings - Fork 21
How to consume a single buffer & connection to array interchange #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The way we handle this in cudf is we have a
Actually in our case it directly is. We often hold a reference to a cupy array or numba device array or pytorch tensor as the underlying owner of memory underneath a Buffer. We also directly implement I.E. we do this in a few places with string columns, where we use
Yes, for numerical and other columns that can be represented by a single buffer it would be nice to support |
That makes sense. This attribute:
is also needed in this protocol - except it can't be optional.
That looks similar (the principle, not the code) to how numpy deals with PyArray_FromInterface(PyObject *origin)
/* Get data buffer from interface specification */
attr = _PyDict_GetItemStringWithError(iface, "data");
/* Case for data access through pointer */
if (attr && PyTuple_Check(attr)) {
PyObject *dataptr;
dataptr = PyTuple_GET_ITEM(attr, 0);
if (PyLong_Check(dataptr)) {
data = PyLong_AsVoidPtr(dataptr);
base = origin;
ret = (PyArrayObject *)PyArray_NewFromDescrAndBase(
&PyArray_Type, dtype,
n, dims, NULL, data,
dataflags, NULL, base); Here |
Those 3 bullet points of (possibly) supported APIs, is this for the Buffer level? Because it's probably also useful to have that on the Column level as well.
Might not be that relevant, but: Arrow does support byte masks in some way, I.e. as boolean arrays (it just can't use it as nulls to compute anything with it, but you can store it). You may not be able to use byte masks in a plain Arrow array, but at the same time I don't think you can use byte masks in numpy's |
In |
Ah, cool, I didn't know that. |
Indeed, the |
Summarizing some of the main comments:
Action for me: create two implementations, a minimal and a maximal one, so it's easier to compare. |
What dtypes does it not support that it needs to at the buffer level? I thought we had previously decided that buffers were untyped since they could be interpreted in different ways? Even if they're typed, I would assume they're |
This comment has been minimized.
This comment has been minimized.
Oh right, resolution of that was "due to variable length strings, buffers must be untyped (dtype info lives on the column not the buffer)". There's still an inconsistency in that |
And this goes back to the strides discussion, that if buffers are untyped, does it really make sense to have strides on them versus controlling that on the Column level? I.E. take something like a theoretical uint128 array, where it would have a data buffer. If I knew the max value of the array was |
Now we are considering going all in on DLPack at the buffer level, a potential alternative to consider would be using the actual Arrow C Data Interface instead? (at the column level) When describing/discussing the requirements in #35, an argument against the Arrow C Data Interface is that we were looking for a Python In the end, for primitive arrays, DLPack and Arrow C Data interface are basically the same (it's almost the same C struct, in arrow there are only some additional null pointers for child/dictionary arrays that are not needed for a primitive array; eg implementing conversion to/from numpy for primitive arrays wouldn't be harder compared to DLPack). But, Arrow C Data interface natively supports the more complex nested/multi-buffer cases we also want to support (eg variable length strings, categoricals). What's missing for using the Arrow C Data interface is:
Both those points have been discussed / worked out for DLPack in context of the Array API (I think? didn't fully follow this), so transferring what has been learned from that to add the same capabilities to the Arrow C data interface could be a nice addition to Arrow and would IMO be a great outcome of the consortium efforts. |
I think it was decided that strided data would be supported which the Arrow C data interface doesn't support as well which we'd need to propose adding. Another key point is with the Arrow C data interface is that there isn't a way to control lifetime of individual buffers versus the column. I.E. for a string column say I use the |
I looked back at the last meeting notes, and there are some notes about how to implement it (should the strides live at buffer or column level? etc), and not much about reasons to support it (but maybe we didn't capture everything in the notes).
And is this problematic to ensure? (honest question, this is out of my comfort zone) |
I would say it adds some complexity but is manageable, but moreso on devices like GPUs with much more limited memory available, every extra byte counts where we'd want to as aggressively release buffers / free memory as possible. |
Yes, we should definitely reconsider, because it's better to reuse than reinvent. If it's not a good idea, we should document very clearly why. A few thoughts:
(2) and (3)
It's a bit unfortunate, but it may be worth giving up on striding indeed if it brings us other benefits.
Yes, that isn't too hard.
The DLPack device support model already existed, it was the CUDA/ROCm stream support that we figured out how to do. Arrow can learn from DLPack there, but all it would use is CUDA/ROCm I believe. The vector lanes and bit-width stuff (and striding) which DLPack has because of support for FPGAs and other devices probably is too foreign to mix in with Arrow.
Yes, I do agree with that. |
I was preparing some more code changes, but summarizing these questions well for tomorrow's call will be more useful I think. |
Yea there's still more work for me to do here 😅. We've ironed out the semantics on the Python side for dlpack, but we need to do the same for a C interface now. Once we have the learnings from that my plan was to point to that for making a proposal to the Arrow C data interface. |
Since we would only use this for the column-level interchange (and not the full dataframe), and so we still have a python level API layer to communicate about how missing values are stored, I think we could agree on some "standard" way to use a StructArray with 2 arrays (values, mask) to represent a boolean-masked array. |
Yes, I agree the boolean mask support can be added via a convention. |
Here is a summary of the "use Arrow C Data Interface" option: Boolean masksArrow does not support them natively. However, it does support a boolean dtype, so it is possible to represent a column with a boolean mask in the Arrow C Data Interface through a naming convention. That will break the following specification in the Mandatory ... The pointer to the null bitmap buffer, if the data type specifies one, MAY be NULL only if This should be okay, given that we still have a Python-only API so we can define the convention that a Memory management at the buffer levelQuoting Keith: Another key point is with the Arrow C data interface is that there isn't a way to control lifetime of individual buffers versus the column ... and the only way to do that is to make sure the release callback isn't called in this situation. ... it adds some complexity but is manageable, but moreso on devices like GPUs with much more limited memory available, every extra byte counts where we'd want to as aggressively release buffers / free memory as possible. Whether or not to manage memory at the column or the buffer level is a choice that must be made in the standard; it cannot be left up to the implementing libraries. Reason: managing at the buffer level only helps if both libraries do that, otherwise all memory stays alive anyway - and there isn't even a buffer-level deleter to call unless we add one. I think this would be a significant change to the Arrow C Data Interface. Device supportAssumption: we want to only support CPU and GPU (CUDA, ROCm) for now. But make sure it is extensible to other device types later. Current design, which uses a Next steps for device support for Arrow:
Strided buffers
The less efficient support for row-based dataframes is a larger downside than the numpy-based ones, because row-based dataframes will always have columns that are strided in memory. It'd be much nicer if Arrow (or our protocol, whether based on Arrow or not) supporting striding. How would column exchange now actually work
I'm not sure this is necessary - in There's still a problem to exactly mirror this for dataframes. We still want a superset of what the Arrow C Data Interface offers: boolean masks, device support, perhaps deleters at the buffer level. So we're talking about what is basically either a fork or a v2 of the Arrow C Data Interface. It may be helpful to sketch the calling code: # User calls
df = consumer_lib.from_dataframe(df_other, columns=['A', 'B'])
# What happens inside `from_dataframe`:
dfobj = df_other.__dataframe__().get_columns(columns)
cols = dict()
for name in dfobj.column_names():
cols[name] = convert_column_to_my_native_format(dfobj.get_column_by_name(name))
# Instantiate our own dataframe:
df_out = mylib.DataFrame(cols)
# That native conversion function can use Arrow (maybe)
def convert_column_to_my_native_format(column):
# Check if null representation is supported by Arrow natively
if column.describe_null() == 4: # byte mask (in future, use enum)
# handle convention in custom implementation, cannot rely directly on Arrow here
...
# This function is all compiled code
return mylib._use_arrow_c_interface(column) |
Do we know of row-based dataframe libraries in Python that can give access to columns as a strided array? (apart from numpy recarrays)
Isn't that basically what
Thanks, concrete code snippets are always helpful! ;) |
I'd expect Koalas to be able to do this (disclaimer, I don't know about its internals). Or maybe one of the Ibis backends. |
Well, there's an opaque object that's not meant to be unpacked or even seen by the user. All a user would be is call
Yes. My point was that there's conventions (like for boolean mask) and extras that are TBD (device support, buffer-level deleters), so we can't say "this uses the Arrow C Data Interface, so you can use your existing implementation to parse it". There's no working C code to reuse, there's only the structs in the Arrow spec that we'd take over. |
Since Koalas is Spark under the hood, I suspect they would use Arrow for efficient Spark->Python data tranfer (but don't know about its internals neither).
But meant to be unpacked by library authors? So we still need the same for a potential Arrow C Data interface?
To better understand what you are referring to / looking for: what would be the equivalent for DLPack for this? Does it have a standalone C implementation that can be reused / exposed as a python library? |
Regarding the boolean masks, I would personally not reuse the bitmask buffer in the vector of buffers and make this a boolean array. As that would make the struct no longer ingestible by Arrow. But rather, we can use an existing / valid construct from the Arrow type system to represent a masked array (eg a StructArray with 2 non-nullable fields (values array and mask array) can zero-copy represent a numpy-type masked array). |
Only in C or C++. You vendor
It kind of is "let's not have a Python API for this". We'd either have to write that reusable library, or force all dataframe library authors to write C. E.g. Modin is now pure Python; with |
The very short summary of the discussion on this was:
|
Re memory management: also @aregm had very clear use cases for having memory managed at the buffer level. This looks like a must-have. |
For dataframe interchange, the smallest building block is a "buffer" (see gh-35, gh-38) - a block of memory. Interpreting that is nontrivial, especially if the goal is to build an interchange protocol in Python. That's why DLPack, buffer protocol,
__array_interface__
,__cuda_array_interface__
,__array__
and__arrow_array__
all exist, and are still complicated.For what a buffer is, currently it's only a data pointer (
ptr
) and a size (bufsize
) which together describe a contiguous block of memory, plus a device attribute (__dlpack_device__
) and optionally DLPack support (__dlpack__
). One open question is:The other, larger question is how to make buffers nice to deal with for implementers of the protocol. The current Pandas prototype shows the issue:
From #38 (review) (@kkraus14 & @rgommers):
Yes that works and I've thought about it. The trouble is where to hold the reference. You really need one reference per buffer, not just store a reference to the whole exchange dataframe object (buffers can end up elsewhere outside the new pandas dataframe here). And given that a buffer just has a raw pointer plus a size, there's nothing to hold on to. I don't think there's a sane pure Python solution.
__cuda_array_interface__
is directly attached to the object you need to hold on to, which is not the case for thisBuffer
.Yep, for numerical data types the solution can simply be: hurry up with implementing
__dlpack__
, and the problem goes away. The dtypes that DLPack does not support are more of an issue.From #38 (comment) (@jorisvandenbossche):
I personally think it would be useful to keep those existing interface methods (or array, or arrow_array). For people that are using those interface, that will be easier to interface with the interchange protocol than manually converting the buffers.
Alternative/extension to the current design
We could change the plain memory description +
__dlpack__
to:ptr
,bufsize
, and devicenative
enum attribute, and if both producer and consumer happen to use that native format, they can call the corresponding protocol -__arrow_array__
or__array__
)__cuda_array_interface__
, buffer protocol,__array_interface__
).(1) is required for any implementation to be able to talk to any other implementation, but also the most clunky to support because it needs to solve the "who owns this memory and how do you prevent it from being freed" all over again. What is needed there is
The advantage of (2) and (3) are that they have the most hairy issue already solved, and will likely be faster.
And the MUST/MAY should address @kkraus14's concern that people will just standardize on the lowest common denominator (numpy).
What is missing for dealing with memory buffers
A summary of why this is hard is:
So what we are aiming for (ambitiously) is:
The "holding a reference to the producing object must guarantee the lifetime of the memory and that has worked relatively well" seems necessary for supporting the raw memory description. This probably means that (a) the
Buffer
object should include the right Python object to keep a reference to (for Pandas that would typically be a 1-D numpy array), and (b) there must be some machinery to keep this reference alive (TBD what that looks like, likely not pure Python) in the implementation.The text was updated successfully, but these errors were encountered: