How to consume a single buffer & connection to array interchange

For dataframe interchange, the smallest building block is a "buffer" (see gh-35, gh-38) - a block of memory. Interpreting that is nontrivial, especially if the goal is to build an interchange protocol in Python. That's why DLPack, buffer protocol, `__array_interface__`, `__cuda_array_interface__`, `__array__` and `__arrow_array__` all exist, and are still complicated.

For what a buffer _is_, currently it's only a data pointer (`ptr`) and a size (`bufsize`) which together describe a contiguous block of memory, plus a device attribute (`__dlpack_device__`) and optionally DLPack support (`__dlpack__`). One open question is:

- Should a buffer support strides? This helps describe actual memory layout, e.g. Pandas can have strided columns (see https://github.com/data-apis/dataframe-api/pull/38#discussion_r586029288)

The other, larger question is how to make buffers nice to deal with for implementers of the protocol. The current Pandas prototype shows the issue:

```python
def convert_column_to_ndarray(col : ColumnObject) -> np.ndarray:
    """
    """
    if col.offset != 0:
        raise NotImplementedError("column.offset > 0 not handled yet")

    if col.describe_null not in (0, 1):
        raise NotImplementedError("Null values represented as masks or "
                                  "sentinel values not handled yet")

    # Handle the dtype
    _dtype = col.dtype
    kind = _dtype[0]
    bitwidth = _dtype[1]
    if _dtype[0] not in (0, 1, 2, 20):
        raise RuntimeError("Not a boolean, integer or floating-point dtype")

    _ints = {8: np.int8, 16: np.int16, 32: np.int32, 64: np.int64}
    _uints = {8: np.uint8, 16: np.uint16, 32: np.uint32, 64: np.uint64}
    _floats = {32: np.float32, 64: np.float64}
    _np_dtypes = {0: _ints, 1: _uints, 2: _floats, 20: {8: bool}}
    column_dtype = _np_dtypes[kind][bitwidth]

    # No DLPack yet, so need to construct a new ndarray from the data pointer
    # and size in the buffer plus the dtype on the column
    _buffer = col.get_data_buffer()
    ctypes_type = np.ctypeslib.as_ctypes_type(column_dtype)
    data_pointer = ctypes.cast(_buffer.ptr, ctypes.POINTER(ctypes_type))

    # NOTE: `x` does not own its memory, so the caller of this function must
    #       either make a copy or hold on to a reference of the column or
    #       buffer! (not done yet, this is pretty awful ...)
    x = np.ctypeslib.as_array(data_pointer,
                              shape=(_buffer.bufsize // (bitwidth//8),))

    return x
```

From https://github.com/data-apis/dataframe-api/pull/38#pullrequestreview-602372092 (@kkraus14 & @rgommers):

> _In `__cuda_array_interface__` we've generally stated that holding a reference to the producing object must guarantee the lifetime of the memory and that has worked relatively well._

_Yes that works and I've thought about it. The trouble is where to hold the reference. You really need one reference per buffer, not just store a reference to the whole exchange dataframe object (buffers can end up elsewhere outside the new pandas dataframe here). And given that a buffer just has a raw pointer plus a size, there's nothing to hold on to. I don't think there's a sane pure Python solution._

_`__cuda_array_interface__` is directly attached to the object you need to hold on to, which is not the case for this `Buffer`._

> _I'd argue this is a place where we should really align with the array interchange protocol though as the same problem is being solved there._

_Yep, for numerical data types the solution can simply be: hurry up with implementing `__dlpack__`, and the problem goes away. The dtypes that DLPack does not support are more of an issue._

From https://github.com/data-apis/dataframe-api/pull/38#discussion_r574712313 (@jorisvandenbossche):

_I personally think it would be useful to keep those existing interface methods (or __array__, or __arrow_array__). For people that are using those interface, that will be easier to interface with the interchange protocol than manually converting the buffers._

### Alternative/extension to the current design

We could change the plain memory description + `__dlpack__` to:

1. Implementations MUST support a memory description with `ptr`, `bufsize`, and device
2. Implementations MAY support buffers in their native format (e.g. add a `native` enum attribute, and if both producer and consumer happen to use that native format, they can call the corresponding protocol - `__arrow_array__` or `__array__`)
3. Implementations MAY support any exchange protocol (DLPack, `__cuda_array_interface__`, buffer protocol, `__array_interface__`).

(1) is required for any implementation to be able to talk to any other implementation, but also the most clunky to support because it needs to solve the "who owns this memory and how do you prevent it from being freed" all over again. What is needed there is 

The advantage of (2) and (3) are that they have the most hairy issue already solved, _and_ will likely be faster.

And the MUST/MAY should address @kkraus14's concern that people will just standardize on the lowest common denominator (numpy).

### What is missing for dealing with memory buffers

A summary of why this is hard is:

1. Underlying implementations are not compatible. E.g., NumPy doesn't support variable length strings or bit masks, Arrow does not support strided arrays or byte masks.
2. DLPack is the only protocol with device support, but it does not support all dtypes that are needed.

So what we are aiming for (ambitiously) is:

- Something flexible enough to be a superset of NumPy and Arrow, with full device support.
- In pure Python

The _"holding a reference to the producing object must guarantee the lifetime of the memory and that has worked relatively well"_ seems necessary for supporting the raw memory description. This probably means that (a) the `Buffer` object should include the right Python object to keep a reference to (for Pandas that would typically be a 1-D numpy array), and (b) there must be some machinery to keep this reference alive (TBD what that looks like, likely not pure Python) in the implementation. 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to consume a single buffer & connection to array interchange #39

Alternative/extension to the current design

What is missing for dealing with memory buffers

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to consume a single buffer & connection to array interchange #39

Description

Alternative/extension to the current design

What is missing for dealing with memory buffers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions