|
| 1 | +# API of the `__dataframe__` protocol |
| 2 | + |
| 3 | +Specification for objects to be accessed, for the purpose of dataframe |
| 4 | +interchange between libraries, via the `__dataframe__` method on a libraries' |
| 5 | +data frame object. |
| 6 | + |
| 7 | +For guiding requirements, see {ref}`design-requirements`. |
| 8 | + |
| 9 | + |
| 10 | +## Concepts in this design |
| 11 | + |
| 12 | +1. A `Buffer` class. A *buffer* is a contiguous block of memory - this is the |
| 13 | + only thing that actually maps to a 1-D array in a sense that it could be |
| 14 | + converted to NumPy, CuPy, et al. |
| 15 | +2. A `Column` class. A *column* has a single dtype. It can consist |
| 16 | + of multiple *chunks*. A single chunk of a column (which may be the whole |
| 17 | + column if ``num_chunks == 1``) is modeled as again a `Column` instance, and |
| 18 | + contains 1 data *buffer* and (optionally) one *mask* for missing data. |
| 19 | +3. A `DataFrame` class. A *data frame* is an ordered collection of *columns*, |
| 20 | + which are identified with names that are unique strings. All the data |
| 21 | + frame's rows are the same length. It can consist of multiple *chunks*. A |
| 22 | + single chunk of a data frame is modeled as again a `DataFrame` instance. |
| 23 | +4. A *mask* concept. A *mask* of a single-chunk column is a *buffer*. |
| 24 | +5. A *chunk* concept. A *chunk* is a sub-dividing element that can be applied |
| 25 | + to a *data frame* or a *column*. |
| 26 | + |
| 27 | +Note that the only way to access these objects is through a call to |
| 28 | +`__dataframe__` on a data frame object. This is NOT meant as public API; |
| 29 | +only think of instances of the different classes here to describe the API of |
| 30 | +what is returned by a call to `__dataframe__`. They are the concepts needed |
| 31 | +to capture the memory layout and data access of a data frame. |
| 32 | + |
| 33 | + |
| 34 | +## Design decisions |
| 35 | + |
| 36 | +1. Use a separate column abstraction in addition to a dataframe interface. |
| 37 | + |
| 38 | + Rationales: |
| 39 | + |
| 40 | + - This is how it works in R, Julia and Apache Arrow. |
| 41 | + - Semantically most existing applications and users treat a column similar to a 1-D array |
| 42 | + - We should be able to connect a column to the array data interchange mechanism(s) |
| 43 | + |
| 44 | + Note that this does not imply a library must have such a public user-facing |
| 45 | + abstraction (ex. ``pandas.Series``) - it can only be accessed via |
| 46 | + ``__dataframe__``. |
| 47 | + |
| 48 | +2. Use methods and properties on an opaque object rather than returning |
| 49 | + hierarchical dictionaries describing memory. |
| 50 | + |
| 51 | + This is better for implementations that may rely on, for example, lazy |
| 52 | + computation. |
| 53 | + |
| 54 | +3. No row names. If a library uses row names, use a regular column for them. |
| 55 | + |
| 56 | + See discussion at |
| 57 | + [wesm/dataframe-protocol/pull/1](https://github.com/wesm/dataframe-protocol/pull/1/files#r394316241) |
| 58 | + Optional row names are not a good idea, because people will assume they're |
| 59 | + present (see cuDF experience, forced to add because pandas has them). |
| 60 | + Requiring row names seems worse than leaving them out. Note that row labels |
| 61 | + could be added in the future - right now there's no clear requirements for |
| 62 | + more complex row labels that cannot be represented by a single column. These |
| 63 | + do exist, for example Modin has has table and tree-based row labels. |
| 64 | + |
| 65 | +## Interface |
| 66 | + |
| 67 | + |
| 68 | + |
| 69 | +```{literalinclude} dataframe_protocol.py |
| 70 | +--- |
| 71 | +language: python |
| 72 | +--- |
0 commit comments