Skip to content

Commit eead53a

Browse files
committed
Add protocol API to Sphinx doc
1 parent 6cc4401 commit eead53a

File tree

4 files changed

+76
-68
lines changed

4 files changed

+76
-68
lines changed

protocol/API.md

+72
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# API of the `__dataframe__` protocol
2+
3+
Specification for objects to be accessed, for the purpose of dataframe
4+
interchange between libraries, via the `__dataframe__` method on a libraries'
5+
data frame object.
6+
7+
For guiding requirements, see {ref}`design-requirements`.
8+
9+
10+
## Concepts in this design
11+
12+
1. A `Buffer` class. A *buffer* is a contiguous block of memory - this is the
13+
only thing that actually maps to a 1-D array in a sense that it could be
14+
converted to NumPy, CuPy, et al.
15+
2. A `Column` class. A *column* has a single dtype. It can consist
16+
of multiple *chunks*. A single chunk of a column (which may be the whole
17+
column if ``num_chunks == 1``) is modeled as again a `Column` instance, and
18+
contains 1 data *buffer* and (optionally) one *mask* for missing data.
19+
3. A `DataFrame` class. A *data frame* is an ordered collection of *columns*,
20+
which are identified with names that are unique strings. All the data
21+
frame's rows are the same length. It can consist of multiple *chunks*. A
22+
single chunk of a data frame is modeled as again a `DataFrame` instance.
23+
4. A *mask* concept. A *mask* of a single-chunk column is a *buffer*.
24+
5. A *chunk* concept. A *chunk* is a sub-dividing element that can be applied
25+
to a *data frame* or a *column*.
26+
27+
Note that the only way to access these objects is through a call to
28+
`__dataframe__` on a data frame object. This is NOT meant as public API;
29+
only think of instances of the different classes here to describe the API of
30+
what is returned by a call to `__dataframe__`. They are the concepts needed
31+
to capture the memory layout and data access of a data frame.
32+
33+
34+
## Design decisions
35+
36+
1. Use a separate column abstraction in addition to a dataframe interface.
37+
38+
Rationales:
39+
40+
- This is how it works in R, Julia and Apache Arrow.
41+
- Semantically most existing applications and users treat a column similar to a 1-D array
42+
- We should be able to connect a column to the array data interchange mechanism(s)
43+
44+
Note that this does not imply a library must have such a public user-facing
45+
abstraction (ex. ``pandas.Series``) - it can only be accessed via
46+
``__dataframe__``.
47+
48+
2. Use methods and properties on an opaque object rather than returning
49+
hierarchical dictionaries describing memory.
50+
51+
This is better for implementations that may rely on, for example, lazy
52+
computation.
53+
54+
3. No row names. If a library uses row names, use a regular column for them.
55+
56+
See discussion at
57+
[wesm/dataframe-protocol/pull/1](https://github.com/wesm/dataframe-protocol/pull/1/files#r394316241)
58+
Optional row names are not a good idea, because people will assume they're
59+
present (see cuDF experience, forced to add because pandas has them).
60+
Requiring row names seems worse than leaving them out. Note that row labels
61+
could be added in the future - right now there's no clear requirements for
62+
more complex row labels that cannot be represented by a single column. These
63+
do exist, for example Modin has has table and tree-based row labels.
64+
65+
## Interface
66+
67+
68+
69+
```{literalinclude} dataframe_protocol.py
70+
---
71+
language: python
72+
---

protocol/dataframe_protocol.py

-67
Original file line numberDiff line numberDiff line change
@@ -1,70 +1,3 @@
1-
"""
2-
Specification for objects to be accessed, for the purpose of dataframe
3-
interchange between libraries, via the ``__dataframe__`` method on a libraries'
4-
data frame object.
5-
6-
For guiding requirements, see https://github.com/data-apis/dataframe-api/pull/35
7-
8-
9-
Concepts in this design
10-
-----------------------
11-
12-
1. A `Buffer` class. A *buffer* is a contiguous block of memory - this is the
13-
only thing that actually maps to a 1-D array in a sense that it could be
14-
converted to NumPy, CuPy, et al.
15-
2. A `Column` class. A *column* has a single dtype. It can consist
16-
of multiple *chunks*. A single chunk of a column (which may be the whole
17-
column if ``num_chunks == 1``) is modeled as again a `Column` instance, and
18-
contains 1 data *buffer* and (optionally) one *mask* for missing data.
19-
3. A `DataFrame` class. A *data frame* is an ordered collection of *columns*,
20-
which are identified with names that are unique strings. All the data
21-
frame's rows are the same length. It can consist of multiple *chunks*. A
22-
single chunk of a data frame is modeled as again a `DataFrame` instance.
23-
4. A *mask* concept. A *mask* of a single-chunk column is a *buffer*.
24-
5. A *chunk* concept. A *chunk* is a sub-dividing element that can be applied
25-
to a *data frame* or a *column*.
26-
27-
Note that the only way to access these objects is through a call to
28-
``__dataframe__`` on a data frame object. This is NOT meant as public API;
29-
only think of instances of the different classes here to describe the API of
30-
what is returned by a call to ``__dataframe__``. They are the concepts needed
31-
to capture the memory layout and data access of a data frame.
32-
33-
34-
Design decisions
35-
----------------
36-
37-
**1. Use a separate column abstraction in addition to a dataframe interface.**
38-
39-
Rationales:
40-
- This is how it works in R, Julia and Apache Arrow.
41-
- Semantically most existing applications and users treat a column similar to a 1-D array
42-
- We should be able to connect a column to the array data interchange mechanism(s)
43-
44-
Note that this does not imply a library must have such a public user-facing
45-
abstraction (ex. ``pandas.Series``) - it can only be accessed via ``__dataframe__``.
46-
47-
**2. Use methods and properties on an opaque object rather than returning
48-
hierarchical dictionaries describing memory**
49-
50-
This is better for implementations that may rely on, for example, lazy
51-
computation.
52-
53-
**3. No row names. If a library uses row names, use a regular column for them.**
54-
55-
See discussion at https://github.com/wesm/dataframe-protocol/pull/1/files#r394316241
56-
Optional row names are not a good idea, because people will assume they're present
57-
(see cuDF experience, forced to add because pandas has them).
58-
Requiring row names seems worse than leaving them out.
59-
60-
Note that row labels could be added in the future - right now there's no clear
61-
requirements for more complex row labels that cannot be represented by a single
62-
column. These do exist, for example Modin has has table and tree-based row
63-
labels.
64-
65-
"""
66-
67-
681
class Buffer:
692
"""
703
Data in the buffer is guaranteed to be contiguous in memory.

protocol/design_requirements.md

+3-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# The `__dataframe__` protocol
1+
# Design concepts and requirements
22

33
This document aims to describe the design requirements and principles of the
44
dataframe interchange protcol, and the functionality it needs to support.
@@ -20,6 +20,8 @@ A column or a dataframe can be "chunked"; a **chunk** is a subset of a column
2020
or dataframe that contains a set of (neighboring) rows.
2121

2222

23+
(design-requirements)=
24+
2325
## Protocol design requirements
2426

2527
1. Must be a standard Python-level API that is unambiguously specified, and

protocol/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -10,4 +10,5 @@ Contents
1010

1111
purpose_and_scope
1212
design_requirements
13+
API
1314

0 commit comments

Comments
 (0)