Skip to content

Commit 27b8e1c

Browse files
authored
Merge pull request #53 from rgommers/sphinx-site
Add content for a Sphinx site specifically for the protocol
2 parents 8498cf1 + 070b9cf commit 27b8e1c

9 files changed

+636
-174
lines changed

protocol/API.md

+72
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
# API of the `__dataframe__` protocol
2+
3+
Specification for objects to be accessed, for the purpose of dataframe
4+
interchange between libraries, via the `__dataframe__` method on a libraries'
5+
data frame object.
6+
7+
For guiding requirements, see {ref}`design-requirements`.
8+
9+
10+
## Concepts in this design
11+
12+
1. A `Buffer` class. A *buffer* is a contiguous block of memory - this is the
13+
only thing that actually maps to a 1-D array in a sense that it could be
14+
converted to NumPy, CuPy, et al.
15+
2. A `Column` class. A *column* has a single dtype. It can consist
16+
of multiple *chunks*. A single chunk of a column (which may be the whole
17+
column if ``num_chunks == 1``) is modeled as again a `Column` instance, and
18+
contains 1 data *buffer* and (optionally) one *mask* for missing data.
19+
3. A `DataFrame` class. A *data frame* is an ordered collection of *columns*,
20+
which are identified with names that are unique strings. All the data
21+
frame's rows are the same length. It can consist of multiple *chunks*. A
22+
single chunk of a data frame is modeled as again a `DataFrame` instance.
23+
4. A *mask* concept. A *mask* of a single-chunk column is a *buffer*.
24+
5. A *chunk* concept. A *chunk* is a sub-dividing element that can be applied
25+
to a *data frame* or a *column*.
26+
27+
Note that the only way to access these objects is through a call to
28+
`__dataframe__` on a data frame object. This is NOT meant as public API;
29+
only think of instances of the different classes here to describe the API of
30+
what is returned by a call to `__dataframe__`. They are the concepts needed
31+
to capture the memory layout and data access of a data frame.
32+
33+
34+
## Design decisions
35+
36+
1. Use a separate column abstraction in addition to a dataframe interface.
37+
38+
Rationales:
39+
40+
- This is how it works in R, Julia and Apache Arrow.
41+
- Semantically most existing applications and users treat a column similar to a 1-D array
42+
- We should be able to connect a column to the array data interchange mechanism(s)
43+
44+
Note that this does not imply a library must have such a public user-facing
45+
abstraction (ex. ``pandas.Series``) - it can only be accessed via
46+
``__dataframe__``.
47+
48+
2. Use methods and properties on an opaque object rather than returning
49+
hierarchical dictionaries describing memory.
50+
51+
This is better for implementations that may rely on, for example, lazy
52+
computation.
53+
54+
3. No row names. If a library uses row names, use a regular column for them.
55+
56+
See discussion at
57+
[wesm/dataframe-protocol/pull/1](https://github.com/wesm/dataframe-protocol/pull/1/files#r394316241)
58+
Optional row names are not a good idea, because people will assume they're
59+
present (see cuDF experience, forced to add because pandas has them).
60+
Requiring row names seems worse than leaving them out. Note that row labels
61+
could be added in the future - right now there's no clear requirements for
62+
more complex row labels that cannot be represented by a single column. These
63+
do exist, for example Modin has has table and tree-based row labels.
64+
65+
## Interface
66+
67+
68+
69+
```{literalinclude} dataframe_protocol.py
70+
---
71+
language: python
72+
---

protocol/Makefile

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Minimal makefile for Sphinx documentation
2+
#
3+
4+
# You can set these variables from the command line, and also
5+
# from the environment for the first two.
6+
SPHINXOPTS ?=
7+
SPHINXBUILD ?= sphinx-build
8+
SOURCEDIR = .
9+
BUILDDIR = _build
10+
11+
# Put it first so that "make" without argument is like "make help".
12+
help:
13+
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
14+
15+
.PHONY: help Makefile
16+
17+
# Catch-all target: route all unknown targets to Sphinx using the new
18+
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
19+
%: Makefile
20+
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

protocol/conf.py

+146
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
# Configuration file for the Sphinx documentation builder.
2+
#
3+
# This file only contains a selection of the most common options. For a full
4+
# list see the documentation:
5+
# https://www.sphinx-doc.org/en/master/usage/configuration.html
6+
7+
# -- Path setup --------------------------------------------------------------
8+
9+
# If extensions (or modules to document with autodoc) are in another directory,
10+
# add these directories to sys.path here. If the directory is relative to the
11+
# documentation root, use os.path.abspath to make it absolute, like shown here.
12+
#
13+
# import os
14+
# import sys
15+
# sys.path.insert(0, os.path.abspath('.'))
16+
17+
import sphinx_material
18+
19+
# -- Project information -----------------------------------------------------
20+
21+
project = 'Python dataframe interchange protocol'
22+
copyright = '2021, Consortium for Python Data API Standards'
23+
author = 'Consortium for Python Data API Standards'
24+
25+
# The full version, including alpha/beta/rc tags
26+
release = '2021-DRAFT'
27+
28+
29+
# -- General configuration ---------------------------------------------------
30+
31+
# Add any Sphinx extension module names here, as strings. They can be
32+
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
33+
# ones.
34+
extensions = [
35+
'myst_parser',
36+
'sphinx.ext.extlinks',
37+
'sphinx.ext.intersphinx',
38+
'sphinx.ext.todo',
39+
'sphinx_markdown_tables',
40+
'sphinx_copybutton',
41+
]
42+
43+
# Add any paths that contain templates here, relative to this directory.
44+
templates_path = ['_templates']
45+
46+
# List of patterns, relative to source directory, that match files and
47+
# directories to ignore when looking for source files.
48+
# This pattern also affects html_static_path and html_extra_path.
49+
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
50+
51+
# MyST options
52+
myst_heading_anchors = 3
53+
myst_enable_extensions = ["colon_fence"]
54+
55+
# -- Options for HTML output -------------------------------------------------
56+
57+
# The theme to use for HTML and HTML Help pages. See the documentation for
58+
# a list of builtin themes.
59+
#
60+
extensions.append("sphinx_material")
61+
html_theme_path = sphinx_material.html_theme_path()
62+
html_context = sphinx_material.get_html_context()
63+
html_theme = 'sphinx_material'
64+
65+
# Add any paths that contain custom static files (such as style sheets) here,
66+
# relative to this directory. They are copied after the builtin static files,
67+
# so a file named "default.css" will overwrite the builtin "default.css".
68+
html_static_path = ['_static']
69+
70+
71+
# -- Material theme options (see theme.conf for more information) ------------
72+
html_show_sourcelink = False
73+
html_sidebars = {
74+
"**": ["logo-text.html", "globaltoc.html", "localtoc.html", "searchbox.html"]
75+
}
76+
77+
html_theme_options = {
78+
79+
# Set the name of the project to appear in the navigation.
80+
'nav_title': 'Python dataframe interchange protocol',
81+
82+
# Set you GA account ID to enable tracking
83+
#'google_analytics_account': 'UA-XXXXX',
84+
85+
# Specify a base_url used to generate sitemap.xml. If not
86+
# specified, then no sitemap will be built.
87+
#'base_url': 'https://project.github.io/project',
88+
89+
# Set the color and the accent color (see
90+
# https://material.io/design/color/the-color-system.html)
91+
'color_primary': 'indigo',
92+
'color_accent': 'green',
93+
94+
# Set the repo location to get a badge with stats
95+
#'repo_url': 'https://github.com/project/project/',
96+
#'repo_name': 'Project',
97+
98+
"html_minify": False,
99+
"html_prettify": True,
100+
"css_minify": True,
101+
"logo_icon": "&#xe869",
102+
"repo_type": "github",
103+
"touch_icon": "images/apple-icon-152x152.png",
104+
"theme_color": "#2196f3",
105+
"master_doc": False,
106+
107+
# Visible levels of the global TOC; -1 means unlimited
108+
'globaltoc_depth': 2,
109+
# If False, expand all TOC entries
110+
'globaltoc_collapse': True,
111+
# If True, show hidden TOC entries
112+
'globaltoc_includehidden': True,
113+
114+
"nav_links": [
115+
{"href": "index", "internal": True, "title": "Dataframe interchange protcol"},
116+
{
117+
"href": "https://data-apis.org",
118+
"internal": False,
119+
"title": "Consortium for Python Data API Standards",
120+
},
121+
],
122+
"heroes": {
123+
"index": "A protocol for zero-copy data interchange between Python dataframe libraries",
124+
#"customization": "Configuration options to personalize your site.",
125+
},
126+
127+
#"version_dropdown": True,
128+
#"version_json": "_static/versions.json",
129+
"table_classes": ["plain"],
130+
}
131+
132+
133+
todo_include_todos = True
134+
#html_favicon = "images/favicon.ico"
135+
136+
html_use_index = True
137+
html_domain_indices = True
138+
139+
extlinks = {
140+
"duref": (
141+
"http://docutils.sourceforge.net/docs/ref/rst/" "restructuredtext.html#%s",
142+
"",
143+
),
144+
"durole": ("http://docutils.sourceforge.net/docs/ref/rst/" "roles.html#%s", ""),
145+
"dudir": ("http://docutils.sourceforge.net/docs/ref/rst/" "directives.html#%s", ""),
146+
}

protocol/dataframe_protocol.py

-67
Original file line numberDiff line numberDiff line change
@@ -1,70 +1,3 @@
1-
"""
2-
Specification for objects to be accessed, for the purpose of dataframe
3-
interchange between libraries, via the ``__dataframe__`` method on a libraries'
4-
data frame object.
5-
6-
For guiding requirements, see https://github.com/data-apis/dataframe-api/pull/35
7-
8-
9-
Concepts in this design
10-
-----------------------
11-
12-
1. A `Buffer` class. A *buffer* is a contiguous block of memory - this is the
13-
only thing that actually maps to a 1-D array in a sense that it could be
14-
converted to NumPy, CuPy, et al.
15-
2. A `Column` class. A *column* has a single dtype. It can consist
16-
of multiple *chunks*. A single chunk of a column (which may be the whole
17-
column if ``num_chunks == 1``) is modeled as again a `Column` instance, and
18-
contains 1 data *buffer* and (optionally) one *mask* for missing data.
19-
3. A `DataFrame` class. A *data frame* is an ordered collection of *columns*,
20-
which are identified with names that are unique strings. All the data
21-
frame's rows are the same length. It can consist of multiple *chunks*. A
22-
single chunk of a data frame is modeled as again a `DataFrame` instance.
23-
4. A *mask* concept. A *mask* of a single-chunk column is a *buffer*.
24-
5. A *chunk* concept. A *chunk* is a sub-dividing element that can be applied
25-
to a *data frame* or a *column*.
26-
27-
Note that the only way to access these objects is through a call to
28-
``__dataframe__`` on a data frame object. This is NOT meant as public API;
29-
only think of instances of the different classes here to describe the API of
30-
what is returned by a call to ``__dataframe__``. They are the concepts needed
31-
to capture the memory layout and data access of a data frame.
32-
33-
34-
Design decisions
35-
----------------
36-
37-
**1. Use a separate column abstraction in addition to a dataframe interface.**
38-
39-
Rationales:
40-
- This is how it works in R, Julia and Apache Arrow.
41-
- Semantically most existing applications and users treat a column similar to a 1-D array
42-
- We should be able to connect a column to the array data interchange mechanism(s)
43-
44-
Note that this does not imply a library must have such a public user-facing
45-
abstraction (ex. ``pandas.Series``) - it can only be accessed via ``__dataframe__``.
46-
47-
**2. Use methods and properties on an opaque object rather than returning
48-
hierarchical dictionaries describing memory**
49-
50-
This is better for implementations that may rely on, for example, lazy
51-
computation.
52-
53-
**3. No row names. If a library uses row names, use a regular column for them.**
54-
55-
See discussion at https://github.com/wesm/dataframe-protocol/pull/1/files#r394316241
56-
Optional row names are not a good idea, because people will assume they're present
57-
(see cuDF experience, forced to add because pandas has them).
58-
Requiring row names seems worse than leaving them out.
59-
60-
Note that row labels could be added in the future - right now there's no clear
61-
requirements for more complex row labels that cannot be represented by a single
62-
column. These do exist, for example Modin has has table and tree-based row
63-
labels.
64-
65-
"""
66-
67-
681
class Buffer:
692
"""
703
Data in the buffer is guaranteed to be contiguous in memory.

0 commit comments

Comments
 (0)