Skip to content

Commit 8352aba

Browse files
Adding introduction, goals, scope and use cases to the RFC (#27)
1 parent 333db2b commit 8352aba

File tree

2 files changed

+390
-3
lines changed

2 files changed

+390
-3
lines changed

spec/01_purpose_and_scope.md

+213-3
Original file line numberDiff line numberDiff line change
@@ -2,19 +2,230 @@
22

33
## Introduction
44

5+
This document defines a Python dataframe API.
56

7+
A dataframe is a programming interface for expressing data manipulations over a
8+
data structure consisting of rows and columns. Columns are named, and values in a
9+
column share a common data type. This definition is intentionally left broad.
610

7-
## History
11+
## History and dataframe implementations
812

13+
Dataframe libraries in several programming language exist, such as
14+
[R](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame),
15+
[Scala](https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-scala.html),
16+
[Julia](https://juliadata.github.io/DataFrames.jl/stable/) and others.
917

18+
In Python, the most popular dataframe library is [pandas](https://pandas.pydata.org/).
19+
pandas was initially developed at a hedge fund, with a focus on
20+
[panel data](https://en.wikipedia.org/wiki/Panel_data) and financial time series.
21+
It was open sourced in 2009, and since then it has been growing in popularity, including
22+
many other domains outside time series and financial data. While still rich in time series
23+
functionality, today is considered a general-purpose dataframe library. The original
24+
`Panel` class that gave name to the library was deprecated in 2017 and removed in 2019,
25+
to focus on the main `DataFrame` class.
1026

11-
## Scope (includes out-of-scope / non-goals)
27+
Internally, pandas is implemented on top of NumPy, which is used to store the data
28+
and to perform many of the operations. Some parts of pandas are written in Cython.
1229

30+
As of 2020 the pandas website has around one million and a half visitors per month.
31+
32+
Other libraries emerged in the last years, to address some of the limitations of pandas.
33+
But in most cases, the libraries implemented a public API very similar to pandas, to
34+
make the transition to their libraries easier. Next, there is a short description of
35+
the main dataframe libraries in Python.
36+
37+
[Dask](https://dask.org/) is a task scheduler built in Python, which implements a
38+
dataframe interface. Dask dataframe uses pandas internally in the workers, and it provides
39+
an API similar to pandas, adapted to its distributed and lazy nature.
40+
41+
[Vaex](https://vaex.io/) is an out-of-core alternative to pandas. Vaex uses hdf5 to
42+
create memory maps that avoid loading data sets to memory. Some parts of Vaex are
43+
implemented in C++.
44+
45+
[Modin](https://github.com/modin-project/modin) is a distributed dataframe
46+
library originally built on [Ray](https://github.com/ray-project/ray), but has
47+
a more modular way, that allows it to also use Dask as a scheduler, or replace the
48+
pandas-like public API by a SQLite-like one.
49+
50+
[cuDF](https://github.com/rapidsai/cudf) is a GPU dataframe library built on top
51+
of Apache Arrow and RAPIDS. It provides an API similar to pandas.
52+
53+
[PySpark](https://spark.apache.org/docs/latest/api/python/index.html) is a
54+
dataframe library that uses Spark as a backend. PySpark public API is based on the
55+
original Spark API, and not in pandas.
56+
57+
[Koalas](https://github.com/databricks/koalas) is a dataframe library built on
58+
top of PySpark that provides a pandas-like API.
59+
60+
[Ibis](https://ibis-project.org/) is a dataframe library with multiple SQL backends.
61+
It uses SQLAlchemy and a custom SQL compiler to translate its pandas-like API to
62+
SQL statements, executed by the backends. It supports conventional DBMS, as well
63+
as big data systems such as Apache Impala or BigQuery.
64+
65+
Given the growing Python dataframe ecosystem, and its complexity, this document provides
66+
a standard Python dataframe API. Until recently, pandas has been a de-facto standard for
67+
Python dataframes. But currently there are a growing number of not only dataframe libraries,
68+
but also libraries that interact with dataframes (visualization, statistical or machine learning
69+
libraries for example). Interactions among libraries are becoming complex, and the pandas
70+
public API is suboptimal as a standard, for its size, complexity, and implementation details
71+
it exposes (for example, using NumPy data types or `NaN`).
72+
73+
74+
## Scope
75+
76+
In the first iteration of the API standard, the scope is limited to create a data exchange
77+
protocol. In future iterations the scope will be broader, including elements to operate with
78+
the data.
79+
80+
It is in the scope of this document the different elements of the API. This includes signatures
81+
and semantics. To be more specific:
82+
83+
- Data structures and Python classes
84+
- Functions, methods, attributes and other API elements
85+
- Expected returns of the different operations
86+
- Data types (Python and low-level types)
87+
88+
The scope of this document is limited to generic dataframes, and not dataframes specific to
89+
certain domains.
90+
91+
92+
### Goals
93+
94+
The goal of the first iteration is to provide a data exchange protocol, so consumers of dataframes
95+
can interact with a standard interface to access their data.
96+
97+
The goal of future iterations will be to provide a standard interface that encapsulates
98+
implementation details of dataframe libraries. This will allow users and third-party libraries to
99+
write code that interacts and operates with a standard dataframe, and not with specific implementations.
100+
101+
The main goals for the API defined in this document are:
102+
103+
- Make conversion of data among different implementations easier
104+
- Let third party libraries consume dataframes from any implementations
105+
106+
In the future, besides a data exchange protocol, the standard aims to include common operations
107+
done with dataframe, with the next goals in mind:
108+
109+
- Provide a common API for dataframes so software using dataframes can work with all
110+
implementations
111+
- Provide a common API for dataframes to build user interfaces on top of it, for example
112+
libraries for interactive use or specific domains and industries
113+
- Help user transition from one dataframe library to another
114+
115+
See the [use cases](02_use_cases.html) section for details on the exact use cases considered.
116+
117+
118+
### Out-of-scope
119+
120+
#### Execution details
121+
122+
Implementation details of the dataframes and execution of operations. This includes:
123+
124+
- How data is represented and stored (whether the data is in memory, disk, distributed)
125+
- Expectations on when the execution is happening (in an eager or lazy way)
126+
- Other execution details
127+
128+
**Rationale:** The API defined in this document needs to be used by libraries as diverse as Ibis,
129+
Dask, Vaex or cuDF. The data can live in databases, distributed systems, disk or GPU memory.
130+
Any decision that involves assumptions on where the data is stored, or where execution happens
131+
could prevent implementation from adopting the standard.
132+
133+
#### High level APIs
134+
135+
It is out of scope to provide an API designed for interactive use. While interactive use
136+
is a key aspect of dataframes, an API designed for interactive use can be built on top
137+
of the API defined in this document.
138+
139+
Domain or industry specific APIs are also out of scope, but can benefit from the standard
140+
to better interact with the different dataframe implementation.
141+
142+
**Rationale:** Interactive or domain specific users are key in the Python dataframe ecosystem.
143+
But the amount and diversity of users makes it unfeasible to standardize every dataframe feature
144+
that is currently used. In particular, functionality built as syntactic sugar for convenience in
145+
interactive use, or heavily overloaded create very complex APIs. For example, the pandas dataframe
146+
constructor, which accepts a huge number of formats, or its `__getitem__` (e.g. `df[something]`)
147+
which is heavily overloaded. Implementations can provide convenient functionality like this one
148+
for the users they are targeting, but it is out-of-scope for the standard, so the standard is
149+
simple and easy to adopt.
150+
151+
152+
### Non-goals
153+
154+
- Build an API that is appropriate to all users
155+
- Have a unique dataframe implementation for Python
156+
- Standardize functionalities specific to a domain or industry
13157

14158

15159
## Stakeholders
16160

161+
This section provides the list of stakeholders considered for the definition of this API.
162+
163+
164+
### Dataframe library authors
165+
166+
We encourage dataframe libraries in Python to implement the API defined in this document
167+
in their libraries.
168+
169+
The list of known Python dataframe libraries at the time of writing this document is next:
170+
171+
- [cuDF](https://github.com/rapidsai/cudf)
172+
- [Dask](https://dask.org/)
173+
- [datatable](https://github.com/h2oai/datatable)
174+
- [dexplo](https://github.com/dexplo/dexplo/)
175+
- [Eland](https://github.com/elastic/eland)
176+
- [Grizzly](https://github.com/weld-project/weld#grizzly)
177+
- [Ibis](https://ibis-project.org/)
178+
- [Koalas](https://github.com/databricks/koalas)
179+
- [Mars](https://docs.pymars.org/en/latest/)
180+
- [Modin](https://github.com/modin-project/modin)
181+
- [pandas](https://pandas.pydata.org/)
182+
- [PySpark](https://spark.apache.org/docs/latest/api/python/index.html)
183+
- [StaticFrame](https://static-frame.readthedocs.io/en/latest/)
184+
- [Turi Create](https://github.com/apple/turicreate)
185+
- [Vaex](https://vaex.io/)
186+
187+
188+
### Downstream library authors
17189

190+
Authors of libraries that consume dataframes. They can use the API defined in this document
191+
to know how the data contained in a dataframe can be consumed, and which operations are implemented.
192+
193+
A non-exhaustive list of downstream library categories is next:
194+
195+
- Plotting and visualization (e.g. Matplotlib, Bokeh, Altair, Plotly)
196+
- Statistical libraries (e.g. statsmodels)
197+
- Machine learning libraries (e.g. scikit-learn)
198+
199+
200+
### Upstream library authors
201+
202+
Authors of libraries that provide functionality used by dataframes.
203+
204+
A non-exhaustive list of upstream categories is next:
205+
206+
- Data formats, protocols and libraries for data analytics (e.g. Apache Arrow, NumPy)
207+
- Task schedulers (e.g. Dask, Ray, Mars)
208+
- Big data systems (e.g. Spark, Hive, Impala, Presto)
209+
- Libraries for database access (e.g. SQLAlchemy)
210+
211+
212+
### Dataframe power users
213+
214+
215+
This group considers developers of reusable code that use dataframes. For example, developers of
216+
applications that use dataframes. Or authors of libraries that provide specialized dataframe
217+
APIs to be built on top of the standard API.
218+
219+
People using dataframes in an interactive way are considered out of scope. These users include data
220+
analysts, data scientists and other users that are key for dataframes. But this type of user may need
221+
shortcuts, or libraries that take decisions for them to save them time. For example automatic type
222+
inference, or excessive use of very compact syntax like Python squared brackets / `__getitem__`.
223+
Standardizing on such practices can be extremely difficult, and it is out of scope.
224+
225+
With the development of a standard API that targets developers writing reusable code we expected
226+
to also serve data analysts and other interactive users. But in an indirect way, by providing a
227+
standard API where other libraries can be built on top. Including libraries with the syntactic
228+
sugar required for fast analysis of data.
18229

19230

20231
## High-level API overview
@@ -38,4 +249,3 @@
38249

39250

40251
## References
41-

0 commit comments

Comments
 (0)