|
2 | 2 |
|
3 | 3 | ## Introduction
|
4 | 4 |
|
| 5 | +This document defines a Python dataframe API. |
5 | 6 |
|
| 7 | +A dataframe is a programming interface for expressing data manipulations over a |
| 8 | +data structure consisting of rows and columns. Columns are named, and values in a |
| 9 | +column share a common data type. This definition is intentionally left broad. |
6 | 10 |
|
7 |
| -## History |
| 11 | +## History and dataframe implementations |
8 | 12 |
|
| 13 | +Dataframe libraries in several programming language exist, such as |
| 14 | +[R](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame), |
| 15 | +[Scala](https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-scala.html), |
| 16 | +[Julia](https://juliadata.github.io/DataFrames.jl/stable/) and others. |
9 | 17 |
|
| 18 | +In Python, the most popular dataframe library is [pandas](https://pandas.pydata.org/). |
| 19 | +pandas was initially developed at a hedge fund, with a focus on |
| 20 | +[panel data](https://en.wikipedia.org/wiki/Panel_data) and financial time series. |
| 21 | +It was open sourced in 2009, and since then it has been growing in popularity, including |
| 22 | +many other domains outside time series and financial data. While still rich in time series |
| 23 | +functionality, today is considered a general-purpose dataframe library. The original |
| 24 | +`Panel` class that gave name to the library was deprecated in 2017 and removed in 2019, |
| 25 | +to focus on the main `DataFrame` class. |
10 | 26 |
|
11 |
| -## Scope (includes out-of-scope / non-goals) |
| 27 | +Internally, pandas is implemented on top of NumPy, which is used to store the data |
| 28 | +and to perform many of the operations. Some parts of pandas are written in Cython. |
12 | 29 |
|
| 30 | +As of 2020 the pandas website has around one million and a half visitors per month. |
| 31 | + |
| 32 | +Other libraries emerged in the last years, to address some of the limitations of pandas. |
| 33 | +But in most cases, the libraries implemented a public API very similar to pandas, to |
| 34 | +make the transition to their libraries easier. Next, there is a short description of |
| 35 | +the main dataframe libraries in Python. |
| 36 | + |
| 37 | +[Dask](https://dask.org/) is a task scheduler built in Python, which implements a |
| 38 | +dataframe interface. Dask dataframe uses pandas internally in the workers, and it provides |
| 39 | +an API similar to pandas, adapted to its distributed and lazy nature. |
| 40 | + |
| 41 | +[Vaex](https://vaex.io/) is an out-of-core alternative to pandas. Vaex uses hdf5 to |
| 42 | +create memory maps that avoid loading data sets to memory. Some parts of Vaex are |
| 43 | +implemented in C++. |
| 44 | + |
| 45 | +[Modin](https://github.com/modin-project/modin) is a distributed dataframe |
| 46 | +library originally built on [Ray](https://github.com/ray-project/ray), but has |
| 47 | +a more modular way, that allows it to also use Dask as a scheduler, or replace the |
| 48 | +pandas-like public API by a SQLite-like one. |
| 49 | + |
| 50 | +[cuDF](https://github.com/rapidsai/cudf) is a GPU dataframe library built on top |
| 51 | +of Apache Arrow and RAPIDS. It provides an API similar to pandas. |
| 52 | + |
| 53 | +[PySpark](https://spark.apache.org/docs/latest/api/python/index.html) is a |
| 54 | +dataframe library that uses Spark as a backend. PySpark public API is based on the |
| 55 | +original Spark API, and not in pandas. |
| 56 | + |
| 57 | +[Koalas](https://github.com/databricks/koalas) is a dataframe library built on |
| 58 | +top of PySpark that provides a pandas-like API. |
| 59 | + |
| 60 | +[Ibis](https://ibis-project.org/) is a dataframe library with multiple SQL backends. |
| 61 | +It uses SQLAlchemy and a custom SQL compiler to translate its pandas-like API to |
| 62 | +SQL statements, executed by the backends. It supports conventional DBMS, as well |
| 63 | +as big data systems such as Apache Impala or BigQuery. |
| 64 | + |
| 65 | +Given the growing Python dataframe ecosystem, and its complexity, this document provides |
| 66 | +a standard Python dataframe API. Until recently, pandas has been a de-facto standard for |
| 67 | +Python dataframes. But currently there are a growing number of not only dataframe libraries, |
| 68 | +but also libraries that interact with dataframes (visualization, statistical or machine learning |
| 69 | +libraries for example). Interactions among libraries are becoming complex, and the pandas |
| 70 | +public API is suboptimal as a standard, for its size, complexity, and implementation details |
| 71 | +it exposes (for example, using NumPy data types or `NaN`). |
| 72 | + |
| 73 | + |
| 74 | +## Scope |
| 75 | + |
| 76 | +In the first iteration of the API standard, the scope is limited to create a data exchange |
| 77 | +protocol. In future iterations the scope will be broader, including elements to operate with |
| 78 | +the data. |
| 79 | + |
| 80 | +It is in the scope of this document the different elements of the API. This includes signatures |
| 81 | +and semantics. To be more specific: |
| 82 | + |
| 83 | +- Data structures and Python classes |
| 84 | +- Functions, methods, attributes and other API elements |
| 85 | +- Expected returns of the different operations |
| 86 | +- Data types (Python and low-level types) |
| 87 | + |
| 88 | +The scope of this document is limited to generic dataframes, and not dataframes specific to |
| 89 | +certain domains. |
| 90 | + |
| 91 | + |
| 92 | +### Goals |
| 93 | + |
| 94 | +The goal of the first iteration is to provide a data exchange protocol, so consumers of dataframes |
| 95 | +can interact with a standard interface to access their data. |
| 96 | + |
| 97 | +The goal of future iterations will be to provide a standard interface that encapsulates |
| 98 | +implementation details of dataframe libraries. This will allow users and third-party libraries to |
| 99 | +write code that interacts and operates with a standard dataframe, and not with specific implementations. |
| 100 | + |
| 101 | +The main goals for the API defined in this document are: |
| 102 | + |
| 103 | +- Make conversion of data among different implementations easier |
| 104 | +- Let third party libraries consume dataframes from any implementations |
| 105 | + |
| 106 | +In the future, besides a data exchange protocol, the standard aims to include common operations |
| 107 | +done with dataframe, with the next goals in mind: |
| 108 | + |
| 109 | +- Provide a common API for dataframes so software using dataframes can work with all |
| 110 | + implementations |
| 111 | +- Provide a common API for dataframes to build user interfaces on top of it, for example |
| 112 | + libraries for interactive use or specific domains and industries |
| 113 | +- Help user transition from one dataframe library to another |
| 114 | + |
| 115 | +See the [use cases](02_use_cases.html) section for details on the exact use cases considered. |
| 116 | + |
| 117 | + |
| 118 | +### Out-of-scope |
| 119 | + |
| 120 | +#### Execution details |
| 121 | + |
| 122 | +Implementation details of the dataframes and execution of operations. This includes: |
| 123 | + |
| 124 | +- How data is represented and stored (whether the data is in memory, disk, distributed) |
| 125 | +- Expectations on when the execution is happening (in an eager or lazy way) |
| 126 | +- Other execution details |
| 127 | + |
| 128 | +**Rationale:** The API defined in this document needs to be used by libraries as diverse as Ibis, |
| 129 | +Dask, Vaex or cuDF. The data can live in databases, distributed systems, disk or GPU memory. |
| 130 | +Any decision that involves assumptions on where the data is stored, or where execution happens |
| 131 | +could prevent implementation from adopting the standard. |
| 132 | + |
| 133 | +#### High level APIs |
| 134 | + |
| 135 | +It is out of scope to provide an API designed for interactive use. While interactive use |
| 136 | +is a key aspect of dataframes, an API designed for interactive use can be built on top |
| 137 | +of the API defined in this document. |
| 138 | + |
| 139 | +Domain or industry specific APIs are also out of scope, but can benefit from the standard |
| 140 | +to better interact with the different dataframe implementation. |
| 141 | + |
| 142 | +**Rationale:** Interactive or domain specific users are key in the Python dataframe ecosystem. |
| 143 | +But the amount and diversity of users makes it unfeasible to standardize every dataframe feature |
| 144 | +that is currently used. In particular, functionality built as syntactic sugar for convenience in |
| 145 | +interactive use, or heavily overloaded create very complex APIs. For example, the pandas dataframe |
| 146 | +constructor, which accepts a huge number of formats, or its `__getitem__` (e.g. `df[something]`) |
| 147 | +which is heavily overloaded. Implementations can provide convenient functionality like this one |
| 148 | +for the users they are targeting, but it is out-of-scope for the standard, so the standard is |
| 149 | +simple and easy to adopt. |
| 150 | + |
| 151 | + |
| 152 | +### Non-goals |
| 153 | + |
| 154 | +- Build an API that is appropriate to all users |
| 155 | +- Have a unique dataframe implementation for Python |
| 156 | +- Standardize functionalities specific to a domain or industry |
13 | 157 |
|
14 | 158 |
|
15 | 159 | ## Stakeholders
|
16 | 160 |
|
| 161 | +This section provides the list of stakeholders considered for the definition of this API. |
| 162 | + |
| 163 | + |
| 164 | +### Dataframe library authors |
| 165 | + |
| 166 | +We encourage dataframe libraries in Python to implement the API defined in this document |
| 167 | +in their libraries. |
| 168 | + |
| 169 | +The list of known Python dataframe libraries at the time of writing this document is next: |
| 170 | + |
| 171 | +- [cuDF](https://github.com/rapidsai/cudf) |
| 172 | +- [Dask](https://dask.org/) |
| 173 | +- [datatable](https://github.com/h2oai/datatable) |
| 174 | +- [dexplo](https://github.com/dexplo/dexplo/) |
| 175 | +- [Eland](https://github.com/elastic/eland) |
| 176 | +- [Grizzly](https://github.com/weld-project/weld#grizzly) |
| 177 | +- [Ibis](https://ibis-project.org/) |
| 178 | +- [Koalas](https://github.com/databricks/koalas) |
| 179 | +- [Mars](https://docs.pymars.org/en/latest/) |
| 180 | +- [Modin](https://github.com/modin-project/modin) |
| 181 | +- [pandas](https://pandas.pydata.org/) |
| 182 | +- [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) |
| 183 | +- [StaticFrame](https://static-frame.readthedocs.io/en/latest/) |
| 184 | +- [Turi Create](https://github.com/apple/turicreate) |
| 185 | +- [Vaex](https://vaex.io/) |
| 186 | + |
| 187 | + |
| 188 | +### Downstream library authors |
17 | 189 |
|
| 190 | +Authors of libraries that consume dataframes. They can use the API defined in this document |
| 191 | +to know how the data contained in a dataframe can be consumed, and which operations are implemented. |
| 192 | + |
| 193 | +A non-exhaustive list of downstream library categories is next: |
| 194 | + |
| 195 | +- Plotting and visualization (e.g. Matplotlib, Bokeh, Altair, Plotly) |
| 196 | +- Statistical libraries (e.g. statsmodels) |
| 197 | +- Machine learning libraries (e.g. scikit-learn) |
| 198 | + |
| 199 | + |
| 200 | +### Upstream library authors |
| 201 | + |
| 202 | +Authors of libraries that provide functionality used by dataframes. |
| 203 | + |
| 204 | +A non-exhaustive list of upstream categories is next: |
| 205 | + |
| 206 | +- Data formats, protocols and libraries for data analytics (e.g. Apache Arrow, NumPy) |
| 207 | +- Task schedulers (e.g. Dask, Ray, Mars) |
| 208 | +- Big data systems (e.g. Spark, Hive, Impala, Presto) |
| 209 | +- Libraries for database access (e.g. SQLAlchemy) |
| 210 | + |
| 211 | + |
| 212 | +### Dataframe power users |
| 213 | + |
| 214 | + |
| 215 | +This group considers developers of reusable code that use dataframes. For example, developers of |
| 216 | +applications that use dataframes. Or authors of libraries that provide specialized dataframe |
| 217 | +APIs to be built on top of the standard API. |
| 218 | + |
| 219 | +People using dataframes in an interactive way are considered out of scope. These users include data |
| 220 | +analysts, data scientists and other users that are key for dataframes. But this type of user may need |
| 221 | +shortcuts, or libraries that take decisions for them to save them time. For example automatic type |
| 222 | +inference, or excessive use of very compact syntax like Python squared brackets / `__getitem__`. |
| 223 | +Standardizing on such practices can be extremely difficult, and it is out of scope. |
| 224 | + |
| 225 | +With the development of a standard API that targets developers writing reusable code we expected |
| 226 | +to also serve data analysts and other interactive users. But in an indirect way, by providing a |
| 227 | +standard API where other libraries can be built on top. Including libraries with the syntactic |
| 228 | +sugar required for fast analysis of data. |
18 | 229 |
|
19 | 230 |
|
20 | 231 | ## High-level API overview
|
|
38 | 249 |
|
39 | 250 |
|
40 | 251 | ## References
|
41 |
| - |
|
0 commit comments