Skip to content

Commit 7da9f9b

Browse files
authored
Blog post about the RFC status of the dataframe standard (#19)
1 parent a0fdb3b commit 7da9f9b

File tree

1 file changed

+151
-0
lines changed

1 file changed

+151
-0
lines changed
+151
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
+++
2+
date = "2023-05-25"
3+
author = "Marco Gorelli"
4+
title = "Want to super-charge your library by writing dataframe-agnostic code? We'd love to hear from you"
5+
tags = ["APIs", "standard", "consortium", "dataframes", "community", "pandas", "polars", "cudf", "modin", "vaex", "koalas", "ibis", "dask"]
6+
categories = ["Consortium", "Standardization"]
7+
description = "An RFC for a dataframe API Standard"
8+
draft = false
9+
weight = 40
10+
+++
11+
12+
<h1 align="center">
13+
<img
14+
width="400"
15+
alt="standard-compliant dataframe"
16+
src="https://github.com/MarcoGorelli/impl-dataframe-api/assets/33491632/fb4bc907-2b85-4ad7-8d13-c2b9912b97f5">
17+
</h1>
18+
19+
Tired of getting lost in if-then statements when dealing with API differences
20+
between dataframe libraries? Would you like to be able to write your code
21+
once, have it work with all major dataframe libraries, and be done?
22+
Let's learn about an initiative which will enable you to write
23+
cross-dataframe code - no special-casing nor data conversions required!
24+
25+
## Why would I want this anyway?
26+
27+
Say you want to write a function which selects rows of a dataframe based
28+
on the [z-score](https://en.wikipedia.org/wiki/Standard_score) of a given
29+
column, and you want it to work with any dataframe library. How might
30+
you write that?
31+
32+
### Solution 1
33+
34+
Here's a typical solution:
35+
```python
36+
def remove_outliers(df: object, column: str) -> pd.DataFrame:
37+
if isinstance(df, pandas.DataFrame):
38+
z_score = (df[column] - df[column].mean())/df[column].std()
39+
return df[z_score.between(-3, 3)]
40+
if isinstance(df, polars.DataFrame):
41+
z_score = ((pl.col(column) - pl.col(column).mean()) / pl.col(column).std())
42+
return df.filter(z_score.is_between(-3, 3))
43+
if isinstance(df, some_other_library.DataFrame):
44+
...
45+
```
46+
This quickly gets unwieldy. Libraries like `cudf` and `modin` _might_ work
47+
in the `isinstance(df, pandas.DataFrame)` arm, but there's no guarantee -
48+
their APIs are similar, but subtly different. Furthermore, as new libraries
49+
come out, you'd have to keep updating your function to add new `if` statements.
50+
51+
Can we do better?
52+
53+
### Solution 2: Interchange Protocol
54+
55+
An alternative, which wouldn't involve special-casing, could be to
56+
leverage the [DataFrame interchange protocol](https://data-apis.org/dataframe-protocol/latest/index.html):
57+
```python
58+
def remove_outliers(df: object, column: str) -> pd.DataFrame:
59+
df_pd = pd.api.interchange.from_dataframe(df)
60+
z_score = (df_pd[column] - df_pd[column].mean())/df_pd[column].std()
61+
return df_pd[z_score.between(-3, 3)]
62+
```
63+
We got out of having to write if-then statements (🥳), but there's still a
64+
couple of issues:
65+
1. we had to convert to pandas: this might be expensive if your data was
66+
originally stored on GPU;
67+
2. the return value is a `pandas.DataFrame`, rather than an object of your
68+
original dataframe library.
69+
70+
Can we do better? Can we really have it all?
71+
72+
### Solution 3: Introducing the Dataframe Standard
73+
74+
Yes, we really can. To write cross-dataframe code, we'll take these steps:
75+
1. enable the Standard using ``.__dataframe_standard__``. This will return
76+
a Standard-compliant dataframe;
77+
2. write your code, using the [Dataframe Standard specification](https://data-apis.org/dataframe-api/draft/API_specification/index.html)
78+
3. (optional) return a dataframe from your original library by calling `.dataframe`.
79+
80+
Let's see how this would look like for our ``remove_outliers`` example function:
81+
```python
82+
def remove_outliers(df, column):
83+
# Get a Standard-compliant dataframe.
84+
# NOTE: this has not yet been upstreamed, so won't work out-of-the-box!
85+
# See 'resources' below for how to try it out.
86+
df_standard = df.__dataframe_standard__()
87+
# Use methods from the Standard specification.
88+
col = df_standard.get_column_by_name(column)
89+
z_score = (col - col.mean()) / col.std()
90+
df_standard_filtered = df_standard.get_rows_by_mask((z_score > -3) & (z_score < 3))
91+
# Return the result as a dataframe from the original library.
92+
return df_standard_filtered.dataframe
93+
```
94+
This will work, as if by magic, on any dataframe with a Standard-compliant implementation.
95+
But it's not magic, of course, it's the power of standardisation!
96+
97+
## The Standard's philosophy - will all dataframe libraries have the same API one day?
98+
99+
Let's start with what this isn't: the Standard isn't an attempt to force all dataframe
100+
libraries to have the same API. It also isn't a way to convert
101+
between dataframes: the [Interchange Protocol](https://data-apis.org/dataframe-protocol/latest/index.html),
102+
whose adoption is increasing, already does that. It also doesn't aim to standardise
103+
domain or industry specific functionality.
104+
105+
Rather, it is minimal set of essential dataframe functionality which will work
106+
the same way across libraries. It will behave in a strict and predictable manner
107+
across dataframe libraries. Library authors trying to write dataframe-agnostic
108+
code are expected to greatly benefit from this, as are their users.
109+
110+
## Who's this for? Do I need to learn yet another API?
111+
112+
If you're a casual user, then probably not.
113+
The Dataframe Standard is currently mainly targeted towards library developers,
114+
who wish to support multiple dataframe libraries. Users of non-pandas dataframe
115+
libraries would then be able to seamlessly use the Python packages which
116+
provide functionality for dataframes (e.g. visualisation, feature engineering,
117+
data cleaning) without having to do any expensive data conversions.
118+
119+
If you're a library author, then we'd love to hear from you. Would this be
120+
useful to you? We expect it to be, as the demand for dataframe-agnostic tools
121+
certainly seems to be there:
122+
- https://github.com/mwaskom/seaborn/issues/3277,
123+
- https://github.com/scikit-learn/scikit-learn/issues/25896
124+
- https://github.com/plotly/plotly.py/issues/3637
125+
- (many, many more...)
126+
127+
## Are we there yet? What lies ahead?
128+
129+
This is just a first draft, based on design discussions between authors from various
130+
dataframe libraries, and a request for comments (RFC). Our goal is to solicit input
131+
from a wider range of potential stakeholders, and evolve the Standard throughout
132+
the rest of 2023, resulting in a first official release towards the end of the year.
133+
134+
Future plans include:
135+
- increasing the scope of the Standard based on real-world code from widely used
136+
packages (currently, the spec is very minimal);
137+
- creating implementations of the Standard for several major dataframe libraries
138+
(initially available as a separate ``dataframe-api-compat`` package);
139+
- creating a cross-dataframe test-suite;
140+
- aiming to ensure each major dataframe library has a `__dataframe_standard__` method.
141+
142+
## Conclusion
143+
144+
We've introduced the Dataframe Standard, which allows you to write cross-dataframe code.
145+
We learned about its philosophy, as well as what it doesn't aim to be. Finally, we saw
146+
what plans lie ahead - the Standard is in active development, so please watch this space!
147+
148+
## Resources
149+
150+
- Read more on the [official website](https://data-apis.org/dataframe-api/), and contribute to the discussion on the [GitHub repo](https://github.com/data-apis/dataframe-api)
151+
- Try out the [proof-of-concept implementation for pandas and polars](https://github.com/MarcoGorelli/impl-dataframe-api)!

0 commit comments

Comments
 (0)