|
| 1 | ++++ |
| 2 | +date = "2023-05-25" |
| 3 | +author = "Marco Gorelli" |
| 4 | +title = "Want to super-charge your library by writing dataframe-agnostic code? We'd love to hear from you" |
| 5 | +tags = ["APIs", "standard", "consortium", "dataframes", "community", "pandas", "polars", "cudf", "modin", "vaex", "koalas", "ibis", "dask"] |
| 6 | +categories = ["Consortium", "Standardization"] |
| 7 | +description = "An RFC for a dataframe API Standard" |
| 8 | +draft = false |
| 9 | +weight = 40 |
| 10 | ++++ |
| 11 | + |
| 12 | +<h1 align="center"> |
| 13 | + <img |
| 14 | + width="400" |
| 15 | + alt="standard-compliant dataframe" |
| 16 | + src="https://github.com/MarcoGorelli/impl-dataframe-api/assets/33491632/fb4bc907-2b85-4ad7-8d13-c2b9912b97f5"> |
| 17 | +</h1> |
| 18 | + |
| 19 | +Tired of getting lost in if-then statements when dealing with API differences |
| 20 | +between dataframe libraries? Would you like to be able to write your code |
| 21 | +once, have it work with all major dataframe libraries, and be done? |
| 22 | +Let's learn about an initiative which will enable you to write |
| 23 | +cross-dataframe code - no special-casing nor data conversions required! |
| 24 | + |
| 25 | +## Why would I want this anyway? |
| 26 | + |
| 27 | +Say you want to write a function which selects rows of a dataframe based |
| 28 | +on the [z-score](https://en.wikipedia.org/wiki/Standard_score) of a given |
| 29 | +column, and you want it to work with any dataframe library. How might |
| 30 | +you write that? |
| 31 | + |
| 32 | +### Solution 1 |
| 33 | + |
| 34 | +Here's a typical solution: |
| 35 | +```python |
| 36 | +def remove_outliers(df: object, column: str) -> pd.DataFrame: |
| 37 | + if isinstance(df, pandas.DataFrame): |
| 38 | + z_score = (df[column] - df[column].mean())/df[column].std() |
| 39 | + return df[z_score.between(-3, 3)] |
| 40 | + if isinstance(df, polars.DataFrame): |
| 41 | + z_score = ((pl.col(column) - pl.col(column).mean()) / pl.col(column).std()) |
| 42 | + return df.filter(z_score.is_between(-3, 3)) |
| 43 | + if isinstance(df, some_other_library.DataFrame): |
| 44 | + ... |
| 45 | +``` |
| 46 | +This quickly gets unwieldy. Libraries like `cudf` and `modin` _might_ work |
| 47 | +in the `isinstance(df, pandas.DataFrame)` arm, but there's no guarantee - |
| 48 | +their APIs are similar, but subtly different. Furthermore, as new libraries |
| 49 | +come out, you'd have to keep updating your function to add new `if` statements. |
| 50 | + |
| 51 | +Can we do better? |
| 52 | + |
| 53 | +### Solution 2: Interchange Protocol |
| 54 | + |
| 55 | +An alternative, which wouldn't involve special-casing, could be to |
| 56 | +leverage the [DataFrame interchange protocol](https://data-apis.org/dataframe-protocol/latest/index.html): |
| 57 | +```python |
| 58 | +def remove_outliers(df: object, column: str) -> pd.DataFrame: |
| 59 | + df_pd = pd.api.interchange.from_dataframe(df) |
| 60 | + z_score = (df_pd[column] - df_pd[column].mean())/df_pd[column].std() |
| 61 | + return df_pd[z_score.between(-3, 3)] |
| 62 | +``` |
| 63 | +We got out of having to write if-then statements (🥳), but there's still a |
| 64 | +couple of issues: |
| 65 | +1. we had to convert to pandas: this might be expensive if your data was |
| 66 | + originally stored on GPU; |
| 67 | +2. the return value is a `pandas.DataFrame`, rather than an object of your |
| 68 | + original dataframe library. |
| 69 | + |
| 70 | +Can we do better? Can we really have it all? |
| 71 | + |
| 72 | +### Solution 3: Introducing the Dataframe Standard |
| 73 | + |
| 74 | +Yes, we really can. To write cross-dataframe code, we'll take these steps: |
| 75 | +1. enable the Standard using ``.__dataframe_standard__``. This will return |
| 76 | + a Standard-compliant dataframe; |
| 77 | +2. write your code, using the [Dataframe Standard specification](https://data-apis.org/dataframe-api/draft/API_specification/index.html) |
| 78 | +3. (optional) return a dataframe from your original library by calling `.dataframe`. |
| 79 | + |
| 80 | +Let's see how this would look like for our ``remove_outliers`` example function: |
| 81 | +```python |
| 82 | +def remove_outliers(df, column): |
| 83 | + # Get a Standard-compliant dataframe. |
| 84 | + # NOTE: this has not yet been upstreamed, so won't work out-of-the-box! |
| 85 | + # See 'resources' below for how to try it out. |
| 86 | + df_standard = df.__dataframe_standard__() |
| 87 | + # Use methods from the Standard specification. |
| 88 | + col = df_standard.get_column_by_name(column) |
| 89 | + z_score = (col - col.mean()) / col.std() |
| 90 | + df_standard_filtered = df_standard.get_rows_by_mask((z_score > -3) & (z_score < 3)) |
| 91 | + # Return the result as a dataframe from the original library. |
| 92 | + return df_standard_filtered.dataframe |
| 93 | +``` |
| 94 | +This will work, as if by magic, on any dataframe with a Standard-compliant implementation. |
| 95 | +But it's not magic, of course, it's the power of standardisation! |
| 96 | + |
| 97 | +## The Standard's philosophy - will all dataframe libraries have the same API one day? |
| 98 | + |
| 99 | +Let's start with what this isn't: the Standard isn't an attempt to force all dataframe |
| 100 | +libraries to have the same API. It also isn't a way to convert |
| 101 | +between dataframes: the [Interchange Protocol](https://data-apis.org/dataframe-protocol/latest/index.html), |
| 102 | +whose adoption is increasing, already does that. It also doesn't aim to standardise |
| 103 | +domain or industry specific functionality. |
| 104 | + |
| 105 | +Rather, it is minimal set of essential dataframe functionality which will work |
| 106 | +the same way across libraries. It will behave in a strict and predictable manner |
| 107 | +across dataframe libraries. Library authors trying to write dataframe-agnostic |
| 108 | +code are expected to greatly benefit from this, as are their users. |
| 109 | + |
| 110 | +## Who's this for? Do I need to learn yet another API? |
| 111 | + |
| 112 | +If you're a casual user, then probably not. |
| 113 | +The Dataframe Standard is currently mainly targeted towards library developers, |
| 114 | +who wish to support multiple dataframe libraries. Users of non-pandas dataframe |
| 115 | +libraries would then be able to seamlessly use the Python packages which |
| 116 | +provide functionality for dataframes (e.g. visualisation, feature engineering, |
| 117 | +data cleaning) without having to do any expensive data conversions. |
| 118 | + |
| 119 | +If you're a library author, then we'd love to hear from you. Would this be |
| 120 | +useful to you? We expect it to be, as the demand for dataframe-agnostic tools |
| 121 | +certainly seems to be there: |
| 122 | +- https://github.com/mwaskom/seaborn/issues/3277, |
| 123 | +- https://github.com/scikit-learn/scikit-learn/issues/25896 |
| 124 | +- https://github.com/plotly/plotly.py/issues/3637 |
| 125 | +- (many, many more...) |
| 126 | + |
| 127 | +## Are we there yet? What lies ahead? |
| 128 | + |
| 129 | +This is just a first draft, based on design discussions between authors from various |
| 130 | +dataframe libraries, and a request for comments (RFC). Our goal is to solicit input |
| 131 | +from a wider range of potential stakeholders, and evolve the Standard throughout |
| 132 | +the rest of 2023, resulting in a first official release towards the end of the year. |
| 133 | + |
| 134 | +Future plans include: |
| 135 | +- increasing the scope of the Standard based on real-world code from widely used |
| 136 | + packages (currently, the spec is very minimal); |
| 137 | +- creating implementations of the Standard for several major dataframe libraries |
| 138 | + (initially available as a separate ``dataframe-api-compat`` package); |
| 139 | +- creating a cross-dataframe test-suite; |
| 140 | +- aiming to ensure each major dataframe library has a `__dataframe_standard__` method. |
| 141 | + |
| 142 | +## Conclusion |
| 143 | + |
| 144 | +We've introduced the Dataframe Standard, which allows you to write cross-dataframe code. |
| 145 | +We learned about its philosophy, as well as what it doesn't aim to be. Finally, we saw |
| 146 | +what plans lie ahead - the Standard is in active development, so please watch this space! |
| 147 | + |
| 148 | +## Resources |
| 149 | + |
| 150 | +- Read more on the [official website](https://data-apis.org/dataframe-api/), and contribute to the discussion on the [GitHub repo](https://github.com/data-apis/dataframe-api) |
| 151 | +- Try out the [proof-of-concept implementation for pandas and polars](https://github.com/MarcoGorelli/impl-dataframe-api)! |
0 commit comments