-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: use PyCapsule Interface instead of Dataframe Interchange Protocol #3782
Open
MarcoGorelli
wants to merge
1
commit into
mwaskom:master
Choose a base branch
from
MarcoGorelli:pycapsule
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+66
−66
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -42,6 +42,7 @@ dev = [ | |
"mypy", | ||
"pandas-stubs", | ||
"pre-commit", | ||
"pyarrow", | ||
"flit", | ||
] | ||
docs = [ | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,7 @@ | ||
import functools | ||
import numpy as np | ||
import pandas as pd | ||
from seaborn.external.version import Version | ||
|
||
import pytest | ||
from numpy.testing import assert_array_equal | ||
|
@@ -404,11 +405,11 @@ def test_bad_type(self, flat_list): | |
with pytest.raises(TypeError, match=err): | ||
PlotData(flat_list, {}) | ||
|
||
@pytest.mark.skipif( | ||
condition=not hasattr(pd.api, "interchange"), | ||
reason="Tests behavior assuming support for dataframe interchange" | ||
) | ||
def test_data_interchange(self, mock_long_df, long_df): | ||
pytest.importorskip( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice, TIL |
||
'pyarrow', '14.0', | ||
reason="Tests behavior assuming support for PyCapsule Interface" | ||
) | ||
|
||
variables = {"x": "x", "y": "z", "color": "a"} | ||
p = PlotData(mock_long_df, variables) | ||
|
@@ -419,21 +420,22 @@ def test_data_interchange(self, mock_long_df, long_df): | |
for var, col in variables.items(): | ||
assert_vector_equal(p.frame[var], long_df[col]) | ||
|
||
@pytest.mark.skipif( | ||
condition=not hasattr(pd.api, "interchange"), | ||
reason="Tests behavior assuming support for dataframe interchange" | ||
) | ||
def test_data_interchange_failure(self, mock_long_df): | ||
pytest.importorskip( | ||
'pyarrow', '14.0', | ||
reason="Tests behavior assuming support for PyCapsule Interface" | ||
) | ||
|
||
mock_long_df._data = None # Break __dataframe__() | ||
mock_long_df.__arrow_c_stream__ = lambda _x: 1 / 0 # Break __arrow_c_stream__() | ||
with pytest.raises(RuntimeError, match="Encountered an exception"): | ||
PlotData(mock_long_df, {"x": "x"}) | ||
|
||
@pytest.mark.skipif( | ||
condition=hasattr(pd.api, "interchange"), | ||
reason="Tests graceful failure without support for dataframe interchange" | ||
) | ||
def test_data_interchange_support_test(self, mock_long_df): | ||
pyarrow = pytest.importorskip('pyarrow') | ||
if Version(pyarrow.__version__) >= Version('14.0.0'): | ||
pytest.skip( | ||
reason="Tests graceful failure without support for PyCapsule Interface" | ||
) | ||
|
||
with pytest.raises(TypeError, match="Support for non-pandas DataFrame"): | ||
PlotData(mock_long_df, {"x": "x"}) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this generally a dependency of non-pandas dataframe libraries now? Or could this change introduce a regression for e.g. polars users who are currently leveraging the dataframe interchange protocol?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review!
Polars doesn't depend on PyArrow, but
polars.DataFrame.to_pandas
always requires PyArrow. So, in practice, anyone working with both dataframe libraries may well already have PyArrow already installedTo avoid requiring PyArrow for the cases when it's not necessary, one way could be to do something like:
This has the upside of not requiring PyArrow in some cases, but the downside of hiding issues where the interchange protocol silently produces invalid results
It may be possible to do this PyCapsule Interface conversion in the future without PyArrow but with something lighter instead, like arro3 by @kylebarron (who I'm ccing in case he has comments too)
What would be your preference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some polars users may not have pyarrow installed. If seaborn needs to get pandas data, the only production-ready way to do
Arrow -> pandas
that I know of is using pyarrow.As Marco mentions I'm working on arro3, which is a minimal library for Arrow in Python, but Pandas interop is not a primary concern, and it's not production-ready today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW pandas 3.x is going to strongly incentivize users to install PyArrow, although it stops short of outright requiring it. In theory, the only people that shouldn't have PyArrow installed are those that operate in space/resource constrained environments, probably in headless environments like AWS Lambda where seaborn won't be used
Of course up to you how much you want to support non-PyArrow configurations, but the dataframe interchange protocol is relatively buggy and gets very little support, so you may find it easier altogether to force users towards PyArrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cuDF have said that they will deprecate the interchange format: rapidsai/cudf#17282
Plotly have stopped using it, so Seaborn is the only library left using it
At this point, I think there's a greater risk in keeping it - I don't want to force anything here of course, just making sure you're aware
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please clarify what you mean by "if the infra isn't there yet"?
The Arrow C Interface already has quite widespread adoption and I'm not aware of edge cases in its implementations. @WillAyd wrote about switching his Pantab project over to it in Leveraging the Arrow C Data Interface, and noted
That was nearly a year ago, and given that he's now suggesting it here in Plotly, I'd say that his experience has stayed just as positive
Regarding PyArrow dependency, I'll also note that
polars.DataFrame.to_pandas
also requires PyArrow, so any Polars user (such as myself) would already have needed PyArrow installed if they were converting to pandas via the Polars official methodThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically my threshold is "do I need to think about it at all". I'm just not interested in the minutia of competing Python dataframe libraries or the various attempts to make them work better together. The previous approach was sold as a simple protocol that always works, but it turns out that wasn't the case. Maybe this new way is better, the problem is I have no real way to say for sure without spending a lot of time learning about something that doesn't interest me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall I close and leave you to remove cross-dataframe compatibility altogether?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is then I get issues bugging me about Polars, so I have to think about it anyway :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😄 that's understandable
I'm aware that you said that using Narwhals was a complete non-starter, but just to showcase that as a possibility:
and then leave it up to Narwhals to convert to pandas in the best way for each input library
Altair, Plotly, and Vegafusion are using it as required dependency now, and Bokeh have a PR in progress to do the same
For completeness: the way the other libraries are using Narwhals is by making the whole logic dataframe-agnostic. In Plotly this resulted in 2-3x better performance for many plots involving group-bys (compared with converting all inputs to pandas), but I understand that you may not be interested in that