Skip to content

How to convert Series to compliant Column? #190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MarcoGorelli opened this issue Jun 29, 2023 · 23 comments · Fixed by #192
Closed

How to convert Series to compliant Column? #190

MarcoGorelli opened this issue Jun 29, 2023 · 23 comments · Fixed by #192

Comments

@MarcoGorelli
Copy link
Contributor

MarcoGorelli commented Jun 29, 2023

Say I have a function which takes a pandas Series:

def calculate_std_of_series(ser: pd.Series):
    return ser.std(ddof=1)

I'd like to convert this to support any standard-compliant library. How can I do that?

I'd like to be able to do something like

def calculate_std_of_series(ser: Any):
    ser_compliant = ser.__series_standard__()
    return ser_compliant.std(correction=1)

Currently, I can't do that, because we don't have __series_standard__.

Are we OK to require the libraries implementing the Standard have a __series_standard__ method in whichever object they use to back their Column objects?


Context of why this is necessary: in plotly, it's possible to pass a Series to 'x' / 'y' / etc:

df = pd.DataFrame({'x': [1, 2, 3], 'y': [1, 3, 2]})
px.line(df, x='x', y=pd.Series([1,2,3]))

I expect there'll initially be pushback (just something I've come to expect 😄 ) - if you do disagree that we should do have __series_standard__, could you please suggest an alternative for writing library-agnostic functions which operate on Series?
And if the answer is that we don't want to support that, then could you please suggest how plotly could support

px.line(df, x='x', y=pd.Series([1,2,3]))

in a library-agnostic manner?

@MarcoGorelli
Copy link
Contributor Author

One alternative would be to pass a namespace to the function:

def calculate_std_of_series(ser: Any, namespace):
    ser_compliant = namespace.dataframe_from_dict({'col': ser}).get_column_by_name('col')
    return ser_compliant.std(correction=1)

but this feel unnecessarily complicated (not to mention that the particular namespace may not be known at the given place in the code)

@rgommers
Copy link
Member

Didn't we just add .column to address a gap here? It feels like we should have all the pieces already without needing another dunder method.

It's hard to say more without more context; it sounds like this is a private function? And is there a need for backwards compatibility with it? Your proposed return ser_compliant.std(correction=1) changes the return type, and it's not clear to me if that's fine here.

Perhaps we should talk through how array-api-compat and the scikit-learn/SciPy usages of that work. We went through several iterations of adoption approaches for questions like these, and it would be useful to take the learnings from that.

@MarcoGorelli
Copy link
Contributor Author

.column only works if you already have a standard-compliant Column

@MarcoGorelli
Copy link
Contributor Author

Your proposed return ser_compliant.std(correction=1) changes the return type, and it's not clear to me if that's fine here.

it doesn't, the return value here is just a single value:

>>> ser = pd.Series([1,2,3])
>>> ser.mean()
2.0
>>> ser.rename('col').to_frame().__dataframe_standard__().get_column_by_name('col').mean()
2.0

We already have a way to go from non-compliant DataFrame to compliant DataFrame: __dataframe_standard__.
I'm saying: we need a similar way to go from non-compliant Series to compliant Series

@rgommers
Copy link
Member

We already have a way to go from non-compliant DataFrame to compliant DataFrame: __dataframe_standard__.

To clarify: we have that as an example on how to go about adoption, and not in the standard itself.

I'm saying: we need a similar way to go from non-compliant Series to compliant Series

That's fine to add in the purpose and scope section I'd say, same as for dataframes. And then in dataframe-api-compat it can be used for specific libraries that are supported there.

I think you can just implement it there and see how it works? It's not a standardization question.

@jorisvandenbossche
Copy link
Member

I think you can just implement it there and see how it works? It's not a standardization question.

But there is no equivalent __dataframe_standard__ for Series (column objects), so I don't think you can currently "just implement it"?

@MarcoGorelli
Copy link
Contributor Author

To clarify: we have that as an example on how to go about adoption, and not in the standard itself.

I don't think it's just an example: if a library has a compliant namespace, then __dataframe_standard__ is required for going from non-compliant to compliant

def get_compliant_df(df):
"""Utility function to support programming against a dataframe API"""
if hasattr(df, '__dataframe_namespace__'):
# Is already Standard-compliant DataFrame, nothing to do here.
pass
elif hasattr(df, '__dataframe_standard__'):
# Convert to Standard-compliant DataFrame.
df = df.__dataframe_standard__()
else:
# Here we can raise an exception if we only want to support compliant dataframes,
# or convert to our default choice of dataframe if we want to accept (e.g.) dicts
raise TypeError(
"Expected Standard-compliant DataFrame, or DataFrame with Standard-compliant implementation"
)
return df

I don't think you can currently "just implement it"?

For now I'll implement it the same way that I've implemented __dataframe_standard__, i.e. by monkeypatching: https://github.com/data-apis/dataframe-api-compat

pd.Series.__column_namespace__ = pandas_standard.convert_to_standard_compliant_column
pl.Series.__column_namespace__ = polars_standard.convert_to_standard_compliant_column

(this is good enough for testing, but longer term we would upstream __dataframe_standard__ and __column_standard__ methods to each library with a Standard-compliant namespace)

@rgommers
Copy link
Member

Oh wait, I was going by Marco's statement and a quick grep. It's actually in the protocol "purpose and scope" section, not in the API standard one:

namespace = df.__dataframe_namespace__()
.

So there's no analogy at all here, the API standard has nothing for a dataframe either? And to go to a standard object, the answer so far is "use a constructor function".

@rgommers
Copy link
Member

if a library has a compliant namespace, then __dataframe_standard__ is required for going from non-compliant to compliant

That's an example, not a required method.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 29, 2023

Strictly speaking it's indeed not a requirement, as we can't "require" something in third-party libraries that are not adhering to a specification. But it's still the only way for those third-party libraries to opt-in to supporting the standard through a separate object.

In my mind we discussed that as essentially a requirement, though, at #169 (otherwise there is no way you can know if some object supports the standard, assuming that users will use a pandas/polars/modin/ibis/... dataframe that does not itself supports the standard, but only supports converting itself into one)

@jorisvandenbossche
Copy link
Member

It's actually in the protocol "purpose and scope" section, not in the API standard one:

I just noticed the same (looking for that example), but shouldn't we move that to the spec docs?(or have it in both places)

@MarcoGorelli
Copy link
Contributor Author

If __dataframe_standard__ isn't required, then how is someone meant to write a library-agnostic dataframe function?

Let's take the example from the blog post:

def remove_outliers(df, column: str):
    z_score = (df[column] - df[column].mean())/df[column].std()
    return df[z_score.between(-3, 3)]

Could you please write out exactly how to make this function library-agnostic without using __dataframe_standard__?

@MarcoGorelli
Copy link
Contributor Author

__dataframe_standard__ is on the front page of the blog post https://data-apis.org/blog/dataframe_standard_rfc/, and is mentioned there in "step 1" of the "use the DataFrame Standard" solution

If it's not required, then we need to align on this as soon possible as it looks like there's been some serious miscommunication

@jorisvandenbossche
Copy link
Member

FWIW this blog post doesn't appear at https://data-apis.org/blog/ or is not linked on the home page, so it's not easy to find (unless you know it linked from some other post)

@rgommers
Copy link
Member

Thanks, gh-169 does clarify this. We've indeed agreed on it, but we have simply forgotten to document that agreement. So I guess we need to do that, and then having the analogous method for Column seems pretty obvious (I hope it'd be called __column_standard__ rather than __series_standard__ though).

@MarcoGorelli
Copy link
Contributor Author

agree on __column_standard__ instead of __series_standard__

I'll document this better then, but to summarise:

  • __dataframe_namespace__: check if a dataframe is already standard compliant
  • __column_namespace__: check if a column is already standard compliant
  • __dataframe_standard__: go from not-necessarily-compliant dataframe to compliant dataframe
  • __column_standard__: go from not-necessarily-compliant column to compliant column
  • DataFrame.dataframe: go from compliant dataframe to not-necessarily-compliant dataframe
  • Column.column: go from compliant column to not-necessarily-compliant column

@jorisvandenbossche
Copy link
Member

Do we need a separate __column_namespace__? It would return the same object as __dataframe_namespace__, and we could add another attribute (eg ndim) to distinguish dataframe from column

Other option for __column_standard__ is the longer __dataframe_column_standard__ (it's longer, but I think clearer)

@MarcoGorelli
Copy link
Contributor Author

sure, we can rename later, for now I'm just trying to plug holes in what we've got so that the standard can be useful

@rgommers
Copy link
Member

Thanks, that list LGTM! If we had that bullet list in a single place, that would really clarify things.

@MarcoGorelli
Copy link
Contributor Author

use a constructor function

Just to clarify - the current DataFrame constructor namespace.dataframe_from_dict assumes your dict is made up of columns which are already compliant

Just pointing this out double-check we're all on the same page here, I hope I'm not coming across as confrontational 😄 Thanks all for discussing

@rgommers
Copy link
Member

Good point, I suspect that we may return to that point again and will want a direct Column constructor at some point.

I hope I'm not coming across as confrontational

not at all!

@jorisvandenbossche
Copy link
Member

This issue also mentioned a "column_namespace" dunder, which isn't resolved yet?

@MarcoGorelli
Copy link
Contributor Author

isn't it?

def __column_namespace__(
self: Column, /, *, api_version: str | None = None
) -> Any:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants