Skip to content

BUG: FutureWarning when splitting a dataframe using np.split #57351

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
amanlai opened this issue Feb 11, 2024 · 13 comments
Open
2 of 3 tasks

BUG: FutureWarning when splitting a dataframe using np.split #57351

amanlai opened this issue Feb 11, 2024 · 13 comments
Labels
Blocker Blocking issue or pull request for an upcoming release Compat pandas objects compatability with Numpy or Python functions Deprecate Functionality to remove in pandas Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@amanlai
Copy link
Contributor

amanlai commented Feb 11, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

To reproduce it:

import numpy as np
import pandas as pd

df = pd.DataFrame({'a': [1, 2, 3]})
lst = np.split(df, 3)

Issue Description

The above code raises a FutureWarning:

FutureWarning: 'DataFrame.swapaxes' is deprecated and will be removed in a future version. Please use 'DataFrame.transpose' instead.

As far as I understand, np.split uses np.swapaxes which is raising this warning.

Expected Behavior

Not show a warning.

Installed Versions

python : 3.11.5
pandas : 2.2.0
numpy : 1.26.3

@amanlai amanlai added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 11, 2024
@VISWESWARAN1998
Copy link
Contributor

swapaxes is called from shape_base.py in numpy package

@simonjayhawkins simonjayhawkins added Warnings Warnings that appear or should be added to pandas Compat pandas objects compatability with Numpy or Python functions labels Feb 13, 2024
@rhshadrach
Copy link
Member

Thanks for the report - is there a reason you prefer to use np.swapaxes(df) over df.T?

@rhshadrach rhshadrach added Needs Info Clarification about behavior needed to assess issue Deprecate Functionality to remove in pandas and removed Needs Triage Issue that has not been reviewed by a pandas team member Warnings Warnings that appear or should be added to pandas labels Feb 15, 2024
@amanlai
Copy link
Contributor Author

amanlai commented Feb 15, 2024

@rhshadrach the use case here is not np.swapaxes(df) itself, it's np.split(df), which apparently uses np.swapaxes under the hood.

@rhshadrach
Copy link
Member

Ah, thanks @amanlai. On main, DataFrame.swapaxes has been removed and the OP gives the output:

[array([[1]]), array([[2]]), array([[3]])]

On 2.2.x, I am seeing

[   a
0  1,    a
1  2,    a
2  3]

cc @jorisvandenbossche @phofl @mroeschke

@rhshadrach rhshadrach added Blocker Blocking issue or pull request for an upcoming release Regression Functionality that used to work in a prior pandas version and removed Needs Info Clarification about behavior needed to assess issue Bug labels Feb 23, 2024
@rhshadrach rhshadrach added this to the 3.0 milestone Feb 23, 2024
@rhshadrach
Copy link
Member

Marking this as a blocker for 3.0 so a decision is made.

@Aloqeely
Copy link
Member

I don't think many people use np.split on a DataFrame to consider reverting this deprecation. (Stackoverflow question did not have many interactions)

And as stated in numpy/numpy#24889 (comment) it's possible to calculate the start/stop slices and then manually slicing using df.iloc[start:stop]

@WillAyd
Copy link
Member

WillAyd commented May 20, 2024

I agree with @Aloqeely. Is there an upstream issue for NumPy on this topic? I think should be resolved there and use df.T like @rhshadrach suggests

@WillAyd
Copy link
Member

WillAyd commented May 20, 2024

Ah ignore my previous comment - I thought they were calling our swapaxes implementation but misread the OP. Assuming they call swapaxes generically for their internal use, so not as easy as changing the call

Even still I don't think we should revert this deprecation

@Aloqeely
Copy link
Member

Is it sensible to implement a DataFrame.split function for convenience? Since np.split doesn't work appropriately on DataFrames anymore.

I can work on it next week if you are all ok with it.

@jorisvandenbossche
Copy link
Member

Is there an upstream issue for NumPy on this topic?

numpy/numpy#24889 (comment)

@WillAyd
Copy link
Member

WillAyd commented May 20, 2024

Is it sensible to implement a DataFrame.split function for convenience? Since np.split doesn't work appropriately on DataFrames anymore.

I don't think so - generally our goal is to reduce the footprint of our API, and I don't see this as a huge value add over the other method you have suggested:

And as stated in numpy/numpy#24889 (comment) it's possible to calculate the start/stop slices and then manually slicing using df.iloc[start:stop]

@ddelange
Copy link

ddelange commented Oct 5, 2024

Is it sensible to implement a DataFrame.split function for convenience? Since np.split doesn't work appropriately on DataFrames anymore.

I can work on it next week if you are all ok with it.

+1 👍 splitting a dataframe into equally sized chunks (except for the trailing chunk) is a routine task in ML and other batching applications.

requiring end-users to replace np.split/np.array_split with some bespoke iloc helper function sounds to me like a lot of duplicated boilerplate, wasted brain cycles over time, and overall increased bug surface for the ecosystem

@ddelange
Copy link

ddelange commented Oct 7, 2024

another big reason to use np.array_split is the optional axis=1 argument, which is probably why it is calling np.swapaxes in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Blocker Blocking issue or pull request for an upcoming release Compat pandas objects compatability with Numpy or Python functions Deprecate Functionality to remove in pandas Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

8 participants