Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError if DataFrame contains duplicated column name (in some cases) #2718

Closed
1 task
saiwing-yeung opened this issue Nov 16, 2022 · 4 comments · Fixed by #3452
Closed
1 task

TypeError if DataFrame contains duplicated column name (in some cases) #2718

saiwing-yeung opened this issue Nov 16, 2022 · 4 comments · Fixed by #3452
Labels

Comments

@saiwing-yeung
Copy link

saiwing-yeung commented Nov 16, 2022

This is kind of an edge case but the error message makes it somewhat difficult to identify the underlying issue.
If you have a Pandas DataFrame where there are duplicated column names and they are not integers, you'd get an exception when trying to plot something. MWE:

import io
df = pd.read_csv(io.StringIO("""
a, b, c, d
0, 1, 2, 2022-01-01
2, 3, 4, 2022-01-01
"""))
df.columns = ['a', 'b', 'c', 'c']
alt.Chart(df).mark_point().encode(x='a', y='b')

results in

TypeError: to_list_if_array() got an unexpected keyword argument 'convert_dtype'

Note that

  • the duplicated columns are not used in plotting.
  • if both duplicated columns are of type integer, then you would just get a warning. But with most other types (including floats) it would generate an exception.
  • besides explicit renaming the columns like this, another scenario where you'd accidentally generate duplicated column names is calling toPandas() after join two PySpark DataFrames.

Using altair 4.2.0

Tracking

@joelostblom
Copy link
Contributor

My sense is that this issue should be handled by pandas and that it should not be possible to create a dataframe where two columns have the same name. Have you raised this on their issue tracker?

@JanetMatsen
Copy link

JanetMatsen commented May 4, 2023

I think it would be nice to raise a more informative error. It came up for me too.

In my case, I had a big dataframe that causes some encoding errors if I dump the whole dataframe in. So I made a list of the subset of columns that I wanted to send to Altair. However, if this list is long or generated by complex logic, then it is easy to mistakenly include one column name twice.

A cartoon of my workflow was somewhat as follows, but I kept about 10 of 300 columns when I sent it to Altair and my list of ~10 had a duplicate.:

df = pd.DataFrame({'created_at':['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04', '2022-01-05']})
df["y"] = range(1, len(df)+1, 1)

keep_cols = ['created_at', 'created_at', 'y']  # <-- the logic that created this list led to a duplicate column name.

alt.Chart(df[keep_cols]).mark_circle(size=60).encode(
    x="created_at:T",
    y="y", 
)

@joelostblom
Copy link
Contributor

Maybe we could introduce something like df.flags.allows_duplicate_labels = False (docs) in santize_dataframe which is where this error is raised, but I wonder why this isn't the default in pandas already so I opened pandas-dev/pandas#53217

@dangotbanned
Copy link
Member

dangotbanned commented Jan 5, 2025

@MarcoGorelli is this resolved by adopting narwhals?

https://narwhals-dev.github.io/narwhals/pandas_like_concepts/column_names/

Duplicate column names are 🚫 banned 🚫.

Update

Yeah it is 🙂

import io
import pandas as pd
import altair as alt

df = pd.read_csv(
    io.StringIO("""
a, b, c, d
0, 1, 2, 2022-01-01
2, 3, 4, 2022-01-01
""")
)
df.columns = ["a", "b", "c", "c"]
alt.Chart(df).mark_point().encode(x="a", y="b")
File /site-packages/narwhals/_pandas_like/dataframe.py:106, in PandasLikeDataFrame._validate_columns(self, columns)
    [104]         msg += f"\n- '{key}' {value} times"
    [105] msg = f"Expected unique column names, got:{msg}"
--> [106] raise ValueError(msg)

ValueError: Expected unique column names, got:
- 'c' 2 times

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants