Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Disallow duplicate column names everywhere by default #53217

Open
1 of 3 tasks
Tracked by #2718
joelostblom opened this issue May 13, 2023 · 3 comments
Open
1 of 3 tasks
Tracked by #2718

ENH: Disallow duplicate column names everywhere by default #53217

joelostblom opened this issue May 13, 2023 · 3 comments
Labels
Enhancement Index Related to the Index class or subclasses

Comments

@joelostblom
Copy link
Contributor

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Having duplicated columns can lead to confusing downstream behavior that might be difficult to detect, e.g. we recently had this occur in Altair for a couple of users vega/altair#2718.

Feature Description

It was suggested in the PR that introduced the flag to disallow duplicates that this might be suitable as a default option in the future #28394 (comment), but I couldn't find a follow up discussion so I 'm opening this issue to suggest that this becomes the default behavior to protect users from doing things they might not intend to, like selecting the same column twice.

Alternative Solutions

Keep the current default

Additional Context

No response

@joelostblom joelostblom added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 13, 2023
@topper-123 topper-123 added Index Related to the Index class or subclasses and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 4, 2023
@topper-123
Copy link
Contributor

I'm not sure what my opinion is on this, but open to discussions.

Currently, we disallow by setting an attribute in flags (see here), which IMO is the wrong API and we should rather have a parameter in the index constructor, like Index(..., allow_duplicates=False) instead. Then it would be easier to discuss if the parameter flag should be False or True.

@topper-123
Copy link
Contributor

To add, the flag-based approach doesn't allow us to decide if we want label duplicates in the DataFrame constructor, which doesn't seem right. E.g. we'd want

>>> df = pd.DataFrame(data,
...     index=Index(..., allow_duplicates=True|False),
...     columns=Index(..., allow_duplicates=True|False),
... )

for precise control in the constructor. Also, a decision has to be if non-duplicate labels also means non-duplicate label indexing, e.g. should we disallow df.loc[["a", "a"]] when we disallow duplicate labels.

@tomhoq
Copy link
Contributor

tomhoq commented Apr 18, 2024

Is this still to be implemented?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Index Related to the Index class or subclasses
Projects
None yet
Development

No branches or pull requests

3 participants