Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add paramenter index to drop_duplicates to drop duplicate indices #58648

Open
1 of 3 tasks
bingbong-sempai opened this issue May 9, 2024 · 8 comments · May be fixed by #59133
Open
1 of 3 tasks

ENH: Add paramenter index to drop_duplicates to drop duplicate indices #58648

bingbong-sempai opened this issue May 9, 2024 · 8 comments · May be fixed by #59133
Labels
Enhancement Needs Discussion Requires discussion from core team before further action

Comments

@bingbong-sempai
Copy link

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

There currently is no elegant pattern to drop duplicate indices.
I think what people usually do is
df[~df.index.duplicated(keep='first')]

Feature Description

Add a new parameter to drop_duplicates to specify dropping duplicate indices.
An option could be a index=True to do this, similar to when merging on an index.

Alternative Solutions

Allow the subset parameter of drop_duplicates to accept the name of the index.

Additional Context

No response

@bingbong-sempai bingbong-sempai added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 9, 2024
@Yousinator
Copy link

Hi, I would like to work on this issue. I'll start implementing the feature and submit a PR soon.

@Yousinator Yousinator linked a pull request Jun 28, 2024 that will close this issue
5 tasks
@Aloqeely Aloqeely added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 1, 2024
@Aloqeely
Copy link
Member

Aloqeely commented Jul 1, 2024

Thanks for the suggestion! I think this feature will be useful so I'm ok with it being added.

If I understood you correctly, if someone passes index=True then duplicate indices will be dropped alongside duplicate rows, but what if someone wants to drop duplicate indices only?

@bingbong-sempai
Copy link
Author

This is how it can look like if an index parameter is added:

Description Parameters
Drop duplicate indices only subset=None, index=True
Drop duplicate columns only subset=None, index=False
  subset=colnames, index=False
Drop mixed columns and index subset=colnames, index=True

It turned out a bit more complex than I expected.
Basically index is set to True any time you want to drop duplicate indices.

A simpler alternative might be better to accept the index name in the subset parameter:

Description Parameters
Drop duplicate indices only subset=index.name
Drop duplicate columns only subset=None
  subset=colnames
Drop mixed columns and index subset=[index.name, *colnames]

@Yousinator
Copy link

Yousinator commented Jul 2, 2024

My current implementation is to drop indices only. However, my take on an alternative would be adding two parameters:

  1. index which drops duplicates alongside the other rows
  2. index_only - or any other naming - which would then drop only duplicate indices.

@Aloqeely
Copy link
Member

Aloqeely commented Jul 2, 2024

So if subset=colnames and index=True, will that drop rows that have duplicate indices and then drop duplicate rows separately or will that drop duplicate rows taking their index into consideration?
e.g. row1 has index = 0, values of [1,2] -- row2 has index = 1, values of [1,2] -- row3 has index = 1, values of [0, 1]
there are 2 cases here:

  1. check duplicate indices --> removed row3 AND THEN check duplicate values --> removed row2 (Only row1 is left)
  2. check duplicate values WITH same index --> no row removed.

I'm personally leaning towards option 2 which basically treats the index as a value of the row

@Aloqeely
Copy link
Member

Aloqeely commented Jul 2, 2024

@Yousinator Not a fan of having 2 parameters, that will be confusing with the existing ignore_index parameter.

@Yousinator
Copy link

@Yousinator Not a fan of having 2 parameters, that will be confusing with the existing ignore_index parameter.

My final take would be having a string value rather than a bool value for the parameter. It would be a bit confusing, but would be easier and simpler than the subset / index combination.

If we where to go for the subset / index combination, I would prefer the second option too.

If going with the second option we could rearrange the indices at the end if duplicate indices exist with different values

@bingbong-sempai
Copy link
Author

I'm personally leaning towards option 2 which basically treats the index as a value of the row

Same, I think this is what most people would expect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants