ENH: Add paramenter `index` to `drop_duplicates` to drop duplicate indices #58648

bingbong-sempai · 2024-05-09T08:16:30Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

There currently is no elegant pattern to drop duplicate indices.
I think what people usually do is
df[~df.index.duplicated(keep='first')]

Feature Description

Add a new parameter to drop_duplicates to specify dropping duplicate indices.
An option could be a index=True to do this, similar to when merging on an index.

Alternative Solutions

Allow the subset parameter of drop_duplicates to accept the name of the index.

Additional Context

No response

The text was updated successfully, but these errors were encountered:

Yousinator · 2024-06-25T02:06:41Z

Hi, I would like to work on this issue. I'll start implementing the feature and submit a PR soon.

Aloqeely · 2024-07-01T22:35:08Z

Thanks for the suggestion! I think this feature will be useful so I'm ok with it being added.

If I understood you correctly, if someone passes index=True then duplicate indices will be dropped alongside duplicate rows, but what if someone wants to drop duplicate indices only?

bingbong-sempai · 2024-07-02T03:45:28Z

This is how it can look like if an index parameter is added:

Description	Parameters
Drop duplicate indices only	subset=None, index=True
Drop duplicate columns only	subset=None, index=False
	subset=colnames, index=False
Drop mixed columns and index	subset=colnames, index=True

It turned out a bit more complex than I expected.
Basically index is set to True any time you want to drop duplicate indices.

A simpler alternative might be better to accept the index name in the subset parameter:

Description	Parameters
Drop duplicate indices only	subset=index.name
Drop duplicate columns only	subset=None
	subset=colnames
Drop mixed columns and index	subset=[index.name, *colnames]

Yousinator · 2024-07-02T12:48:11Z

My current implementation is to drop indices only. However, my take on an alternative would be adding two parameters:

index which drops duplicates alongside the other rows
index_only - or any other naming - which would then drop only duplicate indices.

Aloqeely · 2024-07-02T13:11:32Z

So if subset=colnames and index=True, will that drop rows that have duplicate indices and then drop duplicate rows separately or will that drop duplicate rows taking their index into consideration?
e.g. row1 has index = 0, values of [1,2] -- row2 has index = 1, values of [1,2] -- row3 has index = 1, values of [0, 1]
there are 2 cases here:

check duplicate indices --> removed row3 AND THEN check duplicate values --> removed row2 (Only row1 is left)
check duplicate values WITH same index --> no row removed.

I'm personally leaning towards option 2 which basically treats the index as a value of the row

Aloqeely · 2024-07-02T13:14:43Z

@Yousinator Not a fan of having 2 parameters, that will be confusing with the existing ignore_index parameter.

Yousinator · 2024-07-02T13:55:41Z

@Yousinator Not a fan of having 2 parameters, that will be confusing with the existing ignore_index parameter.

My final take would be having a string value rather than a bool value for the parameter. It would be a bit confusing, but would be easier and simpler than the subset / index combination.

If we where to go for the subset / index combination, I would prefer the second option too.

If going with the second option we could rearrange the indices at the end if duplicate indices exist with different values

bingbong-sempai · 2024-07-03T01:28:10Z

I'm personally leaning towards option 2 which basically treats the index as a value of the row

Same, I think this is what most people would expect.

bingbong-sempai added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 9, 2024

Yousinator linked a pull request Jun 28, 2024 that will close this issue

Drop duplicate indices #59133

Open

5 tasks

Aloqeely added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add paramenter `index` to `drop_duplicates` to drop duplicate indices #58648

ENH: Add paramenter `index` to `drop_duplicates` to drop duplicate indices #58648

bingbong-sempai commented May 9, 2024

Yousinator commented Jun 25, 2024

Aloqeely commented Jul 1, 2024 •

edited

Loading

bingbong-sempai commented Jul 2, 2024

Yousinator commented Jul 2, 2024 •

edited

Loading

Aloqeely commented Jul 2, 2024

Aloqeely commented Jul 2, 2024

Yousinator commented Jul 2, 2024

bingbong-sempai commented Jul 3, 2024

ENH: Add paramenter index to drop_duplicates to drop duplicate indices #58648

ENH: Add paramenter index to drop_duplicates to drop duplicate indices #58648

Comments

bingbong-sempai commented May 9, 2024

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

Yousinator commented Jun 25, 2024

Aloqeely commented Jul 1, 2024 • edited Loading

bingbong-sempai commented Jul 2, 2024

Yousinator commented Jul 2, 2024 • edited Loading

Aloqeely commented Jul 2, 2024

Aloqeely commented Jul 2, 2024

Yousinator commented Jul 2, 2024

bingbong-sempai commented Jul 3, 2024

ENH: Add paramenter `index` to `drop_duplicates` to drop duplicate indices #58648

ENH: Add paramenter `index` to `drop_duplicates` to drop duplicate indices #58648

Aloqeely commented Jul 1, 2024 •

edited

Loading

Yousinator commented Jul 2, 2024 •

edited

Loading