Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: .isin() method should use __contains__ rather than __iter__ for user-defined classes to determine presence. #59041

Closed
2 of 3 tasks
f3ss1 opened this issue Jun 18, 2024 · 6 comments · Fixed by #59201
Closed
2 of 3 tasks
Assignees

Comments

@f3ss1
Copy link

f3ss1 commented Jun 18, 2024

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Right now, if you would define a user class:

class MyClass:
    def __init__(self):
        self.collection = [1, 2, 3]
        self.another_collection = [4, 5, 6]
    
    def __contains__(self, item):
        return item in self.collection
    
    def __iter__(self):
        yield from self.another_collection

and would then initialize a pandas dataframe like this:

example_dataframe = pd.DataFrame(
    {
        'column_name': [3, 1, 4, 6, 13],
        'another_column_name': ['tolly', 'trolly', 'telly', 'belly', 'nelly']
    }
)

and would then call the .isin() method like this:

class_instance = MyClass()
example_dataframe['column_name'].isin(class_instance)

you would actually get this output:

False
False
True
True
False

which is if the values from self.another_collections specified in __iter__ are checked, rather than self.collection from __contains__. I do realize that this might stem from compatibility with other libraries, but this seems counter-intuitive.

Feature Description

A solution I suggest is either to change the behavior (which might result into ruining some peoples code, I believe), or adding a flag (which would lead to more complexity, I guess).

Alternative Solutions

See above.

Additional Context

No response

@f3ss1 f3ss1 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 18, 2024
@rhshadrach
Copy link
Member

isin uses a hash table for performance - this would not work with __contains__. Is there a real-world example where the results would differ?

I'd be okay with leaking a bit of the implementation details in the docstring - that isin iterates through the values in the passed list-like.

@rhshadrach rhshadrach added Needs Discussion Requires discussion from core team before further action isin isin method and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 19, 2024
@f3ss1
Copy link
Author

f3ss1 commented Jun 24, 2024

I believe the real-life example I have might not be the best programming practice. In my scenario, I have a dictionary that maps item names to item instances, and the __contains__ method works for both class names and instances. However, I'm starting to doubt if this is a good design choice in fact.

Still, I think that the current isin implementation and its documentation need some adjustments as they seem a bit misleading. Although it mentions that the object should be Iterable, implying that __contains__ might not be used, it also suggests using a dict, which contradicts this. Maybe introducing how it works with an Attention section in the documentations is indeed the most effective solution.

@rhshadrach
Copy link
Member

Although it mentions that the object should be Iterable, implying that __contains__ might not be used, it also suggests using a dict, which contradicts this.

Can you elaborate what the contradiction is?

@f3ss1
Copy link
Author

f3ss1 commented Jun 26, 2024

I believe my comment might be misleading :) It's not the contradiction in general for Python (of course dict is iterable as well) rather than in my personal opinion: when I see a dict, the first thing I think about is hashing and doing stuff like my_dict["a"] which leads to thinking that it would use __contains__.

@rhshadrach
Copy link
Member

I'm good with a note that we use __iter__ over __contains__ in a notes section of the docstring.

@rhshadrach rhshadrach added Docs good first issue and removed Enhancement Needs Discussion Requires discussion from core team before further action labels Jul 3, 2024
@AnaDenisa
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants