Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] .at accessor for new column fails in cudf.pandas #16112

Closed
bdice opened this issue Jun 27, 2024 · 3 comments · Fixed by #16177
Closed

[BUG] .at accessor for new column fails in cudf.pandas #16112

bdice opened this issue Jun 27, 2024 · 3 comments · Fixed by #16177
Assignees
Labels
bug Something isn't working cudf.pandas Issues specific to cudf.pandas

Comments

@bdice
Copy link
Contributor

bdice commented Jun 27, 2024

Describe the bug
I was helping a user Amith with a question about cudf.pandas and the library pandapower (see RAPIDS GoAI Slack). The pandapower notebook examples trigger a bug in cudf.pandas around the .at accessor. @Jacey0 helped debug this with me during PyCon 2024. Thanks Jacey!

Steps/Code to reproduce bug

Here is a Colab notebook with a reproducer: https://colab.research.google.com/gist/bdice/f5be320dada30671015d5b53642ec19c/solution-test.ipynb

Here is the minimal code snippet needed to reproduce:

%load_ext cudf.pandas
import pandas as pd

bus = pd.DataFrame({'name': []})
entries = {'name': 'Bus 0', 'geo': 'X and Y'}
for col, val in entries.items():
    print("Setting", col, "to", val, end="\n\n")
    print("Before:")
    print(bus, end="\n\n")
    bus.at[0, col] = val # pandas adds a new row and 'geo' column' fine
    print("After:")
    print(bus, end="\n\n")

Error output:

Setting name to Bus 0

Before:
Empty DataFrame
Columns: [name]
Index: []

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

/usr/local/lib/python3.10/dist-packages/cudf/core/series.py in _loc_to_iloc(self, arg)
    381                 if (n := len(indices)) == 0:
--> 382                     raise KeyError("Label scalar is out of bounds")
    383                 elif n == 1:

KeyError: 'Label scalar is out of bounds'


During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)

15 frames

KeyError: 'Label scalar is out of bounds'


During handling of the above exception, another exception occurred:

IndexError                                Traceback (most recent call last)

IndexError: Index out of bounds


During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)

/usr/local/lib/python3.10/dist-packages/pandas/core/indexing.py in __setitem__(self, key, value)
    843         else:
    844             key = com.apply_if_callable(key, self.obj)
--> 845         indexer = self._get_setitem_indexer(key)
    846         self._has_valid_setitem_indexer(key)
    847 

AttributeError: '_AtIndexer' object has no attribute '_get_setitem_indexer'

Expected behavior
Here is the output from plain pandas:

Setting name to Bus 0

Before:
Empty DataFrame
Columns: [name]
Index: []

After:
    name
0  Bus 0

Setting geo to X and Y

Before:
    name
0  Bus 0

After:
    name      geo
0  Bus 0  X and Y
@bdice bdice added bug Something isn't working cudf.pandas Issues specific to cudf.pandas labels Jun 27, 2024
@galipremsagar galipremsagar self-assigned this Jun 27, 2024
@mroeschke
Copy link
Contributor

It appears the cudf.pandas bug is that _AtIndexer (and _iAtIndexer) indexer are not defined in _wrappers/pandas.py as proxy objects.

And of course, it appears cudf.DataFrame.loc.__setitem__ that expands is also buggy

@mroeschke
Copy link
Contributor

More investigation notes:

So cudf.DataFrame.at is just returns cudf.DataFrame.loc, so a pandas.DataFrame.at call will always succeed to return cudf.pandas' intermediate proxy of pandas.DataFrame.loc.

So if pandas.DataFrame.at.__setitem__ fails in cudf.pandas, we rewind to pandas.DataFrame.loc.__setitem__ instead of pandas.DataFrame.at.__setitem__.

I think we need pandas.DataFrame.at to be an IntermediateProxy of an IntermediateProxy as "proper" fix. I'm not sure if there's an straightforward way to do this today.

Alternatively, if we make cudf.DataFrame.at not map to cudf.DataFrame.loc (i.e. return a different object with the same implementation), that would be an easier fix...

@bdice
Copy link
Contributor Author

bdice commented Jul 3, 2024

Alternatively, if we make cudf.DataFrame.at not map to cudf.DataFrame.loc (i.e. return a different object with the same implementation), that would be an easier fix...

That sounds like the right thing to do. cudf's object model should mirror that of pandas. If at returns loc in cudf but not pandas, that's a problem in the mirrored object model.

rapids-bot bot pushed a commit that referenced this issue Jul 8, 2024
closes #16112

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #16177
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf.pandas Issues specific to cudf.pandas
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants