Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Enlarging multilevel index fails if one or more level keys are None #59153

Open
3 tasks done
micky-gee opened this issue Jul 1, 2024 · 2 comments
Open
3 tasks done
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@micky-gee
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

#Create simple multilevel index with two levels (note one entry on level 1 is None):
index = pd.MultiIndex.from_tuples([('A', 'a1'), ('A', 'a2'), ('B', 'b1'), ('B', None)])

#Create dataframe with said index:
pd.DataFrame([(0, 6), (1, 5), (2, 4), (3, 7)], index=index)

#       0  1
#A a1   0  6
#  a2   1  5
#B b1   2  4
#  NaN  3  7

#Now it is possible to enlarge this dataframe with a new index entry provided none of the keys are None:
df.loc[('B', 'b2'),:] = [10, 11]

#           0     1
# A a1    0.0   6.0
#   a2    1.0   5.0
# B b1    2.0   4.0
#   NaN   3.0   7.0
#   b2   10.0  11.0

#However this will throw a KeyError:
df.loc[('A', None),:] = [12, 13]

#Also doesn't work with an index slice:
idx = pd.IndexSlice

#this will throw a KeyError:
df.loc[idx['A', None],:] = [12, 13]

Issue Description

It is possible to enlarge a dataframe with a multilevel indexes by providing the new key as parameters to df.loc[...]

It is also possible to create entries to multilevel indices that have None as the key i.e. df.loc[('A', None),...]

It is not possible to enlarge a dataframe with a multilevel index if one or more of the keys is None.

Expected Behavior

Building on the example above,
df.loc[('A', None),:] = [12, 13]

should result in the following:

# A a1    0.0   6.0
#   a2    1.0   5.0
#   NaN  12.0  13.0
# B b1    2.0   4.0
#   NaN   3.0   7.0
#   b2   10.0  11.0

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.10.6.final.0
python-bits : 64
OS : Darwin
OS-release : 23.5.0
Version : Darwin Kernel Version 23.5.0: Wed May 1 20:19:05 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T8112
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 63.2.0
pip : 24.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.4
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.4
numba : 0.59.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.13.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

@micky-gee micky-gee added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 1, 2024
@micky-gee
Copy link
Author

Adding what I've found from some more digging, I've found the call within the multilevel index that is failing:

>>> index._engine.get_loc(('A', None))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "index.pyx", line 776, in pandas._libs.index.BaseMultiIndexCodesEngine.get_loc
  File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
  File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 2152, in pandas._libs.hashtable.UInt64HashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 2176, in pandas._libs.hashtable.UInt64HashTable.get_item
KeyError: 17

I think that this has to do with the hashing of the None type and converting that to an address on the underlying data structure?

When I give a valid tuple to the multilevel index, I get an integer corresponding to an entry in an underlying datastructure:

>>>index._engine.get_loc(('A', 'a2'))
1

@micky-gee
Copy link
Author

As part of trying to understand this problem more broadly, I've been investigating hashable types (None and NaN are hashable) and their usability in indices with Pandas.

As a single level index (opposed to a multilevel index), here is an MWE that demonstrates these inconsistencies:

>>> import pandas as pd
>>> import numpy as np
>>> index2 = pd.Index([1, 2, 3, None])
>>> df2 = pd.DataFrame([4, 5, 6, 9], index=index2)
>>> df2
     0
1.0  4
2.0  5
3.0  6
NaN  9

Now addressing the index entry with None results in a key error:

>>> df2.loc[None]
Traceback (most recent call last):
  File "/Users/michaelgrant/.pyenv/versions/s7s_strategy_private/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
    return self._engine.get_loc(casted_key)
  File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
  File "index.pyx", line 175, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index_class_helper.pxi", line 19, in pandas._libs.index.Float64Engine._check_type
KeyError: None

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/michaelgrant/.pyenv/versions/s7s_strategy_private/lib/python3.10/site-packages/pandas/core/indexing.py", line 1191, in __getitem__
    return self._getitem_axis(maybe_callable, axis=axis)
  File "/Users/michaelgrant/.pyenv/versions/s7s_strategy_private/lib/python3.10/site-packages/pandas/core/indexing.py", line 1431, in _getitem_axis
    return self._get_label(key, axis=axis)
  File "/Users/michaelgrant/.pyenv/versions/s7s_strategy_private/lib/python3.10/site-packages/pandas/core/indexing.py", line 1381, in _get_label
    return self.obj.xs(label, axis=axis)
  File "/Users/michaelgrant/.pyenv/versions/s7s_strategy_private/lib/python3.10/site-packages/pandas/core/generic.py", line 4301, in xs
    loc = index.get_loc(key)
  File "/Users/michaelgrant/.pyenv/versions/s7s_strategy_private/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
    raise KeyError(key) from err
KeyError: None

However replacing None with np.nan works just fine:

>>> df2.loc[np.nan]
0    9
Name: nan, dtype: int64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

1 participant