Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Unexpected behavior math operations using multiindexes #51500

Closed
2 of 3 tasks
Tracked by #1 ...
AJZw opened this issue Feb 20, 2023 · 8 comments · Fixed by #59191
Closed
2 of 3 tasks
Tracked by #1 ...

BUG: Unexpected behavior math operations using multiindexes #51500

AJZw opened this issue Feb 20, 2023 · 8 comments · Fixed by #59191
Labels
Groupby MultiIndex Needs Tests Unit test(s) needed to prevent regressions

Comments

@AJZw
Copy link

AJZw commented Feb 20, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

data = [
    ["C","A","A","A"],
    ["C","A","B","A"],
    ["B","A","A","A"],
    ["B","A","A","B"],
]
data = pd.DataFrame(data, columns=["0","1","2","3"])

l_map = {
    "C":"B", # When renaming the categories, the new categories 'alphabetical ordering' 
    "B":"C", # must be flipped compared to the old categories.
}
data["0"] = data["0"].astype("category")
data["0"] = data["0"].cat.rename_categories(l_map)

# We need a multiindex of atleast 4 levels deep
a = data.groupby(by=["0","1","2"])["3"].value_counts()
b = data.groupby(by=["0","1","2"])["3"].size()

# Sorting the multi-index prevents the unexpected behavior
# b = b.sort_index(ascending=False)

# Unexpected behavior; also for a.mul(b), a.add(b), a.sub(b)
c = a.div(b)
print(c.loc[pd.IndexSlice["B",:,:,:]])
# 1 2 3
# A A A   NaN
#   B A   NaN
#   A NaN NaN
#   B NaN NaN
# Name: 3, dtype: float64

# Expected behavior
d = a.loc[pd.IndexSlice["B",:,:,:]]
e = b.loc[pd.IndexSlice["B",:,:]]
f = d.div(e)
print(f)
# 1 2 3
# A A A  1.0
#   B A  1.0
# Name: 3, dtype: float64

Issue Description

Performing a division/multiplication/addition/subtraction using two multiindexes leads to NaN's where that is not appropriate.

Expected Behavior

Performing a math operation produces the expected result (see also code snippet above).

Installed Versions

INSTALLED VERSIONS

commit : 2e218d1
python : 3.8.5.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : Swedish_Sweden.1252

pandas : 1.5.3
numpy : 1.23.5
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 47.1.0
pip : 22.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.6.3
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.0
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : 2022.7

@AJZw AJZw added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 20, 2023
@phofl
Copy link
Member

phofl commented Feb 20, 2023

Hi, thanks for your report. Can you provide a minimal reproducible example? See https://matthewrocklin.com/minimal-bug-reports for an explanation

@AJZw
Copy link
Author

AJZw commented Feb 20, 2023

Hi, of course. The code above is as minimal as i could get it though. Oh wait, upon rechecking I could reduce the input data little bit further, sorry! Now, bit more reduced and without the comments.

import pandas as pd

data = pd.DataFrame([["C","B","B"],["B","A","A"],["B","A","B"]], columns=["0","1","2"])

data["0"] = data["0"].astype("category")
data["0"] = data["0"].cat.rename_categories({"C":"B", "B":"C"})

a = data.groupby(by=["0","1"])["2"].value_counts()
b = data.groupby(by=["0","1"]).size()
a.div(b)

For the expected behavior, but with a differently ordered index, use:

a.div(b.sort_index(ascending=False))

My findings so far:

  • Removing any further column / row / or category-within-a-column of the DataFrame 'fixes' the unexpected behavior.
  • One column must be of type category as the categories must be renamed in such way that alphabetical order of the categories itself are flipped.
  • The column order given to the groupby(by) and [index] matters. I can only reproduce the unexpected behavior with order 0,1,2 & 1,0,2.
  • Replacing a = data.groupby(by=["0","1"])["2"].value_counts() with a = data.groupby(by=["0","1","2"]).size() cause the math operation to behave as expected.

Based on the above, I would guess that rename_categories() causes value_counts() to return a 'corrupted' index.
I got stuck trying to investigate further by looking up the MultiIndex code in the repository, but I havent been able to locate it with a quick search.

@phofl
Copy link
Member

phofl commented Feb 20, 2023

Ah it's not only the data but also the necessary steps. No need to include groupby, you could simply create the DataFrames that groupby returns directly without any additional methods.

Edit: To be more clear, you can just write the MultiIndex down the way it's returned by the groupby op.

@AJZw
Copy link
Author

AJZw commented Feb 20, 2023

I actually tried! But those 'manually created' indexes behave as expected.

I did have a quick look what could be the difference between the 'manual' index and the one generated using the code snippet above. If i remember correctly the MultiIndex.equals() returned False, but the MultiIndex._values were perfectly identical.

@phofl
Copy link
Member

phofl commented Feb 20, 2023

Ah, this is a valuable information for your initial post. This points to something in groupby not working as expected.

Your example works as expected on main btw. I recall that we fixed something similar a couple of months ago (the groupby case).

@AJZw
Copy link
Author

AJZw commented Feb 20, 2023

Thanks for checking. (I tried to build main, but issue #51499 prevented me.)
In that case feel free the close this, This was a fun puzzle 😄

@phofl
Copy link
Member

phofl commented Feb 20, 2023

This might need tests, not sure

@phofl phofl added Groupby MultiIndex Needs Tests Unit test(s) needed to prevent regressions and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 20, 2023
@luke396
Copy link
Contributor

luke396 commented Feb 22, 2023

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby MultiIndex Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
3 participants