Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] After replace [-np.inf, np.inf] with np.nan, group forward fill not working. #16136

Open
edwardluohao opened this issue Jun 30, 2024 · 1 comment
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@edwardluohao
Copy link

edwardluohao commented Jun 30, 2024

Describe the bug
There is an inconsistency in the forward fill behavior of cudf when replacing np.inf and -np.inf values using a list. The same operation works correctly with pandas or replace np.inf and -np.inf seperately.

Steps/Code to reproduce bug

import cudf
import numpy as np

data = {
    'group': ['A', 'A', 'A', 'B', 'B', 'B'],
    'value': [1, -np.inf, 3, np.inf, 5, np.inf]
}

df = cudf.DataFrame(data)

print("Original DataFrame:")
print(df)

df['value'] = df['value'].replace([-np.inf, np.inf], np.nan)
df['value'] = df.groupby('group')['value'].ffill()

print("\nDataFrame after forward fill:")
print(df)

Output

DataFrame after forward fill:
group value
0 A 1.0
1 A NaN
2 A 3.0
3 B NaN
4 B 5.0
5 B NaN

Expected behavior
DataFrame after forward fill:
group value
0 A 1.0
1 A 1.0
2 A 3.0
3 B
4 B 5.0
5 B 5.0

Environment overview (please complete the following information)

  • Environment location: CentOS
  • Method of cuDF install: Conda

it works fine if seperate the replace by:

df['value'] = df['value'].replace(-np.inf, np.nan)
df['value'] = df['value'].replace(np.inf, np.nan)

or use pandas instead

@edwardluohao edwardluohao added the bug Something isn't working label Jun 30, 2024
@wence-
Copy link
Contributor

wence- commented Jul 1, 2024

The problem appears already after the replace call:

import cudf
import numpy as np

s = cudf.Series([1, -np.inf, np.inf])

print(s.replace([-np.inf, np.inf], np.nan))

print(s.replace(-np.inf, np.nan).replace(np.inf, np.nan))

The former produces:

0    1.0
1    NaN
2    NaN
dtype: float64

The latter:

0     1.0
1    <NA>
2    <NA>
dtype: float64

groupby.ffill handles the latter case, but not the former, in the way you might expect from pandas (where NaN is consider a missing value).

I agree that replace should produce the same output for the two examples in this comment (I think the latter is "more correct").

To work around this, if you replace your usage of np.nan in your replace call with None, then everything works as anticipated.

Note that this is a consequence of cudf being slightly stricter than pandas in a number of places when it comes to differences between nan and NA, the latter indicates and actually missing value, the former (in cudf) does not.

@mroeschke mroeschke added the Python Affects Python cuDF API. label Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
Status: In Progress
Development

No branches or pull requests

3 participants