Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Data loss for extraction of year from far future date and datetime types #16196

Open
wence- opened this issue Jul 4, 2024 · 1 comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.

Comments

@wence-
Copy link
Contributor

wence- commented Jul 4, 2024

Describe the bug

libcudf's cudf::datetime::extract_year returns an INT16 column, this can lose information for large positive or negative years.

The date32 type is:

signed 32 bit number of days since the unix epoch

The timestamp types are (for resolutions milli-, micro-, and nano-seconds):

signed 64 bit number of RESOLUTION ticks since the unix epoch

The must positive year representable by the date32 type is (approximately) $1970 + (2^{31} - 1)/365 \approx 5885486 \gg 2^{15} - 1$.

Similarly the most positive year representable by the timestamp64[ms] and timestamp64[us] types is respectively approximately 292473178 and 294441. Both of which are again larger than $2^{15} - 1$.

Steps/Code to reproduce bug

import cudf

s = cudf.Series([2**63 - 1], dtype="datetime64[us]")

cudf_year = s.dt.year[0]

pandas_year = s.to_pandas().dt.year[0]

print(cudf_year) # 32103, incorrect
print(pandas_year) # 294247, correct, depending on how much the earth's rotation speed changes of the next few millenia

Expected behavior

We should produce the right answer. This might be doable by returning an INT32 column for year extraction.

@wence- wence- added the bug Something isn't working label Jul 4, 2024
@wence-
Copy link
Contributor Author

wence- commented Jul 4, 2024

This is a bit fiddly since std::chrono specifies that the minimum and maximum values of representable years are $-2^{15}$ and $2^{15} - 1$ respectively. So given the manipulations rely on cuda::std::chrono, this may not be fixable.

@mroeschke mroeschke added the libcudf Affects libcudf (C++/CUDA) code. label Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: In Progress
Development

No branches or pull requests

2 participants