Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add low memory JSON reader for cudf.pandas #16204

Merged
merged 15 commits into from
Jul 12, 2024

Conversation

galipremsagar
Copy link
Contributor

Description

Fixes: #16122

This PR introduces low-memory JSON reading for cudf.pandas read_json.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@galipremsagar galipremsagar added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 6, 2024
@galipremsagar galipremsagar self-assigned this Jul 6, 2024
@galipremsagar galipremsagar requested review from a team as code owners July 6, 2024 00:06
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Jul 6, 2024
if len(final_columns) == 0:
final_columns = new_chunk
else:
for col_idx in range(len(meta_names)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming that the concatenation technique here is generally the same as done in the parquet reader?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup

Copy link
Contributor

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one non-blocking question

try:
with nogil:
c_result = move(libcudf_read_json(opts))
except (ValueError, OverflowError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious if we could use some logic such as _has_next() in PQ chunked reader to break this loop instead of this exception?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that we're catching the exception from a datasource here. The memory mapping is very much an implementation detail.
How expensive would it be to get the total source(s) size? Then we can loop until all of it is read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is expensive, we basically have to seek to the end of file to get the size of all data sources. For remote data-sources it get's complicated to properly perform seek too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious if we could use some logic such as _has_next() in PQ chunked reader to break this loop instead of this exception?

We basically call into libcudf layer for that, is there any such provision for json reader in libcudf?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, we already need the file size(s) to read JSON input(s). With the current implementation of the low memory JSON reader, we get the size of each input file for each byte range, so getting the sizes one more time to have a cleaner loop would not add much.

@github-actions github-actions bot added the pylibcudf Issues specific to the pylibcudf package label Jul 10, 2024
Copy link
Contributor Author

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke @mhaseeb123 I just moved the chunked reader to pylibcudf to resolve the conflicts. Would you be able to take a look at the changes again?

@galipremsagar galipremsagar added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Jul 11, 2024
Copy link
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this! I have just one comment based on the byte range reading behaviour we have seen in this issue.

python/cudf/cudf/_lib/pylibcudf/io/json.pyx Show resolved Hide resolved
@galipremsagar
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 954ce6d into rapidsai:branch-24.08 Jul 12, 2024
81 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: Done
Status: Landed
Development

Successfully merging this pull request may close these issues.

[FEA] Add a low-memory JSON lines reader option based on byte range reads
7 participants