Add low memory JSON reader for `cudf.pandas` #16204

galipremsagar · 2024-07-06T00:06:09Z

Description

This PR introduces low-memory JSON reading for cudf.pandas read_json.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

mroeschke · 2024-07-08T18:25:10Z

python/cudf/cudf/_lib/json.pyx

+        if len(final_columns) == 0:
+            final_columns = new_chunk
+        else:
+            for col_idx in range(len(meta_names)):


Just confirming that the concatenation technique here is generally the same as done in the parquet reader?

mroeschke

Just one non-blocking question

mhaseeb123 · 2024-07-08T19:04:17Z

python/cudf/cudf/_lib/json.pyx

+        try:
+            with nogil:
+                c_result = move(libcudf_read_json(opts))
+        except (ValueError, OverflowError):


Just curious if we could use some logic such as _has_next() in PQ chunked reader to break this loop instead of this exception?

I don't like that we're catching the exception from a datasource here. The memory mapping is very much an implementation detail.
How expensive would it be to get the total source(s) size? Then we can loop until all of it is read.

It is expensive, we basically have to seek to the end of file to get the size of all data sources. For remote data-sources it get's complicated to properly perform seek too.

Just curious if we could use some logic such as _has_next() in PQ chunked reader to break this loop instead of this exception?

We basically call into libcudf layer for that, is there any such provision for json reader in libcudf?

FWIW, we already need the file size(s) to read JSON input(s). With the current implementation of the low memory JSON reader, we get the size of each input file for each byte range, so getting the sizes one more time to have a cleaner loop would not add much.

galipremsagar

@mroeschke @mhaseeb123 I just moved the chunked reader to pylibcudf to resolve the conflicts. Would you be able to take a look at the changes again?

python/cudf/cudf/_lib/pylibcudf/io/json.pyx

shrshi

Thank you for working on this! I have just one comment based on the byte range reading behaviour we have seen in this issue.

python/cudf/cudf/_lib/pylibcudf/io/json.pyx

Co-authored-by: Shruti Shivakumar <[email protected]>

galipremsagar · 2024-07-12T23:23:22Z

/merge

galipremsagar added 3 commits July 5, 2024 22:07

Implement chunked json reader

ed86600

Merge

df6266d

add tests

8427e4a

galipremsagar added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 6, 2024

galipremsagar requested review from mroeschke, GregoryKimball and vuule July 6, 2024 00:06

galipremsagar self-assigned this Jul 6, 2024

galipremsagar requested review from a team as code owners July 6, 2024 00:06

galipremsagar requested review from Matt711 and mhaseeb123 July 6, 2024 00:06

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. labels Jul 6, 2024

mroeschke reviewed Jul 8, 2024

View reviewed changes

mroeschke approved these changes Jul 8, 2024

View reviewed changes

mhaseeb123 reviewed Jul 8, 2024

View reviewed changes

mhaseeb123 approved these changes Jul 10, 2024

View reviewed changes

galipremsagar added 5 commits July 10, 2024 21:43

merge

8a81ad5

Merge remote-tracking branch 'upstream/branch-24.08' into 16122

04dca59

fix syntax

a007761

move common code together

1bf5569

move common code together

872a1fe

github-actions bot added the pylibcudf Issues specific to the pylibcudf package label Jul 10, 2024

galipremsagar commented Jul 10, 2024

View reviewed changes

Matt711 reviewed Jul 11, 2024

View reviewed changes

python/cudf/cudf/_lib/pylibcudf/io/json.pyx Show resolved Hide resolved

Merge branch 'branch-24.08' into 16122

a958de1

mroeschke approved these changes Jul 11, 2024

View reviewed changes

lithomas1 reviewed Jul 11, 2024

View reviewed changes

python/cudf/cudf/_lib/pylibcudf/io/json.pyx Show resolved Hide resolved

update docstring

bf2578e

galipremsagar added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Jul 11, 2024

GregoryKimball requested a review from shrshi July 11, 2024 18:35

galipremsagar added 2 commits July 11, 2024 15:14

Merge branch 'branch-24.08' into 16122

9bcc074

Merge branch 'branch-24.08' into 16122

5dbb708

shrshi reviewed Jul 12, 2024

View reviewed changes

python/cudf/cudf/_lib/pylibcudf/io/json.pyx Show resolved Hide resolved

galipremsagar and others added 2 commits July 12, 2024 16:04

Update python/cudf/cudf/_lib/pylibcudf/io/json.pyx

64992f8

Co-authored-by: Shruti Shivakumar <[email protected]>

Merge branch 'branch-24.08' into 16122

9bbe58e

shrshi approved these changes Jul 12, 2024

View reviewed changes

Update json.pyx

a70d841

rapids-bot bot merged commit 954ce6d into rapidsai:branch-24.08 Jul 12, 2024
81 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add low memory JSON reader for `cudf.pandas` #16204

Add low memory JSON reader for `cudf.pandas` #16204

galipremsagar commented Jul 6, 2024

mroeschke Jul 8, 2024

galipremsagar Jul 8, 2024

mroeschke left a comment

mhaseeb123 Jul 8, 2024

vuule Jul 8, 2024

galipremsagar Jul 8, 2024

galipremsagar Jul 8, 2024

vuule Jul 8, 2024

galipremsagar left a comment

shrshi left a comment

galipremsagar commented Jul 12, 2024

Add low memory JSON reader for cudf.pandas #16204

Add low memory JSON reader for cudf.pandas #16204

Conversation

galipremsagar commented Jul 6, 2024

Description

Checklist

mroeschke Jul 8, 2024

Choose a reason for hiding this comment

galipremsagar Jul 8, 2024

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

mhaseeb123 Jul 8, 2024

Choose a reason for hiding this comment

vuule Jul 8, 2024

Choose a reason for hiding this comment

galipremsagar Jul 8, 2024

Choose a reason for hiding this comment

galipremsagar Jul 8, 2024

Choose a reason for hiding this comment

vuule Jul 8, 2024

Choose a reason for hiding this comment

galipremsagar left a comment

Choose a reason for hiding this comment

shrshi left a comment

Choose a reason for hiding this comment

galipremsagar commented Jul 12, 2024

Add low memory JSON reader for `cudf.pandas` #16204

Add low memory JSON reader for `cudf.pandas` #16204