Remove size constraints on source files in batched JSON reading #16162

shrshi · 2024-07-02T06:25:26Z

Description

Addresses #16138
The batched multi-source JSON reader fails when the size of any of the input source buffers exceeds INT_MAX bytes.
The goal of this PR is to remove this constraint by modifying the batching behavior of the reader. Instead of constructing batches that include entire source files, the batches are now constructed at the granularity of byte ranges of size at most INT_MAX bytes,

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

vuule

Looks good!
Just have a few questions and small suggestions.

vuule · 2024-07-10T02:39:59Z

cpp/src/io/json/read_json.cu

+                                               : std::min(chunk_size, total_source_size - chunk_offset);
+
+  size_t const size_per_subchunk = estimate_size_per_subchunk(chunk_size);
+  size_t const batch_size_ub =


what does ub stand for?

ub stands for upper bound on the batch size. batch_size_ub is not INTMAX directly since the byte range reader allocates a few more subchunks if the line end is not found in the current chunk. The upper bound is adjusted for this case to prevent overflow.

cpp/src/io/json/read_json.cu

cpp/tests/large_strings/json_tests.cpp

vuule · 2024-07-10T03:46:14Z

cpp/tests/large_strings/json_tests.cpp

+// function to extract first delimiter in the string in each chunk,
+// collate together and form byte_range for each chunk,
+// parse separately.
+std::vector<cudf::io::table_with_metadata> skeleton_for_parellel_chunk_reader(


this reimplements the block reading support with the public API?

vuule · 2024-07-10T03:48:54Z

cpp/src/io/json/read_json.cu

+  batch_offsets.push_back(pref_bytes_size);
+  for (size_t i = start_source; i < sources.size() && pref_bytes_size < end_bytes_size;) {
+    pref_source_size += sources[i]->size();
+    while (pref_bytes_size < end_bytes_size &&


Maybe a few more comments here, since this non-trivial logic is the core of the feature.
For example something like "break the current source into batches" could go above this line

shrshi · 2024-07-10T21:52:23Z

Notes on the json tests cleanup(?) exercise:

The split_byte_range_reading function in json_utils.hpp splits the input source files into chunks of size chunk_size and constructs partial tables for each chunk. It is called by tests in both json_chunked_reader.cpp and large_strings/json_tests.cpp to evaluate the JSON byte range reader.
find_first_delimiter_in_chunk is not invoked by the reader, and has been moved from src to a lambda function in split_byte_range_reading. On a side note, should we consider moving the definition of find_first_delimiter from byte_range_info.cu to read_json.cu?
All json tests have been moved to tests/io/json/
Feedback is most welcome! 😃

shrshi added 2 commits July 2, 2024 06:10

removing the source file size restraint

bf33e98

formatting fixes

b97aaa9

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jul 2, 2024

shrshi added cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed libcudf Affects libcudf (C++/CUDA) code. labels Jul 2, 2024

shrshi marked this pull request as ready for review July 2, 2024 16:26

shrshi requested a review from a team as a code owner July 2, 2024 16:26

shrshi requested review from harrism and vuule July 2, 2024 16:26

GregoryKimball requested a review from karthikeyann July 3, 2024 22:29

vuule reviewed Jul 10, 2024

View reviewed changes

GregoryKimball assigned shrshi Jul 10, 2024

shrshi added 2 commits July 10, 2024 21:17

json tests cleanup

203f948

formatting

a9b7027

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jul 10, 2024

shrshi requested a review from vuule July 10, 2024 21:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove size constraints on source files in batched JSON reading #16162

Remove size constraints on source files in batched JSON reading #16162

shrshi commented Jul 2, 2024 •

edited

Loading

vuule left a comment

vuule Jul 10, 2024

shrshi Jul 10, 2024

vuule Jul 10, 2024

vuule Jul 10, 2024

shrshi commented Jul 10, 2024 •

edited

Loading

Remove size constraints on source files in batched JSON reading #16162

Are you sure you want to change the base?

Remove size constraints on source files in batched JSON reading #16162

Conversation

shrshi commented Jul 2, 2024 • edited Loading

Description

Checklist

vuule left a comment

Choose a reason for hiding this comment

vuule Jul 10, 2024

Choose a reason for hiding this comment

shrshi Jul 10, 2024

Choose a reason for hiding this comment

vuule Jul 10, 2024

Choose a reason for hiding this comment

vuule Jul 10, 2024

Choose a reason for hiding this comment

shrshi commented Jul 10, 2024 • edited Loading

shrshi commented Jul 2, 2024 •

edited

Loading

shrshi commented Jul 10, 2024 •

edited

Loading