Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove size constraints on source files in batched JSON reading #16162

Open
wants to merge 4 commits into
base: branch-24.08
Choose a base branch
from

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Jul 2, 2024

Description

Addresses #16138
The batched multi-source JSON reader fails when the size of any of the input source buffers exceeds INT_MAX bytes.
The goal of this PR is to remove this constraint by modifying the batching behavior of the reader. Instead of constructing batches that include entire source files, the batches are now constructed at the granularity of byte ranges of size at most INT_MAX bytes,

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jul 2, 2024
@shrshi shrshi added cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed libcudf Affects libcudf (C++/CUDA) code. labels Jul 2, 2024
@shrshi shrshi marked this pull request as ready for review July 2, 2024 16:26
@shrshi shrshi requested a review from a team as a code owner July 2, 2024 16:26
@shrshi shrshi requested review from harrism and vuule July 2, 2024 16:26
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!
Just have a few questions and small suggestions.

: std::min(chunk_size, total_source_size - chunk_offset);

size_t const size_per_subchunk = estimate_size_per_subchunk(chunk_size);
size_t const batch_size_ub =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does ub stand for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ub stands for upper bound on the batch size. batch_size_ub is not INTMAX directly since the byte range reader allocates a few more subchunks if the line end is not found in the current chunk. The upper bound is adjusted for this case to prevent overflow.

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/tests/large_strings/json_tests.cpp Outdated Show resolved Hide resolved
// function to extract first delimiter in the string in each chunk,
// collate together and form byte_range for each chunk,
// parse separately.
std::vector<cudf::io::table_with_metadata> skeleton_for_parellel_chunk_reader(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this reimplements the block reading support with the public API?

batch_offsets.push_back(pref_bytes_size);
for (size_t i = start_source; i < sources.size() && pref_bytes_size < end_bytes_size;) {
pref_source_size += sources[i]->size();
while (pref_bytes_size < end_bytes_size &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a few more comments here, since this non-trivial logic is the core of the feature.
For example something like "break the current source into batches" could go above this line

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jul 10, 2024
@shrshi
Copy link
Contributor Author

shrshi commented Jul 10, 2024

Notes on the json tests cleanup(?) exercise:

  1. The split_byte_range_reading function in json_utils.hpp splits the input source files into chunks of size chunk_size and constructs partial tables for each chunk. It is called by tests in both json_chunked_reader.cpp and large_strings/json_tests.cpp to evaluate the JSON byte range reader.
  2. find_first_delimiter_in_chunk is not invoked by the reader, and has been moved from src to a lambda function in split_byte_range_reading. On a side note, should we consider moving the definition of find_first_delimiter from byte_range_info.cu to read_json.cu?
  3. All json tests have been moved to tests/io/json/
    Feedback is most welcome! 😃

@shrshi shrshi requested a review from vuule July 10, 2024 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Status: In Progress
Status: Burndown
Development

Successfully merging this pull request may close these issues.

None yet

2 participants