-
Notifications
You must be signed in to change notification settings - Fork 872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove size constraints on source files in batched JSON reading #16162
base: branch-24.08
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Just have a few questions and small suggestions.
: std::min(chunk_size, total_source_size - chunk_offset); | ||
|
||
size_t const size_per_subchunk = estimate_size_per_subchunk(chunk_size); | ||
size_t const batch_size_ub = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does ub stand for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ub
stands for upper bound
on the batch size. batch_size_ub
is not INTMAX
directly since the byte range reader allocates a few more subchunks if the line end is not found in the current chunk. The upper bound is adjusted for this case to prevent overflow.
// function to extract first delimiter in the string in each chunk, | ||
// collate together and form byte_range for each chunk, | ||
// parse separately. | ||
std::vector<cudf::io::table_with_metadata> skeleton_for_parellel_chunk_reader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this reimplements the block reading support with the public API?
batch_offsets.push_back(pref_bytes_size); | ||
for (size_t i = start_source; i < sources.size() && pref_bytes_size < end_bytes_size;) { | ||
pref_source_size += sources[i]->size(); | ||
while (pref_bytes_size < end_bytes_size && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a few more comments here, since this non-trivial logic is the core of the feature.
For example something like "break the current source into batches" could go above this line
Notes on the json tests cleanup(?) exercise:
|
Description
Addresses #16138
The batched multi-source JSON reader fails when the size of any of the input source buffers exceeds
INT_MAX
bytes.The goal of this PR is to remove this constraint by modifying the batching behavior of the reader. Instead of constructing batches that include entire source files, the batches are now constructed at the granularity of byte ranges of size at most
INT_MAX
bytes,Checklist