-
Notifications
You must be signed in to change notification settings - Fork 872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 #16195
base: branch-24.08
Are you sure you want to change the base?
Conversation
auto const& row_groups_info = _file_itm_data.row_groups; | ||
auto& chunks = _file_itm_data.chunks; | ||
auto const num_rows = _file_itm_data.global_num_rows; | ||
auto& row_groups_info = _file_itm_data.row_groups; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed const as the first row group's start_row needs to be adjusted for skip_rows
} | ||
} | ||
} else { | ||
size_type count = 0; | ||
for (size_t src_idx = 0; src_idx < per_file_metadata.size(); ++src_idx) { | ||
auto const& fmd = per_file_metadata[src_idx]; | ||
for (size_t rg_idx = 0; rg_idx < fmd.row_groups.size(); ++rg_idx) { | ||
for (size_t rg_idx = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop will now stop as soon as count >= rows_to_read + rows_to_skip
@@ -1245,9 +1247,18 @@ void reader::impl::preprocess_file(read_mode mode) | |||
_expr_conv.get_converted_expr(), | |||
_stream); | |||
|
|||
// Inclusive scan the number of rows per source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only need to compute this if chunked reading.
@@ -81,6 +81,7 @@ cdef extern from "cudf/io/types.hpp" \ | |||
map[string, string] user_data | |||
vector[unordered_map[string, string]] per_file_user_data | |||
vector[column_name_info] schema_info | |||
vector[size_t] num_rows_per_source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added as per @galipremsagar's suggestion. No need to do anything else with it as of now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean we're now returning the same vector in Python as well, or are we maybe missing a copy?
|
||
std::vector<size_t> num_rows_per_source(_file_itm_data.num_rows_per_source.size(), 0); | ||
|
||
// Subtract global skip rows from the start_row as we took care of that when computing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Binary search the lower and upper index into the partial_sum_nrows_source
and compute the number of rows seen per source in between.
|
||
// Adjust the start_row of the first row group which was left unadjusted during | ||
// select_row_groups(). | ||
if (skip_rows) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not particularly elegant to use skip_rows to make global_skip_rows adjustment to the very first row_group but adds the least code diff here. Suggestions welcome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mhaseeb123! I'll do a deep dive later...just a few first thoughts.
There's a lot going on here...I wonder if it should be two PRs, one for the bug fix and one to add the row counts to the metadata. As I read this I wonder which issue some changes are meant to address. 😅
Also, looking at the original issue and use case, the lazy part of me wonders if simply returning the num_rows
field from each FileMetaData
object wouldn't suffice. I realize using the chunked reader complicates things, but at the end of the day wouldn't one sum all the counts anyway (i.e. is there a use case for knowing the per-chunk per-file row counts)? skip_rows
could be decremented off the head of the vector and num_rows
off the tail if that's important.
I agree. I apologize for this but I couldn't push the fix without having changes from the
I implemented this solution first as well but it can't cater for the case where a list of row_groups to read per source is provided.
Since we are returning the vector |
…xes rely on absolute row numbers, not adjusted for skip_rows Co-authored-by: Ed Seidl <[email protected]>
True, that case could go the extra mile and sum up
We can agree to disagree here 😄. I just think this is a lot of complexity to add for a feature of dubious benefit. But if the complexity is necessary to fix the chunked skip_rows case, then I guess it can't be helped. ✌️ |
First I would like to thank you for looking into this and providing quick feedback.
Since we apply the filter at the very end at the output table, going AST route would involve building a new column containing the source idx for each row, applying the filter, and processing the column again to get the final counts which may be very compute expensive. We also don't yet have a use case or request for that.
I am open to either solution (either report per chunk basis or global counts once all chunks will be read). I am leaning towards the former as it provides better information and covers any future requests as well. The solution is orthogonal to the segfault fix so switching to either is doable. I guess we can take opinions from @GregoryKimball @vuule about this. |
…23/cudf into report-nrows-per-source
out_metadata.num_rows_per_source = | ||
std::vector<size_t>(_file_itm_data.num_rows_per_source.size(), 0); | ||
} | ||
// If this is previously non-empty, simply fill in zeros |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when do we hit this branch?
Description
Closes #15389
Closes #16186
This PR adds the capability to calculate and report the number of rows read from each data source into the table returned by the Parquet reader (both chunked and normal). The returned vector of counts is only valid (non-empty) when row selection (AST filter) is not being used.
This PR also fixes a segfault in chunked parquet reader when skip_rows > 0 and the number of passes > 1. This segfault was being caused by a couple of arithmetic errors when computing the (start_row, num_row) for row_group_info, pass, column chunk descriptor structs.
Both changes were added to this PR as changes and the gtests from the former work were needed to implement the segfault fix.
Checklist