Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 #16195

Open
wants to merge 18 commits into
base: branch-24.08
Choose a base branch
from

Conversation

mhaseeb123
Copy link
Member

@mhaseeb123 mhaseeb123 commented Jul 4, 2024

Description

Closes #15389
Closes #16186

This PR adds the capability to calculate and report the number of rows read from each data source into the table returned by the Parquet reader (both chunked and normal). The returned vector of counts is only valid (non-empty) when row selection (AST filter) is not being used.

This PR also fixes a segfault in chunked parquet reader when skip_rows > 0 and the number of passes > 1. This segfault was being caused by a couple of arithmetic errors when computing the (start_row, num_row) for row_group_info, pass, column chunk descriptor structs.

Both changes were added to this PR as changes and the gtests from the former work were needed to implement the segfault fix.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jul 4, 2024
@mhaseeb123 mhaseeb123 self-assigned this Jul 4, 2024
@mhaseeb123 mhaseeb123 added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change cuIO cuIO issue labels Jul 4, 2024
@mhaseeb123 mhaseeb123 changed the title Report number of rows per data source in table read by Parquet reader if no row selection (AST filter) Report number of rows per data source in table read by Parquet reader when no selection Jul 4, 2024
@mhaseeb123 mhaseeb123 changed the title Report number of rows per data source in table read by Parquet reader when no selection Report number of rows per data source read by Parquet reader when no selection Jul 4, 2024
Copy link

copy-pr-bot bot commented Jul 9, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cpp/include/cudf/io/types.hpp Outdated Show resolved Hide resolved
@mhaseeb123 mhaseeb123 marked this pull request as ready for review July 9, 2024 19:46
@mhaseeb123 mhaseeb123 requested a review from a team as a code owner July 9, 2024 19:46
@mhaseeb123 mhaseeb123 marked this pull request as draft July 9, 2024 21:00
@mhaseeb123 mhaseeb123 marked this pull request as ready for review July 9, 2024 21:20
@mhaseeb123 mhaseeb123 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 9, 2024
@mhaseeb123 mhaseeb123 changed the title Report number of rows per data source read by Parquet reader when no selection Report number of rows per data source read by Parquet reader when no row selection Jul 9, 2024
@mhaseeb123 mhaseeb123 added 2 - In Progress Currently a work in progress and removed 3 - Ready for Review Ready for review by team labels Jul 10, 2024
@mhaseeb123 mhaseeb123 marked this pull request as ready for review July 11, 2024 00:49
@mhaseeb123 mhaseeb123 requested a review from a team as a code owner July 11, 2024 00:49
@mhaseeb123 mhaseeb123 changed the title Report number of rows per data source read by Parquet reader when no row selection Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 Jul 11, 2024
@mhaseeb123 mhaseeb123 requested a review from vuule July 11, 2024 00:55
@mhaseeb123
Copy link
Member Author

mhaseeb123 commented Jul 11, 2024

CC @etseidl @nvdbaranec

auto const& row_groups_info = _file_itm_data.row_groups;
auto& chunks = _file_itm_data.chunks;
auto const num_rows = _file_itm_data.global_num_rows;
auto& row_groups_info = _file_itm_data.row_groups;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed const as the first row group's start_row needs to be adjusted for skip_rows

}
}
} else {
size_type count = 0;
for (size_t src_idx = 0; src_idx < per_file_metadata.size(); ++src_idx) {
auto const& fmd = per_file_metadata[src_idx];
for (size_t rg_idx = 0; rg_idx < fmd.row_groups.size(); ++rg_idx) {
for (size_t rg_idx = 0;
Copy link
Member Author

@mhaseeb123 mhaseeb123 Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop will now stop as soon as count >= rows_to_read + rows_to_skip

@@ -1245,9 +1247,18 @@ void reader::impl::preprocess_file(read_mode mode)
_expr_conv.get_converted_expr(),
_stream);

// Inclusive scan the number of rows per source
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only need to compute this if chunked reading.

@@ -81,6 +81,7 @@ cdef extern from "cudf/io/types.hpp" \
map[string, string] user_data
vector[unordered_map[string, string]] per_file_user_data
vector[column_name_info] schema_info
vector[size_t] num_rows_per_source
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added as per @galipremsagar's suggestion. No need to do anything else with it as of now!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean we're now returning the same vector in Python as well, or are we maybe missing a copy?


std::vector<size_t> num_rows_per_source(_file_itm_data.num_rows_per_source.size(), 0);

// Subtract global skip rows from the start_row as we took care of that when computing
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binary search the lower and upper index into the partial_sum_nrows_source and compute the number of rows seen per source in between.


// Adjust the start_row of the first row group which was left unadjusted during
// select_row_groups().
if (skip_rows) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not particularly elegant to use skip_rows to make global_skip_rows adjustment to the very first row_group but adds the least code diff here. Suggestions welcome!

@mhaseeb123 mhaseeb123 added breaking Breaking change and removed non-breaking Non-breaking change labels Jul 11, 2024
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mhaseeb123! I'll do a deep dive later...just a few first thoughts.

There's a lot going on here...I wonder if it should be two PRs, one for the bug fix and one to add the row counts to the metadata. As I read this I wonder which issue some changes are meant to address. 😅

Also, looking at the original issue and use case, the lazy part of me wonders if simply returning the num_rows field from each FileMetaData object wouldn't suffice. I realize using the chunked reader complicates things, but at the end of the day wouldn't one sum all the counts anyway (i.e. is there a use case for knowing the per-chunk per-file row counts)? skip_rows could be decremented off the head of the vector and num_rows off the tail if that's important.

cpp/src/io/parquet/reader_impl_helpers.cpp Outdated Show resolved Hide resolved
@mhaseeb123
Copy link
Member Author

There's a lot going on here...I wonder if it should be two PRs, one for the bug fix and one to add the row counts to the metadata. As I read this I wonder which issue some changes are meant to address. 😅

I agree. I apologize for this but I couldn't push the fix without having changes from the num_rows_per_source changes so had to consolidate. I put some comments which may help understand what's going on. About 70% of the changeset is the tests so I am hoping it wouldn't be too bad to review. Please feel free to comment if you would like me to go over on what's going on.

Also, looking at the original issue and use case, the lazy part of me wonders if simply returning the num_rows field from each FileMetaData object wouldn't suffice.

I implemented this solution first as well but it can't cater for the case where a list of row_groups to read per source is provided.

I realize using the chunked reader complicates things, but at the end of the day wouldn't one sum all the counts anyway (i.e. is there a use case for knowing the per-chunk per-file row counts)? skip_rows could be decremented off the head of the vector and num_rows off the tail if that's important.

Since we are returning the vector num_rows_per_source per chunk (as a part of table_metadata), I think it would be better to return what exactly the particular chunk contains instead of returning the whole count with every chunk.

@etseidl
Copy link
Contributor

etseidl commented Jul 11, 2024

Also, looking at the original issue and use case, the lazy part of me wonders if simply returning the num_rows field from each FileMetaData object wouldn't suffice.

I implemented this solution first as well but it can't cater for the case where a list of row_groups to read per source is provided.

True, that case could go the extra mile and sum up num_rows from the requested row groups, but then why? If a user has sufficient knowledge of their files to know exactly which row groups they want read from a set of files, they presumably already have enough knowledge of the file metadata to know how many rows they're requesting from each file. Perhaps go the AST route here and not populate the rows-per-source vector when row groups are set.

I realize using the chunked reader complicates things, but at the end of the day wouldn't one sum all the counts anyway (i.e. is there a use case for knowing the per-chunk per-file row counts)? skip_rows could be decremented off the head of the vector and num_rows off the tail if that's important.

Since we are returning the vector num_rows_per_source per chunk (as a part of table_metadata), I think it would be better to return what exactly the particular chunk contains instead of returning the whole count with every chunk.

We can agree to disagree here 😄. I just think this is a lot of complexity to add for a feature of dubious benefit. But if the complexity is necessary to fix the chunked skip_rows case, then I guess it can't be helped. ✌️

@mhaseeb123
Copy link
Member Author

mhaseeb123 commented Jul 11, 2024

First I would like to thank you for looking into this and providing quick feedback.

True, that case could go the extra mile and sum up num_rows from the requested row groups, but then why? If a user has sufficient knowledge of their files to know exactly which row groups they want read from a set of files, they presumably already have enough knowledge of the file metadata to know how many rows they're requesting from each file. Perhaps go the AST route here and not populate the rows-per-source vector when row groups are set.

Since we apply the filter at the very end at the output table, going AST route would involve building a new column containing the source idx for each row, applying the filter, and processing the column again to get the final counts which may be very compute expensive. We also don't yet have a use case or request for that.

We can agree to disagree here 😄. I just think this is a lot of complexity to add for a feature of dubious benefit. But if the complexity is necessary to fix the chunked skip_rows case, then I guess it can't be helped. ✌️

I am open to either solution (either report per chunk basis or global counts once all chunks will be read). I am leaning towards the former as it provides better information and covers any future requests as well. The solution is orthogonal to the segfault fix so switching to either is doable. I guess we can take opinions from @GregoryKimball @vuule about this.

out_metadata.num_rows_per_source =
std::vector<size_t>(_file_itm_data.num_rows_per_source.size(), 0);
}
// If this is previously non-empty, simply fill in zeros
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when do we hit this branch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: In Progress
Status: Burndown
4 participants