Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 #16195

mhaseeb123 · 2024-07-04T01:22:28Z

Description

This PR adds the capability to calculate and report the number of rows read from each data source into the table returned by the Parquet reader (both chunked and normal). The returned vector of counts is only valid (non-empty) when row selection (AST filter) is not being used.

This PR also fixes a segfault in chunked parquet reader when skip_rows > 0 and the number of passes > 1. This segfault was being caused by a couple of arithmetic errors when computing the (start_row, num_row) for row_group_info, pass, column chunk descriptor structs.

Both changes were added to this PR as changes and the gtests from the former work were needed to implement the segfault fix.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…23/cudf into report-nrows-per-source

copy-pr-bot · 2024-07-09T01:44:44Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cpp/include/cudf/io/types.hpp

…hon when needed

mhaseeb123 · 2024-07-11T00:56:07Z

CC @etseidl @nvdbaranec

mhaseeb123 · 2024-07-11T01:03:04Z

cpp/src/io/parquet/reader_impl_chunking.cu

-  auto const& row_groups_info = _file_itm_data.row_groups;
-  auto& chunks                = _file_itm_data.chunks;
+  auto const num_rows   = _file_itm_data.global_num_rows;
+  auto& row_groups_info = _file_itm_data.row_groups;


Removed const as the first row group's start_row needs to be adjusted for skip_rows

mhaseeb123 · 2024-07-11T01:05:56Z

cpp/src/io/parquet/reader_impl_helpers.cpp

      }
    }
  } else {
    size_type count = 0;
    for (size_t src_idx = 0; src_idx < per_file_metadata.size(); ++src_idx) {
      auto const& fmd = per_file_metadata[src_idx];
-      for (size_t rg_idx = 0; rg_idx < fmd.row_groups.size(); ++rg_idx) {
+      for (size_t rg_idx = 0;


The loop will now stop as soon as count >= rows_to_read + rows_to_skip

mhaseeb123 · 2024-07-11T01:07:02Z

cpp/src/io/parquet/reader_impl_preprocess.cu

@@ -1245,9 +1247,18 @@ void reader::impl::preprocess_file(read_mode mode)
                                 _expr_conv.get_converted_expr(),
                                 _stream);

+  // Inclusive scan the number of rows per source


Only need to compute this if chunked reading.

mhaseeb123 · 2024-07-11T01:07:43Z

python/cudf/cudf/_lib/pylibcudf/libcudf/io/types.pxd

@@ -81,6 +81,7 @@ cdef extern from "cudf/io/types.hpp" \
        map[string, string] user_data
        vector[unordered_map[string, string]] per_file_user_data
        vector[column_name_info] schema_info
+        vector[size_t] num_rows_per_source


Added as per @galipremsagar's suggestion. No need to do anything else with it as of now!

does this mean we're now returning the same vector in Python as well, or are we maybe missing a copy?

mhaseeb123 · 2024-07-11T01:08:55Z

cpp/src/io/parquet/reader_impl.cpp

+
+  std::vector<size_t> num_rows_per_source(_file_itm_data.num_rows_per_source.size(), 0);
+
+  // Subtract global skip rows from the start_row as we took care of that when computing


Binary search the lower and upper index into the partial_sum_nrows_source and compute the number of rows seen per source in between.

mhaseeb123 · 2024-07-11T01:11:30Z

cpp/src/io/parquet/reader_impl_chunking.cu

+
+    // Adjust the start_row of the first row group which was left unadjusted during
+    // select_row_groups().
+    if (skip_rows) {


Not particularly elegant to use skip_rows to make global_skip_rows adjustment to the very first row_group but adds the least code diff here. Suggestions welcome!

cpp/src/io/parquet/reader_impl.cpp

etseidl

Thanks @mhaseeb123! I'll do a deep dive later...just a few first thoughts.

There's a lot going on here...I wonder if it should be two PRs, one for the bug fix and one to add the row counts to the metadata. As I read this I wonder which issue some changes are meant to address. 😅

Also, looking at the original issue and use case, the lazy part of me wonders if simply returning the num_rows field from each FileMetaData object wouldn't suffice. I realize using the chunked reader complicates things, but at the end of the day wouldn't one sum all the counts anyway (i.e. is there a use case for knowing the per-chunk per-file row counts)? skip_rows could be decremented off the head of the vector and num_rows off the tail if that's important.

cpp/src/io/parquet/reader_impl_helpers.cpp

mhaseeb123 · 2024-07-11T18:22:28Z

There's a lot going on here...I wonder if it should be two PRs, one for the bug fix and one to add the row counts to the metadata. As I read this I wonder which issue some changes are meant to address. 😅

I agree. I apologize for this but I couldn't push the fix without having changes from the num_rows_per_source changes so had to consolidate. I put some comments which may help understand what's going on. About 70% of the changeset is the tests so I am hoping it wouldn't be too bad to review. Please feel free to comment if you would like me to go over on what's going on.

Also, looking at the original issue and use case, the lazy part of me wonders if simply returning the num_rows field from each FileMetaData object wouldn't suffice.

I implemented this solution first as well but it can't cater for the case where a list of row_groups to read per source is provided.

I realize using the chunked reader complicates things, but at the end of the day wouldn't one sum all the counts anyway (i.e. is there a use case for knowing the per-chunk per-file row counts)? skip_rows could be decremented off the head of the vector and num_rows off the tail if that's important.

Since we are returning the vector num_rows_per_source per chunk (as a part of table_metadata), I think it would be better to return what exactly the particular chunk contains instead of returning the whole count with every chunk.

…xes rely on absolute row numbers, not adjusted for skip_rows Co-authored-by: Ed Seidl <[email protected]>

etseidl · 2024-07-11T18:47:09Z

Also, looking at the original issue and use case, the lazy part of me wonders if simply returning the num_rows field from each FileMetaData object wouldn't suffice.

I implemented this solution first as well but it can't cater for the case where a list of row_groups to read per source is provided.

True, that case could go the extra mile and sum up num_rows from the requested row groups, but then why? If a user has sufficient knowledge of their files to know exactly which row groups they want read from a set of files, they presumably already have enough knowledge of the file metadata to know how many rows they're requesting from each file. Perhaps go the AST route here and not populate the rows-per-source vector when row groups are set.

I realize using the chunked reader complicates things, but at the end of the day wouldn't one sum all the counts anyway (i.e. is there a use case for knowing the per-chunk per-file row counts)? skip_rows could be decremented off the head of the vector and num_rows off the tail if that's important.

Since we are returning the vector num_rows_per_source per chunk (as a part of table_metadata), I think it would be better to return what exactly the particular chunk contains instead of returning the whole count with every chunk.

We can agree to disagree here 😄. I just think this is a lot of complexity to add for a feature of dubious benefit. But if the complexity is necessary to fix the chunked skip_rows case, then I guess it can't be helped. ✌️

mhaseeb123 · 2024-07-11T21:18:48Z

First I would like to thank you for looking into this and providing quick feedback.

True, that case could go the extra mile and sum up num_rows from the requested row groups, but then why? If a user has sufficient knowledge of their files to know exactly which row groups they want read from a set of files, they presumably already have enough knowledge of the file metadata to know how many rows they're requesting from each file. Perhaps go the AST route here and not populate the rows-per-source vector when row groups are set.

Since we apply the filter at the very end at the output table, going AST route would involve building a new column containing the source idx for each row, applying the filter, and processing the column again to get the final counts which may be very compute expensive. We also don't yet have a use case or request for that.

We can agree to disagree here 😄. I just think this is a lot of complexity to add for a feature of dubious benefit. But if the complexity is necessary to fix the chunked skip_rows case, then I guess it can't be helped. ✌️

I am open to either solution (either report per chunk basis or global counts once all chunks will be read). I am leaning towards the former as it provides better information and covers any future requests as well. The solution is orthogonal to the segfault fix so switching to either is doable. I guess we can take opinions from @GregoryKimball @vuule about this.

…23/cudf into report-nrows-per-source

vuule · 2024-07-16T14:37:41Z

cpp/src/io/parquet/reader_impl.cpp

+        out_metadata.num_rows_per_source =
+          std::vector<size_t>(_file_itm_data.num_rows_per_source.size(), 0);
+      }
+      // If this is previously non-empty, simply fill in zeros


when do we hit this branch?

Calculate num rows read from each data source by PQ reader

c249f05

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jul 4, 2024

mhaseeb123 self-assigned this Jul 4, 2024

mhaseeb123 added 2 - In Progress Currently a work in progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change cuIO cuIO issue labels Jul 4, 2024

mhaseeb123 changed the title ~~Report number of rows per data source in table read by Parquet reader if no row selection (AST filter)~~ Report number of rows per data source in table read by Parquet reader when no selection Jul 4, 2024

mhaseeb123 changed the title ~~Report number of rows per data source in table read by Parquet reader when no selection~~ Report number of rows per data source read by Parquet reader when no selection Jul 4, 2024

mhaseeb123 and others added 4 commits July 9, 2024 00:54

Adding gtests for num_rows_per_source

3891e3a

Merge branch 'branch-24.08' into report-nrows-per-source

4bc569e

Minor updates

d3863a6

Merge branch 'report-nrows-per-source' of https://github.com/mhaseeb1…

794d59c

…23/cudf into report-nrows-per-source

vuule reviewed Jul 9, 2024

View reviewed changes

cpp/include/cudf/io/types.hpp Outdated Show resolved Hide resolved

mhaseeb123 marked this pull request as ready for review July 9, 2024 19:46

mhaseeb123 requested a review from a team as a code owner July 9, 2024 19:46

mhaseeb123 requested review from vyasr and davidwendt July 9, 2024 19:46

mhaseeb123 marked this pull request as draft July 9, 2024 21:00

gtests for empty dfs and minor improvements

ebdfad5

mhaseeb123 marked this pull request as ready for review July 9, 2024 21:20

separate out the empty df gtest

0fd6890

mhaseeb123 added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jul 9, 2024

Merge branch 'branch-24.08' into report-nrows-per-source

a294c18

mhaseeb123 changed the title ~~Report number of rows per data source read by Parquet reader when no selection~~ Report number of rows per data source read by Parquet reader when no row selection Jul 9, 2024

Add num_rows_per_source vector to the types.pxd for future use in pyt…

d268873

…hon when needed

mhaseeb123 added 2 - In Progress Currently a work in progress and removed 3 - Ready for Review Ready for review by team labels Jul 10, 2024

mhaseeb123 marked this pull request as ready for review July 11, 2024 00:49

mhaseeb123 requested a review from a team as a code owner July 11, 2024 00:49

mhaseeb123 requested a review from lithomas1 July 11, 2024 00:49

mhaseeb123 changed the title ~~Report number of rows per data source read by Parquet reader when no row selection~~ Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 Jul 11, 2024

mhaseeb123 and others added 2 commits July 11, 2024 00:52

Handle base cases when calculating num_rows_per_source

1a207e9

Merge branch 'branch-24.08' into report-nrows-per-source

78ed6d1

mhaseeb123 requested a review from vuule July 11, 2024 00:55

mhaseeb123 requested a review from nvdbaranec July 11, 2024 00:58

mhaseeb123 commented Jul 11, 2024

View reviewed changes

cpp/src/io/parquet/reader_impl.cpp Outdated Show resolved Hide resolved

mhaseeb123 added breaking Breaking change and removed non-breaking Non-breaking change labels Jul 11, 2024

Add a couple more gtests

7ad6179

galipremsagar approved these changes Jul 11, 2024

View reviewed changes

etseidl reviewed Jul 11, 2024

View reviewed changes

cpp/src/io/parquet/reader_impl_helpers.cpp Outdated Show resolved Hide resolved

mhaseeb123 and others added 2 commits July 11, 2024 11:23

Revert the chunk_start_row in column_info_for_row_group the page inde…

975b7c3

…xes rely on absolute row numbers, not adjusted for skip_rows Co-authored-by: Ed Seidl <[email protected]>

Merge branch 'branch-24.08' into report-nrows-per-source

e826caf

mhaseeb123 added 2 commits July 12, 2024 02:06

Merge branch 'report-nrows-per-source' of https://github.com/mhaseeb1…

fa33f7a

…23/cudf into report-nrows-per-source

Minor bug fix

0ac70b2

mhaseeb123 mentioned this pull request Jul 12, 2024

Add skiprows and nrows to parquet reader #16214

Open

3 tasks

vuule reviewed Jul 16, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 #16195

Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 #16195

mhaseeb123 commented Jul 4, 2024 •

edited

Loading

copy-pr-bot bot commented Jul 9, 2024

mhaseeb123 commented Jul 11, 2024 •

edited

Loading

mhaseeb123 Jul 11, 2024

mhaseeb123 Jul 11, 2024 •

edited

Loading

mhaseeb123 Jul 11, 2024

mhaseeb123 Jul 11, 2024

vuule Jul 16, 2024

mhaseeb123 Jul 11, 2024

mhaseeb123 Jul 11, 2024

etseidl left a comment

mhaseeb123 commented Jul 11, 2024

etseidl commented Jul 11, 2024

mhaseeb123 commented Jul 11, 2024 •

edited

Loading

vuule Jul 16, 2024


		std::vector<size_t> num_rows_per_source(_file_itm_data.num_rows_per_source.size(), 0);

		// Subtract global skip rows from the start_row as we took care of that when computing

Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 #16195

Are you sure you want to change the base?

Report number of rows per file read by PQ reader when no row selection and fix segfault in chunked PQ reader when skip_rows > 0 #16195

Conversation

mhaseeb123 commented Jul 4, 2024 • edited Loading

Description

Checklist

copy-pr-bot bot commented Jul 9, 2024

mhaseeb123 commented Jul 11, 2024 • edited Loading

mhaseeb123 Jul 11, 2024

Choose a reason for hiding this comment

mhaseeb123 Jul 11, 2024 • edited Loading

Choose a reason for hiding this comment

mhaseeb123 Jul 11, 2024

Choose a reason for hiding this comment

mhaseeb123 Jul 11, 2024

Choose a reason for hiding this comment

vuule Jul 16, 2024

Choose a reason for hiding this comment

mhaseeb123 Jul 11, 2024

Choose a reason for hiding this comment

mhaseeb123 Jul 11, 2024

Choose a reason for hiding this comment

etseidl left a comment

Choose a reason for hiding this comment

mhaseeb123 commented Jul 11, 2024

etseidl commented Jul 11, 2024

mhaseeb123 commented Jul 11, 2024 • edited Loading

vuule Jul 16, 2024

Choose a reason for hiding this comment

mhaseeb123 commented Jul 4, 2024 •

edited

Loading

mhaseeb123 commented Jul 11, 2024 •

edited

Loading

mhaseeb123 Jul 11, 2024 •

edited

Loading

mhaseeb123 commented Jul 11, 2024 •

edited

Loading