Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] make_lists_column does not enforce arrow-compatible invariants on offsets column #16164

Closed
wence- opened this issue Jul 2, 2024 · 4 comments · Fixed by #16201
Closed
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.

Comments

@wence-
Copy link
Contributor

wence- commented Jul 2, 2024

Describe the bug

The arrow spec says, of offsets:

The offsets buffer contains length + 1 signed integers (either 32-bit or 64-bit, depending on the logical type), which encode the start position of each slot in the data buffer.

(emphasis mine).

However, make_lists_column enforces:

CUDF_EXPECTS(
(num_rows == 0 && offsets_column->size() == 0) || num_rows == offsets_column->size() - 1,
"Invalid offsets column size for lists column.");

That is, it allows the offsets buffer to have length zero iff the number of rows in the list column is zero.

make_empty_lists_column uses this, laxer, definition and will produce a length-zero list column with an offset array of length zero.

Consequently, a library which checks the invariants on list columns when ingesting an empty list column that is exported by libcudf will observe that they are not satisfied. Libraries that do not check these invariants may read arbitrary memory and/or deduce incorrect sizes if they ever access the offset array.

Steps/Code to reproduce bug

import pyarrow as pa
import cudf._lib.pylibcudf as plc

h_table = pa.table({"a": pa.array([], type=pa.int8()), "b": pa.array([], type=pa.float32())})

d_table = plc.interop.from_arrow(h_table)

grouper = plc.groupby.GroupBy(plc.Table([d_table.columns()[0]]))

request = plc.groupby.GroupByRequest(d_table.columns()[1], [plc.aggregation.collect_list()])

_, (result,) = grouper.aggregate([request])

empty_list_column = result.columns()[0]

offsets, values = empty_list_column.children()

assert offsets.size() == values.size() - 1 # fails

arrow_column = plc.interop.to_arrow(empty_list_column)

import polars as pl

pl.from_arrow(arrow_column)

# raises: ComputeError: offsets must not exceed the values length

Expected behavior

Empty list columns should have a layout that is compatible with arrow.

I suppose there are two ways to fix this:

  1. In to_arrow special case construction of empty list columns (we already do this for empty strings).
  2. Enforce the correct invariant in construction of all list columns.

They're probably not mutually exclusive.

@wence- wence- added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. labels Jul 2, 2024
@davidwendt
Copy link
Contributor

I believe we do option 1 for strings columns so I think we should do the same for lists columns

if (child_arrays.empty()) {
// Empty string will have only one value in offset of 4 bytes
auto tmp_offset_buffer = allocate_arrow_buffer(sizeof(int32_t), ar_mr);
auto tmp_data_buffer = allocate_arrow_buffer(0, ar_mr);
memset(tmp_offset_buffer->mutable_data(), 0, sizeof(int32_t));
return std::make_shared<arrow::StringArray>(
0, std::move(tmp_offset_buffer), std::move(tmp_data_buffer));
}

Accessing child data in an empty column is considered UB in libcudf. I should add this to the developer guide.

@wence- wence- self-assigned this Jul 2, 2024
@wence-
Copy link
Contributor Author

wence- commented Jul 2, 2024

I believe we do option 1 for strings columns so I think we should do the same for lists columns

if (child_arrays.empty()) {
// Empty string will have only one value in offset of 4 bytes
auto tmp_offset_buffer = allocate_arrow_buffer(sizeof(int32_t), ar_mr);
auto tmp_data_buffer = allocate_arrow_buffer(0, ar_mr);
memset(tmp_offset_buffer->mutable_data(), 0, sizeof(int32_t));
return std::make_shared<arrow::StringArray>(
0, std::move(tmp_offset_buffer), std::move(tmp_data_buffer));
}

Accessing child data in an empty column is considered UB in libcudf. I should add this to the developer guide.

How would you represent a nested empty list column, such that the correct datatype can be determined, if accessing child data is UB?

@davidwendt
Copy link
Contributor

How would you represent a nested empty list column, such that the correct datatype can be determined, if accessing child data is UB?

That is a good point. I was hoping for a global rule. The empty child column is necessary to access the list column's type.

std::unique_ptr<column> make_empty_lists_column(data_type child_type,
rmm::cuda_stream_view stream,
rmm::device_async_resource_ref mr)
{
auto offsets = make_empty_column(data_type(type_to_id<size_type>()));
auto child = make_empty_column(child_type);
return make_lists_column(
0, std::move(offsets), std::move(child), 0, rmm::device_buffer{}, stream, mr);
}

So another thing worth documenting in the dev guide.

Are you ok with option 1 as you described? I feel that would be more consistent with how offsets are handled for empty strings columns.

@wence-
Copy link
Contributor Author

wence- commented Jul 2, 2024

The empty child column is necessary to access the list column's type.

FWIW, I really wish the nested type implementation in libcudf stored the structure in the data_type, rather than using data_type as a flat union tag discriminator. But I suspect that ship has sailed.

Are you ok with option 1 as you described? I feel that would be more consistent with how offsets are handled for empty strings columns.

Yes, I think so. I guess it means one always has to remember to special-case empty list columns in libcudf algorithms, but I suppose that is done at the moment anyway to avoid kernel launches for empty data.

rapids-bot bot pushed a commit that referenced this issue Jul 4, 2024
The offset column of a nested empty list column may be empty as discussed in #16164. `ListColumn.memory_usage` assumed that this column was non-empty

Unblocks rapidsai/cuspatial#1400

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #16193
@rapids-bot rapids-bot bot closed this as completed in 2664427 Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

2 participants