[BUG] Segfault in pylibcudf to_arrow interop when passing nested list and metadata #16069

lithomas1 · 2024-06-24T15:36:29Z

Describe the bug
A clear and concise description of what the bug is.

plc.interop.to_arrow is segfaulting when passed (possibly invalid) metadata.

Steps/Code to reproduce bug
Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.

import cudf._lib.pylibcudf as plc
import pyarrow as pa

def metadata_from_arrow_type(
    pa_type: pa.Array,
    name: str = "",
) -> plc.interop.ColumnMetadata | None:
    metadata = plc.interop.ColumnMetadata(name) #None
    if pa.types.is_list(pa_type) or pa.types.is_struct(pa_type):
        child_meta = []
        for i in range(pa_type.num_fields):
            field_meta = metadata_from_arrow_type(pa_type.field(i).type, pa_type.field(i).name)
            child_meta.append(field_meta)
        metadata = plc.interop.ColumnMetadata(
            name,
            # libcudf does not store field names, so just match pyarrow's.
            child_meta
        )
    return metadata

pa_array = pa.array(
    [[[4]], [[5, 6]], [[7]], [[8]], [[9]], [[10]]], type=pa.list_(pa.list_(pa.int64()))
)

plc_array = plc.interop.from_arrow(pa_array)

print(metadata_from_arrow_type(pa_array.type))
new_pa_array = plc.interop.to_arrow(plc_array, metadata=metadata_from_arrow_type(pa_array.type))

This code will segfault or raise memoryerror.

Expected behavior
A clear and concise description of what you expected to happen.

No segfault. If the metadata being passed is wrong, an error should be raised (like in the struct case).

Environment overview (please complete the following information)

Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
Method of cuDF install: [conda, Docker, or from source]
- If method of install is [Docker], provide docker pull & docker run commands used

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Additional context

The error seems to occur in the C++ arrow conversion code (but still could be a bug in either pylibcudf or libcudf), so I'm labelling as both a libcudf and pylibcudf issue.

The text was updated successfully, but these errors were encountered:

wence- · 2024-06-25T17:53:00Z

This is because the to_arrow interop expects, if the column_metadata for the children of a ListColumn is not empty, that you provide metadata for both the offsets column and the "values" column.

If you use the following:

def metadata_from_arrow_type(
    pa_type: pa.Array,
    name: str = "",
) -> plc.interop.ColumnMetadata | None:
    metadata = plc.interop.ColumnMetadata(name) #None
    if pa.types.is_list(pa_type):
        child_meta = [plc.interop.ColumnMetadata("offsets")]
        for i in range(pa_type.num_fields):
            field_meta = metadata_from_arrow_type(pa_type.field(i).type, pa_type.field(i).name)
            child_meta.append(field_meta)
        metadata = plc.interop.ColumnMetadata(name, child_meta)
    elif pa.types.is_struct(pa_type):
        child_meta = []
        for i in range(pa_type.num_fields):
            field_meta = metadata_from_arrow_type(pa_type.field(i).type, pa_type.field(i).name)
            child_meta.append(field_meta)
        metadata = plc.interop.ColumnMetadata(
            name,
            # libcudf does not store field names, so just match pyarrow's.
            child_meta
        )
    return metadata

Then there are no errors.

I debugged this by:

Compile to_arrow.cu with -g -O0
run valgrind --vgdb=yes python bug.py
Attach to the process with gdb via the instructions and spelunk a bit when I have an error.

lithomas1 · 2024-06-26T14:22:25Z

Thanks a bunch for the debugging!

This is because the to_arrow interop expects, if the column_metadata for the children of a ListColumn is not empty, that you provide metadata for both the offsets column and the "values" column.

This makes sense, I was thinking in terms of pyarrow ListType which doesn't have this.
(This could be better documented, though, probably on the interop page - which I now realize isn't getting built in the docs.)

I'll leave this issue open, though, since I don't think the code should segfault.
(For structs, a nice runtime error is raised).

wence- · 2024-06-26T14:41:32Z

Yeah, we can do something like this:

diff --git a/cpp/src/interop/to_arrow.cu b/cpp/src/interop/to_arrow.cu
index 2b3aa2f08f..0aed1f6612 100644
--- a/cpp/src/interop/to_arrow.cu
+++ b/cpp/src/interop/to_arrow.cu
@@ -374,6 +374,10 @@ std::shared_ptr<arrow::Array> dispatch_to_arrow::operator()<cudf::list_view>(
   column_view input_view = (tmp_column != nullptr) ? tmp_column->view() : input;
   auto children_meta =
     metadata.children_meta.empty() ? std::vector<column_metadata>{{}, {}} : metadata.children_meta;
+
+  CUDF_EXPECTS(metadata.children_meta.size() == static_cast<std::size_t>(input_view.num_children()),
+               "Number of field names and number of children doesn't match\n");
+
   auto child_arrays = fetch_child_array(input_view, children_meta, ar_mr, stream);
   if (child_arrays.empty()) {
     return std::make_shared<arrow::ListArray>(arrow::list(arrow::null()), 0, nullptr, nullptr);

However, the bigger question is what the preferred approach here is. It seems like the principle of least surprise would lead one to think, as you did, that for a list column one should provide a single metadata entry corresponding for the "element type" that's interior to the list. After all, the offsets are an implementation detail. But I don't know if changing those semantics would break any other usage.

lithomas1 · 2024-06-26T18:51:21Z

However, the bigger question is what the preferred approach here is. It seems like the principle of least surprise would lead one to think, as you did, that for a list column one should provide a single metadata entry corresponding for the "element type" that's interior to the list. After all, the offsets are an implementation detail. But I don't know if changing those semantics would break any other usage.

I haven't looked at libcudf internals, but I'd think that offset information is exposed in e.g. table_metadata which is returned in the readers.

On the pylibcudf side, we could hide this complexity by exposing the metadata_from_arrow_type helper you fixed for me here, as I think libcudf types map 1-1 to Arrow types.
(Or, we could directly accept a pyarrow schema instead and call metadata_from_arrow_type internally.)

P.S.
Me and @vyasr talked about how interop would look like in the future, and I think for columns/scalars we were leaning towards using the buffer protocol/arrow c data interface/cuda array interface to generically support interop. So for those types at the very least, you could just do np.array(plc column), etc. without dealing with interop.
(we were planning to bring it up to the group on Monday)

This doesn't work for tables, unfortunately, since table metadata is not stored with libcudf table.

vyasr · 2024-06-26T19:31:56Z

My general approach to pylibcudf to this point has been to keep it as a minimal, faithful export of libcudf algorithms to Python without adding much in the way of syntactic sugar to improve the API. I do think that work is worth doing, I just haven't prioritized it since my assumption is that internal usage of pylibcudf (inside cuDF classic and cudf.polars) is going to dwarf any external usage until we have a proper standalone package anyway. As a result there are a lot of APIs like this that have very sharp edges. I'm not sure how much effort we should invest into improving them just yet.

That said, in this particular case since we do a lot of Arrow interop in testing (especially interactively) it's probably worth making developers lives easier if possible. The question then is, do we want to make users create ColumnMetadata objects, or should we support something more ergonomic? I think #15130 captures the analogous questions for groupby-agg APIs and maybe is instructive as to how I'm thinking about a good API there. Do you think similar ideas would apply here?

wence- · 2024-06-27T16:00:04Z

For testing at the pylibcudf level, do we care about the metadata attached to arrow objects at all? libcudf doesn't care about metadata. So either pylibcudf also doesn't care. Or pylibcudf does care, but then it needs a principled way of attaching metadata to columns, I think.

vyasr · 2024-06-27T20:16:39Z

My claim is that pylibcudf doesn't care, but users may want to produce arrow objects with metadata and so interop is the only place where there should be a principled way to do this right? I don't think there needs to be metadata attached to columns, but there should be a way to produce arrow arrays from columns that allows the attachment of metadata upon array creation.

lithomas1 · 2024-06-28T16:24:50Z

A solution could maybe be to make users that want to have the metadata follow the pylibcudf Table around use the TableWithMetadata class I added for I/O.
(pylibcudf operations would then either accept TableWithMetadata or Table).

This also solves the usability issue of users having to keep track of metadata by hand after an I/O operation.
(since after I/O users get a TableWithMetadata, but they have to keep track of column metadata manually, which is easy to get wrong, especially when the dtypes change after operations)

Then, we could make to_arrow on a pylibcudf Table give an Arrow table with default names.
(And for users that want their own names, they can construct a pylibcudf TableWithMetadata from the Table and their own column names).

lithomas1 · 2024-07-03T18:10:49Z

Yeah, we can do something like this:
diff --git a/cpp/src/interop/to_arrow.cu b/cpp/src/interop/to_arrow.cu
index 2b3aa2f08f..0aed1f6612 100644
--- a/cpp/src/interop/to_arrow.cu
+++ b/cpp/src/interop/to_arrow.cu
@@ -374,6 +374,10 @@ std::shared_ptr<arrow::Array> dispatch_to_arrow::operator()<cudf::list_view>(
   column_view input_view = (tmp_column != nullptr) ? tmp_column->view() : input;
   auto children_meta =
     metadata.children_meta.empty() ? std::vector<column_metadata>{{}, {}} : metadata.children_meta;
+
+  CUDF_EXPECTS(metadata.children_meta.size() == static_cast<std::size_t>(input_view.num_children()),
+               "Number of field names and number of children doesn't match\n");
+
   auto child_arrays = fetch_child_array(input_view, children_meta, ar_mr, stream);
   if (child_arrays.empty()) {
     return std::make_shared<arrow::ListArray>(arrow::list(arrow::null()), 0, nullptr, nullptr);
However, the bigger question is what the preferred approach here is. It seems like the principle of least surprise would lead one to think, as you did, that for a list column one should provide a single metadata entry corresponding for the "element type" that's interior to the list. After all, the offsets are an implementation detail. But I don't know if changing those semantics would break any other usage.

@wence- Were you going to submit a patch for this?

When converting a list column to arrow with metadata, one must provide metadata information for both the offset and value columns, or none at all. This is not completely obvious (perhaps we only need the metadata for the inner value column), so explicitly assert this case. - Closes rapidsai#16069

When converting a list column to arrow with metadata, one must provide metadata information for both the offset and value columns, or none at all. This is not completely obvious (perhaps we only need the metadata for the inner value column), so explicitly assert this case. - Closes #16069 Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - MithunR (https://github.com/mythrocks) - David Wendt (https://github.com/davidwendt) URL: #16198

lithomas1 added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. pylibcudf Issues specific to the pylibcudf package labels Jun 24, 2024

lithomas1 mentioned this issue Jun 24, 2024

Start migrating I/O writers to pylibcudf (starting with JSON) #15952

Merged

3 tasks

wence- mentioned this issue Jul 4, 2024

Assert valid metadata is passed in to_arrow for list_view #16198

Merged

3 tasks

rapids-bot bot closed this as completed in #16198 Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Segfault in pylibcudf to_arrow interop when passing nested list and metadata #16069

[BUG] Segfault in pylibcudf to_arrow interop when passing nested list and metadata #16069

lithomas1 commented Jun 24, 2024 •

edited

Loading

wence- commented Jun 25, 2024

lithomas1 commented Jun 26, 2024

wence- commented Jun 26, 2024

lithomas1 commented Jun 26, 2024

vyasr commented Jun 26, 2024 •

edited

Loading

wence- commented Jun 27, 2024

vyasr commented Jun 27, 2024

lithomas1 commented Jun 28, 2024 •

edited

Loading

lithomas1 commented Jul 3, 2024

[BUG] Segfault in pylibcudf to_arrow interop when passing nested list and metadata #16069

[BUG] Segfault in pylibcudf to_arrow interop when passing nested list and metadata #16069

Comments

lithomas1 commented Jun 24, 2024 • edited Loading

wence- commented Jun 25, 2024

lithomas1 commented Jun 26, 2024

wence- commented Jun 26, 2024

lithomas1 commented Jun 26, 2024

vyasr commented Jun 26, 2024 • edited Loading

wence- commented Jun 27, 2024

vyasr commented Jun 27, 2024

lithomas1 commented Jun 28, 2024 • edited Loading

lithomas1 commented Jul 3, 2024

lithomas1 commented Jun 24, 2024 •

edited

Loading

vyasr commented Jun 26, 2024 •

edited

Loading

lithomas1 commented Jun 28, 2024 •

edited

Loading