Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Segfault in pylibcudf to_arrow interop when passing nested list and metadata #16069

Closed
lithomas1 opened this issue Jun 24, 2024 · 9 comments · Fixed by #16198
Closed

[BUG] Segfault in pylibcudf to_arrow interop when passing nested list and metadata #16069

lithomas1 opened this issue Jun 24, 2024 · 9 comments · Fixed by #16198
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. pylibcudf Issues specific to the pylibcudf package

Comments

@lithomas1
Copy link
Contributor

lithomas1 commented Jun 24, 2024

Describe the bug
A clear and concise description of what the bug is.

plc.interop.to_arrow is segfaulting when passed (possibly invalid) metadata.

Steps/Code to reproduce bug
Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.

import cudf._lib.pylibcudf as plc
import pyarrow as pa

def metadata_from_arrow_type(
    pa_type: pa.Array,
    name: str = "",
) -> plc.interop.ColumnMetadata | None:
    metadata = plc.interop.ColumnMetadata(name) #None
    if pa.types.is_list(pa_type) or pa.types.is_struct(pa_type):
        child_meta = []
        for i in range(pa_type.num_fields):
            field_meta = metadata_from_arrow_type(pa_type.field(i).type, pa_type.field(i).name)
            child_meta.append(field_meta)
        metadata = plc.interop.ColumnMetadata(
            name,
            # libcudf does not store field names, so just match pyarrow's.
            child_meta
        )
    return metadata

pa_array = pa.array(
    [[[4]], [[5, 6]], [[7]], [[8]], [[9]], [[10]]], type=pa.list_(pa.list_(pa.int64()))
)

plc_array = plc.interop.from_arrow(pa_array)

print(metadata_from_arrow_type(pa_array.type))
new_pa_array = plc.interop.to_arrow(plc_array, metadata=metadata_from_arrow_type(pa_array.type))

This code will segfault or raise memoryerror.

Expected behavior
A clear and concise description of what you expected to happen.

No segfault. If the metadata being passed is wrong, an error should be raised (like in the struct case).

Environment overview (please complete the following information)

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
  • Method of cuDF install: [conda, Docker, or from source]
    • If method of install is [Docker], provide docker pull & docker run commands used

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Additional context

The error seems to occur in the C++ arrow conversion code (but still could be a bug in either pylibcudf or libcudf), so I'm labelling as both a libcudf and pylibcudf issue.

@lithomas1 lithomas1 added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. pylibcudf Issues specific to the pylibcudf package labels Jun 24, 2024
@wence-
Copy link
Contributor

wence- commented Jun 25, 2024

This is because the to_arrow interop expects, if the column_metadata for the children of a ListColumn is not empty, that you provide metadata for both the offsets column and the "values" column.

If you use the following:

def metadata_from_arrow_type(
    pa_type: pa.Array,
    name: str = "",
) -> plc.interop.ColumnMetadata | None:
    metadata = plc.interop.ColumnMetadata(name) #None
    if pa.types.is_list(pa_type):
        child_meta = [plc.interop.ColumnMetadata("offsets")]
        for i in range(pa_type.num_fields):
            field_meta = metadata_from_arrow_type(pa_type.field(i).type, pa_type.field(i).name)
            child_meta.append(field_meta)
        metadata = plc.interop.ColumnMetadata(name, child_meta)
    elif pa.types.is_struct(pa_type):
        child_meta = []
        for i in range(pa_type.num_fields):
            field_meta = metadata_from_arrow_type(pa_type.field(i).type, pa_type.field(i).name)
            child_meta.append(field_meta)
        metadata = plc.interop.ColumnMetadata(
            name,
            # libcudf does not store field names, so just match pyarrow's.
            child_meta
        )
    return metadata

Then there are no errors.

I debugged this by:

  • Compile to_arrow.cu with -g -O0
  • run valgrind --vgdb=yes python bug.py
  • Attach to the process with gdb via the instructions and spelunk a bit when I have an error.

@lithomas1
Copy link
Contributor Author

Thanks a bunch for the debugging!

This is because the to_arrow interop expects, if the column_metadata for the children of a ListColumn is not empty, that you provide metadata for both the offsets column and the "values" column.

This makes sense, I was thinking in terms of pyarrow ListType which doesn't have this.
(This could be better documented, though, probably on the interop page - which I now realize isn't getting built in the docs.)

I'll leave this issue open, though, since I don't think the code should segfault.
(For structs, a nice runtime error is raised).

@wence-
Copy link
Contributor

wence- commented Jun 26, 2024

Yeah, we can do something like this:

diff --git a/cpp/src/interop/to_arrow.cu b/cpp/src/interop/to_arrow.cu
index 2b3aa2f08f..0aed1f6612 100644
--- a/cpp/src/interop/to_arrow.cu
+++ b/cpp/src/interop/to_arrow.cu
@@ -374,6 +374,10 @@ std::shared_ptr<arrow::Array> dispatch_to_arrow::operator()<cudf::list_view>(
   column_view input_view = (tmp_column != nullptr) ? tmp_column->view() : input;
   auto children_meta =
     metadata.children_meta.empty() ? std::vector<column_metadata>{{}, {}} : metadata.children_meta;
+
+  CUDF_EXPECTS(metadata.children_meta.size() == static_cast<std::size_t>(input_view.num_children()),
+               "Number of field names and number of children doesn't match\n");
+
   auto child_arrays = fetch_child_array(input_view, children_meta, ar_mr, stream);
   if (child_arrays.empty()) {
     return std::make_shared<arrow::ListArray>(arrow::list(arrow::null()), 0, nullptr, nullptr);

However, the bigger question is what the preferred approach here is. It seems like the principle of least surprise would lead one to think, as you did, that for a list column one should provide a single metadata entry corresponding for the "element type" that's interior to the list. After all, the offsets are an implementation detail. But I don't know if changing those semantics would break any other usage.

@lithomas1
Copy link
Contributor Author

However, the bigger question is what the preferred approach here is. It seems like the principle of least surprise would lead one to think, as you did, that for a list column one should provide a single metadata entry corresponding for the "element type" that's interior to the list. After all, the offsets are an implementation detail. But I don't know if changing those semantics would break any other usage.

I haven't looked at libcudf internals, but I'd think that offset information is exposed in e.g. table_metadata which is returned in the readers.

On the pylibcudf side, we could hide this complexity by exposing the metadata_from_arrow_type helper you fixed for me here, as I think libcudf types map 1-1 to Arrow types.
(Or, we could directly accept a pyarrow schema instead and call metadata_from_arrow_type internally.)

P.S.
Me and @vyasr talked about how interop would look like in the future, and I think for columns/scalars we were leaning towards using the buffer protocol/arrow c data interface/cuda array interface to generically support interop. So for those types at the very least, you could just do np.array(plc column), etc. without dealing with interop.
(we were planning to bring it up to the group on Monday)

This doesn't work for tables, unfortunately, since table metadata is not stored with libcudf table.

@vyasr
Copy link
Contributor

vyasr commented Jun 26, 2024

My general approach to pylibcudf to this point has been to keep it as a minimal, faithful export of libcudf algorithms to Python without adding much in the way of syntactic sugar to improve the API. I do think that work is worth doing, I just haven't prioritized it since my assumption is that internal usage of pylibcudf (inside cuDF classic and cudf.polars) is going to dwarf any external usage until we have a proper standalone package anyway. As a result there are a lot of APIs like this that have very sharp edges. I'm not sure how much effort we should invest into improving them just yet.

That said, in this particular case since we do a lot of Arrow interop in testing (especially interactively) it's probably worth making developers lives easier if possible. The question then is, do we want to make users create ColumnMetadata objects, or should we support something more ergonomic? I think #15130 captures the analogous questions for groupby-agg APIs and maybe is instructive as to how I'm thinking about a good API there. Do you think similar ideas would apply here?

@wence-
Copy link
Contributor

wence- commented Jun 27, 2024

For testing at the pylibcudf level, do we care about the metadata attached to arrow objects at all? libcudf doesn't care about metadata. So either pylibcudf also doesn't care. Or pylibcudf does care, but then it needs a principled way of attaching metadata to columns, I think.

@vyasr
Copy link
Contributor

vyasr commented Jun 27, 2024

My claim is that pylibcudf doesn't care, but users may want to produce arrow objects with metadata and so interop is the only place where there should be a principled way to do this right? I don't think there needs to be metadata attached to columns, but there should be a way to produce arrow arrays from columns that allows the attachment of metadata upon array creation.

@lithomas1
Copy link
Contributor Author

lithomas1 commented Jun 28, 2024

A solution could maybe be to make users that want to have the metadata follow the pylibcudf Table around use the TableWithMetadata class I added for I/O.
(pylibcudf operations would then either accept TableWithMetadata or Table).

This also solves the usability issue of users having to keep track of metadata by hand after an I/O operation.
(since after I/O users get a TableWithMetadata, but they have to keep track of column metadata manually, which is easy to get wrong, especially when the dtypes change after operations)

Then, we could make to_arrow on a pylibcudf Table give an Arrow table with default names.
(And for users that want their own names, they can construct a pylibcudf TableWithMetadata from the Table and their own column names).

@lithomas1
Copy link
Contributor Author

Yeah, we can do something like this:

diff --git a/cpp/src/interop/to_arrow.cu b/cpp/src/interop/to_arrow.cu
index 2b3aa2f08f..0aed1f6612 100644
--- a/cpp/src/interop/to_arrow.cu
+++ b/cpp/src/interop/to_arrow.cu
@@ -374,6 +374,10 @@ std::shared_ptr<arrow::Array> dispatch_to_arrow::operator()<cudf::list_view>(
   column_view input_view = (tmp_column != nullptr) ? tmp_column->view() : input;
   auto children_meta =
     metadata.children_meta.empty() ? std::vector<column_metadata>{{}, {}} : metadata.children_meta;
+
+  CUDF_EXPECTS(metadata.children_meta.size() == static_cast<std::size_t>(input_view.num_children()),
+               "Number of field names and number of children doesn't match\n");
+
   auto child_arrays = fetch_child_array(input_view, children_meta, ar_mr, stream);
   if (child_arrays.empty()) {
     return std::make_shared<arrow::ListArray>(arrow::list(arrow::null()), 0, nullptr, nullptr);

However, the bigger question is what the preferred approach here is. It seems like the principle of least surprise would lead one to think, as you did, that for a list column one should provide a single metadata entry corresponding for the "element type" that's interior to the list. After all, the offsets are an implementation detail. But I don't know if changing those semantics would break any other usage.

@wence- Were you going to submit a patch for this?

wence- added a commit to wence-/cudf that referenced this issue Jul 4, 2024
When converting a list column to arrow with metadata, one must provide
metadata information for both the offset and value columns, or none at
all. This is not completely obvious (perhaps we only need the metadata
for the inner value column), so explicitly assert this case.

- Closes rapidsai#16069
rapids-bot bot pushed a commit that referenced this issue Jul 12, 2024
When converting a list column to arrow with metadata, one must provide metadata information for both the offset and value columns, or none at all. This is not completely obvious (perhaps we only need the metadata for the inner value column), so explicitly assert this case.

- Closes #16069

Authors:
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - MithunR (https://github.com/mythrocks)
  - David Wendt (https://github.com/davidwendt)

URL: #16198
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. pylibcudf Issues specific to the pylibcudf package
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants