-
Notifications
You must be signed in to change notification settings - Fork 872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Segfault in pylibcudf to_arrow interop when passing nested list and metadata #16069
Comments
This is because the If you use the following:
Then there are no errors. I debugged this by:
|
Thanks a bunch for the debugging!
This makes sense, I was thinking in terms of pyarrow ListType which doesn't have this. I'll leave this issue open, though, since I don't think the code should segfault. |
Yeah, we can do something like this: diff --git a/cpp/src/interop/to_arrow.cu b/cpp/src/interop/to_arrow.cu
index 2b3aa2f08f..0aed1f6612 100644
--- a/cpp/src/interop/to_arrow.cu
+++ b/cpp/src/interop/to_arrow.cu
@@ -374,6 +374,10 @@ std::shared_ptr<arrow::Array> dispatch_to_arrow::operator()<cudf::list_view>(
column_view input_view = (tmp_column != nullptr) ? tmp_column->view() : input;
auto children_meta =
metadata.children_meta.empty() ? std::vector<column_metadata>{{}, {}} : metadata.children_meta;
+
+ CUDF_EXPECTS(metadata.children_meta.size() == static_cast<std::size_t>(input_view.num_children()),
+ "Number of field names and number of children doesn't match\n");
+
auto child_arrays = fetch_child_array(input_view, children_meta, ar_mr, stream);
if (child_arrays.empty()) {
return std::make_shared<arrow::ListArray>(arrow::list(arrow::null()), 0, nullptr, nullptr); However, the bigger question is what the preferred approach here is. It seems like the principle of least surprise would lead one to think, as you did, that for a list column one should provide a single metadata entry corresponding for the "element type" that's interior to the list. After all, the offsets are an implementation detail. But I don't know if changing those semantics would break any other usage. |
I haven't looked at libcudf internals, but I'd think that offset information is exposed in e.g. On the pylibcudf side, we could hide this complexity by exposing the P.S. This doesn't work for tables, unfortunately, since table metadata is not stored with libcudf table. |
My general approach to pylibcudf to this point has been to keep it as a minimal, faithful export of libcudf algorithms to Python without adding much in the way of syntactic sugar to improve the API. I do think that work is worth doing, I just haven't prioritized it since my assumption is that internal usage of pylibcudf (inside cuDF classic and cudf.polars) is going to dwarf any external usage until we have a proper standalone package anyway. As a result there are a lot of APIs like this that have very sharp edges. I'm not sure how much effort we should invest into improving them just yet. That said, in this particular case since we do a lot of Arrow interop in testing (especially interactively) it's probably worth making developers lives easier if possible. The question then is, do we want to make users create |
For testing at the pylibcudf level, do we care about the metadata attached to arrow objects at all? libcudf doesn't care about metadata. So either pylibcudf also doesn't care. Or pylibcudf does care, but then it needs a principled way of attaching metadata to columns, I think. |
My claim is that pylibcudf doesn't care, but users may want to produce arrow objects with metadata and so interop is the only place where there should be a principled way to do this right? I don't think there needs to be metadata attached to columns, but there should be a way to produce arrow arrays from columns that allows the attachment of metadata upon array creation. |
A solution could maybe be to make users that want to have the metadata follow the pylibcudf Table around use the This also solves the usability issue of users having to keep track of metadata by hand after an I/O operation. Then, we could make |
@wence- Were you going to submit a patch for this? |
When converting a list column to arrow with metadata, one must provide metadata information for both the offset and value columns, or none at all. This is not completely obvious (perhaps we only need the metadata for the inner value column), so explicitly assert this case. - Closes rapidsai#16069
When converting a list column to arrow with metadata, one must provide metadata information for both the offset and value columns, or none at all. This is not completely obvious (perhaps we only need the metadata for the inner value column), so explicitly assert this case. - Closes #16069 Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - MithunR (https://github.com/mythrocks) - David Wendt (https://github.com/davidwendt) URL: #16198
Describe the bug
A clear and concise description of what the bug is.
plc.interop.to_arrow
is segfaulting when passed (possibly invalid) metadata.Steps/Code to reproduce bug
Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.
This code will segfault or raise memoryerror.
Expected behavior
A clear and concise description of what you expected to happen.
No segfault. If the metadata being passed is wrong, an error should be raised (like in the struct case).
Environment overview (please complete the following information)
docker pull
&docker run
commands usedEnvironment details
Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsAdditional context
The error seems to occur in the C++ arrow conversion code (but still could be a bug in either pylibcudf or libcudf), so I'm labelling as both a libcudf and pylibcudf issue.
The text was updated successfully, but these errors were encountered: