Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Polars combined filter and head throws error with some datasets (row limit in scan error) #16172

Open
beckernick opened this issue Jul 2, 2024 · 3 comments
Assignees
Labels
1 - On Deck To be worked on next cudf.polars Issues specific to cudf.polars feature request New feature or request

Comments

@beckernick
Copy link
Member

The following filter + head operation works smoothly:

import polars as pl
from functools import partial
from cudf_polars.callback import execute_with_cudf
import numpy as np

ldf = pl.DataFrame({"a": ["a", "b","c", "c", "b"]}).lazy()
ldf.select(pl.col("a") == "b").head().collect(post_opt_callback=partial(execute_with_cudf, raise_on_fail=True)) # works as expected

It also works on a larger dataset:

import polars as pl
from functools import partial
from cudf_polars.callback import execute_with_cudf
import numpy as np

N = 100000000
K = 20

ldf = pl.DataFrame({
    "a": np.random.choice(K, N),
    "b": np.random.choice(K, N),
    "c": np.random.choice(K, N),
}).lazy()

ldf.select(pl.col("b") == 11).head().collect(post_opt_callback=partial(execute_with_cudf, raise_on_fail=True))

But, when I use a different dataset, I get an error (apologies for non-accessible dataset path):

transactions = pl.scan_csv("/raid/manass/cudf/data/half_transactions.csv")

transactions.select(pl.col("DAY") == 11).head().collect(
    post_opt_callback=partial(execute_with_cudf, raise_on_fail=True)
)
---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[22], line 3
      1 transactions = pl.scan_csv("[/raid/manass/cudf/data/half_transactions.csv](http://10.117.23.184:8882/lab/tree/raid/raid/manass/cudf/data/half_transactions.csv)")
----> 3 transactions.select(pl.col("DAY") == 11).head().collect(
      4     post_opt_callback=partial(execute_with_cudf, raise_on_fail=True)
      5 )

File [/raid/nicholasb/miniconda3/envs/all_cuda-122_arch-x86_64/lib/python3.11/site-packages/polars/lazyframe/frame.py:1942](http://10.117.23.184:8882/lab/tree/raid/raid/nicholasb/miniconda3/envs/all_cuda-122_arch-x86_64/lib/python3.11/site-packages/polars/lazyframe/frame.py#line=1941), in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
   1939 # Only for testing purposes atm.
   1940 callback = _kwargs.get("post_opt_callback")
-> 1942 return wrap_df(ldf.collect(callback))

ComputeError: 'cuda' conversion failed: NotImplementedError: row limit in scan

But the filter on its own works:

transactions = pl.scan_csv("/raid/manass/cudf/data/half_transactions.csv")

transactions.select(pl.col("DAY") == 11).collect(
    post_opt_callback=partial(execute_with_cudf, raise_on_fail=True)
)
@beckernick beckernick added the bug Something isn't working label Jul 2, 2024
@lithomas1
Copy link
Contributor

Looks like from here

if self.file_options.n_rows is not None:
raise NotImplementedError("row limit in scan")

I'll take a look at this as part of my I/O work (this should be enable-able for csv, and for parquet, we need to do a little bit of work to expose the option from cython/pylibcudf)

@lithomas1 lithomas1 self-assigned this Jul 2, 2024
@lithomas1 lithomas1 added feature request New feature or request 1 - On Deck To be worked on next cudf.polars Issues specific to cudf.polars and removed bug Something isn't working labels Jul 2, 2024
@wence-
Copy link
Contributor

wence- commented Jul 2, 2024

I'll take a look at this as part of my I/O work (this should be enable-able for csv, and for parquet, we need to do a little bit of work to expose the option from cython/pylibcudf)

@lithomas1 If you think this will not be a day or two, can you open a PR that moves this notimplemented error into the __post_init__ method of Scan?

@lithomas1
Copy link
Contributor

I'll take a look at this as part of my I/O work (this should be enable-able for csv, and for parquet, we need to do a little bit of work to expose the option from cython/pylibcudf)

@lithomas1 If you think this will not be a day or two, can you open a PR that moves this notimplemented error into the __post_init__ method of Scan?

I should be able to get this in the next couple of days (it should be pretty easy to expose the bindings, but the full migration to pylibcudf is pretty far down the stack).

@lithomas1 lithomas1 assigned lithomas1 and unassigned lithomas1 Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - On Deck To be worked on next cudf.polars Issues specific to cudf.polars feature request New feature or request
Projects
Status: In Progress
Development

No branches or pull requests

3 participants