Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support Polars expression calculating lengths / count from a parquet file #16180

Open
beckernick opened this issue Jul 2, 2024 · 1 comment
Labels
cudf.polars Issues specific to cudf.polars feature request New feature or request

Comments

@beckernick
Copy link
Member

We can't currently do a read + length/count operation on a parquet file.

import polars as pl
from functools import partial
from cudf_polars.callback import execute_with_cudf
import numpy as np

use_cudf = partial(execute_with_cudf, raise_on_fail=True)

ldf = pl.DataFrame({
    "date": ['2015-09-11', '2017-02-08', '2015-08-01', '2019-03-16', '2015-05-15'],
    "val": [1, 2, 3, 4, 5]
}).lazy()

ldf.sink_parquet("test.parquet")

print(ldf.select(pl.len()).collect())
print(ldf.select(pl.len()).collect(post_opt_callback=use_cudf))
print(pl.scan_parquet("test.parquet").select(pl.len()).collect(post_opt_callback=use_cudf))
shape: (1, 1)
┌─────┐
│ len │
│ --- │
│ u32 │
╞═════╡
│ 5   │
└─────┘
shape: (1, 1)
┌─────┐
│ len │
│ --- │
│ u32 │
╞═════╡
│ 5   │
└─────┘
---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[55], line 17
     15 print(ldf.select(pl.len()).collect())
     16 print(ldf.select(pl.len()).collect(post_opt_callback=use_cudf))
---> 17 print(pl.scan_parquet("test.parquet").select(pl.len()).collect(post_opt_callback=use_cudf))

File [/raid/nicholasb/miniconda3/envs/all_cuda-122_arch-x86_64/lib/python3.11/site-packages/polars/lazyframe/frame.py:1942](http://10.117.23.184:8882/lab/tree/raid/nicholasb/raid/nicholasb/miniconda3/envs/all_cuda-122_arch-x86_64/lib/python3.11/site-packages/polars/lazyframe/frame.py#line=1941), in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, cluster_with_columns, no_optimization, streaming, background, _eager, **_kwargs)
   1939 # Only for testing purposes atm.
   1940 callback = _kwargs.get("post_opt_callback")
-> 1942 return wrap_df(ldf.collect(callback))

ComputeError: 'cuda' conversion failed: NotImplementedError: function count

Perhaps this happens because we don't have the FAST COUNT op implemented, which may be a specialized op to take advantage of parquet file metadata for things like row counts?

print(ldf.select(pl.len()).explain())
print(pl.scan_parquet("test.parquet").select(pl.len()).explain())
 SELECT [len()] FROM
  DF ["date", "val"]; PROJECT 1/2 COLUMNS; SELECTION: None
FAST COUNT(*)
  DF []; PROJECT */0 COLUMNS; SELECTION: None
@beckernick beckernick added the feature request New feature or request label Jul 2, 2024
@mroeschke mroeschke added the cudf.polars Issues specific to cudf.polars label Jul 2, 2024
@beckernick beckernick changed the title [FEA] Support expression calculating lengths / count from a parquet file (polars) [FEA] Support Polars expression calculating lengths / count from a parquet file Jul 2, 2024
@lithomas1
Copy link
Contributor

Looks like needs to be exposed on polars side as well as our side (so this won't work out of the box with polars 1.0).

https://github.com/pola-rs/polars/blob/09c98c58d7806d587deb31e53c47e94565bbafc8/py-polars/src/lazyframe/visitor/nodes.rs#L572-L576

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf.polars Issues specific to cudf.polars feature request New feature or request
Projects
Status: In Progress
Development

No branches or pull requests

3 participants