Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polars checks not being evaluated correctly #1662

Closed
2 tasks
mxblsdl opened this issue May 30, 2024 · 6 comments
Closed
2 tasks

Polars checks not being evaluated correctly #1662

mxblsdl opened this issue May 30, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@mxblsdl
Copy link

mxblsdl commented May 30, 2024

Describe the bug
The column checks on polars LazyFrames are not registering errors when they should. Values outside of a defined range pass validation with no warnings or errors. This is not true for polars DataFrame which does register an error.

It looks like this was addressed in a recent PR but I am still seeing the bug in the 0.19.3 release.

  • I have checked that this issue has not already been reported.
    • The issue has been reported and merged to main, but is still persisting in the most recent release
  • [ x] I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the main branch of pandera.

Code Sample,

# This code is taken from the examples page [here](https://pandera--1373.org.readthedocs.build/en/1373/polars.html)
# With values changed to be outside the define range.

import pandera.polars as pa
import polars as pl


schema = pa.DataFrameSchema(
    {
        "state": pa.Column(str),
        "city": pa.Column(str),
        "price": pa.Column(int, pa.Check.in_range(min_value=5, max_value=20)), # check is defined
    }
)


lf = pl.LazyFrame(
    {
        "state": ["FL", "FL", "FL", "CA", "CA", "CA"],
        "city": [
            "Orlando",
            "Miami",
            "Tampa",
            "San Francisco",
            "Los Angeles",
            "San Diego",
        ],
        "price": [2, 12, 10, 16, 20, 180], # values outside of defined range are passed
    }
)
print(schema.validate(lf).collect()) # no errors are raised

Expected behavior

I would expect a pandera.errors.SchemaError to be raised. Note that the polars.DataFrame version of this code does raise and error.

import pandera.polars as pa
import polars as pl


schema = pa.DataFrameSchema(
    {
        "state": pa.Column(str),
        "city": pa.Column(str),
        "price": pa.Column(int, pa.Check.in_range(min_value=5, max_value=20)),
    }
)


lf = pl.DataFrame(
    {
        "state": ["FL", "FL", "FL", "CA", "CA", "CA"],
        "city": [
            "Orlando",
            "Miami",
            "Tampa",
            "San Francisco",
            "Los Angeles",
            "San Diego",
        ],
        "price": [2, 12, 10, 16, 20, 180],
    }
)
print(schema.validate(lf))

Desktop (please complete the following information):

  • OS: Windows 10
  • Browser: Chrome
  • Version: pandera: 0.19.3, polars: 0.20.28
@mxblsdl mxblsdl added the bug Something isn't working label May 30, 2024
@kacper-sellforte
Copy link

Screenshot 2024-06-12 at 21 03 31

https://pandera.readthedocs.io/en/stable/polars.html#how-it-works

I think this behaviour is expected. pa.Check.in_range(min_value=5, max_value=20) cannot be performed on pl.LazyFrame object as it requires reading of the data.

@mxblsdl
Copy link
Author

mxblsdl commented Jun 17, 2024

So are checks never assessed for LazyFrame objects?

I feel like the documentation should make this more explicit or a warning should be issued. The top example comes directly from Pandera documentation and having a check that is never assessed creates a false sense of coverage.

@kacper-sellforte
Copy link

Checks are assessed for LazyFrame objects, but only those that don't require data being present in the memory are evaluated - so most importantly data types

@cosmicBboy
Copy link
Collaborator

This is expected behavior @mxblsdl.

I feel like the documentation should make this more explicit

I believe it already does, see https://pandera.readthedocs.io/en/stable/polars.html#how-it-works already linked by @kacper-sellforte.

or a warning should be issued

This is also a good idea. I think a better logging experience here would be helpful. Would you mind opening up a separate issue for this request?

The correct way to support this would be if polars has a first-class expression that asserts whether a column contains any False values, in which case pandera can catch the error lazily when the lazyframe is evaluated. I opened up an issue in the polars project: pola-rs/polars#16120

@cosmicBboy
Copy link
Collaborator

Also see https://pandera.readthedocs.io/en/stable/polars.html#data-level-validation-with-lazyframes. You can set the environment variable export PANDERA_VALIDATION_DEPTH=SCHEMA_AND_DATA and pandera will do a LazyFrame.collect call under the hood and convert back into a LazyFrame.

@mxblsdl
Copy link
Author

mxblsdl commented Jul 16, 2024

okay thank you for taking a look at this. I guess I was just confused on the limits of lazyframe evaluation. I will experiment with the env variable mentioned above and close the issue.

@mxblsdl mxblsdl closed this as completed Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants