Lazy validation raises misleading SchemaError when columns are missing #1732

benlee1284 · 2024-07-06T16:40:45Z

Describe the bug
A clear and concise description of what the bug is.

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandera.
(optional) I have confirmed this bug exists on the main branch of pandera.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import polars as pl
import pandera.polars as pa

schema = pa.DataFrameSchema(
    {
        "a": pa.Column(pl.Int32),
        "b": pa.Column(pl.Int32)
    }
)

df = pl.DataFrame({"a": [1, 2, 3]})

schema.validate(df, lazy=True)
>> SchemaError: type String is incompatible with expected type Null

Expected behavior

Expected SchemaErrors (rather than SchemaError) which details the missing column and any other validation failures in the DataFrame.

Desktop (please complete the following information):

OS: macOS Sonoma Version 14.5
Browser: Chrome
Version: 0.19.3
Python Version: 3.11

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

It looks like the SchemaErrors class is attempting to concatenate DataFrames representing different failure cases into one DataFrame. In this case, the below are the DataFrames it is trying to concat:

shape: (1, 6)
┌──────────────┬─────────────────┬────────┬────────────────────┬──────────────┬───────┐
│ failure_case ┆ schema_context  ┆ column ┆ check              ┆ check_number ┆ index │
│ ---          ┆ ---             ┆ ---    ┆ ---                ┆ ---          ┆ ---   │
│ str          ┆ str             ┆ null   ┆ str                ┆ i32          ┆ i32   │
╞══════════════╪═════════════════╪════════╪════════════════════╪══════════════╪═══════╡
│ b            ┆ DataFrameSchema ┆ null   ┆ column_in_datafram ┆ null         ┆ null  │
│              ┆                 ┆        ┆ e                  ┆              ┆       │
└──────────────┴─────────────────┴────────┴────────────────────┴──────────────┴───────┘
shape: (1, 6)
┌──────────────┬────────────────┬────────┬────────────────┬──────────────┬───────┐
│ failure_case ┆ schema_context ┆ column ┆ check          ┆ check_number ┆ index │
│ ---          ┆ ---            ┆ ---    ┆ ---            ┆ ---          ┆ ---   │
│ str          ┆ str            ┆ str    ┆ str            ┆ i32          ┆ i32   │
╞══════════════╪════════════════╪════════╪════════════════╪══════════════╪═══════╡
│ Int64        ┆ Column         ┆ a      ┆ dtype('Int32') ┆ null         ┆ null  │
└──────────────┴────────────────┴────────┴────────────────┴──────────────┴───────┘

It should be as simple as setting the dtype of column to always be str but I have no knowledge of the pandera codebase, so no idea if it's actually as quick of a fix as I think it is 😂

Full Traceback:

SchemaError                               Traceback (most recent call last)
Cell In[6], line 1
----> 1 schema.validate(df, lazy=True)

File ~/.local/share/virtualenvs/zeus-XZJKObwC/lib/python3.11/site-packages/pandera/api/polars/container.py:58, in DataFrameSchema.validate(self, check_obj, head, tail, sample, random_state, lazy, inplace)
     54     if is_dataframe:
     55         # if validating a polars DataFrame, use the global config setting
     56         check_obj = check_obj.lazy()
---> 58     output = self.get_backend(check_obj).validate(
     59         check_obj=check_obj,
     60         schema=self,
     61         head=head,
     62         tail=tail,
     63         sample=sample,
     64         random_state=random_state,
     65         lazy=lazy,
     66         inplace=inplace,
     67     )
     69 if is_dataframe:
     70     output = output.collect()

File ~/.local/share/virtualenvs/zeus-XZJKObwC/lib/python3.11/site-packages/pandera/backends/polars/container.py:122, in DataFrameSchemaBackend.validate(self, check_obj, schema, head, tail, sample, random_state, lazy, inplace)
    120         check_obj = self.drop_invalid_rows(check_obj, error_handler)
    121     else:
--> 122         raise SchemaErrors(
    123             schema=schema,
    124             schema_errors=error_handler.schema_errors,
    125             data=check_obj,
    126         )
    128 return check_obj

File ~/.local/share/virtualenvs/zeus-XZJKObwC/lib/python3.11/site-packages/pandera/errors.py:183, in SchemaErrors.__init__(self, schema, schema_errors, data)
    178 self.schema_errors = schema_errors
    179 self.data = data
    181 failure_cases_metadata = schema.get_backend(
    182     data
--> 183 ).failure_cases_metadata(schema.name, schema_errors)
    184 self.error_counts = failure_cases_metadata.error_counts
    185 self.failure_cases = failure_cases_metadata.failure_cases

File ~/.local/share/virtualenvs/zeus-XZJKObwC/lib/python3.11/site-packages/pandera/backends/polars/base.py:204, in PolarsSchemaBackend.failure_cases_metadata(self, schema_name, schema_errors)
    198         failure_cases_df = pl.DataFrame(scalar_failure_cases).cast(
    199             {"check_number": pl.Int32, "index": pl.Int32}
    200         )
    202     failure_case_collection.append(failure_cases_df)
--> 204 failure_cases = pl.concat(failure_case_collection)
    206 error_handler = ErrorHandler()
    207 error_handler.collect_errors(schema_errors)

File ~/.local/share/virtualenvs/zeus-XZJKObwC/lib/python3.11/site-packages/polars/functions/eager.py:187, in concat(items, how, rechunk, parallel)
    184     out = wrap_df(plr.concat_df(elems))
    185 elif how == "vertical_relaxed":
    186     out = wrap_ldf(
--> 187         plr.concat_lf(
    188             [df.lazy() for df in elems],
    189             rechunk=rechunk,
    190             parallel=parallel,
    191             to_supertypes=True,
    192         )
    193     ).collect(no_optimization=True)
    195 elif how == "diagonal":
    196     out = wrap_df(plr.concat_df_diagonal(elems))

SchemaError: type String is incompatible with expected type Null

The text was updated successfully, but these errors were encountered:

benlee1284 · 2024-07-08T16:42:04Z

I was about to try to make a bugfix PR but I noticed the bug seems to be fixed on the main branch... although there's nothing in recent commits that would suggest a reason for it being fixed as far as I can see 🤔

@cosmicBboy Would it be possible to cut a bugfix release so I can see if it's definitely fixed? If not no worries

benlee1284 added the bug Something isn't working label Jul 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazy validation raises misleading SchemaError when columns are missing #1732

Lazy validation raises misleading SchemaError when columns are missing #1732

benlee1284 commented Jul 6, 2024

benlee1284 commented Jul 8, 2024 •

edited

Loading

Lazy validation raises misleading SchemaError when columns are missing #1732

Lazy validation raises misleading SchemaError when columns are missing #1732

Comments

benlee1284 commented Jul 6, 2024

Code Sample, a copy-pastable example

Expected behavior

Desktop (please complete the following information):

Screenshots

Additional context

benlee1284 commented Jul 8, 2024 • edited Loading

benlee1284 commented Jul 8, 2024 •

edited

Loading