-
-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for multi type (Unions) in schemas and validation #1152
Comments
@vianmixtkz Great writeup. This is something that would be great for Pandera to support. |
Thanks @vianmixtkz this is an interesting use case: the way pandas handles mixed-type columns is to represent the data in an One thing we should clarify in the semantics of this feature is the following: we can interpret
Do we need special syntax to differentiate between these two cases, or is that something that we leave to the pandera type engine to handle? I.e.:
|
Here what I described is matching case 2. That's is in a given column, I'll have for example str on some rows and floats on other rows. With something like: Case 1 class InputSchema(pa.DataFrameModel):
year: Series[int] = pa.Field(gt=2000, coerce=True)
month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
day: Series[int] = pa.Field(ge=0, le=365, coerce=True)
comment : Union[Series[str], Series[float]] = pa.Field() # comment is either only str or only float in a given DataFrame Case 2 class InputSchema(pa.DataFrameModel):
year: Series[int] = pa.Field(gt=2000, coerce=True)
month: Series[int] = pa.Field(ge=1, le=12, coerce=True)
day: Series[int] = pa.Field(ge=0, le=365, coerce=True)
comment : Series[Union[str,float]] = pa.Field() # comment is a column containing str on some rows and float on other rows And yeah, I think the behavior you are describing is what users would expect
|
fix: unionai-oss#1152 I would like pandera to support Union Type. That is the validation of a Series/Column should allow multiple types. 1. Add a new PythonUnion type. 2. Add a new test to for the new UnionType. Signed-off-by: karajan1001 <[email protected]>
Just bumping this thread. Any consensus how to proceed? Seem like the #1227 is stale. |
Revisiting this issue and thinking about it a little bit, here's another proposal for this issue: from pandera.engines.pandas_engine import Object
from typing import Annotated
class Model(pa.DataFrameModel):
union_column : Union[str, float] # the column data type must be either a str or float
object_column: Object = pa.Field(dtype_kwargs={"allowable_types": [str, float]})
# or use the annotated types
object_column: Annotated[Object, [str, float]] This syntax is less ambiguous as to what the actual type of the column is vs. the values within it are. However, it does require importing a special I'm still open to the more ambiguous behavior where |
Re: this proposal: #1152 (comment) Unfortunately |
I'm not a fan of this case
|
Is your feature request related to a problem? Please describe.
I would like pandera to support Union Type. That is the validation of a Series/Column should allow multiple types.
Pydantic allows it.
Here an example of my issue
Describe the solution you'd like
I think it is the desired behavior for now to not allow Unions. But could you consider an option to allow it in the future ?
Describe alternatives you've considered
Split the Union columns into multiple columns, one for each type but this is not really something that I can control. Cf next section.
Additional context
I have a valid use case for this. I am using pandas to handle CSVs where some columns contain hybrid data types.
I am using pandas for the preprocessing and pydantic for the validation, and I would like to use pandera to make this process (processing + validation) more robust
The text was updated successfully, but these errors were encountered: