-
-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement basic validation backend for Ibis tables #1451
base: ibis-dev
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## ibis-dev #1451 +/- ##
=============================================
- Coverage 94.29% 83.42% -10.87%
=============================================
Files 91 127 +36
Lines 7024 9117 +2093
=============================================
+ Hits 6623 7606 +983
- Misses 401 1511 +1110 ☔ View full report in Codecov by Sentry. |
This is a great start @deepyaman ! I just created a |
@cosmicBboy Happy belated New Year! Hope you enjoyed the holidays. I made some (admittedly-slow) progress on the Ibis backend, and I'd be happy to get a review to make sure things are on the right track. Been learning a lot about how Pandera works. This is very incomplete, but I think I'm implemented a happy and unhappy path that work: Happy path>>> import ibis
>>> import pandas as pd
>>> import pandera.ibis as pa
/opt/miniconda3/envs/pandera-dev/lib/python3.11/site-packages/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
warnings.warn(
>>>
>>> df = pd.DataFrame({
... "probability": [0.1, 0.4, 0.52, 0.23, 0.8, 0.76],
... "category": ["dog", "dog", "cat", "duck", "dog", "dog"],
... })
>>> t = ibis.memtable(df, name="t")
>>> schema_withchecks = pa.DataFrameSchema({"probability": pa.Column(float)})
>>> schema_withchecks.validate(t)[["probability", "category"]]
r0 := InMemoryTable
data:
PandasDataFrameProxy:
probability category
0 0.10 dog
1 0.40 dog
2 0.52 cat
3 0.23 duck
4 0.80 dog
5 0.76 dog
Selection[r0]
selections:
probability: r0.probability
category: r0.category
>>> Unhappy path>>> import ibis
>>> import pandas as pd
>>> import pandera.ibis as pa
/opt/miniconda3/envs/pandera-dev/lib/python3.11/site-packages/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
warnings.warn(
>>>
>>> df = pd.DataFrame({
... "probability": [1, 4, 52, 23, 8, 76],
... "category": ["dog", "dog", "cat", "duck", "dog", "dog"],
... })
>>> t = ibis.memtable(df, name="t")
>>> schema_withchecks = pa.DataFrameSchema({"probability": pa.Column(float)})
>>> schema_withchecks.validate(t)[["probability", "category"]]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/deepyaman/github/deepyaman/pandera/pandera/api/ibis/container.py", line 80, in validate
return self.get_backend(check_obj).validate(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/deepyaman/github/deepyaman/pandera/pandera/backends/ibis/container.py", line 72, in validate
error_handler.collect_error(
File "/Users/deepyaman/github/deepyaman/pandera/pandera/error_handlers.py", line 38, in collect_error
raise schema_error from original_exc
File "/Users/deepyaman/github/deepyaman/pandera/pandera/backends/ibis/container.py", line 103, in run_schema_component_checks
result = schema_component.validate(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/deepyaman/github/deepyaman/pandera/pandera/api/pandas/components.py", line 169, in validate
return self.get_backend(check_obj).validate(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/deepyaman/github/deepyaman/pandera/pandera/backends/ibis/components.py", line 69, in validate
error_handler.collect_error( # Why indent (unlike in container.py)?
File "/Users/deepyaman/github/deepyaman/pandera/pandera/error_handlers.py", line 38, in collect_error
raise schema_error from original_exc
pandera.errors.SchemaError: expected column 'probability' to have type float64, got int64
>>> As a next step (if this looks good), I can probably work on writing tests for this functionality, and then implement additional checks (with the goal of getting the example in the PR description working first). Probably start with getting the value check on the same column ( |
check_output=result.check_output, | ||
reason_code=result.reason_code, | ||
) | ||
error_handler.collect_error( # Why indent (unlike in container.py)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cosmicBboy This is another thing I don't understand yet; in the case result.schema_error is not None
, would the error handler not be triggered in component.py
? But, in a very similar case, it would be triggered in container.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that looks like a bug!
type: Any = dataclasses.field(repr=False, init=False) | ||
"""Native Ibis dtype boxed by the data type.""" | ||
|
||
def __init__(self, dtype: Any): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think I've written any code that hits this yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/unionai-oss/pandera/pull/1451/files#diff-6bd99d89ccace74b1b743c0d34ae391325b1b194a0e11c7dcbb478513d353c7aR96-R99 should translate into this datatype via the Engine
class
I was searching to see if this exists and what a delight to see your name here @deepyaman thanks for kicking this off! |
Thanks for this PR @deepyaman thanks so much for all your work on kicking this effort off! I'm still in the process of looking through all the changes in this PR, but the examples you provide here is the right direction. If you can get some tests for these written, we can check this into the Super excited to get this shipped! 🚀 |
@cosmicBboy Sounds great! I'm on holiday/unavailable until the end of the month, but I will prioritize adding the tests after that (and then continue making incremental process)! |
Cool, enjoy your time off! Btw the beta release for polars support is out https://github.com/unionai-oss/pandera/releases/tag/v0.19.0b0 Part of it comes with a You may want to leverage these classes in your implementation to reduce repetitive logic, depending on how similar/different it is from polars/pandas. |
Hey @cosmicBboy @deepyaman I'd like to jump in and get this over the line. It's a little hard to follow where's best to get stuck in, do either of you have any recommendations? |
Thanks @datajoely! I think @deepyaman may still be on leave, so maybe we wait on him to provide more context, but at a high level, there are parts that are fairly easy to parallelize. For example, for the core pandera functionality with the polars integration, we have the following modules:
Work in each of these sections is fairly parallelizable. To help with implementation, pandera also provides library-agnostic base classes for some of the common class definitions:
This PR already implements a bunch of the pieces above. Probably the best way to start is to run the happy path code path described here, and start poking around. |
@cosmicBboy Thanks! This is actually a very helpful breakdown. I also briefly chatted with @datajoely earlier today about this (should've updated the issue), but it seems he's taken a look at the Polars backend more recently, and will push up a branch with some things he'd been trying. As a first step, it may make sense to update this PR to leverage some of the generalized functionality more; since I wrote this last December, I referenced the I will have some time to dedicate to this once I get the stuff I'm currently working on out the door, hopefully by later this month. 🤞 But also happy to try to unblock @datajoely if anything is there I can do sooner. |
Excellent - super helpful :) |
So I've raised #1651 in draft - it would be great to get some thoughts whether it's smarter to mirror the quite complicated Polars approach or to continue with this basic approach. |
FYI I am working on rebasing this to |
Amazing @deepyaman! let me know if you need any help |
@cosmicBboy Thanks! Can you update Also, would it be possible to set up |
Amazing! I just updated the |
ec1c6fc
to
31a3fc9
Compare
Seems I hadn't fully rebased onto latest Let me add some more tests; other than that, I think I need to make sure Ibis is included in the requirements, and I was running into a few issues running I've also talked briefly to @datajoely. My suggestion was:
@cosmicBboy if you have a better suggestion for where to get started/don't think coercion is a good place, I think that would be very much welcome. |
Thanks @deepyaman, I'll take a look at this PR next week.
Yes!
Yes, I've been meaning to drop 3.8 support, maybe this happens when
Yep, we can do this |
Sorry for the delay on reviewing this @deepyaman, will do so this coming week |
@deepyaman I just updated the |
type: Any = dataclasses.field(repr=False, init=False) | ||
"""Native Ibis dtype boxed by the data type.""" | ||
|
||
def __init__(self, dtype: Any): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/unionai-oss/pandera/pull/1451/files#diff-6bd99d89ccace74b1b743c0d34ae391325b1b194a0e11c7dcbb478513d353c7aR96-R99 should translate into this datatype via the Engine
class
try: | ||
return engine.Engine.dtype(cls, data_type) | ||
except TypeError: | ||
np_dtype = data_type().to_numpy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do ibis types have a to_numpy
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
@cosmicBboy Done, although CI seems to be choking on some installation issue. |
thanks! yeah haven't diagnosed the CI issue yet, for some reason this happens fairly often... restarting CI eventually resolves it |
Taking a stab at #1105.
No need to review, still very much a WIP. Working on implementing the backend classes now.Initial goal is to get this example running:
See #1451 (comment) for the current state.