Implement basic validation backend for Ibis tables #1451

deepyaman · 2023-12-20T17:29:32Z

Taking a stab at #1105.

~~No need to review, still very much a WIP. Working on implementing the backend classes now.~~

Initial goal is to get this example running:

>>> import ibis
>>> import pandas as pd
>>> import pandera.ibis as pa
>>>
>>> df = pd.DataFrame({
...     "probability": [0.1, 0.4, 0.52, 0.23, 0.8, 0.76],
...     "category": ["dog", "dog", "cat", "duck", "dog", "dog"],
... })
>>> t = ibis.memtable(df, name="t")
>>>
>>> schema_withchecks = pa.DataFrameSchema({
...     "probability": pa.Column(
...         float, pa.Check(lambda s: (s >= 0) & (s <= 1))),
...
...     # check that the "category" column contains a few discrete
...     # values, and the majority of the entries are dogs.
...     "category": pa.Column(
...         str, [
...             pa.Check(lambda s: s.isin(["dog", "cat", "duck"])),
...             pa.Check(lambda s: (s == "dog").mean() > 0.5),
...         ]),
... })
>>>
>>> schema_withchecks.validate(t)[["probability", "category"]]
   probability category
0         0.10      dog
1         0.40      dog
2         0.52      cat
3         0.23     duck
4         0.80      dog
5         0.76      dog

See #1451 (comment) for the current state.

codecov · 2023-12-20T17:36:48Z

Codecov Report

Attention: Patch coverage is 78.35052% with 63 lines in your changes missing coverage. Please review.

Project coverage is 83.42%. Comparing base (4df61da) to head (6e768a2).
Report is 119 commits behind head on ibis-dev.

Files	Patch %	Lines
pandera/backends/ibis/base.py	29.26%	29 Missing ⚠️
pandera/api/ibis/model.py	73.33%	12 Missing ⚠️
pandera/backends/ibis/container.py	78.26%	10 Missing ⚠️
pandera/engines/ibis_engine.py	80.00%	9 Missing ⚠️
pandera/backends/ibis/components.py	93.33%	3 Missing ⚠️

Additional details and impacted files

@@              Coverage Diff              @@
##           ibis-dev    #1451       +/-   ##
=============================================
- Coverage     94.29%   83.42%   -10.87%     
=============================================
  Files            91      127       +36     
  Lines          7024     9117     +2093     
=============================================
+ Hits           6623     7606      +983     
- Misses          401     1511     +1110

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

cosmicBboy · 2023-12-22T15:14:37Z

This is a great start @deepyaman ! I just created a ibis-dev branch that we can merge all of our work into: https://github.com/unionai-oss/pandera/tree/ibis-dev

deepyaman · 2024-01-05T00:10:52Z

@cosmicBboy Happy belated New Year! Hope you enjoyed the holidays.

I made some (admittedly-slow) progress on the Ibis backend, and I'd be happy to get a review to make sure things are on the right track. Been learning a lot about how Pandera works.

This is very incomplete, but I think I'm implemented a happy and unhappy path that work:

Happy path

>>> import ibis
>>> import pandas as pd
>>> import pandera.ibis as pa
/opt/miniconda3/envs/pandera-dev/lib/python3.11/site-packages/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
  warnings.warn(
>>> 
>>> df = pd.DataFrame({
...     "probability": [0.1, 0.4, 0.52, 0.23, 0.8, 0.76],
...     "category": ["dog", "dog", "cat", "duck", "dog", "dog"],
... })
>>> t = ibis.memtable(df, name="t")
>>> schema_withchecks = pa.DataFrameSchema({"probability": pa.Column(float)})
>>> schema_withchecks.validate(t)[["probability", "category"]]
r0 := InMemoryTable
  data:
    PandasDataFrameProxy:
         probability category
      0         0.10      dog
      1         0.40      dog
      2         0.52      cat
      3         0.23     duck
      4         0.80      dog
      5         0.76      dog

Selection[r0]
  selections:
    probability: r0.probability
    category:    r0.category
>>>

Unhappy path

>>> import ibis
>>> import pandas as pd
>>> import pandera.ibis as pa
/opt/miniconda3/envs/pandera-dev/lib/python3.11/site-packages/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
  warnings.warn(
>>> 
>>> df = pd.DataFrame({
...     "probability": [1, 4, 52, 23, 8, 76],
...     "category": ["dog", "dog", "cat", "duck", "dog", "dog"],
... })
>>> t = ibis.memtable(df, name="t")
>>> schema_withchecks = pa.DataFrameSchema({"probability": pa.Column(float)})
>>> schema_withchecks.validate(t)[["probability", "category"]]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/api/ibis/container.py", line 80, in validate
    return self.get_backend(check_obj).validate(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/backends/ibis/container.py", line 72, in validate
    error_handler.collect_error(
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/error_handlers.py", line 38, in collect_error
    raise schema_error from original_exc
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/backends/ibis/container.py", line 103, in run_schema_component_checks
    result = schema_component.validate(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/api/pandas/components.py", line 169, in validate
    return self.get_backend(check_obj).validate(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/backends/ibis/components.py", line 69, in validate
    error_handler.collect_error(  # Why indent (unlike in container.py)?
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/error_handlers.py", line 38, in collect_error
    raise schema_error from original_exc
pandera.errors.SchemaError: expected column 'probability' to have type float64, got int64
>>>

As a next step (if this looks good), I can probably work on writing tests for this functionality, and then implement additional checks (with the goal of getting the example in the PR description working first). Probably start with getting the value check on the same column (pa.Check(lambda s: (s >= 0) & (s <= 1))) working.

pandera/backends/ibis/components.py

deepyaman · 2024-01-05T02:09:16Z

pandera/backends/ibis/components.py

+                        check_output=result.check_output,
+                        reason_code=result.reason_code,
+                    )
+                    error_handler.collect_error(  # Why indent (unlike in container.py)?


@cosmicBboy This is another thing I don't understand yet; in the case result.schema_error is not None, would the error handler not be triggered in component.py? But, in a very similar case, it would be triggered in container.py?

that looks like a bug!

deepyaman · 2024-01-05T02:09:50Z

pandera/engines/ibis_engine.py

+    type: Any = dataclasses.field(repr=False, init=False)
+    """Native Ibis dtype boxed by the data type."""
+
+    def __init__(self, dtype: Any):


Don't think I've written any code that hits this yet.

https://github.com/unionai-oss/pandera/pull/1451/files#diff-6bd99d89ccace74b1b743c0d34ae391325b1b194a0e11c7dcbb478513d353c7aR96-R99 should translate into this datatype via the Engine class

datajoely · 2024-02-05T10:26:25Z

I was searching to see if this exists and what a delight to see your name here @deepyaman thanks for kicking this off!

cosmicBboy · 2024-02-19T04:30:37Z

Thanks for this PR @deepyaman thanks so much for all your work on kicking this effort off!

I'm still in the process of looking through all the changes in this PR, but the examples you provide here is the right direction. If you can get some tests for these written, we can check this into the ibis-dev branch so we can get the initial foothold on this feature merged into the main repo.

Super excited to get this shipped! 🚀

deepyaman · 2024-02-23T15:34:17Z

Thanks for this PR @deepyaman thanks so much for all your work on kicking this effort off!

I'm still in the process of looking through all the changes in this PR, but the examples you provide here is the right direction. If you can get some tests for these written, we can check this into the ibis-dev branch so we can get the initial foothold on this feature merged into the main repo.

Super excited to get this shipped! 🚀

@cosmicBboy Sounds great! I'm on holiday/unavailable until the end of the month, but I will prioritize adding the tests after that (and then continue making incremental process)!

cosmicBboy · 2024-03-15T18:42:25Z

Cool, enjoy your time off! Btw the beta release for polars support is out https://github.com/unionai-oss/pandera/releases/tag/v0.19.0b0

Part of it comes with a pandera.api.dataframe module that generalizes some of the common functionality between pandas and polars. It currently only does this for DataFrameModel, but I'm planning on doing the same for DataFrameSchema.

You may want to leverage these classes in your implementation to reduce repetitive logic, depending on how similar/different it is from polars/pandas.

datajoely · 2024-05-15T12:20:34Z

Hey @cosmicBboy @deepyaman I'd like to jump in and get this over the line. It's a little hard to follow where's best to get stuck in, do either of you have any recommendations?

cosmicBboy · 2024-05-15T19:38:40Z

Thanks @datajoely! I think @deepyaman may still be on leave, so maybe we wait on him to provide more context, but at a high level, there are parts that are fairly easy to parallelize. For example, for the core pandera functionality with the polars integration, we have the following modules:

The api: these are the user-facing schema definitions: https://github.com/unionai-oss/pandera/tree/main/pandera/api/polars
- container.py: DataFrame/LazyFrame
- components.py: Column definition
- model.py: the DataFrameModel class-based syntax
- model_config.py: configurations for DataFrameModel
The backend: this is the implementation of how to actually validate data: https://github.com/unionai-oss/pandera/tree/main/pandera/backends/polars
- These generally have modules associated with the api
- Additionally it has the checks.py backend for how to run the built-in or user-defined checks, and builtin_checks.py that provide the implementations for the built-in pandera checks.
- Backend registration function: https://github.com/unionai-oss/pandera/blob/main/pandera/backends/polars/register.py
The type engine: https://github.com/unionai-oss/pandera/blob/main/pandera/engines/polars_engine.py
- This contains the pandera dtype translation layer from the dataframe library to dtypes that pandera can understand.
Generic classes for python type annotations: https://github.com/unionai-oss/pandera/blob/main/pandera/typing/polars.py
Finally, the library integration entrypoint: https://github.com/unionai-oss/pandera/blob/main/pandera/polars.py

Work in each of these sections is fairly parallelizable. To help with implementation, pandera also provides library-agnostic base classes for some of the common class definitions:

API spec for generic dataframes: https://github.com/unionai-oss/pandera/tree/main/pandera/api/dataframe
the user-facing class for Checks: https://github.com/unionai-oss/pandera/blob/main/pandera/api/checks.py

This PR already implements a bunch of the pieces above. Probably the best way to start is to run the happy path code path described here, and start poking around.

deepyaman · 2024-05-16T03:03:07Z

Thanks @datajoely! I think @deepyaman may still be on leave, so maybe we wait on him to provide more context, but at a high level, there are parts that are fairly easy to parallelize. For example, for the core pandera functionality with the polars integration, we have the following modules:

[...]

@cosmicBboy Thanks! This is actually a very helpful breakdown. I also briefly chatted with @datajoely earlier today about this (should've updated the issue), but it seems he's taken a look at the Polars backend more recently, and will push up a branch with some things he'd been trying. As a first step, it may make sense to update this PR to leverage some of the generalized functionality more; since I wrote this last December, I referenced the polars-dev branch a lot, but it was a WIP and things must have changed. I just haven't gotten a chance to look again yet. :)

I will have some time to dedicate to this once I get the stuff I'm currently working on out the door, hopefully by later this month. 🤞 But also happy to try to unblock @datajoely if anything is there I can do sooner.

datajoely · 2024-05-16T07:55:07Z

Excellent - super helpful :)

datajoely · 2024-05-21T10:31:17Z

So I've raised #1651 in draft - it would be great to get some thoughts whether it's smarter to mirror the quite complicated Polars approach or to continue with this basic approach.

deepyaman · 2024-06-19T19:25:25Z

FYI I am working on rebasing this to main today and leveraging the new constructs mentioned! Hope to push something up soon.

cosmicBboy · 2024-06-19T20:15:16Z

Amazing @deepyaman! let me know if you need any help

deepyaman · 2024-06-20T14:03:56Z

Amazing @deepyaman! let me know if you need any help

@cosmicBboy Thanks! Can you update unionai-oss:ibis-dev to main?

Also, would it be possible to set up ibis-dev to run the checks, at least for Ibis, if we're using that as the "main" while building out Ibis integration?

cosmicBboy · 2024-06-20T14:19:08Z

Amazing! I just updated the ibis-dev branch to catch up with main

deepyaman · 2024-06-20T14:32:18Z

Seems I hadn't fully rebased onto latest main before? In any case, done now, and the happy/unhappy paths from #1451 (comment) are still passing.

Let me add some more tests; other than that, I think I need to make sure Ibis is included in the requirements, and I was running into a few issues running mypy locally (getting a bunch of errors on other parts of the codebase), but should be almost ready to be merged. 🤞

I've also talked briefly to @datajoely. My suggestion was:

Maybe you can look at implementing the functionality corresponding to tests/polars/test_polars_dtypes.py, around coercion? I haven't really looked into it. Can start with the numeric types (I've only added a couple of numeric types so far).

I was planning to try adding a non-dtype check next myself. Once that machinery is there, I think can also parallelize implementation of checks.

@cosmicBboy if you have a better suggestion for where to get started/don't think coercion is a good place, I think that would be very much welcome.

cosmicBboy · 2024-06-27T14:13:25Z

Thanks @deepyaman, I'll take a look at this PR next week.

Shall we skip Ibis tests on 3.8?

Yes!

Looks like Pandera still supports 3.8, despite following NEP-29 (which is fine)

Yes, I've been meaning to drop 3.8 support, maybe this happens when ibis-dev is merged onto main when it's ready for a beta release?

Would it be possible to rebase-merge this to ibis-dev instead of squashing?

Yep, we can do this

cosmicBboy · 2024-07-13T20:58:53Z

Sorry for the delay on reviewing this @deepyaman, will do so this coming week

cosmicBboy · 2024-07-13T21:38:58Z

@deepyaman I just updated the ibis-dev branch to main (@c895dc4), would you mind regenerating the requirements files with make nox-requirements?

cosmicBboy · 2024-07-13T21:40:35Z

pandera/engines/ibis_engine.py

+    type: Any = dataclasses.field(repr=False, init=False)
+    """Native Ibis dtype boxed by the data type."""
+
+    def __init__(self, dtype: Any):


https://github.com/unionai-oss/pandera/pull/1451/files#diff-6bd99d89ccace74b1b743c0d34ae391325b1b194a0e11c7dcbb478513d353c7aR96-R99 should translate into this datatype via the Engine class

cosmicBboy · 2024-07-13T21:41:16Z

pandera/engines/ibis_engine.py

+        try:
+            return engine.Engine.dtype(cls, data_type)
+        except TypeError:
+            np_dtype = data_type().to_numpy()


do ibis types have a to_numpy method?

They do! See https://github.com/ibis-project/ibis/blob/9.1.0/ibis/expr/datatypes/core.py#L260-L264

Signed-off-by: Deepyaman Datta <[email protected]>

deepyaman · 2024-07-16T16:46:22Z

@deepyaman I just updated the ibis-dev branch to main (@c895dc4), would you mind regenerating the requirements files with make nox-requirements?

@cosmicBboy Done, although CI seems to be choking on some installation issue.

cosmicBboy · 2024-07-16T21:29:00Z

thanks! yeah haven't diagnosed the CI issue yet, for some reason this happens fairly often... restarting CI eventually resolves it

deepyaman force-pushed the ibis-dev branch from 7d32137 to fe0e6b8 Compare December 22, 2023 01:54

cosmicBboy changed the base branch from main to ibis-dev December 22, 2023 15:14

deepyaman force-pushed the ibis-dev branch from a464a95 to 3d71d04 Compare December 30, 2023 22:34

deepyaman commented Jan 5, 2024

View reviewed changes

pandera/backends/ibis/components.py Show resolved Hide resolved

deepyaman commented Jan 5, 2024

View reviewed changes

deepyaman marked this pull request as ready for review January 5, 2024 02:35

datajoely mentioned this pull request May 21, 2024

Implement basic validation backend for Ibis tables (alternative) #1651

Closed

csubhodeep mentioned this pull request Jun 10, 2024

Support Ibis Backend #1105

Open

deepyaman force-pushed the ibis-dev branch from 863ded3 to a1dcf11 Compare June 20, 2024 13:59

deepyaman force-pushed the ibis-dev branch 2 times, most recently from ec1c6fc to 31a3fc9 Compare June 20, 2024 14:25

deepyaman mentioned this pull request Jun 20, 2024

[EPIC] Ibis and Pandera integration ibis-project/ibis#8999

Open

deepyaman changed the base branch from main to ibis-dev June 26, 2024 21:55

deepyaman force-pushed the ibis-dev branch from 9441745 to 067a70d Compare June 26, 2024 21:55

deepyaman changed the base branch from ibis-dev to main June 26, 2024 21:56

deepyaman force-pushed the ibis-dev branch from 067a70d to 132e803 Compare June 26, 2024 21:58

cosmicBboy changed the base branch from main to ibis-dev July 13, 2024 21:21

cosmicBboy changed the base branch from ibis-dev to main July 13, 2024 21:36

cosmicBboy changed the base branch from main to ibis-dev July 13, 2024 21:38

cosmicBboy reviewed Jul 13, 2024

View reviewed changes

deepyaman added 16 commits July 16, 2024 10:33

Add DataFrameModel, DataFrameSchema for ibis

3936bf5

Signed-off-by: Deepyaman Datta <[email protected]>

Implement a basic DataFrameSchema class for Ibis

25438e7

Signed-off-by: Deepyaman Datta <[email protected]>

Refactor Ibis DataFrameSchema to extend pandas's

eae6c40

Signed-off-by: Deepyaman Datta <[email protected]>

Add code for basic Column, and stub more modules

13c0d70

Signed-off-by: Deepyaman Datta <[email protected]>

Fix various Pylint and mypy violations in model.py

8e951eb

Signed-off-by: Deepyaman Datta <[email protected]>

Add Ibis's parsing, validation, and error backends

23d66be

Signed-off-by: Deepyaman Datta <[email protected]>

Implement stub to validate schema component checks

a88651b

Signed-off-by: Deepyaman Datta <[email protected]>

Implement happy-path validation for floating types

5ed2d6c

Signed-off-by: Deepyaman Datta <[email protected]>

Implement an unhappy path (got int64, not float64)

c1c5c87

Signed-off-by: Deepyaman Datta <[email protected]>

Fix missing imports, and reformat using pre-commit

285812e

Signed-off-by: Deepyaman Datta <[email protected]>

Inherit getter, only override setter, for .dtype

8196856

Signed-off-by: Deepyaman Datta <[email protected]>

Add basic unit tests for Ibis data type validation

b6a1d14

Signed-off-by: Deepyaman Datta <[email protected]>

Test model, implement column, and fix registration

0ad03f3

Signed-off-by: Deepyaman Datta <[email protected]>

Resolve remaining type checking and linting issues

6bde9e8

Signed-off-by: Deepyaman Datta <[email protected]>

Add ibis extra and regenerate requirements files

e11ec00

Signed-off-by: Deepyaman Datta <[email protected]>

Re-enable Python equivalents for int and float

6e768a2

Signed-off-by: Deepyaman Datta <[email protected]>

deepyaman force-pushed the ibis-dev branch from 132e803 to 6e768a2 Compare July 16, 2024 16:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement basic validation backend for Ibis tables #1451

Implement basic validation backend for Ibis tables #1451

deepyaman commented Dec 20, 2023 •

edited

Loading

codecov bot commented Dec 20, 2023 •

edited

Loading

cosmicBboy commented Dec 22, 2023

deepyaman commented Jan 5, 2024 •

edited

Loading

deepyaman Jan 5, 2024

cosmicBboy May 15, 2024

deepyaman Jan 5, 2024

cosmicBboy Jul 13, 2024

datajoely commented Feb 5, 2024

cosmicBboy commented Feb 19, 2024 •

edited

Loading

deepyaman commented Feb 23, 2024

cosmicBboy commented Mar 15, 2024

datajoely commented May 15, 2024

cosmicBboy commented May 15, 2024 •

edited

Loading

deepyaman commented May 16, 2024

datajoely commented May 16, 2024

datajoely commented May 21, 2024

deepyaman commented Jun 19, 2024

cosmicBboy commented Jun 19, 2024

deepyaman commented Jun 20, 2024

cosmicBboy commented Jun 20, 2024

deepyaman commented Jun 20, 2024

cosmicBboy commented Jun 27, 2024 •

edited

Loading

cosmicBboy commented Jul 13, 2024

cosmicBboy commented Jul 13, 2024

cosmicBboy Jul 13, 2024

cosmicBboy Jul 13, 2024

deepyaman Jul 16, 2024

deepyaman commented Jul 16, 2024

cosmicBboy commented Jul 16, 2024

Implement basic validation backend for Ibis tables #1451

Are you sure you want to change the base?

Implement basic validation backend for Ibis tables #1451

Conversation

deepyaman commented Dec 20, 2023 • edited Loading

codecov bot commented Dec 20, 2023 • edited Loading

Codecov Report

cosmicBboy commented Dec 22, 2023

deepyaman commented Jan 5, 2024 • edited Loading

Happy path

Unhappy path

deepyaman Jan 5, 2024

Choose a reason for hiding this comment

cosmicBboy May 15, 2024

Choose a reason for hiding this comment

deepyaman Jan 5, 2024

Choose a reason for hiding this comment

cosmicBboy Jul 13, 2024

Choose a reason for hiding this comment

datajoely commented Feb 5, 2024

cosmicBboy commented Feb 19, 2024 • edited Loading

deepyaman commented Feb 23, 2024

cosmicBboy commented Mar 15, 2024

datajoely commented May 15, 2024

cosmicBboy commented May 15, 2024 • edited Loading

deepyaman commented May 16, 2024

datajoely commented May 16, 2024

datajoely commented May 21, 2024

deepyaman commented Jun 19, 2024

cosmicBboy commented Jun 19, 2024

deepyaman commented Jun 20, 2024

cosmicBboy commented Jun 20, 2024

deepyaman commented Jun 20, 2024

cosmicBboy commented Jun 27, 2024 • edited Loading

cosmicBboy commented Jul 13, 2024

cosmicBboy commented Jul 13, 2024

cosmicBboy Jul 13, 2024

Choose a reason for hiding this comment

cosmicBboy Jul 13, 2024

Choose a reason for hiding this comment

deepyaman Jul 16, 2024

Choose a reason for hiding this comment

deepyaman commented Jul 16, 2024

cosmicBboy commented Jul 16, 2024

deepyaman commented Dec 20, 2023 •

edited

Loading

codecov bot commented Dec 20, 2023 •

edited

Loading

deepyaman commented Jan 5, 2024 •

edited

Loading

cosmicBboy commented Feb 19, 2024 •

edited

Loading

cosmicBboy commented May 15, 2024 •

edited

Loading

cosmicBboy commented Jun 27, 2024 •

edited

Loading