Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement basic validation backend for Ibis tables #1451

Open
wants to merge 16 commits into
base: ibis-dev
Choose a base branch
from

Conversation

deepyaman
Copy link
Contributor

@deepyaman deepyaman commented Dec 20, 2023

Taking a stab at #1105.

No need to review, still very much a WIP. Working on implementing the backend classes now.

Initial goal is to get this example running:

>>> import ibis
>>> import pandas as pd
>>> import pandera.ibis as pa
>>>
>>> df = pd.DataFrame({
...     "probability": [0.1, 0.4, 0.52, 0.23, 0.8, 0.76],
...     "category": ["dog", "dog", "cat", "duck", "dog", "dog"],
... })
>>> t = ibis.memtable(df, name="t")
>>>
>>> schema_withchecks = pa.DataFrameSchema({
...     "probability": pa.Column(
...         float, pa.Check(lambda s: (s >= 0) & (s <= 1))),
...
...     # check that the "category" column contains a few discrete
...     # values, and the majority of the entries are dogs.
...     "category": pa.Column(
...         str, [
...             pa.Check(lambda s: s.isin(["dog", "cat", "duck"])),
...             pa.Check(lambda s: (s == "dog").mean() > 0.5),
...         ]),
... })
>>>
>>> schema_withchecks.validate(t)[["probability", "category"]]
   probability category
0         0.10      dog
1         0.40      dog
2         0.52      cat
3         0.23     duck
4         0.80      dog
5         0.76      dog

See #1451 (comment) for the current state.

Copy link

codecov bot commented Dec 20, 2023

Codecov Report

Attention: Patch coverage is 78.35052% with 63 lines in your changes missing coverage. Please review.

Project coverage is 83.42%. Comparing base (4df61da) to head (6e768a2).
Report is 119 commits behind head on ibis-dev.

Files Patch % Lines
pandera/backends/ibis/base.py 29.26% 29 Missing ⚠️
pandera/api/ibis/model.py 73.33% 12 Missing ⚠️
pandera/backends/ibis/container.py 78.26% 10 Missing ⚠️
pandera/engines/ibis_engine.py 80.00% 9 Missing ⚠️
pandera/backends/ibis/components.py 93.33% 3 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##           ibis-dev    #1451       +/-   ##
=============================================
- Coverage     94.29%   83.42%   -10.87%     
=============================================
  Files            91      127       +36     
  Lines          7024     9117     +2093     
=============================================
+ Hits           6623     7606      +983     
- Misses          401     1511     +1110     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cosmicBboy
Copy link
Collaborator

This is a great start @deepyaman ! I just created a ibis-dev branch that we can merge all of our work into: https://github.com/unionai-oss/pandera/tree/ibis-dev

@cosmicBboy cosmicBboy changed the base branch from main to ibis-dev December 22, 2023 15:14
@deepyaman
Copy link
Contributor Author

deepyaman commented Jan 5, 2024

@cosmicBboy Happy belated New Year! Hope you enjoyed the holidays.

I made some (admittedly-slow) progress on the Ibis backend, and I'd be happy to get a review to make sure things are on the right track. Been learning a lot about how Pandera works.

This is very incomplete, but I think I'm implemented a happy and unhappy path that work:

Happy path

>>> import ibis
>>> import pandas as pd
>>> import pandera.ibis as pa
/opt/miniconda3/envs/pandera-dev/lib/python3.11/site-packages/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
  warnings.warn(
>>> 
>>> df = pd.DataFrame({
...     "probability": [0.1, 0.4, 0.52, 0.23, 0.8, 0.76],
...     "category": ["dog", "dog", "cat", "duck", "dog", "dog"],
... })
>>> t = ibis.memtable(df, name="t")
>>> schema_withchecks = pa.DataFrameSchema({"probability": pa.Column(float)})
>>> schema_withchecks.validate(t)[["probability", "category"]]
r0 := InMemoryTable
  data:
    PandasDataFrameProxy:
         probability category
      0         0.10      dog
      1         0.40      dog
      2         0.52      cat
      3         0.23     duck
      4         0.80      dog
      5         0.76      dog

Selection[r0]
  selections:
    probability: r0.probability
    category:    r0.category
>>> 

Unhappy path

>>> import ibis
>>> import pandas as pd
>>> import pandera.ibis as pa
/opt/miniconda3/envs/pandera-dev/lib/python3.11/site-packages/pyspark/pandas/__init__.py:50: UserWarning: 'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
  warnings.warn(
>>> 
>>> df = pd.DataFrame({
...     "probability": [1, 4, 52, 23, 8, 76],
...     "category": ["dog", "dog", "cat", "duck", "dog", "dog"],
... })
>>> t = ibis.memtable(df, name="t")
>>> schema_withchecks = pa.DataFrameSchema({"probability": pa.Column(float)})
>>> schema_withchecks.validate(t)[["probability", "category"]]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/api/ibis/container.py", line 80, in validate
    return self.get_backend(check_obj).validate(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/backends/ibis/container.py", line 72, in validate
    error_handler.collect_error(
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/error_handlers.py", line 38, in collect_error
    raise schema_error from original_exc
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/backends/ibis/container.py", line 103, in run_schema_component_checks
    result = schema_component.validate(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/api/pandas/components.py", line 169, in validate
    return self.get_backend(check_obj).validate(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/backends/ibis/components.py", line 69, in validate
    error_handler.collect_error(  # Why indent (unlike in container.py)?
  File "/Users/deepyaman/github/deepyaman/pandera/pandera/error_handlers.py", line 38, in collect_error
    raise schema_error from original_exc
pandera.errors.SchemaError: expected column 'probability' to have type float64, got int64
>>> 

As a next step (if this looks good), I can probably work on writing tests for this functionality, and then implement additional checks (with the goal of getting the example in the PR description working first). Probably start with getting the value check on the same column (pa.Check(lambda s: (s >= 0) & (s <= 1))) working.

check_output=result.check_output,
reason_code=result.reason_code,
)
error_handler.collect_error( # Why indent (unlike in container.py)?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cosmicBboy This is another thing I don't understand yet; in the case result.schema_error is not None, would the error handler not be triggered in component.py? But, in a very similar case, it would be triggered in container.py?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that looks like a bug!

type: Any = dataclasses.field(repr=False, init=False)
"""Native Ibis dtype boxed by the data type."""

def __init__(self, dtype: Any):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think I've written any code that hits this yet.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepyaman deepyaman marked this pull request as ready for review January 5, 2024 02:35
@datajoely
Copy link

I was searching to see if this exists and what a delight to see your name here @deepyaman thanks for kicking this off!

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Feb 19, 2024

Thanks for this PR @deepyaman thanks so much for all your work on kicking this effort off!

I'm still in the process of looking through all the changes in this PR, but the examples you provide here is the right direction. If you can get some tests for these written, we can check this into the ibis-dev branch so we can get the initial foothold on this feature merged into the main repo.

Super excited to get this shipped! 🚀

@deepyaman
Copy link
Contributor Author

Thanks for this PR @deepyaman thanks so much for all your work on kicking this effort off!

I'm still in the process of looking through all the changes in this PR, but the examples you provide here is the right direction. If you can get some tests for these written, we can check this into the ibis-dev branch so we can get the initial foothold on this feature merged into the main repo.

Super excited to get this shipped! 🚀

@cosmicBboy Sounds great! I'm on holiday/unavailable until the end of the month, but I will prioritize adding the tests after that (and then continue making incremental process)!

@cosmicBboy
Copy link
Collaborator

Cool, enjoy your time off! Btw the beta release for polars support is out https://github.com/unionai-oss/pandera/releases/tag/v0.19.0b0

Part of it comes with a pandera.api.dataframe module that generalizes some of the common functionality between pandas and polars. It currently only does this for DataFrameModel, but I'm planning on doing the same for DataFrameSchema.

You may want to leverage these classes in your implementation to reduce repetitive logic, depending on how similar/different it is from polars/pandas.

@datajoely
Copy link

Hey @cosmicBboy @deepyaman I'd like to jump in and get this over the line. It's a little hard to follow where's best to get stuck in, do either of you have any recommendations?

@cosmicBboy
Copy link
Collaborator

cosmicBboy commented May 15, 2024

Thanks @datajoely! I think @deepyaman may still be on leave, so maybe we wait on him to provide more context, but at a high level, there are parts that are fairly easy to parallelize. For example, for the core pandera functionality with the polars integration, we have the following modules:

Work in each of these sections is fairly parallelizable. To help with implementation, pandera also provides library-agnostic base classes for some of the common class definitions:

This PR already implements a bunch of the pieces above. Probably the best way to start is to run the happy path code path described here, and start poking around.

@deepyaman
Copy link
Contributor Author

Thanks @datajoely! I think @deepyaman may still be on leave, so maybe we wait on him to provide more context, but at a high level, there are parts that are fairly easy to parallelize. For example, for the core pandera functionality with the polars integration, we have the following modules:

[...]

@cosmicBboy Thanks! This is actually a very helpful breakdown. I also briefly chatted with @datajoely earlier today about this (should've updated the issue), but it seems he's taken a look at the Polars backend more recently, and will push up a branch with some things he'd been trying. As a first step, it may make sense to update this PR to leverage some of the generalized functionality more; since I wrote this last December, I referenced the polars-dev branch a lot, but it was a WIP and things must have changed. I just haven't gotten a chance to look again yet. :)

I will have some time to dedicate to this once I get the stuff I'm currently working on out the door, hopefully by later this month. 🤞 But also happy to try to unblock @datajoely if anything is there I can do sooner.

@datajoely
Copy link

Excellent - super helpful :)

@datajoely
Copy link

So I've raised #1651 in draft - it would be great to get some thoughts whether it's smarter to mirror the quite complicated Polars approach or to continue with this basic approach.

@deepyaman
Copy link
Contributor Author

FYI I am working on rebasing this to main today and leveraging the new constructs mentioned! Hope to push something up soon.

@cosmicBboy
Copy link
Collaborator

Amazing @deepyaman! let me know if you need any help

@deepyaman
Copy link
Contributor Author

Amazing @deepyaman! let me know if you need any help

@cosmicBboy Thanks! Can you update unionai-oss:ibis-dev to main?

Also, would it be possible to set up ibis-dev to run the checks, at least for Ibis, if we're using that as the "main" while building out Ibis integration?

@cosmicBboy
Copy link
Collaborator

Amazing! I just updated the ibis-dev branch to catch up with main

@deepyaman deepyaman force-pushed the ibis-dev branch 2 times, most recently from ec1c6fc to 31a3fc9 Compare June 20, 2024 14:25
@deepyaman
Copy link
Contributor Author

Seems I hadn't fully rebased onto latest main before? In any case, done now, and the happy/unhappy paths from #1451 (comment) are still passing.

Let me add some more tests; other than that, I think I need to make sure Ibis is included in the requirements, and I was running into a few issues running mypy locally (getting a bunch of errors on other parts of the codebase), but should be almost ready to be merged. 🤞

I've also talked briefly to @datajoely. My suggestion was:

Maybe you can look at implementing the functionality corresponding to tests/polars/test_polars_dtypes.py, around coercion? I haven't really looked into it. Can start with the numeric types (I've only added a couple of numeric types so far).

I was planning to try adding a non-dtype check next myself. Once that machinery is there, I think can also parallelize implementation of checks.

@cosmicBboy if you have a better suggestion for where to get started/don't think coercion is a good place, I think that would be very much welcome.

@deepyaman deepyaman changed the base branch from main to ibis-dev June 26, 2024 21:55
@deepyaman deepyaman changed the base branch from ibis-dev to main June 26, 2024 21:56
@cosmicBboy
Copy link
Collaborator

cosmicBboy commented Jun 27, 2024

Thanks @deepyaman, I'll take a look at this PR next week.

Shall we skip Ibis tests on 3.8?

Yes!

Looks like Pandera still supports 3.8, despite following NEP-29 (which is fine)

Yes, I've been meaning to drop 3.8 support, maybe this happens when ibis-dev is merged onto main when it's ready for a beta release?

Would it be possible to rebase-merge this to ibis-dev instead of squashing?

Yep, we can do this

@cosmicBboy
Copy link
Collaborator

Sorry for the delay on reviewing this @deepyaman, will do so this coming week

@cosmicBboy cosmicBboy changed the base branch from main to ibis-dev July 13, 2024 21:21
@cosmicBboy cosmicBboy changed the base branch from ibis-dev to main July 13, 2024 21:36
@cosmicBboy cosmicBboy changed the base branch from main to ibis-dev July 13, 2024 21:38
@cosmicBboy
Copy link
Collaborator

@deepyaman I just updated the ibis-dev branch to main (@c895dc4), would you mind regenerating the requirements files with make nox-requirements?

type: Any = dataclasses.field(repr=False, init=False)
"""Native Ibis dtype boxed by the data type."""

def __init__(self, dtype: Any):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try:
return engine.Engine.dtype(cls, data_type)
except TypeError:
np_dtype = data_type().to_numpy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do ibis types have a to_numpy method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deepyaman
Copy link
Contributor Author

@deepyaman I just updated the ibis-dev branch to main (@c895dc4), would you mind regenerating the requirements files with make nox-requirements?

@cosmicBboy Done, although CI seems to be choking on some installation issue.

@cosmicBboy
Copy link
Collaborator

thanks! yeah haven't diagnosed the CI issue yet, for some reason this happens fairly often... restarting CI eventually resolves it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants