Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add custom error message for pyspark register_check_method which currently defaults to None #1716

Open
3 tasks done
marrov opened this issue Jun 27, 2024 · 1 comment
Open
3 tasks done
Labels
bug Something isn't working

Comments

@marrov
Copy link

marrov commented Jun 27, 2024

Describe the bug
Currently, when you register a custom check in pyspark there is no option to add a custom error message as one can do in the register builtin check. This leads to the error message on check fail to be None.

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandera.
  • (optional) I have confirmed this bug exists on the main branch of pandera.

Code Sample, a copy-pastable example

import json
import pandera.pyspark as pa
import pyspark.sql.types as T
from pandera.api.extensions import register_check_method
from pandera.api.pyspark.types import PysparkDataframeColumnObject
from pandera.backends.pyspark.decorators import register_input_datatypes
from pandera.backends.pyspark.utils import convert_to_list
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("example").getOrCreate()
data = [("A", 1), ("B", -1)]
schema = T.StructType([T.StructField("id", T.StringType(), True), T.StructField("quantity", T.IntegerType(), True)])
orders = spark.createDataFrame(data, schema=schema)


@register_check_method()  # error="fraction_ge({value=}, {fraction=})"
@register_input_datatypes(acceptable_datatypes=convert_to_list(T.IntegerType))
def fraction_ge(data: PysparkDataframeColumnObject, value: int, fraction: float) -> bool:
    """Ensure that at least a specified fraction of integer values in a column are greater than or equal to a threshold."""
    if not 0 <= fraction <= 1:
        raise ValueError("Fraction must be between 0 and 1")
    total_count = data.dataframe.count()
    if total_count == 0:
        return False
    cond = F.col(data.column_name) >= value
    valid_count = data.dataframe.filter(cond).count()

    return (valid_count / total_count) >= fraction


class OrdersSchema(pa.DataFrameModel):
    id: T.StringType
    quantity: T.IntegerType = pa.Field(fraction_ge={"value": 0, "fraction": 0.9})


orders = OrdersSchema.validate(orders)
print(json.dumps(orders.pandera.errors, indent=4))

Result:

{
    "DATA": {
        "DATAFRAME_CHECK": [
            {
                "schema": "OrdersSchema",
                "column": "quantity",
                "check": "fraction_ge",
                "error": "column 'quantity' with type IntegerType() failed validation None"
            }
        ]
    }
}

Expected behavior

Ideally, the @register_check_method method should have an optional error parameter like the @register_builtin_check has. With the example above, the decorator would look like:

@register_check_method(error="fraction_ge({value=}, {fraction=})")

The output on a failed check would be:

{
    "DATA": {
        "DATAFRAME_CHECK": [
            {
                "schema": "OrdersSchema",
                "column": "quantity",
                "check": "fraction_ge",
                "error": "column 'quantity' with type IntegerType() failed validation fraction_ge(value=0, fraction=0.9)"
            }
        ]
    }
}
@marrov marrov added the bug Something isn't working label Jun 27, 2024
@cosmicBboy
Copy link
Collaborator

@marrov please feel free to make a PR for this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants