Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python write_deltalake with schema_mode="merge" casts types #2642

Open
robjhornby-ts opened this issue Jul 2, 2024 · 0 comments
Open

Python write_deltalake with schema_mode="merge" casts types #2642

robjhornby-ts opened this issue Jul 2, 2024 · 0 comments
Labels
binding/python Issues for the Python package bug Something isn't working

Comments

@robjhornby-ts
Copy link

robjhornby-ts commented Jul 2, 2024

Environment

Delta-rs version: 0.18.1 (Python deltalake 0.18.1)

Binding:

Environment:

  • Cloud provider:
  • OS: MacOS Sonoma 14.5 (Intel)
  • Other:

Bug

What happened:

Using write_deltalake to append data to a table with schema_mode="merge", if I change the type of a field in a struct, the data is cast to the existing table schema (e.g. a string "1" becomes an int 1) instead of raising an error because I'm trying to merge incompatible schemas.

What you expected to happen:

When a field's type changes, I'd expect the merge to fail. (Unless I deliberately configure something like type widening int32 to int64 etc.)

How to reproduce it:

Here, a struct contains a field "x" which changes from int64 to string, the table schema is printed before and after, and the final data shows the string has been cast to an int

import pyarrow as pa
from pathlib import Path
import tempfile
from deltalake import DeltaTable, write_deltalake


def invalid_schema_merge():
    with tempfile.TemporaryDirectory() as tmpdir:
        data_path = Path(tmpdir)
        schema_before = pa.schema(
            pa.struct(
                [
                    pa.field("x", pa.int64()),
                ]
            )
        )
        schema_after = pa.schema(
            pa.struct(
                [
                    pa.field("x", pa.string()),
                ]
            )
        )
        data_before = pa.Table.from_pylist([{"x": 100}], schema=schema_before)
        data_after = pa.Table.from_pylist([{"x": "1"}], schema=schema_after)

        table = DeltaTable.create(
            data_path,
            schema=schema_before,
            mode="overwrite",
        )
        print(table.schema())
        write_deltalake(
            table,
            data_before,
            schema=schema_before,
            mode="append",
            schema_mode="merge",
            engine="rust",
        )
        print("-" * 80)
        print(table.schema())

        write_deltalake(
            table,
            data_after,
            schema=schema_after,
            mode="append",
            schema_mode="merge",
            engine="rust",
        )
        print("-" * 80)
        print(table.schema())
        print("-" * 80)
        print("Final data: ", table.to_pyarrow_table())


invalid_schema_merge()

Prints:

Schema([Field(x, PrimitiveType("long"), nullable=True)])
--------------------------------------------------------------------------------
Schema([Field(x, PrimitiveType("long"), nullable=True)])
--------------------------------------------------------------------------------
Schema([Field(x, PrimitiveType("long"), nullable=True)])
--------------------------------------------------------------------------------
Final data:  pyarrow.Table
x: int64
----
x: [[1],[100]]

The schema is the same after writing the new data containing a string, and the string has been cast to an int

I don't know whether this is a bug or a feature request. I see some mention of type casting in the code but couldn't find it in docs.

I'd like to be able to merge schemas in a way which will raise an exception if types have changed, unless I enable a setting for type casting (or similar). Let me know if more info would be useful, or if you want me to check more cases

@robjhornby-ts robjhornby-ts added the bug Something isn't working label Jul 2, 2024
@rtyler rtyler added the binding/python Issues for the Python package label Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants