Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compaction can rewrite files without reducing file count #2591

Open
gfredericks opened this issue Jun 12, 2024 · 0 comments
Open

Compaction can rewrite files without reducing file count #2591

gfredericks opened this issue Jun 12, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@gfredericks
Copy link

Environment

Delta-rs version: 0.15.3

Binding:

Environment: python 3.9.16


Bug

What happened:

I compacted a table, and it replaced a set of files with a new identically-sized set of files.

What you expected to happen:

I expect compaction to do no work if it is not going to reduce the file count. Combining this issue with #2576 means that there is nothing I can do to get a table into a state where I am sure that compaction is a NOOP (incidentally the example below is also a reproduction of #2576, showing that it can take more than one compaction to get a table into a minimal state).

How to reproduce it:

import deltalake
import pyarrow as pa
 
for z in range(10):
    deltalake.write_deltalake(
        './storageloop-table',
        pa.Table.from_pydict(
            {
                "x": pa.array([x % 207 for x in range(1000000)]),
                "y": pa.array([x % 3008 for x in range(1000000)]),
                "z": pa.array([z for _ in range(1000000)]),
            }
        ),
        mode='append',
    )
 
for _ in range(5):
    dt = deltalake.DeltaTable('./storageloop-table')
    print(f"Table has {len(dt.files())} files pre-compaction")
    # use a small target_size for this toy example so we can
    # reproduce it with smaller data
    stats = dt.optimize.compact(target_size=2**21)
    print(f"Compaction added {stats['numFilesAdded']} files and removed {stats['numFilesRemoved']} files")

Outputs:

Table has 10 files pre-compaction
Compaction added 3 files and removed 9 files
Table has 4 files pre-compaction
Compaction added 2 files and removed 4 files
Table has 2 files pre-compaction
Compaction added 2 files and removed 2 files
Table has 2 files pre-compaction
Compaction added 2 files and removed 2 files
Table has 2 files pre-compaction
Compaction added 2 files and removed 2 files

More details:

Without having looked at the implementation, my guess is that the compaction algorithm decides it can merge the two files, and issues a write of a single file to the table, and some lower-level mechanism splits it back up into two files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant