Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_dataset fails to load dataset saved by save_to_disk #7018

Open
sliedes opened this issue Jul 1, 2024 · 0 comments
Open

load_dataset fails to load dataset saved by save_to_disk #7018

sliedes opened this issue Jul 1, 2024 · 0 comments

Comments

@sliedes
Copy link

sliedes commented Jul 1, 2024

Describe the bug

This code fails to load the dataset it just saved:

from datasets import load_dataset
from transformers import AutoTokenizer

MODEL = "google-bert/bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(MODEL)

dataset = load_dataset("yelp_review_full")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets.save_to_disk("dataset")

tokenized_datasets = load_dataset("dataset/")  # raises

It raises ValueError: Couldn't infer the same data file format for all splits. Got {NamedSplit('train'): ('arrow', {}), NamedSplit('test'): ('json', {})}.

I believe this bug is caused by the logic that tries to infer dataset format. It counts the most common file extension. However, a small dataset can fit in a single .arrow file and have two JSON metadata files, causing the format to be inferred as JSON:

$ ls -l dataset/test
-rw-r--r-- 1 sliedes sliedes 191498784 Jul  1 13:55 data-00000-of-00001.arrow
-rw-r--r-- 1 sliedes sliedes      1730 Jul  1 13:55 dataset_info.json
-rw-r--r-- 1 sliedes sliedes       249 Jul  1 13:55 state.json

Steps to reproduce the bug

Execute the code above.

Expected behavior

The dataset is loaded successfully.

Environment info

  • datasets version: 2.20.0
  • Platform: Linux-6.9.3-arch1-1-x86_64-with-glibc2.39
  • Python version: 3.12.4
  • huggingface_hub version: 0.23.4
  • PyArrow version: 16.1.0
  • Pandas version: 2.2.2
  • fsspec version: 2024.5.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant