Streaming dataset not returning data #7024

johnwee1 · 2024-07-04T07:21:47Z

Describe the bug

I'm deciding to post here because I'm still not sure what the issue is, or if I am using IterableDatasets wrongly.
I'm following the guide on here https://huggingface.co/learn/cookbook/en/fine_tuning_code_llm_on_single_gpu pretty much to a tee and have verified that it works when I'm fine-tuning on the provided dataset.

However, I'm doing some data preprocessing steps (filtering out entries), when I try to swap out the dataset for mine, it fails to train. However, I eventually fixed this by simply setting stream=False in load_dataset.

Coud this be some sort of network / firewall issue I'm facing?

Steps to reproduce the bug

I made a post with greater description about how I reproduced this problem before I found my workaround: https://discuss.huggingface.co/t/problem-with-custom-iterator-of-streaming-dataset-not-returning-anything/94551

Here is the problematic dataset snippet, which works when streaming=False (and with buffer keyword removed from shuffle)

commitpackft = load_dataset(
    "chargoddard/commitpack-ft-instruct", split="train", streaming=True
).filter(lambda example: example["language"] == "Python")
 
 
def form_template(example):
    """Forms a template for each example following the alpaca format for CommitPack"""
    example["content"] = (
        "### Human: " + example["instruction"] + " " + example["input"] + " ### Assistant: " + example["output"]
    )
    return example
 
 
dataset = commitpackft.map(
    form_template,
    remove_columns=["id", "language", "license", "instruction", "input", "output"],
).shuffle(
    seed=42, buffer_size=10000
)  # remove everything since its all inside "content" now
validation_data = dataset.take(4000)
train_data = dataset.skip(4000)

The annoying part about this is that it only fails during training and I don't know when it will fail, except that it always fails during evaluation.

Expected behavior

The expected behavior is that I should be able to get something from the iterator when called instead of getting nothing / stuck in a loop somewhere.

Environment info

datasets version: 2.20.0
Platform: Linux-5.4.0-121-generic-x86_64-with-glibc2.31
Python version: 3.11.7
huggingface_hub version: 0.23.4
PyArrow version: 16.1.0
Pandas version: 2.2.2
fsspec version: 2024.5.0

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming dataset not returning data #7024

Streaming dataset not returning data #7024

johnwee1 commented Jul 4, 2024

Streaming dataset not returning data #7024

Streaming dataset not returning data #7024

Comments

johnwee1 commented Jul 4, 2024

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info