Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming dataset not returning data #7024

Open
johnwee1 opened this issue Jul 4, 2024 · 0 comments
Open

Streaming dataset not returning data #7024

johnwee1 opened this issue Jul 4, 2024 · 0 comments

Comments

@johnwee1
Copy link

johnwee1 commented Jul 4, 2024

Describe the bug

I'm deciding to post here because I'm still not sure what the issue is, or if I am using IterableDatasets wrongly.
I'm following the guide on here https://huggingface.co/learn/cookbook/en/fine_tuning_code_llm_on_single_gpu pretty much to a tee and have verified that it works when I'm fine-tuning on the provided dataset.

However, I'm doing some data preprocessing steps (filtering out entries), when I try to swap out the dataset for mine, it fails to train. However, I eventually fixed this by simply setting stream=False in load_dataset.

Coud this be some sort of network / firewall issue I'm facing?

Steps to reproduce the bug

I made a post with greater description about how I reproduced this problem before I found my workaround: https://discuss.huggingface.co/t/problem-with-custom-iterator-of-streaming-dataset-not-returning-anything/94551

Here is the problematic dataset snippet, which works when streaming=False (and with buffer keyword removed from shuffle)

commitpackft = load_dataset(
    "chargoddard/commitpack-ft-instruct", split="train", streaming=True
).filter(lambda example: example["language"] == "Python")
 
 
def form_template(example):
    """Forms a template for each example following the alpaca format for CommitPack"""
    example["content"] = (
        "### Human: " + example["instruction"] + " " + example["input"] + " ### Assistant: " + example["output"]
    )
    return example
 
 
dataset = commitpackft.map(
    form_template,
    remove_columns=["id", "language", "license", "instruction", "input", "output"],
).shuffle(
    seed=42, buffer_size=10000
)  # remove everything since its all inside "content" now
validation_data = dataset.take(4000)
train_data = dataset.skip(4000)

The annoying part about this is that it only fails during training and I don't know when it will fail, except that it always fails during evaluation.

Expected behavior

The expected behavior is that I should be able to get something from the iterator when called instead of getting nothing / stuck in a loop somewhere.

Environment info

  • datasets version: 2.20.0
  • Platform: Linux-5.4.0-121-generic-x86_64-with-glibc2.31
  • Python version: 3.11.7
  • huggingface_hub version: 0.23.4
  • PyArrow version: 16.1.0
  • Pandas version: 2.2.2
  • fsspec version: 2024.5.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant