add split argument to Generator #7015

piercus · 2024-07-01T08:09:25Z

Actual

When creating a multi-split dataset using generators like

datasets.DatasetDict({
  "val": datasets.Dataset.from_generator(
      generator=generator_val,
      features=features
  ),
  "test": datasets.Dataset.from_generator(
      generator=generator_test,
      features=features,
  )
})

It displays (for both test and val)

Generating train split

Expected

I would like to be able to improve this behavior by doing

datasets.DatasetDict({
  "val": datasets.Dataset.from_generator(
      generator=generator_val,
      features=features,
      split="val"
  ),
  "test": datasets.Dataset.from_generator(
      generator=generator_test,
      features=features,
      split="test"
  )
})

It would display

Generating val split

and

Generating test split

Proposal

Current PR is adding an explicit split argument and replace the implicit "train" split in the following classes/function :

Generator
from_generator
AbstractDatasetInputStream
GeneratorDatasetInputStream

Please share your feedbacks

…Stream, GeneratorDatasetInputStream

HuggingFaceDocBuilderDev · 2024-07-09T08:07:48Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

albertvillanova

Thanks for your proposed contribution @piercus.

This is a nice one!

Some comments below. Basically, I would propose to define the split parameter just as an attribute of GeneratorConfig instead of Generator, GeneratorDatasetInputStream, AbstractDatasetInputStream and SqlDatasetReader.

src/datasets/arrow_dataset.py

albertvillanova · 2024-07-09T08:57:36Z

src/datasets/arrow_dataset.py

@@ -1088,6 +1089,8 @@ def from_generator(
                Number of processes when downloading and generating the dataset locally.
                This is helpful if the dataset is made of multiple files. Multiprocessing is disabled by default.
                If `num_proc` is greater than one, then all list values in `gen_kwargs` must be the same length. These values will be split between calls to the generator. The number of shards will be the minimum of the shortest list in `gen_kwargs` and `num_proc`.
+            split (`str`, defaults to `"train"`):
+                Split name to be assigned to the dataset.


This docstring should go below <Added version="2.7.0"/>, because the version added tag corresponds to the num_proc parameter above split.

I would suggest to align its type with the rest of the code as: ([`NamedSplit`], defaults to `Split.TRAIN`).

I would also add a specific version added tag for the split parameter: . We may eventually change this depending on the next release.

just used <Added version="2.21.0"/>, please cross-check

src/datasets/io/abc.py

src/datasets/io/generator.py

src/datasets/iterable_dataset.py

albertvillanova · 2024-07-09T09:10:30Z

src/datasets/iterable_dataset.py

@@ -2074,7 +2075,8 @@ def from_generator(
                Keyword arguments to be passed to the `generator` callable.
                You can define a sharded iterable dataset by passing the list of shards in `gen_kwargs`.
                This can be used to improve shuffling and when iterating over the dataset with multiple workers.
-
+            split(`str`, default="train"):
+                Split name to be assigned to the dataset.


Same comments as before.

I also added <Added version="2.21.0"/> please cross-check

src/datasets/packaged_modules/generator/generator.py

albertvillanova · 2024-07-09T09:14:52Z

tests/test_arrow_dataset.py

-    dataset = Dataset.from_generator(data_generator, features=features, cache_dir=cache_dir)
-    _check_generator_dataset(dataset, expected_features)
+    dataset = Dataset.from_generator(data_generator, features=features, cache_dir=cache_dir, split=split)
+    _check_generator_dataset(dataset, expected_features, split)



I would add a specific test_dataset_from_generator_split with a parametrized split values, such as not passing any value, passing NamedSplit("train"), passing literal "train", passing other NamedSplit, etc.

test_dataset_from_generator_split added, still i have impacted _check_generator_dataset to share the same generic check everywhere

…e_iterable_datasets

piercus · 2024-07-10T06:17:40Z

@albertvillanova thanks for the review, please take a look

albertvillanova

Thanks! Good work.

Just a fix of the non-passing test and a nit.

src/datasets/iterable_dataset.py

tests/test_arrow_dataset.py

Co-authored-by: Albert Villanova del Moral <[email protected]>

src/datasets/iterable_dataset.py

Co-authored-by: Albert Villanova del Moral <[email protected]>

piercus · 2024-07-11T07:44:44Z

@albertvillanova please take a look

albertvillanova

Thanks for your contribution!

Note the CI action to generate the docs is failing due to an unrelated issue: https://github.com/huggingface/datasets/actions/runs/9887484572/job/27309892176?pr=7015

Therefore, if we do not want to break the generation of docs, this other PR should be merged before yours:

Fix doc generation when NamedSplit is used as parameter default value #7036

add split argument to Generator, from_generator, AbstractDatasetInput…

3524459

…Stream, GeneratorDatasetInputStream

albertvillanova mentioned this pull request Jul 9, 2024

from_generator does not allow to specify the split name #7033

Open

albertvillanova linked an issue Jul 9, 2024 that may be closed by this pull request

from_generator does not allow to specify the split name #7033

Open

albertvillanova requested changes Jul 9, 2024

View reviewed changes

piercus added 4 commits July 10, 2024 07:48

split generator review feedbacks

eef7c96

import Split

bdd9662

tag added version in iterable_dataset, rollback change in _concatenat…

5512e3f

…e_iterable_datasets

rm useless Generator __init__

6f1c18b

piercus requested a review from albertvillanova July 10, 2024 06:17

albertvillanova requested changes Jul 10, 2024

View reviewed changes

src/datasets/iterable_dataset.py Outdated Show resolved Hide resolved

tests/test_arrow_dataset.py Outdated Show resolved Hide resolved

docstring formatting

d74a862

Co-authored-by: Albert Villanova del Moral <[email protected]>

albertvillanova reviewed Jul 10, 2024

View reviewed changes

src/datasets/iterable_dataset.py Outdated Show resolved Hide resolved

piercus and others added 3 commits July 10, 2024 10:12

format docstring

7e50f23

Co-authored-by: Albert Villanova del Moral <[email protected]>

fix test_dataset_from_generator_split[None]

96b9e37

Merge branch 'main' into generator-split

b912261

piercus requested a review from albertvillanova July 11, 2024 07:44

albertvillanova approved these changes Jul 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add split argument to Generator #7015

add split argument to Generator #7015

piercus commented Jul 1, 2024

HuggingFaceDocBuilderDev commented Jul 9, 2024

albertvillanova left a comment

albertvillanova Jul 9, 2024

piercus Jul 10, 2024

albertvillanova Jul 9, 2024

piercus Jul 10, 2024

albertvillanova Jul 9, 2024

piercus Jul 10, 2024

piercus commented Jul 10, 2024

albertvillanova left a comment

piercus commented Jul 11, 2024

albertvillanova left a comment

add split argument to Generator #7015

Are you sure you want to change the base?

add split argument to Generator #7015

Conversation

piercus commented Jul 1, 2024

Actual

Expected

Proposal

HuggingFaceDocBuilderDev commented Jul 9, 2024

albertvillanova left a comment

Choose a reason for hiding this comment

albertvillanova Jul 9, 2024

Choose a reason for hiding this comment

piercus Jul 10, 2024

Choose a reason for hiding this comment

albertvillanova Jul 9, 2024

Choose a reason for hiding this comment

piercus Jul 10, 2024

Choose a reason for hiding this comment

albertvillanova Jul 9, 2024

Choose a reason for hiding this comment

piercus Jul 10, 2024

Choose a reason for hiding this comment

piercus commented Jul 10, 2024

albertvillanova left a comment

Choose a reason for hiding this comment

piercus commented Jul 11, 2024

albertvillanova left a comment

Choose a reason for hiding this comment