[`feat`] Move dataset card creation to method for easier overriding #6988

tomaarsen · 2024-06-20T10:47:57Z

Hello!

Pull Request overview

Move dataset card creation to method for easier overriding

Details

It's common for me to fully automatically download, reformat, and upload a dataset (e.g. see https://huggingface.co/datasets?other=sentence-transformers), but one aspect that I cannot easily automate is the dataset card generation. This is because during push_to_hub, the dataset card is created in 3 lines of code in a much larger method. To automatically generate a dataset card, I need to either:

Subclass Dataset/DatasetDict, copy the entire push_to_hub method to override the ~3 lines used to generate the dataset card. This is not viable as the method is likely to change over time.
Use push_to_hub normally, then separately download the pushed (but empty) dataset card, update it, and reupload the modified dataset. This works fine, but prevents me from being able to return a Dataset to my users which will automatically use a nice dataset card.

So, in this PR I'm proposing to move the dataset generation into another method so that it can be overridden more easily. For example, imagine the following use case:

import json
from typing import Any, Dict, Optional
from datasets import Dataset, load_dataset
from datasets.info import DatasetInfosDict, DatasetInfo
from datasets.utils.metadata import MetadataConfigs
from huggingface_hub import DatasetCardData, DatasetCard

TEMPLATE = r"""---
{dataset_card_data}
---

# Dataset Card for {source_dataset_name} with mined hard negatives

This dataset is a collection of {column_one}-{column_two}-negative triplets from the {source_dataset_name} dataset. See [{source_dataset_name}](https://huggingface.co/datasets/{source_dataset_id}) for additional information. This dataset can be used directly with Sentence Transformers to train embedding models.

## Mining Parameters
The negative samples have been mined using the following parameters:
- `range_min`: {range_min}, i.e. we skip the {range_min} most similar samples
- `range_max`: {range_max}, i.e. we only look at the top {range_max} most similar samples
- `margin`: {margin}, i.e. we require negative similarity + margin < positive similarity, so negative samples can't be more similar than the known true answer
- `sampling_strategy`: {sampling_strategy}, i.e. whether to randomly sample from the candidate negatives or take the "top" negatives
- `num_negatives`: {num_negatives}, i.e. we mine {num_negatives} negatives per question-answer pair

## Dataset Format
- Columns: {column_one}, {column_two}, negative
- Column types: str, str, str
- Example:
```python
{example}
```
"""

class HNMDataset(Dataset):
    @classmethod
    def from_dict(cls, *args, mining_kwargs: Dict[str, Any], **kwargs) -> "HNMDataset":
        dataset = super().from_dict(*args, **kwargs)
        dataset.mining_kwargs = mining_kwargs
        return dataset
    
    def _create_dataset_card(
        self,
        dataset_card_data: DatasetCardData,
        dataset_card: Optional[DatasetCard],
        config_name: str,
        info_to_dump: DatasetInfo,
        metadata_config_to_dump: MetadataConfigs,
    ) -> DatasetCard:
        if dataset_card:
            return dataset_card

        DatasetInfosDict({config_name: info_to_dump}).to_dataset_card_data(dataset_card_data)
        MetadataConfigs({config_name: metadata_config_to_dump}).to_dataset_card_data(dataset_card_data)
        dataset_card_data.tags = ["sentence-transformers"]
        dataset_name = self.mining_kwargs["source_dataset"].info.dataset_name
        # Very messy, just as an example:
        dataset_id = list(self.mining_kwargs["source_dataset"].info.download_checksums.keys())[0].removeprefix("hf://datasets/").split("@")[0]
        content = TEMPLATE.format(**{
            "dataset_card_data": str(dataset_card_data),
            "source_dataset_name": dataset_name,
            "source_dataset_id": dataset_id,
            "range_min": self.mining_kwargs["range_min"],
            "range_max": self.mining_kwargs["range_max"],
            "margin": self.mining_kwargs["margin"],
            "sampling_strategy": self.mining_kwargs["sampling_strategy"],
            "num_negatives": self.mining_kwargs["num_negatives"],
            "column_one": self.column_names[0],
            "column_two": self.column_names[1],
            "example": json.dumps(self[0], indent=4),
        })
        return DatasetCard(content)

source_dataset = load_dataset("sentence-transformers/gooaq", split="train[:100]")
dataset = HNMDataset.from_dict({
    "query": source_dataset["question"],
    "answer": source_dataset["answer"],
    # "negative": ... <- In my case, this column would be 'mined' automatically with these parameters
}, mining_kwargs={
    "range_min": 10,
    "range_max": 20,
    "max_score": 0.9,
    "margin": 0.1,
    "sampling_strategy": "random",
    "num_negatives": 3,
    "source_dataset": source_dataset,
})
dataset.push_to_hub("tomaarsen/mining_demo", private=True)

In this script, I've created a subclass which stores some additional information about how the dataset was generated. It's a bit hacky (e.g. setting a mining_kwargs parameter in from_dict that wasn't created in __init__, but that's just a consequence of how the from_... methods don't accept kwargs), but it allows me to create a "hard negatives mining" function that returns a dataset which people can use locally like normal, but if they choose to upload it, then it'll automatically include some information, e.g.: https://huggingface.co/datasets/tomaarsen/mining_demo

This allows others to actually find this dataset (e.g. via the sentence-transformers tag) and get an idea of the quality, source, etc. by looking at the model card.

Note

I'm not fixed on this solution whatsoever: I am also completely fine with other solutions, e.g. a dataset.set_dataset_card_creator method that allows me to provide a function without even having to subclass anything. I'm open to all ideas :)

cc @albertvillanova @lhoestq
cc @LysandreJik

Tom Aarsen

HuggingFaceDocBuilderDev · 2024-06-20T10:52:06Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lhoestq · 2024-06-20T17:10:30Z

Dataset objects are not made to be subclassed, so I don't think going in that direction is a good idea. In particular there is absolutely no test to make sure it works well, and nothing in the internal has been made to anticipate this use case.

I'd suggest to use a separate function to push changes to the Dataset card, and call it after push_to_hub(). This way people can also use a similar logic with other tools that datasets. You can also use composition instead of subclassing.

tomaarsen · 2024-06-20T22:49:58Z

Would you consider an alternative where a Dataset instance carries a dataset card template which can be updated?

I don't want to burden my users with having to call another method after push_to_hub themselves. If you're not a fan of the template approach above either, then I'll likely subclass push_to_hub to once again download the just-uploaded-but-empty dataset card, update it, and reupload it. It'll just be a bit more requests than necessary, but not a big deal overall.

Tom Aarsen

lhoestq · 2024-06-21T12:58:23Z

Actually I find the idea of overriding _create_dataset_card better than implementing a templating logic. My main concern is that if we go in that direction we better make sure that subclasses of Dataset are working well.

Well if it's been working fine on your side why not, but make sure you test correctly features that could not work because of subclassing (e.g. I'm pretty sure map() won't return your subclass of Dataset). Or at least the ones that matter for your lib.

If it sounds good to you I'm fine with merging your addition to let you override the dataset card.

tomaarsen · 2024-06-21T13:38:16Z

e.g. I'm pretty sure map() won't return your subclass of Dataset

I understand that there's limitations such as this one. The subclass doesn't have to be robust - I'd just like some simple automatic dataset card generation options directly after generating the dataset. This can be removed if the user does additional steps before pushing the model, e.g. mapping, filtering, saving to disk and uploading the loaded dataset, etc.

If it sounds good to you I'm fine with merging your addition to let you override the dataset card.

That would be quite useful for me! I appreciate it.

I'm not very sure what the test failures are caused by, I believe the only change in behaviour is that

        DatasetInfosDict({config_name: info_to_dump}).to_dataset_card_data(dataset_card_data)
        MetadataConfigs({config_name: metadata_config_to_dump}).to_dataset_card_data(dataset_card_data)

are not called when dataset_card was already defined. Unless these have side-effects other than updating dataset_card_data, it shouldn't be any different than main.

Tom Aarsen

lhoestq · 2024-06-21T16:04:57Z

Let's try to have this PR merged then !

IMO your current implementation can be improved since you path both the dataset card data and the dataset card itself, which is redundant. Also I anticipate the failures in the CI to come from your default implementation which doesn't correspond to what it was doing before

Unless these have side-effects other than updating dataset_card_data, it shouldn't be any different than main.

Indeed the dataset_card_data is the value from attribute of the dataset_card from a few lines before your changes, so yes it modifies the dataset_card object too.

tomaarsen added 2 commits June 20, 2024 12:33

Move dataset card creation to method for easier overriding

798b01c

Reformat

7a9ccbc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`feat`] Move dataset card creation to method for easier overriding #6988

[`feat`] Move dataset card creation to method for easier overriding #6988

tomaarsen commented Jun 20, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 20, 2024

lhoestq commented Jun 20, 2024 •

edited

Loading

tomaarsen commented Jun 20, 2024

lhoestq commented Jun 21, 2024

tomaarsen commented Jun 21, 2024

lhoestq commented Jun 21, 2024

[feat] Move dataset card creation to method for easier overriding #6988

Are you sure you want to change the base?

[feat] Move dataset card creation to method for easier overriding #6988

Conversation

tomaarsen commented Jun 20, 2024 • edited Loading

Pull Request overview

Details

Note

HuggingFaceDocBuilderDev commented Jun 20, 2024

lhoestq commented Jun 20, 2024 • edited Loading

tomaarsen commented Jun 20, 2024

lhoestq commented Jun 21, 2024

tomaarsen commented Jun 21, 2024

lhoestq commented Jun 21, 2024

[`feat`] Move dataset card creation to method for easier overriding #6988

[`feat`] Move dataset card creation to method for easier overriding #6988

tomaarsen commented Jun 20, 2024 •

edited

Loading

lhoestq commented Jun 20, 2024 •

edited

Loading