Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

save_to_disk() freezes when saving on s3 bucket with multiprocessing #6936

Open
ycattan opened this issue May 30, 2024 · 0 comments
Open

save_to_disk() freezes when saving on s3 bucket with multiprocessing #6936

ycattan opened this issue May 30, 2024 · 0 comments

Comments

@ycattan
Copy link

ycattan commented May 30, 2024

Describe the bug

I'm trying to save a Dataset using the save_to_disk() function with:

  • num_proc > 1
  • dataset_path being a s3 bucket path e.g. "s3://{bucket_name}/{dataset_folder}/"

The hf progress bar shows up but the saving does not seem to start.
When using one processor only (num_proc=1), everything works fine.
When saving the dataset on local disk (as opposed to s3 bucket) with num_proc > 1, everything works fine.

Thank you for your help! :)

Steps to reproduce the bug

I tried without any storage options:

from datasets import load_dataset

sandbox_ds = load_dataset("openai_humaneval")
sandbox_ds["test"].save_to_disk(
    "s3://bucket-name/test_multiprocessing_saving/",
    num_proc=4,
)

and with the specific s3fs storage options:

from datasets import load_dataset
from s3fs import S3FileSystem

def get_s3fs():
    return S3FileSystem()

sandbox_ds = load_dataset("openai_humaneval")
sandbox_ds["test"].save_to_disk(
    "s3://bucket-name/test_multiprocessing_saving/",
    num_proc=4,
    storage_options=get_s3fs().storage_options, # also tried: storage_options=S3FileSystem().storage_options
)

I'm guessing I might use storage_options parameter wrongly, but I didn't find anything online that made it work.

NB: Behavior is the same when trying to save the whole DatasetDict.

Expected behavior

Progress bar fills in and saving is carried out.

Environment info

datasets==2.18.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant