Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WinError 32 The process cannot access the file during load_dataset #6917

Open
elwe-2808 opened this issue May 24, 2024 · 0 comments
Open

WinError 32 The process cannot access the file during load_dataset #6917

elwe-2808 opened this issue May 24, 2024 · 0 comments

Comments

@elwe-2808
Copy link

Describe the bug

When I try to load the opus_book from hugging face (following the guide on the website)

from datasets import load_dataset, Dataset

dataset = load_dataset("Helsinki-NLP/opus_books", "en-fr", features=["id", "translation"])

I get an error:
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:/Users/Me/.cache/huggingface/datasets/Helsinki-NLP___parquet/ca-de-a39f1ef185b9b73b/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec.incomplete\\parquet-train-00000-00000-of-NNNNN.arrow'

Full stacktrace

AttributeError                            Traceback (most recent call last)
File c:\Users\Me\.conda\envs\ia\lib\site-packages\datasets\builder.py:1858, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   [1857](file:///C:/Users/Me/.conda/envs/ia/lib/site-packages/datasets/builder.py:1857) _time = time.time()
-> [1858](file:///C:/Users/Me/.conda/envs/ia/lib/site-packages/datasets/builder.py:1858) for _, table in generator:
   [1859](file:///C:/Users/Me/.conda/envs/ia/lib/site-packages/datasets/builder.py:1859)     if max_shard_size is not None and writer._num_bytes > max_shard_size:

File c:\Users\Me\.conda\envs\ia\lib\site-packages\datasets\packaged_modules\parquet\parquet.py:59, in Parquet._generate_tables(self, files)
     [58](file:///C:/Users/Me/.conda/envs/ia/lib/site-packages/datasets/packaged_modules/parquet/parquet.py:58) def _generate_tables(self, files):
---> [59](file:///C:/Users/Me/.conda/envs/ia/lib/site-packages/datasets/packaged_modules/parquet/parquet.py:59)     schema = self.config.features.arrow_schema if self.config.features is not None else None
     [60](file:///C:/Users/Me/.conda/envs/ia/lib/site-packages/datasets/packaged_modules/parquet/parquet.py:60)     if self.config.features is not None and self.config.columns is not None:

AttributeError: 'list' object has no attribute 'arrow_schema'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
File c:\Users\Me\.conda\envs\ia\lib\site-packages\datasets\builder.py:1882, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   [1881](file:///C:/Users/Me/.conda/envs/ia/lib/site-packages/datasets/builder.py:1881) num_shards = shard_id + 1
-> [1882](file:///C:/Users/Me/.conda/envs/ia/lib/site-packages/datasets/builder.py:1882) num_examples, num_bytes = writer.finalize()
   [1883](file:///C:/Users/Me/.conda/envs/ia/lib/site-packages/datasets/builder.py:1883) writer.close()

File c:\Users\Me\.conda\envs\ia\lib\site-packages\datasets\arrow_writer.py:584, in ArrowWriter.finalize(self, close_stream)
    [583](file:///C:/Users/Me/.conda/envs/ia/lib/site-packages/datasets/arrow_writer.py:583) # If schema is known, infer features even if no examples were written
--> [584](file:///C:/Users/Me/.conda/envs/ia/lib/site-packages/datasets/arrow_writer.py:584) if self.pa_writer is None and self.schema:
...
--> [627](file:///C:/Users/Me/.conda/envs/ia/lib/shutil.py:627)         os.unlink(fullname)
    [628](file:///C:/Users/Me/.conda/envs/ia/lib/shutil.py:628)     except OSError:
    [629](file:///C:/Users/Me/.conda/envs/ia/lib/shutil.py:629)         onerror(os.unlink, fullname, sys.exc_info())

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:/Users/Me/.cache/huggingface/datasets/Helsinki-NLP___parquet/ca-de-a39f1ef185b9b73b/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec.incomplete\\parquet-train-00000-00000-of-NNNNN.arrow'

Steps to reproduce the bug

Steps to reproduce:

Just execute these lines

from datasets import load_dataset, Dataset

dataset = load_dataset("Helsinki-NLP/opus_books", "en-fr", features=["id", "translation"])

Expected behavior

I expect the dataset to be loaded without any errors.

Environment info

Package Version
transformers 4.37.2
python 3.9.19
pytorch 2.3.0
datasets 2.12.0
arrow 1.2.3

I am using Conda on Windows 11.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant