-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cannot split dataset when using load_dataset #6982
Comments
it seems the bug will happened in all windows system, I tried it in windows8.1, 10, 11 and all of them failed. But it won't happened in the Linux(Ubuntu and Centos7) and Mac (both my virtual and physical machine). I still don't know what the problem is. May be related to the path? I cannot run the split file in my windows server which created in Linux (even I replace the path in the arrow document)....work for it for a week but still cannot fix it .....upset |
Have you properly logged in? Are you using the a valid token? Note that this dataset is gated and you must follow the right procedure to be able to access it. You can find more info in the docs: https://huggingface.co/docs/hub/datasets-gated#access-gated-datasets-as-a-user |
I finally found it what happened. It is not about the logging. When I copy the dataset from its original path (C:/Users/cybes/.cache/huggingface/datasets/downloads/extracted/XXX/cv-corpus-7.0-2021-07-21) to the desktop and load each tsv in it one by one , when I load the test spilt, the following warning occurs: Then I manually deleted them in the "segment", the error won't happen anymore, even I replace the original path with these revised tsv and use the previous loading method (common_voice_train = load_dataset("mozilla-foundation/common_voice_7_0", "ja", split="train", trust_remote_code=True)). It can work properly. |
Describe the bug
when I use load_dataset methods to load mozilla-foundation/common_voice_7_0, it can successfully download and extracted the dataset but It cannot generating the arrow document,
This bug happened in my server, my laptop, so as #6906 , but it won't happen in the google colab. I work for it for days, even I load the datasets from local path, it can Generating train split and validation split but bug happen again in test split.
Steps to reproduce the bug
from datasets import load_dataset, load_metric, Audio
common_voice_train = load_dataset("mozilla-foundation/common_voice_7_0", "ja", split="train", token=selftoken, trust_remote_code=True)
Expected behavior
Environment info
Environment:
python 3.9
windows 11 pro
VScode+jupyter
The text was updated successfully, but these errors were encountered: