-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support non streamable arrow file binary format #7025
base: main
Are you sure you want to change the base?
feat: support non streamable arrow file binary format #7025
Conversation
requesting review - @albertvillanova @lhoestq |
8be0e3f
to
c75c4c3
Compare
@@ -42,8 +42,12 @@ def _split_generators(self, dl_manager): | |||
# Infer features if they are stored in the arrow schema | |||
if self.info.features is None: | |||
for file in itertools.chain.from_iterable(files): | |||
with open(file, "rb") as f: | |||
self.info.features = datasets.Features.from_arrow_schema(pa.ipc.open_stream(f).schema) | |||
data_memory_map = pa.memory_map(file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Memory mapping is only available for local files, however in streaming mode file
is a URL (and open
is extended to work with URLs and returns a valid f
).
Could you make it work using f
rather than data_memory_map
?
Ideally this should work when passing streaming=True
to load_dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated @lhoestq thanks
c75c4c3
to
2e3af68
Compare
Signed-off-by: Mehant Kammakomati <[email protected]>
2e3af68
to
a3412c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome thank you ! this will be pretty useful :)
Before we merge could you also add a test in tests/packaged_modules/test_arrow.py
?
I noticed it's pretty empty right now compared to test_json.py or test_csv.py though, maybe I can take care of it next week if needed
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Co-authored-by: Quentin Lhoest <[email protected]>
Thank you for the review.
@lhoestq Would you like to take that up? since it needs adding some test data and I see no supportive examples for similar data formats - parquet pandas etc. Thanks |
Support Arrow files (
.arrow
) that are in non streamable binary file formats.