Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support non streamable arrow file binary format #7025

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

kmehant
Copy link

@kmehant kmehant commented Jul 4, 2024

Support Arrow files (.arrow) that are in non streamable binary file formats.

@kmehant
Copy link
Author

kmehant commented Jul 4, 2024

requesting review - @albertvillanova @lhoestq

@kmehant kmehant force-pushed the support-non-streamable-arrow-files branch from 8be0e3f to c75c4c3 Compare July 4, 2024 17:44
@@ -42,8 +42,12 @@ def _split_generators(self, dl_manager):
# Infer features if they are stored in the arrow schema
if self.info.features is None:
for file in itertools.chain.from_iterable(files):
with open(file, "rb") as f:
self.info.features = datasets.Features.from_arrow_schema(pa.ipc.open_stream(f).schema)
data_memory_map = pa.memory_map(file)
Copy link
Member

@lhoestq lhoestq Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory mapping is only available for local files, however in streaming mode file is a URL (and open is extended to work with URLs and returns a valid f).

Could you make it work using f rather than data_memory_map ?
Ideally this should work when passing streaming=True to load_dataset

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated @lhoestq thanks

@kmehant kmehant force-pushed the support-non-streamable-arrow-files branch from c75c4c3 to 2e3af68 Compare July 9, 2024 02:18
@kmehant kmehant force-pushed the support-non-streamable-arrow-files branch from 2e3af68 to a3412c5 Compare July 9, 2024 02:21
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome thank you ! this will be pretty useful :)

Before we merge could you also add a test in tests/packaged_modules/test_arrow.py ?

I noticed it's pretty empty right now compared to test_json.py or test_csv.py though, maybe I can take care of it next week if needed

src/datasets/packaged_modules/arrow/arrow.py Outdated Show resolved Hide resolved
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Co-authored-by: Quentin Lhoest <[email protected]>
@kmehant
Copy link
Author

kmehant commented Jul 9, 2024

Thank you for the review.

Before we merge could you also add a test in tests/packaged_modules/test_arrow.py ?

I noticed it's pretty empty right now compared to test_json.py or test_csv.py though, maybe I can take care of it next week if needed

@lhoestq Would you like to take that up? since it needs adding some test data and I see no supportive examples for similar data formats - parquet pandas etc. Thanks

@kmehant kmehant requested a review from lhoestq July 9, 2024 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants