Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] ArrowVariableShapedTensorArray with LargeListArray #46434

Open
vipese-idoven opened this issue Jul 4, 2024 · 4 comments · Fixed by #45352
Open

[Data] ArrowVariableShapedTensorArray with LargeListArray #46434

vipese-idoven opened this issue Jul 4, 2024 · 4 comments · Fixed by #45352
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks

Comments

@vipese-idoven
Copy link

vipese-idoven commented Jul 4, 2024

Description

The current implementation only allows to create ArrowVariableShapedTensorArray objects with a maximum number of (2^31)-1 elements because it uses PyArrow's ListArray in ray.air.util.tensor_extention.arrow L812 which uses 32-bit encoding for indexing. Thus, storing some types of data like long time-series which contain more elements than with 32-bit encoding causes overflow.

Providing the possibility to replace ListArray with Pyarrow LargeListArray would allow to store arrays with up to (2^63)-1 elements. (Note: this would also require to change the OFFSET_DTYPE in L722)

Use case

The goal is to be able to store long time-series in arrow format (like long audios, or audios with high sample frequencies).

@vipese-idoven vipese-idoven added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 4, 2024
@anyscalesam
Copy link
Collaborator

@vipese-idoven what is your use case for this; are you looking for batch processing on those audio files?

@vipese-idoven
Copy link
Author

@vipese-idoven what is your use case for this; are you looking for batch processing on those audio files?

I've used Ray Data for batch processing, which can turn into very long signals. I was hoping to store the pre-processed data into arrow format for later segmentation and classification to avoid pre-processing again (or doing it on the fly).

@scottjlee
Copy link
Contributor

There is a WIP PR from an external contributor, but had to be reverted due to some failing release tests.

@scottjlee scottjlee changed the title [Ray Air] ArrowVariableShapedTensorArray with LargeListArray [Data] ArrowVariableShapedTensorArray with LargeListArray Jul 16, 2024
@scottjlee scottjlee added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 16, 2024
@vipese-idoven
Copy link
Author

There is a WIP PR from an external contributor, but had to be reverted due to some failing release tests.

Awesome! Happy to help if need be

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Ray Data-related issues enhancement Request for new feature and/or capability P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants