Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gathering a list of strings from multiple devices using Fabric #20016

Open
Haran71 opened this issue Jun 26, 2024 · 1 comment
Open

Gathering a list of strings from multiple devices using Fabric #20016

Haran71 opened this issue Jun 26, 2024 · 1 comment
Labels
docs Documentation related question Further information is requested

Comments

@Haran71
Copy link

Haran71 commented Jun 26, 2024

Bug description

I have a list of strings, on each device in multi-gpu evaluation, I want to be able to collect them all on all devices across all devices into a single list

m_preds = fabric.all_gather(all_preds) 
m_gt = fabric.all_gather(all_gt) 

when I try the above code (all_preds I and all_gt are lists of strings), m_preds and m_gt are the same lists as all_preds and all_gt as per the device their on. Am I doing something wrong?

What version are you seeing the problem on?

v2.2

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @Borda

@Haran71 Haran71 added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Jun 26, 2024
@awaelchli
Copy link
Member

Hey @Haran71

The documentation states:

Gather tensors or collections of tensors from multiple processes.

        This method needs to be called on all processes and the tensors need to have the same shape across all
        processes, otherwise your program will stall forever.

        Args:
            data: int, float, tensor of shape (batch, ...), or a (possibly nested) collection thereof.
            group: the process group to gather results from. Defaults to all processes (world).
            sync_grads: flag that allows users to synchronize gradients for the ``all_gather`` operation

        Return:
            A tensor of shape (world_size, batch, ...), or if the input was a collection
            the output will also be a collection with tensors of this shape. For the special case where
            world_size is 1, no additional dimension is added to the tensor(s).

It does not mention anywhere that strings are supported. The documentation states clearly this is meant to work for tensors.
The reason why there is no error is because we want to support dictionaries. Perhaps the documentation could mention that explicitly.

If you have predictions you'd like to all-gather, I suggest to keep them as numbers/tensors, gather them, and then convert them to strings at the end.

@awaelchli awaelchli added question Further information is requested docs Documentation related and removed bug Something isn't working needs triage Waiting to be triaged by maintainers labels Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation related question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants