Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][experimental] Support multiple readers for IntraProcessChannel #46431

Merged
merged 6 commits into from
Jul 10, 2024

Conversation

kevin85421
Copy link
Member

@kevin85421 kevin85421 commented Jul 4, 2024

Why are these changes needed?

This PR enables IntraProcessChannel to be read more than once. Before this PR, the data caches in serialization_context would be removed if read once. This PR also adds a test to simulate the pattern of pipeline parallelism.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: kaihsun <[email protected]>
Signed-off-by: kaihsun <[email protected]>
Signed-off-by: kaihsun <[email protected]>
Signed-off-by: kaihsun <[email protected]>
@kevin85421 kevin85421 changed the title [WIP] Support multiple readers for IntraProcessChannel [core][experimental] Support multiple readers for IntraProcessChannel Jul 5, 2024
@kevin85421 kevin85421 marked this pull request as ready for review July 5, 2024 18:21
Comment on lines +1192 to +1206
# Worker 0: FFFBBB
assert ray.get(worker_0.get_logs.remote()) == [
"FWD rank-0, batch-0",
"FWD rank-0, batch-1",
"FWD rank-0, batch-2",
"BWD rank-0, batch-0",
"BWD rank-0, batch-1",
"BWD rank-0, batch-2",
]
# Worker 1: BBB
assert ray.get(worker_1.get_logs.remote()) == [
"BWD rank-1, batch-0",
"BWD rank-1, batch-1",
"BWD rank-1, batch-2",
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is exactly what we need!


def set_use_external_transport(self, use_external_transport: bool) -> None:
self.use_external_transport = use_external_transport

def set_data(self, channel_id: str, value: Any) -> None:
def set_data(self, channel_id: str, value: Any, num_readers: int) -> None:
assert num_readers > 0, "num_readers must be greater than 0."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A newbie question here: Seems that there will be one _SerializationContext per DAG actor. If there's nobody reading the returned value of this node (this is a leaf node), will it raise an error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that there will be one _SerializationContext per DAG actor.

You are correct.

If there's nobody reading the returned value of this node (this is a leaf node), will it raise an error?

For IntraProcessChannel, if there is no reader, the channel will not be created. You can see the logic in shared_memory_channel.py.

        if num_local_readers > 0:
            local_channel = IntraProcessChannel(num_local_readers)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the leaf node, I think we currently don't do anything. Maybe we should raise a ValueError or warning message.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see. from my experiments, these leaf node will not be executed.

It's not a blocker for now, since there will be no leaf node in PP. But for FSDP, there might be some collective calls which doesn't has return values, thus becomes leaf nodes.

Copy link
Contributor

@ruisearch42 ruisearch42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

Signed-off-by: kaihsun <[email protected]>
Signed-off-by: kaihsun <[email protected]>
@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Jul 9, 2024
Copy link
Member

@woshiyyya woshiyyya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!

Copy link
Contributor

@ruisearch42 ruisearch42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you planned to raise an error for non-used leaf node in this PR?

@kevin85421
Copy link
Member Author

I thought you planned to raise an error for non-used leaf node in this PR?

I will handle the leaf node in a separate PR. I think it is not relevant to this pull request.

Copy link
Contributor

@ruisearch42 ruisearch42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving the change to unblock train workload.

I think we should follow up to better understand edge cases such as leaf nodes (e.g., adding more tests) and properly handle it (e.g., raise error or support).
Example leaf node scenario:

driver --> a --> b --> driver
           | 
           ----> c

@kevin85421
Copy link
Member Author

@woshiyyya @ruisearch42 open an issue #46528 to track the progress.

@jjyao jjyao merged commit ca67a49 into ray-project:master Jul 10, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerated-dag core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants