Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Add object back to memory store when object recovery is skipped #46460

Merged
merged 7 commits into from
Jul 8, 2024

Conversation

jjyao
Copy link
Collaborator

@jjyao jjyao commented Jul 6, 2024

Why are these changes needed?

For the periodic CoreWorker.RecoverObjects, we remove all objects to recover from the memory store with the expectation that ObjectRecoveryManager::RecoverObject will guarantee to add the object back to the memory store eventually after recovery. However there is a case that we failed to add the object back to the memory store which is when the recovery is skipped due to existing pin or spill. This PR fixes it by adding the object back to the memory store when recovery is skipped.

Sequence of events:

  1. Task A runs on node 1 and generates objects a and b.
  2. Task B runs on node 2 and has object a as argument so a has a secondary copy on node 2.
  3. Task C runs on node 3 and has object b as argument so b has a secondary copy on node 3.
  4. Node 1 crashes, ObjectRecoveryManager recovers objects a and b by promoting the secondary copy of a and b to primary so that a's primary copy is on node 2 and b's primary copy is on node 3.
  5. Node 2 crashes, ObjectRecoverManager recovers object a by resubmitting task A.
  6. Node 3 crashes, object b is added to objects_to_recover_ in ReferenceCounter.
  7. Resubmitted task A runs on node 4 and finishes, so b's primary copy is on node 4.
  8. CoreWorker.RecoverObjects periodic runner runs, core worker will call RecoverObject(b) but b already has a primary copy (node 4) so the recovery is skipped. Without this PR, b is removed from memory store so future get will hang forever.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@jjyao jjyao added the go add ONLY when ready to merge, run all tests label Jul 6, 2024
jjyao added 4 commits July 6, 2024 22:32
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
@jjyao jjyao changed the title [Core] Only delete object from memory store if recovery happens [Core] Add object back to memory store when object recovery is skipped Jul 8, 2024
@jjyao jjyao marked this pull request as ready for review July 8, 2024 18:50
jjyao added 2 commits July 8, 2024 12:14
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
@jjyao jjyao merged commit 33ee732 into ray-project:master Jul 8, 2024
5 checks passed
@jjyao jjyao deleted the jjyao/hang branch July 8, 2024 21:57
GeneDer added a commit to GeneDer/ray that referenced this pull request Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants