Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] lineage-reconstructed Non-deterministic generators hang callers #46425

Open
rynewang opened this issue Jul 3, 2024 · 0 comments
Open
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core p0.5 uueeehhh

Comments

@rynewang
Copy link
Contributor

rynewang commented Jul 3, 2024

What happened + What you expected to happen

If a generator creates 100 object refs, then object lost, then it rerun and yields only 50. A caller waiting for the latter 50 objects hang until ObjectFetchTimedOutError.

E                       ray.exceptions.RayTaskError(ObjectFetchTimedOutError): ray::consumes() (pid=87331, ip=10.0.0.180)
E                         File "/Users/ruiyangwang/gits/ray/python/ray/tests/test_data_chaos.py", line 185, in consumes
E                           nums = ray.get(objs)
E                       ray.exceptions.ObjectFetchTimedOutError: Failed to retrieve object 16310a0f0a45af5cffffffffffffffffffffffff0100000034000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
E                       
E                       Fetch for object 16310a0f0a45af5cffffffffffffffffffffffff0100000034000000 timed out because no locations were found for the object. This may indicate a system-level bug.

Versions / Dependencies

master

Reproduction script

import os
import sys
import ray
import numpy as np
import pytest

@pytest.fixture
def short_timeout(monkeypatch):
    monkeypatch.setenv("RAY_fetch_fail_timeout_milliseconds", "1000")
    yield


def test_f(short_timeout, ray_start_cluster):
    """
    Tests nondeterministic generators vs lineage reconstruction.
    Timeline:

    1. In worker node, creates a generator that generates 100 objects
    2. Kills worker node, objs exist in ref, but data lost
    3. In worker node, creates a consumer that consumes 100 objects
    4. Start a worker node to enable the task and lineage reconstruction
    5. Lineage reconstruction should be working here. Make the gen to only generate 50.
    5. Verify that the consumer task can still run (it's not)
    """
    cluster = ray_start_cluster
    cluster.add_node(num_cpus=1, resources={"head": 1})
    cluster.wait_for_nodes()

    ray.init(address=cluster.address)

    @ray.remote(num_cpus=0, resources={"head": 0.1})
    class ValueHolder:
        def __init__(self, val):
            self.value = val

        def set(self, val):
            self.value = val
        
        def get(self):
            return self.value

    @ray.remote(num_cpus=1, resources={"worker": 1})
    def generates(value_holder):
        num = ray.get(value_holder.get.remote())
        print(f"generates {num}")
        for i in range(num):
            print(f"generating {i}")
            yield np.ones((1000, 1000), dtype=np.uint8) * i
        print(f"generated {num}")
    
    @ray.remote(num_cpus=1, resources={"worker": 1})
    def consumes(objs, expected_num):
        nums = ray.get(objs)  # Time out now!!!
        # E                       ray.exceptions.RayTaskError(ObjectFetchTimedOutError): ray::consumes() (pid=87331, ip=10.0.0.180)
        # E                         File "/Users/ruiyangwang/gits/ray/python/ray/tests/test_data_chaos.py", line 185, in consumes
        # E                           nums = ray.get(objs)
        # E                       ray.exceptions.ObjectFetchTimedOutError: Failed to retrieve object 16310a0f0a45af5cffffffffffffffffffffffff0100000034000000. To see information about where this ObjectRef was created in Python, set the environment variable RAY_record_ref_creation_sites=1 during `ray start` and `ray.init()`.
        # E                       
        # E                       Fetch for object 16310a0f0a45af5cffffffffffffffffffffffff0100000034000000 timed out because no locations were found for the object. This may indicate a system-level bug.

        
        assert len(nums) == expected_num
        print(f"consumes {len(nums)}")
        print(nums)
        return expected_num
    
    worker_node = cluster.add_node(num_cpus=10, resources={"worker": 10})
    cluster.wait_for_nodes()

    holder = ValueHolder.remote(100)
    gen = ray.get(generates.remote(holder))
    objs = list(gen)
    assert len(objs) == 100

    # kill the worker node
    cluster.remove_node(worker_node, allow_graceful=False)

    # Make sure gen only generates 50 now...
    ray.get(holder.set.remote(50))
    # ... but a consumer takes all 100
    consumer = consumes.remote(objs, 100)
    # start a new worker node
    worker_node = cluster.add_node(num_cpus=10, resources={"worker": 10})
    cluster.wait_for_nodes()

    ray.get(consumer)





if __name__ == "__main__":
    import pytest

    if os.environ.get("PARALLEL_CI"):
        sys.exit(pytest.main(["-n", "auto", "--boxed", "-vs", __file__]))
    else:
        sys.exit(pytest.main(["-sv", __file__]))

Issue Severity

High: It blocks me from completing my task.

@rynewang rynewang added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 3, 2024
@anyscalesam anyscalesam added the core Issues that should be addressed in Ray Core label Jul 8, 2024
@jjyao jjyao added p0.5 uueeehhh and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core p0.5 uueeehhh
Projects
None yet
Development

No branches or pull requests

3 participants