Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Progress bar estimates are initially incorrect it read tasks yields multiple outputs #46420

Open
bveeramani opened this issue Jul 3, 2024 · 0 comments · May be fixed by #46601
Open
Assignees
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P2 Important issue, but not time-critical

Comments

@bveeramani
Copy link
Member

What happened + What you expected to happen

Ray Data initially assumes that each read task produces exactly one block.

Running: 1/1 CPU, 0/0 GPU, 384.0MB/1.0GB object_store_memory:  80%|███████████████████████▏     | 8/10 [00:10<00:02,  1.13s/it]
- ReadRange->MapBatches(sleep): 1 active, 9 queued, [cpu: 1.0, objects: 256.0MB]:  90%|████████ | 9/10 [00:10<00:01,  1.10s/it

When a read tasks produces multiple output, you might see both the numerator and the denominator increasing by one.

Running: 1/1 CPU, 0/0 GPU, 384.0MB/1.0GB object_store_memory: 100%|████████████████████████████| 61/61 [01:08<00:00,  1.09s/it]

Until Ray Data finally corrects it's estimate when a task completes.

Running: 1/1 CPU, 0/0 GPU, 384.0MB/1.0GB object_store_memory:  27%|██████▊                  | 274/1000 [00:23<01:01, 11.87it/s]

This behavior is janky and confusing if you don't know what's going on under the hood.

Versions / Dependencies

d14c95c

Reproduction script

import time

import numpy as np

import ray

ray.init(num_cpus=1)


target_block_size = ray.data.DataContext.get_current().target_max_block_size


def sleep(batch):
    for _ in range(100):
        time.sleep(0.1)
        yield {"batch": np.zeros((target_block_size,), dtype=np.uint8)}


ray.data.range(10, override_num_blocks=10).map_batches(
    sleep, batch_size=None
).materialize()

Issue Severity

None

@bveeramani bveeramani added bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical data Ray Data-related issues labels Jul 3, 2024
@scottjlee scottjlee self-assigned this Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P2 Important issue, but not time-critical
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants