Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray on Spark fractional GPU error calling setup_ray_cluster #39537

Closed
Joseph-Sarsfield opened this issue Sep 11, 2023 · 5 comments · Fixed by #46443
Closed

Ray on Spark fractional GPU error calling setup_ray_cluster #39537

Joseph-Sarsfield opened this issue Sep 11, 2023 · 5 comments · Fixed by #46443
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P2 Important issue, but not time-critical

Comments

@Joseph-Sarsfield
Copy link

What happened + What you expected to happen

  1. Bug spark.task.resource.gpu.amount does not support fractional GPU which is required for parallel spark jobs on GPU
    https://github.com/ray-project/ray/blob/master/python/ray/util/spark/cluster_init.py
    line 1026
    num_spark_task_gpus = int(
    spark.sparkContext.getConf().get("spark.task.resource.gpu.amount", "0")
    )

  2. Ignore spark.task.resource.gpu.amount if num_spark_task_gpus passed manually

Versions / Dependencies

All versions

Reproduction script

Set spark.task.resource.gpu.amount to fractional value and num_gpus_worker_node to not None in call to setup_ray_cluster

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@Joseph-Sarsfield Joseph-Sarsfield added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 11, 2023
@Joseph-Sarsfield
Copy link
Author

Hi @WeichenXu123 this is related to Ray on Spark

@jjyao jjyao added the core Issues that should be addressed in Ray Core label Sep 25, 2023
@rkooo567 rkooo567 added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 25, 2023
@Joseph-Sarsfield
Copy link
Author

@jjyao do we have an update on this "spark.task.resource.gpu.amount" can legitimately be a decimal value and shouldn't be used to set num_gpus_worker_node

ValueError: invalid literal for int() with base 10: '0.5'
https://github.com/ray-project/ray/blob/master/python/ray/util/spark/cluster_init.py#L1026C44-L1026C73

@jjyao jjyao removed their assignment Jul 3, 2024
@WeichenXu123
Copy link
Contributor

Hi , RayonSpark haven't supported fractional GPU, we can support it if you need it.

@Joseph-Sarsfield
Copy link
Author

@WeichenXu123 yes please, we have currently forked ray to bypass the exception.

@WeichenXu123
Copy link
Contributor

@Joseph-Sarsfield PR is out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P2 Important issue, but not time-critical
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants