-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setup Commands not running with Ray Cluster on GCP #46451
Comments
@arandhaw when you say worker nodes is not setup correctly, what's they symptoms? |
I am experiencing the same issue.
My |
I think the setup commands are only run on the first worked that is being created. I'm saying that because using |
@jjyao let me clarify exactly what seems to occur. Since I install dependencies in the setup commands (e.g., "pip install torch"), my ray jobs fail since none of the required libraries have been installed. To be clear, the problem is not that the commands are failing and raising error messages. They are not being run at all. |
I may have solved mt problem. It turns out the version of ray installed on the head and worker nodes was 2.8.1, whereas the version on the cluster launcher was 2.32.0. I just assumed that ray would install itself on the head/worker nodes, but I think it was using an older version part of the VM image. By adding "pip install -U ray[all]" to the setup commands, it seems to have fixed the problem. It would be nice if the documentation was clearer (or if a meaningful error message was given by ray). |
What happened + What you expected to happen
I have been creating Ray Clusters on cloud VM's in Google Cloud. I've been having issues with the setup_commands in the ray cluster YAML file. These are supposed to run when new nodes are made.
The commands always run correctly on the head node. However, sometimes when new workers are created by the autoscaler, one or both of the worker nodes is not setup correctly. No errors appear in logs, but the worker is not set up correctly. It appears to work / stop working randomly.
The YAML file below is the configuration file I've been using. You'll need to change the in 3 places for your specific cloud project. I've been creating the clusters using ray up on the google cloud shell, then SSH'ing into the head node to run scripts. The error first started appearing when the autoscaler added more than one worker.
Versions / Dependencies
Ray most recent version.
The cluster is created from the google cloud shell.
Reproduction script
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: