Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup Commands not running with Ray Cluster on GCP #46451

Open
arandhaw opened this issue Jul 5, 2024 · 5 comments
Open

Setup Commands not running with Ray Cluster on GCP #46451

arandhaw opened this issue Jul 5, 2024 · 5 comments
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-clusters For launching and managing Ray clusters/jobs/kubernetes P1 Issue that should be fixed within a few weeks

Comments

@arandhaw
Copy link

arandhaw commented Jul 5, 2024

What happened + What you expected to happen

I have been creating Ray Clusters on cloud VM's in Google Cloud. I've been having issues with the setup_commands in the ray cluster YAML file. These are supposed to run when new nodes are made.

The commands always run correctly on the head node. However, sometimes when new workers are created by the autoscaler, one or both of the worker nodes is not setup correctly. No errors appear in logs, but the worker is not set up correctly. It appears to work / stop working randomly.

The YAML file below is the configuration file I've been using. You'll need to change the in 3 places for your specific cloud project. I've been creating the clusters using ray up on the google cloud shell, then SSH'ing into the head node to run scripts. The error first started appearing when the autoscaler added more than one worker.

Versions / Dependencies

Ray most recent version.
The cluster is created from the google cloud shell.

Reproduction script

# An unique identifier for the head node and workers of this cluster.
cluster_name: gpu-cluster

# Cloud-provider specific configuration.
provider:
    type: gcp
    region: us-east1
    availability_zone: "us-east1-c"
    project_id: <the project ID>  # Globally unique project id
# The maximum number of workers nodes to launch in addition to the head
# node.
max_workers: 2

# The autoscaler will scale up the cluster faster with higher upscaling speed.
# E.g., if the task requires adding more nodes then autoscaler will gradually
# scale up the cluster in chunks of upscaling_speed*currently_running_nodes.
# This number should be > 0.
upscaling_speed: 2.0

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 20
# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu

# Tell the autoscaler the allowed node types and the resources they provide.
available_node_types:
    ray_head:
        # The resources provided by this node type.
        resources: {"CPU": 16}
        # Provider-specific config for the head node, e.g. instance type.
        node_config:
            machineType: n1-standard-16
            serviceAccounts:
              - email: "ray-autoscaler-sa-v1@<project name>.iam.gserviceaccount.com"
                scopes:
                 - "https://www.googleapis.com/auth/cloud-platform"
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 100
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/ml-images/global/images/c0-deeplearning-common-cu121-v20231209-debian-11
            scheduling:
              - onHostMaintenance: TERMINATE

    ray_worker_gpu:
        # The minimum number of nodes of this type to launch.
        # This number should be >= 0.
        min_workers: 0
        # The maximum number of workers nodes of this type to launch.
        # This takes precedence over min_workers.
        max_workers: 2
        # The resources provided by this node type.
        resources: {"CPU": 8, "GPU": 1}
        # Provider-specific config for the head node, e.g. instance type. By default
        # Ray will auto-configure unspecified fields such as subnets and ssh-keys.
        # For more documentation on available fields, see:
        # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
        node_config:
            machineType: n1-standard-8
            serviceAccounts:
              - email: "ray-autoscaler-sa-v1@<project-id>.iam.gserviceaccount.com"
                scopes:
                 - "https://www.googleapis.com/auth/cloud-platform"
            disks:
              - boot: true
                autoDelete: true
                type: PERSISTENT
                initializeParams:
                  diskSizeGb: 100
                  # See https://cloud.google.com/compute/docs/images for more images
                  sourceImage: projects/ml-images/global/images/c0-deeplearning-common-cu121-v20231209-debian-11
            # Make sure to set scheduling->onHostMaintenance to TERMINATE when GPUs are present
            # Run workers on preemtible instance by default.
            # Comment this out to use on-demand
            guestAccelerators:
              - acceleratorType: nvidia-tesla-t4
                acceleratorCount: 1
            metadata:
              items:
                - key: install-nvidia-driver
                  value: "True"
            scheduling:
              - preemptible: true
              - onHostMaintenance: TERMINATE
# Specify the node type of the head node (as configured above).
head_node_type: ray_head

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {}
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",


# run before setup commands (also outside of any docker containers)
initialization_commands:
 - 'echo "Setup Commands Started" >> /home/ubuntu/logs.txt 2>&1'

# List of shell commands to run to set up nodes.
setup_commands:
  - 'echo "Setup Commands Started" >> /home/ubuntu/logs.txt 2>&1'
  - "pip3 install torch >> /home/ubuntu/logs.txt 2>&1"
  - "pip3 install torchvision >> /home/ubuntu/logs.txt 2>&1"
  - "pip3 install Pillow >> /home/ubuntu/logs.txt 2>&1"
  - "pip3 install requests >> /home/ubuntu/logs.txt 2>&1"
  - "pip3 install Flask >> /home/ubuntu/logs.txt 2>&1"

# Custom commands that will be run on the head node after common setup.
head_setup_commands:
  - 'echo "Head Commands Started" >> /home/ubuntu/logs.txt 2>&1'
  - "pip install google-api-python-client==1.7.8"

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands:
  - 'echo "Worker command Started" >> /home/ubuntu/logs.txt 2>&1'

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

Issue Severity

High: It blocks me from completing my task.

@arandhaw arandhaw added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 5, 2024
@anyscalesam anyscalesam added core Issues that should be addressed in Ray Core core-clusters For launching and managing Ray clusters/jobs/kubernetes labels Jul 8, 2024
@jjyao
Copy link
Contributor

jjyao commented Jul 8, 2024

@arandhaw when you say worker nodes is not setup correctly, what's they symptoms?

@jjyao jjyao added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 8, 2024
@g-goessel
Copy link

I am experiencing the same issue.
When I start 1 worker node at a time, it works fine.
If the autoscaler tries to start multiple nodes it will fail to run the setup commands. The result is that ray is not installed.

2024-07-09 11:57:41,321 INFO updater.py:452 -- [5/7] Initializing command runner
2024-07-09 11:57:41,321 INFO updater.py:498 -- [6/7] No setup commands to run.
2024-07-09 11:57:41,322 INFO updater.py:503 -- [7/7] Starting the Ray runtime
2024-07-09 11:57:41,322 VINFO command_runner.py:371 -- Running export RAY_OVERRIDE_RESOURCES='{"CPU":1}';export RAY_HEAD_IP=10.128.0.62; ray stop
2024-07-09 11:57:41,322 VVINFO command_runner.py:373 -- Full command is ssh -tt -i ~/ray_bootstrap_key.pem -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o IdentitiesOnly=yes -o ExitOnForwardFailure=yes -o ServerAliveInterval=5 -o ServerAliveCountMax=3 -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_24bf68e341/c022e6b155/%C -o ControlPersist=10s -o ConnectTimeout=120s [email protected] bash --login -c -i 'source ~/.bashrc; export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && (export RAY_OVERRIDE_RESOURCES='"'"'{"CPU":1}'"'"';export RAY_HEAD_IP=10.128.0.62; ray stop)'

==> /tmp/ray/session_latest/logs/monitor.log <==
2024-07-09 11:57:41,488 INFO discovery.py:873 -- URL being requested: POST https://compute.googleapis.com/compute/v1/projects/animated-bit-413114/zones/us-central1-a/instances/ray-gcp-99498f66b357d8db-worker-5b1c305c-compute/setLabels?alt=json
2024-07-09 11:57:41,791 INFO node.py:348 -- wait_for_compute_zone_operation: Waiting for operation operation-1720526261534-61ccf3ca540ad-b590f7ab-89ab5f7d to finish...
2024-07-09 11:57:41,792 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/animated-bit-413114/zones/us-central1-a/operations/operation-1720526261534-61ccf3ca540ad-b590f7ab-89ab5f7d?alt=json
2024-07-09 11:57:46,930 INFO discovery.py:873 -- URL being requested: GET https://compute.googleapis.com/compute/v1/projects/animated-bit-413114/zones/us-central1-a/operations/operation-1720526261534-61ccf3ca540ad-b590f7ab-89ab5f7d?alt=json
2024-07-09 11:57:47,058 INFO node.py:367 -- wait_for_compute_zone_operation: Operation operation-1720526261534-61ccf3ca540ad-b590f7ab-89ab5f7d finished.

==> /tmp/ray/session_latest/logs/monitor.out <==
2024-07-09 11:57:47,058 ERR updater.py:171 -- New status: update-failed
2024-07-09 11:57:47,064 ERR updater.py:173 -- !!!
2024-07-09 11:57:47,064 VERR updater.py:183 -- Exception details: {'message': 'SSH command failed.'}
2024-07-09 11:57:47,065 ERR updater.py:185 -- Full traceback: Traceback (most recent call last):
File "/home/ubuntu/.local/share/pipx/venvs/ray/lib/python3.12/site-packages/ray/autoscaler/_private/updater.py", line 166, in run
self.do_update()
File "/home/ubuntu/.local/share/pipx/venvs/ray/lib/python3.12/site-packages/ray/autoscaler/_private/updater.py", line 531, in do_update
self.cmd_runner.run(
File "/home/ubuntu/.local/share/pipx/venvs/ray/lib/python3.12/site-packages/ray/autoscaler/_private/command_runner.py", line 379, in run
return self._run_helper(
^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/share/pipx/venvs/ray/lib/python3.12/site-packages/ray/autoscaler/_private/command_runner.py", line 298, in _run_helper
raise click.ClickException(fail_msg) from None
click.exceptions.ClickException: SSH command failed.

My setup_commands block is not empty, and it was successful in installing ray on the head node.

@g-goessel
Copy link

I think the setup commands are only run on the first worked that is being created.

I'm saying that because using ray monitor revealed that my setup command was being run on exactly on host, and all the others failed. Eventually, all the worker nodes have been configured.

@arandhaw
Copy link
Author

arandhaw commented Jul 10, 2024

@jjyao let me clarify exactly what seems to occur.
The problem occurs when the autoscaler creates new worker nodes.
What is supposed to happen is that during startup, the setup commands are run on the worker nodes.
Instead, what sometimes happens is that on some of the nodes, none of the setup commands are run.

Since I install dependencies in the setup commands (e.g., "pip install torch"), my ray jobs fail since none of the required libraries have been installed.

To be clear, the problem is not that the commands are failing and raising error messages. They are not being run at all.

@arandhaw
Copy link
Author

I may have solved mt problem. It turns out the version of ray installed on the head and worker nodes was 2.8.1, whereas the version on the cluster launcher was 2.32.0. I just assumed that ray would install itself on the head/worker nodes, but I think it was using an older version part of the VM image. By adding "pip install -U ray[all]" to the setup commands, it seems to have fixed the problem.

It would be nice if the documentation was clearer (or if a meaningful error message was given by ray).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-clusters For launching and managing Ray clusters/jobs/kubernetes P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

4 participants