You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/resiliency/ray_resilient_jax.md
+4-4
Original file line number
Diff line number
Diff line change
@@ -36,11 +36,11 @@ The first step to launching any workload using Ray is to start a Ray Cluster. Th
36
36
of "worker nodes". There is a slight abuse of terminology that can be a little confusing here with the usage of the word "node" but both Ray
37
37
head nodes and Ray worker nodes can in general be thought of as sub-allocations within a physical node. This means that on a physical
38
38
node with 8 GPUs and 128 CPUs, a Ray worker node could pertain to a sub-allocation of up to 8 GPUs and 128 CPUs. And similarly for the Ray head node. The coordinator process
39
-
always runs on the Ray head node and actors always run on Ray worker nodes. In this guide we will assume that each actor gets 1 GPU, 16 CPUs and that we operate in a 1 process per GPU setting.
39
+
always runs on the Ray head node and actors always run on Ray worker nodes. In this guide we will assume that each actor gets 1 GPU, 16 CPUs and that we operate in a 1 GPU per process setting.
40
40
41
41
### Starting a Ray Cluster manually
42
42
43
-
We will begin with a simple example of how to manually start a Ray cluster on 2 physical nodes. This will involve a single Ray head node and 2 Ray worker nodes, where each Ray worker node is allocated all GPUs of the node it runs on. We will assume the IP addresses of the physical nodes are `IP_ADDR_1` and `IP_ADDR_2` and that the head node will be allocated on the physical node with `IP_ADDR_1`.
43
+
We will begin with a simple example of how to manually start a Ray cluster on 2 physical nodes. This will involve a single Ray head node and 2 Ray worker nodes, where each Ray worker node is allocated all GPUs of the node it runs on. Moreover, each worker node will have as many actors scheduled on it as it has GPUs available, with each actor being able to address only a single GPU. This ensures that we are indeed operating in a 1 GPU per process setting. We will assume the IP addresses of the physical nodes are `IP_ADDR_1` and `IP_ADDR_2` and that the head node will be allocated on the physical node with `IP_ADDR_1`.
44
44
45
45
First, run the following script on one physical node:
0 commit comments