Skip to content

Commit e997826

Browse files
committed
clarify NUM_GPU actors per worker ndoe
1 parent 1f54fd9 commit e997826

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

docs/resiliency/ray_resilient_jax.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -36,11 +36,11 @@ The first step to launching any workload using Ray is to start a Ray Cluster. Th
3636
of "worker nodes". There is a slight abuse of terminology that can be a little confusing here with the usage of the word "node" but both Ray
3737
head nodes and Ray worker nodes can in general be thought of as sub-allocations within a physical node. This means that on a physical
3838
node with 8 GPUs and 128 CPUs, a Ray worker node could pertain to a sub-allocation of up to 8 GPUs and 128 CPUs. And similarly for the Ray head node. The coordinator process
39-
always runs on the Ray head node and actors always run on Ray worker nodes. In this guide we will assume that each actor gets 1 GPU, 16 CPUs and that we operate in a 1 process per GPU setting.
39+
always runs on the Ray head node and actors always run on Ray worker nodes. In this guide we will assume that each actor gets 1 GPU, 16 CPUs and that we operate in a 1 GPU per process setting.
4040

4141
### Starting a Ray Cluster manually
4242

43-
We will begin with a simple example of how to manually start a Ray cluster on 2 physical nodes. This will involve a single Ray head node and 2 Ray worker nodes, where each Ray worker node is allocated all GPUs of the node it runs on. We will assume the IP addresses of the physical nodes are `IP_ADDR_1` and `IP_ADDR_2` and that the head node will be allocated on the physical node with `IP_ADDR_1`.
43+
We will begin with a simple example of how to manually start a Ray cluster on 2 physical nodes. This will involve a single Ray head node and 2 Ray worker nodes, where each Ray worker node is allocated all GPUs of the node it runs on. Moreover, each worker node will have as many actors scheduled on it as it has GPUs available, with each actor being able to address only a single GPU. This ensures that we are indeed operating in a 1 GPU per process setting. We will assume the IP addresses of the physical nodes are `IP_ADDR_1` and `IP_ADDR_2` and that the head node will be allocated on the physical node with `IP_ADDR_1`.
4444

4545
First, run the following script on one physical node:
4646
```console
@@ -594,9 +594,9 @@ max_worker_port=10257
594594
for ((i = 1; i <= num_ray_worker_nodes; i++)); do
595595
node_i=${nodes_array[$i]}
596596
597-
srun --exact --nodes=1 --ntasks=1 --cpus-per-task=$((16 * gpus_per_node)) -w "$node_i" \
597+
srun --exact --nodes=1 --ntasks=1 --cpus-per-task=$((16 * $gpus_per_node)) -w "$node_i" \
598598
ray start --address "$ip_head" \
599-
--resources="{\"worker_units\": gpus_per_node}" \
599+
--resources="{\"worker_units\": $gpus_per_node}" \
600600
--min-worker-port=$min_worker_port \
601601
--max-worker-port=$max_worker_port --block &
602602
sleep 3

0 commit comments

Comments
 (0)