Skip to content

Commit 1f54fd9

Browse files
committed
minor edit
1 parent 06a6908 commit 1f54fd9

File tree

1 file changed

+2
-7
lines changed

1 file changed

+2
-7
lines changed

docs/resiliency/ray_resilient_jax.md

+2-7
Original file line numberDiff line numberDiff line change
@@ -480,7 +480,7 @@ When the coordinator determines an actor has hanged, it raises an exception that
480480

481481
## Launching a job Ray job
482482

483-
With that we've described every important aspect of the coordinator and the actors that help them work together to achieve failure resilient training. The final piece of the puzzle is to launch the training job that leverages all the logic implemented in `RayClusterCoordinator`, `ResilientWorker` and `ModelTrainer`, on the Ray cluster. This is achieved through the following two scripts:
483+
With that we've described every important aspect of the coordinator and the actors that help them work together to achieve failure resilient training. The final piece of the puzzle is to launch the training job that leverages all the logic implemented in `RayClusterCoordinator`, `ResilientWorker` and `ModelTrainer`, on the Ray cluster. This is achieved through the following entrypoint (which could be in its own main.py file or all the code above as well as the entrypoint could be implemented in a large main.py file):
484484

485485
```python
486486
# main.py
@@ -503,12 +503,7 @@ cluster_coordinator.initialize_workers(jax_compilation_cache=job_runtime_env['ja
503503
run_results = asyncio.run(cluster_coordinator.run(restore=False))
504504
```
505505

506-
Now that we have:
507-
508-
- Brought up a Ray Cluster
509-
- Implemented the Coordinator and Actor functionality
510-
511-
We can launch a workload on the Ray cluster that runs the computation we want in a fault tolerant manner. The script that does so is called a driver script and looks as follows:
506+
and a script called the driver as shown below:
512507

513508
```python
514509
# launch_ray_cluster_job.py

0 commit comments

Comments
 (0)