Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray tutorial #1302

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

Ray tutorial #1302

wants to merge 15 commits into from

Conversation

keshavb96
Copy link

This document presents a detailed tutorial on how Ray can be used together with JAX to achieve fault tolerant training.

Copy link
Contributor

@gspschmid gspschmid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried this out and needed to make a few changes to get the container ready (see inline comments).

Once that was set up I ran into the following error on a viking node:

$ docker build -t ray_resiliency_example -f Dockerfile .
...
$ docker run --gpus=all --name resilient_jax --network=host --security-opt seccomp=unconfined --cap-add SYS_PTRACE -it --shm-size=50g --ulimit memlock=-1 ray_resiliency_example
root@viking-prod-283:/ray_resiliency_example# ./launch_ray_job.sh
...
redis.exceptions.ConnectionError: Error 111 connecting to 10.78.2.240:6380. Connection refused.

Full log here: https://gist.github.com/gspschmid/ff1d8e7873a5010d880cc8350bf314f1

Nvm, launch_ray_job.sh had the line to launch redis commented out, it seems to work after uncommenting that!

Copy link
Contributor

@gspschmid gspschmid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment re triggering multiple non-rank-0 failures

@keshavb96 keshavb96 marked this pull request as ready for review March 12, 2025 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants