-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for axlearn
#1285
Add support for axlearn
#1285
Conversation
@olupton
TODO:
|
…box into sbosisio/support_axlearn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. A few more comments, but we can refine later.
I think some of the error handling from
JAX-Toolbox/.github/workflows/_test_maxtext_k8s.yaml
Lines 74 to 90 in b0ec72a
- name: Retrieve Kubernetes job status | |
shell: bash -exo pipefail {0} | |
run: | | |
while readarray -d : -t status < <(kubectl get job/${JOB_NAME} -o 'jsonpath={.status.failed}:{.status.succeeded}'); do | |
failure=${status[0]:-0} | |
success=${status[1]:-0} | |
total=$((failure+success)) | |
if [[ ${total} < 2 ]]; then | |
sleep 1 | |
elif [[ ${total} == 2 ]]; then | |
break | |
else | |
# FIXME | |
exit 255 | |
fi | |
done | |
exit ${failure} |
Can you also add AXLearn to the README and plumb through its badge etc. ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
No description provided.