Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for axlearn #1285

Merged
merged 98 commits into from
Mar 5, 2025
Merged

Add support for axlearn #1285

merged 98 commits into from
Mar 5, 2025

Conversation

Steboss
Copy link
Contributor

@Steboss Steboss commented Feb 3, 2025

No description provided.

@Steboss Steboss changed the title Sbosisio/support axlearn Add support for axlearn Feb 4, 2025
@Steboss Steboss requested review from olupton and yhtang February 14, 2025 17:23
@Steboss
Copy link
Contributor Author

Steboss commented Feb 27, 2025

@olupton
Some updates on the last push:

  • in store-delete-k8s-ghcr we're creating a token through ${RANDOM} (echo "token-name=${RANDOM}-${GITHUB_RUN_ID}-${GITHUB_RUN_ATTEMPT}" >> $GITHUB_OUTPUT) to create a unique token per process

TODO:

  • Implement the s3 mount https://github.com/awslabs/mountpoint-s3-csi-driver in k8s jobs
  • at the moment we're testing fuji-1B model, we need --model-name to be added to test-fuji-model.sh to allow tests on other models
  • add pre-commit file that implements: 1) yq formatting, ruff and newline checks

Copy link
Collaborator

@olupton olupton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. A few more comments, but we can refine later.

I think some of the error handling from

- name: Retrieve Kubernetes job status
shell: bash -exo pipefail {0}
run: |
while readarray -d : -t status < <(kubectl get job/${JOB_NAME} -o 'jsonpath={.status.failed}:{.status.succeeded}'); do
failure=${status[0]:-0}
success=${status[1]:-0}
total=$((failure+success))
if [[ ${total} < 2 ]]; then
sleep 1
elif [[ ${total} == 2 ]]; then
break
else
# FIXME
exit 255
fi
done
exit ${failure}
might be worth including in a follow-up.

Can you also add AXLearn to the README and plumb through its badge etc. ?

Copy link
Collaborator

@olupton olupton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@Steboss Steboss merged commit 45e0560 into main Mar 5, 2025
118 of 121 checks passed
@Steboss Steboss deleted the sbosisio/support_axlearn branch March 5, 2025 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants