-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kubernetes mode terminating early or not terminating at all. #3578
Comments
Hello! Thank you for filing an issue. The maintainers will triage your issue shortly. In the meantime, please take a look at the troubleshooting guide for bug reports. If this is a feature request, please review our contribution guidelines. |
Hey @Josh-Engle, Can you please upgrade to the |
@nikola-jokic I duplicated the issue in |
Thank you for letting us know! |
Hi @nikola-jokic , could you link the PR that fixes this issue? I'm just curious to see what it was |
Hi @nikola-jokic , we are actually seeing this error on From the image above we can see that it exits in 5m 29s instead. Here is a gist to the runner log: https://gist.github.com/genesis-jamin/774d115df441c3afdd755f73a3c499dc You can grep the logs "Finished process 170 with exit code 0" to see where the |
I'm re-opening as we are actually still seeing this issue on |
Someone else also reported the same issue in the container hooks repo: actions/runner-container-hooks#165 I'm going to copy my comment to that issue as well for posterity's sake |
We were able to root cause this -- turns out it's related to our k8s cluster setup. Our k8s cluster is hosted on GKE, and we noticed that every time a Github step would terminate early, it happened right after the cluster scaled down and evicted some We were able to someone mitigate this issue by adding taints / tolerations so that Another option for us is to disable autoscaling, but that defeats the purpose of using ARC in the first place 😆 |
@jamin-chen's root cause is right on point, but the workaround didn't work for me. Looking a bit deeper on what that service does and how it works in GKE clusters, these references 1 2 helped me understand it a bit better. I've also noticed that this affected any The reason only tainting nodes was not enough for me, is because the While poking around the service, i ended up finding a complementary action to the workaround. The replica count of those agents is managed by Updating it to have less changes on smaller clusters and adding taints to the node pools that sacle up/down during jobs can ensure those pods land on "stable" nodes. This can be considered a safe change because it will not be reconciled by the addon manager, and we know that because it has an annotation With those changes, i managed to reduce/eliminate the scaling events for |
I'm using an OKE cluster with And when testing the As a workaround, I print something every minute during the I'm wondering if this behavior is related to the setting in |
Hi, I'm experiencing the same on AKS v1.31.3 with In my case, I'm seeing more "non-terminating" jobs than stopped one after 5 minutes. I have 2 different jobs that have this issue and both of them are having a step where nothing is printed in the console for several minutes. |
Checks
Controller Version
0.9.0
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
Even if the workflow should have slept for 7 minutes, it completes successfully after 4 minutes OR it never completes.
Terminating Early:

Never Terminating:

Describe the expected behavior
The workflow should have completed only after the sleep command completed.
Additional Context
Controller Logs
Runner Pod Logs
The text was updated successfully, but these errors were encountered: