Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes mode terminating early or not terminating at all. #3578

Open
4 tasks done
Josh-Engle opened this issue Jun 5, 2024 · 12 comments
Open
4 tasks done

Kubernetes mode terminating early or not terminating at all. #3578

Josh-Engle opened this issue Jun 5, 2024 · 12 comments
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode question Further information is requested

Comments

@Josh-Engle
Copy link

Checks

Controller Version

0.9.0

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

Create a workflow with a job that sleeps for 7 minutes and uses a ARC runner scale set configured in kubernetes mode.

Run the workflow.

Describe the bug

Even if the workflow should have slept for 7 minutes, it completes successfully after 4 minutes OR it never completes.

Terminating Early:
image

Never Terminating:
image

Describe the expected behavior

The workflow should have completed only after the sleep command completed.

Additional Context

values:
  - containerMode:
      kubernetesModeWorkVolumeClaim:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 5Gi
        storageClassName: gh-runner-class
      type: kubernetes
    controllerServiceAccount:
      name: arc-gha-rs-controller
      namespace: arc-systems
    githubConfigSecret: sandbox
    githubConfigUrl: https://github.com/****
    githubServerTLS:
      certificateFrom:
        configMapKeyRef:
          key: ca.crt
          name: crt
      runnerMountPath: /usr/local/share/ca-certificates/
    maxRunners: 200
    minRunners: 5
    proxy:
      http:
        url: '*******'
      https:
        url: '*******'
      noProxy:
      - '*******'
      - '*******'
      - '*******'
      - '*******'
      - '*******'
    runnerGroup: preprod
    runnerScaleSetName: sandbox
    template:
      spec:
        containers:
        - command:
          - /home/runner/run.sh
          env:
          - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
            value: "false"
          image: ghcr.io/actions/actions-runner:latest
          name: runner
        securityContext:
          fsGroup: 123

Controller Logs

[actions-runner-controller logs](https://gist.github.com/Josh-Engle/ea789d7eae1ca30e472fd90fba86bece)

Runner Pod Logs

[Runner Pod Logs](https://gist.github.com/Josh-Engle/86fcca504a8c3175e87b1a1efbd3f794)

### kubectl describe output on job that never terminated.

Name:             ****************.k8s-qrp6d-runner-9f8xw
Namespace:        arc-runners
Priority:         0
Service Account:  *************.k8s-gha-rs-kube-mode
Node:             aks-runnerpool-35648369-vmss000015/10.139.31.13
Start Time:       Wed, 05 Jun 2024 11:48:26 -0700
Labels:           actions-ephemeral-runner=True
                  actions.github.com/organization=********************
                  actions.github.com/scale-set-name=****************
                  actions.github.com/scale-set-namespace=arc-runners
                  app.kubernetes.io/component=runner
                  app.kubernetes.io/part-of=gha-runner-scale-set
                  app.kubernetes.io/version=0.9.0
                  pod-template-hash=67bb8777b7
Annotations:      actions.github.com/patch-id: 70
                  actions.github.com/runner-group-name: *****************
                  actions.github.com/runner-scale-set-name: ******************
                  actions.github.com/runner-spec-hash: cdff6bcb
Status:           Running
IP:               10.244.0.76
IPs:
  IP:           10.244.0.76
Controlled By:  EphemeralRunner/*******************.k8s-qrp6d-runner-9f8xw
Containers:
  runner:
    Container ID:  containerd://71795364d03d6e29cc0a7d40623120040354ab416e7139528b39841c164d47f1
    Image:         ghcr.io/actions/actions-runner:latest
    Image ID:      ghcr.io/actions/actions-runner@sha256:95db6fbb020b9f734e8a00389291dae766f0e6ad3d1171ae2d68e9ad8ac4a985
    Port:          <none>
    Host Port:     <none>
    Command:
      /home/runner/run.sh
    State:          Running
      Started:      Wed, 05 Jun 2024 11:48:50 -0700
    Ready:          True
    Restart Count:  0
    Environment:
      ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER:    false
      ACTIONS_RUNNER_CONTAINER_HOOKS:          /home/runner/k8s/index.js
      ACTIONS_RUNNER_POD_NAME:                 *********************.k8s-qrp6d-runner-9f8xw (v1:metadata.name)
      NODE_EXTRA_CA_CERTS:                     /usr/local/share/ca-certificates/ca.crt
      RUNNER_UPDATE_CA_CERTS:                  1
      ACTIONS_RUNNER_INPUT_JITCONFIG:          <set to the key 'jitToken' in secret '*************.k8s-qrp6d-runner-9f8xw'>  Optional: false
      GITHUB_ACTIONS_RUNNER_EXTRA_USER_AGENT:  actions-runner-controller/0.9.0
      http_proxy:                              <set to the key 'http_proxy' in secret '*************.k8s-qrp6d-754b578d-runner-proxy'>   Optional: false
      https_proxy:                             <set to the key 'https_proxy' in secret '*************.k8s-qrp6d-754b578d-runner-proxy'>  Optional: false
      no_proxy:                                <set to the key 'no_proxy' in secret '*************.k8s-qrp6d-754b578d-runner-proxy'>     Optional: false
    Mounts:
      /home/runner/_work from work (rw)
      /usr/local/share/ca-certificates/ca.crt from github-server-tls-cert (rw,path="ca.crt")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-c6wc6 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  github-server-tls-cert:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      crt
    Optional:  false
  work:
    Type:          EphemeralVolume (an inline specification for a volume that gets created and deleted with the pod)
    StorageClass:  gh-runner-class
    Volume:        
    Labels:            <none>
    Annotations:       <none>
    Capacity:      
    Access Modes:  
    VolumeMode:    Filesystem
  kube-api-access-c6wc6:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>
@Josh-Engle Josh-Engle added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels Jun 5, 2024
Copy link
Contributor

github-actions bot commented Jun 5, 2024

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@nikola-jokic
Copy link
Collaborator

Hey @Josh-Engle,

Can you please upgrade to the 0.9.2 and let us know if you are seeing this issue? The 0.9.0 and 0.9.1 can potentially kill the runner before it finishes the job, so I'm wondering if that is the source of the issue.
If you confirm that it is still happening on 0.9.2, can you please provide a runner log for the job that exits prematurely, and the one that never exits? I failed to reproduce this behavior on 0.9.2, so I'm wondering if the fix we introduced to the controller eliminated this problem.

@nikola-jokic nikola-jokic added question Further information is requested and removed needs triage Requires review from the maintainers labels Jun 17, 2024
@Josh-Engle
Copy link
Author

@nikola-jokic I duplicated the issue in 0.9.0 and 0.9.1 but it went away 0.9.2 so you're spot on. Thanks for the help!

@nikola-jokic
Copy link
Collaborator

Thank you for letting us know!

@genesis-jamin
Copy link

Hi @nikola-jokic , could you link the PR that fixes this issue? I'm just curious to see what it was

@genesis-jamin
Copy link

genesis-jamin commented Jul 10, 2024

Hi @nikola-jokic , we are actually seeing this error on 0.9.3 where the job exits prematurely with exit code 0. We first started noticing this in our pytest steps where the step would finish with Completed status but before all tests had been run. I was able to reproduce this behavior with a step that runs sleep 6000:

image

From the image above we can see that it exits in 5m 29s instead.

Here is a gist to the runner log: https://gist.github.com/genesis-jamin/774d115df441c3afdd755f73a3c499dc

You can grep the logs "Finished process 170 with exit code 0" to see where the sleep 6000 step ends. Let me know if you need anything else to help debug.

@Josh-Engle
Copy link
Author

I'm re-opening as we are actually still seeing this issue on 0.9.2. Our workflows for this issue were working correctly immediately after the upgrade but they are now exhibiting the same behavior as before.

@Josh-Engle Josh-Engle reopened this Jul 10, 2024
@genesis-jamin
Copy link

Someone else also reported the same issue in the container hooks repo: actions/runner-container-hooks#165

I'm going to copy my comment to that issue as well for posterity's sake

@genesis-jamin
Copy link

We were able to root cause this -- turns out it's related to our k8s cluster setup. Our k8s cluster is hosted on GKE, and we noticed that every time a Github step would terminate early, it happened right after the cluster scaled down and evicted some konnectivity-agent pod. I am not a k8s networking expert myself but it seems that this pod is responsible for maintaining the websocket connection used when exec'ing into a pod (which is what the runner does to run commands inside the workflow pod).

We were able to someone mitigate this issue by adding taints / tolerations so that konnectivity-agent pods didn't run on our pytest node pool (which is scaling up / down frequently). This helps us avoid the case when a konnectivity-agent pod is evicted due to the node scaling down, but does not solve the case when the konnectivity autoscaler scales down the number of replicas.

Another option for us is to disable autoscaling, but that defeats the purpose of using ARC in the first place 😆

@c-fteixeira
Copy link

@jamin-chen's root cause is right on point, but the workaround didn't work for me. Looking a bit deeper on what that service does and how it works in GKE clusters, these references 1 2 helped me understand it a bit better.

I've also noticed that this affected any kubectl exec commands, even from outside the cluster as well, and was able to replicate it multiple times, so it seems to be more likely a GKE + k8s-client library problem.

The reason only tainting nodes was not enough for me, is because the konnectivity-agent is responsible for that tunnel between worker/master nodes even if there isn't one running on your specific node. Meaning that if your connection is established through one of those pods, and they are killed for any reason, that socket will close (and the promise from that exec will end successfully/hang, no reconnections (maybe here is the real problem/solution path, but reconnections on sockets and attaching to pre existing buffer streams don't seem trivial)).

While poking around the service, i ended up finding a complementary action to the workaround. The replica count of those agents is managed by konnectivity-agent-autoscaler and that service is a GKE default addon that takes its configuration from a configmap kube-system/konnectivity-agent-autoscaler-config, there we can change the scaling settings.

Updating it to have less changes on smaller clusters and adding taints to the node pools that sacle up/down during jobs can ensure those pods land on "stable" nodes. This can be considered a safe change because it will not be reconciled by the addon manager, and we know that because it has an annotation addonmanager.kubernetes.io/mode=EnsureExists(ref).

With those changes, i managed to reduce/eliminate the scaling events for konnectivity-agent and as such, reduce the occurrence of these errors to an acceptable level (it will still happen if your stable nodes fail while jobs are running).

@ywang-psee
Copy link

ywang-psee commented Oct 10, 2024

I'm using an OKE cluster with gha-runner-scale-set version 0.9.3, and the runner is set to Kubernetes mode.

And when testing the sleep 6000 command in a container created by hook, we encountered the same issue where the step exits with code 0 after 5 minutes. I noticed that there is no logs(in github action workflow logs) during this 5-minute period.

As a workaround, I print something every minute during the sleep 6000, and that resolved the issue.

I'm wondering if this behavior is related to the setting in runner-container-hooks, specifically job.spec.ttlSecondsAfterFinished = 300.

@AblionGE
Copy link

Hi,

I'm experiencing the same on AKS v1.31.3 with gha-runner-scale-set version 0.10.1 and runners set to Kubernetes mode.

In my case, I'm seeing more "non-terminating" jobs than stopped one after 5 minutes. I have 2 different jobs that have this issue and both of them are having a step where nothing is printed in the console for several minutes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants