Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduling stucks when node daemon never cleans a startupTaint #7595

Open
lobanov opened this issue Jan 14, 2025 · 0 comments
Open

Scheduling stucks when node daemon never cleans a startupTaint #7595

lobanov opened this issue Jan 14, 2025 · 0 comments

Comments

@lobanov
Copy link

lobanov commented Jan 14, 2025

We observed that sometimes nodes using AWS EFS CSI driver fail to clean efs.csi.aws.com/agent-not-ready:NoExecute taint due to a transient driver failure (not exactly sure why, but it happens rarely). In such case, the pod nevers get scheduled to the new node due to the taint. Karpenter does not provision a replacement node, nor does it deprovision the stuck node believing that it's fit for scheduling.

Here is a sequence of events:

2025-01-14T01:17:15Z [Warning] 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
2025-01-14T01:17:16Z [Normal] Pod should schedule on: nodeclaim/lab-cpu-lm7x6
2025-01-14T01:22:36Z [Warning] 0/4 nodes are available: 1 node(s) had untolerated taint {efs.csi.aws.com/agent-not-ready: }, 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
2025-01-14T01:25:46Z [Normal] Pod should schedule on: nodeclaim/lab-cpu-lm7x6, node/ip-10-64-47-242.ap-southeast-1.compute.internal
2025-01-14T01:26:30Z [Warning] skip schedule deleting pod: REDACTED

Logs from Karpenter pod (contiguous block):

{"level":"INFO","time":"2025-01-14T01:17:16.792Z","logger":"controller","message":"found provisionable pod(s)","commit":"3298d91","controller":"provisioner","namespace":"","name":"","reconcileID":"480d0e1e-f95d-42d8-9728-e4b03473c6b7","Pods":"REDACTED","duration":"96.585701ms"}
{"level":"INFO","time":"2025-01-14T01:17:16.792Z","logger":"controller","message":"computed new nodeclaim(s) to fit pod(s)","commit":"3298d91","controller":"provisioner","namespace":"","name":"","reconcileID":"480d0e1e-f95d-42d8-9728-e4b03473c6b7","nodeclaims":1,"pods":1}
{"level":"INFO","time":"2025-01-14T01:17:16.807Z","logger":"controller","message":"created nodeclaim","commit":"3298d91","controller":"provisioner","namespace":"","name":"","reconcileID":"480d0e1e-f95d-42d8-9728-e4b03473c6b7","NodePool":{"name":"lab-cpu"},"NodeClaim":{"name":"lab-cpu-lm7x6"},"requests":{"cpu":"3770m","memory":"26864Mi","pods":"6"},"instance-types":"m7i-flex.2xlarge, m7i-flex.4xlarge, m7i-flex.8xlarge"}
{"level":"INFO","time":"2025-01-14T01:17:18.726Z","logger":"controller","message":"launched nodeclaim","commit":"3298d91","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"lab-cpu-lm7x6"},"namespace":"","name":"lab-cpu-lm7x6","reconcileID":"a93f070a-159e-4056-9161-36fe5d8ed913","provider-id":"aws:///ap-southeast-1a/i-05592660cf21930d2","instance-type":"m7i-flex.2xlarge","zone":"ap-southeast-1a","capacity-type":"on-demand","allocatable":{"cpu":"7910m","ephemeral-storage":"89Gi","memory":"29317Mi","pods":"58","vpc.amazonaws.com/pod-eni":"18"}}
{"level":"INFO","time":"2025-01-14T01:17:38.107Z","logger":"controller","message":"registered nodeclaim","commit":"3298d91","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"lab-cpu-lm7x6"},"namespace":"","name":"lab-cpu-lm7x6","reconcileID":"b8918fbe-0fb0-4e98-acce-93a705a40d41","provider-id":"aws:///ap-southeast-1a/i-05592660cf21930d2","Node":{"name":"ip-10-64-47-242.ap-southeast-1.compute.internal"}}
{"level":"INFO","time":"2025-01-14T01:47:30.925Z","logger":"controller","message":"tainted node","commit":"3298d91","controller":"node.termination","controllerGroup":"","controllerKind":"Node","Node":{"name":"ip-10-64-47-242.ap-southeast-1.compute.internal"},"namespace":"","name":"ip-10-64-47-242.ap-southeast-1.compute.internal","reconcileID":"b8caa83e-b903-402e-898e-fb3e3166b82a","taint.Key":"karpenter.sh/disrupted","taint.Value":"","taint.Effect":"NoSchedule"}
{"level":"INFO","time":"2025-01-14T01:47:31.982Z","logger":"controller","message":"deleted node","commit":"3298d91","controller":"node.termination","controllerGroup":"","controllerKind":"Node","Node":{"name":"ip-10-64-47-242.ap-southeast-1.compute.internal"},"namespace":"","name":"ip-10-64-47-242.ap-southeast-1.compute.internal","reconcileID":"b343231e-b57c-444c-85c3-76279475975e"}
{"level":"INFO","time":"2025-01-14T01:47:32.219Z","logger":"controller","message":"deleted nodeclaim","commit":"3298d91","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"lab-cpu-lm7x6"},"namespace":"","name":"lab-cpu-lm7x6","reconcileID":"eacb1ad5-f4ff-4563-9180-2b07558c7f68","Node":{"name":"ip-10-64-47-242.ap-southeast-1.compute.internal"},"provider-id":"aws:///ap-southeast-1a/i-05592660cf21930d2"}

It seems Karpenter does not detect the fact that pod placement has failed, so it does not create a new nodeclaim, and the pod still earmarked to be scheduled for this node. The node eventually gets disrupted and deprovisioned after 30 minutes (something else was probably scheduled on it, I'm still going through logs).

Edit: the node was manually deleted in EC2 console by a team member. According to kube-scheduler logs, nothing else was scheduled on it, so it seems Karpenter didn't disrupt the node after it was empty despite the fact that the nodepool is configured like this:

disruption:
  consolidationPolicy: WhenEmpty
  consolidateAfter: 1m

My questions to that:

  1. In the face of possible transient failures to node initialization, is there a way to de-provision nodes that do not clear a startup taint after some time, so that Karpenter can try to provision a suitable node again?
  2. Is there a way to troubleshoot this sort of situations further should they happen again?

Context:

  • AWS EKS 1.29
  • AWS EFS driver 2.0.4
  • Karpenter 1.1.1 installed via Helm chart
  • Bottlerocket AMI alias [email protected]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant