Scheduling stucks when node daemon never cleans a startupTaint #7595

lobanov · 2025-01-14T16:22:16Z

We observed that sometimes nodes using AWS EFS CSI driver fail to clean efs.csi.aws.com/agent-not-ready:NoExecute taint due to a transient driver failure (not exactly sure why, but it happens rarely). In such case, the pod nevers get scheduled to the new node due to the taint. Karpenter does not provision a replacement node, nor does it deprovision the stuck node believing that it's fit for scheduling.

Here is a sequence of events:

2025-01-14T01:17:15Z [Warning] 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
2025-01-14T01:17:16Z [Normal] Pod should schedule on: nodeclaim/lab-cpu-lm7x6
2025-01-14T01:22:36Z [Warning] 0/4 nodes are available: 1 node(s) had untolerated taint {efs.csi.aws.com/agent-not-ready: }, 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.
2025-01-14T01:25:46Z [Normal] Pod should schedule on: nodeclaim/lab-cpu-lm7x6, node/ip-10-64-47-242.ap-southeast-1.compute.internal
2025-01-14T01:26:30Z [Warning] skip schedule deleting pod: REDACTED

Logs from Karpenter pod (contiguous block):

{"level":"INFO","time":"2025-01-14T01:17:16.792Z","logger":"controller","message":"found provisionable pod(s)","commit":"3298d91","controller":"provisioner","namespace":"","name":"","reconcileID":"480d0e1e-f95d-42d8-9728-e4b03473c6b7","Pods":"REDACTED","duration":"96.585701ms"}
{"level":"INFO","time":"2025-01-14T01:17:16.792Z","logger":"controller","message":"computed new nodeclaim(s) to fit pod(s)","commit":"3298d91","controller":"provisioner","namespace":"","name":"","reconcileID":"480d0e1e-f95d-42d8-9728-e4b03473c6b7","nodeclaims":1,"pods":1}
{"level":"INFO","time":"2025-01-14T01:17:16.807Z","logger":"controller","message":"created nodeclaim","commit":"3298d91","controller":"provisioner","namespace":"","name":"","reconcileID":"480d0e1e-f95d-42d8-9728-e4b03473c6b7","NodePool":{"name":"lab-cpu"},"NodeClaim":{"name":"lab-cpu-lm7x6"},"requests":{"cpu":"3770m","memory":"26864Mi","pods":"6"},"instance-types":"m7i-flex.2xlarge, m7i-flex.4xlarge, m7i-flex.8xlarge"}
{"level":"INFO","time":"2025-01-14T01:17:18.726Z","logger":"controller","message":"launched nodeclaim","commit":"3298d91","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"lab-cpu-lm7x6"},"namespace":"","name":"lab-cpu-lm7x6","reconcileID":"a93f070a-159e-4056-9161-36fe5d8ed913","provider-id":"aws:///ap-southeast-1a/i-05592660cf21930d2","instance-type":"m7i-flex.2xlarge","zone":"ap-southeast-1a","capacity-type":"on-demand","allocatable":{"cpu":"7910m","ephemeral-storage":"89Gi","memory":"29317Mi","pods":"58","vpc.amazonaws.com/pod-eni":"18"}}
{"level":"INFO","time":"2025-01-14T01:17:38.107Z","logger":"controller","message":"registered nodeclaim","commit":"3298d91","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"lab-cpu-lm7x6"},"namespace":"","name":"lab-cpu-lm7x6","reconcileID":"b8918fbe-0fb0-4e98-acce-93a705a40d41","provider-id":"aws:///ap-southeast-1a/i-05592660cf21930d2","Node":{"name":"ip-10-64-47-242.ap-southeast-1.compute.internal"}}
{"level":"INFO","time":"2025-01-14T01:47:30.925Z","logger":"controller","message":"tainted node","commit":"3298d91","controller":"node.termination","controllerGroup":"","controllerKind":"Node","Node":{"name":"ip-10-64-47-242.ap-southeast-1.compute.internal"},"namespace":"","name":"ip-10-64-47-242.ap-southeast-1.compute.internal","reconcileID":"b8caa83e-b903-402e-898e-fb3e3166b82a","taint.Key":"karpenter.sh/disrupted","taint.Value":"","taint.Effect":"NoSchedule"}
{"level":"INFO","time":"2025-01-14T01:47:31.982Z","logger":"controller","message":"deleted node","commit":"3298d91","controller":"node.termination","controllerGroup":"","controllerKind":"Node","Node":{"name":"ip-10-64-47-242.ap-southeast-1.compute.internal"},"namespace":"","name":"ip-10-64-47-242.ap-southeast-1.compute.internal","reconcileID":"b343231e-b57c-444c-85c3-76279475975e"}
{"level":"INFO","time":"2025-01-14T01:47:32.219Z","logger":"controller","message":"deleted nodeclaim","commit":"3298d91","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"lab-cpu-lm7x6"},"namespace":"","name":"lab-cpu-lm7x6","reconcileID":"eacb1ad5-f4ff-4563-9180-2b07558c7f68","Node":{"name":"ip-10-64-47-242.ap-southeast-1.compute.internal"},"provider-id":"aws:///ap-southeast-1a/i-05592660cf21930d2"}

It seems Karpenter does not detect the fact that pod placement has failed, so it does not create a new nodeclaim, and the pod still earmarked to be scheduled for this node. ~~The node eventually gets disrupted and deprovisioned after 30 minutes (something else was probably scheduled on it, I'm still going through logs).~~

Edit: the node was manually deleted in EC2 console by a team member. According to kube-scheduler logs, nothing else was scheduled on it, so it seems Karpenter didn't disrupt the node after it was empty despite the fact that the nodepool is configured like this:

disruption:
  consolidationPolicy: WhenEmpty
  consolidateAfter: 1m

My questions to that:

In the face of possible transient failures to node initialization, is there a way to de-provision nodes that do not clear a startup taint after some time, so that Karpenter can try to provision a suitable node again?
Is there a way to troubleshoot this sort of situations further should they happen again?

Context:

AWS EKS 1.29
AWS EFS driver 2.0.4
Karpenter 1.1.1 installed via Helm chart
Bottlerocket AMI alias [email protected]

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduling stucks when node daemon never cleans a startupTaint #7595

Scheduling stucks when node daemon never cleans a startupTaint #7595

lobanov commented Jan 14, 2025 •

edited

Loading

Scheduling stucks when node daemon never cleans a startupTaint #7595

Scheduling stucks when node daemon never cleans a startupTaint #7595

Comments

lobanov commented Jan 14, 2025 • edited Loading

lobanov commented Jan 14, 2025 •

edited

Loading