You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2023-10-26T16:47:09.627Z ERROR controller.provisioner creating scheduler, tracking topology counts, getting node ip-192-168-128-105.us-west-2.compute.internal, Node "ip-192-168-128-105.us-west-2.compute.internal" not found {"commit": "2012cf9"}
2023-10-26T16:47:19.627Z ERROR controller.provisioner creating scheduler, tracking topology counts, getting node ip-192-168-128-105.us-west-2.compute.internal, Node "ip-192-168-128-105.us-west-2.compute.internal" not found {"commit": "2012cf9"}
2023-10-26T16:47:29.628Z ERROR controller.provisioner creating scheduler, tracking topology counts, getting node ip-192-168-128-105.us-west-2.compute.internal, Node "ip-192-168-128-105.us-west-2.compute.internal" not found {"commit": "2012cf9"}
Expected Behavior:
Provisioning should not block on pods awaiting garbage collection.
Reproduction Steps (Please include YAML):
Here's what's happening:
Karpenter looks at other pods during scheduling if pod topology, pod affinity, or pod antiaffinity is defined.
We retrieve the Node for those pods using pod.Spec.NodeName
However, if the Node does not exist, Karpenter will error out and try again
Karpenter attempts to drain pods from Nodes when they are terminated, however, pods that tolerate NodeSchedule/NoExecute cannot be drained, as they would immediately reschedule after being drained.
Karpenter then deletes the EC2 instance, and the corresponding Node object.
Any pods that were on that node and unable to be drained, are leaked, and sit in the API server, until they are able to be garbage collected by the Kube Controller Manager.
In the meantime, Karpenter will fail to schedule, since the node cannot be discovered.
There are two paths forward:
[Short Term] I can make a change to our topology logic to ignore pods if their node cannot be found -- this will prevent the process from locking up during this edge case.
[Longer Term] We want to address this holistically via Mega Issue: Node Disruption Lifecycle Taints #624, which changes how nodes are deregistered from the API server, and causes the pods to not leak.
Description
Observed Behavior:
Expected Behavior:
Provisioning should not block on pods awaiting garbage collection.
Reproduction Steps (Please include YAML):
Here's what's happening:
There are two paths forward:
Versions:
kubectl version
): 1.27The text was updated successfully, but these errors were encountered: