-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Karpenter fails to scale down to desired state, Indefinitely blocked by PDB #7586
Comments
Update: |
Do you mind sharing your NodePool configuration(for 0.36.2, 0.36.8 and 1.0.8) and logs from when this happened? |
I dont have logs from when this originally happend as we dont save our logs for large periods of time, |
Sure, replicating this behavior and providing logs from that attempt works. If you are replicating this on Karpenter v1.0.8, can you share nodepool configuration for it as well? |
I tried to reproduce the behavior that you are seeing from the nodepool spec that you shared earlier by setting nodepool What are the total number of worker nodes you expect to have in the cluster. How many pods are running on each worker node? I see that you have shared screenshot of memory usage across the nodes. Why not monitor the actual node count to see if there's actually an increase in the number of nodes? |
This behavior is effective on a small scale. During my stress test on Karpenter 1.0.8, I simulated a production spike by creating 200 pods, each requiring 20 GB of memory. Our actual production spikes are larger, but this setup closely approximates the scenario. After ensuring all pods were ready, I waited a few minutes before deleting them. The expected behavior was for Karpenter to gradually scale the cluster back down to its original size. However, as shown in the image I posted, Karpenter does scale down, but stops after a few hours. Despite the resource requests returning to pre-spike levels, the cluster remains 2–3 times larger than necessary.
Why is node count relevant here? If a spike results in 16 pods, each requesting 4 GB of memory, why does it matter whether Karpenter decides to create one large node or four smaller ones? Shouldn't we focus on the total capacity of the nodes in the cluster and compare it to the actual resource requests instead? |
I am not sure I completely understand the issue here. From the title of the issue, I thought the issue is Karpenter not expiring the nodes due to blocking PDBs. But from your previous message it looks like you deleted the pods and Karpenter didn't scale down is the issue? Can you help me understand what's the issue that we are trying to solve here and also share logs if you had a chance to reproduce this? |
Thank you for the question, and I realize I wasn’t clear earlier—my apologies for that. Let me clarify: it's a combination of both issues. On one hand, we observed Karpenter failing to expire nodes. |
Yes, I think updating the title of the issue would be really helpful. So now if I understand the issue that you are running into -
The description of the issue is a little confusing as well because 5 days node expiry is different than user deleting the pods on the cluster. I think they are unrelated in this case. Also if you were able to reproduce this, can you please share the logs. Thank you. |
@omri-kaslasi I have the opposite problem on 1.0.8 - nodes are expired although PDB is set. So double check if you are really running karpenter>=1.0 |
Correct except point 4, |
Replicated this behavior again on a cluster with karpenter 1.0.8, I see the same behavior (karpenter starts scaling down after the peak, but cluster total capacity is larger than it should be In the attached picture I manually created a peak at 9:30, and reduced it at around 10:15 ( at 11:00 O returned the cluster to how was prior to my manual peak). |
@omri-kaslasi We are also facing similar issue, but in our case, nodes are not consolidated due to topology constraint. when the nodes are not scaled up, the pods are scheduled by kube-scheduler into existing nodes even though maxSkew is 1, it honors scheduleAnyway , but when the nodes are scaled up, karpenter simulation honours the maxSkew over scheduleAnyway and doesnot consolidate. topology constraint:
|
We dont have topologySpreadConstraints defined, ill take a look what are the default values and maybe it has an impact |
Bump so issue wont get closed for stale |
@omri-kaslasi , |
Description
Observed Behavior:
Karpenter is not removing expired nodes, even though we have
expireAfter
(5 days) configured.The issue appears to be related to PodsDisruptionBudgets (PDB) blocking node consolidation attempts.
I reviewed the blocking PDBs and they are configured correctly (selector is ok, configured to maxUnavailable 25%)
Example of one of the PDBs blocking:
If im not mistaken 4 Distruptions allowed means 4 pods can be stopped,
To confirm we are not in an edge case where we have 4 pods affected by this PDB on the same node i checked and saw only 1 pod on the relevant node
I fetched karpenter events and see what looks like Multi Node Consolidation attempts, (color matched node with nodeclaim)
I dont see in the events any standalone single node consolidation attempts, which should be able to remove the expired nodes (reviewed multiple nodes, non of them would be blocked by PDB if they were distrupted by themselves)
I attempted to find if single node consolidation is disabled but havent found anything in logs/documentation, enabled debug on karpenter but that didn't provide any relevant information
Edit 1:
Tried on a cluster with Karpenter 1.0.8, same issue, Adding graphs from datadog:
Right graph- Memory Requests spikes were manually caused by myself
Left graph- sum of memory in all nodes in the nodepull (node-mixed is the relevant nodepool), node-mixed spiked as expected but after over 2 hours, we have more than double the amount of capacity we had before the spike (and describing the nodeclaim shows PDB related as stated above)
Expected Behavior:
Nodes with expired
expireAfter
configuration should be removed, even when PDBs are in place, as long as the disruptions are within the allowed limits.According to documentation Single Node consolidation is supposed to run on all nodes, which doesnt seem to run.
Observed on Karpenter 0.36.2, upgraded to 0.36.8 but didnt see a change in behavior.
Reproduction Steps (Please include YAML):
Have multiple deployments with large replica count and PDB, increase the replica count (to trigger karpenter creating additional nodes), wait a few minutes then reduce replica count (karpenter will begin scaling down and remove some of the nodes, but not all)
I've generated generic yamls for this and adding them
KarpenterStress.txt
Versions:
kubectl version
): 1.29The text was updated successfully, but these errors were encountered: