Karpenter fails to scale down to desired state, Indefinitely blocked by PDB #7586

omri-kaslasi · 2025-01-13T09:34:16Z

Description

Observed Behavior:
Karpenter is not removing expired nodes, even though we have expireAfter (5 days) configured.

The issue appears to be related to PodsDisruptionBudgets (PDB) blocking node consolidation attempts.

I reviewed the blocking PDBs and they are configured correctly (selector is ok, configured to maxUnavailable 25%)
Example of one of the PDBs blocking:

If im not mistaken 4 Distruptions allowed means 4 pods can be stopped,
To confirm we are not in an edge case where we have 4 pods affected by this PDB on the same node i checked and saw only 1 pod on the relevant node

I fetched karpenter events and see what looks like Multi Node Consolidation attempts, (color matched node with nodeclaim)

I dont see in the events any standalone single node consolidation attempts, which should be able to remove the expired nodes (reviewed multiple nodes, non of them would be blocked by PDB if they were distrupted by themselves)

I attempted to find if single node consolidation is disabled but havent found anything in logs/documentation, enabled debug on karpenter but that didn't provide any relevant information

Edit 1:
Tried on a cluster with Karpenter 1.0.8, same issue, Adding graphs from datadog:
Right graph- Memory Requests spikes were manually caused by myself
Left graph- sum of memory in all nodes in the nodepull (node-mixed is the relevant nodepool), node-mixed spiked as expected but after over 2 hours, we have more than double the amount of capacity we had before the spike (and describing the nodeclaim shows PDB related as stated above)

Expected Behavior:
Nodes with expired expireAfter configuration should be removed, even when PDBs are in place, as long as the disruptions are within the allowed limits.
According to documentation Single Node consolidation is supposed to run on all nodes, which doesnt seem to run.
Observed on Karpenter 0.36.2, upgraded to 0.36.8 but didnt see a change in behavior.

Reproduction Steps (Please include YAML):
Have multiple deployments with large replica count and PDB, increase the replica count (to trigger karpenter creating additional nodes), wait a few minutes then reduce replica count (karpenter will begin scaling down and remove some of the nodes, but not all)
I've generated generic yamls for this and adding them
KarpenterStress.txt

Versions:

Chart Version: 0.36.2 + 0.36.8
Kubernetes Version (kubectl version): 1.29

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

omri-kaslasi · 2025-01-13T11:23:20Z

Update:
Tried on a cluster with Karpenter 1.0.8, same issue, Adding graphs from datadog:
Right graph- Memory Requests spikes were manually caused by myself
Left graph- sum of memory in all nodes in the nodepull (node-mixed is the relevant nodepool), node-mixed spiked as expected but after over 2 hours, we have more than double the amount of capacity we had before the spike (and describing the nodeclaim shows PDB related as stated above)

jigisha620 · 2025-01-14T01:41:26Z

Do you mind sharing your NodePool configuration(for 0.36.2, 0.36.8 and 1.0.8) and logs from when this happened?

omri-kaslasi · 2025-01-14T11:56:36Z

I dont have logs from when this originally happend as we dont save our logs for large periods of time,
WillI replicating this behavior (Like i wrote in my update) and providing logs from that assist?
If so ill provide them but ill need to go over the logs and probably censor alot of names/ sections
Adding Nodepool configuration

nodepool.txt

jigisha620 · 2025-01-15T19:28:13Z

Sure, replicating this behavior and providing logs from that attempt works. If you are replicating this on Karpenter v1.0.8, can you share nodepool configuration for it as well?

jigisha620 · 2025-01-15T20:39:56Z

I tried to reproduce the behavior that you are seeing from the nodepool spec that you shared earlier by setting nodepool budgets: 20% and PDB to 25%. I had 4 worker nodes each running 1 pod that's considered by the PDB to calculate availability. That means that 1 node can expire and get terminated as allowed by both PDB and nodepool budget. This is what I can see is happening and there's no issue on my end.

What are the total number of worker nodes you expect to have in the cluster. How many pods are running on each worker node? I see that you have shared screenshot of memory usage across the nodes. Why not monitor the actual node count to see if there's actually an increase in the number of nodes?

omri-kaslasi · 2025-01-16T08:32:08Z

I tried to reproduce the behavior that you are seeing from the nodepool spec that you shared earlier by setting nodepool budgets: 20% and PDB to 25%. I had 4 worker nodes each running 1 pod that's considered by the PDB to calculate availability. That means that 1 node can expire and get terminated as allowed by both PDB and nodepool budget. This is what I can see is happening and there's no issue on my end.

This behavior is effective on a small scale. During my stress test on Karpenter 1.0.8, I simulated a production spike by creating 200 pods, each requiring 20 GB of memory. Our actual production spikes are larger, but this setup closely approximates the scenario. After ensuring all pods were ready, I waited a few minutes before deleting them.

The expected behavior was for Karpenter to gradually scale the cluster back down to its original size. However, as shown in the image I posted, Karpenter does scale down, but stops after a few hours. Despite the resource requests returning to pre-spike levels, the cluster remains 2–3 times larger than necessary.

What are the total number of worker nodes you expect to have in the cluster. How many pods are running on each worker node? I see that you have shared screenshot of memory usage across the nodes. Why not monitor the actual node count to see if there's actually an increase in the number of nodes?

Why is node count relevant here? If a spike results in 16 pods, each requesting 4 GB of memory, why does it matter whether Karpenter decides to create one large node or four smaller ones? Shouldn't we focus on the total capacity of the nodes in the cluster and compare it to the actual resource requests instead?

jigisha620 · 2025-01-17T00:35:53Z

I am not sure I completely understand the issue here. From the title of the issue, I thought the issue is Karpenter not expiring the nodes due to blocking PDBs. But from your previous message it looks like you deleted the pods and Karpenter didn't scale down is the issue? Can you help me understand what's the issue that we are trying to solve here and also share logs if you had a chance to reproduce this?

omri-kaslasi · 2025-01-19T08:27:33Z

Thank you for the question, and I realize I wasn’t clear earlier—my apologies for that. Let me clarify: it's a combination of both issues.

On one hand, we observed Karpenter failing to expire nodes.
On the other hand, during a significant spike in production workload, we expected Karpenter to scale down nodes once the spike subsided to match the cluster's current requests.
While it was able to scale down a bit, it wasn’t enough to right-size the cluster. When examining the details of multiple nodes, we consistently found the issue was being blocked by Pod Disruption Budgets (PDBs).
Should I update the title of the issue?

jigisha620 · 2025-01-20T00:50:12Z

Yes, I think updating the title of the issue would be really helpful. So now if I understand the issue that you are running into -

You added 200 pods to your cluster
Karpenter created nodes in response
You deleted 200 pods created in the first step
Karpenter started scaling down but not completely and you suspect it's happening due to expiration.

The description of the issue is a little confusing as well because 5 days node expiry is different than user deleting the pods on the cluster. I think they are unrelated in this case. Also if you were able to reproduce this, can you please share the logs. Thank you.

pkit · 2025-01-20T17:40:03Z

@omri-kaslasi I have the opposite problem on 1.0.8 - nodes are expired although PDB is set. So double check if you are really running karpenter>=1.0
I do have your behavior on 0.37.5 but not on 1.0.8

omri-kaslasi · 2025-01-21T09:15:42Z

Yes, I think updating the title of the issue would be really helpful. So now if I understand the issue that you are running into -

You added 200 pods to your cluster

Karpenter created nodes in response

You deleted 200 pods created in the first step

Karpenter started scaling down but not completely and you suspect it's happening due to expiration.

The description of the issue is a little confusing as well because 5 days node expiry is different than user deleting the pods on the cluster. I think they are unrelated in this case. Also if you were able to reproduce this, can you please share the logs. Thank you.

Correct except point 4,
I suspect it is due to PDBs, I see on many large nodes that they are blocked by PDBs, Which by itself sounds fine,
But when I look at the nodes themselves, I see that if they an attempt to remove them alone would occur No pdb will block it (for example attempt to see if node can be removed via single node consolidation)
Expiration is just another reason that the nodes are expected to be removed, yet we see they are blocked
When I review Karpenter events, I see alot of events that seem like multi node consolidation attempts but cant find single node attempts,

omri-kaslasi · 2025-01-21T09:23:14Z

@omri-kaslasi I have the opposite problem on 1.0.8 - nodes are expired although PDB is set. So double check if you are really running karpenter>=1.0 I do have your behavior on 0.37.5 but not on 1.0.8

Replicated this behavior again on a cluster with karpenter 1.0.8, I see the same behavior (karpenter starts scaling down after the peak, but cluster total capacity is larger than it should be

In the attached picture I manually created a peak at 9:30, and reduced it at around 10:15 ( at 11:00 O returned the cluster to how was prior to my manual peak).
We can see the total memory capacity of the cluster increases (so karpenter scales up well) and then scales down but stops at a point that is way higher than it should be (ill wait for additional hour-2 and update this comment if i see karpenter making progress)

rajakadali4134 · 2025-01-22T04:04:58Z

@omri-kaslasi We are also facing similar issue, but in our case, nodes are not consolidated due to topology constraint. when the nodes are not scaled up, the pods are scheduled by kube-scheduler into existing nodes even though maxSkew is 1, it honors scheduleAnyway , but when the nodes are scaled up, karpenter simulation honours the maxSkew over scheduleAnyway and doesnot consolidate.

topology constraint:

   topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: ScheduleAnyway

omri-kaslasi · 2025-01-22T10:52:46Z

@omri-kaslasi We are also facing similar issue, but in our case, nodes are not consolidated due to topology constraint. when the nodes are not scaled up, the pods are scheduled by kube-scheduler into existing nodes even though maxSkew is 1, it honors scheduleAnyway , but when the nodes are scaled up, karpenter simulation honours the maxSkew over scheduleAnyway and doesnot consolidate.

topology constraint:
   topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: ScheduleAnyway

We dont have topologySpreadConstraints defined, ill take a look what are the default values and maybe it has an impact

omri-kaslasi · 2025-01-27T08:10:56Z

Bump so issue wont get closed for stale

jigisha620 · 2025-01-30T17:19:32Z

Replicated this behavior again on a cluster with karpenter 1.0.8, I see the same behavior (karpenter starts scaling down after the peak, but cluster total capacity is larger than it should be

@omri-kaslasi ,
Sorry for the delayed response, but do you have karpenter controller logs from this attempt? It would be really helpful to look at the logs to understand what really happened here.

omri-kaslasi added bug Something isn't working needs-triage Issues that need to be triaged labels Jan 13, 2025

jigisha620 self-assigned this Jan 20, 2025

omri-kaslasi changed the title ~~Expired nodes are not removed, Indefinitely blocked by PDB~~ Karpenter fails to scale down correctly, Indefinitely blocked by PDB Jan 21, 2025

omri-kaslasi changed the title ~~Karpenter fails to scale down correctly, Indefinitely blocked by PDB~~ Karpenter fails to scale down to desired state, Indefinitely blocked by PDB Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Karpenter fails to scale down to desired state, Indefinitely blocked by PDB #7586

Karpenter fails to scale down to desired state, Indefinitely blocked by PDB #7586

omri-kaslasi commented Jan 13, 2025 •

edited

Loading

omri-kaslasi commented Jan 13, 2025 •

edited

Loading

jigisha620 commented Jan 14, 2025

omri-kaslasi commented Jan 14, 2025 •

edited

Loading

jigisha620 commented Jan 15, 2025 •

edited

Loading

jigisha620 commented Jan 15, 2025

omri-kaslasi commented Jan 16, 2025 •

edited

Loading

jigisha620 commented Jan 17, 2025

omri-kaslasi commented Jan 19, 2025

jigisha620 commented Jan 20, 2025

pkit commented Jan 20, 2025

omri-kaslasi commented Jan 21, 2025

omri-kaslasi commented Jan 21, 2025

rajakadali4134 commented Jan 22, 2025 •

edited

Loading

omri-kaslasi commented Jan 22, 2025

omri-kaslasi commented Jan 27, 2025

jigisha620 commented Jan 30, 2025

Karpenter fails to scale down to desired state, Indefinitely blocked by PDB #7586

Karpenter fails to scale down to desired state, Indefinitely blocked by PDB #7586

Comments

omri-kaslasi commented Jan 13, 2025 • edited Loading

Description

omri-kaslasi commented Jan 13, 2025 • edited Loading

jigisha620 commented Jan 14, 2025

omri-kaslasi commented Jan 14, 2025 • edited Loading

jigisha620 commented Jan 15, 2025 • edited Loading

jigisha620 commented Jan 15, 2025

omri-kaslasi commented Jan 16, 2025 • edited Loading

jigisha620 commented Jan 17, 2025

omri-kaslasi commented Jan 19, 2025

jigisha620 commented Jan 20, 2025

pkit commented Jan 20, 2025

omri-kaslasi commented Jan 21, 2025

omri-kaslasi commented Jan 21, 2025

rajakadali4134 commented Jan 22, 2025 • edited Loading

omri-kaslasi commented Jan 22, 2025

omri-kaslasi commented Jan 27, 2025

jigisha620 commented Jan 30, 2025

omri-kaslasi commented Jan 13, 2025 •

edited

Loading

omri-kaslasi commented Jan 13, 2025 •

edited

Loading

omri-kaslasi commented Jan 14, 2025 •

edited

Loading

jigisha620 commented Jan 15, 2025 •

edited

Loading

omri-kaslasi commented Jan 16, 2025 •

edited

Loading

rajakadali4134 commented Jan 22, 2025 •

edited

Loading