Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter fails to scale down to desired state, Indefinitely blocked by PDB #7586

Open
omri-kaslasi opened this issue Jan 13, 2025 · 16 comments
Assignees
Labels
bug Something isn't working needs-triage Issues that need to be triaged

Comments

@omri-kaslasi
Copy link

omri-kaslasi commented Jan 13, 2025

Description

Observed Behavior:
Karpenter is not removing expired nodes, even though we have expireAfter (5 days) configured.
Screenshot 2025-01-12 at 18 29 40

The issue appears to be related to PodsDisruptionBudgets (PDB) blocking node consolidation attempts.
Screenshot 2025-01-12 at 18 29 00

I reviewed the blocking PDBs and they are configured correctly (selector is ok, configured to maxUnavailable 25%)
Example of one of the PDBs blocking:
generation 1

If im not mistaken 4 Distruptions allowed means 4 pods can be stopped,
To confirm we are not in an edge case where we have 4 pods affected by this PDB on the same node i checked and saw only 1 pod on the relevant node
Screenshot 2025-01-13 at 11 13 01

I fetched karpenter events and see what looks like Multi Node Consolidation attempts, (color matched node with nodeclaim)
Pasted Graphic 6

I dont see in the events any standalone single node consolidation attempts, which should be able to remove the expired nodes (reviewed multiple nodes, non of them would be blocked by PDB if they were distrupted by themselves)

I attempted to find if single node consolidation is disabled but havent found anything in logs/documentation, enabled debug on karpenter but that didn't provide any relevant information

Edit 1:
Tried on a cluster with Karpenter 1.0.8, same issue, Adding graphs from datadog:
Right graph- Memory Requests spikes were manually caused by myself
Left graph- sum of memory in all nodes in the nodepull (node-mixed is the relevant nodepool), node-mixed spiked as expected but after over 2 hours, we have more than double the amount of capacity we had before the spike (and describing the nodeclaim shows PDB related as stated above)
image

Expected Behavior:
Nodes with expired expireAfter configuration should be removed, even when PDBs are in place, as long as the disruptions are within the allowed limits.
According to documentation Single Node consolidation is supposed to run on all nodes, which doesnt seem to run.
Observed on Karpenter 0.36.2, upgraded to 0.36.8 but didnt see a change in behavior.

Reproduction Steps (Please include YAML):
Have multiple deployments with large replica count and PDB, increase the replica count (to trigger karpenter creating additional nodes), wait a few minutes then reduce replica count (karpenter will begin scaling down and remove some of the nodes, but not all)
I've generated generic yamls for this and adding them
KarpenterStress.txt

Versions:

  • Chart Version: 0.36.2 + 0.36.8
  • Kubernetes Version (kubectl version): 1.29
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@omri-kaslasi omri-kaslasi added bug Something isn't working needs-triage Issues that need to be triaged labels Jan 13, 2025
@omri-kaslasi
Copy link
Author

omri-kaslasi commented Jan 13, 2025

Update:
Tried on a cluster with Karpenter 1.0.8, same issue, Adding graphs from datadog:
Right graph- Memory Requests spikes were manually caused by myself
Left graph- sum of memory in all nodes in the nodepull (node-mixed is the relevant nodepool), node-mixed spiked as expected but after over 2 hours, we have more than double the amount of capacity we had before the spike (and describing the nodeclaim shows PDB related as stated above)

image

@jigisha620
Copy link
Contributor

Do you mind sharing your NodePool configuration(for 0.36.2, 0.36.8 and 1.0.8) and logs from when this happened?

@omri-kaslasi
Copy link
Author

omri-kaslasi commented Jan 14, 2025

I dont have logs from when this originally happend as we dont save our logs for large periods of time,
WillI replicating this behavior (Like i wrote in my update) and providing logs from that assist?
If so ill provide them but ill need to go over the logs and probably censor alot of names/ sections
Adding Nodepool configuration

nodepool.txt

@jigisha620
Copy link
Contributor

jigisha620 commented Jan 15, 2025

Sure, replicating this behavior and providing logs from that attempt works. If you are replicating this on Karpenter v1.0.8, can you share nodepool configuration for it as well?

@jigisha620
Copy link
Contributor

I tried to reproduce the behavior that you are seeing from the nodepool spec that you shared earlier by setting nodepool budgets: 20% and PDB to 25%. I had 4 worker nodes each running 1 pod that's considered by the PDB to calculate availability. That means that 1 node can expire and get terminated as allowed by both PDB and nodepool budget. This is what I can see is happening and there's no issue on my end.

What are the total number of worker nodes you expect to have in the cluster. How many pods are running on each worker node? I see that you have shared screenshot of memory usage across the nodes. Why not monitor the actual node count to see if there's actually an increase in the number of nodes?

@omri-kaslasi
Copy link
Author

omri-kaslasi commented Jan 16, 2025

I tried to reproduce the behavior that you are seeing from the nodepool spec that you shared earlier by setting nodepool budgets: 20% and PDB to 25%. I had 4 worker nodes each running 1 pod that's considered by the PDB to calculate availability. That means that 1 node can expire and get terminated as allowed by both PDB and nodepool budget. This is what I can see is happening and there's no issue on my end.

This behavior is effective on a small scale. During my stress test on Karpenter 1.0.8, I simulated a production spike by creating 200 pods, each requiring 20 GB of memory. Our actual production spikes are larger, but this setup closely approximates the scenario. After ensuring all pods were ready, I waited a few minutes before deleting them.

The expected behavior was for Karpenter to gradually scale the cluster back down to its original size. However, as shown in the image I posted, Karpenter does scale down, but stops after a few hours. Despite the resource requests returning to pre-spike levels, the cluster remains 2–3 times larger than necessary.

Image

What are the total number of worker nodes you expect to have in the cluster. How many pods are running on each worker node? I see that you have shared screenshot of memory usage across the nodes. Why not monitor the actual node count to see if there's actually an increase in the number of nodes?

Why is node count relevant here? If a spike results in 16 pods, each requesting 4 GB of memory, why does it matter whether Karpenter decides to create one large node or four smaller ones? Shouldn't we focus on the total capacity of the nodes in the cluster and compare it to the actual resource requests instead?

@jigisha620
Copy link
Contributor

I am not sure I completely understand the issue here. From the title of the issue, I thought the issue is Karpenter not expiring the nodes due to blocking PDBs. But from your previous message it looks like you deleted the pods and Karpenter didn't scale down is the issue? Can you help me understand what's the issue that we are trying to solve here and also share logs if you had a chance to reproduce this?

@omri-kaslasi
Copy link
Author

Thank you for the question, and I realize I wasn’t clear earlier—my apologies for that. Let me clarify: it's a combination of both issues.

On one hand, we observed Karpenter failing to expire nodes.
On the other hand, during a significant spike in production workload, we expected Karpenter to scale down nodes once the spike subsided to match the cluster's current requests.
While it was able to scale down a bit, it wasn’t enough to right-size the cluster. When examining the details of multiple nodes, we consistently found the issue was being blocked by Pod Disruption Budgets (PDBs).
Should I update the title of the issue?

@jigisha620
Copy link
Contributor

Yes, I think updating the title of the issue would be really helpful. So now if I understand the issue that you are running into -

  1. You added 200 pods to your cluster
  2. Karpenter created nodes in response
  3. You deleted 200 pods created in the first step
  4. Karpenter started scaling down but not completely and you suspect it's happening due to expiration.

The description of the issue is a little confusing as well because 5 days node expiry is different than user deleting the pods on the cluster. I think they are unrelated in this case. Also if you were able to reproduce this, can you please share the logs. Thank you.

@pkit
Copy link

pkit commented Jan 20, 2025

@omri-kaslasi I have the opposite problem on 1.0.8 - nodes are expired although PDB is set. So double check if you are really running karpenter>=1.0
I do have your behavior on 0.37.5 but not on 1.0.8

@jigisha620 jigisha620 self-assigned this Jan 20, 2025
@omri-kaslasi
Copy link
Author

Yes, I think updating the title of the issue would be really helpful. So now if I understand the issue that you are running into -

  1. You added 200 pods to your cluster
  2. Karpenter created nodes in response
  3. You deleted 200 pods created in the first step
  4. Karpenter started scaling down but not completely and you suspect it's happening due to expiration.

The description of the issue is a little confusing as well because 5 days node expiry is different than user deleting the pods on the cluster. I think they are unrelated in this case. Also if you were able to reproduce this, can you please share the logs. Thank you.

Correct except point 4,
I suspect it is due to PDBs, I see on many large nodes that they are blocked by PDBs, Which by itself sounds fine,
But when I look at the nodes themselves, I see that if they an attempt to remove them alone would occur No pdb will block it (for example attempt to see if node can be removed via single node consolidation)
Expiration is just another reason that the nodes are expected to be removed, yet we see they are blocked
When I review Karpenter events, I see alot of events that seem like multi node consolidation attempts but cant find single node attempts,

@omri-kaslasi
Copy link
Author

@omri-kaslasi I have the opposite problem on 1.0.8 - nodes are expired although PDB is set. So double check if you are really running karpenter>=1.0 I do have your behavior on 0.37.5 but not on 1.0.8

Replicated this behavior again on a cluster with karpenter 1.0.8, I see the same behavior (karpenter starts scaling down after the peak, but cluster total capacity is larger than it should be

In the attached picture I manually created a peak at 9:30, and reduced it at around 10:15 ( at 11:00 O returned the cluster to how was prior to my manual peak).
We can see the total memory capacity of the cluster increases (so karpenter scales up well) and then scales down but stops at a point that is way higher than it should be (ill wait for additional hour-2 and update this comment if i see karpenter making progress)

Image

@omri-kaslasi omri-kaslasi changed the title Expired nodes are not removed, Indefinitely blocked by PDB Karpenter fails to scale down correctly, Indefinitely blocked by PDB Jan 21, 2025
@omri-kaslasi omri-kaslasi changed the title Karpenter fails to scale down correctly, Indefinitely blocked by PDB Karpenter fails to scale down to desired state, Indefinitely blocked by PDB Jan 21, 2025
@rajakadali4134
Copy link

rajakadali4134 commented Jan 22, 2025

@omri-kaslasi We are also facing similar issue, but in our case, nodes are not consolidated due to topology constraint. when the nodes are not scaled up, the pods are scheduled by kube-scheduler into existing nodes even though maxSkew is 1, it honors scheduleAnyway , but when the nodes are scaled up, karpenter simulation honours the maxSkew over scheduleAnyway and doesnot consolidate.

topology constraint:

   topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: ScheduleAnyway

@omri-kaslasi
Copy link
Author

@omri-kaslasi We are also facing similar issue, but in our case, nodes are not consolidated due to topology constraint. when the nodes are not scaled up, the pods are scheduled by kube-scheduler into existing nodes even though maxSkew is 1, it honors scheduleAnyway , but when the nodes are scaled up, karpenter simulation honours the maxSkew over scheduleAnyway and doesnot consolidate.

topology constraint:

   topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
      - maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: ScheduleAnyway

We dont have topologySpreadConstraints defined, ill take a look what are the default values and maybe it has an impact

@omri-kaslasi
Copy link
Author

Bump so issue wont get closed for stale

@jigisha620
Copy link
Contributor

Replicated this behavior again on a cluster with karpenter 1.0.8, I see the same behavior (karpenter starts scaling down after the peak, but cluster total capacity is larger than it should be

@omri-kaslasi ,
Sorry for the delayed response, but do you have karpenter controller logs from this attempt? It would be really helpful to look at the logs to understand what really happened here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-triage Issues that need to be triaged
Projects
None yet
Development

No branches or pull requests

4 participants