feat: Consolidation tolerance #795

stevenpitts · 2023-11-16T20:11:40Z

Description

What problem are you trying to solve?

I am trying to reduce the frequency of consolidation on clusters that have frequent but insignificant resource request changes.

An active cluster can cause frequent consolidation events.
For example, if a deploy with HPA scales up and down one replica every 10 minutes, it's very likely that a new node will be spun up and then spun down every 10 minutes, such that cost is optimized. This could even result in a packed node getting deleted, if Karpenter decides that a different node type or number of nodes would be more cost efficient.

That can be really disruptive. PDBs help, but in order for them to guard against users experiencing slowness you'd need to set a PDB of practically 1% maxUnavailable.

Once a consolidationPolicy of WhenUnderutilized works alongside consolidateAfter, that will help out greatly, but it would still result in consolidation likely happening every (for example) 2 hours, even with very low net resource changes.

I think a way of configuring "consolidation tolerance" would help here. One implementation could be a way of specifying cost tolerance.
In pseudo-configuration, there could be a consolidationCostTolerance field that I might set as "$50 per hour".
If an HPA decides a deploy needs a new replica and there's no space, it would spin up a new combination of nodes that has enough space for all desired pods but is still cost effective. Later on, the HPA might decrement desired replicas. Karpenter would normally want to consolidate now since there's now a more cost effective combination of nodes for requested resources.
The idea is, consolidation would not happen unless currentCostPerHour - consolidatedCostPerHour is greater than $50.
This way, until there is a significant amount of unused resources on nodes, consolidation would not trigger.

How important is this feature to you?

This feature is fairly important. Even when all the features described in disruption controls become stable, existing solutions only reduce the frequency of consolidation, slow down consolidation, or block consolidation during certain hours.
We could set a PDB on all deploys of 1% maxUnavailable, but that feels like a pretty extreme demand.

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

ellistarn · 2023-11-17T04:42:44Z

We've discussed the idea of an "improvement threshold" https://github.com/aws/karpenter-core/pull/768/files#diff-e6f78172a1d86c735a03ec76853021c670f4203f387c45b601670eca0e2ae1a4R26, which may model this quite nicely. Thoughts?

stevenpitts · 2023-11-17T15:51:58Z

We've discussed the idea of an "improvement threshold" https://github.com/aws/karpenter-core/pull/768/files#diff-e6f78172a1d86c735a03ec76853021c670f4203f387c45b601670eca0e2ae1a4R26, which may model this quite nicely. Thoughts?

That does seem like what I'm looking for! The design doc appears primarily focused on a spot issue I'm not too familiar with, but

Note: Regardless of the decision made to solve the spot consolidation problem, we’d likely want to implement a price improvement in the future to prevent consolidation from interrupting nodes to make marginal improvements.

👍

k8s-triage-robot · 2024-02-15T16:28:34Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

stevenpitts · 2024-02-16T16:38:41Z

/remove-lifecycle stale

Since it's still not totally clear what direction the project is going in with regard to this problem

sumeet-baghel · 2024-03-01T06:05:37Z

@stevenpitts What is your current strategy to mitigate this problem?

Have you tried creating a custom PriorityClass with a higher priority for critical workloads? This might help in a scenario where karpenter decides to delete a few nodes.

I haven't used karpenter myself so might be a dumb question.

stevenpitts · 2024-03-01T16:35:46Z

@sumeet-baghel Hello stranger!
Right now we're just relying on do-not-disrupt annotations for temporary critical workloads. Haven't found a great solution yet.

k8s-triage-robot · 2024-05-30T16:44:11Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

stevenpitts · 2024-05-30T17:32:06Z

/remove-lifecycle stale

ellistarn · 2024-05-31T01:32:30Z

Anyone interested in picking up "PriceImprovementThreshold"?

stevenpitts · 2024-06-09T01:31:17Z

@ellistarn I think that from the RFC it's unclear what the maintainers think the solution should look like. Is there a more specific doc I should read about it? Or are you still looking for feedback/opinions on the RFC?

k8s-triage-robot · 2024-09-07T02:13:33Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

stevenpitts · 2024-09-07T16:26:29Z

/remove-lifecycle stale

k8s-triage-robot · 2024-12-06T16:47:57Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

stevenpitts · 2024-12-06T18:25:21Z

/remove-lifecycle stale

stevenpitts added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 16, 2023

jonathan-innis mentioned this issue Feb 14, 2024

Mega Issue: Preferences #666

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 15, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 16, 2024

jonathan-innis mentioned this issue Feb 20, 2024

MinAge for NodeClaims #1030

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 30, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 30, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 7, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 7, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 6, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Consolidation tolerance #795

feat: Consolidation tolerance #795

stevenpitts commented Nov 16, 2023

ellistarn commented Nov 17, 2023

stevenpitts commented Nov 17, 2023

k8s-triage-robot commented Feb 15, 2024

stevenpitts commented Feb 16, 2024

sumeet-baghel commented Mar 1, 2024

stevenpitts commented Mar 1, 2024

k8s-triage-robot commented May 30, 2024

stevenpitts commented May 30, 2024

ellistarn commented May 31, 2024

stevenpitts commented Jun 9, 2024

k8s-triage-robot commented Sep 7, 2024

stevenpitts commented Sep 7, 2024

k8s-triage-robot commented Dec 6, 2024

stevenpitts commented Dec 6, 2024

feat: Consolidation tolerance #795

feat: Consolidation tolerance #795

Comments

stevenpitts commented Nov 16, 2023

Description

ellistarn commented Nov 17, 2023

stevenpitts commented Nov 17, 2023

k8s-triage-robot commented Feb 15, 2024

stevenpitts commented Feb 16, 2024

sumeet-baghel commented Mar 1, 2024

stevenpitts commented Mar 1, 2024

k8s-triage-robot commented May 30, 2024

stevenpitts commented May 30, 2024

ellistarn commented May 31, 2024

stevenpitts commented Jun 9, 2024

k8s-triage-robot commented Sep 7, 2024

stevenpitts commented Sep 7, 2024

k8s-triage-robot commented Dec 6, 2024

stevenpitts commented Dec 6, 2024