Selective Disabling of Automatic Node Draining for AWS Health Events #7576

JacobAmar · 2025-01-09T08:24:54Z

Description

What problem are you trying to solve?

We need the ability to disable AWS health events automatic node drain on specific node pools. This is crucial for running stateful applications like databases and Kafka clusters, which require special handling for termination.

Currently, when AWS health events such as instance stop-start are required, Karpenter automatically drains the affected nodes. However, for certain node pools running stateful applications, we want to have control over when and how Karpenter disrupts these nodes.

If there's an existing feature that addresses this concern, I would appreciate information about it.

How important is this feature to you?
This feature is very important to us. It directly impacts our ability to manage and maintain the stability of our stateful applications in Kubernetes clusters managed by Karpenter.

Proposed Solution:
I propose adding a configuration option to Karpenter that allows users to disable automatic node draining for specific node pools when AWS health events occur. This could be implemented as:

A new field in the NodePool CRD to disable automatic draining for AWS health events.
A global configuration option with the ability to specify exceptions for certain node pools.

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

jmdeal · 2025-01-10T17:48:51Z

Do these workloads tolerate any form of automated disruption by Karpenter (e.g. consolidation or drift)? As of v1.0.0 the karpenter.sh/do-not-disrupt annotation will prevent pods from being drained, even if the node is disrupted. This will prevent nodes with pods with that annotation from being gracefully disrupted, and if they're forcibly disrupted (e.g. by an AWS health event) the pod won't be drained and the instance won't be removed until one of the following conditions are met:

The pod is manually evicted from the Node
The NodeClaims TerminationGracePeriod has elapsed (if configured)
The pod has went into a terminal state (e.g. Succeeded / Failed)
The annotation is removed from the pod

Does this fit your use-case?

jmdeal · 2025-01-13T16:57:48Z

After some offline discussion, it sounds like the goal here is not to prevent Karpenter from draining the node during these termination events, but prevent the termination events in the first place. Is that correct? do-not-disrupt will prevent Karpenter from initiating a termination, but if it was initiated out-of-band that's a different story. Could you provide some examples, i.e. what service is initiating the termination based on what events?

JacobAmar added feature New feature or request needs-triage Issues that need to be triaged labels Jan 9, 2025

jmdeal added triage/needs-information Marks that the issue still needs more information to properly triage and removed needs-triage Issues that need to be triaged labels Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selective Disabling of Automatic Node Draining for AWS Health Events #7576

Selective Disabling of Automatic Node Draining for AWS Health Events #7576

JacobAmar commented Jan 9, 2025

jmdeal commented Jan 10, 2025

jmdeal commented Jan 13, 2025

Selective Disabling of Automatic Node Draining for AWS Health Events #7576

Selective Disabling of Automatic Node Draining for AWS Health Events #7576

Comments

JacobAmar commented Jan 9, 2025

Description

jmdeal commented Jan 10, 2025

jmdeal commented Jan 13, 2025