Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selective Disabling of Automatic Node Draining for AWS Health Events #7576

Open
JacobAmar opened this issue Jan 9, 2025 · 2 comments
Open
Labels
feature New feature or request triage/needs-information Marks that the issue still needs more information to properly triage

Comments

@JacobAmar
Copy link

Description

What problem are you trying to solve?

We need the ability to disable AWS health events automatic node drain on specific node pools. This is crucial for running stateful applications like databases and Kafka clusters, which require special handling for termination.

Currently, when AWS health events such as instance stop-start are required, Karpenter automatically drains the affected nodes. However, for certain node pools running stateful applications, we want to have control over when and how Karpenter disrupts these nodes.

If there's an existing feature that addresses this concern, I would appreciate information about it.

How important is this feature to you?
This feature is very important to us. It directly impacts our ability to manage and maintain the stability of our stateful applications in Kubernetes clusters managed by Karpenter.

Proposed Solution:
I propose adding a configuration option to Karpenter that allows users to disable automatic node draining for specific node pools when AWS health events occur. This could be implemented as:

  • A new field in the NodePool CRD to disable automatic draining for AWS health events.
  • A global configuration option with the ability to specify exceptions for certain node pools.
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@JacobAmar JacobAmar added feature New feature or request needs-triage Issues that need to be triaged labels Jan 9, 2025
@jmdeal
Copy link
Contributor

jmdeal commented Jan 10, 2025

Do these workloads tolerate any form of automated disruption by Karpenter (e.g. consolidation or drift)? As of v1.0.0 the karpenter.sh/do-not-disrupt annotation will prevent pods from being drained, even if the node is disrupted. This will prevent nodes with pods with that annotation from being gracefully disrupted, and if they're forcibly disrupted (e.g. by an AWS health event) the pod won't be drained and the instance won't be removed until one of the following conditions are met:

  • The pod is manually evicted from the Node
  • The NodeClaims TerminationGracePeriod has elapsed (if configured)
  • The pod has went into a terminal state (e.g. Succeeded / Failed)
  • The annotation is removed from the pod

Does this fit your use-case?

@jmdeal jmdeal added triage/needs-information Marks that the issue still needs more information to properly triage and removed needs-triage Issues that need to be triaged labels Jan 10, 2025
@jmdeal
Copy link
Contributor

jmdeal commented Jan 13, 2025

After some offline discussion, it sounds like the goal here is not to prevent Karpenter from draining the node during these termination events, but prevent the termination events in the first place. Is that correct? do-not-disrupt will prevent Karpenter from initiating a termination, but if it was initiated out-of-band that's a different story. Could you provide some examples, i.e. what service is initiating the termination based on what events?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request triage/needs-information Marks that the issue still needs more information to properly triage
Projects
None yet
Development

No branches or pull requests

2 participants