Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Cascade Delete When Removing Karpenter from my Cluster #1040

Open
jonathan-innis opened this issue Feb 22, 2024 · 10 comments
Open

Support Cascade Delete When Removing Karpenter from my Cluster #1040

jonathan-innis opened this issue Feb 22, 2024 · 10 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@jonathan-innis
Copy link
Member

Description

What problem are you trying to solve?

I'd like to be able to configure cascading delete behavior for Karpenter so that I can set values on NodePool deletion or on CRD deletion that convey to Karpenter that I want a more expedited termination of my nodes rather than waiting for all nodes to fully drain.

Right now, it's possible for nodes to hang due to stuck pods or fully blocking PDBs due to our graceful drain logic. Because a NodePool deletion or CRD deletion causes all the nodes to gracefully drain, it's also possible for these deletion operations to hang, halting the whole process. Ideally, a user could send through something like --grace-period when they are deleting a resource and Karpenter could reason about how to pass that down to all resources that the deletion cascades to.

Minimally, we should allow CRD deletions to get unblocked so that cluster operators can uninstall Karpenter from clusters without being blocked by graceful node drains that may hang.

An initial implementation of this was tried here #466 and there was some discussion in the community about enabling the ability to pass gracePeriod through to CRs in the same way that you can pass them through to pods today to affect the deletionTimestamp for a CR, allowing controller authors to build custom logic around this gracePeriod concept.

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@jonathan-innis jonathan-innis added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 22, 2024
@sftim
Copy link

sftim commented Feb 26, 2024

enabling the ability to pass gracePeriod through to CRs in the same way that you can pass them through to pods today to affect the deletionTimestamp for a CR

Building a coalition of supporters for this idea is effort, but may pay off really well.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 26, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 25, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 25, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@jonathan-innis
Copy link
Member Author

/reopen

@k8s-ci-robot
Copy link
Contributor

@jonathan-innis: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot reopened this Aug 1, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Aug 1, 2024
@jonathan-innis
Copy link
Member Author

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 1, 2024
@jonathan-innis
Copy link
Member Author

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 1, 2024
@jonathan-innis
Copy link
Member Author

Discussed this in WG today: The consensus was that folks, in general, still want the ability to have the graceful termination of their nodes -- so they don't want Karpenter to always do a forceful termination of all the nodes on their behalf. There are currently workarounds with the TerminationGracePeriod implementation that would allow users to start the teardown of Karpenter's CRDs, have the NodeClaims start to be terminated, and then have a user or automation annotate all of the nodes with the karpenter.sh/nodeclaim-termination-timestamp to mark the time that the NodeClaim has to be removed by.

In the case that you want forceful termination, you could mark the timestamp to be the current time and then everything should start forcefully removing itself, with the instances that were launched by Karpneter torn down as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants