Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter taints nodes as un-schedulable too early causing them to be unusable and scaled down #1421

Closed
miadabrin opened this issue Jul 15, 2024 · 7 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@miadabrin
Copy link

miadabrin commented Jul 15, 2024

Description

Karpenter taints nodes as un-schedulable too early causing them to be unusable and scaled down. specially when pods are being added and removed.

(please note the time differences on DisruptionBlocked and NodeNotSchedulable). If a node is picked for a pod, it should not be tainted logically
the issue might have been introduced in this PR #1180
Observed Behavior:
Karpenter taints node too early (as non-schedulable) and then won't deprovision it since it is a candidate for a pod
sample events for node1:

Events:
  Type     Reason                   Age                From                     Message
  ----     ------                   ----               ----                     -------
  Normal   Starting                 33m                kube-proxy               
  Normal   Starting                 33m                kubelet                  Starting kubelet.
  Normal   NodeAllocatableEnforced  33m                kubelet                  Updated Node Allocatable limit across pods
  Normal   Synced                   33m                cloud-node-controller    Node synced successfully
  Warning  InvalidDiskCapacity      33m                kubelet                  invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory  33m (x2 over 33m)  kubelet                  Node ip-172-20-108-21.ca-central-1.compute.internal status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    33m (x2 over 33m)  kubelet                  Node ip-172-20-108-21.ca-central-1.compute.internal status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     33m (x2 over 33m)  kubelet                  Node ip-172-20-108-21.ca-central-1.compute.internal status is now: NodeHasSufficientPID
  Normal   RegisteredNode           33m                node-controller          Node ip-172-20-108-21.ca-central-1.compute.internal event: Registered Node ip-172-20-108-21.ca-central-1.compute.internal in Controller
  Normal   ControllerVersionNotice  33m                vpc-resource-controller  The node is managed by VPC resource controller version v1.5.0
  Normal   NodeReady                33m                kubelet                  Node ip-172-20-108-21.ca-central-1.compute.internal status is now: NodeReady
  Normal   NodeTrunkInitiated       33m                vpc-resource-controller  The node has trunk interface initialized successfully
  Normal   DisruptionBlocked        33m                karpenter                Cannot disrupt Node: Nominated for a pending pod
  Normal   NodeNotSchedulable       32m                kubelet                  Node ip-172-20-108-21.ca-central-1.compute.internal status is now: NodeNotSchedulable

node 2:

Events:
  Type     Reason                   Age                 From                     Message
  ----     ------                   ----                ----                     -------
  Normal   Starting                 10m                 kube-proxy               
  Normal   Starting                 10m                 kubelet                  Starting kubelet.
  Normal   NodeAllocatableEnforced  10m                 kubelet                  Updated Node Allocatable limit across pods
  Warning  InvalidDiskCapacity      10m                 kubelet                  invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientPID     10m (x2 over 10m)   kubelet                  Node ip-172-20-112-55.ca-central-1.compute.internal status is now: NodeHasSufficientPID
  Normal   NodeHasSufficientMemory  10m (x2 over 10m)   kubelet                  Node ip-172-20-112-55.ca-central-1.compute.internal status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    10m (x2 over 10m)   kubelet                  Node ip-172-20-112-55.ca-central-1.compute.internal status is now: NodeHasNoDiskPressure
  Normal   Synced                   10m                 cloud-node-controller    Node synced successfully
  Normal   RegisteredNode           10m                 node-controller          Node ip-172-20-112-55.ca-central-1.compute.internal event: Registered Node ip-172-20-112-55.ca-central-1.compute.internal in Controller
  Normal   ControllerVersionNotice  10m (x2 over 3d)    vpc-resource-controller  The node is managed by VPC resource controller version v1.5.0
  Normal   NodeReady                10m                 kubelet                  Node ip-172-20-112-55.ca-central-1.compute.internal status is now: NodeReady
  Normal   NodeTrunkInitiated       9m59s (x2 over 3d)  vpc-resource-controller  The node has trunk interface initialized successfully
  Normal   DisruptionBlocked        9m59s               karpenter                Cannot disrupt Node: Nominated for a pending pod
  Normal   NodeNotSchedulable       9m30s               kubelet                  Node ip-172-20-112-55.ca-central-1.compute.internal status is now: NodeNotSchedulable

karpenter logs for the second node:

{"level":"INFO","time":"2024-07-15T20:01:06.492Z","logger":"controller","message":"registered nodeclaim","commit":"490ef94","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"production-mixed-default-jgmhj"},"namespace":"","name":"production-mixed-default-jgmhj","reconcileID":"c13f9ee9-9e02-4230-afc9-08065b5988d3","provider-id":"aws:///ca-central-1a/i-0939b93fe0841ae20","Node":{"name":"ip-172-20-100-252.ca-central-1.compute.internal"}}
{"level":"INFO","time":"2024-07-15T20:01:31.445Z","logger":"controller","message":"initialized nodeclaim","commit":"490ef94","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"production-mixed-default-jgmhj"},"namespace":"","name":"production-mixed-default-jgmhj","reconcileID":"c7eba0f4-110a-476b-85e5-0a873626ba4d","provider-id":"aws:///ca-central-1a/i-0939b93fe0841ae20","Node":{"name":"ip-172-20-100-252.ca-central-1.compute.internal"},"allocatable":{"cpu":"15890m","ephemeral-storage":"47233297124","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"28811800Ki","pods":"234"}}

Expected Behavior:
karpenter should not taint a node as non-schedulable if there is going to be a pod scheduled on it. This is causing unnecessary scaling issues
Reproduction Steps (Please include YAML):
create a nodepool with:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  generation: 3
  name: production-warehouse
spec:
  disruption:
    budgets:
    - duration: 45m
      nodes: "0"
      schedule: '@hourly'
    - nodes: "1"
    consolidateAfter: 2400s
    consolidationPolicy: WhenEmpty
    expireAfter: 648000s
  limits:
    cpu: "8"
    memory: 32Gi
  template:
    metadata:
      labels:
        data-warehouse-worker: "true"
    spec:
      kubelet:
        evictionSoft:
          memory.available: 1Gi
        evictionSoftGracePeriod:
          memory.available: 30s
      nodeClassRef:
        name: default
      requirements:
     ...
  weight: 31

scheduling pods and removing them causes the nodes to be created and then tainted immediately
Versions:

  • Chart Version:
    v0.37.0
  • Kubernetes Version (kubectl version):
Server Version: v1.27.13-eks-3af4770
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@miadabrin miadabrin added the kind/bug Categorizes issue or PR as related to a bug. label Jul 15, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 15, 2024
@jmdeal
Copy link
Member

jmdeal commented Jul 26, 2024

Pod nomination is only going to exclude a node from disruption for that reconciliation cycle. If by the time the next cycle rolls around the node is no longer nominated, it will be considered as a candidate. Without anything to show otherwise, I assume that is what's happening here. If your able provide a full set of Karpenter logs that would be useful in determining exactly what is happening.

As for tainting, Karpenter will only taint a node once it has decided to make a disruption decision. The only reason it may bail from this decision and un-taint the node later is if we failed to launch or register the replacement node. All #1180 did was reduce the gap in time between validation and tainting the node, if anything it made it more likely we wouldn't taint the node in the first place.

Your reproduction steps are to schedule pods and then delete them, which is followed by Karpenter consolidating the nodes correct? This sounds like expected behavior to me, if there are no longer any pods to schedule against that node we're going to remove it.

@miadabrin
Copy link
Author

miadabrin commented Aug 1, 2024

@jmdeal I think the details matter here. If you look at the timing of the events on the node, you can see that the node is almost immediately tainted after it is Ready (even though it is nominated for a pod) . This doesn't look like something anyone would find useful in my opinion.

If we zoom back a little, the nodes are provisioned to respond to scheduling pods on them. If a node is provisioned and immediately killed, this looks like either: 1-the initial estimation has been wrong and we didn't need the node in the first place 2- there is some issue (of race condition nature) that is causing the node to go down even though it is needed.

Based on the scaling behaviour that I have observed here (spinning up another node at the exact time the current node is being scaled down) the second issue is more likely to be the case.

it might be alleviated once this PR is released so folks will be able to let it happen less often but I think the underlying issue might be something more serious

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 30, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 29, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 29, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

4 participants