Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot disrupt Node: state node is nominated for a pending pod #7521

Open
vb-atelio opened this issue Dec 11, 2024 · 8 comments
Open

Cannot disrupt Node: state node is nominated for a pending pod #7521

vb-atelio opened this issue Dec 11, 2024 · 8 comments
Labels
bug Something isn't working needs-triage Issues that need to be triaged

Comments

@vb-atelio
Copy link

Description

Observed Behavior:
Karpenter refused to drain a node(instance type: m7i.12xlarge) when it's clearly underutilized(has 8 pods running) with reason: state node is nominated for a pending pod. When I run kubectl get pods --all-namespaces --field-selector=status.phase=Pending I see that there are no pending pods.

Expected Behavior:
Karpenter should be disrupting this node and draining it and scheduling these pods on another node or atleast show the correct reason on why it's not able to drain the node

Reproduction Steps (Please include YAML):
nodepool.yaml

Name:         default
Namespace:
Labels:       <none>
Annotations:  compatibility.karpenter.sh/v1beta1-nodeclass-reference: {"name":"default"}
              karpenter.sh/nodepool-hash: 12063359807553009501
              karpenter.sh/nodepool-hash-version: v3
API Version:  karpenter.sh/v1
Kind:         NodePool
Metadata:
  Creation Timestamp:  2024-10-29T06:33:00Z
  Generation:          24
  Resource Version:    47771428
  UID:                 857db43c-c406-4952-8648-d363b9079f63
Spec:
  Disruption:
    Budgets:
      Nodes:               50%
    Consolidate After:     0s
    Consolidation Policy:  WhenEmptyOrUnderutilized
  Limits:
    Count:   50
    Cpu:     4k
    Memory:  4000Gi
  Template:
    Metadata:
      Labels:
        Type:  karpenter
    Spec:
      Expire After:  720h
      Node Class Ref:
        Group:  karpenter.k8s.aws
        Kind:   EC2NodeClass
        Name:   default
      Requirements:
        Key:       karpenter.sh/capacity-type
        Operator:  In
        Values:
          on-demand
        Key:       node.kubernetes.io/instance-type
        Operator:  In
        Values:
          m7i.12xlarge

Versions:

  • Chart Version: 1.0.2
  • Kubernetes Version (kubectl version):1.30
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@vb-atelio vb-atelio added bug Something isn't working needs-triage Issues that need to be triaged labels Dec 11, 2024
@jigisha620
Copy link
Contributor

Hi @vb-atelio,
Can you share detailed logs from when this happened? How did you determine that the node was underutilized? Did you monitor node usage during this period? If yes, can you please share it?

@tufitko
Copy link

tufitko commented Dec 19, 2024

@jigisha620
I have the same problem. I'll try to describe it:

A node is marked for deletion due to expiration, but it hosts a pod with the karpenter.sh/do-not-disrupt annotation and an attached volume. Karpenter waits for the volume to detach before proceeding with the node deletion. (Karpenter will wait infinitely while pod is running, also Karpenter wont evict this pod (ref) )

At the same time, Karpenter nominates the pod from the node marked for deletion to another node. For example, the nomination logic can be found here.

The new node receiving the nominated pod might be empty or underutilized, but due to the presence of the nominated pod, Karpenter cannot disrupt it.

Karpenter version: 1.1.1

@tufitko
Copy link

tufitko commented Dec 23, 2024

@jigisha620 any updates here?

@dfsdevops
Copy link

I have the same issue. The node in question isn't running anything except normal daemonsets. So it should be ripe for deletion, yet for some reason karpenter is refusing to delete. I noticed this when one of my subnets ran out of IPs... I'm manually deleting the nodes now.

@Pierre-Raffa
Copy link

same here

kubectl describe  node ip-10-0-64-84.us-east-2.compute.internal
Name:               ip-10-0-64-84.us-east-2.compute.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=c5d.large
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-2
                    failure-domain.beta.kubernetes.io/zone=us-east-2c
                    k8s.io/cloud-provider-aws=d1a7bd0bd2a5288aa5b57aada6f8224b
                    karpenter.k8s.aws/instance-category=c
                    karpenter.k8s.aws/instance-cpu=2
                    karpenter.k8s.aws/instance-cpu-manufacturer=intel
                    karpenter.k8s.aws/instance-ebs-bandwidth=4750
                    karpenter.k8s.aws/instance-encryption-in-transit-supported=false
                    karpenter.k8s.aws/instance-family=c5d
                    karpenter.k8s.aws/instance-generation=5
                    karpenter.k8s.aws/instance-hypervisor=nitro
                    karpenter.k8s.aws/instance-local-nvme=50
                    karpenter.k8s.aws/instance-memory=4096
                    karpenter.k8s.aws/instance-network-bandwidth=750
                    karpenter.k8s.aws/instance-size=large
                    karpenter.sh/capacity-type=spot
                    karpenter.sh/initialized=true
                    karpenter.sh/nodepool=runtime
                    karpenter.sh/registered=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-64-84.us-east-2.compute.internal
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=c5d.large
                    role=runtime
                    topology.ebs.csi.aws.com/zone=us-east-2c
                    topology.k8s.aws/zone-id=use2-az3
                    topology.kubernetes.io/region=us-east-2
                    topology.kubernetes.io/zone=us-east-2c
Annotations:        alpha.kubernetes.io/provided-node-ip: 10.0.64.84
                    compatibility.karpenter.k8s.aws/kubelet-drift-hash: 15379597991425564585
                    csi.volume.kubernetes.io/nodeid:
                      {"ebs.csi.aws.com":"i-0c24fac1bc2411ec7","secrets-store.csi.k8s.io":"ip-10-0-64-84.us-east-2.compute.internal"}
                    karpenter.k8s.aws/ec2nodeclass-hash: 14353939326230165957
                    karpenter.k8s.aws/ec2nodeclass-hash-version: v3
                    karpenter.sh/nodepool-hash: 14701915054057220949
                    karpenter.sh/nodepool-hash-version: v3
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 08 Jan 2025 14:48:12 +0000
Taints:             role=runtime:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-0-64-84.us-east-2.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Wed, 08 Jan 2025 17:42:08 +0000
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 08 Jan 2025 17:42:17 +0000   Wed, 08 Jan 2025 14:48:11 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 08 Jan 2025 17:42:17 +0000   Wed, 08 Jan 2025 14:48:11 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 08 Jan 2025 17:42:17 +0000   Wed, 08 Jan 2025 14:48:11 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 08 Jan 2025 17:42:17 +0000   Wed, 08 Jan 2025 14:48:22 +0000   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   10.0.64.84
  InternalDNS:  ip-10-0-64-84.us-east-2.compute.internal
  Hostname:     ip-10-0-64-84.us-east-2.compute.internal
Capacity:
  cpu:                2
  ephemeral-storage:  20959212Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             3758984Ki
  pods:               29
Allocatable:
  cpu:                1930m
  ephemeral-storage:  18242267924
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             3068808Ki
  pods:               29
System Info:
  Machine ID:                 ec250acedd312a7f31c7b735539bc3b5
  System UUID:                ec250ace-dd31-2a7f-31c7-b735539bc3b5
  Boot ID:                    0c3d70e4-0397-4967-ac8c-d8846933cdc3
  Kernel Version:             5.10.230-223.885.amzn2.x86_64
  OS Image:                   Amazon Linux 2
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.7.23
  Kubelet Version:            v1.29.10-eks-59bf375
  Kube-Proxy Version:         v1.29.10-eks-59bf375
ProviderID:                   aws:///us-east-2c/i-0c24fac1bc2411ec7
Non-terminated Pods:          (6 in total)
  Namespace                   Name                                                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                    ------------  ----------  ---------------  -------------  ---
  kube-system                 aws-node-69kq4                                          50m (2%)      0 (0%)      0 (0%)           0 (0%)         174m
  kube-system                 ebs-csi-node-hcfwf                                      30m (1%)      0 (0%)      120Mi (4%)       768Mi (25%)    174m
  kube-system                 kube-proxy-6fv4j                                        100m (5%)     0 (0%)      0 (0%)           0 (0%)         174m
  kube-system                 secrets-store-csi-driver-8zd4m                          70m (3%)      400m (20%)  140Mi (4%)       400Mi (13%)    174m
  runtime                     grafana-agent-logs-j8d9w                                100m (5%)     100m (5%)   256Mi (8%)       256Mi (8%)     174m
  runtime                     kube-prometheus-stack-prometheus-node-exporter-xv2pj    50m (2%)      100m (5%)   30Mi (1%)        50Mi (1%)      174m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                400m (20%)   600m (31%)
  memory             546Mi (18%)  1474Mi (49%)
  ephemeral-storage  0 (0%)       0 (0%)
  hugepages-1Gi      0 (0%)       0 (0%)
  hugepages-2Mi      0 (0%)       0 (0%)
Events:
  Type    Reason             Age                   From       Message
  ----    ------             ----                  ----       -------
  Normal  DisruptionBlocked  3m6s (x82 over 172m)  karpenter  Cannot disrupt Node: state node is nominated for a pending pod

But I have a pending pod which should be scheduled on the node above, but can't

...
Node-Selectors:               role=runtime
Tolerations:                  node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                              node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                              role=runtime:NoSchedule
Topology Spread Constraints:  topology.kubernetes.io/zone:DoNotSchedule when max skew 1 is exceeded for selector app.kubernetes.io/component=ingester,app.kubernetes.io/name=loki-distributed
Events:
  Type     Reason             Age                   From                Message
  ----     ------             ----                  ----                -------
  Normal   NotTriggerScaleUp  9m29s (x10 over 14m)  cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had untolerated taint {os: windows}, 1 node(s) had untolerated taint {role: karpenter}
  Warning  FailedScheduling   4m31s (x46 over 14m)  default-scheduler   0/43 nodes are available: 1 node(s) didn't match Pod's node affinity/selector, 2 Insufficient memory, 2 node(s) didn't match pod topology spread constraints, 2 node(s) had untolerated taint {os: windows}, 2 node(s) had untolerated taint {role: karpenter}, 2 node(s) had volume node affinity conflict, 32 node(s) had untolerated taint {role: linux-runners}. preemption: 0/43 nodes are available: 1 Insufficient memory, 3 node(s) didn't match pod topology spread constraints, 39 Preemption is not helpful for scheduling.
  Normal   NotTriggerScaleUp  4m28s (x39 over 14m)  cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) had untolerated taint {role: karpenter}, 1 node(s) had untolerated taint {os: windows}
  Normal   Nominated          2m11s (x7 over 14m)   karpenter           Pod should schedule on: nodeclaim/runtime-2v5xp, node/ip-10-0-64-84.us-east-2.compute.internal

The only logs I have for the pod is

{"level":"INFO","time":"2025-01-08T16:59:18.063Z","logger":"controller","message":"pod(s) have a preferred Anti-Affinity which can prevent consolidation","commit":"b897114","controller":"provisioner","namespace":"","name":"","reconcileID":"62586553-5ecf-4775-b159-8ffd11cf9bea","pods":"runtime/loki-distributed-ingester-2"}

and for the node, no logs except the ones about the node creation

@george-zubrienko
Copy link

george-zubrienko commented Jan 13, 2025

Can confirm the same even wtih spotToSpotConsolidation set to true, for the following budget

      disruption = {
        consolidationPolicy = "WhenEmptyOrUnderutilized"
        consolidateAfter    = "1m0s"
        budgets = [
          {
            nodes = "50%"
            reasons = [
              "Underutilized"
            ]
          },
          {
            nodes = "50%"
            reasons = [
              "Empty"
            ]
          },
          {
            nodes = "0"
            reasons = [
              "Drifted"
            ]
          }
        ]
      }

or simply with default budget:

      disruption = {
        consolidationPolicy = "WhenEmptyOrUnderutilized"
        consolidateAfter    = "1m0s"
      }

with pods arriving every 2-3minutes, still getting

Events:
  Type    Reason             Age                From       Message
  ----    ------             ----               ----       -------
  Normal  DisruptionBlocked  52m (x3 over 77m)  karpenter  Cannot disrupt Node: state node is nominated for a pending pod

On 0.37.6 this worked correctly with node being replaced, on 1.0.8 - broken

@george-zubrienko
Copy link

@jigisha620 same issue on v1.1.1. 64-core node (spot) with even spotToSpot enabled not consolidated with replace over almost 2hrs:

Image

@george-zubrienko
Copy link

Some additional context here - in our case I think the reason was that consolidateAfter is set to 2m and there is on average a pod being removed/added every 1m, Karpenter will not consider node for consolidation, even though it is eligible by definition.

Once I changed the period to 15s, I finally saw the following in logs:

{"level":"INFO","time":"2025-01-13T21:16:57.505Z","logger":"controller","message":"disrupting nodeclaim(s) via delete, terminating 1 nodes (25 pods) i-xxxxxxxx.xx-xx-1.compute.internal/m7g.16xlarge/spot","commit":"3298d91","controller":"disruption","namespace":"","name":"","reconcileID":"...","command-id":"...","reason":"underutilized"}

So it seems that consolidateAfter must be lower than pod scheduling period, in case nodepool is used for workloads that are sporadic rather than constant/slow changing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-triage Issues that need to be triaged
Projects
None yet
Development

No branches or pull requests

6 participants