Karpenter consolidation replaces the node with exact same node (EC2 instance) type #4826

badrish-s · 2023-10-13T02:05:32Z

Description

Observed Behavior:

Karpenter consolidation replaces the node with exact same node (EC2 instance) type when spec.disruption.consolidationPolicy: WhenUnderutilized is set. Also, eks-node-viewer doesn't show the node to be deleted/replaced as "Cordoned" - this was atleast the behaviour observed during consolidation with earlier versions of Karpenter.

Expected Behavior:

My understanding is, consolidation should kick-in during below situations for OnDemand instance types:

Delete a node – When pods can run on free capacity of other nodes in the cluster
Deletes a node – When node is empty
Replaces a node – When pods can run on a combination of free capacity of other nodes in the cluster + more efficient replacement node

However, I noticed the "Replace node" happens whenever Karpenter finds the node is underutilized - the node is replaced on continuous basis and with the exact same node type. In my case t4g.nano was replaced with a t4g.nano, the replacement node was not efficient then the original node in any way, rather exactly same. This behaviour made me think the replacement is happening based on utilization only.

Also, the node to be deleted/replaced should be Cordoned first, drained and deleted only after pods are placed onto the new node.

Reproduction Steps:

NodeClass.yaml (karpenter-demo is my Cluster name):

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  creationTimestamp: null
  name: default
spec:
  amiFamily: AL2
  role: KarpenterNodeRole-karpenter-demo
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: karpenter-demo
  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: karpenter-demo
status: {}

Nodepool.yaml

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  creationTimestamp: null
  name: default
spec:
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: Never
  limits:
    cpu: 10k
  template:
    spec:
      nodeClassRef:
        name: default
      requirements:
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - on-demand
      resources: {}
status: {}

Deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 1
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      terminationGracePeriodSeconds: 0
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
          resources:
            requests:
              cpu: 1

Screens from eks-node-viewer:

ip-192-168-151-16.us-west-2.compute.internal (t4g.nano) is being consolidated (because it is underutilized?) and replaced with ip-192-168-24-218.us-west-2.compute.internal (again, t4g.nano).

After sometime, ip-192-168-24-218.us-west-2.compute.internal will be again replaced with another t4g.nano instance and the cycle repeats continuously.

Additionally, unlike earlier versions of Karpenter eks-node-viewer doesn't show the node to be replaced as "Cordoned". Since the logs were rotating fast, it was hard to check if pods were being graciously moved to the new node.

Do I have something misconfigured in NodePool or NodeClass or Deployment manifest? or is this the expected Consolidation behaviour in v1beta1 that needs additional configuration to make it work as expected? If there are no misconfigurations or additional configurations to control this, then this is a potential bug that needs attention.

Versions:

Chart Version:
Kubernetes Version (kubectl version):

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

sadath-12 · 2023-10-24T08:57:42Z

As far as cordening and draining of selected nodes for disruption are concerned , it will be well handled once the issue kubernetes-sigs/karpenter#624 is solved

ellistarn · 2023-10-25T17:09:21Z

You're using an unreleased version of the beta?

ellistarn · 2023-10-25T17:09:25Z

Can you provide some logs?

njtran · 2023-10-25T23:03:11Z

@badrish-s does your t4g.nano node have any node memory pressure taints?

github-actions · 2023-11-09T12:04:06Z

This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.

badrish-s · 2023-11-23T12:35:06Z

Apologies. I was on vacation and didn't get a chance to respond to this issue earlier. I am back now, and I did try to set up the Karpenter v1beta1 freshly (using instructions)and applied the same set of nodepools, nodeclass and deployments. I am unable to reproduce the issue now, i.e. the node is NOT being replaced on continuous basis with exact same node type.

I'd like to mention that, the KARPENTER VERSION I am currently using is the latest i.e. v0.32.2 - this was different when I originally tested and reported this issue during October 2023 - I was testing with an internal only image - v0-2012cf98c2e2e9625e858842c9f2d177efb0c364. I believe I did something incorrect earlier or the issue is been addressed now with latest version v0.32.2. git-hub actions has closed this issue due to inactivity and I will let it remain that way until I see this again (hopefully never). Thanks for looking into this!

jonathan-innis · 2023-11-24T07:29:13Z

Sounds good @badrish-s. Glad to hear that the issue appears to be resolved on the latest version!

badrish-s added the bug Something isn't working label Oct 13, 2023

ellistarn added question Issues that are support related questions and removed bug Something isn't working labels Oct 25, 2023

github-actions bot added the lifecycle/stale label Nov 9, 2023

github-actions bot added the lifecycle/closed label Nov 23, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Karpenter consolidation replaces the node with exact same node (EC2 instance) type #4826

Karpenter consolidation replaces the node with exact same node (EC2 instance) type #4826

badrish-s commented Oct 13, 2023 •

edited

Loading

sadath-12 commented Oct 24, 2023

ellistarn commented Oct 25, 2023

ellistarn commented Oct 25, 2023

njtran commented Oct 25, 2023

github-actions bot commented Nov 9, 2023

badrish-s commented Nov 23, 2023

jonathan-innis commented Nov 24, 2023

Karpenter consolidation replaces the node with exact same node (EC2 instance) type #4826

Karpenter consolidation replaces the node with exact same node (EC2 instance) type #4826

Comments

badrish-s commented Oct 13, 2023 • edited Loading

Description

sadath-12 commented Oct 24, 2023

ellistarn commented Oct 25, 2023

ellistarn commented Oct 25, 2023

njtran commented Oct 25, 2023

github-actions bot commented Nov 9, 2023

badrish-s commented Nov 23, 2023

jonathan-innis commented Nov 24, 2023

badrish-s commented Oct 13, 2023 •

edited

Loading