Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Ignore pods awaiting garbage collection during topology calculat… #642

Merged
merged 2 commits into from
Oct 27, 2023

Conversation

ellistarn
Copy link
Contributor

@ellistarn ellistarn commented Oct 27, 2023

Fixes ##640

Description

Pods can exist with NodeNames that reference Nodes that do not exist. These pods will be removed by the pod garbage collector, and should not be included in topology calculations. Right now, this blocks provisioning until the API Server cleans up the pod, increasing provisioning latency.

How was this change tested?

  • make presubmit
    Manually
  • Created Deployment
  • Deleted NodeClaims
  • Scaled Deployment to create more pods that share topology, but existing pods are leaked
cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 0
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: inflate
      tolerations:
      - key: node.kubernetes.io/unschedulable
        effect: "NoSchedule"
      terminationGracePeriodSeconds: 0
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
          resources:
            requests:
              cpu: 1
EOF
karpenter-6549c476f5-sqfnz controller {"level":"INFO","time":"2023-10-27T18:22:53.258Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"de4388d-dirty","nodeclaims":1,"pods":3}
karpenter-6549c476f5-sqfnz controller {"level":"INFO","time":"2023-10-27T18:22:53.258Z","logger":"controller.provisioner","message":"computed 2 unready node(s) will fit 6 pod(s)","commit":"de4388d-dirty"}
karpenter-6549c476f5-sqfnz controller {"level":"INFO","time":"2023-10-27T18:22:53.273Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"de4388d-dirty","nodepool":"default","nodeclaim":"default-8m46q","requests":{"cpu":"3155m","memory":"120Mi","pods":"6"},"instance-types":"c5.2xlarge, c5.4xlarge, c5.9xlarge, c5.xlarge, c5a.2xlarge and 95 other(s)"}
karpenter-6549c476f5-sqfnz controller {"level":"DEBUG","time":"2023-10-27T18:22:53.426Z","logger":"controller.nodeclaim.lifecycle","message":"created launch template","commit":"de4388d-dirty","nodeclaim":"default-8m46q","nodepool":"default","launch-template-name":"karpenter.k8s.aws/5944867442244304826","id":"lt-0cfa350d4f5d4382f"}
karpenter-6549c476f5-sqfnz controller {"level":"DEBUG","time":"2023-10-27T18:22:54.879Z","logger":"controller.provisioner","message":"waiting on cluster sync","commit":"de4388d-dirty"}
karpenter-6549c476f5-sqfnz controller {"level":"INFO","time":"2023-10-27T18:22:55.314Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"de4388d-dirty","nodeclaim":"default-8m46q","nodepool":"default","provider-id":"aws:///us-west-2d/i-0352ad78d65d36c23","instance-type":"c6a.xlarge","zone":"us-west-2d","capacity-type":"on-demand","allocatable":{"cpu":"3920m","ephemeral-storage":"17Gi","memory":"6584Mi","pods":"58","vpc.amazonaws.com/pod-eni":"18"}}
karpenter-6549c476f5-sqfnz controller {"level":"DEBUG","time":"2023-10-27T18:23:06.174Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"de4388d-dirty","nodeclaim":"default-sxp4t","nodepool":"default","provider-id":"aws:///us-west-2b/i-0e8974831ae0024bf","node":"ip-192-168-110-58.us-west-2.compute.internal"}
karpenter-6549c476f5-sqfnz controller {"level":"DEBUG","time":"2023-10-27T18:23:07.842Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"de4388d-dirty","nodeclaim":"default-fd4kc","nodepool":"default","provider-id":"aws:///us-west-2a/i-0dc98d57f9aa1f7d3","node":"ip-192-168-81-90.us-west-2.compute.internal"}
karpenter-6549c476f5-sqfnz controller {"level":"INFO","time":"2023-10-27T18:23:21.798Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"de4388d-dirty","nodeclaim":"default-sxp4t","nodepool":"default","provider-id":"aws:///us-west-2b/i-0e8974831ae0024bf","node":"ip-192-168-110-58.us-west-2.compute.internal"}
karpenter-6549c476f5-sqfnz controller {"level":"INFO","time":"2023-10-27T18:23:23.464Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"de4388d-dirty","nodeclaim":"default-fd4kc","nodepool":"default","provider-id":"aws:///us-west-2a/i-0dc98d57f9aa1f7d3","node":"ip-192-168-81-90.us-west-2.compute.internal"}
karpenter-6549c476f5-sqfnz controller {"level":"DEBUG","time":"2023-10-27T18:23:31.593Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"de4388d-dirty","nodeclaim":"default-8m46q","nodepool":"default","provider-id":"aws:///us-west-2d/i-0352ad78d65d36c23","node":"ip-192-168-157-179.us-west-2.compute.internal"}
karpenter-6549c476f5-sqfnz controller {"level":"DEBUG","time":"2023-10-27T18:23:41.850Z","logger":"controller.disruption","message":"discovered subnets","commit":"de4388d-dirty","subnets":["subnet-053505c566f46fab2 (us-west-2d)","subnet-02e2f3e9dbded948f (us-west-2a)","subnet-0fbb73e571bbef1b7 (us-west-2b)","subnet-0c4df9999ecbf637e (us-west-2b)","subnet-0f4b3a1820268bbdb (us-west-2a)","subnet-01d4f4bd25cb07a1c (us-west-2d)"]}
karpenter-6549c476f5-sqfnz controller {"level":"INFO","time":"2023-10-27T18:23:46.980Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"de4388d-dirty","nodeclaim":"default-8m46q","nodepool":"default","provider-id":"aws:///us-west-2d/i-0352ad78d65d36c23","node":"ip-192-168-157-179.us-west-2.compute.internal"}
karpenter-6549c476f5-sqfnz controller {"level":"INFO","time":"2023-10-27T18:23:56.575Z","logger":"controller.node.termination","message":"cordoned node","commit":"de4388d-dirty","node":"ip-192-168-157-179.us-west-2.compute.internal"}
karpenter-6549c476f5-sqfnz controller {"level":"INFO","time":"2023-10-27T18:23:56.606Z","logger":"controller.node.termination","message":"cordoned node","commit":"de4388d-dirty","node":"ip-192-168-81-90.us-west-2.compute.internal"}
karpenter-6549c476f5-sqfnz controller {"level":"INFO","time":"2023-10-27T18:23:56.618Z","logger":"controller.node.termination","message":"cordoned node","commit":"de4388d-dirty","node":"ip-192-168-110-58.us-west-2.compute.internal"}
karpenter-6549c476f5-sqfnz controller {"level":"INFO","time":"2023-10-27T18:23:57.130Z","logger":"controller.node.termination","message":"deleted node","commit":"de4388d-dirty","node":"ip-192-168-110-58.us-west-2.compute.internal"}
karpenter-6549c476f5-sqfnz controller {"level":"INFO","time":"2023-10-27T18:23:57.137Z","logger":"controller.node.termination","message":"deleted node","commit":"de4388d-dirty","node":"ip-192-168-157-179.us-west-2.compute.internal"}
karpenter-6549c476f5-sqfnz controller {"level":"INFO","time":"2023-10-27T18:23:57.138Z","logger":"controller.node.termination","message":"deleted node","commit":"de4388d-dirty","node":"ip-192-168-81-90.us-west-2.compute.internal"}
karpenter-6549c476f5-kvn46 controller {"level":"ERROR","time":"2023-10-27T18:23:57.541Z","logger":"webhook","message":"http: TLS handshake error from 192.168.164.78:47260: EOF\n","commit":"de4388d-dirty"}
karpenter-6549c476f5-sqfnz controller {"level":"INFO","time":"2023-10-27T18:23:57.545Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"de4388d-dirty","nodeclaim":"default-8m46q","nodepool":"default","node":"ip-192-168-157-179.us-west-2.compute.internal","provider-id":"aws:///us-west-2d/i-0352ad78d65d36c23"}
karpenter-6549c476f5-sqfnz controller {"level":"INFO","time":"2023-10-27T18:23:57.548Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"de4388d-dirty","nodeclaim":"default-fd4kc","nodepool":"default","node":"ip-192-168-81-90.us-west-2.compute.internal","provider-id":"aws:///us-west-2a/i-0dc98d57f9aa1f7d3"}
karpenter-6549c476f5-sqfnz controller {"level":"INFO","time":"2023-10-27T18:23:57.549Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"de4388d-dirty","nodeclaim":"default-sxp4t","nodepool":"default","node":"ip-192-168-110-58.us-west-2.compute.internal","provider-id":"aws:///us-west-2b/i-0e8974831ae0024bf"}
karpenter-6549c476f5-sqfnz controller {"level":"INFO","time":"2023-10-27T18:24:08.327Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"de4388d-dirty","pods":"default/inflate-5c66bc95bd-lst5d, default/inflate-5c66bc95bd-r6526, default/inflate-5c66bc95bd-nbhvd, default/inflate-5c66bc95bd-x9zjb, default/inflate-5c66bc95bd-tc7jh","duration":"38.62269ms"}
karpenter-6549c476f5-sqfnz controller {"level":"INFO","time":"2023-10-27T18:24:08.327Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"de4388d-dirty","nodeclaims":3,"pods":5}
karpenter-6549c476f5-sqfnz controller {"level":"INFO","time":"2023-10-27T18:24:08.380Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"de4388d-dirty","nodepool":"default","nodeclaim":"default-4q8t9","requests":{"cpu":"2155m","memory":"120Mi","pods":"5"},"instance-types":"c3.2xlarge, c3.xlarge, c4.2xlarge, c4.xlarge, c5.2xlarge and 95 other(s)"}
karpenter-6549c476f5-sqfnz controller {"level":"INFO","time":"2023-10-27T18:24:08.383Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"de4388d-dirty","nodepool":"default","nodeclaim":"default-qgdsh","requests":{"cpu":"2155m","memory":"120Mi","pods":"5"},"instance-types":"c3.2xlarge, c3.xlarge, c4.2xlarge, c4.xlarge, c5.2xlarge and 95 other(s)"}
karpenter-6549c476f5-sqfnz controller {"level":"INFO","time":"2023-10-27T18:24:08.390Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"de4388d-dirty","nodepool":"default","nodeclaim":"default-wcrdh","requests":{"cpu":"1155m","memory":"120Mi","pods":"4"},"instance-types":"c5.2xlarge, c5.4xlarge, c5.large, c5.xlarge, c5a.2xlarge and 95 other(s)"}
karpenter-6549c476f5-sqfnz controller {"level":"DEBUG","time":"2023-10-27T18:24:08.590Z","logger":"controller.nodeclaim.lifecycle","message":"discovered launch template","commit":"de4388d-dirty","nodeclaim":"default-4q8t9","nodepool":"default","launch-template-name":"karpenter.k8s.aws/12808623025736453274"}
karpenter-6549c476f5-sqfnz controller {"level":"DEBUG","time":"2023-10-27T18:24:08.723Z","logger":"controller.nodeclaim.lifecycle","message":"created launch template","commit":"de4388d-dirty","nodeclaim":"default-wcrdh","nodepool":"default","launch-template-name":"karpenter.k8s.aws/14084534824751046234","id":"lt-09038d23f8c26a3c4"}
karpenter-6549c476f5-sqfnz controller {"level":"INFO","time":"2023-10-27T18:24:10.628Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"de4388d-dirty","nodeclaim":"default-qgdsh","nodepool":"default","provider-id":"aws:///us-west-2b/i-0cf1d097c62b6bd75","instance-type":"c6a.xlarge","zone":"us-west-2b","capacity-type":"on-demand","allocatable":{"cpu":"3920m","ephemeral-storage":"17Gi","memory":"6584Mi","pods":"58","vpc.amazonaws.com/pod-eni":"18"}}
karpenter-6549c476f5-sqfnz controller {"level":"INFO","time":"2023-10-27T18:24:10.695Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"de4388d-dirty","nodeclaim":"default-wcrdh","nodepool":"default","provider-id":"aws:///us-west-2d/i-04796526b33128025","instance-type":"c6a.large","zone":"us-west-2d","capacity-type":"on-demand","allocatable":{"cpu":"1930m","ephemeral-storage":"17Gi","memory":"3114Mi","pods":"29","vpc.amazonaws.com/pod-eni":"9"}}
karpenter-6549c476f5-sqfnz controller {"level":"INFO","time":"2023-10-27T18:24:10.709Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"de4388d-dirty","nodeclaim":"default-4q8t9","nodepool":"default","provider-id":"aws:///us-west-2a/i-031a4c730377367eb","instance-type":"c6a.xlarge","zone":"us-west-2a","capacity-type":"on-demand","allocatable":{"cpu":"3920m","ephemeral-storage":"17Gi","memory":"6584Mi","pods":"58","vpc.amazonaws.com/pod-eni":"18"}}
karpenter-6549c476f5-sqfnz controller {"level":"DEBUG","time":"2023-10-27T18:24:29.206Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"de4388d-dirty","nodeclaim":"default-qgdsh","nodepool":"default","provid

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

jonathan-innis
jonathan-innis previously approved these changes Oct 27, 2023
Copy link
Member

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One optional comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants