Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DaemonSets not being correctly calculated when choosing a node #715

Open
kfirsch opened this issue Mar 22, 2023 · 33 comments
Open

DaemonSets not being correctly calculated when choosing a node #715

kfirsch opened this issue Mar 22, 2023 · 33 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. v1.x Issues prioritized for post-1.0

Comments

@kfirsch
Copy link

kfirsch commented Mar 22, 2023

Version

Karpenter Version: v0.24.0

Kubernetes Version: v1.21.0

Context

Due to the significant resource usage of certain Daemonsets, particularly when operating on larger machines, we have chosen to divide these Daemonsets based on affinity rules that use Karpenter's labels such as karpenter.k8s.aws/instance-cpu or karpenter.k8s.aws/instance-size.

Expected Behavior

When selecting a node for provisioning, Karpenter must only consider the appropriate Daemonsets that will run on that node.

Actual Behavior

It appears that Karpenter is wrongly including all of the split Daemonsets instead of only the appropriate one, which can result in poor instance selection when provisioning new nodes or inaccurate consolidation actions.

Steps to Reproduce the Problem

  • Create a fresh cluster with Karpenter deployed and a default provisioner:
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  providerRef:
    name: default
  requirements:
    - key: "node.kubernetes.io/instance-type"
      operator: In
      values: ["c6i"]
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["on-demand"]
  consolidation:
    enabled: true
  • Duplicate one of your Daemonsets and split them into small/large machines using the following settings:
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: karpenter.k8s.aws/instance-cpu
          operator: Lt
          values:
          - "31"

  resources:
    requests:
      cpu: 1
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: karpenter.k8s.aws/instance-cpu
          operator: Gt
          values:
          - "30"

  resources:
    requests:
      cpu: 10
  • Create a simple Pod with 1 CPU request, Karpenter Should provision a 2 or max 4 cpu Instance but will instead provision a large >10 cpu machine due wrongly include the bigger Daemonset in the 2\4\8 cpu evaluation.

  • Same behavior when using karpenter.k8s.aws/instance-size or even podAntiAffinity rules in the Daemonset affinities.

Thank you for your help in addressing this issue.

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@njtran
Copy link
Contributor

njtran commented Mar 23, 2023

Trying it out myself with a 1 CPU request and the same affinities and resource requests on the daemonsets, I was able to get a c5.4xlarge, but I saw that my logs were printing out the wrong resource requests.

karpenter-67c74b794c-rxplx controller 2023-03-23T01:38:04.125Z	INFO	controller.provisioner	launching machine with 1 pods requesting {"cpu":"12125m","pods":"5"} from types c4.4xlarge, c6i.4xlarge, m4.4xlarge, m5a.4xlarge, m6idn.4xlarge and 32 other(s)	{"commit": "17a08aa-dirty", "provisioner": "default"}

Once I removed the large DS with 10 cpu, I saw it change the resource requests.

karpenter-67c74b794c-rxplx controller 2023-03-23T01:39:12.402Z	INFO	controller.provisioner	launching machine with 1 pods requesting {"cpu":"2125m","pods":"4"} from types r6idn.2xlarge, r6a.2xlarge, m5d.xlarge, m5dn.2xlarge, c6in.xlarge and 110 other(s)	{"commit": "17a08aa-dirty", "provisioner": "default"}

This makes me think our daemonset overhead computation logic is including daemonsets that aren't compatible with the instance types that are being launched, which may be why the larger instance is being created, resulting in the behavior you see.

@njtran
Copy link
Contributor

njtran commented Mar 23, 2023

This definitely doesn't sound like the right behavior. I'll dig into this first thing tomorrow to see if this is the case.

@njtran
Copy link
Contributor

njtran commented Mar 23, 2023

Hey @kfirsch, tracked down the issue. So this is something that's not currently supported with the scheduling code. The scheduling logic calculates the resource requirements of non daemonset pods differently than daemonsets.

Karpenter optimistically includes all daemonsets that are compatible with a Provisioner's requirements during bin-packing. This means that Karpenter thinks the daemonset overhead for every instance type allowed by this Provisioner will be at least 11 vcpu, in this case, it'll think there's more overhead than there actually is for each instance type, which is why it tends to pick larger instance types.

To fix this would require a non-trivial amount of code changes to the scheduling logic, but it definitely is a bug.

In the meantime, if you're able to use multiple provisioners for each of these daemonsets to ensure that the bin-packing only considers one of them at a time, that should solve your issue.

@njtran njtran added the v1 Issues requiring resolution by the v1 milestone label Mar 28, 2023
@maxforasteiro
Copy link

I don't know if it's a similar issue but on my cluster I have a daemonset that stays in pending state due to lack of CPU in one of the nodes and Karpenter doesn't know how to deal with that. I tried to cordon the node, delete some of the pods running on the node, which makes karpenter spin up a new instance but as soon as I uncordon the old node, Karpenter removes one the nodes since it calculates as it is not needed anymore, causing the same daemonset to start being un-schedulable again.

    - key: karpenter.k8s.aws/instance-category
      operator: In
      values: ["c", "m", "r"]
    - key: karpenter.k8s.aws/instance-size
      operator: NotIn
      values: [nano, micro, small, medium, large]

this is my provider requirements.

should I open another issue or are these two problems related?

@ellistarn
Copy link
Contributor

There are a number of reasons these calculations could be off. Can you take a look at the logs and see the line that describes our binpacking decision? If you have custom AMIs or custom system reserved resources that Karpenter isn't aware of, it can cause problem.

@maxforasteiro
Copy link

I see this in the logs:

2023-03-30T11:01:20.604Z INFO controller.provisioner found provisionable pod(s) {"commit": "dc3af1a", "pods": 1}
2023-03-30T11:01:20.604Z INFO controller.provisioner computed new node(s) to fit pod(s) {"commit": "dc3af1a", "nodes": 1, "pods": 1}
2023-03-30T11:01:20.604Z INFO controller.provisioner launching machine with 1 pods requesting {"cpu":"1005m","memory":"768Mi","pods":"7"} from types r6idn.16xlarge, m6id.2xlarge, c6in.xlarge, m5a.16xlarge, r6a.xlarge and 294 other(s) {"commit": "dc3af1a", "provisioner": "default-provisioner"}
2023-03-30T11:01:21.013Z DEBUG controller.provisioner.cloudprovider created launch template {"commit": "dc3af1a", "provisioner": "default-provisioner", "launch-template-name": "Karpenter-fleet-sandbox-13145882181278882901", "launch-template-id": "lt-07145405b3919a1b0"}
2023-03-30T11:01:23.863Z DEBUG controller.provisioner.cloudprovider removing offering from offerings {"commit": "dc3af1a", "provisioner": "default-provisioner", "reason": "InsufficientInstanceCapacity", "instance-type": "m2.2xlarge", "zone": "eu-west-1c", "capacity-type": "spot", "ttl": "3m0s"}
2023-03-30T11:01:24.066Z INFO controller.provisioner.cloudprovider launched new instance {"commit": "dc3af1a", "provisioner": "default-provisioner", "id": "i-02e939aee74f23a04", "hostname": "ip-yy-yy-yy-yy.eu-west-1.compute.internal", "instance-type": "c5.xlarge", "zone": "eu-west-1c", "capacity-type": "spot"}
2023-03-30T11:05:17.972Z INFO controller.deprovisioning deprovisioning via consolidation delete, terminating 1 nodes ip-xx-xx-xx-xx.eu-west-1.compute.internal/c5.xlarge/spot {"commit": "dc3af1a"}
2023-03-30T11:05:17.995Z INFO controller.termination cordoned node {"commit": "dc3af1a", "node": "ip-xx-xx-xx-xx.eu-west-1.compute.internal"}
2023-03-30T11:05:51.281Z INFO controller.termination cordoned node {"commit": "dc3af1a", "node": "ip-xx-xx-xx-xx.eu-west-1.compute.internal"}
2023-03-30T11:05:51.347Z INFO controller.termination cordoned node {"commit": "dc3af1a", "node": "ip-xx-xx-xx-xx.eu-west-1.compute.internal"}
2023-03-30T11:06:20.817Z INFO controller.termination deleted node {"commit": "dc3af1a", "node": "ip-xx-xx-xx-xx.eu-west-1.compute.internal"}
2023-03-30T11:11:47.715Z DEBUG controller.aws deleted launch template {"commit": "dc3af1a"}

When my daemonset was pending and un-schedulable, I didn't see any events being created on karpenter. These events only appeared when I cordoned and delete one of the pods running on the node, allowing the daemonset to be scheduled on the cordoned node and creating a new node for the deleted pod.
After some fiddling around I was able to make everything fit.

@maxforasteiro
Copy link

I found the answer over #731 and I agree with #731 comment.

@tzneal
Copy link
Contributor

tzneal commented Apr 14, 2023

Karpenter currently calculates the applicable daemonsets at the provisioner level with label selectors/taints, etc. It does not look to see if there are requirements on the daemonsets that would exclude it from running on particular instances that the provisioner could or couldn't launch.

The workaround for now is to use multiple provisioners with taints/tolerations or label selectors to limit daemonsets to only nodes launched from specific provisioners.

@marksumm
Copy link

@tzneal I have a nodeSelector based on a custom label, which has been applied to the ebs-csi-node DaemonSet. The idea is that the CSI node pod should only run on machines that actually need dynamically provisioned EBS volumes. For some reason, these pods are not affected by this issue.

@tzneal
Copy link
Contributor

tzneal commented Apr 14, 2023

Yes, it works with taints/tolerations and labels on the provisioner. It doesn't work for labels that need to be discovered from instance types that the provisioner might potentially launch.

uditsidana referenced this issue in uditsidana/karpenter May 29, 2023
Addressing the issue - https://github.com/aws/karpenter/issues/3634 where daemonset calculation of resources is not working as expected.
@uristernik
Copy link

uristernik commented May 30, 2023

Yes, it works with taints/tolerations and labels on the provisioner. It doesn't work for labels that need to be discovered from instance types that the provisioner might potentially launch.

Just to clarify:

Assuming we want to exclude a daemon set named DS_X from nodes with label node_type: NODE_Y

Setting label on the provisioner the node_type: NODE_Y label and setting the .spec.template.spec.nodeSelector field on daemon set DS_X to match all node_type labels but node_type: NODE_Y would work as expected because it's known prior to adding the node?

But setting label on the provisioner the node_type: NODE_Y label and setting the .spec.template.spec.affinity with match expression to not run on nodes with the node_type: NODE_Y label won't because it's now known prior to adding the node?

I am asking because we are experiencing this behaviour but it seems to me to be pretty much the same (and if I understand what you wrote correctly, it should be supported)

@tzneal
Copy link
Contributor

tzneal commented Jul 23, 2023

But setting label on the provisioner the node_type: NODE_Y label and setting the .spec.template.spec.affinity with match expression to not run on nodes with the node_type: NODE_Y label won't because it's now known prior to adding the node?

IF you set a required node affinity to not run on nodes with a label, and the provisioner is configured to apply that label to all nodes it launches, then we shouldn't consider that daemonset for that provisioner.

@billrayburn billrayburn added v1.x Issues prioritized for post-1.0 and removed v1 Issues requiring resolution by the v1 milestone labels Aug 23, 2023
@Onlinehead
Copy link

+1 here.
I have a set of cache nodes, created as a statefulset with a requirement of 62000m of memory and a node selector purpose=cache and a provisioner with instance type equal to x2gd.xlarge.
Here is a message from the log:
incompatible with provisioner "cache", daemonset overhead={"cpu":"200m","memory":"128Mi","pods":"5"}, no instance type satisfied resources {"cpu":"200m","memory":"60128Mi","pods":"6"} and requirements karpenter.sh/capacity-type In [on-demand], karpenter.sh/provisioner-name In [cache], kubernetes.io/arch In [amd64 arm64], kubernetes.io/os In [linux], node.kubernetes.io/instance-type In [x2gd.xlarge], purpose In [cache] (no instance type which had enough resources and the required offering met the scheduling requirements);

x2gd.xlarge type has 64gb of memory, so it should satisfy. Moreover, cluster-autoscaler, which I migrated the cluster from, works well in that case.
Karpenter created a node only when I decreased a memory request to 50Gi.

@njtran njtran transferred this issue from aws/karpenter-provider-aws Nov 2, 2023
@kcchy
Copy link

kcchy commented Nov 23, 2023

+1 here.

r5.4xlarge should have enough capacity, is there a way to solve this issue?

2023-11-17T07:24:17.807Z ERROR controller.provisioner Could not schedule pod, incompatible with provisioner "flink", daemonset overhead={"cpu":"455m","memory":"459383552","pods":"7"}, no instance type satisfied resources {"cpu":"11955m","memory":"126087176960","pods":"8"}

Karpenter Version: v0.29.2

@zackery-parkhurst
Copy link

zackery-parkhurst commented Dec 7, 2023

+1 here

Karpenter version 0.30

I was also just bit by this bug.

I had a daemonset that uses node affinity to schedule on specific nodes.
Well when I adjusted the resource requests / limits on this daemonset.

It broke all of my karpenter provisioners and no pod could schedule because of daemonset overhead.

But the daemonset in question is not scheduled on any of the nodes that karpenter is creating.

So it makes no since that karpenter would consider them in its calculations.

Is there plans to fix this at all in the future?

As now to make my karpenter provisioners work......I will have to rightsize the pods for a daemonset that is never going to be scheduled on my nodes created by karpenter.

Meaning I have capacity that is just wasted and am forced to rightsize pods to be smaller when I shouldn't have to.

Copy link

This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2023
@tzneal tzneal removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 21, 2024
@bartoszgridgg
Copy link

Hey
This is impacting us, is it possible to add affinity checks as well?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 10, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 10, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 9, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@njtran njtran reopened this Dec 12, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Dec 12, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Karpenter contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 11, 2025
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@saurav-agarwalla
Copy link
Contributor

/reopen

@k8s-ci-robot
Copy link
Contributor

@saurav-agarwalla: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot reopened this Jan 14, 2025
@saurav-agarwalla
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 14, 2025
@saurav-agarwalla
Copy link
Contributor

/assign

saurav-agarwalla added a commit to saurav-agarwalla/karpenter-provider-aws that referenced this issue Jan 14, 2025
@rschalo
Copy link
Contributor

rschalo commented Feb 3, 2025

/remove-label needs-triage

@k8s-ci-robot
Copy link
Contributor

@rschalo: The label(s) /remove-label needs-triage cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda, refactor, ci-short, ci-extended, ci-full. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/remove-label needs-triage

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@karen-wang-ai
Copy link

@saurav-agarwalla I'm also running into this issue. If it's actively being worked on, I'm happy to wait, but if not, I'll explore the workaround. Let me know—thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. v1.x Issues prioritized for post-1.0
Projects
None yet
Development

No branches or pull requests