-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DaemonSets not being correctly calculated when choosing a node #715
Comments
Trying it out myself with a 1 CPU request and the same affinities and resource requests on the daemonsets, I was able to get a c5.4xlarge, but I saw that my logs were printing out the wrong resource requests.
Once I removed the large DS with 10 cpu, I saw it change the resource requests.
This makes me think our daemonset overhead computation logic is including daemonsets that aren't compatible with the instance types that are being launched, which may be why the larger instance is being created, resulting in the behavior you see. |
This definitely doesn't sound like the right behavior. I'll dig into this first thing tomorrow to see if this is the case. |
Hey @kfirsch, tracked down the issue. So this is something that's not currently supported with the scheduling code. The scheduling logic calculates the resource requirements of non daemonset pods differently than daemonsets. Karpenter optimistically includes all daemonsets that are compatible with a Provisioner's requirements during bin-packing. This means that Karpenter thinks the daemonset overhead for every instance type allowed by this Provisioner will be at least 11 vcpu, in this case, it'll think there's more overhead than there actually is for each instance type, which is why it tends to pick larger instance types. To fix this would require a non-trivial amount of code changes to the scheduling logic, but it definitely is a bug. In the meantime, if you're able to use multiple provisioners for each of these daemonsets to ensure that the bin-packing only considers one of them at a time, that should solve your issue. |
I don't know if it's a similar issue but on my cluster I have a daemonset that stays in pending state due to lack of CPU in one of the nodes and Karpenter doesn't know how to deal with that. I tried to cordon the node, delete some of the pods running on the node, which makes karpenter spin up a new instance but as soon as I uncordon the old node, Karpenter removes one the nodes since it calculates as it is not needed anymore, causing the same daemonset to start being un-schedulable again.
this is my provider requirements. should I open another issue or are these two problems related? |
There are a number of reasons these calculations could be off. Can you take a look at the logs and see the line that describes our binpacking decision? If you have custom AMIs or custom system reserved resources that Karpenter isn't aware of, it can cause problem. |
I see this in the logs:
When my daemonset was pending and un-schedulable, I didn't see any events being created on karpenter. These events only appeared when I cordoned and delete one of the pods running on the node, allowing the daemonset to be scheduled on the cordoned node and creating a new node for the deleted pod. |
Karpenter currently calculates the applicable daemonsets at the provisioner level with label selectors/taints, etc. It does not look to see if there are requirements on the daemonsets that would exclude it from running on particular instances that the provisioner could or couldn't launch. The workaround for now is to use multiple provisioners with taints/tolerations or label selectors to limit daemonsets to only nodes launched from specific provisioners. |
@tzneal I have a nodeSelector based on a custom label, which has been applied to the ebs-csi-node DaemonSet. The idea is that the CSI node pod should only run on machines that actually need dynamically provisioned EBS volumes. For some reason, these pods are not affected by this issue. |
Yes, it works with taints/tolerations and labels on the provisioner. It doesn't work for labels that need to be discovered from instance types that the provisioner might potentially launch. |
Addressing the issue - https://github.com/aws/karpenter/issues/3634 where daemonset calculation of resources is not working as expected.
Just to clarify: Assuming we want to exclude a daemon set named Setting label on the provisioner the But setting label on the provisioner the I am asking because we are experiencing this behaviour but it seems to me to be pretty much the same (and if I understand what you wrote correctly, it should be supported) |
IF you set a required node affinity to not run on nodes with a label, and the provisioner is configured to apply that label to all nodes it launches, then we shouldn't consider that daemonset for that provisioner. |
+1 here.
|
+1 here. r5.4xlarge should have enough capacity, is there a way to solve this issue?
Karpenter Version: v0.29.2 |
+1 here Karpenter version 0.30 I was also just bit by this bug. I had a daemonset that uses node affinity to schedule on specific nodes. It broke all of my karpenter provisioners and no pod could schedule because of daemonset overhead. But the daemonset in question is not scheduled on any of the nodes that karpenter is creating. So it makes no since that karpenter would consider them in its calculations. Is there plans to fix this at all in the future? As now to make my karpenter provisioners work......I will have to rightsize the pods for a daemonset that is never going to be scheduled on my nodes created by karpenter. Meaning I have capacity that is just wasted and am forced to rightsize pods to be smaller when I shouldn't have to. |
This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
Hey |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This issue is currently awaiting triage. If Karpenter contributors determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/reopen |
@saurav-agarwalla: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/remove-lifecycle rotten |
/assign |
/remove-label needs-triage |
@rschalo: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@saurav-agarwalla I'm also running into this issue. If it's actively being worked on, I'm happy to wait, but if not, I'll explore the workaround. Let me know—thanks! |
Version
Karpenter Version: v0.24.0
Kubernetes Version: v1.21.0
Context
Due to the significant resource usage of certain Daemonsets, particularly when operating on larger machines, we have chosen to divide these Daemonsets based on affinity rules that use Karpenter's labels such as
karpenter.k8s.aws/instance-cpu
orkarpenter.k8s.aws/instance-size
.Expected Behavior
When selecting a node for provisioning, Karpenter must only consider the appropriate Daemonsets that will run on that node.
Actual Behavior
It appears that Karpenter is wrongly including all of the split Daemonsets instead of only the appropriate one, which can result in poor instance selection when provisioning new nodes or inaccurate consolidation actions.
Steps to Reproduce the Problem
Create a simple Pod with 1 CPU request, Karpenter Should provision a 2 or max 4 cpu Instance but will instead provision a large >10 cpu machine due wrongly include the bigger Daemonset in the 2\4\8 cpu evaluation.
Same behavior when using
karpenter.k8s.aws/instance-size
or evenpodAntiAffinity
rules in the Daemonset affinities.Thank you for your help in addressing this issue.
Community Note
The text was updated successfully, but these errors were encountered: