Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error messages if resource limit is reached #686

Open
runningman84 opened this issue Jul 24, 2023 · 13 comments
Open

Improve error messages if resource limit is reached #686

runningman84 opened this issue Jul 24, 2023 · 13 comments
Assignees
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. operational-excellence

Comments

@runningman84
Copy link

Description

What problem are you trying to solve?
Error message does not show the real cause:

karpenter/karpenter-9d8575fb-hmrf7[controller]: 2023-07-24T09:45:37.841Z	ERROR	controller.provisioner	Could not schedule pod, incompatible with provisioner "arm", did not tolerate arch=arm64:NoSchedule; incompatible with provisioner "x86", no instance type satisfied resources {"cpu":"16","memory":"32Gi","pods":"1"} and requirements karpenter.k8s.aws/instance-category In [c m r], karpenter.k8s.aws/instance-generation Exists >3, karpenter.k8s.aws/instance-hypervisor In [nitro], kubernetes.io/os In [linux], karpenter.sh/capacity-type In [on-demand], kubernetes.io/arch In [amd64], karpenter.sh/provisioner-name In [x86]	{"commit": "dc3af1a", "pod": "canda/canda-avstock-7bbc89bb7b-8t9lk"}

In this case it was a cpu resource limit:

  limits:
    resources:
      cpu: "200" 

How important is this feature to you?
We had a lot of these issues lately and a better error message like this would help:

Could not schedule pod, incompatible with provisioner "arm " ..., incompatible with provisioner "x86" due to cpu limit (196 out of 200)
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@runningman84 runningman84 added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 24, 2023
@sidewinder12s
Copy link

100% agree on this. The more complicated the provisioner configuration and the # of provisioners you have configured these error messages are indecipherable for end users and close to it for admins.

  Warning  FailedScheduling  24m                  karpenter          Failed to schedule pod, incompatible with provisioner "buildfarm-gpu", daemonset overhead={"cpu":"1510m","ephemeral-storage":"1074Mi","memory":"1659Mi","pods":"11"}, no instance type satisfied resources {"cpu":"61510m","ephemeral-storage":"1074Mi","memory":"99963Mi","pods":"12"} and requirements dedicated-node In [buildfarm], node-subtype In [gpu], node-type In [buildfarm], pricing-model In [on-demand], karpenter.k8s.aws/instance-category In [g], karpenter.k8s.aws/instance-encryption-in-transit-supported In [true], karpenter.k8s.aws/instance-generation In [4 5 6], karpenter.sh/capacity-type In [on-demand], karpenter.sh/provisioner-name In [buildfarm-gpu], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], node.kubernetes.io/instance-type In [m5dn.16xlarge r5dn.16xlarge], node.kubernetes.io/lifecycle In [on-demand], topology.kubernetes.io/zone In [us-east-1b] (no instance type met all requirements); incompatible with provisioner "default", daemonset overhead={"cpu":"1510m","ephemeral-storage":"1074Mi","memory":"1659Mi","pods":"11"}, incompatible requirements, label "dedicated-node" does not have known values; incompatible with provisioner "system", daemonset overhead={"cpu":"1510m","ephemeral-storage":"1074Mi","memory":"1659Mi","pods":"11"}, did not tolerate node-type=system:NoSchedule; incompatible with provisioner "buildfarm", daemonset overhead={"cpu":"1510m","ephemeral-storage":"1074Mi","memory":"1659Mi","pods":"11"}, no instance type satisfied resources {"cpu":"61510m","ephemeral-storage":"1074Mi","memory":"99963Mi","pods":"12"} and requirements dedicated-node In [buildfarm], node-subtype In [cpu], node-type In [buildfarm], pricing-model In [on-demand], karpenter.k8s.aws/instance-category In [c m r], karpenter.k8s.aws/instance-encryption-in-transit-supported In [true], karpenter.k8s.aws/instance-generation Exists >4, karpenter.sh/capacity-type In [on-demand], karpenter.sh/provisioner-name In [buildfarm], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], node.kubernetes.io/instance-type In [m5dn.16xlarge r5dn.16xlarge], node.kubernetes.io/lifecycle In [on-demand], topology.kubernetes.io/zone In [us-east-1b] (no instance type met the scheduling requirements or had enough resources)

@njtran njtran transferred this issue from aws/karpenter-provider-aws Nov 2, 2023
@jonathan-innis jonathan-innis added operational-excellence kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. and removed kind/feature Categorizes issue or PR as related to a new feature. labels Nov 25, 2023
@jonathan-innis
Copy link
Member

jonathan-innis commented Nov 25, 2023

The more complicated the provisioner configuration and the # of provisioners you have configured these error messages are indecipherable for end users and close to it for admins

There's definitely some work that we should do here. It gets a tad complicated since it's hard for us to know exactly which NodePool you intended to schedule to in the first place; so, we print out all the incompatibility for completeness.

@sidewinder12s @runningman84 Did y'all have any thoughts around how we could make this error message shorter and more targeted to help you discover the exact issue?

@runningman84
Copy link
Author

I still think that my original suggestion seams good:
Could not schedule pod, incompatible with provisioner "arm " ..., incompatible with provisioner "x86" due to cpu limit (196 out of 200)
In case a given provisioner does not fit die too limits just state that and ignore all other conditions for the given provisioner….

@sidewinder12s
Copy link

Ya either what @runningman84 said to keep the message consistent with how those messages are written out or even more explicit like; compatible with provisioner X but limit is reached. Though that might cause some confusion if you have overlapping provisioners.

@sadath-12
Copy link
Contributor

/assign

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 4, 2024
@sidewinder12s
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 4, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 3, 2024
@sidewinder12s
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 4, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 2, 2024
@sidewinder12s
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 2, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2024
@sidewinder12s
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. operational-excellence
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants