Improve error messages if resource limit is reached #686

runningman84 · 2023-07-24T09:53:30Z

Description

What problem are you trying to solve?
Error message does not show the real cause:

karpenter/karpenter-9d8575fb-hmrf7[controller]: 2023-07-24T09:45:37.841Z	ERROR	controller.provisioner	Could not schedule pod, incompatible with provisioner "arm", did not tolerate arch=arm64:NoSchedule; incompatible with provisioner "x86", no instance type satisfied resources {"cpu":"16","memory":"32Gi","pods":"1"} and requirements karpenter.k8s.aws/instance-category In [c m r], karpenter.k8s.aws/instance-generation Exists >3, karpenter.k8s.aws/instance-hypervisor In [nitro], kubernetes.io/os In [linux], karpenter.sh/capacity-type In [on-demand], kubernetes.io/arch In [amd64], karpenter.sh/provisioner-name In [x86]	{"commit": "dc3af1a", "pod": "canda/canda-avstock-7bbc89bb7b-8t9lk"}

In this case it was a cpu resource limit:

  limits:
    resources:
      cpu: "200"

How important is this feature to you?
We had a lot of these issues lately and a better error message like this would help:

Could not schedule pod, incompatible with provisioner "arm " ..., incompatible with provisioner "x86" due to cpu limit (196 out of 200)

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

sidewinder12s · 2023-09-08T16:12:52Z

100% agree on this. The more complicated the provisioner configuration and the # of provisioners you have configured these error messages are indecipherable for end users and close to it for admins.

  Warning  FailedScheduling  24m                  karpenter          Failed to schedule pod, incompatible with provisioner "buildfarm-gpu", daemonset overhead={"cpu":"1510m","ephemeral-storage":"1074Mi","memory":"1659Mi","pods":"11"}, no instance type satisfied resources {"cpu":"61510m","ephemeral-storage":"1074Mi","memory":"99963Mi","pods":"12"} and requirements dedicated-node In [buildfarm], node-subtype In [gpu], node-type In [buildfarm], pricing-model In [on-demand], karpenter.k8s.aws/instance-category In [g], karpenter.k8s.aws/instance-encryption-in-transit-supported In [true], karpenter.k8s.aws/instance-generation In [4 5 6], karpenter.sh/capacity-type In [on-demand], karpenter.sh/provisioner-name In [buildfarm-gpu], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], node.kubernetes.io/instance-type In [m5dn.16xlarge r5dn.16xlarge], node.kubernetes.io/lifecycle In [on-demand], topology.kubernetes.io/zone In [us-east-1b] (no instance type met all requirements); incompatible with provisioner "default", daemonset overhead={"cpu":"1510m","ephemeral-storage":"1074Mi","memory":"1659Mi","pods":"11"}, incompatible requirements, label "dedicated-node" does not have known values; incompatible with provisioner "system", daemonset overhead={"cpu":"1510m","ephemeral-storage":"1074Mi","memory":"1659Mi","pods":"11"}, did not tolerate node-type=system:NoSchedule; incompatible with provisioner "buildfarm", daemonset overhead={"cpu":"1510m","ephemeral-storage":"1074Mi","memory":"1659Mi","pods":"11"}, no instance type satisfied resources {"cpu":"61510m","ephemeral-storage":"1074Mi","memory":"99963Mi","pods":"12"} and requirements dedicated-node In [buildfarm], node-subtype In [cpu], node-type In [buildfarm], pricing-model In [on-demand], karpenter.k8s.aws/instance-category In [c m r], karpenter.k8s.aws/instance-encryption-in-transit-supported In [true], karpenter.k8s.aws/instance-generation Exists >4, karpenter.sh/capacity-type In [on-demand], karpenter.sh/provisioner-name In [buildfarm], kubernetes.io/arch In [amd64], kubernetes.io/os In [linux], node.kubernetes.io/instance-type In [m5dn.16xlarge r5dn.16xlarge], node.kubernetes.io/lifecycle In [on-demand], topology.kubernetes.io/zone In [us-east-1b] (no instance type met the scheduling requirements or had enough resources)

jonathan-innis · 2023-11-25T04:16:53Z

The more complicated the provisioner configuration and the # of provisioners you have configured these error messages are indecipherable for end users and close to it for admins

There's definitely some work that we should do here. It gets a tad complicated since it's hard for us to know exactly which NodePool you intended to schedule to in the first place; so, we print out all the incompatibility for completeness.

@sidewinder12s @runningman84 Did y'all have any thoughts around how we could make this error message shorter and more targeted to help you discover the exact issue?

runningman84 · 2023-11-25T10:51:16Z

I still think that my original suggestion seams good:
Could not schedule pod, incompatible with provisioner "arm " ..., incompatible with provisioner "x86" due to cpu limit (196 out of 200)
In case a given provisioner does not fit die too limits just state that and ignore all other conditions for the given provisioner….

sidewinder12s · 2023-11-25T21:28:04Z

Ya either what @runningman84 said to keep the message consistent with how those messages are written out or even more explicit like; compatible with provisioner X but limit is reached. Though that might cause some confusion if you have overlapping provisioners.

sadath-12 · 2024-01-05T07:13:57Z

/assign

k8s-triage-robot · 2024-04-04T07:58:23Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sidewinder12s · 2024-04-04T16:47:49Z

/remove-lifecycle stale

k8s-triage-robot · 2024-07-03T17:26:35Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sidewinder12s · 2024-07-04T12:30:53Z

/remove-lifecycle stale

k8s-triage-robot · 2024-10-02T13:16:25Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sidewinder12s · 2024-10-02T17:10:25Z

/remove-lifecycle stale

k8s-triage-robot · 2024-12-31T17:55:03Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sidewinder12s · 2025-01-13T16:53:41Z

/remove-lifecycle stale

runningman84 added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 24, 2023

sidewinder12s mentioned this issue Nov 2, 2023

Improved Provisioner Limit Metric/Alerting #676

Open

njtran transferred this issue from aws/karpenter-provider-aws Nov 2, 2023

jonathan-innis added operational-excellence kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. and removed kind/feature Categorizes issue or PR as related to a new feature. labels Nov 25, 2023

k8s-ci-robot assigned sadath-12 Jan 5, 2024

sadath-12 mentioned this issue Jan 7, 2024

Improved visibility on pod failing to schedule #927

Closed

jonathan-innis mentioned this issue Jan 17, 2024

Unschedulable Pod Metric #705

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 4, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 4, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 3, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 4, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 2, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 2, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve error messages if resource limit is reached #686

Improve error messages if resource limit is reached #686

runningman84 commented Jul 24, 2023

sidewinder12s commented Sep 8, 2023

jonathan-innis commented Nov 25, 2023 •

edited

Loading

runningman84 commented Nov 25, 2023

sidewinder12s commented Nov 25, 2023

sadath-12 commented Jan 5, 2024

k8s-triage-robot commented Apr 4, 2024

sidewinder12s commented Apr 4, 2024

k8s-triage-robot commented Jul 3, 2024

sidewinder12s commented Jul 4, 2024

k8s-triage-robot commented Oct 2, 2024

sidewinder12s commented Oct 2, 2024

k8s-triage-robot commented Dec 31, 2024

sidewinder12s commented Jan 13, 2025

Improve error messages if resource limit is reached #686

Improve error messages if resource limit is reached #686

Comments

runningman84 commented Jul 24, 2023

Description

sidewinder12s commented Sep 8, 2023

jonathan-innis commented Nov 25, 2023 • edited Loading

runningman84 commented Nov 25, 2023

sidewinder12s commented Nov 25, 2023

sadath-12 commented Jan 5, 2024

k8s-triage-robot commented Apr 4, 2024

sidewinder12s commented Apr 4, 2024

k8s-triage-robot commented Jul 3, 2024

sidewinder12s commented Jul 4, 2024

k8s-triage-robot commented Oct 2, 2024

sidewinder12s commented Oct 2, 2024

k8s-triage-robot commented Dec 31, 2024

sidewinder12s commented Jan 13, 2025

jonathan-innis commented Nov 25, 2023 •

edited

Loading