Unschedulable Pod Metric #705

sidewinder12s · 2023-04-14T20:14:28Z

Tell us about your request

We used this metric with the cluster_autoscaler cluster_autoscaler_unschedulable_pods_count.

Would it be possible to have Karpenter expose a similar metric?

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

How to alert when Karpenter is unable to scale out

Are you currently working around this issue?

Looking at Kubernetes Scheduler metric scheduler_pending_pods, however if you have multiple schedulers this gets more complicated.

EKS also does not expose metrics from the Kubernetes Scheduler.

Additional Context

We might have been abusing this metric a bit as the OSS Autoscaler would also observe entire cluster state.

We have 2 Autoscalers running (CAS and an Internal autoscaler), so we were somewhat overloading on this metric since CAS still took a whole cluster view even if it wasn't scaling most of our nodes.

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

sidewinder12s · 2023-04-14T20:48:08Z

Actually this metric might work for our uses: kube_pod_status_unschedulable

I'll leave this open for a bit to see if anyone else has thoughts.

jonathan-innis · 2023-04-14T20:53:35Z

@sidewinder12s We're planning on looking over all of our metrics right now to get a more comprehensive idea of where we have gaps so this is good feedback for that process.

njtran · 2023-05-08T19:39:03Z

I can see adding this as a gauge, but I do believe you should be able to get the equivalent with some sum(kube_pod_status_unschedulable) as well like you said.

sidewinder12s · 2023-11-02T17:54:27Z

This ask is related to #676

I am trying to write alerting for when Karpenter is failing to function, either because a provisioner is erroring out, it hit its limit, permissions, etc.

I think kube_pod_status_unschedulable is too broad, doesn't provide much context for why it is unschedulable and exposes cluster operators to user error vs something wrong with what they control.

Horiodino · 2024-01-15T18:19:41Z

Hii, how about I work on it!

Bryce-Soghigian · 2024-01-16T02:12:38Z

@njtran many users with large clusters probably will want to avoid scraping the high cardinality series of kube_pod*

all the kube_pod* are rather high cardinality metrics and aren't as scalable as the CAS metric imo.

jonathan-innis · 2024-01-17T05:11:59Z

@Horiodino Assigned!

jonathan-innis · 2024-01-17T05:16:51Z

Seems like this one might be difficult for the same reason that #686 is difficult. The second you get multiple NodePools into the mix, we now have to figure out which NodePool we think you intended to schedule the pod to; otherwise, we might fire multiple metrics, one for each NodePool. There's basically a 1-1 reasoning between why the pod wouldn't have scheduled to a NodePool with Karpenter. Maybe this means what we are really looking for here is a pod metric with a nodepool label and pod label that tells us the reason why the pod didn't schedule against each NodePool.

I still haven't looked into how complex that might be to orchestrate or how much cardinality that would add to pod metrics that already have pretty high cardinality to begin with.

sidewinder12s · 2024-01-17T17:28:58Z

Would looking at how Cluster Autoscaler handles this help at all?

https://github.com/kubernetes/autoscaler/blob/13c58757a70b0f121897fa605aa0cb56667da4d1/cluster-autoscaler/metrics/metrics.go#L145

It sounds like its a single count metric that would just have the # of pods that have not been processed by the scheduler + the number in that cycle that the scheduler was not able to schedule.

I think that would largely accomplish what I'd hoped out of this ticket which was to get an indication of unschedulable pods from Karpenters perspective only. Actually using the metric you'd likely use a rate or duration of elevated count to indicate Karpenter is unable to process some pod.

100% agree getting any more specific or per node pool would be a lot more difficult and potentially have explosive cardinality potential.

njtran · 2024-01-18T01:15:27Z

This sounds reasonable @sidewinder12s. It could be simply added here: https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/provisioning/scheduling/scheduler.go#L193-L196

jonathan-innis · 2024-01-18T08:17:03Z

Actually using the metric you'd likely use a rate or duration of elevated count to indicate Karpenter is unable to process some pod

+1, on board with this. Textbook overthinking on my part. A basic gauge that gets changed on each provisioning loop sounds completely reasonable.

k8s-triage-robot · 2024-04-17T09:01:55Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sidewinder12s · 2024-04-18T16:45:37Z

/remove-lifecycle stale

k8s-triage-robot · 2024-07-17T17:05:37Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sidewinder12s · 2024-07-17T18:34:17Z

/remove-lifecycle stale

k8s-triage-robot · 2024-10-15T18:56:51Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

sidewinder12s · 2024-10-15T19:51:43Z

/remove-lifecycle stale

omerap12 · 2024-10-27T19:38:06Z

Hey @jonathan-innis, could you take a look when you get a chance (#1778) ? If everything looks good, I’ll go ahead and finalize the PR. (Still deciding whether to create a unit tests file.)

sidewinder12s added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 14, 2023

njtran added the good-first-issue Good for newcomers label May 8, 2023

billrayburn added the metrics-audit label Jun 14, 2023

njtran transferred this issue from aws/karpenter-provider-aws Nov 2, 2023

gfcroft mentioned this issue Nov 25, 2023

Improved Provisioner Limit Metric/Alerting #676

Open

jonathan-innis assigned Horiodino Jan 17, 2024

garvinp-stripe mentioned this issue Feb 23, 2024

Karpenter logging and metric suggestion #1042

Closed

jonathan-innis mentioned this issue Feb 28, 2024

Mega Issue: Karpenter Observability (metrics, logs, eventing, etc.) #1051

Open

13 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 17, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 18, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 17, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 17, 2024

This was referenced Sep 18, 2024

Implemented UnschedulablePodsCount metric #1687

Closed

Implemented UnschedulablePodsCount metric #1688

Closed

Implemented UnschedulablePodsCount metric #1698

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2024

This was referenced Oct 27, 2024

[WIP] Add pods unschedulablePods metric #1777

Closed

[WIP] Add unschedulablePods metric #1778

Closed

k8s-ci-robot closed this as completed in #1698 Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unschedulable Pod Metric #705

Unschedulable Pod Metric #705

sidewinder12s commented Apr 14, 2023

sidewinder12s commented Apr 14, 2023

jonathan-innis commented Apr 14, 2023 •

edited

Loading

njtran commented May 8, 2023

sidewinder12s commented Nov 2, 2023

Horiodino commented Jan 15, 2024

Bryce-Soghigian commented Jan 16, 2024

jonathan-innis commented Jan 17, 2024

jonathan-innis commented Jan 17, 2024

sidewinder12s commented Jan 17, 2024

njtran commented Jan 18, 2024

jonathan-innis commented Jan 18, 2024

k8s-triage-robot commented Apr 17, 2024

sidewinder12s commented Apr 18, 2024

k8s-triage-robot commented Jul 17, 2024

sidewinder12s commented Jul 17, 2024

k8s-triage-robot commented Oct 15, 2024

sidewinder12s commented Oct 15, 2024

omerap12 commented Oct 27, 2024

Unschedulable Pod Metric #705

Unschedulable Pod Metric #705

Comments

sidewinder12s commented Apr 14, 2023

Tell us about your request

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Are you currently working around this issue?

Additional Context

Attachments

Community Note

sidewinder12s commented Apr 14, 2023

jonathan-innis commented Apr 14, 2023 • edited Loading

njtran commented May 8, 2023

sidewinder12s commented Nov 2, 2023

Horiodino commented Jan 15, 2024

Bryce-Soghigian commented Jan 16, 2024

jonathan-innis commented Jan 17, 2024

jonathan-innis commented Jan 17, 2024

sidewinder12s commented Jan 17, 2024

njtran commented Jan 18, 2024

jonathan-innis commented Jan 18, 2024

k8s-triage-robot commented Apr 17, 2024

sidewinder12s commented Apr 18, 2024

k8s-triage-robot commented Jul 17, 2024

sidewinder12s commented Jul 17, 2024

k8s-triage-robot commented Oct 15, 2024

sidewinder12s commented Oct 15, 2024

omerap12 commented Oct 27, 2024

jonathan-innis commented Apr 14, 2023 •

edited

Loading