-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unschedulable Pod Metric #705
Comments
Actually this metric might work for our uses: I'll leave this open for a bit to see if anyone else has thoughts. |
@sidewinder12s We're planning on looking over all of our metrics right now to get a more comprehensive idea of where we have gaps so this is good feedback for that process. |
I can see adding this as a gauge, but I do believe you should be able to get the equivalent with some |
This ask is related to #676 I am trying to write alerting for when Karpenter is failing to function, either because a provisioner is erroring out, it hit its limit, permissions, etc. I think |
Hii, how about I work on it! |
@njtran many users with large clusters probably will want to avoid scraping the high cardinality series of kube_pod* all the kube_pod* are rather high cardinality metrics and aren't as scalable as the CAS metric imo. |
@Horiodino Assigned! |
Seems like this one might be difficult for the same reason that #686 is difficult. The second you get multiple NodePools into the mix, we now have to figure out which NodePool we think you intended to schedule the pod to; otherwise, we might fire multiple metrics, one for each NodePool. There's basically a 1-1 reasoning between why the pod wouldn't have scheduled to a NodePool with Karpenter. Maybe this means what we are really looking for here is a pod metric with a nodepool label and pod label that tells us the reason why the pod didn't schedule against each NodePool. I still haven't looked into how complex that might be to orchestrate or how much cardinality that would add to pod metrics that already have pretty high cardinality to begin with. |
Would looking at how Cluster Autoscaler handles this help at all? It sounds like its a single count metric that would just have the # of pods that have not been processed by the scheduler + the number in that cycle that the scheduler was not able to schedule. I think that would largely accomplish what I'd hoped out of this ticket which was to get an indication of unschedulable pods from Karpenters perspective only. Actually using the metric you'd likely use a rate or duration of elevated count to indicate Karpenter is unable to process some pod. 100% agree getting any more specific or per node pool would be a lot more difficult and potentially have explosive cardinality potential. |
This sounds reasonable @sidewinder12s. It could be simply added here: https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/provisioning/scheduling/scheduler.go#L193-L196 |
+1, on board with this. Textbook overthinking on my part. A basic gauge that gets changed on each provisioning loop sounds completely reasonable. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Hey @jonathan-innis, could you take a look when you get a chance (#1778) ? If everything looks good, I’ll go ahead and finalize the PR. (Still deciding whether to create a unit tests file.) |
Tell us about your request
We used this metric with the cluster_autoscaler
cluster_autoscaler_unschedulable_pods_count
.Would it be possible to have Karpenter expose a similar metric?
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
How to alert when Karpenter is unable to scale out
Are you currently working around this issue?
Looking at Kubernetes Scheduler metric
scheduler_pending_pods
, however if you have multiple schedulers this gets more complicated.EKS also does not expose metrics from the Kubernetes Scheduler.
Additional Context
We might have been abusing this metric a bit as the OSS Autoscaler would also observe entire cluster state.
We have 2 Autoscalers running (CAS and an Internal autoscaler), so we were somewhat overloading on this metric since CAS still took a whole cluster view even if it wasn't scaling most of our nodes.
Attachments
No response
Community Note
The text was updated successfully, but these errors were encountered: