[question]Regarding the inconsistency in the computation methods for production and non-production resources in the load_aware plugin of the koord-scheduler. #2317

ditingdapeng · 2025-01-09T13:25:55Z

What happened:

When using Koordinator's load-aware scheduler plugin, I discovered that when hotspot issues occur, many nodes exceeding the threshold still remain in the scoring phase before non-production pods are bound to nodes. The expected behavior is that highly utilized nodes should be filtered out during the filter phase.

Therefore, I reviewed the filter code with this concern and found inconsistencies in how production and non-production resources are calculated within the filter. This leads to node resources being underestimated when scheduling non-production pods. Specific references can be found below.

I am unsure if this discrepancy is intentional as part of the design, look forward to your response!

What you expected to happen:

The code process is as follows: when calculating the total node resources, it subtracts the anomaly (estimated number of pods), whereas when calculating production resources, it adds the anomaly (estimated number of pods).

code link

for resourceName, value := range assignedPodEstimatedUsed {
	estimatedUsed[resourceName] += value
}

code link

for resourceName, quantity := range nodeUsage.ResourceList {
	if q := estimatedPodActualUsages[resourceName]; !q.IsZero() {
		quantity = quantity.DeepCopy()
		if quantity.Cmp(q) >= 0 {
			quantity.Sub(q)
		}
	}
	estimatedUsed[resourceName] += getResourceValue(resourceName, quantity)
}

Environment:

Koordinator version: - latest
Kubernetes version (use kubectl version): v1.21.3

Park-Jiyeonn · 2025-01-09T13:33:23Z

Hi @ditingdapeng ,

I also noticed the same issue while using the load_aware plugin in Koordinator. Specifically, the inconsistency in resource computation between production and non-production workloads has been a point of confusion for me as well.

Like you mentioned, I would have expected nodes exceeding the threshold to be filtered out during the filter phase, but it seems they are still considered during the scoring phase for non-production pods. This behavior can sometimes lead to unexpected scheduling results.

I’m also curious if this is an intentional design decision or if there might be room for improvement in the computation logic. Looking forward to insights from the maintainers or contributors on this matter!

saintube · 2025-01-14T08:26:52Z

@ditingdapeng Please note that nodeUsage >= sum(podUsage) due to a basic node-level overhead. And the assignedPodEstimatedUsed is mainly for the in-flight pods including both the abnormal pods which do not reported in the nodeMetric status and the normal pods just assigned without a valid pod metric. So there is not a certain underestimation for the non-Prod pod when comparing the given formula terms. Anyway, it is still an interesting topic for your problem. How about joining the bi-weekly meeting of the community to discuss together?

ditingdapeng · 2025-01-14T08:34:17Z

@saintube Thank you for your reply. I’d be glad to discuss it together, and I’ll attend the meeting on time this evening

songtao98 · 2025-01-14T12:09:10Z

@ditingdapeng Like we discussed on the meeting, you can check the real value inside NodeMetric to verify if the scheduling result is as expected. We can communicate with more details here.

ditingdapeng added the kind/question Support request or question relating to Koordinator label Jan 9, 2025

saintube added the area/koord-scheduler label Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[question]Regarding the inconsistency in the computation methods for production and non-production resources in the load_aware plugin of the koord-scheduler. #2317

[question]Regarding the inconsistency in the computation methods for production and non-production resources in the load_aware plugin of the koord-scheduler. #2317

ditingdapeng commented Jan 9, 2025 •

edited

Loading

Park-Jiyeonn commented Jan 9, 2025

saintube commented Jan 14, 2025 •

edited

Loading

ditingdapeng commented Jan 14, 2025 •

edited

Loading

songtao98 commented Jan 14, 2025

[question]Regarding the inconsistency in the computation methods for production and non-production resources in the load_aware plugin of the koord-scheduler. #2317

[question]Regarding the inconsistency in the computation methods for production and non-production resources in the load_aware plugin of the koord-scheduler. #2317

Comments

ditingdapeng commented Jan 9, 2025 • edited Loading

Park-Jiyeonn commented Jan 9, 2025

saintube commented Jan 14, 2025 • edited Loading

ditingdapeng commented Jan 14, 2025 • edited Loading

songtao98 commented Jan 14, 2025

ditingdapeng commented Jan 9, 2025 •

edited

Loading

saintube commented Jan 14, 2025 •

edited

Loading

ditingdapeng commented Jan 14, 2025 •

edited

Loading