-
Notifications
You must be signed in to change notification settings - Fork 40.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prioritized Alternatives in Device Requests #128586
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super cursory quick look...
31aaf8f
to
19ab559
Compare
/assign |
19ab559
to
e49122d
Compare
e49122d
to
fa17f45
Compare
also hold for enough folks to chime in. /hold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm from the scheduler perspective
slices: slices, | ||
celCache: celCache, | ||
adminAccessEnabled: adminAccessEnabled, | ||
prioritizedListEnabled: prioritizedListEnabled, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered creating features structure similar to the resourceclaim.Features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am actually making this change in the PR for Partitionable Devices. I suggest we address this in that PR so we can get this one merged.
/cc @macsko @sanposhiho for sig-scheduling approval |
@dom4ha: GitHub didn't allow me to request PR reviews from the following users: for, approval. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where's the integration test for this feature? This file got the change to enable it, but don't we need any new test case specifically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how useful such a test would be. As agreed upon in kubernetes/enhancements#5077, not everything needs all kinds of tests.
There is an E2E test, which covers the full flow (API, scheduler, etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be Yes for some features, but this feature could be achieved with different components. Why don't we need an integration test for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which additional code paths or scenarios would an integration test cover that isn't already covered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we make sure the full scenario works from one component to others (or even more specifically, from one struct to others within the same component)? How do we make sure one component's behavior/output matches the expectations from others?
The integration test is not for the test coverage. It's not that once unit tests cover all code path, you can skip the integration tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The firstAvailable
is mostly internal to the plugin. It doesn't change how the plugin interacts with the scheduler (but see below). We still need to test with an apiserver to ensure that the fields flow properly from user to apiserver to scheduler to plugin, which is covered by the E2E test.
The only reason that I can think of for an integration test is the "feature disabled, field is set, PreFilter returns error". That is better tested through an integration test. But IMHO it's not important.
That shouldn't be in scheduler_perf. We can probably add it to test/integration/scheduler/plugins/plugins_test.go
.
The happy path could be in scheduler_perf.
Morten is taking some time off right now. To unblock this PR, let me see whether I can do something and address #128586 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason that I can think of for an integration test is the "feature disabled, field is set, PreFilter returns error". That is better tested through an integration test. But IMHO it's not important.
Unfortunately that is exactly the scenario that currently cannot be handled in an integration test: apiserver and scheduler have to be brought up in the same process and share the global default feature gate. Without modifying both components it's impossible to bring up the apiserver with the feature enabled and the scheduler without it. It's not supported to change the feature gate during a test run (explicitly prevented by featuregatetesting.SetFeatureGateDuringTest
). Perhaps some custom code could do it.
I think this falls under "version skew testing" which is not required for alpha. @sanposhiho: okay to do it later?
Instead, I added some test cases to scheduler_perf and to test/integration/dra, see #130622.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay to do it later?
🙆♂️
Instead, I added some test cases to scheduler_perf and to test/integration/dra, see #130622.
Thanks.
return nil, status | ||
} | ||
} else { | ||
for _, subRequest := range request.FirstAvailable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to feature-gate this path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. As in the allocator it should refuse to schedule pods which depend on firstAvailable
.
This then raises the question: does the allocator then still need that feature gate check? It should never get called for claims involving firstAvailable
when the feature is off.
In other words, move the check up one level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it's better to do it in both places. The allocator might also be used outside of the scheduler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the validation path only, so the only consequence of not having the feature-gate here is that PreFilter won't fail with the complain that request.DeviceClassName
is missing (when firstAvailable
was present) in PreFilter. The further attempt to use firstAvailable
should still fail in allocator with a different error (most likely in Filter).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. But it's still better to fail earlier. I've implemented that, just need to finish the integration test updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in #130622.
return nil, fmt.Errorf("claim %s, request %s: admin access is requested, but the feature is disabled", klog.KObj(claim), request.Name) | ||
// Error out if the prioritizedList feature is not enabled and the request | ||
// has subrequests. This is to avoid surprising behavior for users. | ||
if !a.prioritizedListEnabled && hasSubRequests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if users followed enable -> disable path? i.e., a user created it with FirstAvailable
with the feature gate enabled, and then disable the feature gate for some reason. It looks like they must update all claims not to have FirstAvailable
when they need to disable the gate.
Is there any existing behavior that FirstAvailable
requests can fall back to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any existing behavior that FirstAvailable requests can fall back to?
No. All that Kubernetes can do is to not behave badly. Ignoring the sub-request and scheduling the pod without the devices that it needs is worse than refusing to schedule with an explanation why.
As you said, the user then needs to take action and either define their workload differently or accept that the cluster cannot run it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then, is there any way to make it visible to users? (instead of logs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refusal to deal with the request would cause PreFilter to fail and thus would get reported to the user as a scheduling failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But only assuming the comment #128586 (comment) is addressed. Without it the validation will pass in PreFilter and a pod will most likely remain unschedulable (fail in Filter), which might be acceptable as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#130622 adds that, so with that PR it'll fail already in PreFilter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know whether @mortent is available to take my commits from #130622 into his branch for this PR. I therefore prefer merging that as a follow-up.
@sanposhiho: are you okay with that? I think you can do the formal LGTM.
I believe everything else has been reviewed (@dom4ha in #128586 (review) for scheduler, Tim in #128586 (review) for the API). It also looks good to me.
return nil, status | ||
} | ||
} else { | ||
for _, subRequest := range request.FirstAvailable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in #130622.
return nil, fmt.Errorf("claim %s, request %s: admin access is requested, but the feature is disabled", klog.KObj(claim), request.Name) | ||
// Error out if the prioritizedList feature is not enabled and the request | ||
// has subrequests. This is to avoid surprising behavior for users. | ||
if !a.prioritizedListEnabled && hasSubRequests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#130622 adds that, so with that PR it'll fail already in PreFilter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason that I can think of for an integration test is the "feature disabled, field is set, PreFilter returns error". That is better tested through an integration test. But IMHO it's not important.
Unfortunately that is exactly the scenario that currently cannot be handled in an integration test: apiserver and scheduler have to be brought up in the same process and share the global default feature gate. Without modifying both components it's impossible to bring up the apiserver with the feature enabled and the scheduler without it. It's not supported to change the feature gate during a test run (explicitly prevented by featuregatetesting.SetFeatureGateDuringTest
). Perhaps some custom code could do it.
I think this falls under "version skew testing" which is not required for alpha. @sanposhiho: okay to do it later?
Instead, I added some test cases to scheduler_perf and to test/integration/dra, see #130622.
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dims, johnbelamaric, mortent, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
sig-scheduling.
note: Some of my comments aren't addressed within this PR, but #130622 will address those. Also, those points are, either way, not super critical ones that we must include in the same PR.
LGTM label has been added. Git tree hash: 8942038fe5779e85d2db34003055e21edc7d9c1e
|
/hold cancel |
Thanks @sanposhiho for your review! I've rebased #130622, let's continue there. |
What type of PR is this?
/kind feature
/kind api-change
/kind deprecation
What this PR does / why we need it:
This implements https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/4816-dra-prioritized-list
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: