-
Notifications
You must be signed in to change notification settings - Fork 921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal to enhance FederatedResourceQuota to enforce resource limits directly on Karmada level #6181
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #6181 +/- ##
==========================================
- Coverage 47.95% 47.90% -0.06%
==========================================
Files 674 676 +2
Lines 55841 55976 +135
==========================================
+ Hits 26781 26813 +32
- Misses 27311 27412 +101
- Partials 1749 1751 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/assign
Start working on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @mszacillo so much for bring this up. I like this feature~
It would be great to add a user story section to this proposal to elaborate the use case. Because it's hard to connect failover feature with the quota management.
|
||
1. FederatedResourceQuota should enforce Overall resource limits if Static Assignment is not defined. | ||
- FederatedResourceQuota will be updated whenever resources are applied against the relevant namespace | ||
- We can either updated FRQ by default, or consider including a scope for the quota |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can either updated FRQ by default, or consider including a scope for the quota
Don't quite understand this, who, how, and why update FRQ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have been more clear here. The FederatedResourceQuota status will be updated whenever resources are applied against the relevant namespace.
That said, I think I can reframe this as a separate API which enforces resource limits on the Karmada control-plane, and keep the existing federated resource quota as is. That way we have separation between the APIs and can introduce the feature without stepping over existing use-cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will look at the separated API later, but it seems concerning because this proposal is exactly under the scope of FederatedResourceQuota. The new API still needs to clarify its relationship with FederatedResourceQuota, unless its intention is to replace FederatedResourceQuota.
|
||
**Reconciliation**: Controller will reconcile whenever a resource binding is created, updated, or deleted. Controller will only reconcile if the resource in question has a pointer to a FederatedResourceQuota. | ||
|
||
**Reconcile Logic**: When reconciling, the controller will fetch the list of ResourceBindings by namespace and add up their resource requirements. The existing implementation grabs all RBs by namespace, however this could probably be improved by only calculating the delta of the applied resource, rather than calculating the entire resource footprint of the namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, no way to get the previous resource requirements when reconciling ResourceBindings in reconcile loops. So we can't calculate the delta.
Fetching the whole list of ResourceBinding may be concerning of performance. But it seems not a blocker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, this was a concern of mine as well. I'm thinking we could perhaps maintain a ResourceBinding cache kept by the controller. In the event of an update to a ResourceBinding, the controller would:
- Check the cache to see if the ResourceBinding exists.
- If yes, take the previous spec and calculate a resource delta to update the FRQ status
- If no, update the FRQ status using the existing resource delta
- Update the cache with teh new ResourceBinding spec.
In the case that the controller pod is restarted or crashes, the cache will be lost and will need to be repopulated. But this would be an improvement over fetching the whole list of ResourceBindings each reconcile loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This an option I guess. For now, perhaps we don't have to dive into the implementation details. I belive there are many ways to handle that. I haven't looked at how Kubernetes calculates the used quota yet, maybe we can share some from it.
In addition, even without the cache you mentioned, we still can build an index with the field like .spec.resourceQuotaName
against ResourceBinding, we can get all of the ResourceBindings enabled quota.(Don't have to list all of them). By the way, there is an effort to manage all of the indexers across the system. If you want to know how the index works, maybe you can refer to #6204.
docs/proposals/federatedresourcequota/federated-resource-quota-enhancement.md
Outdated
Show resolved
Hide resolved
|
||
As part of this change, we will do two things: | ||
|
||
1. Edit the existing FederatedResourceQuota validating webhook to prevent users from toggling StaticAssignments on/off when the resource quota is already in use. Since the overall limits and static assignments have different controllers, we don’t want them both reconciling the resource at once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean prevent user from turn on/off the StaticAssignments configuration of FederatedResourceQuota?
Since the overall limits and static assignments have different controllers, we don’t want them both reconciling the resource at once.
If yes, can you elaborate what's the problem here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that both the existing static assignment controller and the new proposed controller will touch the same parts of the FederatedResourceQuota status. Since they calculate the resource usage differently, there would be conflicting changes in the status.
The more I think about this, the more I'm tempted to separate out this resource limit enforcement into its own API.
As part of this change, we will do two things: | ||
|
||
1. Edit the existing FederatedResourceQuota validating webhook to prevent users from toggling StaticAssignments on/off when the resource quota is already in use. Since the overall limits and static assignments have different controllers, we don’t want them both reconciling the resource at once. | ||
2. Create a ValidatingWebhook to enforce FederatedResourceQuota limits. The existing implementation reuses Karmada default + thirdparty resource interpreters to calculate the predicted delta resource usage for the quota. If the applied resource goes above the limit, then the webhook will deny the request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. Create a ValidatingWebhook to enforce FederatedResourceQuota limits. The existing implementation reuses Karmada default + thirdparty resource interpreters to calculate the predicted delta resource usage for the quota. If the applied resource goes above the limit, then the webhook will deny the request. | |
1. the new validation webhook will be watching for all kinds of resources, at least all supported workload types, like Deployment, FlinkDeployment. | |
> The existing implementation reuses Karmada default + thirdparty resource interpreters to calculate the predicted delta resource usage for the quota. | |
Do you mean the [GetReplicas](https://github.com/karmada-io/karmada/blob/3232c52d57b331d7120eeaac9386b848197475df/pkg/resourceinterpreter/interpreter.go#L47)? | |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
||
 | ||
|
||
Could we support dynamic assignment of FederatedResourceQuota? Potentially yes, but there are some drawbacks with that approach: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely yes! :)
I wanted this about two years ago!
 | ||
|
||
Could we support dynamic assignment of FederatedResourceQuota? Potentially yes, but there are some drawbacks with that approach: | ||
1. Each time an application failovers, the FederatedResourceQuota will need to check that the feasible clusters have enough quota, and if not, rebalance the resource quotas before scheduling work. This adds complexity to the scheduling step, and would increase E2E failover latency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may depend on the way how to balance the quota dynamically.
How about just maintaining the overall quota and used data at the Karmada control plane? no ResourceQuota will be split into member cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's ideally how I'd like this feature to work!
Signed-off-by: mszacillo <[email protected]>
Hi @RainbowMango, I have gone ahead and improved the diagrams to be more clear with the motivation. They now include FlinkDeployments, with their resource usage, and how failover does not work with the existing static assigned ResourceQuotas. Please let me know if there are any other concerns with regards to this proposal. Thanks! |
@whitewindmills Would you like to take a look at this proposal as well? It is highly likely that the scheduler should be involved to enforce the FederatedResourceQuota. |
get, I'll take a look ASAP. |
What type of PR is this?
/kind design
What this PR does / why we need it:
This proposal enhances the FederatedResourceQuota so that it can impose namespaced resource limits directly on the Karmada control-plane level.
Which issue(s) this PR fixes:
Fixes #5179
Special notes for your reviewer:
Following our discussion during the community meeting, there were some questions regarding API server request latency as a result of adding an admission webhook to enforce resource limits. I ran some tests by measuring the time taken to apply some FlinkDeployments to the Karmada control-plane:
Without webhook: Average e2e request latency over 100 retries: 370 ms
With webhook: Average e2e request latency over 100 retries: 390 ms
The request latency increases slightly, as expected. But can run some more comprehensive performance tests for this feature.
Does this PR introduce a user-facing change?: