Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to enhance FederatedResourceQuota to enforce resource limits directly on Karmada level #6181

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mszacillo
Copy link
Contributor

@mszacillo mszacillo commented Mar 3, 2025

What type of PR is this?
/kind design

What this PR does / why we need it:

This proposal enhances the FederatedResourceQuota so that it can impose namespaced resource limits directly on the Karmada control-plane level.

Which issue(s) this PR fixes:
Fixes #5179

Special notes for your reviewer:
Following our discussion during the community meeting, there were some questions regarding API server request latency as a result of adding an admission webhook to enforce resource limits. I ran some tests by measuring the time taken to apply some FlinkDeployments to the Karmada control-plane:

Without webhook: Average e2e request latency over 100 retries: 370 ms
With webhook: Average e2e request latency over 100 retries: 390 ms

The request latency increases slightly, as expected. But can run some more comprehensive performance tests for this feature.

Does this PR introduce a user-facing change?:

NONE

@karmada-bot karmada-bot added the kind/design Categorizes issue or PR as related to design. label Mar 3, 2025
@karmada-bot karmada-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 3, 2025
@codecov-commenter
Copy link

codecov-commenter commented Mar 3, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 47.90%. Comparing base (3b6c0e0) to head (1a2f8e3).
Report is 61 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6181      +/-   ##
==========================================
- Coverage   47.95%   47.90%   -0.06%     
==========================================
  Files         674      676       +2     
  Lines       55841    55976     +135     
==========================================
+ Hits        26781    26813      +32     
- Misses      27311    27412     +101     
- Partials     1749     1751       +2     
Flag Coverage Δ
unittests 47.90% <ø> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign
Start working on this.

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @mszacillo so much for bring this up. I like this feature~

It would be great to add a user story section to this proposal to elaborate the use case. Because it's hard to connect failover feature with the quota management.


1. FederatedResourceQuota should enforce Overall resource limits if Static Assignment is not defined.
- FederatedResourceQuota will be updated whenever resources are applied against the relevant namespace
- We can either updated FRQ by default, or consider including a scope for the quota
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can either updated FRQ by default, or consider including a scope for the quota

Don't quite understand this, who, how, and why update FRQ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have been more clear here. The FederatedResourceQuota status will be updated whenever resources are applied against the relevant namespace.

That said, I think I can reframe this as a separate API which enforces resource limits on the Karmada control-plane, and keep the existing federated resource quota as is. That way we have separation between the APIs and can introduce the feature without stepping over existing use-cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look at the separated API later, but it seems concerning because this proposal is exactly under the scope of FederatedResourceQuota. The new API still needs to clarify its relationship with FederatedResourceQuota, unless its intention is to replace FederatedResourceQuota.


**Reconciliation**: Controller will reconcile whenever a resource binding is created, updated, or deleted. Controller will only reconcile if the resource in question has a pointer to a FederatedResourceQuota.

**Reconcile Logic**: When reconciling, the controller will fetch the list of ResourceBindings by namespace and add up their resource requirements. The existing implementation grabs all RBs by namespace, however this could probably be improved by only calculating the delta of the applied resource, rather than calculating the entire resource footprint of the namespace.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, no way to get the previous resource requirements when reconciling ResourceBindings in reconcile loops. So we can't calculate the delta.

Fetching the whole list of ResourceBinding may be concerning of performance. But it seems not a blocker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, this was a concern of mine as well. I'm thinking we could perhaps maintain a ResourceBinding cache kept by the controller. In the event of an update to a ResourceBinding, the controller would:

  1. Check the cache to see if the ResourceBinding exists.
  2. If yes, take the previous spec and calculate a resource delta to update the FRQ status
  3. If no, update the FRQ status using the existing resource delta
  4. Update the cache with teh new ResourceBinding spec.

In the case that the controller pod is restarted or crashes, the cache will be lost and will need to be repopulated. But this would be an improvement over fetching the whole list of ResourceBindings each reconcile loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This an option I guess. For now, perhaps we don't have to dive into the implementation details. I belive there are many ways to handle that. I haven't looked at how Kubernetes calculates the used quota yet, maybe we can share some from it.

In addition, even without the cache you mentioned, we still can build an index with the field like .spec.resourceQuotaName against ResourceBinding, we can get all of the ResourceBindings enabled quota.(Don't have to list all of them). By the way, there is an effort to manage all of the indexers across the system. If you want to know how the index works, maybe you can refer to #6204.


As part of this change, we will do two things:

1. Edit the existing FederatedResourceQuota validating webhook to prevent users from toggling StaticAssignments on/off when the resource quota is already in use. Since the overall limits and static assignments have different controllers, we don’t want them both reconciling the resource at once.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean prevent user from turn on/off the StaticAssignments configuration of FederatedResourceQuota?

Since the overall limits and static assignments have different controllers, we don’t want them both reconciling the resource at once.

If yes, can you elaborate what's the problem here?

Copy link
Contributor Author

@mszacillo mszacillo Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that both the existing static assignment controller and the new proposed controller will touch the same parts of the FederatedResourceQuota status. Since they calculate the resource usage differently, there would be conflicting changes in the status.

The more I think about this, the more I'm tempted to separate out this resource limit enforcement into its own API.

As part of this change, we will do two things:

1. Edit the existing FederatedResourceQuota validating webhook to prevent users from toggling StaticAssignments on/off when the resource quota is already in use. Since the overall limits and static assignments have different controllers, we don’t want them both reconciling the resource at once.
2. Create a ValidatingWebhook to enforce FederatedResourceQuota limits. The existing implementation reuses Karmada default + thirdparty resource interpreters to calculate the predicted delta resource usage for the quota. If the applied resource goes above the limit, then the webhook will deny the request.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Create a ValidatingWebhook to enforce FederatedResourceQuota limits. The existing implementation reuses Karmada default + thirdparty resource interpreters to calculate the predicted delta resource usage for the quota. If the applied resource goes above the limit, then the webhook will deny the request.
1. the new validation webhook will be watching for all kinds of resources, at least all supported workload types, like Deployment, FlinkDeployment.
> The existing implementation reuses Karmada default + thirdparty resource interpreters to calculate the predicted delta resource usage for the quota.
Do you mean the [GetReplicas](https://github.com/karmada-io/karmada/blob/3232c52d57b331d7120eeaac9386b848197475df/pkg/resourceinterpreter/interpreter.go#L47)?

@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from rainbowmango. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 17, 2025
@mszacillo mszacillo changed the title Proposal for federated resource quota enhancement Proposal to support ResourceQuotaEnforcer API Mar 17, 2025

![failover-example-2](resources/failover-example-2.png)

Could we support dynamic assignment of FederatedResourceQuota? Potentially yes, but there are some drawbacks with that approach:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely yes! :)
I wanted this about two years ago!

![failover-example-2](resources/failover-example-2.png)

Could we support dynamic assignment of FederatedResourceQuota? Potentially yes, but there are some drawbacks with that approach:
1. Each time an application failovers, the FederatedResourceQuota will need to check that the feasible clusters have enough quota, and if not, rebalance the resource quotas before scheduling work. This adds complexity to the scheduling step, and would increase E2E failover latency.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may depend on the way how to balance the quota dynamically.

How about just maintaining the overall quota and used data at the Karmada control plane? no ResourceQuota will be split into member cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's ideally how I'd like this feature to work!

@mszacillo mszacillo changed the title Proposal to support ResourceQuotaEnforcer API Proposal to enhance FederatedResourceQuota to enforce resource limits directly on Karmada level Mar 28, 2025
@mszacillo
Copy link
Contributor Author

Hi @RainbowMango,

I have gone ahead and improved the diagrams to be more clear with the motivation. They now include FlinkDeployments, with their resource usage, and how failover does not work with the existing static assigned ResourceQuotas.

Please let me know if there are any other concerns with regards to this proposal. Thanks!

@RainbowMango
Copy link
Member

@whitewindmills Would you like to take a look at this proposal as well? It is highly likely that the scheduler should be involved to enforce the FederatedResourceQuota.

@whitewindmills
Copy link
Member

get, I'll take a look ASAP.
/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FederatedResourceQuota should be failover friendly
5 participants