New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Proposal to enhance FederatedResourceQuota to enforce resource limits directly on Karmada level #6181

Open

mszacillo wants to merge 1 commit into karmada-io:master from mszacillo:frq-proposal

+161 −0

Contributor

mszacillo commented Mar 3, 2025 •

edited

Loading

What type of PR is this?
/kind design

What this PR does / why we need it:

This proposal enhances the FederatedResourceQuota so that it can impose namespaced resource limits directly on the Karmada control-plane level.

Which issue(s) this PR fixes:
Fixes #5179

Special notes for your reviewer:
Following our discussion during the community meeting, there were some questions regarding API server request latency as a result of adding an admission webhook to enforce resource limits. I ran some tests by measuring the time taken to apply some FlinkDeployments to the Karmada control-plane:

Without webhook: Average e2e request latency over 100 retries: 370 ms
With webhook: Average e2e request latency over 100 retries: 390 ms

The request latency increases slightly, as expected. But can run some more comprehensive performance tests for this feature.

Does this PR introduce a user-facing change?:

NONE

karmada-bot added the kind/design label

karmada-bot requested review from Tingtal and whitewindmills

March 3, 2025 02:53

karmada-bot added the size/M label

mszacillo force-pushed the frq-proposal branch from ca81523 to 89f7a49 Compare

March 3, 2025 02:58

codecov-commenter commented Mar 3, 2025 •

edited

Loading

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 47.90%. Comparing base (3b6c0e0) to head (1a2f8e3).
Report is 61 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6181      +/-   ##
==========================================
- Coverage   47.95%   47.90%   -0.06%     
==========================================
  Files         674      676       +2     
  Lines       55841    55976     +135     
==========================================
+ Hits        26781    26813      +32     
- Misses      27311    27412     +101     
- Partials     1749     1751       +2

Flag	Coverage Δ
unittests	`47.90% <ø> (-0.06%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

RainbowMango reviewed

View reviewed changes

Member

RainbowMango left a comment

/assign
Start working on this.

karmada-bot assigned RainbowMango

RainbowMango reviewed

View reviewed changes

Member

RainbowMango left a comment

Thank you @mszacillo so much for bring this up. I like this feature~

It would be great to add a user story section to this proposal to elaborate the use case. Because it's hard to connect failover feature with the quota management.

docs/proposals/federatedresourcequota/federated-resource-quota-enhancement.md Outdated

+. FederatedResourceQuota should enforce Overall resource limits if Static Assignment is not defined.
+                  - FederatedResourceQuota will be updated whenever resources are applied against the relevant namespace
+                  - We can either updated FRQ by default, or consider including a scope for the quota

Member

RainbowMango Mar 12, 2025

We can either updated FRQ by default, or consider including a scope for the quota

Don't quite understand this, who, how, and why update FRQ?

Contributor Author

mszacillo Mar 17, 2025

I should have been more clear here. The FederatedResourceQuota status will be updated whenever resources are applied against the relevant namespace.

That said, I think I can reframe this as a separate API which enforces resource limits on the Karmada control-plane, and keep the existing federated resource quota as is. That way we have separation between the APIs and can introduce the feature without stepping over existing use-cases.

Member

RainbowMango Mar 18, 2025

I will look at the separated API later, but it seems concerning because this proposal is exactly under the scope of FederatedResourceQuota. The new API still needs to clarify its relationship with FederatedResourceQuota, unless its intention is to replace FederatedResourceQuota.

docs/proposals/federatedresourcequota/federated-resource-quota-enhancement.md Outdated


		Reconciliation: Controller will reconcile whenever a resource binding is created, updated, or deleted. Controller will only reconcile if the resource in question has a pointer to a FederatedResourceQuota.

		Reconcile Logic: When reconciling, the controller will fetch the list of ResourceBindings by namespace and add up their resource requirements. The existing implementation grabs all RBs by namespace, however this could probably be improved by only calculating the delta of the applied resource, rather than calculating the entire resource footprint of the namespace.

Member

RainbowMango Mar 12, 2025

Currently, no way to get the previous resource requirements when reconciling ResourceBindings in reconcile loops. So we can't calculate the delta.

Fetching the whole list of ResourceBinding may be concerning of performance. But it seems not a blocker.

Contributor Author

mszacillo Mar 17, 2025

Good point, this was a concern of mine as well. I'm thinking we could perhaps maintain a ResourceBinding cache kept by the controller. In the event of an update to a ResourceBinding, the controller would:

Check the cache to see if the ResourceBinding exists.
If yes, take the previous spec and calculate a resource delta to update the FRQ status
If no, update the FRQ status using the existing resource delta
Update the cache with teh new ResourceBinding spec.

In the case that the controller pod is restarted or crashes, the cache will be lost and will need to be repopulated. But this would be an improvement over fetching the whole list of ResourceBindings each reconcile loop.

Member

RainbowMango Mar 18, 2025

This an option I guess. For now, perhaps we don't have to dive into the implementation details. I belive there are many ways to handle that. I haven't looked at how Kubernetes calculates the used quota yet, maybe we can share some from it.

In addition, even without the cache you mentioned, we still can build an index with the field like .spec.resourceQuotaName against ResourceBinding, we can get all of the ResourceBindings enabled quota.(Don't have to list all of them). By the way, there is an effort to manage all of the indexers across the system. If you want to know how the index works, maybe you can refer to #6204.

docs/proposals/federatedresourcequota/federated-resource-quota-enhancement.md Outdated Show resolved Hide resolved

docs/proposals/federatedresourcequota/federated-resource-quota-enhancement.md Outdated


		As part of this change, we will do two things:

		1. Edit the existing FederatedResourceQuota validating webhook to prevent users from toggling StaticAssignments on/off when the resource quota is already in use. Since the overall limits and static assignments have different controllers, we don’t want them both reconciling the resource at once.

Member

RainbowMango Mar 12, 2025

Do you mean prevent user from turn on/off the StaticAssignments configuration of FederatedResourceQuota?

Since the overall limits and static assignments have different controllers, we don’t want them both reconciling the resource at once.

If yes, can you elaborate what's the problem here?

Contributor Author

mszacillo Mar 17, 2025 •

edited

Loading

The issue is that both the existing static assignment controller and the new proposed controller will touch the same parts of the FederatedResourceQuota status. Since they calculate the resource usage differently, there would be conflicting changes in the status.

The more I think about this, the more I'm tempted to separate out this resource limit enforcement into its own API.

docs/proposals/federatedresourcequota/federated-resource-quota-enhancement.md Outdated

+              As part of this change, we will do two things:
+. Edit the existing FederatedResourceQuota validating webhook to prevent users from toggling StaticAssignments on/off when the resource quota is already in use. Since the overall limits and static assignments have different controllers, we don’t want them both reconciling the resource at once.
+. Create a ValidatingWebhook to enforce FederatedResourceQuota limits. The existing implementation reuses Karmada default + thirdparty resource interpreters to calculate the predicted delta resource usage for the quota. If the applied resource goes above the limit, then the webhook will deny the request.

Member

RainbowMango Mar 12, 2025

Suggested change

      
            2. Create a ValidatingWebhook to enforce FederatedResourceQuota limits. The existing implementation reuses Karmada default + thirdparty resource interpreters to calculate the predicted delta resource usage for the quota. If the applied resource goes above the limit, then the webhook will deny the request.
          
            1. the new validation webhook will be watching for all kinds of resources, at least all supported workload types, like Deployment, FlinkDeployment. 
          
            > The existing implementation reuses Karmada default + thirdparty resource interpreters to calculate the predicted delta resource usage for the quota.
          
            Do you mean the [GetReplicas](https://github.com/karmada-io/karmada/blob/3232c52d57b331d7120eeaac9386b848197475df/pkg/resourceinterpreter/interpreter.go#L47)?

mszacillo force-pushed the frq-proposal branch from 89f7a49 to ba522f3 Compare

March 17, 2025 15:49

Collaborator

karmada-bot commented Mar 17, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from rainbowmango. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

karmada-bot added size/L and removed size/M labels

mszacillo changed the title ~~Proposal for federated resource quota enhancement~~ Proposal to support ResourceQuotaEnforcer API

RainbowMango reviewed

View reviewed changes

docs/proposals/resource-quota-enforcer/resource-quota-enforcer.md Outdated


		![failover-example-2](resources/failover-example-2.png)

		Could we support dynamic assignment of FederatedResourceQuota? Potentially yes, but there are some drawbacks with that approach:

Member

RainbowMango Mar 18, 2025

Definitely yes! :)
I wanted this about two years ago!

docs/proposals/resource-quota-enforcer/resource-quota-enforcer.md Outdated

+              ![failover-example-2](resources/failover-example-2.png)
+              Could we support dynamic assignment of FederatedResourceQuota? Potentially yes, but there are some drawbacks with that approach:
+. Each time an application failovers, the FederatedResourceQuota will need to check that the feasible clusters have enough quota, and if not, rebalance the resource quotas before scheduling work. This adds complexity to the scheduling step, and would increase E2E failover latency.

Member

RainbowMango Mar 18, 2025

I may depend on the way how to balance the quota dynamically.

How about just maintaining the overall quota and used data at the Karmada control plane? no ResourceQuota will be split into member cluster.

Contributor Author

mszacillo Mar 18, 2025

That's ideally how I'd like this feature to work!

mszacillo force-pushed the frq-proposal branch from ba522f3 to 0a81ad7 Compare

March 22, 2025 00:21


          Proposal for federated resource quota enhancement

1a2f8e3

Signed-off-by: mszacillo <[email protected]>

mszacillo force-pushed the frq-proposal branch from 0a81ad7 to 1a2f8e3 Compare

March 28, 2025 03:37

mszacillo changed the title ~~Proposal to support ResourceQuotaEnforcer API~~ Proposal to enhance FederatedResourceQuota to enforce resource limits directly on Karmada level

Contributor Author

mszacillo commented Mar 28, 2025

Hi @RainbowMango,

I have gone ahead and improved the diagrams to be more clear with the motivation. They now include FlinkDeployments, with their resource usage, and how failover does not work with the existing static assigned ResourceQuotas.

Please let me know if there are any other concerns with regards to this proposal. Thanks!

Member

RainbowMango commented Apr 1, 2025

@whitewindmills Would you like to take a look at this proposal as well? It is highly likely that the scheduler should be involved to enforce the FederatedResourceQuota.

Member

whitewindmills commented Apr 2, 2025

get, I'll take a look ASAP.
/assign

karmada-bot assigned whitewindmills

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/design size/L