GEP 3388 Retry Budget API Implementation #3607

ericdbishop · 2025-02-10T18:04:43Z

What type of PR is this?

/kind documentation
/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Does this PR introduce a user-facing change?:

adds a new BackendTrafficPolicy with ability to configure budgeted retries

k8s-ci-robot · 2025-02-10T18:04:50Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ericdbishop
Once this PR has been reviewed and has the lgtm label, please assign robscott for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-02-10T18:04:53Z

Hi @ericdbishop. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ericdbishop · 2025-02-10T18:14:31Z

apis/v1alpha2/backendtrafficpolicy_types.go

+	// Retry defines the configuration for when to retry a request to a target
+	// backend.
+	//
+	// Implementations SHOULD retry on connection errors (disconnect, reset, timeout,
+	// TCP failure) if a retry stanza is configured.
+	//
+	// Support: Extended
+	//
+	// +optional
+	// <gateway:experimental>
+	Retry *CommonRetryPolicy `json:"retry,omitempty"`


Planning to correct this description (previously discussed here), but I'm also considering changing Retry to RetryBudget so we can better capture the distinction between a constrained budget on retries, versus the static count retries that are configured within HTTPRoute. I think CommonRetryPolicy is okay but would also be curious if we think RetryBudgetPolicy would be more self-explanatory.

CommonRetryPolicy was originally an abstraction from the initial "two possible approaches" proposal just to minimize duplication - agreed that the Common* prefix is probably no longer appropriate, but not quite sure what the correct name should be here:

I feel like *Policy implies a top-level resource like BackendTrafficPolicy that is actually an impl of the policy attachment pattern, not a sub-resource.

We could just collapse the fields into BackendTrafficPolicy inline, but I like the way SessionPersistence is broken out currently - it feels like it will be more composable if we add additional functionality to BackendTrafficPolicy

I'm not quite sure yet if we do indeed want to narrow the scope down to RetryBudget or choose a name that could allow additional fields within this stanza.

I agree with those points, I think it makes sense to leave the retry budget configuration broken out from BackendTrafficPolicy instead of inline. I could see replacing the name CommonRetryPolicy with something like RetryConstraint if we wanted to allow for the possibility down the line of constraining retries based off of something other than a budget.

I took some liberty here in renaming Retry and the CommonRetryPolicy struct to RetryConstraint, but open to better suggestions.

ericdbishop · 2025-02-10T18:21:32Z

apis/v1alpha2/backendtrafficpolicy_types.go

+	// Support: Extended
+	//
+	// +optional
+	BudgetPercent *int `json:"budgetPercent,omitempty"`


Previous comment on validation. The maximum valid argument for BudgetPercent should be 100 as that is effectively the same as having no retry budget at all, but should the minimum value we allow be 0? Should users be allowed to block all retries in that way?

I set the minimum as 0 for the time being.

ericdbishop · 2025-02-10T18:32:21Z

apis/v1alpha2/backendtrafficpolicy_types.go

+// CommonRetryPolicy defines the configuration for when to retry a request.
+type CommonRetryPolicy struct {


What's the minimum viable set of fields here for an implementation to say that they support retry budgets?

Link to comment.

Given confirmation that Envoy's retry_budget spec could be modified to include a parameter that matches BudgetInterval, I think it would be safe to require that implementations should include all fields to be considered supporting retry budgets.

But that being said, I could see how BudgetInterval could be excluded to match Envoy's existing retry budget behavior which @mikemorris detailed here, making only MinRetryRate and BudgetPercent truly necessary.

I could see how BudgetInterval could be excluded to match Envoy's existing retry budget behavior

In the context of @tonya11en's comment at #3573 (comment) and envoyproxy/envoy#30205 (comment), even though this could be possible to enable, I'm unsure if it would actually be desireable even for Envoy-based implementations of Gateway API?

This additionally has some bearing on the semantic meaning of budgetInterval: 0 (weird, effectively a rate with a division by zero unless we use it as a shorthand for Envoy's current behavior) vs if we want to prescribe a default interval when omitting the field entirely (which could make UX more concise).

I do think having a default budgetInterval would make sense, going off of the default I see for Linkerd's ttl parameter, maybe 10s is reasonable? Agree about the meaning of budgetInterval: 0 being strange. Also, since it is a duration, it would require a unit of time. It does seem more desirable to require implementations to include all three parameters after seeing that additional context.

ericdbishop · 2025-02-13T14:07:14Z

apisx/v1alpha2/backendtrafficpolicy.go

+	// RetryConstraint defines the configuration for when to allow or prevent
+	// further retries to a target backend by dynamically calculating a 'retry
+	// budget'. This budget is calculated based on the percentage of incoming
+	// traffic composed of retries over a given time interval. Once the budget
+	// is exceeded, additional retries will be rejected by the backend.
+	//
+	// For example, if the retry budget interval is 10 seconds, there have been
+	// 1000 active requests in the past 10 seconds, and the allowed percentage
+	// of requests that can be retried is 20% (the default), then 200 of those
+	// requests may be composed of retries. Active requests will only be
+	// considered for the duration of the interval when calculating the retry
+	// budget.
+	//
+	// Configuring a RetryConstraint in BackendTrafficPolicy is compatible with
+	// HTTPRoute Retry settings for each HTTPRouteRule that targets the same
+	// backend. While the HTTPRouteRule Retry stanza can specify whether a
+	// request should be retried and the number of retry attempts each client
+	// may perform, RetryConstraint helps prevent cascading failures, such as
+	// retry storms, during periods of consistent failures.
+	//
+	// After the retry budget has been exceeded, additional retries to the
+	// backend must return a 503 response to the client.
+	//
+	// Additional configurations for defining a constraint on retries MAY be
+	// defined in the future.


This entire description requires wordsmithing.

…monRetryPolicy

…orm with api structure

… in kubernetes-sigs#3588

ericdbishop · 2025-02-13T14:14:16Z

@mikemorris @robscott @kflynn @youngnick @dprotaso Hi team, would appreciate an initial review as this PR is already pretty large. I cherry-picked/followed some of @dprotaso's changes from #3588 so would appreciate clarification if I correctly created a separate API group, thanks!

ericdbishop · 2025-02-13T14:20:00Z

apisx/v1alpha2/backendtrafficpolicy.go

+	// retry storms, during periods of consistent failures.
+	//
+	// After the retry budget has been exceeded, additional retries to the
+	// backend must return a 503 response to the client.


Are we opinionated on what should be returned to the client after the targeted backend's retry budget has been exceeded? Envoy returns a 503, in addition to setting an x-envoy-overloaded header on the downstream response.

No strong opinion on this. I can see a case for a SHOULD instead of a MUST here, but probably better to start with more restrictive language and loosen if necessary in the future. On that note, recommend capitalizing all of the RFC 2119 keywords.

I would somewhat expect this would more logically be a 429 response, @kflynn how does Linkerd handle this?

Why would 429 apply here? Per https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429, this is a signal to the client that it has exceeded some limit, but in this case it is the load balancer that has exceeded the budget. From the client's perspective, it may have sent only a single request (ever), so a 429 seem to be strange.

That makes sense to me @htuch that it may not be any individual client's fault that the service is unavailable, thanks for helping to clarify this.

k8s-ci-robot · 2025-02-14T23:50:58Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

robscott

Thanks @ericdbishop!

robscott · 2025-02-15T01:08:15Z

apisx/v1alpha2/backendtrafficpolicy.go

+	// +listMapKey=name
+	// +kubebuilder:validation:MinItems=1
+	// +kubebuilder:validation:MaxItems=16
+	TargetRefs []LocalPolicyTargetReference `json:"targetRefs"`


This specific form of target ref is missing a way to target an individual port on a Service. That may be fine, just want to call it out for future readers.

robscott · 2025-02-15T01:09:08Z

apisx/v1alpha2/backendtrafficpolicy.go

+	// further retries to a target backend by dynamically calculating a 'retry
+	// budget'. This budget is calculated based on the percentage of incoming
+	// traffic composed of retries over a given time interval. Once the budget
+	// is exceeded, additional retries will be rejected by the backend.


This doesn't sound quite right. It's not the backend that's rejecting retries here right?

robscott · 2025-02-15T01:10:37Z

apisx/v1alpha2/backendtrafficpolicy.go

+	// of requests that can be retried is 20% (the default), then 200 of those
+	// requests may be composed of retries. Active requests will only be


How are requests measured here? What if there were 1000 requests to the Gateway, but 100 of them were retried 3 times each, does that count as 100 or 300 here?

robscott · 2025-02-15T01:13:58Z

apisx/v1alpha2/backendtrafficpolicy.go

+	// retry storms, during periods of consistent failures.
+	//
+	// After the retry budget has been exceeded, additional retries to the
+	// backend must return a 503 response to the client.


No strong opinion on this. I can see a case for a SHOULD instead of a MUST here, but probably better to start with more restrictive language and loosen if necessary in the future. On that note, recommend capitalizing all of the RFC 2119 keywords.

robscott · 2025-02-15T01:15:31Z

apisx/v1alpha2/backendtrafficpolicy.go

+	// BudgetPercent defines the maximum percentage of active requests that may
+	// be made up of retries.
+	//
+	// Support: Extended


We need some more work on this, but I'd argue that we should have a concept of "this field MUST be supported if you support this feature" (retry budgets in this case).

/cc @youngnick @mlavacca @shaneutt

robscott · 2025-02-15T01:15:56Z

apisx/v1alpha2/backendtrafficpolicy.go

+	// Support: Extended
+	//
+	// +optional
+	// +kubebuilder:default=10s


Seems like something we'll want to define a min and max value for.

htuch · 2025-02-19T16:25:32Z

apisx/v1alpha2/backendtrafficpolicy.go

+	TargetRefs []LocalPolicyTargetReference `json:"targetRefs"`
+
+	// RetryConstraint defines the configuration for when to allow or prevent
+	// further retries to a target backend by dynamically calculating a 'retry


TBC is backend here referencing an individual target host or a service? Envoy's retry budget is a cluster setting.

htuch · 2025-02-20T21:08:40Z

apisx/v1alpha2/backendtrafficpolicy.go

+	//
+	// +optional
+	// +kubebuilder:default={count: 10, interval: 1s}
+	MinRetryRate *RequestRate `json:"minRetryRate,omitempty"`


I think this is safe given it's an override for very low volume traffic. That said, features that make use of absolute rates and capacities, that are maintained in a single proxy instance (e.g. Envoy's circuit breakers), tend to behave unpredictably when the Gateway is not just a single proxy instance but autoscaled or part of some multi-tenant deployment with an opaque number of instances. For example, here, if we set minRetryRate to something like 1 per second, any given backend service might see anything from 1 to 10k retries per second if there are 1-10k proxy instances backing the Gateway, even if the budget is configured to try deter traffic on the higher end.

Again, I don't think it's an acute issue as we're setting a minimum traffic volume, and the retry budget needs some context specific tuning to make sense in any case, but it's something end users of limits and controls set service-wide should be aware of when using a feature on a multi-instance Gateway implementation.

Hmmm, yep definitely understand - similar consideration would apply for for local vs global rate limiting. I haven't seen any actual requests for or implementations of global retry budgeting, it is an interesting problem but agree tuning based on rough expected scale feels like it could be acceptable for now?

k8s-ci-robot requested review from robscott and youngnick February 10, 2025 18:04

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 10, 2025

ericdbishop changed the title ~~Gep 3388 retry budget api implementation~~ GEP 3388 Retry Budget API Implementation Feb 10, 2025

ericdbishop commented Feb 10, 2025

View reviewed changes

ericdbishop commented Feb 13, 2025

View reviewed changes

ericdbishop added 9 commits February 13, 2025 09:08

apis: add implementation for GEP-3388 HTTPRoute Retry Budget

4b4fd67

fmt and add descriptions for parameters

1993417

Move GEP 3388 to Experimental

81ee318

make generate

7b61c97

Minor change

6661e47

Require both parameters of RequestRate

e359c3e

Begin fixing Retry description. Add defaults, some validation, in Com…

b166a68

…monRetryPolicy

Taking the liberty of renaming CommonRetryPolicy to RetryConstraint

1a122fa

Shamelessly copying from backendlbpolicy and backendtlspolicy to conf…

5a02c8e

…orm with api structure

ericdbishop and others added 5 commits February 13, 2025 09:08

Fleshing out the description for RetryConstraint

5f2b55b

refactor codegen scripts to make it easier to generate two clients

d6bcae5

Attempting to match the experimental API structure that dprotaso made…

6f04b8b

… in kubernetes-sigs#3588

Delete files that were generated before moving to apisx

dcc5729

undo commenting

ededf82

ericdbishop force-pushed the gep-3388-retry-budget-api-implementation branch from d11236f to ededf82 Compare February 13, 2025 14:08

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 13, 2025

ericdbishop marked this pull request as ready for review February 13, 2025 14:09

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 13, 2025

k8s-ci-robot requested review from mikemorris and shaneutt February 13, 2025 14:09

ericdbishop commented Feb 13, 2025

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 14, 2025

robscott reviewed Feb 15, 2025

View reviewed changes

k8s-ci-robot requested a review from mlavacca February 15, 2025 01:16

htuch reviewed Feb 19, 2025

View reviewed changes

htuch reviewed Feb 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GEP 3388 Retry Budget API Implementation #3607

GEP 3388 Retry Budget API Implementation #3607

ericdbishop commented Feb 10, 2025

k8s-ci-robot commented Feb 10, 2025

k8s-ci-robot commented Feb 10, 2025

ericdbishop Feb 10, 2025

mikemorris Feb 10, 2025

ericdbishop Feb 11, 2025

ericdbishop Feb 12, 2025

ericdbishop Feb 10, 2025

ericdbishop Feb 12, 2025

ericdbishop Feb 10, 2025

mikemorris Feb 10, 2025 •

edited

Loading

ericdbishop Feb 11, 2025

ericdbishop Feb 13, 2025

ericdbishop commented Feb 13, 2025 •

edited

Loading

ericdbishop Feb 13, 2025

robscott Feb 15, 2025

mikemorris Feb 18, 2025

htuch Feb 19, 2025

mikemorris Feb 19, 2025

k8s-ci-robot commented Feb 14, 2025

robscott left a comment

robscott Feb 15, 2025

robscott Feb 15, 2025

robscott Feb 15, 2025

robscott Feb 15, 2025

robscott Feb 15, 2025

robscott Feb 15, 2025

htuch Feb 19, 2025

htuch Feb 20, 2025

mikemorris Feb 20, 2025 •

edited

Loading

		// CommonRetryPolicy defines the configuration for when to retry a request.
		type CommonRetryPolicy struct {

		// of requests that can be retried is 20% (the default), then 200 of those
		// requests may be composed of retries. Active requests will only be

GEP 3388 Retry Budget API Implementation #3607

Are you sure you want to change the base?

GEP 3388 Retry Budget API Implementation #3607

Conversation

ericdbishop commented Feb 10, 2025

k8s-ci-robot commented Feb 10, 2025

k8s-ci-robot commented Feb 10, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikemorris Feb 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericdbishop commented Feb 13, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Feb 14, 2025

robscott left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikemorris Feb 20, 2025 • edited Loading

Choose a reason for hiding this comment

mikemorris Feb 10, 2025 •

edited

Loading

ericdbishop commented Feb 13, 2025 •

edited

Loading

mikemorris Feb 20, 2025 •

edited

Loading