Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4951: Fill PRR questionnaire #5094

Merged
merged 7 commits into from
Feb 12, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions keps/prod-readiness/sig-autoscaling/4951.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
kep-number: 4951
alpha:
approver: "@soltysh"
66 changes: 46 additions & 20 deletions keps/sig-autoscaling/4951-configurable-hpa-tolerance/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,9 +102,9 @@ checklist items _must_ be updated for the enhancement to be released.

Items marked with (R) are required *prior to targeting to a milestone / release*.

- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
- [ ] (R) Design details are appropriately documented
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- [x] (R) KEP approvers have approved the KEP status as `implementable`
- [x] (R) Design details are appropriately documented
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- [ ] e2e Tests for all Beta API Operations (endpoints)
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
Expand Down Expand Up @@ -283,7 +283,7 @@ when drafting this test plan.
[testing-guidelines]: https://git.k8s.io/community/contributors/devel/sig-testing/testing.md
-->

[ ] I/we understand the owners of the involved components may require updates to
[x] I/we understand the owners of the involved components may require updates to
existing tests to make this code solid enough prior to committing the changes necessary
to implement this enhancement.

Expand Down Expand Up @@ -335,7 +335,7 @@ For Beta and GA, add links to added tests together with links to k8s-triage for
https://storage.googleapis.com/k8s-triage/index.html
-->

- <test>: <link to test coverage>
N/A, the feature is tested using unit tests and e2e tests.

##### e2e tests

Expand Down Expand Up @@ -491,7 +491,8 @@ well as the [existing list] of feature gates.

- [x] Feature gate (also fill in values in `kep.yaml`)
- Feature gate name: HPAConfigurableTolerance
- Components depending on the feature gate: `kube-controller-manager`
- Components depending on the feature gate: `kube-controller-manager` and
`kube-apiserver`.

###### Does enabling the feature change any default behavior?

Expand All @@ -517,7 +518,8 @@ NOTE: Also set `disable-supported` to `true` or `false` in `kep.yaml`.

The feature can be disabled by restarting the `kube-controller-manager` with the feature gate set to `false`.

Any `tolerance` values set on existing HPAs will be ignored by the `kube-controller-manager` when the feature gate is off.
Any `tolerance` values set on existing HPAs will be ignored by the
`kube-controller-manager` and `kube-apiserver` when the feature gate is off.

###### What happens if we reenable the feature if it was previously rolled back?

Expand All @@ -538,6 +540,9 @@ You can take a look at one potential example of such test in:
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
-->

We will add a unit test verifying that HPAs with and without the new fields are
properly validated, both when the feature gate is enabled or not.

### Rollout, Upgrade and Rollback Planning

<!--
Expand Down Expand Up @@ -594,6 +599,9 @@ checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
-->

The presence of the new `tolerance` HPA field indicates that the feature is
used.

###### How can someone using this feature know that it is working for their instance?

<!--
Expand All @@ -605,13 +613,18 @@ and operation of this feature.
Recall that end users cannot usually observe component logs or access metrics.
-->

- [ ] Events
- Event Reason:
- [ ] API .status
- Condition name:
- Other field:
- [ ] Other (treat as last resort)
- Details:
- [X] Events
- Event Reason: `SuccessfulRescale`

The tolerance is applied on the ratio between the _current_ and _desired_ metric
values. Users can get both values using
[`kubectl describe`](https://github.com/kubernetes/kubernetes/blob/1b7a0591871772fbbc0fda430b3b73bc24c0e738/staging/src/k8s.io/kubectl/pkg/describe/describe.go#L4109)
and use them to verify that scaling events are triggered when their ratio is out
of tolerance.

We will update the controller-manager logs to help users understand the behavior
of the autoscaler. The data added to the logs will include the tolerance used
for each scaling decision.

###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?

Expand All @@ -630,18 +643,21 @@ These goals will help you determine what you need to measure (SLIs) in the next
question.
-->

N/A.

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

<!--
Pick one more of these and delete the rest.
-->

- [ ] Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- [ ] Other (treat as last resort)
- Details:
This KEP is not expected to have any impact on SLIs/SLOs as it doesn't introduce
a new HPA behavior, but merely allows users to easily change the value of a
parameter that's otherwise difficult to update.

Standard HPA metrics (e.g.
`horizontal_pod_autoscaler_controller_metric_computation_duration_seconds`) can
be used to verify the HPA controller health.

###### Are there any missing metrics that would be useful to have to improve observability of this feature?

Expand All @@ -650,6 +666,12 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co
implementation difficulties, etc.).
-->

Users may want to see a signal that autoscaling isn't happening because of the
tolerance, but this is not directly related to this KEP (this problem already
exists today with the hard-coded 10% tolerance), and taking this KEP as an
opportunity to improve the situation is difficult (see
[this thread](https://github.com/kubernetes/enhancements/pull/4954#discussion_r1857098884)).

### Dependencies

<!--
Expand Down Expand Up @@ -775,6 +797,8 @@ Are there any tests that were run/should be run to understand performance charac
and validate the declared limits?
-->

No.

### Troubleshooting

<!--
Expand Down Expand Up @@ -820,6 +844,8 @@ Major milestones might include:
- when the KEP was retired or superseded
-->

2025-01-21: KEP PR merged.

## Drawbacks

<!--
Expand Down
8 changes: 4 additions & 4 deletions keps/sig-autoscaling/4951-configurable-hpa-tolerance/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@ authors:
- "@pr00se"
- "@jm-franc"
owning-sig: sig-autoscaling
status: provisional
status: implementable
creation-date: 2024-11-05
reviewers:
- "@gjtempleton"
- "@raywainman"
approvers:
- TBD
- "@gjtempleton"

see-also:
- "/keps/sig-autoscaling/853-configurable-hpa-scale-velocity"
Expand Down Expand Up @@ -40,5 +40,5 @@ feature-gates:
disable-supported: true

# The following PRR answers are required at beta release
#metrics:
# - my_feature_metric
metrics:
- horizontal_pod_autoscaler_controller_metric_computation_duration_seconds