Skip to content

Commit 1538972

Browse files
committed
Snapshottable API server cache
1 parent 3fb4087 commit 1538972

File tree

2 files changed

+405
-0
lines changed

2 files changed

+405
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,378 @@
1+
# KEP-4988 Snapshottable API server cache
2+
3+
<!-- toc -->
4+
- [Release Signoff Checklist](#release-signoff-checklist)
5+
- [Summary](#summary)
6+
- [Motivation](#motivation)
7+
- [Goals](#goals)
8+
- [Non-Goals](#non-goals)
9+
- [Proposal](#proposal)
10+
- [Risks and Mitigations](#risks-and-mitigations)
11+
- [Client setting limit while not supporting pagination](#client-setting-limit-while-not-supporting-pagination)
12+
- [Memory overhead](#memory-overhead)
13+
- [Delegating slow pagination to etcd](#delegating-slow-pagination-to-etcd)
14+
- [Test Plan](#test-plan)
15+
- [Prerequisite testing updates](#prerequisite-testing-updates)
16+
- [Unit tests](#unit-tests)
17+
- [Integration tests](#integration-tests)
18+
- [e2e tests](#e2e-tests)
19+
- [Graduation Criteria](#graduation-criteria)
20+
- [Alpha](#alpha)
21+
- [Beta](#beta)
22+
- [GA](#ga)
23+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
24+
- [Version Skew Strategy](#version-skew-strategy)
25+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
26+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
27+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
28+
- [Monitoring Requirements](#monitoring-requirements)
29+
- [Dependencies](#dependencies)
30+
- [Scalability](#scalability)
31+
- [Troubleshooting](#troubleshooting)
32+
- [Implementation History](#implementation-history)
33+
- [Drawbacks](#drawbacks)
34+
- [Alternatives](#alternatives)
35+
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
36+
<!-- /toc -->
37+
38+
## Release Signoff Checklist
39+
40+
Items marked with (R) are required *prior to targeting to a milestone / release*.
41+
42+
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
43+
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
44+
- [ ] (R) Design details are appropriately documented
45+
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
46+
- [ ] e2e Tests for all Beta API Operations (endpoints)
47+
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
48+
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
49+
- [ ] (R) Graduation criteria is in place
50+
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
51+
- [ ] (R) Production readiness review completed
52+
- [ ] (R) Production readiness review approved
53+
- [ ] "Implementation History" section is up-to-date for milestone
54+
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
55+
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
56+
57+
[kubernetes.io]: https://kubernetes.io/
58+
[kubernetes/enhancements]: https://git.k8s.io/enhancements
59+
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
60+
[kubernetes/website]: https://git.k8s.io/website
61+
62+
## Summary
63+
64+
The kube-apiserver's caching mechanism (watchcache) efficiently serves requests
65+
for the latest observed state. However, `LIST` requests for previous states,
66+
either via pagination or by specifying a `resourceVersion`, bypass the cache and
67+
are served directly from etcd. This significantly increases the performance cost,
68+
and in aggregate, can cause stability issues. This is especially pronounced when
69+
dealing with large resources, as transferring large data blobs through multiple
70+
systems can create significant memory pressure. This document proposes an
71+
enhancement to the kube-apiserver's caching layer to enable efficient serving all
72+
`LIST` requests from the cache.
73+
74+
## Motivation
75+
76+
When the API server serves a `LIST` requests directly from etcd, it introduces
77+
significant stability and reliability concerns:
78+
79+
* **Unpredictable Memory Pressure:** Retrieving data from etcd and constructing
80+
responses involves significant memory allocations on the API server.
81+
The volume of data retrieved from etcd can vary drastically depending on
82+
object sizes. This results in unpredictable memory pressure, making it difficult
83+
to provision resources effectively and increasing the risk of Out-of-Memory (OOM) errors.
84+
* **Ineffective API Priority and Fairness (APF) Throttling:** The API server's
85+
overload protection mechanism, API Priority and Fairness (APF), primarily
86+
throttles based on the *predicted cost* of a request, which is derived from
87+
factors like latency and object count. While these factors provide some
88+
indication of computational cost, they do not accurately reflect the memory
89+
footprint. Crucially, we lack visibility into the per-request memory allocations.
90+
Therefore, APF cannot effectively throttle requests based on actual memory usage,
91+
leaving the API server vulnerable to memory exhaustion.
92+
93+
These issues with serving data directly from etcd lead to unpredictable and volatile API server memory usage.
94+
95+
Remarkably, the API server already maintains all the necessary data in the watchcache.
96+
By enabling all `LIST` requests to be served from the watchcache, we can
97+
significantly reduce memory pressure and improve the effectiveness of APF throttling,
98+
leading to a more stable and reliable API server.
99+
100+
### Goals
101+
102+
- Reduce memory allocations by supporting all types of LIST requests from cache
103+
104+
### Non-Goals
105+
106+
- Change semantics of the `LIST` request
107+
- Support indexing when serving for all types of requests.
108+
- Enforce that no client requests are served from etcd
109+
110+
## Proposal
111+
112+
Leveraging the recent rewrite of the watchcache storage layer to use a B-tree
113+
(https://github.com/kubernetes/kubernetes/pull/126754), we propose to utilize
114+
B-tree snapshots to serve remaining types of LIST request.
115+
116+
While the we will propose a mechanism that can serve all types of request, we
117+
limit the enablement to pagination for now.
118+
119+
Mechanism:
120+
1. **Snapshot Creation:** When a watch event is received, the cacher will create
121+
a snapshot of the B-tree based cache using the efficient [Clone()] method.
122+
This creates a lazy copy, only duplicating the necessary tree structure, resulting in
123+
minimal overhead. Watch cache already stores the history of watch events, so
124+
B-tree will contain pointers to in-use memory without need for not actual copies.
125+
2. **Snapshot Storage:** The snapshot will be stored in a tree data structure,
126+
keyed by resourceVersion. Tree will help with efficient lookup of nextSmaller element,
127+
as resourceVersions is not continuous.
128+
3. **Serving Subsequent Pages:** When a subsequent request with a continue token
129+
arrives, the API server will:
130+
- Extract the resourceVersion from the continue token.
131+
- Lookup nextSmaller snapshot and return response based on it.
132+
- There are two edge cases relating to requested resource:
133+
- It's smaller than any available snapshot, meaning it was cleaned up (look below).
134+
In that case we fall back to serving from etcd.
135+
- It's larger than the latest snapshot, meaning it's a future resourceVersion or
136+
watch cache is behind. In that case can execute a consistent read from etcd,
137+
to confirm a future resourceVersion or know we can wait for watch cache to catch up.
138+
4. **Snapshot Cleanup:** Snapshots will be subject to a Time-To-Live (TTL)
139+
mechanism same as watch events. We will reuse the process, which limits
140+
events to 10`000 and 75s window (can be overwritten by request timeout).
141+
We also need to remember to purge the snapshots during cache re-initialization.
142+
143+
[Clone()]: https://pkg.go.dev/github.com/google/btree#BTree.Clone
144+
145+
### Risks and Mitigations
146+
147+
#### Client setting limit while not supporting pagination
148+
149+
#### Memory overhead
150+
151+
No, B-tree only store pointers the actual objects, not the object themselves.
152+
The objects are already cached to serve watch, so it should only add a small
153+
overhead for the B-tree structure itself, which is negligible compared to the
154+
size of the cached objects.
155+
156+
#### Delegating slow pagination to etcd
157+
158+
To avoid breaking users the proposal still allows pagination requests older than
159+
75s to pass to etcd. This can have a huge performance impact if the resource is
160+
large. However, this seems still safer than:
161+
* Increasing the watch cache size 4 times to match etcd.
162+
* Block requests older than 75s
163+
164+
### Test Plan
165+
166+
[x] I/we understand the owners of the involved components may require updates to
167+
existing tests to make this code solid enough prior to committing the changes necessary
168+
to implement this enhancement.
169+
170+
##### Prerequisite testing updates
171+
172+
- Ensure the pagination is well tested
173+
174+
##### Unit tests
175+
176+
- `k8s/apiserver/pkg/storage/cache`: `2024-12-12` - `<test coverage>`
177+
178+
##### Integration tests
179+
180+
<!--
181+
Integration tests are contained in k8s.io/kubernetes/test/integration.
182+
Integration tests allow control of the configuration parameters used to start the binaries under test.
183+
This is different from e2e tests which do not allow configuration of parameters.
184+
Doing this allows testing non-default options and multiple different and potentially conflicting command line options.
185+
-->
186+
187+
<!--
188+
This question should be filled when targeting a release.
189+
For Alpha, describe what tests will be added to ensure proper quality of the enhancement.
190+
191+
For Beta and GA, add links to added tests together with links to k8s-triage for those tests:
192+
https://storage.googleapis.com/k8s-triage/index.html
193+
-->
194+
195+
- <test>: <link to test coverage>
196+
197+
##### e2e tests
198+
199+
Given we're only modifying kube-apiserver, integration tests are sufficient.
200+
201+
### Graduation Criteria
202+
203+
#### Alpha
204+
205+
- Feature implemented behind a feature gate
206+
- Feature is covered with unit and integration tests
207+
208+
#### Beta
209+
210+
- Feature is enabled by default
211+
212+
#### GA
213+
214+
TODO
215+
216+
### Upgrade / Downgrade Strategy
217+
218+
The feature is purely in-memory so update/downgrade doesn't require any
219+
specific considerations.
220+
221+
### Version Skew Strategy
222+
223+
Feature touches only kube-apiserver and coordination between individual
224+
instances is not needed.
225+
226+
## Production Readiness Review Questionnaire
227+
228+
### Feature Enablement and Rollback
229+
230+
###### How can this feature be enabled / disabled in a live cluster?
231+
232+
- [X] Feature gate (also fill in values in `kep.yaml`)
233+
- Feature gate name: PaginationFromCache
234+
- Components depending on the feature gate: kube-apiserver
235+
- [ ] Other
236+
- Describe the mechanism:
237+
- Will enabling / disabling the feature require downtime of the control
238+
plane?
239+
- Will enabling / disabling the feature require downtime or reprovisioning
240+
of a node?
241+
242+
###### Does enabling the feature change any default behavior?
243+
244+
Yes, kube-apiserver paginating LIST requests will no longer require request to etcd.
245+
246+
###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
247+
248+
Yes, via disabling the feature-gate in kube-apiserver.
249+
250+
###### What happens if we reenable the feature if it was previously rolled back?
251+
252+
The feature is purely in-memory so it will just work as enabled for the first time.
253+
254+
###### Are there any tests for feature enablement/disablement?
255+
256+
The feature is purely in-memory so feature enablement/disablement will not provide
257+
additional value on top of feature tests themselves.
258+
259+
### Rollout, Upgrade and Rollback Planning
260+
261+
###### How can a rollout or rollback fail? Can it impact already running workloads?
262+
263+
264+
###### What specific metrics should inform a rollback?
265+
266+
<!--
267+
What signals should users be paying attention to when the feature is young
268+
that might indicate a serious problem?
269+
-->
270+
271+
###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
272+
273+
<!--
274+
Describe manual testing that was done and the outcomes.
275+
Longer term, we may want to require automated upgrade/rollback tests, but we
276+
are missing a bunch of machinery and tooling and can't do that now.
277+
-->
278+
279+
###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
280+
281+
NO
282+
283+
### Monitoring Requirements
284+
285+
###### How can an operator determine if the feature is in use by workloads?
286+
287+
This is control-plane feature, not a workload feature.
288+
289+
###### How can someone using this feature know that it is working for their instance?
290+
291+
This is control-plane feature, not a workload feature.
292+
293+
###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
294+
295+
[API call latency SLO](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/api_call_latency.md)
296+
297+
###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
298+
299+
[API call latency SLI](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/api_call_latency.md)
300+
301+
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
302+
303+
### Dependencies
304+
305+
###### Does this feature depend on any specific services running in the cluster?
306+
307+
No
308+
309+
### Scalability
310+
311+
###### Will enabling / using this feature result in any new API calls?
312+
313+
No
314+
315+
###### Will enabling / using this feature result in introducing new API types?
316+
317+
No
318+
319+
###### Will enabling / using this feature result in any new calls to the cloud provider?
320+
321+
No
322+
323+
###### Will enabling / using this feature result in increasing size or count of the existing API objects?
324+
325+
No
326+
327+
###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
328+
329+
No, we expect the [API call latency SLI](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/api_call_latency.md) to improve.
330+
331+
332+
###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
333+
334+
Overall we expect that cost of serving pagination will go down, however caching
335+
might increase RAM usage, if the client reads the first page, but never
336+
paginates. We expect that most controllers will read all pages.
337+
338+
###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
339+
340+
No
341+
342+
### Troubleshooting
343+
344+
###### How does this feature react if the API server and/or etcd is unavailable?
345+
346+
The feature is kube-apiserver feature - it just doesn't work if kube-apiserver is unavailable.
347+
348+
###### What are other known failure modes?
349+
350+
No
351+
352+
###### What steps should be taken if SLOs are not being met to determine the problem?
353+
354+
Disabling the feature-gate.
355+
356+
## Implementation History
357+
358+
## Drawbacks
359+
360+
<!--
361+
Why should this KEP _not_ be implemented?
362+
-->
363+
364+
## Alternatives
365+
366+
<!--
367+
What other approaches did you consider, and why did you rule them out? These do
368+
not need to be as detailed as the proposal, but should include enough
369+
information to express the idea and why it was not acceptable.
370+
-->
371+
372+
## Infrastructure Needed (Optional)
373+
374+
<!--
375+
Use this section if you need things from the project/SIG. Examples include a
376+
new subproject, repos requested, or GitHub details. Listing these here allows a
377+
SIG to get the process for these resources started right away.
378+
-->

0 commit comments

Comments
 (0)