Skip to content

Commit 96db93a

Browse files
committed
VM higher density: addressing review comments
Signed-off-by: Igor Bezukh <[email protected]>
1 parent 8ee2bc3 commit 96db93a

File tree

1 file changed

+50
-76
lines changed

1 file changed

+50
-76
lines changed

enhancements/kubelet/virtualization-higher-workload-density.md

+50-76
Original file line numberDiff line numberDiff line change
@@ -25,27 +25,27 @@ status: implementable
2525
## Summary
2626

2727
Fit more workloads onto a given node - achieve a higher workload
28-
density - by overcommitting it's memory resources. Due to timeline
28+
density - by overcommitting its memory resources. Due to timeline
2929
needs a multi-phased approach is considered.
3030

3131
## Motivation
3232

3333
Today, OpenShift Virtualization is reserving memory (`requests.memory`)
34-
according to the needs of the virtual machine and it's infrastructure
34+
according to the needs of the virtual machine and its infrastructure
3535
(the VM related pod). However, usually an application within the virtual
3636
machine does not utilize _all_ the memory _all_ the time. Instead,
3737
only _sometimes_ there are memory spikes within the virtual machine.
3838
And usually this is also true for the infrastucture part (the pod) of a VM:
3939
Not all the memory is used all the time. Because of this assumption, in
4040
the following we are not differentiating between the guest of a VM and the
41-
infrastructure of a VM, instead we are just speaking colectively of a VM.
41+
infrastructure of a VM, instead we are just speaking collectively of a VM.
4242

4343
Now - Extrapolating this behavior from one to all virtual machines on a
44-
given node leads to the observation that _on average_ there is no memory
45-
ressure and often a rather low memory utilization - despite the fact that
46-
much memory has been reserved.
47-
Reserved but underutilized hardware resources - like memory in this case -
48-
are a cost factor to cluster owners.
44+
given node leads to the observation that _on average_ much of the reserved memory is not utilized.
45+
Moreover, the memory pages that are utilized can be classified to a frequently used (a.k.a working set) and inactive memory pages.
46+
In case of memory pressure the inactive memory pages can be swapped out.
47+
From the cluster owner perspective, reserved but underutilized hardware resources - like memory in this case -
48+
are a cost factor.
4949

5050
This proposal is about increasing the virtual machine density and thus
5151
memory utilization per node, in order to reduce the cost per virtual machine.
@@ -77,10 +77,10 @@ memory utilization per node, in order to reduce the cost per virtual machine.
7777

7878
### Non-Goals
7979

80-
* Complete life-cycling of the WASP Agent. We are not intending to write
80+
* Complete life-cycling of the [WASP](https://github.com/openshift-virtualization/wasp-agent) Agent. We are not intending to write
8181
an Operator for memory over commit for two reasons:
8282
* [Kubernetes SWAP] is close, writing a fully fledged operator seems
83-
to be no good use of resources
83+
to be no good use of developer resources
8484
* To simplify the transition from WASP to [Kubernetes SWAP]
8585
* Allow swapping for VM pods only. We don't want to diverge from the upstream approach
8686
since [Kubernetes SWAP] allows swapping for all pods associated with the burtsable QoS class.
@@ -108,34 +108,31 @@ We expect to mitigate the following situations
108108

109109
#### Scope
110110

111-
Memory over-commitment will be limited to
112-
virtual machines running in the burstable QoS class.
113-
Virtual machines in the guaranteed QoS classes are not getting over
114-
committed due to alignment with upstream Kubernetes. Virtual machines
115-
will never be in the best-effort QoS because memory requests are
116-
always set.
111+
Following the upstream kubernetes approach every workload marked as burstable QoS would be able to swap.
112+
There is no differentiation between the type of the workload: regular pod or a VM.
113+
With that being said, swapping will be allowed by WASP (and later on by kube swap) for pods
114+
with Burstable QoS class.
117115

118-
Swapping will be allowed by WASP (and later on by kube swap) for all pods
119-
that are associated with the burstable QoS. Thus, over-commited VM stability
120-
can be achieved during memory spikes by swapping out "cold" memory pages.
116+
Among the VM workloads, VMs of high-performance configuration (NUMA affinity, CPU affinity, etc.) cannot overcommit.
117+
Also, VMs with best-effort QoS class don't exist because requesting memory is mandatory for the VM spec.
118+
For VMs of Burstable Qos Class over-commited VM stability can be achieved during memory spikes by swapping out "cold" memory pages.
121119

122120
#### Timeline & Phases
123121

124-
| Phase | Target |
125-
|--------------------------------------------------------------------|--------------|
126-
| Phase 1 - Out-Of-Tree SWAP with WASP | 2024 |
127-
| Phase 2 - Transition to Kubernetes SWAP. Limited to CNV-only users | mid-end 2025 |
128-
| Phase 3 - Kubernetes SWAP for all Openshift users | TBD |
122+
| Phase | Target | WASP Graduation | Kube swap Graduation |
123+
|--------------------------------------------------------------------|--------------|-----------------|----------------------|
124+
| Phase 1 - Out-Of-Tree SWAP with WASP | 2024 | Tech-Preview | Beta |
125+
| Phase 2 - Transition to Kubernetes SWAP. Limited to CNV-only users | mid-end 2025 | GA | GA |
126+
| Phase 3 - Kubernetes SWAP for all Openshift users | TBD | GA | GA |
129127

130128
Because [Kubernetes SWAP] is currently in Beta and is only expected to GA within
131129
Kubernetes releases 1.33-1.35 (discussion about its GA criterias are still ongoing).
132130
this proposal is taking a three-phased approach in order to meet the timeline requirements.
133131

134132
* **Phase 1** - OpenShift Virtualization will provide an out-of-tree
135-
solution to enable higher workload density and swap-based eviction mechanisms.
133+
solution (WASP) to enable higher workload density and swap-based eviction mechanisms.
136134
* **Phase 2** - OpenShift Virtualization will transition to [Kubernetes SWAP] (in-tree).
137-
OpenShift will allow using SWAP only for CNV users, that is,
138-
whenever OpenShift Virtualization is installed on the cluster.
135+
OpenShift will [allow](#swap-ga-for-cnv-users-only) SWAP to be configured only if OpenShift Virtualization is installed on the cluster.
139136
In this phase, WASP will be dropped in favor of GAed Kubernetes mechanisms.
140137
* **Phase 3** - OpenShift will GA SWAP for every user, even if OpenShift Virtualization
141138
is not installed on the cluster.
@@ -160,7 +157,7 @@ virtual machine in a cluster.
160157

161158
a. The cluster admin is adding the `failOnSwap=false` flag to the
162159
kubelet configuration via a `KubeletConfig` CR, in order to ensure
163-
that the kubelet will start once swap has been rolled out.
160+
that the kubelets will start once swap has been rolled out.
164161
a. The cluster admin is calculating the amount of swap space to
165162
provision based on the amount of physical ram and overcommitment
166163
ratio
@@ -171,16 +168,16 @@ virtual machine in a cluster.
171168
4. The cluster admin is configuring OpenShift Virtualization for higher
172169
workload density via
173170

174-
a. the OpenShift Virtualization Console "Settings" page
175-
b. or `HCO` API
171+
a. the OpenShift Virtualization Console "Settings" page
172+
b. [or `HCO` API](https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/docs/cluster-configuration.md#configure-higher-workload-density)
176173

177174
The cluster is now set up for higher workload density.
178175

179176
In phase 3, deploying the WASP agent will not be needed.
180177

181178
#### Workflow: Leveraging higher workload density
182179

183-
1. The VM Owner is creating a regular virtual machine and is launching it.
180+
1. The VM Owner is creating a regular virtual machine and is launching it. The VM owner must not specify memory requests in the VM spec, but only the guest memory size.
184181

185182
### API Extensions
186183

@@ -229,7 +226,7 @@ The design is driven by the following guiding principles:
229226
An OCI Hook to enable swap by setting the containers cgroup
230227
`memory.swap.max=max`.
231228

232-
* **Technology Preview**
229+
* **Tech Preview**
233230
* Uses `UnlimitedSwap`.
234231
* **General Availability**
235232
* Limited to burstable QoS class pods.
@@ -242,7 +239,7 @@ For more info, refer to the upstream documentation on how to calculate
242239
###### Provisioning swap
243240

244241
Provisioning of swap is left to the cluster administrator.
245-
The hook itself is not making any assumption where the swap is located.
242+
The OCI hook itself is not making any assumption where the swap is located.
246243

247244
As long as there is no additional tooling available, the recommendation
248245
is to use `MachineConfig` objects to provision swap on nodes.
@@ -263,9 +260,9 @@ Without system services such as `kubelet` or `crio`, any container will
263260
not be able to run well.
264261

265262
Thus, in order to protect the `system.slice` and ensure that the nodes
266-
infrastructure health is prioritized over workload health, the agent is
263+
infrastructure health is prioritized over workload health, WASP agent is
267264
reconfiguring the `system.slice` and setting `memory.swap.max=0` to
268-
prevent any system service within from swapping.
265+
prevent any system service from swapping.
269266

270267
###### Preventing SWAP traffic I/O saturation
271268

@@ -274,7 +271,7 @@ potentially preventing other processes from performing I/O.
274271

275272
In order to ensure that system services are able to perform I/O, the
276273
agent is configuring `io.latency=50` for the `system.slice` in order
277-
to ensure that it's I/O requests are prioritized over any other slice.
274+
to ensure that its I/O requests are prioritized over any other slice.
278275
This is, because by default, no other slice is configured to have
279276
`io.latency` set.
280277

@@ -393,8 +390,19 @@ None.
393390

394391
## Test Plan
395392

396-
Add e2e tests for the WASP agent repository for regression testing against
397-
OpenShift.
393+
The cluster under test has worker nodes with identical amount of RAM and disk size.
394+
Memory overcommit is configured to 200%. There should be enough free space on the disk
395+
in order to create the required file-based swap i.e. 8G of RAM and 200% overcommit require
396+
at least 8G free space on the root disk.
397+
398+
* Fill the cluster with dormant VM's until each worker node is overcommited.
399+
* Test the following scenarios:
400+
* Node drain
401+
* VM live-migration
402+
* Cluster upgrade.
403+
* The expectation is to see that nodes are stable
404+
as well as the workloads.
405+
398406

399407
## Graduation Criteria
400408

@@ -433,48 +441,14 @@ object and the `openshift-cnv` namespace exist.
433441

434442
## Upgrade / Downgrade Strategy
435443

436-
If applicable, how will the component be upgraded and downgraded? Make sure this
437-
is in the test plan.
438-
439-
Consider the following in developing an upgrade/downgrade strategy for this
440-
enhancement:
441-
- What changes (in invocations, configurations, API use, etc.) is an existing
442-
cluster required to make on upgrade in order to keep previous behavior?
443-
- What changes (in invocations, configurations, API use, etc.) is an existing
444-
cluster required to make on upgrade in order to make use of the enhancement?
444+
On OpenShift level no specific action needed, since all of the APIs used
445+
by the WASP agent deliverables are stable (DaemonSet, OCI Hook, MachineConfig, KubeletConfig)
445446

446447
Upgrade expectations:
447-
- Each component should remain available for user requests and
448-
workloads during upgrades. Ensure the components leverage best practices in handling [voluntary
449-
disruption](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/). Any exception to
450-
this should be identified and discussed here.
451-
- Micro version upgrades - users should be able to skip forward versions within a
452-
minor release stream without being required to pass through intermediate
453-
versions - i.e. `x.y.N->x.y.N+2` should work without requiring `x.y.N->x.y.N+1`
454-
as an intermediate step.
455-
- Minor version upgrades - you only need to support `x.N->x.N+1` upgrade
456-
steps. So, for example, it is acceptable to require a user running 4.3 to
457-
upgrade to 4.5 with a `4.3->4.4` step followed by a `4.4->4.5` step.
458-
- While an upgrade is in progress, new component versions should
459-
continue to operate correctly in concert with older component
460-
versions (aka "version skew"). For example, if a node is down, and
461-
an operator is rolling out a daemonset, the old and new daemonset
462-
pods must continue to work correctly even while the cluster remains
463-
in this partially upgraded state for some time.
448+
- WASP from CNV-X.Y must work with OCP-X.Y as well as OCP-X.(Y+1)
464449

465450
Downgrade expectations:
466-
- If an `N->N+1` upgrade fails mid-way through, or if the `N+1` cluster is
467-
misbehaving, it should be possible for the user to rollback to `N`. It is
468-
acceptable to require some documented manual steps in order to fully restore
469-
the downgraded cluster to its previous state. Examples of acceptable steps
470-
include:
471-
- Deleting any CVO-managed resources added by the new version. The
472-
CVO does not currently delete resources that no longer exist in
473-
the target version.
474-
475-
* On the cgroup level WASP agent supports only cgroups v2
476-
* On OpenShift level no specific action needed, since all of the APIs used
477-
by the WASP agent deliverables are stable (DaemonSet, OCI Hook, MachineConfig, KubeletConfig)
451+
- WASP from CNV-X.Y must work with OCP-X.Y as well as OCP-X.(Y-1)
478452

479453
## Version Skew Strategy
480454

0 commit comments

Comments
 (0)