Feature: OpenShift Virtualization Higher Density #1679

iholder101 · 2024-09-17T10:54:55Z

A feature describing OpenShift Virtualization's path to higher density based on:

Phase 1: WASP - a downstream-only agent.
Phase 2: Kubernetes swap feature (Node memory swap support kubernetes/enhancements#2400)

This is a replacement for #1630.

openshift-ci · 2024-09-17T10:55:17Z

Hi @iholder101. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

iholder101 · 2024-09-17T11:04:18Z

/cc @enp0s3 @Barakmor1 @fabiand @mrunalp @haircommander @kannon92

haircommander · 2024-09-17T16:53:49Z

enhancements/kubelet/virtualization-higher-workload-density.md

+
+#### Timeline
+
+* GA higher workload density in OpenShift Virtualization in 2024


is this timeline still accurate?

Yes, I believe it is.
Please keep me honest @stu-gott.

@haircommander Hi, I've checked the Jira planning, we are on track, so yes this is indeed accurate.

@haircommander The timeline bullet GA higher workload density in OpenShift Virtualization in 2024 relates to the phase 1 only. Maybe we should add it in brackets

Hey @haircommander!
@enp0s3 and I reworked the PR. Can you please have another look?

enp0s3 · 2024-09-20T07:47:18Z

/ok-to-test

A feature describing CNV's path to higher density based on - phase 1: wasp - phase 2: kube swap Signed-off-by: Itamar Holder <[email protected]>

Signed-off-by: Itamar Holder <[email protected]>

dankenigsberg

Thanks for this proposal for a much-requested feature. I think it is high time to have it merged.

dankenigsberg · 2024-12-26T07:07:01Z

enhancements/kubelet/virtualization-higher-workload-density.md

+## Summary
+
+Fit more workloads onto a given node - achieve a higher workload
+density - by overcommitting it's memory resources. Due to timeline


Suggested change

density - by overcommitting it's memory resources. Due to timeline

density - by overcommitting its memory resources. Due to timeline

dankenigsberg · 2024-12-26T07:07:59Z

enhancements/kubelet/virtualization-higher-workload-density.md

+## Motivation
+
+Today, OpenShift Virtualization is reserving memory (`requests.memory`)
+according to the needs of the virtual machine and it's infrastructure


Suggested change

according to the needs of the virtual machine and it's infrastructure

according to the needs of the virtual machine and its infrastructure

dankenigsberg · 2024-12-26T07:13:57Z

enhancements/kubelet/virtualization-higher-workload-density.md

+given node leads to the observation that _on average_ there is no memory
+ressure and often a rather low memory utilization - despite the fact that
+much memory has been reserved.


Suggested change

given node leads to the observation that _on average_ there is no memory

ressure and often a rather low memory utilization - despite the fact that

much memory has been reserved.

given node leads to the observation that _on average_ much of the reserved memory is not utilized.

I think it is not precises to say there is no pressure. There is. But we can reduce it, because the memory causing the pressure is not used and can be swapped out.

I've re-phrased the section

dankenigsberg · 2024-12-26T07:16:55Z

enhancements/kubelet/virtualization-higher-workload-density.md

+
+### Non-Goals
+
+* Complete life-cycling of the WASP Agent. We are not intending to write


This is the first time WASP agent is mentioned. Please add a URL.

dankenigsberg · 2024-12-26T07:17:43Z

enhancements/kubelet/virtualization-higher-workload-density.md

+* Complete life-cycling of the WASP Agent. We are not intending to write
+  an Operator for memory over commit for two reasons:
+  * [Kubernetes SWAP] is close, writing a fully fledged operator seems
+    to be no good use of resources


Suggested change

to be no good use of resources

to be no good use of developer resources

dankenigsberg · 2024-12-26T13:03:09Z

enhancements/kubelet/virtualization-higher-workload-density.md

+## Test Plan
+
+Add e2e tests for the WASP agent repository for regression testing against
+OpenShift.


I think that we should include here a bit more details about how we are (already) testing it. Most importantly: configure 200% over-commitment, fill up the cluster with dormant VMs and verify that the cluster is responsive and survives upgrade.

dankenigsberg · 2024-12-26T13:05:53Z

enhancements/kubelet/virtualization-higher-workload-density.md

+Consider the following in developing an upgrade/downgrade strategy for this
+enhancement:
+- What changes (in invocations, configurations, API use, etc.) is an existing
+  cluster required to make on upgrade in order to keep previous behavior?
+- What changes (in invocations, configurations, API use, etc.) is an existing
+  cluster required to make on upgrade in order to make use of the enhancement?
+
+Upgrade expectations:
+- Each component should remain available for user requests and
+  workloads during upgrades. Ensure the components leverage best practices in handling [voluntary
+  disruption](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/). Any exception to
+  this should be identified and discussed here.
+- Micro version upgrades - users should be able to skip forward versions within a
+  minor release stream without being required to pass through intermediate
+  versions - i.e. `x.y.N->x.y.N+2` should work without requiring `x.y.N->x.y.N+1`
+  as an intermediate step.
+- Minor version upgrades - you only need to support `x.N->x.N+1` upgrade
+  steps. So, for example, it is acceptable to require a user running 4.3 to
+  upgrade to 4.5 with a `4.3->4.4` step followed by a `4.4->4.5` step.
+- While an upgrade is in progress, new component versions should
+  continue to operate correctly in concert with older component
+  versions (aka "version skew"). For example, if a node is down, and
+  an operator is rolling out a daemonset, the old and new daemonset
+  pods must continue to work correctly even while the cluster remains
+  in this partially upgraded state for some time.


this seems like generic content. should we not replace it with something specific, or drop it?

dankenigsberg · 2024-12-26T13:07:31Z

enhancements/kubelet/virtualization-higher-workload-density.md

+How will the component handle version skew with other components?
+What are the guarantees? Make sure this is in the test plan.
+
+Consider the following in developing a version skew strategy for this
+enhancement:
+- During an upgrade, we will always have skew among components, how will this impact your work?
+- Does this enhancement involve coordinating behavior in the control plane and
+  in the kubelet? How does an n-2 kubelet without this feature available behave
+  when this feature is used?
+- Will any other components on the node change? For example, changes to CSI, CRI
+  or CNI may require updating that component before the kubelet.


WASP from CNV-X.Y must work with OCP-X.Y as well as OCP-X.(Y+1)

@dankenigsberg Added.

dankenigsberg · 2024-12-26T13:11:19Z

enhancements/kubelet/virtualization-higher-workload-density.md

+
+## Operational Aspects of API Extensions
+
+None


I think that this is a place to discuss the fact that all workers have to have to same memory size and the same disk topology, that deploying and upgrading WASP is a manual step

dankenigsberg · 2024-12-26T13:11:51Z

enhancements/kubelet/virtualization-higher-workload-density.md

+  Examples:
+  - The mutating admission webhook "xyz" has FailPolicy=Ignore and hence
+    will not block the creation or updates on objects when it fails. When the
+    webhook comes back online, there is a controller reconciling all objects, applying
+    labels that were not applied during admission webhook downtime.
+  - Namespaces deletion will not delete all objects in etcd, leading to zombie
+    objects when another namespace with the same name is created.
+
+TBD


Let us replace this generic content.

dankenigsberg · 2024-12-29T09:11:03Z

enhancements/kubelet/virtualization-higher-workload-density.md

+|--------------------------------------------------------------------|--------------|-----------------|----------------------|
+| Phase 1 - Out-Of-Tree SWAP with WASP                               | 2024         | Tech-Preview    | Beta                 |
+| Phase 2 - Transition to Kubernetes SWAP. Limited to CNV-only users | mid-end 2025 | GA              | GA                   |
+| Phase 3 - Kubernetes SWAP for all Openshift users                  | TBD          | GA               | GA                   |


Suggested change

| Phase 3 - Kubernetes SWAP for all Openshift users | TBD | GA | GA |

| Phase 3 - Kubernetes SWAP for all OpenShift users | TBD | GA | GA |

dankenigsberg · 2024-12-29T10:02:21Z

enhancements/kubelet/virtualization-higher-workload-density.md

+* **General Availability**
+  * Limited to burstable QoS class pods.
+  * Uses `LimitedSwap`.
+  * Limited to non-high-priority pods.


Starts containers with memory.swap.max=max as in TechPreivew

Sets memory.swap.max to according to the container request, For more info, refer to the upstream documentation on how to calculate
limited swap.

dankenigsberg · 2024-12-29T10:04:34Z

enhancements/kubelet/virtualization-higher-workload-density.md

+
+For more info, refer to the upstream documentation on how to calculate
+[limited swap](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2400-node-swap#steps-to-calculate-swap-limit).
+


I would benefit from seeing a diagram (or a link to a diagram) showing the execution order of the CRI, OCI hook and Wasp update.

I would also state clearly that we have a race condition here: container binary may allocate memory before WASP sets a limited value.

@dankenigsberg Added both.

dankenigsberg · 2025-01-07T13:58:31Z

enhancements/kubelet/virtualization-higher-workload-density.md

+#### Hypershift / Hosted Control Planes
+
+The `MachineConfig` based swap provisioning will not work, as HCP does
+not provide the `MachineConfig` APIs.


I believe there's a way for HCP to deploy rhcos with needed changes, but you can leave this as it is for now.

dankenigsberg · 2025-01-07T13:59:56Z

enhancements/kubelet/virtualization-higher-workload-density.md

+#### Single-node Deployments or MicroShift
+
+Single-node and MicroShift deployments are out of scope of this
+proposal.


Please add a word about why: we don't support swapping on control-plane node. I'd also be explicit about not supporting Compact cluster for this reason.

enp0s3 · 2025-01-19T14:37:32Z

@haircommander @mrunalp Hi, can you please have another review?

haircommander · 2025-01-22T20:49:52Z

enhancements/kubelet/virtualization-higher-workload-density.md

+### Non-Goals
+
+* Complete life-cycling of the [WASP](https://github.com/OpenShift-virtualization/wasp-agent) Agent. We are not intending to write
+  an Operator for memory over commit for two reasons:


is this still true?

haircommander · 2025-01-22T20:52:23Z

enhancements/kubelet/virtualization-higher-workload-density.md

+* **Phase 1** - OpenShift Virtualization will provide an out-of-tree
+  solution (WASP) to enable higher workload density and swap-based eviction mechanisms.
+* **Phase 2** - WASP will include swap-based eviction mechanisms.


can you describe these have been done (we're in phase 2 right?)

haircommander · 2025-01-22T20:52:37Z

enhancements/kubelet/virtualization-higher-workload-density.md

+  In this phase, WASP deprecation will start.
+* **Phase 4** - OpenShift will GA SWAP for every user, even if OpenShift Virtualization
+  is not installed on the cluster. In this phase WASP will be removed, machine-level configuration
+  will be managed by Machine Config Operator.


Swap Operator?

haircommander · 2025-01-22T20:57:55Z

enhancements/kubelet/virtualization-higher-workload-density.md

+
+### API Extensions
+
+#### Phase 3


this phase glosses over some other MC pieces that are needed (to actually provision swap). that's part of the motivation of doing SwapOperator: to couple the actual swap provisioning with pointing the kubelet to use swap

@haircommander Good point! Let me then suggest the following - since in phase 3 kube swap will be GAed, but wasp-agent will still be there (deprecated but not removed), the user actually doesn't need to move to kube swap right away. This transition can happen in phase 4 by the swap operator (and its even better, less error prone).
Therefore transition to phase 3 can be no-op from user perspective.

haircommander · 2025-01-22T20:58:40Z

enhancements/kubelet/virtualization-higher-workload-density.md

+More info about kubelet swap API can be found [here](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md#api-changes)
+
+#### Phase 4
+In previous phases the machine-level confgiuration was deployed manually using machine configs. The transition to phase 4


spellcheck: configuration*

haircommander · 2025-01-22T20:59:25Z

enhancements/kubelet/virtualization-higher-workload-density.md

+
+#### Phase 4
+In previous phases the machine-level confgiuration was deployed manually using machine configs. The transition to phase 4
+will require re-deployment of the configuration by the operator.


"the operator" this is not very clear. I think we should spell out what will be taken over by the swap operator.

if there are open questions, call them out

I've added content to the open question section regarding the swap operator

haircommander · 2025-01-22T21:01:18Z

enhancements/kubelet/virtualization-higher-workload-density.md

+### Topology Considerations
+
+#### Hosted Control Planes: Hosting cluster with OCP Virt
+No special consideration required since it's identical to the regular cluster case


you're not able to use MCs in HCP AFAIU. it's also not clear once we get to SwapOperator world how exactly that'd work. Probably need to extend the HostedCluster https://github.com/openshift/hypershift/blob/d03bec5285b9dcca819686037a75027e514c6d64/api/hypershift/v1beta1/hostedcluster_types.go#L1617

oh I see nevermind, I got the cases wrong

haircommander · 2025-01-22T21:02:05Z

enhancements/kubelet/virtualization-higher-workload-density.md

+
+#### Hosted Control Planes: Hosted cluster with OCP Virt
+Since nested virt is not supported, the only topology that is supported is when the data plane resides on a bare metal.
+In that case the abovementioned `KubeletCofnig` and `MachineConfig` should be injected into the `NodePool` CR.


KubeletConfig*

haircommander · 2025-01-22T21:06:52Z

enhancements/kubelet/virtualization-higher-workload-density.md

+
+Dealing with memory pressure on a node is differentiating the TP fom GA.
+
+** **Technology Preview** - `memory.high` is set on the `kubepods.slice`


has this been implemented?

It was. @enp0s3 @iholder101 is it still?

@fabiand @haircommander This has been implemented in OCP Virt 4.16 inside the MachineConfig, but since 4.17 (GA) we've removed it in favor to swap-based evictions.

haircommander · 2025-01-22T21:09:16Z

enhancements/kubelet/virtualization-higher-workload-density.md

+
+|                              | TP                  | GA                  |
+|------------------------------|---------------------|---------------------|
+| SWAP Provisioning            | MachineConfig       | MachineConfig       |


GA: operator?

In CNV higher density with wasp is already GA.

I think it is worth speaking about what can be contributed to openshift in order to take openshift close to GA, and let CNV benefit from it at the same time.

One clear advantage of having the swap operator contributed directly to Openshift rather than CNV->Openshift is less API transitions for the end user.

@fabiand Perhaps this table is redundant since we have the phases table.

haircommander · 2025-01-22T21:09:46Z

enhancements/kubelet/virtualization-higher-workload-density.md

+
+| Risk                                       | Mitigation                                                                                               |
+|--------------------------------------------|----------------------------------------------------------------------------------------------------------|
+| Miss details and introduce instability     | * Adjust overcommit ratio <br/> * Tweak eviction thresholds <br/> * Use de-scheduler to balance the load |


is anything needed in the descheduler to do this?

Added link.

Signed-off-by: Igor Bezukh <[email protected]>

openshift-ci · 2025-01-28T12:05:28Z

@iholder101: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/markdownlint	`215df9a`	link	true	`/test markdownlint`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 17, 2024

openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 17, 2024

iholder101 marked this pull request as ready for review September 17, 2024 11:02

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 17, 2024

openshift-ci bot requested review from jupierce and rvanderp3 September 17, 2024 11:03

iholder101 mentioned this pull request Sep 17, 2024

feat: OpenShift Virtualization Higher Density #1630

Closed

openshift-ci bot requested review from Barakmor1, enp0s3, fabiand, haircommander, kannon92 and mrunalp September 17, 2024 11:04

haircommander reviewed Sep 17, 2024

View reviewed changes

openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 20, 2024

fabiand added 11 commits September 25, 2024 12:11

feat: OpenShift Virtualization Higher Density

df71f18

A feature describing CNV's path to higher density based on - phase 1: wasp - phase 2: kube swap Signed-off-by: Itamar Holder <[email protected]>

cnv: Swap add more design details

d23c0b8

Signed-off-by: Itamar Holder <[email protected]>

cnv: Drop all xml style comments to make the md linter happy

4307e06

Signed-off-by: Itamar Holder <[email protected]>

cnv: Address Igor's comments

4b42c68

Signed-off-by: Itamar Holder <[email protected]>

cnv: Address Itamar's comments

cf6cc95

Signed-off-by: Itamar Holder <[email protected]>

cnv: Add line breaks

1315f82

Signed-off-by: Itamar Holder <[email protected]>

cnv: fix headers

d444076

Signed-off-by: Itamar Holder <[email protected]>

cnv: Fix table header

1708d57

Signed-off-by: Itamar Holder <[email protected]>

cnv: Mention KSM and FPR

1a7ae77

Signed-off-by: Itamar Holder <[email protected]>

cnv: Add section to make linter happy

36d6745

Signed-off-by: Itamar Holder <[email protected]>

cnv,swap Fix metadata

796a6b0

Signed-off-by: Itamar Holder <[email protected]>

dankenigsberg suggested changes Dec 26, 2024

View reviewed changes

enp0s3 force-pushed the cnv-swap-2itr branch 2 times, most recently from 26e4e09 to 96db93a Compare December 29, 2024 09:06

dankenigsberg suggested changes Dec 29, 2024

View reviewed changes

enp0s3 force-pushed the cnv-swap-2itr branch 8 times, most recently from 421cb05 to 5ee0ec2 Compare January 6, 2025 21:12

dankenigsberg reviewed Jan 7, 2025

View reviewed changes

enp0s3 force-pushed the cnv-swap-2itr branch 2 times, most recently from 59ce661 to ea40818 Compare January 19, 2025 14:36

haircommander reviewed Jan 22, 2025

View reviewed changes

VM higher density: addressing review comments

215df9a

Signed-off-by: Igor Bezukh <[email protected]>

enp0s3 force-pushed the cnv-swap-2itr branch from ea40818 to 215df9a Compare January 28, 2025 11:42


		#### Timeline

		* GA higher workload density in OpenShift Virtualization in 2024

	density - by overcommitting it's memory resources. Due to timeline
	density - by overcommitting its memory resources. Due to timeline

	according to the needs of the virtual machine and it's infrastructure
	according to the needs of the virtual machine and its infrastructure


		### Non-Goals

		* Complete life-cycling of the WASP Agent. We are not intending to write

	to be no good use of resources
	to be no good use of developer resources

	\| Phase 3 - Kubernetes SWAP for all Openshift users \| TBD \| GA \| GA \|
	\| Phase 3 - Kubernetes SWAP for all OpenShift users \| TBD \| GA \| GA \|


		For more info, refer to the upstream documentation on how to calculate
		[limited swap](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2400-node-swap#steps-to-calculate-swap-limit).


		Dealing with memory pressure on a node is differentiating the TP fom GA.

		Technology Preview** - `memory.high` is set on the `kubepods.slice`

Feature: OpenShift Virtualization Higher Density #1679

Are you sure you want to change the base?

Feature: OpenShift Virtualization Higher Density #1679

Conversation

iholder101 commented Sep 17, 2024

openshift-ci bot commented Sep 17, 2024

iholder101 commented Sep 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enp0s3 commented Sep 20, 2024

dankenigsberg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enp0s3 commented Jan 19, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

haircommander Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Jan 28, 2025

haircommander Jan 22, 2025 •

edited

Loading