Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: OpenShift Virtualization Higher Density #1679

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

iholder101
Copy link

A feature describing OpenShift Virtualization's path to higher density based on:

This is a replacement for #1630.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 17, 2024
Copy link
Contributor

openshift-ci bot commented Sep 17, 2024

Hi @iholder101. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 17, 2024
@iholder101 iholder101 marked this pull request as ready for review September 17, 2024 11:02
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 17, 2024
@iholder101
Copy link
Author

/cc @enp0s3 @Barakmor1 @fabiand @mrunalp @haircommander @kannon92


#### Timeline

* GA higher workload density in OpenShift Virtualization in 2024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this timeline still accurate?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe it is.
Please keep me honest @stu-gott.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haircommander Hi, I've checked the Jira planning, we are on track, so yes this is indeed accurate.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haircommander The timeline bullet GA higher workload density in OpenShift Virtualization in 2024 relates to the phase 1 only. Maybe we should add it in brackets

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @haircommander!
@enp0s3 and I reworked the PR. Can you please have another look?

@enp0s3
Copy link

enp0s3 commented Sep 20, 2024

/ok-to-test

@openshift-ci openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 20, 2024
A feature describing CNV's path to higher density based on

- phase 1: wasp
- phase 2: kube swap

Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Signed-off-by: Itamar Holder <[email protected]>
Copy link
Contributor

@dankenigsberg dankenigsberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this proposal for a much-requested feature. I think it is high time to have it merged.

## Summary

Fit more workloads onto a given node - achieve a higher workload
density - by overcommitting it's memory resources. Due to timeline
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
density - by overcommitting it's memory resources. Due to timeline
density - by overcommitting its memory resources. Due to timeline

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

## Motivation

Today, OpenShift Virtualization is reserving memory (`requests.memory`)
according to the needs of the virtual machine and it's infrastructure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
according to the needs of the virtual machine and it's infrastructure
according to the needs of the virtual machine and its infrastructure

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 44 to 46
given node leads to the observation that _on average_ there is no memory
ressure and often a rather low memory utilization - despite the fact that
much memory has been reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
given node leads to the observation that _on average_ there is no memory
ressure and often a rather low memory utilization - despite the fact that
much memory has been reserved.
given node leads to the observation that _on average_ much of the reserved memory is not utilized.

I think it is not precises to say there is no pressure. There is. But we can reduce it, because the memory causing the pressure is not used and can be swapped out.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've re-phrased the section


### Non-Goals

* Complete life-cycling of the WASP Agent. We are not intending to write
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first time WASP agent is mentioned. Please add a URL.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* Complete life-cycling of the WASP Agent. We are not intending to write
an Operator for memory over commit for two reasons:
* [Kubernetes SWAP] is close, writing a fully fledged operator seems
to be no good use of resources
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
to be no good use of resources
to be no good use of developer resources

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

## Test Plan

Add e2e tests for the WASP agent repository for regression testing against
OpenShift.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should include here a bit more details about how we are (already) testing it. Most importantly: configure 200% over-commitment, fill up the cluster with dormant VMs and verify that the cluster is responsive and survives upgrade.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 439 to 463
Consider the following in developing an upgrade/downgrade strategy for this
enhancement:
- What changes (in invocations, configurations, API use, etc.) is an existing
cluster required to make on upgrade in order to keep previous behavior?
- What changes (in invocations, configurations, API use, etc.) is an existing
cluster required to make on upgrade in order to make use of the enhancement?

Upgrade expectations:
- Each component should remain available for user requests and
workloads during upgrades. Ensure the components leverage best practices in handling [voluntary
disruption](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/). Any exception to
this should be identified and discussed here.
- Micro version upgrades - users should be able to skip forward versions within a
minor release stream without being required to pass through intermediate
versions - i.e. `x.y.N->x.y.N+2` should work without requiring `x.y.N->x.y.N+1`
as an intermediate step.
- Minor version upgrades - you only need to support `x.N->x.N+1` upgrade
steps. So, for example, it is acceptable to require a user running 4.3 to
upgrade to 4.5 with a `4.3->4.4` step followed by a `4.4->4.5` step.
- While an upgrade is in progress, new component versions should
continue to operate correctly in concert with older component
versions (aka "version skew"). For example, if a node is down, and
an operator is rolling out a daemonset, the old and new daemonset
pods must continue to work correctly even while the cluster remains
in this partially upgraded state for some time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like generic content. should we not replace it with something specific, or drop it?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 481 to 491
How will the component handle version skew with other components?
What are the guarantees? Make sure this is in the test plan.

Consider the following in developing a version skew strategy for this
enhancement:
- During an upgrade, we will always have skew among components, how will this impact your work?
- Does this enhancement involve coordinating behavior in the control plane and
in the kubelet? How does an n-2 kubelet without this feature available behave
when this feature is used?
- Will any other components on the node change? For example, changes to CSI, CRI
or CNI may require updating that component before the kubelet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WASP from CNV-X.Y must work with OCP-X.Y as well as OCP-X.(Y+1)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


## Operational Aspects of API Extensions

None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is a place to discuss the fact that all workers have to have to same memory size and the same disk topology, that deploying and upgrading WASP is a manual step

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 539 to 547
Examples:
- The mutating admission webhook "xyz" has FailPolicy=Ignore and hence
will not block the creation or updates on objects when it fails. When the
webhook comes back online, there is a controller reconciling all objects, applying
labels that were not applied during admission webhook downtime.
- Namespaces deletion will not delete all objects in etcd, leading to zombie
objects when another namespace with the same name is created.

TBD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us replace this generic content.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@enp0s3 enp0s3 force-pushed the cnv-swap-2itr branch 2 times, most recently from 26e4e09 to 96db93a Compare December 29, 2024 09:06
|--------------------------------------------------------------------|--------------|-----------------|----------------------|
| Phase 1 - Out-Of-Tree SWAP with WASP | 2024 | Tech-Preview | Beta |
| Phase 2 - Transition to Kubernetes SWAP. Limited to CNV-only users | mid-end 2025 | GA | GA |
| Phase 3 - Kubernetes SWAP for all Openshift users | TBD | GA | GA |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| Phase 3 - Kubernetes SWAP for all Openshift users | TBD | GA | GA |
| Phase 3 - Kubernetes SWAP for all OpenShift users | TBD | GA | GA |

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

* **General Availability**
* Limited to burstable QoS class pods.
* Uses `LimitedSwap`.
* Limited to non-high-priority pods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Starts containers with memory.swap.max=max as in TechPreivew
  • Sets memory.swap.max to according to the container request, For more info, refer to the upstream documentation on how to calculate
    limited swap.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


For more info, refer to the upstream documentation on how to calculate
[limited swap](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2400-node-swap#steps-to-calculate-swap-limit).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would benefit from seeing a diagram (or a link to a diagram) showing the execution order of the CRI, OCI hook and Wasp update.

I would also state clearly that we have a race condition here: container binary may allocate memory before WASP sets a limited value.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dankenigsberg Added both.

@enp0s3 enp0s3 force-pushed the cnv-swap-2itr branch 8 times, most recently from 421cb05 to 5ee0ec2 Compare January 6, 2025 21:12
#### Hypershift / Hosted Control Planes

The `MachineConfig` based swap provisioning will not work, as HCP does
not provide the `MachineConfig` APIs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there's a way for HCP to deploy rhcos with needed changes, but you can leave this as it is for now.

#### Single-node Deployments or MicroShift

Single-node and MicroShift deployments are out of scope of this
proposal.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a word about why: we don't support swapping on control-plane node. I'd also be explicit about not supporting Compact cluster for this reason.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@enp0s3 enp0s3 force-pushed the cnv-swap-2itr branch 2 times, most recently from 59ce661 to ea40818 Compare January 19, 2025 14:36
@enp0s3
Copy link

enp0s3 commented Jan 19, 2025

@haircommander @mrunalp Hi, can you please have another review?

### Non-Goals

* Complete life-cycling of the [WASP](https://github.com/OpenShift-virtualization/wasp-agent) Agent. We are not intending to write
an Operator for memory over commit for two reasons:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still true?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Comment on lines 135 to 137
* **Phase 1** - OpenShift Virtualization will provide an out-of-tree
solution (WASP) to enable higher workload density and swap-based eviction mechanisms.
* **Phase 2** - WASP will include swap-based eviction mechanisms.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you describe these have been done (we're in phase 2 right?)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

In this phase, WASP deprecation will start.
* **Phase 4** - OpenShift will GA SWAP for every user, even if OpenShift Virtualization
is not installed on the cluster. In this phase WASP will be removed, machine-level configuration
will be managed by Machine Config Operator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Swap Operator?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


### API Extensions

#### Phase 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this phase glosses over some other MC pieces that are needed (to actually provision swap). that's part of the motivation of doing SwapOperator: to couple the actual swap provisioning with pointing the kubelet to use swap

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haircommander Good point! Let me then suggest the following - since in phase 3 kube swap will be GAed, but wasp-agent will still be there (deprecated but not removed), the user actually doesn't need to move to kube swap right away. This transition can happen in phase 4 by the swap operator (and its even better, less error prone).
Therefore transition to phase 3 can be no-op from user perspective.

More info about kubelet swap API can be found [here](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2400-node-swap/README.md#api-changes)

#### Phase 4
In previous phases the machine-level confgiuration was deployed manually using machine configs. The transition to phase 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spellcheck: configuration*

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


#### Phase 4
In previous phases the machine-level confgiuration was deployed manually using machine configs. The transition to phase 4
will require re-deployment of the configuration by the operator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the operator" this is not very clear. I think we should spell out what will be taken over by the swap operator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are open questions, call them out

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added content to the open question section regarding the swap operator

### Topology Considerations

#### Hosted Control Planes: Hosting cluster with OCP Virt
No special consideration required since it's identical to the regular cluster case
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're not able to use MCs in HCP AFAIU. it's also not clear once we get to SwapOperator world how exactly that'd work. Probably need to extend the HostedCluster https://github.com/openshift/hypershift/blob/d03bec5285b9dcca819686037a75027e514c6d64/api/hypershift/v1beta1/hostedcluster_types.go#L1617

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see nevermind, I got the cases wrong


#### Hosted Control Planes: Hosted cluster with OCP Virt
Since nested virt is not supported, the only topology that is supported is when the data plane resides on a bare metal.
In that case the abovementioned `KubeletCofnig` and `MachineConfig` should be injected into the `NodePool` CR.
Copy link
Member

@haircommander haircommander Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KubeletConfig*

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


Dealing with memory pressure on a node is differentiating the TP fom GA.

** **Technology Preview** - `memory.high` is set on the `kubepods.slice`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has this been implemented?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was. @enp0s3 @iholder101 is it still?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabiand @haircommander This has been implemented in OCP Virt 4.16 inside the MachineConfig, but since 4.17 (GA) we've removed it in favor to swap-based evictions.


| | TP | GA |
|------------------------------|---------------------|---------------------|
| SWAP Provisioning | MachineConfig | MachineConfig |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GA: operator?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In CNV higher density with wasp is already GA.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth speaking about what can be contributed to openshift in order to take openshift close to GA, and let CNV benefit from it at the same time.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One clear advantage of having the swap operator contributed directly to Openshift rather than CNV->Openshift is less API transitions for the end user.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabiand Perhaps this table is redundant since we have the phases table.


| Risk | Mitigation |
|--------------------------------------------|----------------------------------------------------------------------------------------------------------|
| Miss details and introduce instability | * Adjust overcommit ratio <br/> * Tweak eviction thresholds <br/> * Use de-scheduler to balance the load |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is anything needed in the descheduler to do this?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added link.

Copy link
Contributor

openshift-ci bot commented Jan 28, 2025

@iholder101: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/markdownlint 215df9a link true /test markdownlint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test Indicates a non-member PR verified by an org member that is safe to test.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants