Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPCLOUD-2775: add cluster api autoscaler integration enhancement #1736

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

elmiko
Copy link
Contributor

@elmiko elmiko commented Jan 15, 2025

this enhancement describes how we will integrate the cluster autoscaler, and related controllers, with the Cluster API machine management layer.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 15, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 15, 2025

@elmiko: This pull request references OCPCLOUD-2775 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

this enhancement describes how we will integrate the cluster autoscaler, and related controllers, with the Cluster API machine management layer.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Jan 15, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ashcrow for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@elmiko elmiko force-pushed the add-cas-cao-capi-integration branch 2 times, most recently from 1c80b56 to 4554fb1 Compare January 16, 2025 16:39
@elmiko
Copy link
Contributor Author

elmiko commented Jan 16, 2025

i'm not sure why it's barfing on the metadata

@elmiko
Copy link
Contributor Author

elmiko commented Jan 16, 2025

figured it out, needed quoting on the github handles

@elmiko elmiko force-pushed the add-cas-cao-capi-integration branch from 4554fb1 to 73bfea9 Compare January 16, 2025 18:51
Copy link
Contributor

openshift-ci bot commented Jan 16, 2025

@elmiko: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Comment on lines +85 to +89
version and would allow us to drop some patches we are carrying. The Cluster
API MachineSet sync controller will be updated to recognize when the
Cluster Autoscaler has made a change to a Cluster API resource and then sync
the change to the corresponding Machine API resource, regardless of which resource
is authoritative.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to clarify exactly what kind of writes the CAS would be making, am I right in thinking that it's just the scale subresource?

Comment on lines +94 to +97
locate the resource. The Cluster API MachineSet sync controller will be updated
to ensure that when the Cluster Autoscaler Operator adds the autoscaling
annotations that they are copied to any related resources, regardless of which
is authoritative.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this case, since we own the CAO, we don't necessarily need an exception within the CAPI sync controller, and could handle this in CAO. I would expect CAO to look at a MAPI MachineSet, and check if it's authoritative, and then apply the annotations correctly

Will it still be annotations on the CAPI side?

Comment on lines +294 to +296
The Cluster Autoscaler Operator will update the Machine API MachineSet resource, and
then the MachineSet sync controller will sync the change to the Cluster API MachineSet
resource. The sync controller will use the managed fields (i.e. `.metadata.managedFields`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it an error for there to be a MachineAutoscaler that points to both MAPI and CAPI versions of the same MachineSet?

And then CAO can just update the correct, authoritative resource with whatever information it needs to, without having to worry about conflicting resources

I think if we do this, we need less special logic within the sync controllers

In theory we could validate this using a validating admission policy, but I don't think it's fool-proof

Comment on lines +358 to +362
* The Cluster Autoscaler Operator has added the minimum and maximum size annotations, and ownership
annotation to a record. If the sync controller sees an update to these annotations on a
non-authoritative resource originating from the Cluster Autoscaler Operator, it will copy
that change to the authoritative resource if no MachineAutoscaler is referencing the
authoritative resource.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we own the CAO, I think implementing this within CAO is likely the easier option.

The sync logic already syncs annotations across between machinesets, we just need to add special logic to handle well known annotations if they somehow end up converting from annotations to structured status

Comment on lines +363 to +368
* A provider MachineSet controller has added the scale from zero annotations to a
non-authoritative record. This occurs when the Cluster API resource is marked as
authoritative but the Machine API resource is updated by the provider MachineSet controller.
In these cases the scale from zero annotations will be copied to the non-authoritative
Cluster API resource. The data from the MachineSet controller is only applied to
Machine API resources currently.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an equivalent of this controller in CAPI? Or, if not, is it on the roadmap? If it is on the roadmap, we will want to ensure these controllers following the same pausing as the rest of the controllers

Comment on lines +369 to +375
* `.spec.replicas` changes will be synced from the Cluster API MachineSet to the Machine
API MachineSet regardless of which is authoritative when the change originates from the
Cluster Autoscaler. As the Cluster Autoscaler will be configured to operate against Cluster
API resources only, there will be a need to identify when the Cluster Autoscaler has updated
a non-authoritative Cluster API resource so that the authoritative resource can be updated.
This will only occur when the sync controller observes and update to the replicas field from
the Cluster Autoscaler.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We intend to use a VAP to block updates to a non-authoritative resource, have you investigated how we let the CAS through the VAP (what exception do we need in place) to make just this replicas field change?

could create a race condition where updating the minimum and maximum size
values will lead to an inaccurate update to both MachineSets.

To address the risk of possible race conditions on MachineAutoscalers we have a few
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Third option, CAS will take no action when multiple MachineAutoscalers refer to the same MachineSet (MAPI and CAPI mirrors) and instead will update the status to show that one of the two should be removed

Fourth option, use some sort of admission time validation to try and prevent this

Three and four should be able to be done in conjunction with one another

Comment on lines +465 to +466
2. Will we want to remove MachineAutoscalers that reference Cluster API MachineSets
during a downgrade?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't support downgrades, so we probably don't need to be concerned about this.

We won't automatically create these during an upgrade, so downgrading a failed upgrade shouldn't produce an issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants