Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control plane should scale up in parallel, not serially #12007

Open
joejulian opened this issue Mar 20, 2025 · 15 comments
Open

Control plane should scale up in parallel, not serially #12007

joejulian opened this issue Mar 20, 2025 · 15 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@joejulian
Copy link
Contributor

In issue #2016 @dlipovetsky correctly suggested the opposite be true:

We recently implemented control plane scale up. If the desired number of control plane machines exceeds the actual number, and at least one control plane machine exists, the controller will create multiple machines in parallel. (Once created, each machine runs kubeadm join --control-plane).

I think we should scale up control planes serially. Before creating an additional control plane machine, we should verify that every etcd member has started. We could also verify that the etcd cluster has quorum (if it does not have quorum, creating a new machine might be a waste of time and resources. On the other hand, if it does have quorum, it might lose it after we create the machine)

Today, etcd still recommends that the cluster be scaled up or down one member at a time. Moreover, there are known issues with running kubeadm join --control-plane in parallel.

In the future, we will likely be able to scale up in parallel by using etcd non-voting members (learners). Kubeadm is already exploring this idea.

/cc @detiber @randomvariable @chuckha

In this description, he states that we would likely be able to scale up in parallel by using etcd non-voting members. Kubeadm has completed adding support for that requirement and we should look at returning to parallel scale up using this feature.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 20, 2025
@richardcase
Copy link
Member

/assign

@ivelichkovich
Copy link

I think we'll need to do the first CP with kubeadm init still then the rest could go in parallel

@richardcase
Copy link
Member

I think we'll need to do the first CP with kubeadm init still then the rest could go in parallel

Yeah that make sense.

@sbueringer
Copy link
Member

sbueringer commented Mar 20, 2025

Please consider that KCP is build under the assumption that we only create/delete one Machine at a time.

If we want to implement this change, this assumption goes out the window.

(My assumption is this issue is about KCP. Is that correct? (as KCP is not mentioned at all))

@richardcase
Copy link
Member

(My assumption is this issue is about KCP. Is that correct? (as KCP is not mentioned at all))

@sbueringer - yep this about KCP. I'll work on poc initially so that it will aid discussion.

@sbueringer
Copy link
Member

sbueringer commented Mar 21, 2025

Just a few more notes:

  • KCP will be refactored as part of the v1beta2 work to ensure it works based on v1beta2 conditions (today the logic is based on v1beta1 conditions)
  • In-place updates will also require some bigger changes to KCP soon (xref: 📖 Add In-place updates proposal #11029)
  • Slightly more context to this one: "assumption that we only create/delete one Machine at a time". I would recommend a full audit of KCP to find these assumptions. Some are probably pretty hard to find (e.g. when we calculate conditions, or when we select a Machine for remediation).
  • If we depend on specific kubeadm / etcd features / versions we have to consider our support ranges

@richardcase
Copy link
Member

Thanks for the notes and insight @sbueringer 🙇

@neolit123
Copy link
Member

neolit123 commented Mar 24, 2025

parallel join is generally not supported by kubeadm and YMMV.
while newer kubeadm switched to etcd learners, note that etcd 3.6.0 will enable by default a max learner size == 1, so if capi needs to support this it would also have to override the kubeadm etcd pod manifest to have this flag to equal > 1.

EDIT: the restriction was already there in 3.5, the max learner was hard coded as 1. etcd 3.6 just support customiing the max learner using a flag --max-learner.

kubernetes/kubernetes#130583 (comment)

@sbueringer
Copy link
Member

sbueringer commented Mar 24, 2025

parallel join is generally not supported by kubeadm

I wasn't aware of this. Sounds like a showstopper to me.

@neolit123
Copy link
Member

neolit123 commented Mar 24, 2025

also, it's not documented anywhere at the k8s.io website and our test tool https://github.com/kubernetes/kubeadm/tree/main/kinder only joins in serial, therefore we have no e2e tests for it.

i recall users and some vmware projects doing parallel join at some point, but that was a while back.

@fabriziopandini
Copy link
Member

TBH, I'm struggling a little bit to understand what benefit this change will bring to the users, because considering, that Init cannot happen in parallel, then we are already unblocking workers to joins immediately after init completes (so IMO the fact when 2nd and 3rd CP joins sequentially or in parallel will not bring any benefit to the overall cluster provisioning time).

Also, quoting similar discussion in the past, we always ended up in preferring stability over speed KCP, e.g. #3876.

I also agree on the fact that kubeadm support is a showstopper, as well as the fact that implementing this will be way trickier and risky than you might expect because KCP is build under the assumption that we only create/delete one Machine at a time.

Assuming we find a way forward to address the swostopper, before diving deep into KCP changes, I would suggest to do a preliminary impact analisys of all the code path in KCP (scale up, down, rollout, remediation, basically everything) + discuss outcomes before creating a PR (considering the complexity of KCP's code organization, doing such complex discussions within PR comments will probably result more time consuming/dispersive than having a focused discussion in a design doc).

@richardcase
Copy link
Member

I would suggest to do a preliminary impact analisys of all the code path in KCP (scale up, down, rollout, remediation, basically everything) + discuss outcomes before creating a PR (considering the complexity of KCP's code organization, doing such complex discussions within PR comments will probably result more time consuming/dispersive than having a focused discussion in a design doc)

Yep i agree with you here @fabriziopandini. The poc was really a way to help with the investigation and help understand everything as an aid to writing a proposal......and not as a way to put up a PR for discussion.

I'm struggling a little bit to understand what benefit this change will bring to the users

The benefit is getting the control plane in its desired state sooner when doing 2nd, 3rd (or even 4th & 5th) nodes in parallel. This is more beneficial when using infra providers that take a long time to provision, like MaaS.

parallel join is generally not supported by kubeadm and YMMV.

This does sound like a showstopper for now with this. Especially so if there are no e2e tests. Parallel joins would have to be supported (with tests) in Kubeadm before this could proceed.

@richardcase
Copy link
Member

Thanks for all the helpful input @neolit123 @sbueringer @fabriziopandini

@chrischdi
Copy link
Member

Doing issue triage

/kind feature
/priority backlog
/triage accepted

As per comments above: this should be first sorted out in kubeadm to get actionable on CAPI side.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 25, 2025
@richardcase
Copy link
Member

As there is nothing to do on this at present:

/unassign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/backlog Higher priority than priority/awaiting-more-evidence. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

8 participants