Control plane should scale up in parallel, not serially #12007

joejulian · 2025-03-20T17:11:08Z

In issue #2016 @dlipovetsky correctly suggested the opposite be true:

We recently implemented control plane scale up. If the desired number of control plane machines exceeds the actual number, and at least one control plane machine exists, the controller will create multiple machines in parallel. (Once created, each machine runs kubeadm join --control-plane).

I think we should scale up control planes serially. Before creating an additional control plane machine, we should verify that every etcd member has started. We could also verify that the etcd cluster has quorum (if it does not have quorum, creating a new machine might be a waste of time and resources. On the other hand, if it does have quorum, it might lose it after we create the machine)

Today, etcd still recommends that the cluster be scaled up or down one member at a time. Moreover, there are known issues with running kubeadm join --control-plane in parallel.

In the future, we will likely be able to scale up in parallel by using etcd non-voting members (learners). Kubeadm is already exploring this idea.

/cc @detiber @randomvariable @chuckha

In this description, he states that we would likely be able to scale up in parallel by using etcd non-voting members. Kubeadm has completed adding support for that requirement and we should look at returning to parallel scale up using this feature.

The text was updated successfully, but these errors were encountered:

richardcase · 2025-03-20T17:21:42Z

/assign

ivelichkovich · 2025-03-20T17:24:43Z

I think we'll need to do the first CP with kubeadm init still then the rest could go in parallel

richardcase · 2025-03-20T17:28:40Z

I think we'll need to do the first CP with kubeadm init still then the rest could go in parallel

Yeah that make sense.

sbueringer · 2025-03-20T17:42:01Z

Please consider that KCP is build under the assumption that we only create/delete one Machine at a time.

If we want to implement this change, this assumption goes out the window.

(My assumption is this issue is about KCP. Is that correct? (as KCP is not mentioned at all))

richardcase · 2025-03-20T18:48:20Z

(My assumption is this issue is about KCP. Is that correct? (as KCP is not mentioned at all))

@sbueringer - yep this about KCP. I'll work on poc initially so that it will aid discussion.

sbueringer · 2025-03-21T08:27:59Z

Just a few more notes:

KCP will be refactored as part of the v1beta2 work to ensure it works based on v1beta2 conditions (today the logic is based on v1beta1 conditions)
In-place updates will also require some bigger changes to KCP soon (xref: 📖 Add In-place updates proposal #11029)
Slightly more context to this one: "assumption that we only create/delete one Machine at a time". I would recommend a full audit of KCP to find these assumptions. Some are probably pretty hard to find (e.g. when we calculate conditions, or when we select a Machine for remediation).
If we depend on specific kubeadm / etcd features / versions we have to consider our support ranges

richardcase · 2025-03-21T13:57:58Z

Thanks for the notes and insight @sbueringer 🙇

neolit123 · 2025-03-24T09:21:08Z

parallel join is generally not supported by kubeadm and YMMV.
while newer kubeadm switched to etcd learners, note that etcd 3.6.0 will enable by default a max learner size == 1, so if capi needs to support this it would also have to override the kubeadm etcd pod manifest to have this flag to equal > 1.

EDIT: the restriction was already there in 3.5, the max learner was hard coded as 1. etcd 3.6 just support customiing the max learner using a flag --max-learner.

kubernetes/kubernetes#130583 (comment)

sbueringer · 2025-03-24T09:25:05Z

parallel join is generally not supported by kubeadm

I wasn't aware of this. Sounds like a showstopper to me.

neolit123 · 2025-03-24T09:30:26Z

also, it's not documented anywhere at the k8s.io website and our test tool https://github.com/kubernetes/kubeadm/tree/main/kinder only joins in serial, therefore we have no e2e tests for it.

i recall users and some vmware projects doing parallel join at some point, but that was a while back.

fabriziopandini · 2025-03-24T12:50:01Z

TBH, I'm struggling a little bit to understand what benefit this change will bring to the users, because considering, that Init cannot happen in parallel, then we are already unblocking workers to joins immediately after init completes (so IMO the fact when 2nd and 3rd CP joins sequentially or in parallel will not bring any benefit to the overall cluster provisioning time).

Also, quoting similar discussion in the past, we always ended up in preferring stability over speed KCP, e.g. #3876.

I also agree on the fact that kubeadm support is a showstopper, as well as the fact that implementing this will be way trickier and risky than you might expect because KCP is build under the assumption that we only create/delete one Machine at a time.

Assuming we find a way forward to address the swostopper, before diving deep into KCP changes, I would suggest to do a preliminary impact analisys of all the code path in KCP (scale up, down, rollout, remediation, basically everything) + discuss outcomes before creating a PR (considering the complexity of KCP's code organization, doing such complex discussions within PR comments will probably result more time consuming/dispersive than having a focused discussion in a design doc).

richardcase · 2025-03-24T14:00:05Z

I would suggest to do a preliminary impact analisys of all the code path in KCP (scale up, down, rollout, remediation, basically everything) + discuss outcomes before creating a PR (considering the complexity of KCP's code organization, doing such complex discussions within PR comments will probably result more time consuming/dispersive than having a focused discussion in a design doc)

Yep i agree with you here @fabriziopandini. The poc was really a way to help with the investigation and help understand everything as an aid to writing a proposal......and not as a way to put up a PR for discussion.

I'm struggling a little bit to understand what benefit this change will bring to the users

The benefit is getting the control plane in its desired state sooner when doing 2nd, 3rd (or even 4th & 5th) nodes in parallel. This is more beneficial when using infra providers that take a long time to provision, like MaaS.

parallel join is generally not supported by kubeadm and YMMV.

This does sound like a showstopper for now with this. Especially so if there are no e2e tests. Parallel joins would have to be supported (with tests) in Kubeadm before this could proceed.

richardcase · 2025-03-24T14:01:21Z

Thanks for all the helpful input @neolit123 @sbueringer @fabriziopandini

chrischdi · 2025-03-25T13:08:46Z

Doing issue triage

/kind feature
/priority backlog
/triage accepted

As per comments above: this should be first sorted out in kubeadm to get actionable on CAPI side.

richardcase · 2025-03-26T12:20:07Z

As there is nothing to do on this at present:

/unassign

k8s-ci-robot added needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 20, 2025

k8s-ci-robot assigned richardcase Mar 20, 2025

k8s-ci-robot unassigned richardcase Mar 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control plane should scale up in parallel, not serially #12007

Control plane should scale up in parallel, not serially #12007

joejulian commented Mar 20, 2025

richardcase commented Mar 20, 2025

ivelichkovich commented Mar 20, 2025

richardcase commented Mar 20, 2025

sbueringer commented Mar 20, 2025 •

edited

Loading

richardcase commented Mar 20, 2025

sbueringer commented Mar 21, 2025 •

edited

Loading

richardcase commented Mar 21, 2025

neolit123 commented Mar 24, 2025 •

edited

Loading

sbueringer commented Mar 24, 2025 •

edited

Loading

neolit123 commented Mar 24, 2025 •

edited

Loading

fabriziopandini commented Mar 24, 2025

richardcase commented Mar 24, 2025

richardcase commented Mar 24, 2025

chrischdi commented Mar 25, 2025

richardcase commented Mar 26, 2025

Control plane should scale up in parallel, not serially #12007

Control plane should scale up in parallel, not serially #12007

Comments

joejulian commented Mar 20, 2025

richardcase commented Mar 20, 2025

ivelichkovich commented Mar 20, 2025

richardcase commented Mar 20, 2025

sbueringer commented Mar 20, 2025 • edited Loading

richardcase commented Mar 20, 2025

sbueringer commented Mar 21, 2025 • edited Loading

richardcase commented Mar 21, 2025

neolit123 commented Mar 24, 2025 • edited Loading

sbueringer commented Mar 24, 2025 • edited Loading

neolit123 commented Mar 24, 2025 • edited Loading

fabriziopandini commented Mar 24, 2025

richardcase commented Mar 24, 2025

richardcase commented Mar 24, 2025

chrischdi commented Mar 25, 2025

richardcase commented Mar 26, 2025

sbueringer commented Mar 20, 2025 •

edited

Loading

sbueringer commented Mar 21, 2025 •

edited

Loading

neolit123 commented Mar 24, 2025 •

edited

Loading

sbueringer commented Mar 24, 2025 •

edited

Loading

neolit123 commented Mar 24, 2025 •

edited

Loading