Recreating control-plane members generates etcd errors and orphaned etcd members #3577

nicolaiort-datev · 2025-03-10T15:13:31Z

What happened?

We experimented with rolling update logic (related to #3540) by adding a new node to a kubeone cluster and removing it afterwards. This resulted in some etcd errors and other nodes being labeled as "NotReady" for several minutes.

Etcdserver error

Error from server: etcdserver: request timed out

Etcd member list that shows that the old node has not been removed

❯ kubectl exec -n kube-system etcd-hakuna-matata-cp-1 -- /bin/sh -c "ETCDCTL_API=3 ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt ETCDCTL_CERT=/etc/kubernetes/pki/etcd/healthcheck-client.crt ETCDCTL_KEY=/etc/kubernetes/pki/etcd/healthcheck-client.key etcdctl member list"
3d2d8344072d885a, started, hakuna-matata-cp-3, https://172.18.1.73:2380, https://172.18.1.73:2379, false
6b4c061798973815, started, hakuna-matata-cp-2, https://172.18.1.77:2380, https://172.18.1.77:2379, false
b0a470b479170e7a, started, hakuna-matata-cp-1, https://172.18.1.172:2380, https://172.18.1.172:2379, false
f4bbe456fed03cdf, started, hakuna-matata-cp-0, https://172.18.1.206:2380, https://172.18.1.206:2379, false

Expected behavior

Removing a node from kubernetes and running kubeone apply should cleanup any orphaned etcd members.
The following warning shows up but etcd member list still shows the member.

Kubeone warning

8.1.206:2379"  endpoint status, error: failed to dial endpoint 172.18.1.73:2379 with maintenance client: context deadline exceeded 
WARN[16:02:26 CET] scheduling etcd member ID:3d2d8344072d885a name:"hakuna-matata-cp-3" peerURLs:"https://172.18.1.73:2380" clientURLs:"https://172.18.1.73:2379"  to delete

Member list

❯ kubectl exec -n kube-system etcd-hakuna-matata-cp-1 -- /bin/sh -c "ETCDCTL_API=3 ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt ETCDCTL_CERT=/etc/kubernetes/pki/etcd/healthcheck-client.crt ETCDCTL_KEY=/etc/kubernetes/pki/etcd/healthcheck-client.key etcdctl member list"
6b4c061798973815, started, hakuna-matata-cp-2, https://172.18.1.77:2380, https://172.18.1.77:2379, false
b0a470b479170e7a, started, hakuna-matata-cp-1, https://172.18.1.172:2380, https://172.18.1.172:2379, false
f4bbe456fed03cdf, started, hakuna-matata-cp-0, https://172.18.1.206:2380, https://172.18.1.206:2379, false

How to reproduce the issue?

Create a 3 CP-Node kubeone cluster via terraform
Initialize the cluster via kubeone apply
Increase the node count to 4
Terraform apply && Generate Output
Kubeone apply
Do the following for node 1-3 (aka 0-2)
- Drain the node
- Remove the node via kubectl
- Recreate the node and it's loadbalancer via terraform
- Kubeone apply
Drain the node and remove it from the cluster (kubectl drain && kubectl delete node)
Decrease the node count to 3
Terraform apply (deletes the vm) && Generate Output
Kubeone apply### What KubeOne version are you using?

1.9.1 and 1.9.2

Client 1

❯ kubeone version 
{
  "kubeone": {
    "major": "1",
    "minor": "9",
    "gitVersion": "1.9.1",
    "gitCommit": "none",
    "gitTreeState": "",
    "buildDate": "2024-12-23T13:23:27Z",
    "goVersion": "go1.23.4",
    "compiler": "gc",
    "platform": "darwin/arm64"
  },
  "machine_controller": {
    "major": "1",
    "minor": "60",
    "gitVersion": "v1.60.0",
    "gitCommit": "",
    "gitTreeState": "",
    "buildDate": "",
    "goVersion": "",
    "compiler": "",
    "platform": "linux/amd64"
  }
}

Client 2

❯ kubeone version
{
  "kubeone": {
    "major": "1",
    "minor": "9",
    "gitVersion": "1.9.2",
    "gitCommit": "none",
    "gitTreeState": "",
    "buildDate": "2025-02-05T11:58:13Z",
    "goVersion": "go1.23.6",
    "compiler": "gc",
    "platform": "darwin/arm64"
  },
  "machine_controller": {
    "major": "1",
    "minor": "61",
    "gitVersion": "v1.61.0",
    "gitCommit": "",
    "gitTreeState": "",
    "buildDate": "",
    "goVersion": "",
    "compiler": "",
    "platform": "linux/amd64"
  }
}

Provide your KubeOneCluster manifest here (if applicable)

apiVersion: kubeone.k8c.io/v1beta2
kind: KubeOneCluster
versions:
  kubernetes: "1.29.6"
cloudProvider:
  openstack: {}
  external: true
clusterNetwork:
  cni:
    cilium:
      kubeProxyReplacement: "strict"
      enableHubble: true
  kubeProxy:
    skipInstallation: true
addons:
  enable: true
  path: "./addons"
  addons:
  - name: cluster-autoscaler
  - name: default-storage-class

What cloud provider are you running on?

Openstack

What operating system are you running in your cluster?

Flatcar Linux

Additional information

It seems like this is not very consistent, we had several runs without any problems but others with the first try resulting in the orphaned member. We're still investigating and can only guess the problem's source.

Out best guess is that removing and re-adding etcd-members overloads the etcd-cluster (+ NotReady nodes) and results in the member deletion not getting triggered (resulting in orphaned members).

cc @toschneck

The text was updated successfully, but these errors were encountered:

ahmedwaleedmalik · 2025-03-20T10:41:44Z

For #3577, we have added an improvement in KubeOne through #3584 to take care of orphaned etcd members in a better way.

etcd members in Kubernetes are unique and in-place replacements are not recommended. I’d suggest updating the process for rolling upgrade of control plane VMs to:

Do the following for node 1-3 (aka 0-2)
- Drain the node
- Remove the node via Kubectl(optional)
- Remove the VM from terraform configuration
- Run KubeOne apply — KubeOne will then take care of removing the orphaned etcd member and node.

The Rest of the process remains the same as you explained in this ticket.

nicolaiort-datev added kind/bug Categorizes issue or PR as related to a bug. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management. labels Mar 10, 2025

ahmedwaleedmalik mentioned this issue Mar 20, 2025

Repair cluster if etcd member count has exceeded control plane VM count #3584

Merged

kubermatic-bot closed this as completed in #3584 Mar 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recreating control-plane members generates etcd errors and orphaned etcd members #3577

Recreating control-plane members generates etcd errors and orphaned etcd members #3577

nicolaiort-datev commented Mar 10, 2025

ahmedwaleedmalik commented Mar 20, 2025 •

edited

Loading

Recreating control-plane members generates etcd errors and orphaned etcd members #3577

Recreating control-plane members generates etcd errors and orphaned etcd members #3577

Comments

nicolaiort-datev commented Mar 10, 2025

What happened?

Expected behavior

How to reproduce the issue?

Provide your KubeOneCluster manifest here (if applicable)

What cloud provider are you running on?

What operating system are you running in your cluster?

Additional information

ahmedwaleedmalik commented Mar 20, 2025 • edited Loading

ahmedwaleedmalik commented Mar 20, 2025 •

edited

Loading