Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recreating control-plane members generates etcd errors and orphaned etcd members #3577

Closed
nicolaiort-datev opened this issue Mar 10, 2025 · 1 comment · Fixed by #3584
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management.

Comments

@nicolaiort-datev
Copy link

What happened?

We experimented with rolling update logic (related to #3540) by adding a new node to a kubeone cluster and removing it afterwards. This resulted in some etcd errors and other nodes being labeled as "NotReady" for several minutes.

Etcdserver error

Error from server: etcdserver: request timed out

Etcd member list that shows that the old node has not been removed

kubectl exec -n kube-system etcd-hakuna-matata-cp-1 -- /bin/sh -c "ETCDCTL_API=3 ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt ETCDCTL_CERT=/etc/kubernetes/pki/etcd/healthcheck-client.crt ETCDCTL_KEY=/etc/kubernetes/pki/etcd/healthcheck-client.key etcdctl member list"
3d2d8344072d885a, started, hakuna-matata-cp-3, https://172.18.1.73:2380, https://172.18.1.73:2379, false
6b4c061798973815, started, hakuna-matata-cp-2, https://172.18.1.77:2380, https://172.18.1.77:2379, false
b0a470b479170e7a, started, hakuna-matata-cp-1, https://172.18.1.172:2380, https://172.18.1.172:2379, false
f4bbe456fed03cdf, started, hakuna-matata-cp-0, https://172.18.1.206:2380, https://172.18.1.206:2379, false

Expected behavior

Removing a node from kubernetes and running kubeone apply should cleanup any orphaned etcd members.
The following warning shows up but etcd member list still shows the member.

Kubeone warning

8.1.206:2379"  endpoint status, error: failed to dial endpoint 172.18.1.73:2379 with maintenance client: context deadline exceeded 
WARN[16:02:26 CET] scheduling etcd member ID:3d2d8344072d885a name:"hakuna-matata-cp-3" peerURLs:"https://172.18.1.73:2380" clientURLs:"https://172.18.1.73:2379"  to delete 

Member list

kubectl exec -n kube-system etcd-hakuna-matata-cp-1 -- /bin/sh -c "ETCDCTL_API=3 ETCDCTL_CACERT=/etc/kubernetes/pki/etcd/ca.crt ETCDCTL_CERT=/etc/kubernetes/pki/etcd/healthcheck-client.crt ETCDCTL_KEY=/etc/kubernetes/pki/etcd/healthcheck-client.key etcdctl member list"
6b4c061798973815, started, hakuna-matata-cp-2, https://172.18.1.77:2380, https://172.18.1.77:2379, false
b0a470b479170e7a, started, hakuna-matata-cp-1, https://172.18.1.172:2380, https://172.18.1.172:2379, false
f4bbe456fed03cdf, started, hakuna-matata-cp-0, https://172.18.1.206:2380, https://172.18.1.206:2379, false

How to reproduce the issue?

  1. Create a 3 CP-Node kubeone cluster via terraform
  2. Initialize the cluster via kubeone apply
  3. Increase the node count to 4
  4. Terraform apply && Generate Output
  5. Kubeone apply
  6. Do the following for node 1-3 (aka 0-2)
    • Drain the node
    • Remove the node via kubectl
    • Recreate the node and it's loadbalancer via terraform
    • Kubeone apply
  7. Drain the node and remove it from the cluster (kubectl drain && kubectl delete node)
  8. Decrease the node count to 3
  9. Terraform apply (deletes the vm) && Generate Output
  10. Kubeone apply### What KubeOne version are you using?

1.9.1 and 1.9.2

Client 1

kubeone version 
{
  "kubeone": {
    "major": "1",
    "minor": "9",
    "gitVersion": "1.9.1",
    "gitCommit": "none",
    "gitTreeState": "",
    "buildDate": "2024-12-23T13:23:27Z",
    "goVersion": "go1.23.4",
    "compiler": "gc",
    "platform": "darwin/arm64"
  },
  "machine_controller": {
    "major": "1",
    "minor": "60",
    "gitVersion": "v1.60.0",
    "gitCommit": "",
    "gitTreeState": "",
    "buildDate": "",
    "goVersion": "",
    "compiler": "",
    "platform": "linux/amd64"
  }
}

Client 2

kubeone version
{
  "kubeone": {
    "major": "1",
    "minor": "9",
    "gitVersion": "1.9.2",
    "gitCommit": "none",
    "gitTreeState": "",
    "buildDate": "2025-02-05T11:58:13Z",
    "goVersion": "go1.23.6",
    "compiler": "gc",
    "platform": "darwin/arm64"
  },
  "machine_controller": {
    "major": "1",
    "minor": "61",
    "gitVersion": "v1.61.0",
    "gitCommit": "",
    "gitTreeState": "",
    "buildDate": "",
    "goVersion": "",
    "compiler": "",
    "platform": "linux/amd64"
  }
}

Provide your KubeOneCluster manifest here (if applicable)

apiVersion: kubeone.k8c.io/v1beta2
kind: KubeOneCluster
versions:
  kubernetes: "1.29.6"
cloudProvider:
  openstack: {}
  external: true
clusterNetwork:
  cni:
    cilium:
      kubeProxyReplacement: "strict"
      enableHubble: true
  kubeProxy:
    skipInstallation: true
addons:
  enable: true
  path: "./addons"
  addons:
  - name: cluster-autoscaler
  - name: default-storage-class

What cloud provider are you running on?

Openstack

What operating system are you running in your cluster?

Flatcar Linux

Additional information

It seems like this is not very consistent, we had several runs without any problems but others with the first try resulting in the orphaned member. We're still investigating and can only guess the problem's source.

Out best guess is that removing and re-adding etcd-members overloads the etcd-cluster (+ NotReady nodes) and results in the member deletion not getting triggered (resulting in orphaned members).

cc @toschneck

@nicolaiort-datev nicolaiort-datev added kind/bug Categorizes issue or PR as related to a bug. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management. labels Mar 10, 2025
@ahmedwaleedmalik
Copy link
Member

ahmedwaleedmalik commented Mar 20, 2025

For #3577, we have added an improvement in KubeOne through #3584 to take care of orphaned etcd members in a better way.

etcd members in Kubernetes are unique and in-place replacements are not recommended. I’d suggest updating the process for rolling upgrade of control plane VMs to:

  1. Do the following for node 1-3 (aka 0-2)
    • Drain the node
    • Remove the node via Kubectl(optional)
    • Remove the VM from terraform configuration
    • Run KubeOne apply — KubeOne will then take care of removing the orphaned etcd member and node.

The Rest of the process remains the same as you explained in this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/cluster-management Denotes a PR or issue as being assigned to SIG Cluster Management.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants