Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failing e2e test jobs after ControlPlaneKubeletLocalMode enabled by default #3154

Open
neolit123 opened this issue Jan 29, 2025 · 4 comments
Labels
area/feature-gates kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Milestone

Comments

@neolit123
Copy link
Member

neolit123 commented Jan 29, 2025

i suspect it's

because the other PR after is cosmetic (klog change)

failing jobs:

https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kubeadm-kinder-dryrun-latest/1884502182036770816/build-log.txt

[etcd] Would wait for the new etcd member to join the cluster
[kubelet-check] Waiting for a healthy kubelet at http://127.0.0.1:10248/healthz. This can take up to 4m0s
[kubelet-check] The kubelet is not healthy after 4m0.001133273s

Unfortunately, an error has occurred:
	The HTTP call equal to 'curl -sSL http://127.0.0.1:10248/healthz' returned error: Get "http://127.0.0.1:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused

https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kubeadm-kinder-external-ca-latest/1884261091186315264/build-log.txt

I0129 09:33:28.655624     245 loader.go:402] Config loaded from file:  /etc/kubernetes/kubelet.conf
I0129 09:33:28.656811     245 loader.go:402] Config loaded from file:  /etc/kubernetes/kubelet.conf
I0129 09:33:28.657129     245 kubelet.go:337] [kubelet-start] preserving the crisocket information for the node
I0129 09:33:28.657219     245 patchnode.go:32] [patchnode] Uploading the CRI socket "unix:///run/containerd/containerd.sock" to Node "kinder-external-ca-control-plane-2" as an annotation
...
I0128 15:30:34.908571     219 round_trippers.go:632] "Response" verb="GET" url="https://172.17.0.5:6443/api/v1/nodes/kinder-external-ca-worker-1?timeout=10s" status="" milliseconds=0
I0128 15:30:35.408661     219 round_trippers.go:632] "Response" verb="GET" url="https://172.17.0.5:6443/api/v1/nodes/kinder-external-ca-worker-1?timeout=10s" status="" milliseconds=0
Get "https://172.17.0.5:6443/api/v1/nodes/kinder-external-ca-worker-1?timeout=10s": dial tcp 172.17.0.5:6443: connect: connection refused
error writing CRISocket for this node
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join.runKubeletWaitBootstrapPhase
	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join/kubelet.go:339
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:261
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:450
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run

both cases need investigation. in one case it seems it's not reaching the kubelet and the other the apiserver.
don't seem like flakes as it failed consistently N times. these jobs are a bit uncommon, i.e. they do custom actions like dry-run/external ca.

the regular job is green:

cc @chrischdi

@neolit123 neolit123 added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. area/feature-gates labels Jan 29, 2025
@neolit123 neolit123 added this to the v1.33 milestone Jan 29, 2025
@neolit123
Copy link
Member Author

neolit123 commented Jan 29, 2025

@chrischdi and the dedicated fg=false job also started failing, oddly:
https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-control-plane-kubelet-local-mode-latest

edit: actually this one is clearer. this needs update:


# task-09-post-upgrade
/bin/bash -c set -x

IP_ADDRESS="$(docker inspect --format='{{ .NetworkSettings.IPAddress }}' kinder-control-plane-local-kubelet-mode-lb)"

CMD="docker exec kinder-control-plane-local-kubelet-mode-control-plane-1"
${CMD} grep "server: https://${IP_ADDRESS}:6443" /etc/kubernetes/kubelet.conf || exit 1

CMD="docker exec kinder-control-plane-local-kubelet-mode-control-plane-2"
${CMD} grep "server: https://${IP_ADDRESS}:6443" /etc/kubernetes/kubelet.conf || exit 1

CMD="docker exec kinder-control-plane-local-kubelet-mode-control-plane-3"
${CMD} grep "server: https://${IP_ADDRESS}:6443" /etc/kubernetes/kubelet.conf || exit 1

# Ensure exit status of 0
exit 0


++ docker inspect '--format={{ .NetworkSettings.IPAddress }}' kinder-control-plane-local-kubelet-mode-lb
+ IP_ADDRESS=172.17.0.7
+ CMD='docker exec kinder-control-plane-local-kubelet-mode-control-plane-1'
+ docker exec kinder-control-plane-local-kubelet-mode-control-plane-1 grep 'server: https://172.17.0.7:6443' /etc/kubernetes/kubelet.conf
+ exit 1
 exit status 1

@neolit123
Copy link
Member Author

neolit123 commented Jan 29, 2025

https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kubeadm-kinder-external-ca-latest/1884261091186315264/build-log.txt

I0129 09:33:28.655624     245 loader.go:402] Config loaded from file:  /etc/kubernetes/kubelet.conf
I0129 09:33:28.656811     245 loader.go:402] Config loaded from file:  /etc/kubernetes/kubelet.conf
I0129 09:33:28.657129     245 kubelet.go:337] [kubelet-start] preserving the crisocket information for the node
I0129 09:33:28.657219     245 patchnode.go:32] [patchnode] Uploading the CRI socket "unix:///run/containerd/containerd.sock" to Node "kinder-external-ca-control-plane-2" as an annotation
...
I0128 15:30:34.908571     219 round_trippers.go:632] "Response" verb="GET" url="https://172.17.0.5:6443/api/v1/nodes/kinder-external-ca-worker-1?timeout=10s" status="" milliseconds=0
I0128 15:30:35.408661     219 round_trippers.go:632] "Response" verb="GET" url="https://172.17.0.5:6443/api/v1/nodes/kinder-external-ca-worker-1?timeout=10s" status="" milliseconds=0
Get "https://172.17.0.5:6443/api/v1/nodes/kinder-external-ca-worker-1?timeout=10s": dial tcp 172.17.0.5:6443: connect: connection refused
error writing CRISocket for this node
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join.runKubeletWaitBootstrapPhase
	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join/kubelet.go:339
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:261
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:450
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run

external ca calls a kinder action setup-external-ca
https://github.com/kubernetes/kubeadm/blob/main/kinder/ci/workflows/external-ca-tasks.yaml#L56

it needs to be updated because it uses a naive approach to generate the same kubelet.conf on both workers and CP nodes
https://github.com/kubernetes/kubeadm/blob/main/kinder/pkg/cluster/manager/actions/setup-external-ca.go#L111

without that the kublet.conf will point to a non-existing local apiserver on worker nodes. instead it should point to lb.
the culprit is kubeadm init phase kubeconfig kubelet --control-plane-endpoint=%s --v=%d", where the CPE should be the LB.

i don't think there is a bigger issue here, i.e. we don't need to patch k/k.

edit: hmm but, --control-plane-endpoint=%s is already the lb IP according to the kinder source, but the file ends up with 172.17.0.5 which is the worker IP and there is no apiserver there at port 6443.

@neolit123
Copy link
Member Author

neolit123 commented Jan 29, 2025

tested locally.

sudo kubeadm init phase certs ca
sudo kubeadm init phase kubeconfig all --control-plane-endpoint=foo.bar --v=5
sudo cat /etc/kubernetes/kubelet.conf | grep server
    server: https://192.168.0.101:6443

so that's a regression. we need to think how the kubelet local mode will continue to respect the user prodided clusterconfiguration.controlplaneendpoint or flag.

i will send revert PR for

until we fix all these issues.

edit: here it is:

@neolit123
Copy link
Member Author

neolit123 commented Jan 29, 2025

https://storage.googleapis.com/kubernetes-ci-logs/logs/ci-kubernetes-e2e-kubeadm-kinder-external-ca-latest/1884261091186315264/build-log.txt

I0129 09:33:28.655624     245 loader.go:402] Config loaded from file:  /etc/kubernetes/kubelet.conf
I0129 09:33:28.656811     245 loader.go:402] Config loaded from file:  /etc/kubernetes/kubelet.conf
I0129 09:33:28.657129     245 kubelet.go:337] [kubelet-start] preserving the crisocket information for the node
I0129 09:33:28.657219     245 patchnode.go:32] [patchnode] Uploading the CRI socket "unix:///run/containerd/containerd.sock" to Node "kinder-external-ca-control-plane-2" as an annotation
...
I0128 15:30:34.908571     219 round_trippers.go:632] "Response" verb="GET" url="https://172.17.0.5:6443/api/v1/nodes/kinder-external-ca-worker-1?timeout=10s" status="" milliseconds=0
I0128 15:30:35.408661     219 round_trippers.go:632] "Response" verb="GET" url="https://172.17.0.5:6443/api/v1/nodes/kinder-external-ca-worker-1?timeout=10s" status="" milliseconds=0
Get "https://172.17.0.5:6443/api/v1/nodes/kinder-external-ca-worker-1?timeout=10s": dial tcp 172.17.0.5:6443: connect: connection refused
error writing CRISocket for this node
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join.runKubeletWaitBootstrapPhase
	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join/kubelet.go:339
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:261
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:450
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run

this issue seems to be that runKubeletWaitBootstrapPhase assumes there is a real kubelet running
https://github.com/kubernetes/kubernetes/blob/3bc8f01c74e80cb85e6f3813db1b410adba22bfe/cmd/kubeadm/app/cmd/phases/join/kubelet.go#L285
yet, during join dryrun, one is never started
https://github.com/kubernetes/kubernetes/blob/3bc8f01c74e80cb85e6f3813db1b410adba22bfe/cmd/kubeadm/app/cmd/phases/join/kubelet.go#L258

perhaps we should wrap the waiting

if dryrun {
  // print would wait for kubelet
} else {
  // wait
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/feature-gates kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

No branches or pull requests

1 participant