Handle Kubernetes API server failover #3008

mxey · 2018-05-08T13:49:54Z

Environment

Dashboard version: v1.8.3
Kubernetes version: v1.10.2
Operating system: CentOS Linux release 7.4.1708 (Core)

Steps to reproduce

Have a multi-master Kubernetes cluster
Run dashboard with in-cluster config
Stop one of the API servers

Observed result

Dashboard hangs while trying to load cluster resources, until Linux eventually timeouts the TCP connection.
The dashboard pod is not killed and restarted automatically either, because the liveness probe does not exercise the Kubernetes API connection
Even after the TCP timeout and eventual reconnect, the log still repeatedly shows Synchronizer kubernetes-dashboard-key-holder-kube-system exited with error: kubernetes-dashboard-key-holder-kube-system watch ended with timeout (see Extremely dangerous logging #2723 (comment))

Expected result

Dashboard has a shorter timeout on the API server connection and reconnects (might depend on Client should expose a mechanism to close underlying TCP connections client-go#374)
The dashboard has a liveness probe that actually tests core functionality of the dashboard (/api/v1/settings/global might do, but ideally a designated health URI)

The text was updated successfully, but these errors were encountered:

m3co-code · 2018-05-16T11:57:54Z

I just experienced the same issue. The dashboard was trying to synchronize in a fast loop (thousands of log entries in one second) consuming a lot of CPU.

vdboor · 2018-06-04T09:45:18Z

Even with a single-master cluster this happens.

Steps to reproduce:

Run the dashboard (e.g. with default configuration from the helm chart)
Restart the API server (find with kubectl get pods -n kube-system -l component=kube-apiserver)

The logging flood stops when the dashboard pod is deleted/recreated.

fejta-bot · 2018-09-02T10:06:48Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

vdboor · 2018-09-03T08:27:29Z

/remove-lifecycle stale

maciaszczykm · 2018-09-03T16:09:28Z

/lifecycle frozen

ninlil · 2019-01-15T08:18:49Z

Same problem.

Running Azure AKS
K8 version 1.11.4 and 1.11.5
Dashboard image: k8s.gcr.io/kubernetes-dashboard-amd64:v1.10.0

After "Microsoft" sometimes restarts the managed API-server the dashboard starts to log around 450 lines per 5 minutes.

Any update on actually getting the reconnect solved?

floreks · 2019-01-15T08:42:35Z

@zenlil it will be fixed in v2. Right now a workaround is to delete Dashboard pod after api server restart.

WhoAteDaCake · 2019-06-11T14:04:50Z

Still happening in 1.10.1 ? I was flodded with gigs of logs in no time. Deleting the dashboard pod did not solve it

floreks · 2019-06-11T14:16:41Z

Can you upload beginning of the log? First 30m let's say.

spingel · 2019-07-26T10:38:53Z

These are the initial log entries that we saw when we encountered the issue:

2019/07/26 02:28:58 Synchronizer kubernetes-dashboard-key-holder-kube-system exited with error: unexpected object: &Secret{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:,GenerateName:,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,},Data:map[string][]byte{},Type:,StringData:map[string]string{},}

2019/07/26 02:29:00 Restarting synchronizer: kubernetes-dashboard-key-holder-kube-system.

2019/07/26 02:29:00 Starting secret synchronizer for kubernetes-dashboard-key-holder in namespace kube-system

2019/07/26 02:29:00 Synchronizer kubernetes-dashboard-key-holder-kube-system exited with error: kubernetes-dashboard-key-holder-kube-system watch ended with timeout

2019/07/26 02:29:02 Restarting synchronizer: kubernetes-dashboard-key-holder-kube-system.

2019/07/26 02:29:02 Starting secret synchronizer for kubernetes-dashboard-key-holder in namespace kube-system

2019/07/26 02:29:02 Synchronizer kubernetes-dashboard-key-holder-kube-system exited with error: kubernetes-dashboard-key-holder-kube-system watch ended with timeout

The last 3 errors repeat every 2 seconds causing a flood of log entries.

floreks · 2019-07-26T13:21:31Z

It's no longer a case with v2 as it forces restart of the pod after few retries.

bryk added the good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. label May 25, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 2, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 3, 2018

k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Sep 3, 2018

floreks added the kind/bug Categorizes issue or PR as related to a bug. label Dec 13, 2018

floreks mentioned this issue Dec 13, 2018

Extremely dangerous logging #2723

Closed

floreks mentioned this issue Jan 17, 2019

Fix in-cluster config #3522

Merged

1 task

k8s-ci-robot closed this as completed in #3522 Jan 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle Kubernetes API server failover #3008

Handle Kubernetes API server failover #3008

mxey commented May 8, 2018

m3co-code commented May 16, 2018

vdboor commented Jun 4, 2018 •

edited

Loading

fejta-bot commented Sep 2, 2018

vdboor commented Sep 3, 2018

maciaszczykm commented Sep 3, 2018

ninlil commented Jan 15, 2019

floreks commented Jan 15, 2019

WhoAteDaCake commented Jun 11, 2019

floreks commented Jun 11, 2019

spingel commented Jul 26, 2019 •

edited

Loading

floreks commented Jul 26, 2019

Handle Kubernetes API server failover #3008

Handle Kubernetes API server failover #3008

Comments

mxey commented May 8, 2018

Environment

Steps to reproduce

Observed result

Expected result

m3co-code commented May 16, 2018

vdboor commented Jun 4, 2018 • edited Loading

fejta-bot commented Sep 2, 2018

vdboor commented Sep 3, 2018

maciaszczykm commented Sep 3, 2018

ninlil commented Jan 15, 2019

floreks commented Jan 15, 2019

WhoAteDaCake commented Jun 11, 2019

floreks commented Jun 11, 2019

spingel commented Jul 26, 2019 • edited Loading

floreks commented Jul 26, 2019

vdboor commented Jun 4, 2018 •

edited

Loading

spingel commented Jul 26, 2019 •

edited

Loading