diff --git a/README.md b/README.md index aacd18861..7ed9992a9 100644 --- a/README.md +++ b/README.md @@ -12,10 +12,12 @@ Now it is running as a [Kubernetes Addon](https://github.com/kubernetes/kubernetes/tree/master/cluster/addons) enabled by default in the GKE cluster. It is also enabled by default in AKS as part of the [AKS Linux Extension](https://learn.microsoft.com/en-us/azure/aks/faq#what-is-the-purpose-of-the-aks-linux-extension-i-see-installed-on-my-linux-vmss-instances). + # Background There are tons of node problems that could possibly affect the pods running on the node, such as: + * Infrastructure daemon issues: ntp service down; * Hardware issues: Bad CPU, memory or disk; * Kernel issues: Kernel deadlock, corrupted file system; @@ -34,6 +36,7 @@ layers. Once upstream layers have visibility to those problems, we can discuss t node-problem-detector uses `Event` and `NodeCondition` to report problems to apiserver. + * `NodeCondition`: Permanent problem that makes the node unavailable for pods should be reported as `NodeCondition`. * `Event`: Temporary problem that has limited impact on pod but is informative @@ -45,6 +48,7 @@ A problem daemon is a sub-daemon of node-problem-detector. It monitors specific kinds of node problems and reports them to node-problem-detector. A problem daemon could be: + * A tiny daemon designed for dedicated Kubernetes use-cases. * An existing node health monitoring daemon integrated with node-problem-detector. @@ -61,10 +65,10 @@ List of supported problem daemons types: | Problem Daemon Types | NodeCondition | Description | Configs | Disabling Build Tag | |----------------|:---------------:|:------------|:--------|:--------------------| -| [SystemLogMonitor](https://github.com/kubernetes/node-problem-detector/tree/master/pkg/systemlogmonitor) | KernelDeadlock ReadonlyFilesystem FrequentKubeletRestart FrequentDockerRestart FrequentContainerdRestart | A system log monitor monitors system log and reports problems and metrics according to predefined rules. | [filelog](https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor-filelog.json), [kmsg](https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json), [kernel](https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor-counter.json) [abrt](https://github.com/kubernetes/node-problem-detector/blob/master/config/abrt-adaptor.json) [systemd](https://github.com/kubernetes/node-problem-detector/blob/master/config/systemd-monitor-counter.json) | disable_system_log_monitor +| [SystemLogMonitor](https://github.com/kubernetes/node-problem-detector/tree/master/pkg/systemlogmonitor) | KernelDeadlock ReadonlyFilesystem FrequentKubeletRestart FrequentContainerdRestart | A system log monitor monitors system log and reports problems and metrics according to predefined rules. | [filelog](https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor-filelog.json), [kmsg](https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json), [kernel](https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor-counter.json) [abrt](https://github.com/kubernetes/node-problem-detector/blob/master/config/abrt-adaptor.json) [systemd](https://github.com/kubernetes/node-problem-detector/blob/master/config/systemd-monitor-counter.json) | disable_system_log_monitor | [SystemStatsMonitor](https://github.com/kubernetes/node-problem-detector/tree/master/pkg/systemstatsmonitor) | None(Could be added in the future) | A system stats monitor for node-problem-detector to collect various health-related system stats as metrics. See the proposal [here](https://docs.google.com/document/d/1SeaUz6kBavI283Dq8GBpoEUDrHA2a795xtw0OvjM568/edit). | [system-stats-monitor](https://github.com/kubernetes/node-problem-detector/blob/master/config/system-stats-monitor.json) | disable_system_stats_monitor | [CustomPluginMonitor](https://github.com/kubernetes/node-problem-detector/tree/master/pkg/custompluginmonitor) | On-demand(According to users configuration), existing example: NTPProblem | A custom plugin monitor for node-problem-detector to invoke and check various node problems with user-defined check scripts. See the proposal [here](https://docs.google.com/document/d/1jK_5YloSYtboj-DtfjmYKxfNnUxCAvohLnsH5aGCAYQ/edit#). | [example](https://github.com/kubernetes/node-problem-detector/blob/4ad49bbd84b8ced45ac825eac01ec93d9235935e/config/custom-plugin-monitor.json) | disable_custom_plugin_monitor -| [HealthChecker](https://github.com/kubernetes/node-problem-detector/tree/master/pkg/healthchecker) | KubeletUnhealthy ContainerRuntimeUnhealthy| A health checker for node-problem-detector to check kubelet and container runtime health. | [kubelet](https://github.com/kubernetes/node-problem-detector/blob/master/config/health-checker-kubelet.json) [docker](https://github.com/kubernetes/node-problem-detector/blob/master/config/health-checker-docker.json) [containerd](https://github.com/kubernetes/node-problem-detector/blob/master/config/health-checker-containerd.json) | +| [HealthChecker](https://github.com/kubernetes/node-problem-detector/tree/master/pkg/healthchecker) | KubeletUnhealthy ContainerRuntimeUnhealthy| A health checker for node-problem-detector to check kubelet and container runtime health. | [kubelet](https://github.com/kubernetes/node-problem-detector/blob/master/config/health-checker-kubelet.json) [containerd](https://github.com/kubernetes/node-problem-detector/blob/master/config/health-checker-containerd.json) | # Exporter @@ -105,7 +109,6 @@ certain backends. Some of them can be disabled at compile-time using a build tag Node problem detector will start a separate custom plugin monitor for each configuration. You can use different custom plugin monitors to monitor different node problems. - #### For Health Checkers Health checkers are configured as custom plugins, using the config/health-checker-*.json config files. @@ -118,9 +121,11 @@ connects the apiserver. This is ignored if `--enable-k8s-exporter` is `false`. [`source`](https://github.com/kubernetes/heapster/blob/master/docs/source-configuration.md#kubernetes) flag of [Heapster](https://github.com/kubernetes/heapster). For example, to run without auth, use the following config: + ``` http://APISERVER_IP:APISERVER_PORT?inClusterConfig=false ``` + Refer to [heapster docs](https://github.com/kubernetes/heapster/blob/master/docs/source-configuration.md#kubernetes) for a complete list of available options. * `--address`: The address to bind the node problem detector server. * `--port`: The port to bind the node problem detector server. Use 0 to disable. @@ -149,7 +154,7 @@ For example, to run without auth, use the following config: * Run `make` in the top directory. It will: * Build the binary. - * Build the docker image. The binary and `config/` are copied into the docker image. + * Build the container image. The binary and `config/` are copied into the container image. If you do not need certain categories of problem daemons, you could choose to disable them at compilation time. This is the best way of keeping your node-problem-detector runtime compact without unnecessary code (e.g. global @@ -165,7 +170,7 @@ to see how to disable each problem daemon during compilation time. ## Push Image -`make push` uploads the docker image to a registry. By default, the image will be uploaded to +`make push` uploads the container image to a registry. By default, the image will be uploaded to `staging-k8s.gcr.io`. It's easy to modify the `Makefile` to push the image to another registry. @@ -198,6 +203,7 @@ To run node-problem-detector standalone, you should set `inClusterConfig` to `fa teach node-problem-detector how to access apiserver with `apiserver-override`. To run node-problem-detector standalone with an insecure apiserver connection: + ``` node-problem-detector --apiserver-override=http://APISERVER_IP:APISERVER_INSECURE_PORT?inClusterConfig=false ``` @@ -247,21 +253,23 @@ You can try node-problem-detector in a running cluster by injecting messages to When adding new rules or developing node-problem-detector, it is probably easier to test it on the local workstation in the standalone mode. For the API server, an easy way is to use ```kubectl proxy``` to make a running cluster's API server available locally. You will get some errors because your local workstation is not recognized by the API server. But you should still be able to test your new rules regardless. For example, to test [KernelMonitor](https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json) rules: + 1. ```make``` (build node-problem-detector locally) 2. ```kubectl proxy --port=8080``` (make a running cluster's API server available locally) 3. Update [KernelMonitor](https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json)'s ```logPath``` to your local kernel log directory. For example, on some Linux systems, it is ```/run/log/journal``` instead of ```/var/log/journal```. 3. ```./bin/node-problem-detector --logtostderr --apiserver-override=http://127.0.0.1:8080?inClusterConfig=false --config.system-log-monitor=config/kernel-monitor.json --config.system-stats-monitor=config/system-stats-monitor.json --port=20256 --prometheus-port=20257``` (or point to any API server address:port and Prometheus port) 4. ```sudo sh -c "echo 'kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING' >> /dev/kmsg"``` 5. You can see ```KernelOops``` event in the node-problem-detector log. -6. ```sudo sh -c "echo 'kernel: INFO: task docker:20744 blocked for more than 120 seconds.' >> /dev/kmsg"``` -7. You can see ```DockerHung``` event and condition in the node-problem-detector log. -8. You can see ```DockerHung``` condition at [http://127.0.0.1:20256/conditions](http://127.0.0.1:20256/conditions). +6. ```sudo sh -c "echo 'kernel: INFO: task foo:20744 blocked for more than 120 seconds.' >> /dev/kmsg"``` +7. You can see ```TaskHung``` event and condition in the node-problem-detector log. +8. You can see ```TaskHung``` condition at [http://127.0.0.1:20256/conditions](http://127.0.0.1:20256/conditions). 9. You can see disk-related system metrics in Prometheus format at [http://127.0.0.1:20257/metrics](http://127.0.0.1:20257/metrics). **Note**: -- You can see more rule examples under [test/kernel_log_generator/problems](https://github.com/kubernetes/node-problem-detector/tree/master/test/kernel_log_generator/problems). -- For [KernelMonitor](https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json) message injection, all messages should have ```kernel: ``` prefix (also note there is a space after ```:```); or use [generator.sh](https://github.com/kubernetes/node-problem-detector/blob/master/test/kernel_log_generator/generator.sh). -- To inject other logs into journald like systemd logs, use ```echo 'Some systemd message' | systemd-cat -t systemd```. + +* You can see more rule examples under [test/kernel_log_generator/problems](https://github.com/kubernetes/node-problem-detector/tree/master/test/kernel_log_generator/problems). +* For [KernelMonitor](https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor.json) message injection, all messages should have ```kernel:``` prefix (also note there is a space after ```:```); or use [generator.sh](https://github.com/kubernetes/node-problem-detector/blob/master/test/kernel_log_generator/generator.sh). +* To inject other logs into journald like systemd logs, use ```echo 'Some systemd message' | systemd-cat -t systemd```. ## Dependency Management @@ -295,6 +303,7 @@ Kubernetes cluster to a healthy state. The following remedy systems exist: NPD is tested via unit tests, [NPD e2e tests](https://github.com/kubernetes/node-problem-detector/blob/master/test/e2e/README.md), Kubernetes e2e tests and Kubernetes nodes e2e tests. Prow handles the [pre-submit tests](https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/node-problem-detector/node-problem-detector-presubmits.yaml) and [CI tests](https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/node-problem-detector/node-problem-detector-ci.yaml). CI test results can be found below: + 1. [Unit tests](https://testgrid.k8s.io/sig-node-node-problem-detector#ci-npd-test) 2. [NPD e2e tests](https://testgrid.k8s.io/sig-node-node-problem-detector#ci-npd-e2e-test) 3. [Kubernetes e2e tests](https://testgrid.k8s.io/sig-node-node-problem-detector#ci-npd-e2e-kubernetes-gce-gci) diff --git a/cmd/healthchecker/options/options.go b/cmd/healthchecker/options/options.go index 5711fc324..fe9e04168 100644 --- a/cmd/healthchecker/options/options.go +++ b/cmd/healthchecker/options/options.go @@ -49,15 +49,15 @@ type HealthCheckerOptions struct { // AddFlags adds health checker command line options to pflag. func (hco *HealthCheckerOptions) AddFlags(fs *pflag.FlagSet) { fs.StringVar(&hco.Component, "component", types.KubeletComponent, - "The component to check health for. Supports kubelet, docker, kube-proxy, and cri") + "The component to check health for. Supports kubelet, kube-proxy, and cri") // Deprecated: For backward compatibility on linux environment. Going forward "service" will be used instead of systemd-service if runtime.GOOS == "linux" { fs.MarkDeprecated("systemd-service", "please use --service flag instead") fs.StringVar(&hco.Service, "systemd-service", "", - "The underlying service responsible for the component. Set to the corresponding component for docker and kubelet, containerd for cri.") + "The underlying service responsible for the component. Set to the corresponding component for kubelet, containerd for cri.") } fs.StringVar(&hco.Service, "service", "", - "The underlying service responsible for the component. Set to the corresponding component for docker and kubelet, containerd for cri.") + "The underlying service responsible for the component. Set to the corresponding component for kubelet, containerd for cri.") fs.BoolVar(&hco.EnableRepair, "enable-repair", true, "Flag to enable/disable repair attempt for the component.") fs.StringVar(&hco.CriCtlPath, "crictl-path", types.DefaultCriCtl, "The path to the crictl binary. This is used to check health of cri component.") @@ -79,9 +79,8 @@ func (hco *HealthCheckerOptions) AddFlags(fs *pflag.FlagSet) { // Returns error if invalid, nil otherwise. func (hco *HealthCheckerOptions) IsValid() error { // Make sure the component specified is valid. - if hco.Component != types.KubeletComponent && hco.Component != types.DockerComponent && - hco.Component != types.CRIComponent && hco.Component != types.KubeProxyComponent { - return fmt.Errorf("the component specified is not supported. Supported components are : ") + if hco.Component != types.KubeletComponent && hco.Component != types.CRIComponent && hco.Component != types.KubeProxyComponent { + return fmt.Errorf("the component specified is not supported. Supported components are : ") } // Make sure the service is specified if repair is enabled. if hco.EnableRepair && hco.Service == "" { diff --git a/cmd/logcounter/options/options.go b/cmd/logcounter/options/options.go index 07fbea26d..af4f2bc4a 100644 --- a/cmd/logcounter/options/options.go +++ b/cmd/logcounter/options/options.go @@ -40,7 +40,7 @@ type LogCounterOptions struct { // AddFlags adds log counter command line options to pflag. func (fedo *LogCounterOptions) AddFlags(fs *pflag.FlagSet) { - fs.StringVar(&fedo.JournaldSource, "journald-source", "", "The source configuration of journald, e.g., kernel, kubelet, dockerd, etc") + fs.StringVar(&fedo.JournaldSource, "journald-source", "", "The source configuration of journald, e.g., kernel, kubelet, etc") fs.StringVar(&fedo.LogPath, "log-path", "", "The log path that log watcher looks up") fs.StringVar(&fedo.Lookback, "lookback", "", "The time log watcher looks up") fs.StringVar(&fedo.Delay, "delay", "", diff --git a/config/docker-monitor-counter.json b/config/docker-monitor-counter.json deleted file mode 100644 index fcbd66e37..000000000 --- a/config/docker-monitor-counter.json +++ /dev/null @@ -1,33 +0,0 @@ -{ - "plugin": "custom", - "pluginConfig": { - "invoke_interval": "5m", - "timeout": "1m", - "max_output_length": 80, - "concurrency": 1 - }, - "source": "docker-monitor", - "conditions": [ - { - "type": "CorruptDockerOverlay2", - "reason": "NoCorruptDockerOverlay2", - "message": "docker overlay2 is functioning properly" - } - ], - "rules": [ - { - "type": "permanent", - "condition": "CorruptDockerOverlay2", - "reason": "CorruptDockerOverlay2", - "path": "/home/kubernetes/bin/log-counter", - "args": [ - "--journald-source=dockerd", - "--log-path=/var/log/journal", - "--lookback=5m", - "--count=10", - "--pattern=returned error: readlink /var/lib/docker/overlay2.*: invalid argument.*" - ], - "timeout": "1m" - } - ] -} diff --git a/config/docker-monitor-filelog.json b/config/docker-monitor-filelog.json deleted file mode 100644 index db504cac7..000000000 --- a/config/docker-monitor-filelog.json +++ /dev/null @@ -1,20 +0,0 @@ -{ - "plugin": "filelog", - "pluginConfig": { - "timestamp": "^time=\"(\\S*)\"", - "message": "msg=\"([^\n]*)\"", - "timestampFormat": "2006-01-02T15:04:05.999999999-07:00" - }, - "logPath": "/var/log/docker.log", - "lookback": "5m", - "bufferSize": 10, - "source": "docker-monitor", - "conditions": [], - "rules": [ - { - "type": "temporary", - "reason": "CorruptDockerImage", - "pattern": "Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*" - } - ] -} diff --git a/config/docker-monitor.json b/config/docker-monitor.json deleted file mode 100644 index 27435be26..000000000 --- a/config/docker-monitor.json +++ /dev/null @@ -1,36 +0,0 @@ -{ - "plugin": "journald", - "pluginConfig": { - "source": "dockerd" - }, - "logPath": "/var/log/journal", - "lookback": "5m", - "bufferSize": 10, - "source": "docker-monitor", - "metricsReporting": true, - "conditions": [ - { - "type": "CorruptDockerOverlay2", - "reason": "NoCorruptDockerOverlay2", - "message": "docker overlay2 is functioning properly" - } - ], - "rules": [ - { - "type": "temporary", - "reason": "CorruptDockerImage", - "pattern": "Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*" - }, - { - "type": "permanent", - "condition": "CorruptDockerOverlay2", - "reason": "CorruptDockerOverlay2", - "pattern": "returned error: readlink /var/lib/docker/overlay2.*: invalid argument.*" - }, - { - "type": "temporary", - "reason": "DockerContainerStartupFailure", - "pattern": "OCI runtime start failed: container process is already dead: unknown" - } - ] -} diff --git a/config/health-checker-docker.json b/config/health-checker-docker.json deleted file mode 100644 index ab9e102d4..000000000 --- a/config/health-checker-docker.json +++ /dev/null @@ -1,33 +0,0 @@ -{ - "plugin": "custom", - "pluginConfig": { - "invoke_interval": "10s", - "timeout": "3m", - "max_output_length": 80, - "concurrency": 1 - }, - "source": "health-checker", - "metricsReporting": true, - "conditions": [ - { - "type": "ContainerRuntimeUnhealthy", - "reason": "ContainerRuntimeIsHealthy", - "message": "Container runtime on the node is functioning properly" - } - ], - "rules": [ - { - "type": "permanent", - "condition": "ContainerRuntimeUnhealthy", - "reason": "DockerUnhealthy", - "path": "/home/kubernetes/bin/health-checker", - "args": [ - "--component=docker", - "--enable-repair=true", - "--cooldown-time=2m", - "--health-check-timeout=60s" - ], - "timeout": "3m" - } - ] -} diff --git a/config/kernel-monitor-filelog.json b/config/kernel-monitor-filelog.json index 1f7d39ac1..c9f0a417d 100644 --- a/config/kernel-monitor-filelog.json +++ b/config/kernel-monitor-filelog.json @@ -41,12 +41,6 @@ "type": "temporary", "reason": "KernelOops", "pattern": "divide error: 0000 \\[#\\d+\\] SMP" - }, - { - "type": "permanent", - "condition": "KernelDeadlock", - "reason": "DockerHung", - "pattern": "task docker:\\w+ blocked for more than \\w+ seconds\\." } ] } diff --git a/config/kernel-monitor.json b/config/kernel-monitor.json index a132e47a3..e5adce466 100644 --- a/config/kernel-monitor.json +++ b/config/kernel-monitor.json @@ -89,12 +89,6 @@ "condition": "CperHardwareErrorFatal", "reason": "CperHardwareErrorFatal", "pattern": ".*\\[Hardware Error\\]: event severity: fatal$" - }, - { - "type": "permanent", - "condition": "KernelDeadlock", - "reason": "DockerHung", - "pattern": "task docker:\\w+ blocked for more than \\w+ seconds\\." } ] } diff --git a/config/systemd-monitor-counter.json b/config/systemd-monitor-counter.json index aaed32f8d..94fcaabe4 100644 --- a/config/systemd-monitor-counter.json +++ b/config/systemd-monitor-counter.json @@ -14,11 +14,6 @@ "reason": "NoFrequentKubeletRestart", "message": "kubelet is functioning properly" }, - { - "type": "FrequentDockerRestart", - "reason": "NoFrequentDockerRestart", - "message": "docker is functioning properly" - }, { "type": "FrequentContainerdRestart", "reason": "NoFrequentContainerdRestart", @@ -42,21 +37,6 @@ ], "timeout": "1m" }, - { - "type": "permanent", - "condition": "FrequentDockerRestart", - "reason": "FrequentDockerRestart", - "path": "/home/kubernetes/bin/log-counter", - "args": [ - "--journald-source=systemd", - "--log-path=/var/log/journal", - "--lookback=20m", - "--count=5", - "--pattern=Starting (Docker Application Container Engine|docker.service|docker.service - Docker Application Container Engine)...", - "--revert-pattern=Stopping (Docker Application Container Engine|docker.service|docker.service - Docker Application Container Engine)..." - ], - "timeout": "1m" - }, { "type": "permanent", "condition": "FrequentContainerdRestart", diff --git a/config/systemd-monitor.json b/config/systemd-monitor.json index e73785e38..d995eee05 100644 --- a/config/systemd-monitor.json +++ b/config/systemd-monitor.json @@ -15,11 +15,6 @@ "reason": "KubeletStart", "pattern": "Started (Kubernetes kubelet|kubelet.service|kubelet.service - Kubernetes kubelet)." }, - { - "type": "temporary", - "reason": "DockerStart", - "pattern": "Starting (Docker Application Container Engine|docker.service|docker.service - Docker Application Container Engine)..." - }, { "type": "temporary", "reason": "ContainerdStart", diff --git a/config/systemd/node-problem-detector-metric-only.service b/config/systemd/node-problem-detector-metric-only.service index 76cd5ad8a..3d127827a 100644 --- a/config/systemd/node-problem-detector-metric-only.service +++ b/config/systemd/node-problem-detector-metric-only.service @@ -8,7 +8,7 @@ Restart=always RestartSec=10 ExecStart=/home/kubernetes/bin/node-problem-detector --v=2 --logtostderr --enable-k8s-exporter=false \ --exporter.stackdriver=/home/kubernetes/node-problem-detector/config/exporter/stackdriver-exporter.json \ - --config.system-log-monitor=/home/kubernetes/node-problem-detector/config/kernel-monitor.json,/home/kubernetes/node-problem-detector/config/readonly-monitor.json,/home/kubernetes/node-problem-detector/config/docker-monitor.json,/home/kubernetes/node-problem-detector/config/systemd-monitor.json \ + --config.system-log-monitor=/home/kubernetes/node-problem-detector/config/kernel-monitor.json,/home/kubernetes/node-problem-detector/config/readonly-monitor.json,/home/kubernetes/node-problem-detector/config/systemd-monitor.json \ --config.custom-plugin-monitor=/home/kubernetes/node-problem-detector/config/kernel-monitor-counter.json,/home/kubernetes/node-problem-detector/config/systemd-monitor-counter.json \ --config.system-stats-monitor=/home/kubernetes/node-problem-detector/config/system-stats-monitor.json,/home/kubernetes/node-problem-detector/config/net-cgroup-system-stats-monitor.json diff --git a/config/windows-health-checker-docker.json b/config/windows-health-checker-docker.json deleted file mode 100644 index a41054a81..000000000 --- a/config/windows-health-checker-docker.json +++ /dev/null @@ -1,34 +0,0 @@ -{ - "plugin": "custom", - "pluginConfig": { - "invoke_interval": "10s", - "timeout": "3m", - "max_output_length": 80, - "concurrency": 1 - }, - "source": "health-checker", - "metricsReporting": true, - "conditions": [ - { - "type": "ContainerRuntimeUnhealthy", - "reason": "ContainerRuntimeIsHealthy", - "message": "Container runtime on the node is functioning properly" - } - ], - "rules": [ - { - "type": "permanent", - "condition": "ContainerRuntimeUnhealthy", - "reason": "DockerUnhealthy", - "path": "C:\\etc\\kubernetes\\node\\bin\\health-checker.exe", - "args": [ - "--component=docker", - "--enable-repair=true", - "--cooldown-time=2m", - "--health-check-timeout=60s" - ], - "timeout": "3m" - } - ] - } - \ No newline at end of file diff --git a/deployment/node-problem-detector-config.yaml b/deployment/node-problem-detector-config.yaml index e80aac8b4..ece6655f0 100644 --- a/deployment/node-problem-detector-config.yaml +++ b/deployment/node-problem-detector-config.yaml @@ -50,12 +50,6 @@ data: "reason": "MemoryReadError", "pattern": "CE memory read error .*" }, - { - "type": "permanent", - "condition": "KernelDeadlock", - "reason": "DockerHung", - "pattern": "task docker:\\w+ blocked for more than \\w+ seconds\\." - }, { "type": "permanent", "condition": "ReadonlyFilesystem", @@ -88,25 +82,6 @@ data: } ] } - docker-monitor.json: | - { - "plugin": "journald", - "pluginConfig": { - "source": "dockerd" - }, - "logPath": "/var/log/journal", - "lookback": "5m", - "bufferSize": 10, - "source": "docker-monitor", - "conditions": [], - "rules": [ - { - "type": "temporary", - "reason": "CorruptDockerImage", - "pattern": "Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*" - } - ] - } kind: ConfigMap metadata: name: node-problem-detector-config diff --git a/deployment/node-problem-detector-healthchecker.yaml b/deployment/node-problem-detector-healthchecker.yaml index f7f4fc50f..cdbb7a306 100644 --- a/deployment/node-problem-detector-healthchecker.yaml +++ b/deployment/node-problem-detector-healthchecker.yaml @@ -28,7 +28,7 @@ spec: command: - /node-problem-detector - --logtostderr - - --config.system-log-monitor=/config/kernel-monitor.json,/config/readonly-monitor.json,/config/docker-monitor.json + - --config.system-log-monitor=/config/kernel-monitor.json,/config/readonly-monitor.json - --config.custom-plugin-monitor=/config/health-checker-kubelet.json image: registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.19 resources: @@ -88,8 +88,6 @@ spec: path: kernel-monitor.json - key: readonly-monitor.json path: readonly-monitor.json - - key: docker-monitor.json - path: docker-monitor.json - name: machine-id hostPath: path: /etc/machine-id diff --git a/deployment/node-problem-detector.yaml b/deployment/node-problem-detector.yaml index a3614e96b..8e732e569 100644 --- a/deployment/node-problem-detector.yaml +++ b/deployment/node-problem-detector.yaml @@ -28,7 +28,7 @@ spec: command: - /node-problem-detector - --logtostderr - - --config.system-log-monitor=/config/kernel-monitor.json,/config/readonly-monitor.json,/config/docker-monitor.json + - --config.system-log-monitor=/config/kernel-monitor.json,/config/readonly-monitor.json image: registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.19 resources: limits: @@ -80,8 +80,6 @@ spec: path: kernel-monitor.json - key: readonly-monitor.json path: readonly-monitor.json - - key: docker-monitor.json - path: docker-monitor.json tolerations: - effect: NoSchedule operator: Exists diff --git a/pkg/healthchecker/health_checker.go b/pkg/healthchecker/health_checker.go index 94b718c8e..e278cf746 100644 --- a/pkg/healthchecker/health_checker.go +++ b/pkg/healthchecker/health_checker.go @@ -141,13 +141,6 @@ func getHealthCheckFunc(hco *options.HealthCheckerOptions) func() (bool, error) return healthCheckEndpointOKFunc(types.KubeletHealthCheckEndpoint(), hco.HealthCheckTimeout) case types.KubeProxyComponent: return healthCheckEndpointOKFunc(types.KubeProxyHealthCheckEndpoint(), hco.HealthCheckTimeout) - case types.DockerComponent: - return func() (bool, error) { - if _, err := execCommand(hco.HealthCheckTimeout, getDockerPath(), "ps"); err != nil { - return false, nil - } - return true, nil - } case types.CRIComponent: return func() (bool, error) { _, err := execCommand( diff --git a/pkg/healthchecker/health_checker_darwin.go b/pkg/healthchecker/health_checker_darwin.go index 19f87731c..d14107eb6 100644 --- a/pkg/healthchecker/health_checker_darwin.go +++ b/pkg/healthchecker/health_checker_darwin.go @@ -42,8 +42,3 @@ func checkForPattern(service, logStartTime, logPattern string, logCountThreshold klog.Fatalf("checkForPattern is not supported in %s", runtime.GOOS) return false, nil } - -func getDockerPath() string { - klog.Fatalf("getDockerPath is not supported in %s", runtime.GOOS) - return "" -} diff --git a/pkg/healthchecker/health_checker_linux.go b/pkg/healthchecker/health_checker_linux.go index 724f966b3..5636cfbb4 100644 --- a/pkg/healthchecker/health_checker_linux.go +++ b/pkg/healthchecker/health_checker_linux.go @@ -38,7 +38,6 @@ func getUptimeFunc(service string) func() (time.Duration, error) { // RestartSec of systemd and invoke interval of plugin got in sync. The service was repeatedly killed in // activating state and hence ActiveEnterTimestamp was never updated. out, err := execCommand(types.CmdTimeout, "systemctl", "show", service, "--property=InactiveExitTimestamp") - if err != nil { return time.Duration(0), err } @@ -61,19 +60,9 @@ func getRepairFunc(hco *options.HealthCheckerOptions) func() { // indicate if the component restart is due to an administrative plan (restart) // or a system issue that needs repair (kill). // See https://github.com/kubernetes/node-problem-detector/issues/847. - switch hco.Component { - case types.DockerComponent: - // Use "docker ps" for docker health check. Not using crictl for docker to remove - // dependency on the kubelet. - return func() { - execCommand(types.CmdTimeout, "pkill", "-SIGUSR1", "dockerd") - execCommand(types.CmdTimeout, "systemctl", "kill", "--kill-who=main", hco.Service) - } - default: - // Just kill the service for all other components - return func() { - execCommand(types.CmdTimeout, "systemctl", "kill", "--kill-who=main", hco.Service) - } + // Just kill the service for all other components + return func() { + execCommand(types.CmdTimeout, "systemctl", "kill", "--kill-who=main", hco.Service) } } @@ -100,7 +89,3 @@ func checkForPattern(service, logStartTime, logPattern string, logCountThreshold } return true, nil } - -func getDockerPath() string { - return "docker" -} diff --git a/pkg/healthchecker/health_checker_test.go b/pkg/healthchecker/health_checker_test.go index 03d51c51a..7e7de25ad 100644 --- a/pkg/healthchecker/health_checker_test.go +++ b/pkg/healthchecker/health_checker_test.go @@ -134,10 +134,6 @@ func TestComponentsSupported(t *testing.T) { description: "Kubelet should be supported", component: types.KubeletComponent, }, - { - description: "Docker should be supported", - component: types.DockerComponent, - }, { description: "CRI should be supported", component: types.CRIComponent, @@ -150,8 +146,6 @@ func TestComponentsSupported(t *testing.T) { if checkFunc == nil { t.Errorf("component %v should be supported", tc.component) } - }) } - } diff --git a/pkg/healthchecker/health_checker_windows.go b/pkg/healthchecker/health_checker_windows.go index cab40dbaa..82d3b4a61 100644 --- a/pkg/healthchecker/health_checker_windows.go +++ b/pkg/healthchecker/health_checker_windows.go @@ -106,7 +106,3 @@ func checkForPattern(service, logStartTime, logPattern string, logCountThreshold } return true, nil } - -func getDockerPath() string { - return "docker.exe" -} diff --git a/pkg/healthchecker/types/types.go b/pkg/healthchecker/types/types.go index e7e7266c2..97405be63 100644 --- a/pkg/healthchecker/types/types.go +++ b/pkg/healthchecker/types/types.go @@ -35,7 +35,6 @@ const ( KubeletComponent = "kubelet" CRIComponent = "cri" - DockerComponent = "docker" ContainerdService = "containerd" KubeProxyComponent = "kube-proxy" @@ -80,12 +79,12 @@ func setKubeEndpoints() { kubeletHealthCheckEndpoint = fmt.Sprintf("http://%s:%s/healthz", hostAddress, kubeletPort) kubeProxyHealthCheckEndpoint = fmt.Sprintf("http://%s:%s/healthz", hostAddress, kubeProxyPort) - } func KubeProxyHealthCheckEndpoint() string { return kubeProxyHealthCheckEndpoint } + func KubeletHealthCheckEndpoint() string { return kubeletHealthCheckEndpoint } diff --git a/pkg/systemlogmonitor/README.md b/pkg/systemlogmonitor/README.md index f86e55279..9c4fb1280 100644 --- a/pkg/systemlogmonitor/README.md +++ b/pkg/systemlogmonitor/README.md @@ -108,9 +108,8 @@ example: ``` # HELP problem_counter Number of times a specific type of problem have occurred. # TYPE problem_counter counter -problem_counter{reason="DockerHung"} 1 +problem_counter{reason="TaskHung"} 1 # HELP problem_gauge Whether a specific type of problem is affecting the node or not. # TYPE problem_gauge gauge -problem_gauge{condition="KernelDeadlock",reason="DockerHung"} 1 +problem_gauge{condition="KernelDeadlock",reason="TaskHung"} 1 ``` - diff --git a/pkg/systemlogmonitor/logwatchers/filelog/log_watcher.go b/pkg/systemlogmonitor/logwatchers/filelog/log_watcher.go index f0b8259be..976ae268c 100644 --- a/pkg/systemlogmonitor/logwatchers/filelog/log_watcher.go +++ b/pkg/systemlogmonitor/logwatchers/filelog/log_watcher.go @@ -134,7 +134,7 @@ func (s *filelogWatcher) watchLoop() { } func (s *filelogWatcher) filterSkipList(line string) bool { - for _ , skipItem := range s.cfg.SkipList { + for _, skipItem := range s.cfg.SkipList { if strings.Contains(line, skipItem) { return true } diff --git a/pkg/systemlogmonitor/logwatchers/filelog/log_watcher_test.go b/pkg/systemlogmonitor/logwatchers/filelog/log_watcher_test.go index c59f86e80..7e52d30bf 100644 --- a/pkg/systemlogmonitor/logwatchers/filelog/log_watcher_test.go +++ b/pkg/systemlogmonitor/logwatchers/filelog/log_watcher_test.go @@ -185,22 +185,21 @@ func TestFilterSkipList(t *testing.T) { }, }, } - testcase := []struct{ - log string + testcase := []struct { + log string expect bool }{ { - log: `Jan 2 03:04:03 kernel: [0.000000] 1`, + log: `Jan 2 03:04:03 kernel: [0.000000] 1`, expect: false, }, { - log: `Jan 2 03:04:04 audit: [1.000000] 2`, + log: `Jan 2 03:04:04 audit: [1.000000] 2`, expect: true, }, { - log: `Jan 2 03:04:05 kubelet: [2.000000] 3`, + log: `Jan 2 03:04:05 kubelet: [2.000000] 3`, expect: true, - }, } for i, test := range testcase { diff --git a/pkg/systemlogmonitor/types/types.go b/pkg/systemlogmonitor/types/types.go index 75c3e6e57..7d38c065f 100644 --- a/pkg/systemlogmonitor/types/types.go +++ b/pkg/systemlogmonitor/types/types.go @@ -23,7 +23,7 @@ import ( ) // Log is the log item returned by translator. It's very easy to extend this -// to support other log monitoring, such as docker log monitoring. +// to support other log monitoring. type Log struct { Timestamp time.Time Message string diff --git a/pkg/util/metrics/helpers_test.go b/pkg/util/metrics/helpers_test.go index 548921ccd..772cdbd42 100644 --- a/pkg/util/metrics/helpers_test.go +++ b/pkg/util/metrics/helpers_test.go @@ -81,7 +81,7 @@ func TestPrometheusMetricsParsingAndMatching(t *testing.T) { }, { Name: "problem_counter", - Labels: map[string]string{"reason": "DockerHung"}, + Labels: map[string]string{"reason": "TaskHung"}, }, { Name: "problem_counter", diff --git a/pkg/util/metrics/testdata/sample_metrics.txt b/pkg/util/metrics/testdata/sample_metrics.txt index 2103a13a0..deff2922c 100644 --- a/pkg/util/metrics/testdata/sample_metrics.txt +++ b/pkg/util/metrics/testdata/sample_metrics.txt @@ -21,7 +21,7 @@ host_uptime{kernel_version="4.14.127+",os_version="cos 73-11647.217.0"} 81 problem_counter{reason="ContainerdStart"} 1 problem_counter{reason="CorruptDockerImage"} 0 problem_counter{reason="CorruptDockerOverlay2"} 0 -problem_counter{reason="DockerHung"} 0 +problem_counter{reason="TaskHung"} 0 problem_counter{reason="DockerStart"} 1 problem_counter{reason="FilesystemIsReadOnly"} 0 problem_counter{reason="FrequentContainerdRestart"} 0 @@ -35,7 +35,7 @@ problem_counter{reason="UnregisterNetDevice"} 0 # HELP problem_gauge Whether a specific type of problem is affecting the node or not. # TYPE problem_gauge gauge problem_gauge{reason="CorruptDockerOverlay2",type="CorruptDockerOverlay2"} 0 -problem_gauge{reason="DockerHung",type="KernelDeadlock"} 0 +problem_gauge{reason="TaskHung",type="KernelDeadlock"} 0 problem_gauge{reason="FilesystemIsReadOnly",type="ReadonlyFilesystem"} 0 problem_gauge{reason="FrequentContainerdRestart",type="FrequentContainerdRestart"} 0 problem_gauge{reason="FrequentDockerRestart",type="FrequentDockerRestart"} 0 diff --git a/test/build.sh b/test/build.sh index 6a351cf56..b7223267f 100755 --- a/test/build.sh +++ b/test/build.sh @@ -21,7 +21,6 @@ set -o errexit set -o nounset set -o pipefail - NPD_STAGING_PATH=${NPD_STAGING_PATH:-"gs://k8s-staging-npd"} NPD_STAGING_REGISTRY=${NPD_STAGING_REGISTRY:-"gcr.io/node-problem-detector-staging"} PR_ENV_FILENAME=${PR_ENV_FILENAME:-"pr.env"} @@ -30,7 +29,6 @@ CI_CUSTOM_FLAGS_ENV_FILENAME=${CI_CUSTOM_FLAGS_ENV_FILENAME:-"ci-custom-flags.en ROOT_PATH=$(git rev-parse --show-toplevel) GCS_URL_PREFIX="https://storage.googleapis.com/" - function print-help() { echo "Usage: build.sh [flags] [command]" echo @@ -57,7 +55,7 @@ function print-help() { function get-version() { if [ -d .git ]; then - echo `git describe --tags --dirty` + echo $(git describe --tags --dirty) else echo "UNKNOWN" fi @@ -75,7 +73,7 @@ function write-env-file() { exit 1 fi - cat > ${ROOT_PATH}/${env_file} <${ROOT_PATH}/${env_file} <> ${ROOT_PATH}/${env_file} <>${ROOT_PATH}/${env_file} < /proc/sys/kernel/hung_task_timeout_secs" disables this message. -docker D ffff8801a8f2b078 0 20744 1 0x00000000 - ffff8801a8f2ac20 0000000000000082 0000000000012f00 ffff880057a17fd8 - 0000000000012f00 ffff8801a8f2ac20 ffffffff818bb4a0 ffff880057a17d80 - ffffffff818bb4a4 ffff8801a8f2ac20 00000000ffffffff ffffffff818bb4a8 -Call Trace: - [] ? schedule_preempt_disabled+0x25/0x70 - [] ? __mutex_lock_slowpath+0xd3/0x1c0 - [] ? mutex_lock+0x1b/0x2a - [] ? copy_net_ns+0x6c/0x130 - [] ? create_new_namespaces+0xf4/0x180 - [] ? copy_namespaces+0x6c/0x90 - [] ? copy_process.part.25+0x966/0x1c30 - [] ? do_fork+0xe1/0x390 - [] ? __alloc_fd+0x7c/0x120 - [] ? stub_clone+0x69/0x90 - [] ? system_call_fast_compare_end+0x10/0x15` - - writeKernelMessageOrDie(dockerHungPattern) -} diff --git a/test/kernel_log_generator/problems/docker_hung b/test/kernel_log_generator/problems/docker_hung deleted file mode 100644 index 3ccdc7609..000000000 --- a/test/kernel_log_generator/problems/docker_hung +++ /dev/null @@ -1,19 +0,0 @@ -INFO: task docker:20744 blocked for more than 120 seconds. - Tainted: G C 3.16.0-4-amd64 #1 -"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. -docker D ffff8801a8f2b078 0 20744 1 0x00000000 - ffff8801a8f2ac20 0000000000000082 0000000000012f00 ffff880057a17fd8 - 0000000000012f00 ffff8801a8f2ac20 ffffffff818bb4a0 ffff880057a17d80 - ffffffff818bb4a4 ffff8801a8f2ac20 00000000ffffffff ffffffff818bb4a8 -Call Trace: - [] ? schedule_preempt_disabled+0x25/0x70 - [] ? __mutex_lock_slowpath+0xd3/0x1c0 - [] ? mutex_lock+0x1b/0x2a - [] ? copy_net_ns+0x6c/0x130 - [] ? create_new_namespaces+0xf4/0x180 - [] ? copy_namespaces+0x6c/0x90 - [] ? copy_process.part.25+0x966/0x1c30 - [] ? do_fork+0xe1/0x390 - [] ? __alloc_fd+0x7c/0x120 - [] ? stub_clone+0x69/0x90 - [] ? system_call_fast_compare_end+0x10/0x15