NIC Pod fails to bind to unix socket when NGINX master process exists unexpectedly and does not clean up #4604

shaun-nx · 2023-11-02T14:28:24Z

Describe the bug
When the NGINX master process exists unexpectedly (e.g. the process is killed using kill -9 <master-process-pid>), system files generated by NGINX are not cleaned up.

This bug outlines the impact of unix socket files in /var/lib/nginx persisting after the NGINX master process exists unexpectedly.

Log output from NGINX when master process exists unexpectedly

E1102 09:38:53.243649       1 main.go:501] nginx command exited with an error: signal: killed
I1102 09:38:53.243740       1 main.go:511] Shutting down the controller
I1102 09:38:53.244035       1 main.go:521] Exiting with a status: 1

To Reproduce
Steps to reproduce the behavior:

Deploy all the necessary prerequisites outlined in the installation with manifest docs.
Deploy the below Deployment manifest which is configured with a volume or type emptyDir:{} and volumeMount for /var/lib/nginx

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-ingress
  namespace: nginx-ingress
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx-ingress
  template:
    metadata:
      labels:
        app: nginx-ingress
        app.kubernetes.io/name: nginx-ingress
    spec:
      serviceAccountName: nginx-ingress
      automountServiceAccountToken: true
      securityContext:
        seccompProfile:
          type: RuntimeDefault
      volumes:
      - name: nginx-lib
        emptyDir: {}
      containers:
      - image: nginx/nginx-ingress:3.3.1
        imagePullPolicy: IfNotPresent
        name: nginx-ingress
        ports:
        - name: http
          containerPort: 80
        - name: https
          containerPort: 443
        - name: readiness-port
          containerPort: 8081
        - name: prometheus
          containerPort: 9113
        readinessProbe:
          httpGet:
            path: /nginx-ready
            port: readiness-port
          periodSeconds: 1
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
        securityContext:
          allowPrivilegeEscalation: false
          runAsUser: 101 #nginx
          runAsNonRoot: true
          capabilities:
            drop:
            - ALL
            add:
            - NET_BIND_SERVICE
        volumeMounts:
        - mountPath: /var/lib/nginx
          name: nginx-lib
        env:
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        args:
          - -nginx-configmaps=$(POD_NAMESPACE)/nginx-config

Attach a debug container to the running NGINX Ingress Controller pod using kubectl debug -it -n <ic-namespace> <ic-pod> --image=busybox:1.28 --target=nginx-ingress
Within the debug container, run ps -ef to get the process id of the NGINX master process
Stop the NGINX master process using kill -9 <master-process-pid>
View the logs of the NGINX Ingress Controller pod and see NGINX fail to bind to unix sockets.

Expected behavior
NGINX Ingress Controller is able to recover and operate normally after exiting unexpectedly.

Your environment

Version of the Ingress Controller - 3.3.1
Version of Kubernetes - 1.27.4
Kubernetes platform - k3d
Using NGINX or NGINX Plus - NGINX 1.25.2

Additional context
Full deployment manifest used:
Log output

NINX Ingress Controller Version=3.3.1 Commit=0f828bb5f4159d7fb52bcff0159d1ddd99f16f87 Date=2023-10-13T16:23:42Z DirtyState=false Arch=linux/arm64 Go=go1.21.3
I1102 09:38:54.316209       1 flags.go:297] Starting with flags: ["-nginx-configmaps=nginx-ingress/nginx-config"]
I1102 09:38:54.320330       1 main.go:236] Kubernetes version: 1.27.4
I1102 09:38:54.328891       1 main.go:382] Using nginx version: nginx/1.25.2
I1102 09:38:54.337340       1 main.go:782] Pod label updated: nginx-ingress-64f9fcdb96-dpgsk
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-config-version.sock failed (98: Address already in use)
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address already in use)
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-418-server.sock failed (98: Address already in use)
2023/11/02 09:38:54 [notice] 16#16: try again to bind() after 500ms
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-config-version.sock failed (98: Address already in use)
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address already in use)
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-418-server.sock failed (98: Address already in use)
2023/11/02 09:38:54 [notice] 16#16: try again to bind() after 500ms
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-config-version.sock failed (98: Address already in use)
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address already in use)
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-418-server.sock failed (98: Address already in use)
2023/11/02 09:38:54 [notice] 16#16: try again to bind() after 500ms
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-config-version.sock failed (98: Address already in use)
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address already in use)
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-418-server.sock failed (98: Address already in use)
2023/11/02 09:38:54 [notice] 16#16: try again to bind() after 500ms
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-config-version.sock failed (98: Address already in use)
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address already in use)
2023/11/02 09:38:54 [emerg] 16#16: bind() to unix:/var/lib/nginx/nginx-418-server.sock failed (98: Address already in use)
2023/11/02 09:38:54 [notice] 16#16: try again to bind() after 500ms
2023/11/02 09:38:54 [emerg] 16#16: still could not bind()
F1102 09:39:54.341336       1 manager.go:288] Could not get newest config version: could not get expected version: 0 after 1m0s

The text was updated successfully, but these errors were encountered:

github-actions · 2023-11-02T14:28:38Z

Hi @shaun-nx thanks for reporting!

Be sure to check out the docs and the Contributing Guidelines while you wait for a human to take a look at this 🙂

Cheers!

shaun-nx · 2025-01-14T13:52:40Z

Fixed in #7121

MarkTopping · 2025-02-12T09:23:14Z

Hello,

I've performing some testing with Nginx v4.0.1 this morning to verify the fix for this ticket (issue also captured in #6752 and #4370).

I'm afraid to report that v4.0.1 does not resolve the issue reported.

I deployed Nginx with a low memory limit of 125Mi so that I could be able to trigger an OOMKill with some modest traffic. After the OOMKill the pod was unable to restart. We see the same familiar error message:

with a debug log level:

{"time":"2025-02-12T08:37:10.48020867Z","level":"DEBUG","msg":"Unable to fetch version: error getting client: Get \"http://config-version/configVersion\": dial unix /var/lib/nginx/nginx-config-version.sock: connect: connection refused"}

{"// Comment": "PREVIOUS REPEATED MESSAGE ~ 1500 times" }

{"time":"2025-02-12T08:37:10.789984148Z","level":"DEBUG","msg":"Unable to fetch version: error getting client: Get \"http://config-version/configVersion\": dial unix /var/lib/nginx/nginx-config-version.sock: connect: connection refused"}
2025/02/12 08:37:10 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-config-version.sock failed (98: Address already in use)
2025/02/12 08:37:10 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address already in use)
2025/02/12 08:37:10 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-418-server.sock failed (98: Address already in use)
2025/02/12 08:37:10 [notice] 18#18: try again to bind() after 500ms
{"time":"2025-02-12T08:37:10.79009325Z","level":"DEBUG","msg":"Unable to fetch version: error getting client: Get \"http://config-version/configVersion\": dial unix /var/lib/nginx/nginx-config-version.sock: connect: connection refused"}

{"// Comment": "PREVIOUS REPEATED MESSAGE ~ 4000 times" }

{"time":"2025-02-12T08:37:11.289973867Z","level":"DEBUG","msg":"Unable to fetch version: error getting client: Get \"http://config-version/configVersion\": dial unix /var/lib/nginx/nginx-config-version.sock: connect: connection refused"}
2025/02/12 08:37:10 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-config-version.sock failed (98: Address already in use)
2025/02/12 08:37:10 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-502-server.sock failed (98: Address already in use)
2025/02/12 08:37:10 [emerg] 18#18: bind() to unix:/var/lib/nginx/nginx-418-server.sock failed (98: Address already in use)
2025/02/12 08:37:10 [notice] 18#18: try again to bind() after 500ms
{"time":"2025-02-12T08:37:11.29015717Z","level":"DEBUG","msg":"Unable to fetch version: error getting client: Get \"http://config-version/configVersion\": dial unix /var/lib/nginx/nginx-config-version.sock: connect: connection refused"}

{"// Comment": "AND SO ON" }

MarkTopping · 2025-02-21T12:01:16Z

@shaun-nx Hi, I'm just wondering if you saw my message above?
I refrained from opening a new ticket on the assumption you'd prefer to re-open one of the existing ones that reference this issue, but let me know if you think otherwise.

jjngx · 2025-02-21T14:09:05Z

@MarkTopping thank you for reporting. We will re-open the issue and continue investigating.

shaun-nx added bug An issue reporting a potential bug backlog Pull requests/issues that are backlog items labels Nov 2, 2023

shaun-nx mentioned this issue Nov 3, 2023

bind() to unix:/var/lib/nginx/nginx-config-version.sock failed (98: Address already in use) #4370

Closed

shaun-nx self-assigned this Nov 3, 2023

shaun-nx added this to the v3.9.0 milestone Sep 2, 2024

granescb mentioned this issue Nov 7, 2024

[Bug]: bind() to unix:/var/lib/nginx/nginx-status.sock failed (98: Address already in use) #6752

Closed

MarkTopping mentioned this issue Nov 25, 2024

[Bug]: Increased and erratic memory in the Nginx Pods leading to OOM Kills - appears to be introduced by v3.7.x #6860

Open

vepatel removed this from the v4.1.0 milestone Dec 2, 2024

AlexFenlon mentioned this issue Jan 13, 2025

Clean up and fix for NIC Pod failing to bind when NGINX exits unexpectedly #7121

Merged

6 tasks

shaun-nx closed this as completed Jan 14, 2025

lucacome added this to NGINX Ingress Controller Jan 15, 2025

lucacome moved this to Done 🚀 in NGINX Ingress Controller Jan 15, 2025

jjngx reopened this Feb 21, 2025

jjngx added docker Pull requests that update Docker code devops Pull requests/issues related to DevOps labels Feb 21, 2025

jjngx moved this from Done 🚀 to Todo ☑ in NGINX Ingress Controller Feb 21, 2025

jjngx moved this from Todo ☑ to Prioritized backlog in NGINX Ingress Controller Feb 21, 2025

vepatel added the needs triage An issue that needs to be triaged label Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NIC Pod fails to bind to unix socket when NGINX master process exists unexpectedly and does not clean up #4604

NIC Pod fails to bind to unix socket when NGINX master process exists unexpectedly and does not clean up #4604

shaun-nx commented Nov 2, 2023

github-actions bot commented Nov 2, 2023

shaun-nx commented Jan 14, 2025

MarkTopping commented Feb 12, 2025

MarkTopping commented Feb 21, 2025

jjngx commented Feb 21, 2025

NIC Pod fails to bind to unix socket when NGINX master process exists unexpectedly and does not clean up #4604

NIC Pod fails to bind to unix socket when NGINX master process exists unexpectedly and does not clean up #4604

Comments

shaun-nx commented Nov 2, 2023

github-actions bot commented Nov 2, 2023

shaun-nx commented Jan 14, 2025

MarkTopping commented Feb 12, 2025

MarkTopping commented Feb 21, 2025

jjngx commented Feb 21, 2025