Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Startup Probe kills "/bin/opm serve" process and prevents operatorhubio pod to start #3269

Open
fjammes opened this issue May 21, 2024 · 4 comments

Comments

@fjammes
Copy link

fjammes commented May 21, 2024

Type of question

General context and help around the operator-sdk

Question

What did you do?

Install operator-sdk v0.28.0

What did you expect to see?

Operator startup

What did you see instead? Under which circumstances?

Operatorhubio pod does not start:

runner@arc-runners-x2src-runner-mxhq2:~$ kubectl get pods -A | grep operatorhubio
olm                  operatorhubio-catalog-gqxnw                  0/1     CrashLoopBackOff   15 (4m6s ago)   55m
runner@arc-runners-x2src-runner-mxhq2:~$ kubectl describe pods -n olm operatorhubio-catalog-gqxnw | tail -n 5
  Normal   Pulled     52m                    kubelet            Successfully pulled image "quay.io/operatorhubio/catalog:latest" in 16.469534578s
  Normal   Created    52m (x2 over 54m)      kubelet            Created container registry-server
  Normal   Started    52m (x2 over 54m)      kubelet            Started container registry-server
  Warning  Unhealthy  5m47s (x150 over 54m)  kubelet            Startup probe failed: timeout: failed to connect service ":50051" within 1s
  Warning  BackOff    42s (x132 over 42m)    kubelet            Back-off restarting failed container
runner@arc-runners-x2src-runner-mxhq2:~$ kubectl logs  -n olm operatorhubio-catalog-gqxnw
time="2024-05-21T10:21:02Z" level=info msg="starting pprof endpoint" address="localhost:6060"
time="2024-05-21T10:21:02Z" level=info msg="found existing cache contents" backend=pogreb.v1 cache=/tmp/cache configs=/configs

Process seems to freeze for 2/3 minutes at the step logged above.

Environment

  • operator-lifecycle-manager version: v0.28.0

  • Kubernetes version information:

kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.3", GitCommit:"25b4e43193bcda6c7328a6d147b1fb73a33f1598", GitTreeState:"clean", BuildDate:"2023-06-14T09:53:42Z", GoVersion:"go1.20.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.0", GitCommit:"a866cbe2e5bbaa01cfd5e969aa3e033f3282a8a2", GitTreeState:"clean", BuildDate:"2022-09-01T23:30:43Z", GoVersion:"go1.19", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.27) and server (1.25) exceeds the supported minor version skew of +/-1
  • Kubernetes cluster kind:

ARC and kind based:

kind version
kind v0.15.0 go1.19 linux/amd64

Additional context

The command /bin/opm serve /configs --cache-dir=/tmp/cache takes ~2/3 minutes to start in this container and this trigger the startupProbe. This occurs only on one of our infrastructure. Is there a way to increase the probe duration or to debug what's happening in opm process?

@fjammes fjammes changed the title Startup Probe kill "/bin/opm serve" process and prevent operatorhubio pod to start Startup Probe kills "/bin/opm serve" process and prevents operatorhubio pod to start May 21, 2024
@jkranner
Copy link

I am also facing this issue.
Pod: operatorhubio-catalog-ql6bs
Startup probe failed: timeout: failed to connect service ":50051" within 1s
Then keeps crash-looping.

@fjammes
Copy link
Author

fjammes commented Jul 24, 2024

Still blocked on this issue when OLM runs on servers with slow disks. Is there a way to configure the startupProbe throught olm install procedure?

@mateuszkca
Copy link

I have this problem too.
I found that it only affects k8s running on workers with CentOS9 /Rocky 9 OS regardless of docker version.
On that node catalog POD starts about 75 seconds.
For CentOS 8/Rocky 8 there is no problem and catalog POD starts in 6 sec.
Any progress in resolving this problem?

@grokspawn
Copy link
Contributor

OLMv0 does not currently support a configurable startupProbe ref. If it's not too late in the day for me to math, that is 100 seconds of startup delay. Without better understanding of what's going here I'd be reluctant to advocate for any arbitrary duration bump, just because it might be pushing off the issue to another day.

There are a couple of things that you could do to try to get a better understanding of why your catalog pods are taking so long to be ready:

  1. you can disable cache validation in running instances of your catalog. Right now it looks like you have a pre-generated catalog cache and opm will validate that cache when opm serve starts. If you disable this and get a better experience then we'd have a better focus on what needs to be improved; and/or
  2. opm serve exposes a pprof endpoint for creation...readiness CPU profiling, discrete from an optional default pprof endpoint. You would need to port-forward (or ssh to the pod and access) localhost:6060

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants