Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(healthchecks): Adding Agent Health Checks for legacy & hubble control planes #1092

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

mereta
Copy link
Contributor

@mereta mereta commented Dec 2, 2024

Description

Extending retina port 10093 to serve a /health endpoint.
For subsequent iterations a different checker can be plugged in to the Handler.

The most important changes include modifying the probe paths and parameters, and adding a new health check handler in the server code.

Changes to Kubernetes Deployment Manifests:

Changes to Configuration Values:

Changes to Server Code:

  • pkg/server/server.go: Added import for healthz package and implemented serveHealth function to handle /health endpoint using healthz.CheckHandler. [1] [2] [3]

Related Issue

#1047

Checklist

  • I have read the contributing documentation.
  • I signed and signed-off the commits (git commit -S -s ...). See this documentation on signing commits.
  • I have correctly attributed the author(s) of the code.
  • I have tested the changes locally.
  • I have followed the project's style guidelines.
  • I have updated the documentation, if necessary.
  • I have added tests, if applicable.

Screenshots (if applicable) or Testing Completed

Testing was completed manually for legacy & hubble control planes.
Windows image tested with legacy retina, hubble retina N/A.

Additional Notes

Add any additional notes or context about the pull request here.


Please refer to the CONTRIBUTING.md file for more information on how to contribute to this project.

cmd/hubble/daemon_linux.go Outdated Show resolved Hide resolved
cmd/legacy/daemon.go Outdated Show resolved Hide resolved
cmd/legacy/daemon.go Outdated Show resolved Hide resolved
pkg/enricher/enricher.go Outdated Show resolved Hide resolved
cmd/hubble/daemon_linux.go Outdated Show resolved Hide resolved
cmd/hubble/daemon_linux.go Outdated Show resolved Hide resolved
cmd/legacy/daemon.go Outdated Show resolved Hide resolved
mereta and others added 2 commits December 4, 2024 12:19
Co-authored-by: Timothy J. Raymond <[email protected]>
Signed-off-by: Mereta <[email protected]>
@mereta mereta marked this pull request as ready for review December 4, 2024 13:53
@mereta mereta requested a review from a team as a code owner December 4, 2024 13:53
@mereta mereta self-assigned this Dec 4, 2024
@mereta mereta changed the title feat(healthchecks): Adding Example Health Checks feat(healthchecks): Adding Agent Health Checks for legacy & hubble control planes Dec 4, 2024
cmd/hubble/daemon_linux.go Outdated Show resolved Hide resolved
@@ -86,6 +86,7 @@ daemonset:
metricsBindAddress: ":18080"
ports:
containerPort: 10093
healthPort: 18081
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why create a new port? and not reuse the existing 10093 ? since we are a hostnet: true pod, any port number we use comes with restrictions as it takes away from nodes usable port ranges. In AKS, we will have to pre register the ports we will be using to make sure customers are also aware of the ports they should not use to avoid any conflict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be already configured in Legacy & Hubble control planes as default HealthProbeBindAddress.
Its always provided when we set up k8s controller runtime options.
Hubble:

HealthProbeBindAddress: ":18001",

Legacy: https://github.com/microsoft/retina/blob/4b12472cc5007e0931238dd5e26f819e6be093b5/cmd/root.go#L42C10-L42C15

I think also @rbtr mentioned having a configurable port.
Do you think we should change to 10093?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I did some testing and got errors when using port 10093.
I don't think we can reuse the Retina port, some findings below:

We start a http api server with that port in controller manager:

// create HTTP server for API server

And the trail for setting up health check is:
The HealthProbeBindAddress we pass as part of options in the ctrl manager ends up here where a listener is created.
comment:
This will throw an error if the bind address is invalid or already in use.
(same logic as providing pprof address below)
https://github.com/kubernetes-sigs/controller-runtime/blob/aea2e32a936584b06ae6f7992f856fe7292b0297/pkg/manager/manager.go#L407

If listener exists k8s tries to add HealthProbeServer:
https://github.com/kubernetes-sigs/controller-runtime/blob/aea2e32a936584b06ae6f7992f856fe7292b0297/pkg/manager/internal.go#L400

addHealthProverServer() definition - creates a new server:
https://github.com/kubernetes-sigs/controller-runtime/blob/aea2e32a936584b06ae6f7992f856fe7292b0297/pkg/manager/internal.go#L290

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with Vamsi. Controller Manager's implementation isn't fit for purpose here, so we should just use the standard library and write the two HTTP handlers on whatever server is bound to 10093. It will probably provide a better healthcheck overall, since we're testing whether traffic can be served on 10093.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vakalapa @timraymond Updated the PR.

Copy link

This PR will be closed in 7 days due to inactivity.

@github-actions github-actions bot added the meta/waiting-for-author Blocked and waiting on the author label Jan 16, 2025
@github-actions github-actions bot removed the meta/waiting-for-author Blocked and waiting on the author label Jan 22, 2025
Copy link
Contributor

@ritwikranjan ritwikranjan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly opinions and curious question rather than any explicit suggestion.

periodSeconds: 30
timeoutSeconds: 15
failureThreshold: 3
successThreshold: 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anubhabMajumdar Any input on the values? Do they look ok?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants