Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(telemetry): Application Insights Shutdown on Retina Operator #1227

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

mereta
Copy link
Contributor

@mereta mereta commented Jan 16, 2025

Description

Fix for missing telemetry after application shutdown.
This pull request includes changes to improve error handling in the operator/cmd/legacy/deployment.go and operator/cmd/root.go files. The main focus is to replace os.Exit(1) calls with error returns to allow for better error management and testing.

Error handling improvements:

  • operator/cmd/legacy/deployment.go: Modified the Start method to return an error instead of calling os.Exit(1) on failure. This change affects multiple points within the method where errors are now returned instead of causing the program to exit.
  • operator/cmd/root.go: Updated the Run function to RunE to handle errors returned by the Start method of the Operator. This ensures that any errors encountered during the startup process are properly returned and handled.

Related Issue

#1203

Checklist

  • I have read the contributing documentation.
  • I signed and signed-off the commits (git commit -S -s ...). See this documentation on signing commits.
  • I have correctly attributed the author(s) of the code.
  • I have tested the changes locally.
  • I have followed the project's style guidelines.
  • I have updated the documentation, if necessary.
  • I have added tests, if applicable.

Screenshots (if applicable) or Testing Completed

Please add any relevant screenshots or GIFs to showcase the changes made.

Additional Notes

Add any additional notes or context about the pull request here.


Please refer to the CONTRIBUTING.md file for more information on how to contribute to this project.

@mereta mereta requested a review from a team as a code owner January 16, 2025 12:18
@mereta mereta changed the title Telemetry shutdown fix(telemetry shutdown) Jan 16, 2025
@mereta mereta changed the title fix(telemetry shutdown) fix(telemetry): Application Insights Shutdown on Retina Operator Jan 16, 2025
@mereta mereta linked an issue Jan 16, 2025 that may be closed by this pull request
Copy link
Contributor

@huntergregory huntergregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also make the relevant changes for the non-legacy (cilium-crd) operator? Up to you if it should be the same PR or separate.

In your testing, could you build a new retina image with your changes, deploy it in your cluster, and induce an error to verify that the final error log shows up in app insights? Could induce an error by modifying the code for testing.

@@ -176,10 +165,14 @@ func (o *Operator) Start() {
"version": buildinfo.Version,
telemetry.PropertyApiserver: apiserverURL,
}

telemetry.InitAppInsights(buildinfo.ApplicationInsightsID, buildinfo.Version)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we verified in the past that operator was sending logs to app insights. I'm confused what this function does

We also seem to init app insights here and with a different config:

retina/pkg/log/zap.go

Lines 95 to 100 in a2c261f

aiTelemetryConfig := appinsights.NewTelemetryConfiguration(lOpts.ApplicationInsightsID)
sinkcfg := zapai.SinkConfig{
GracePeriod: 30 * time.Second, //nolint:gomnd // ignore
TelemetryConfiguration: *aiTelemetryConfig,
}
sinkcfg.MaxBatchSize = 10000

@jimassa / @matmerr what's the difference between configuring zap with AI vs initializing AI client? What's the impact of having different configs?

@@ -176,10 +165,14 @@ func (o *Operator) Start() {
"version": buildinfo.Version,
telemetry.PropertyApiserver: apiserverURL,
}

telemetry.InitAppInsights(buildinfo.ApplicationInsightsID, buildinfo.Version)
defer telemetry.ShutdownAppInsights()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's shutdown wherever we handle the error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the request @huntergregory . The defer will run when we return the errors?

d.Start()

if err := d.Start(); err != nil {
return errors.Wrap(err, "failed to start retina-operator")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's log the error before the program exits

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to amend that. "Let's assure that any error returned is logged immediately before os.Exit()". I'm pretty sure the answer to this is "yes, RunE already does that." But it would be good to double-check.

Log in-situ and return error is redundant and carries a lot of mental overhead both in reading and writing. The simple rule of "if an error makes it to the top of the call stack, it gets logged, loudly" is semantically cleaner and does the right thing by default. When combined with "all errors are wrapped," you also get detailed context to track down what went wrong.

@@ -25,10 +26,14 @@ var (
Use: "retina-operator",
Short: "Retina Operator",
Long: "Start Retina Operator",
Run: func(cmd *cobra.Command, args []string) {
RunE: func(_ *cobra.Command, _ []string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does RunE do? Another option is to leave as Run and os.exit on err?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as Run but with an error return

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and importantly it will return a non-zero exit on err != nil for parity with existing behavior

}

mainLogger.Info("Starting manager")
if err := mgr.Start(ctrlCtx); err != nil {
mainLogger.Error("Problem running manager", zap.Error(err))
os.Exit(1)
return errors.Wrap(err, "problem running manager")
}

// start heartbeat goroutine for application insights
go tel.Heartbeat(ctx, HeartbeatFrequency)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Start is a blocking call, so it never gets reached. Heartbeat should be called above

@@ -165,8 +155,7 @@ func (o *Operator) Start() {

apiserverURL, err := telemetry.GetK8SApiserverURLFromKubeConfig()
if err != nil {
mainLogger.Error("Apiserver URL is cannot be found", zap.Error(err))
os.Exit(1)
return errors.Wrap(err, "apiserver URL is cannot be found")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return errors.Wrap(err, "apiserver URL is cannot be found")
return errors.Wrap(err, "apiserver URL cannot be found")

@@ -176,10 +165,14 @@ func (o *Operator) Start() {
"version": buildinfo.Version,
telemetry.PropertyApiserver: apiserverURL,
}

telemetry.InitAppInsights(buildinfo.ApplicationInsightsID, buildinfo.Version)
defer telemetry.ShutdownAppInsights()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the request @huntergregory . The defer will run when we return the errors?

@@ -25,10 +26,14 @@ var (
Use: "retina-operator",
Short: "Retina Operator",
Long: "Start Retina Operator",
Run: func(cmd *cobra.Command, args []string) {
RunE: func(_ *cobra.Command, _ []string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and importantly it will return a non-zero exit on err != nil for parity with existing behavior

d.Start()

if err := d.Start(); err != nil {
return errors.Wrap(err, "failed to start retina-operator")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to amend that. "Let's assure that any error returned is logged immediately before os.Exit()". I'm pretty sure the answer to this is "yes, RunE already does that." But it would be good to double-check.

Log in-situ and return error is redundant and carries a lot of mental overhead both in reading and writing. The simple rule of "if an error makes it to the top of the call stack, it gets logged, loudly" is semantically cleaner and does the right thing by default. When combined with "all errors are wrapped," you also get detailed context to track down what went wrong.

@mereta mereta self-assigned this Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

log: if there's a crash, final telemetry logs might not be sent
4 participants