Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(telemetry): Application Insights Shutdown on Retina Operator #1227

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 24 additions & 37 deletions operator/cmd/legacy/deployment.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ import (
"fmt"
"net/http"
"net/http/pprof"
"os"
"time"

"go.uber.org/zap/zapcore"
Expand Down Expand Up @@ -40,6 +39,7 @@ import (
retinaendpointcontroller "github.com/microsoft/retina/pkg/controllers/operator/retinaendpoint"
"github.com/microsoft/retina/pkg/log"
"github.com/microsoft/retina/pkg/telemetry"
"github.com/pkg/errors"
)

var (
Expand Down Expand Up @@ -81,7 +81,7 @@ func NewOperator(metricsAddr, probeAddr, configFile string, enableLeaderElection
}
}

func (o *Operator) Start() {
func (o *Operator) Start() error {
mainLogger = log.Logger().Named("main")

mainLogger.Sugar().Infof("Starting legacy operator")
Expand All @@ -93,8 +93,7 @@ func (o *Operator) Start() {
var err error
oconfig, err = config.GetConfig(o.configFile)
if err != nil {
fmt.Printf("failed to load config with err %s", err.Error())
os.Exit(1)
return errors.Wrap(err, "failed to load config")
}

mainLogger.Sugar().Infof("Operator configuration", zap.Any("configuration", oconfig))
Expand All @@ -103,15 +102,9 @@ func (o *Operator) Start() {
oconfig.CaptureConfig.CaptureImageVersion = buildinfo.Version
oconfig.CaptureConfig.CaptureImageVersionSource = captureUtils.VersionSourceOperatorImageVersion

if err != nil {
fmt.Printf("failed to load config with err %s", err.Error())
os.Exit(1)
}

err = initLogging(oconfig, buildinfo.ApplicationInsightsID)
if err != nil {
fmt.Printf("failed to initialize logging with err %s", err.Error())
os.Exit(1)
return errors.Wrap(err, "failed to initialize logging")
}

ctrl.SetLogger(crzap.New(crzap.UseFlagOptions(opts), crzap.Encoder(zapcore.NewConsoleEncoder(log.EncoderConfig()))))
Expand All @@ -138,15 +131,13 @@ func (o *Operator) Start() {
// LeaderElectionReleaseOnCancel: true,
})
if err != nil {
mainLogger.Error("Unable to start manager", zap.Error(err))
os.Exit(1)
return errors.Wrap(err, "unable to start manager")
}

ctx := context.Background()
clientset, err := apiextv1.NewForConfig(mgr.GetConfig())
if err != nil {
mainLogger.Error("Failed to get apiextension clientset", zap.Error(err))
os.Exit(1)
return errors.Wrap(err, "failed to get apiextension clientset")
}

if oconfig.InstallCRDs {
Expand All @@ -155,8 +146,7 @@ func (o *Operator) Start() {
var crds map[string]*v1.CustomResourceDefinition
crds, err = deploy.InstallOrUpdateCRDs(ctx, oconfig.EnableRetinaEndpoint, clientset)
if err != nil {
mainLogger.Error("unable to register CRDs", zap.Error(err))
os.Exit(1)
return errors.Wrap(err, "unable to register CRDs")
}
for name := range crds {
mainLogger.Info("CRD registered", zap.String("name", name))
Expand All @@ -165,8 +155,7 @@ func (o *Operator) Start() {

apiserverURL, err := telemetry.GetK8SApiserverURLFromKubeConfig()
if err != nil {
mainLogger.Error("Apiserver URL is cannot be found", zap.Error(err))
os.Exit(1)
return errors.Wrap(err, "apiserver URL is cannot be found")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return errors.Wrap(err, "apiserver URL is cannot be found")
return errors.Wrap(err, "apiserver URL cannot be found")

}

var tel telemetry.Telemetry
Expand All @@ -176,10 +165,14 @@ func (o *Operator) Start() {
"version": buildinfo.Version,
telemetry.PropertyApiserver: apiserverURL,
}

telemetry.InitAppInsights(buildinfo.ApplicationInsightsID, buildinfo.Version)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we verified in the past that operator was sending logs to app insights. I'm confused what this function does

We also seem to init app insights here and with a different config:

retina/pkg/log/zap.go

Lines 95 to 100 in a2c261f

aiTelemetryConfig := appinsights.NewTelemetryConfiguration(lOpts.ApplicationInsightsID)
sinkcfg := zapai.SinkConfig{
GracePeriod: 30 * time.Second, //nolint:gomnd // ignore
TelemetryConfiguration: *aiTelemetryConfig,
}
sinkcfg.MaxBatchSize = 10000

@jimassa / @matmerr what's the difference between configuring zap with AI vs initializing AI client? What's the impact of having different configs?

defer telemetry.ShutdownAppInsights()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's shutdown wherever we handle the error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the request @huntergregory . The defer will run when we return the errors?

defer telemetry.TrackPanic()

tel, err = telemetry.NewAppInsightsTelemetryClient("retina-operator", properties)
if err != nil {
mainLogger.Error("failed to create telemetry client", zap.Error(err))
os.Exit(1)
return errors.Wrap(err, "failed to create telemetry client")
}
} else {
mainLogger.Info("telemetry disabled", zap.String("apiserver", apiserverURL))
Expand All @@ -188,20 +181,17 @@ func (o *Operator) Start() {

kubeClient, err := kubernetes.NewForConfig(mgr.GetConfig())
if err != nil {
mainLogger.Error("Failed to get clientset", zap.Error(err))
os.Exit(1)
return errors.Wrap(err, "failed to get clientset")
}

captureReconciler, err := captureController.NewCaptureReconciler(
mgr.GetClient(), mgr.GetScheme(), kubeClient, oconfig.CaptureConfig,
)
if err != nil {
mainLogger.Error("Unable to create capture reconciler", zap.Error(err))
os.Exit(1)
return errors.Wrap(err, "unable to create capture reconciler")
}
if err = captureReconciler.SetupWithManager(mgr); err != nil {
mainLogger.Error("Unable to setup retina capture controller with manager", zap.Error(err))
os.Exit(1)
return errors.Wrap(err, "unable to setup retina capture controller with manager")
}

ctrlCtx := ctrl.SetupSignalHandler()
Expand All @@ -225,37 +215,34 @@ func (o *Operator) Start() {

pc := podcontroller.New(mgr.GetClient(), mgr.GetScheme(), retinaendpointchannel)
if err = (pc).SetupWithManager(mgr); err != nil {
mainLogger.Error("Unable to create controller", zap.String("controller", "podcontroller"), zap.Error(err))
os.Exit(1)
return errors.Wrap(err, "unable to create controller - podcontroller")
}
}
}

mc := metricsconfiguration.New(mgr.GetClient(), mgr.GetScheme())
if err = (mc).SetupWithManager(mgr); err != nil {
mainLogger.Error("Unable to create controller", zap.String("controller", "metricsconfiguration"), zap.Error(err))
os.Exit(1)
return errors.Wrap(err, "unable to create controller - metricsconfiguration")
}

//+kubebuilder:scaffold:builder

if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
mainLogger.Error("Unable to set up health check", zap.Error(err))
os.Exit(1)
return errors.Wrap(err, "unable to set up health check")
}
if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil {
mainLogger.Error("Unable to set up ready check", zap.Error(err))
os.Exit(1)
return errors.Wrap(err, "unable to set up ready check")
}

mainLogger.Info("Starting manager")
if err := mgr.Start(ctrlCtx); err != nil {
mainLogger.Error("Problem running manager", zap.Error(err))
os.Exit(1)
return errors.Wrap(err, "problem running manager")
}

// start heartbeat goroutine for application insights
go tel.Heartbeat(ctx, HeartbeatFrequency)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Start is a blocking call, so it never gets reached. Heartbeat should be called above


return nil
}

func EnablePProf() {
Expand Down
9 changes: 7 additions & 2 deletions operator/cmd/root.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ import (
"os"

"github.com/microsoft/retina/operator/cmd/legacy"
"github.com/pkg/errors"
"github.com/spf13/cobra"
)

Expand All @@ -25,10 +26,14 @@ var (
Use: "retina-operator",
Short: "Retina Operator",
Long: "Start Retina Operator",
Run: func(cmd *cobra.Command, args []string) {
RunE: func(_ *cobra.Command, _ []string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does RunE do? Another option is to leave as Run and os.exit on err?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as Run but with an error return

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and importantly it will return a non-zero exit on err != nil for parity with existing behavior

fmt.Println("Starting Retina Operator")
d := legacy.NewOperator(metricsAddr, probeAddr, cfgFile, enableLeaderElection)
d.Start()

if err := d.Start(); err != nil {
return errors.Wrap(err, "failed to start retina-operator")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's log the error before the program exits

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to amend that. "Let's assure that any error returned is logged immediately before os.Exit()". I'm pretty sure the answer to this is "yes, RunE already does that." But it would be good to double-check.

Log in-situ and return error is redundant and carries a lot of mental overhead both in reading and writing. The simple rule of "if an error makes it to the top of the call stack, it gets logged, loudly" is semantically cleaner and does the right thing by default. When combined with "all errors are wrapped," you also get detailed context to track down what went wrong.

}
return nil
},
}
)
Expand Down
Loading