Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(sidecar): Allow sidecar to not crash on startup if objstore is not available #7708

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

amaury-d
Copy link

@amaury-d amaury-d commented Sep 9, 2024

This PR adds --shipper.retry-init option to sidecar to allow it to continue to serve prometheus read path if shipper init is failing at startup (eg: network issue on objstore).
On each upload, if shipper is not initialized, reinitialization will be attempted again.

This PR also brings a new metric thanos_sidecar_shipper_up to alert in case of shipper issue.

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

  • New --shipper.retry-init option to allow retrying shipper init on each upload
  • New thanos_sidecar_shipper_up metric
  • Add .bin to .gitignore

Verification

I used thanos-docker-compose.
I then configured a bogus objstore.config to force crash.

1. Current situation with broken objstore at startup

ts=2024-09-06T13:30:59.31896362Z caller=main.go:145 level=error err="decode account key: illegal base64 data at input byte 0\ncreate AZURE client\ngithub.com/thanos-io/objstore/client.NewBucket\n\t/Users/adecreme/go/pkg/mod/github.com/thanos-io/[email protected]/client/factory.go:90\nmain.runSidecar\n\t/Users/adecreme/github/thanos/thanos/cmd/thanos/sidecar.go:370\nmain.registerSidecar.func1\n\t/Users/adecreme/github/thanos/thanos/cmd/thanos/sidecar.go:105\nmain.main\n\t/Users/adecreme/github/thanos/thanos/cmd/thanos/main.go:143\nruntime.main\n\t/opt/homebrew/Cellar/[email protected]/1.22.6/libexec/src/runtime/proc.go:271\nruntime.goexit\n\t/opt/homebrew/Cellar/[email protected]/1.22.6/libexec/src/runtime/asm_arm64.s:1222\npreparing sidecar command failed\nmain.main\n\t/Users/adecreme/github/thanos/thanos/cmd/thanos/main.go:145\nruntime.main\n\t/opt/homebrew/Cellar/[email protected]/1.22.6/libexec/src/runtime/proc.go:271\nruntime.goexit\n\t/opt/homebrew/Cellar/[email protected]/1.22.6/libexec/src/runtime/asm_arm64.s:1222"

Then sidecar crashes.

2. With --shipper.retry-init with broken objstore

ts=2024-09-06T14:47:30.920080081Z caller=grpc.go:167 level=info service=gRPC/server component=sidecar msg="listening for serving gRPC" address=0.0.0.0:10901
ts=2024-09-06T14:47:32.916587861Z caller=factory.go:53 level=info msg="loading bucket configuration"
ts=2024-09-06T14:47:32.917102072Z caller=azure.go:149 level=debug msg="creating new Azure bucket connection" component=sidecar
ts=2024-09-06T14:47:32.917182156Z caller=sidecar.go:401 level=warn err="create AZURE client: decode account key: illegal base64 data at input byte 0" msg="Failed to create bucket. Sidecar will start without upload feature and will retry later."
ts=2024-09-06T14:48:02.917020664Z caller=factory.go:53 level=info msg="loading bucket configuration"
ts=2024-09-06T14:48:02.917300082Z caller=azure.go:149 level=debug msg="creating new Azure bucket connection" component=sidecar
ts=2024-09-06T14:48:02.917340791Z caller=sidecar.go:401 level=warn err="create AZURE client: decode account key: illegal base64 data at input byte 0" msg="Failed to create bucket. Sidecar will start without upload feature and will retry later."

Sidecar doesn't crash.

Metric:

# HELP thanos_sidecar_shipper_up Boolean indicator whether the sidecar shipper is running.
# TYPE thanos_sidecar_shipper_up gauge
thanos_sidecar_shipper_up 0

@MichaHoffmann
Copy link
Contributor

MichaHoffmann commented Sep 9, 2024

Do all object storage clients attempt to create the bucket?

Edit: I spot checked the s3 provider and it doesnt create a bucket - so this would likely break that, right? As in - if we start with S3 we will never create the bucket and human interaction is needed. FWIW: we have a sidecar that creates the bucket if it doesnt exist.

@amaury-d
Copy link
Author

amaury-d commented Sep 9, 2024

So, in objstore factory.go, I see that all objstore call NewBucket

	case string(GCS):
		bucket, err = gcs.NewBucket(context.Background(), logger, config, component)
	case string(S3):
		bucket, err = s3.NewBucket(logger, config, component)
	case string(AZURE):
		bucket, err = azure.NewBucket(logger, config, component)
	case string(SWIFT):
		bucket, err = swift.NewContainer(logger, config)
	case string(COS):
		bucket, err = cos.NewBucket(logger, config, component)
	case string(ALIYUNOSS):
		bucket, err = oss.NewBucket(logger, config, component)
	case string(FILESYSTEM):
		bucket, err = filesystem.NewBucketFromConfig(config)
	case string(BOS):
		bucket, err = bos.NewBucket(logger, config, component)
	case string(OCI):
		bucket, err = oci.NewBucket(logger, config)
	case string(OBS):
		bucket, err = obs.NewBucket(logger, config)

The PR moved the call to NewBucket that occurs for all provider from the startup to the scheduled upload part so it shouldn't break S3.

@MichaHoffmann
Copy link
Contributor

So, in objstore factory.go, I see that all objstore call NewBucket

	case string(GCS):
		bucket, err = gcs.NewBucket(context.Background(), logger, config, component)
	case string(S3):
		bucket, err = s3.NewBucket(logger, config, component)
	case string(AZURE):
		bucket, err = azure.NewBucket(logger, config, component)
	case string(SWIFT):
		bucket, err = swift.NewContainer(logger, config)
	case string(COS):
		bucket, err = cos.NewBucket(logger, config, component)
	case string(ALIYUNOSS):
		bucket, err = oss.NewBucket(logger, config, component)
	case string(FILESYSTEM):
		bucket, err = filesystem.NewBucketFromConfig(config)
	case string(BOS):
		bucket, err = bos.NewBucket(logger, config, component)
	case string(OCI):
		bucket, err = oci.NewBucket(logger, config)
	case string(OBS):
		bucket, err = obs.NewBucket(logger, config)

The PR moved the call to NewBucket that occurs for all provider from the startup to the scheduled upload part so it shouldn't break S3.

New bucket creates a new implementation of the objstore bucket interface - that does not mean that they create a bucket in object storage i think!

@amaury-d amaury-d force-pushed the fix/sidecar branch 2 times, most recently from 6ec7fdc to 10349bf Compare September 10, 2024 08:11
@amaury-d amaury-d marked this pull request as draft September 19, 2024 12:57
@amaury-d amaury-d changed the title fix(sidecar): Allow sidecar to not crash on startup if objstore is not available #7585 fix(sidecar): Allow sidecar to not crash on startup if objstore is not available Sep 24, 2024
@amaury-d
Copy link
Author

Note: pull request updated to make this new behavior disabled by default and enabled only via --shipper.retry-init option

This commit allows sidecar to continue to serve prometheus read path if objstore is not available at startup.
Bucket creation will be attempted again on next upload.

This option is disabled by default and can be enabled by passing
argument "--shipper.retry-init".

This commit also brings a new metric to alert in case of bucket initialization crash: thanos_sidecar_shipper_up

Signed-off-by: Amaury Decrême <[email protected]>
@amaury-d amaury-d marked this pull request as ready for review September 26, 2024 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants