Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace dgraph-io/badger cache storage with etcd-io/bbolt #42571

Closed
wants to merge 36 commits into from

Conversation

stefans-elastic
Copy link
Contributor

@stefans-elastic stefans-elastic commented Feb 3, 2025

Proposed commit message

Replacing dgraph-io/badger persistent storage for key-value cache with etcd-io/bbolt. Originally it was meant to just get rid of go.opencensus.io dependency which is introduced by badger (please see parent issue for more details). After it got evident that this won't erase go.opencensus.io dependency it was decided that this work still should be done since etcd-io/bbolt is already used elsewhere in the project and it isn't a good thing to have multiple storages for cache (again, please see parent issue for more details (in comments)).

Implementation should be fairly straight-forward but I would like to clarify one thing - since bolt doesn't support value expiration the expiration time (and TTL) are stored as metadata of the value. Upon value retrieval it is checked for expiration and if it is expired then nil is returned and value gets deleted from bolt DB.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

Author's Checklist

  • [ ]

How to test this PR locally

Related issues

Use cases

Screenshots

Logs

@stefans-elastic stefans-elastic self-assigned this Feb 3, 2025
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Feb 3, 2025
Copy link
Contributor

mergify bot commented Feb 3, 2025

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @stefans-elastic? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

@stefans-elastic stefans-elastic added the Team:Obs-InfraObs Label for the Observability Infrastructure Monitoring team label Feb 4, 2025
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Feb 4, 2025
@stefans-elastic stefans-elastic added needs_team Indicates that the issue/PR needs a Team:* label backport-8.x Automated backport to the 8.x branch with mergify backport-8.16 Automated backport with mergify backport-8.17 Automated backport with mergify labels Feb 4, 2025
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Feb 4, 2025
@botelastic
Copy link

botelastic bot commented Feb 4, 2025

This pull request doesn't have a Team:<team> label.

@stefans-elastic stefans-elastic changed the title Drop dbadger io Replace dgraph-io/badger cache storage with etcd-io/bbolt Feb 4, 2025
@stefans-elastic stefans-elastic marked this pull request as ready for review February 4, 2025 11:41
@stefans-elastic stefans-elastic requested review from a team as code owners February 4, 2025 11:41
Comment on lines 108 to 109
if err != nil {
c.log.Debugf("Key '%s' not found in key-value store", k)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but could you log the error instead of saying the key was not found?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, done
please check it out

@stefans-elastic
Copy link
Contributor Author

/test

@stefans-elastic
Copy link
Contributor Author

@elastic/beats-tech-leads / @VihasMakwana could you review this PR?

@cmacknz
Copy link
Member

cmacknz commented Feb 11, 2025

Looks like the persistent cache has a few uses related to cloudfoundry including the add_cloudfoundry_metadata process via the newClientCacheWrap function.

rg newClientCacheWrap
x-pack/libbeat/common/cloudfoundry/hub.go
82:func (h *Hub) ClientWithCache() (Client, error) {

x-pack/libbeat/common/cloudfoundry/cache_integration_test.go
41:             client, err := hub.ClientWithCache()
52:             client, err := hub.ClientWithCache()

x-pack/libbeat/processors/add_cloudfoundry_metadata/add_cloudfoundry_metadata.go
51:     client, err := hub.ClientWithCache()
  1. Let's indicate in the changelog that this change only impacts Cloudfoundry related functionality. It looks like the impact would be we essentially clear the cache and start from scratch, which doesn't seem breaking to me.
  2. Do we have any way to sanity check check any of this running on Cloudfoundry itself?

@stefans-elastic
Copy link
Contributor Author

Looks like the persistent cache has a few uses related to cloudfoundry including the add_cloudfoundry_metadata process via the newClientCacheWrap function.

rg newClientCacheWrap
x-pack/libbeat/common/cloudfoundry/hub.go
82:func (h *Hub) ClientWithCache() (Client, error) {

x-pack/libbeat/common/cloudfoundry/cache_integration_test.go
41:             client, err := hub.ClientWithCache()
52:             client, err := hub.ClientWithCache()

x-pack/libbeat/processors/add_cloudfoundry_metadata/add_cloudfoundry_metadata.go
51:     client, err := hub.ClientWithCache()
  1. Let's indicate in the changelog that this change only impacts Cloudfoundry related functionality. It looks like the impact would be we essentially clear the cache and start from scratch, which doesn't seem breaking to me.
  2. Do we have any way to sanity check check any of this running on Cloudfoundry itself?
  1. I've updated the changelog message. Please take a look.
  2. I'm not sure how to do this (I would need some assistance with testing on Cloudfoundry)

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog looks good, I also have no experience working with cloudfoundry so I can't be much help there, but we must have tested this in the past.

I see @jsoriano in the Git history so maybe he can give us some leads.

@jsoriano
Copy link
Member

I see @jsoriano in the Git history so maybe he can give us some leads.

I cannot help with testing as I haven't used CF in years, but I can try to help with the background of this cache.

You can find here a summary of the analysis that lead to using badger for this use case: #19511 (comment)

The summary-of-the-summary is that we needed it to perform well in clusters with several thousands of applications, and we needed it to cleanup unused entries. Badger fitted better than other alternatives as it performed well under pressure, and it had built-in TTL support.

The kind of expiration added in this PR may not work so well for this use case, because it won't remove entries that stop being accessed, as the ones for applications that stop producing events, that is the most common use case when we want entries to be removed here.

Other thing to take into account is that add_cloudfoundry_metadata may be unnecessary in current deployments, as Cloudfoundry started attaching this metadata to all events and we don't need to query and cache it (see #26868), so maybe we don't need to care a lot about its performance.

@jsoriano
Copy link
Member

Btw, maybe we need to add some release notes, to warn users of add_cloudfoundry_metadata to be careful when upgrading to the version containing this change, as their caches will be regenerated on first start, potentially making loads of queries to CF APIs.

@stefans-elastic
Copy link
Contributor Author

@jsoriano

Btw, maybe we need to add some release notes

Do you mean I need to add an entry to CHANGELOG.next.asciidoc

@jsoriano
Copy link
Member

Btw, maybe we need to add some release notes

Do you mean I need to add an entry to CHANGELOG.next.asciidoc

Yes, maybe this is enough as this seems to appear in https://www.elastic.co/guide/en/beats/libbeat/current/release-notes-8.17.2.html

@stefans-elastic
Copy link
Contributor Author

@jsoriano I've added CHANGELOG.next.asciidoc entry, please take a look

@jsoriano
Copy link
Member

PR open to try to address the root issue upstream: hypermodeinc/badger#2169

@mauri870
Copy link
Member

PR open to try to address the root issue upstream: hypermodeinc/badger#2169

Thanks for that! The repo seems fairly active, if we can avoid the rewrite that would be great.

@cmacknz
Copy link
Member

cmacknz commented Feb 13, 2025

We definitely need to test this on CloudFoundry before releasing it to anybody, looking through the log of closed SDHs it is definitely still used but I don't know by how many users.

That CI doesn't effectively test this is concerning, we are maintaining this by hoping nothing that breaks it ever changes.

If we can fix upstream and avoid a potential long tail of support pain here then that could be the best path. Efforts are probably better focused on figuring out how to maintain CloudFoundry support properly first.

@stefans-elastic
Copy link
Contributor Author

We definitely need to test this on CloudFoundry before releasing it to anybody, looking through the log of closed SDHs it is definitely still used but I don't know by how many users.

That CI doesn't effectively test this is concerning, we are maintaining this by hoping nothing that breaks it ever changes.

If we can fix upstream and avoid a potential long tail of support pain here then that could be the best path. Efforts are probably better focused on figuring out how to maintain CloudFoundry support properly first.

@cmacknz Should I close this PR?

@cmacknz
Copy link
Member

cmacknz commented Feb 14, 2025

It definitely sounds like we are not in a position to make big changes to this yet. I'm not officialy a codeowner for cloudfoundry, so if you feel like the risk is too great go ahead and close it.

Just because we wrote code doesn't mean we have to keep it :)

@stefans-elastic
Copy link
Contributor Author

@cmacknz I really wouldn't like to break anything so I guess it would be safer not to merge this PR. That being said I dont feel like having 2 different stores(bbolt and badger) in the codebase for cache isn't ideal. Since using bbolt here might be dangerous here (lack of TTL functionality might cause keeping stale data in the storage) we might want to consider switch from bbolt to badger in other places.

@jsoriano
Copy link
Member

Fix merged upstream, but I don't know when a version containing it will be released.

@stefans-elastic
Copy link
Contributor Author

Closing this PR as we are going to update badger once new version is released (that doesn't depend on opencensus, the change was made in this PR)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify backport-8.16 Automated backport with mergify backport-8.17 Automated backport with mergify backport-9.0 Automated backport to the 9.0 branch bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team Team:Obs-InfraObs Label for the Observability Infrastructure Monitoring team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Drop dbadger-io dependency
7 participants