Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

security: change client cert expiration caching eviction policy #143081

Conversation

angles-n-daemons
Copy link
Contributor

security: change client cert expiration caching eviction policy

Previously, we cached and emitted metrics for client certificates
indefinitely or until they expired. However, this strategy ignores
the case where a team rotates their certificates, discontinuing to
use soon-to-expire ones, in which crdb would continue to report on
their expiration.

This change modifies the caching behavior so that eviction from the
cache is based on last usage. It does a few additional things here:

  • Supports the removal of child metrics by key.
  • Removes the server.client_cert_expiration_cache.capacity cluster
    setting.
  • Adds a shared utility for adding jitter.
  • Moves caching and metrics reporting from a single cache to two
    caches, one for expiration, one for ttl.

Fixes: #142686
Epic: CRDB-40209

Release note(ops change): Cluster setting
server.client_cert_expiration_cache.capacity removed.
Metrics security.certificate.expiration.client and
security.certificate.ttl.client now report lowest value seen for
user in last 24h.

@angles-n-daemons angles-n-daemons requested review from a team as code owners March 18, 2025 18:58
@angles-n-daemons angles-n-daemons requested review from dhartunian, arjunmahishi, aa-joshi and Abhinav1299 and removed request for a team March 18, 2025 18:58
Copy link

blathers-crl bot commented Mar 18, 2025

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@angles-n-daemons
Copy link
Contributor Author

Note to reviewers, only review the last commit.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@angles-n-daemons angles-n-daemons force-pushed the client-cert-expiration-by-usage branch 2 times, most recently from b7fc3d0 to 9c78f7a Compare March 18, 2025 19:02
// time window (24h) and groups the certificates by user. When metrics are
// read, the cache reports for each user the earliest expiration for any of
// their recently used certificates.
type Cache struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the bulk of the new changes.

@angles-n-daemons angles-n-daemons force-pushed the client-cert-expiration-by-usage branch 2 times, most recently from 0f0f0e1 to 6f504b4 Compare March 18, 2025 19:47
if certs, ok := c.cache[user]; ok {
// compute the earliest expiration time.
var minExpiration int64
for _, cert := range certs {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does a scan through the certificates for a given user. It's assumed that a given user does not have a lot of certificates.


// ClientCertExpirationCacheCapacity is the cluster setting that controls the
// maximum number of client cert expirations in the cache.
var ClientCertExpirationCacheCapacity = settings.RegisterIntSetting(
Copy link
Contributor Author

@angles-n-daemons angles-n-daemons Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This setting is removed, it wasn't immediately clear to me what scale of client certificates the system deals with. Even if it were 1 per connecting node, I'd assume the number of nodes talking a single database node in a given 24h period does not exceed 1e5.

@angles-n-daemons angles-n-daemons force-pushed the client-cert-expiration-by-usage branch from 6f504b4 to a579bf7 Compare March 18, 2025 19:53
ctx context.Context, identity, serial string, expiration int64,
) error {
for _, cache := range []*clientcert.Cache{cm.clientCertExpirationCache, cm.clientCertTTLCache} {
if cache == nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wrong. What am I missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm..this does look wrong, and was working...let me check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely a bug, fixed this up.

return time.Duration(float64(interval) * (0.75 + 0.5*math_rand.Float64()))
// evict is a utility function for removing a specific certificate from the
// cache and removing the corresponding user if they have no more certificates.
func (c *Cache) evict(ctx context.Context, user, serial string) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename this to evictLocked since it assumes mutex is taken

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah good call, changed.

clientCertExpirationCache *clientcert.Cache

// Client cert ttl cache.
clientCertTTLCache *clientcert.Cache
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence about splitting things up into two caches. Their implementations are so similar that it feels like there should be one cache that's specialized to update both metrics since they're supposed to be identical. In this case because these are only used here in 3 specific locations, it's easy to keep the usages contained.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm up for discussing it a little further. For more context behind motivation, making each cache responsible for one metric drastically improves the testing story.

The other improvement is that code duplication would have to exist somewhere, and would likely be multiplied if pushed down. If anything, I'd almost prefer to create an intermediary composite object for managing TTL and Expiration as one surface to the certificate manager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I thought this over a bit more and was able to make it better than I worried about. Check out the new change, lmk what you think.

@angles-n-daemons angles-n-daemons force-pushed the client-cert-expiration-by-usage branch from a579bf7 to 2e3055a Compare March 19, 2025 18:28
@@ -0,0 +1,133 @@
// Copyright 2025 The Cockroach Authors.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpt made this

@angles-n-daemons angles-n-daemons force-pushed the client-cert-expiration-by-usage branch 2 times, most recently from a69bbbc to 200ae5a Compare March 21, 2025 16:41
Previously, we cached and emitted metrics for client certificates
indefinitely or until they expired. However, this strategy ignores
the case where a team rotates their certificates, discontinuing to
use soon-to-expire ones, in which crdb would continue to report on
their expiration.

This change modifies the caching behavior so that eviction from the
cache is based on last usage. It does a few additional things here:
 - Supports the removal of child metrics by key.
 - Removes the `server.client_cert_expiration_cache.capacity` cluster
	 setting.
 - Adds a shared utility for adding jitter.
 - Moves caching and metrics reporting from a single cache to two
   caches, one for expiration, one for ttl.

Fixes: cockroachdb#142686
Epic: CRDB-40209

Release note(ops change): Cluster setting
`server.client_cert_expiration_cache.capacity` removed.
Metrics `security.certificate.expiration.client` and
`security.certificate.ttl.client` now report lowest value seen for
user in last 24h.
@angles-n-daemons angles-n-daemons force-pushed the client-cert-expiration-by-usage branch from 200ae5a to 522c333 Compare March 21, 2025 18:25
Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the modification to maintain a single cache, thanks!

In general, I think this PR is not "backport friendly" because the diff is large and contains tangential changes (I agree they're good but just not suited for a backport where we prefer to change as little as possible). Can you restructure into two separate commits potentially? Or perhaps 3 might be necessary.

  1. Smallest possible diff that gets a version of this functionality working. It's not as important in the backport to have these refactors and changes. They make the diff larger, and increase risk.
  2. Refactors etc. that are nice to have but not needed for backports
  3. Second iteration of feature to make the code nicer if necessary on master only.

Let me know if this is tricky or not possible and we can discuss

@@ -1,4 +1,4 @@
// Copyright 2023 The Cockroach Authors.
// Copyright 2025 The Cockroach Authors.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: generally there's no need to bump copyright years in files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh nice catch!

@angles-n-daemons
Copy link
Contributor Author

angles-n-daemons commented Mar 21, 2025

Making the PR smaller is definitely possible, and tricky simultaneously. Is it worth hacking on for a day to see if I can trim down the changeset?

Here are some of the not-quite-required additions:

  • a shared jitter utility (with tests)
  • a new package for the cache to simplify namespace collisions
  • bytes monitor setup upstream
  • long-held metrics object

On Monday I'll try to put up some example PRs which are smaller, pruning out the easier of the above.

@angles-n-daemons
Copy link
Contributor Author

Closing this in favor of a smaller changeset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

security: certificate caching should evict on usage
3 participants