-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
security: change client cert expiration caching eviction policy #143081
security: change client cert expiration caching eviction policy #143081
Conversation
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Note to reviewers, only review the last commit. |
b7fc3d0
to
9c78f7a
Compare
// time window (24h) and groups the certificates by user. When metrics are | ||
// read, the cache reports for each user the earliest expiration for any of | ||
// their recently used certificates. | ||
type Cache struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the bulk of the new changes.
0f0f0e1
to
6f504b4
Compare
if certs, ok := c.cache[user]; ok { | ||
// compute the earliest expiration time. | ||
var minExpiration int64 | ||
for _, cert := range certs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does a scan through the certificates for a given user. It's assumed that a given user does not have a lot of certificates.
pkg/security/cert_expiry_cache.go
Outdated
|
||
// ClientCertExpirationCacheCapacity is the cluster setting that controls the | ||
// maximum number of client cert expirations in the cache. | ||
var ClientCertExpirationCacheCapacity = settings.RegisterIntSetting( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This setting is removed, it wasn't immediately clear to me what scale of client certificates the system deals with. Even if it were 1 per connecting node, I'd assume the number of nodes talking a single database node in a given 24h period does not exceed 1e5.
6f504b4
to
a579bf7
Compare
pkg/security/certificate_manager.go
Outdated
ctx context.Context, identity, serial string, expiration int64, | ||
) error { | ||
for _, cache := range []*clientcert.Cache{cm.clientCertExpirationCache, cm.clientCertTTLCache} { | ||
if cache == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks wrong. What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm..this does look wrong, and was working...let me check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely a bug, fixed this up.
return time.Duration(float64(interval) * (0.75 + 0.5*math_rand.Float64())) | ||
// evict is a utility function for removing a specific certificate from the | ||
// cache and removing the corresponding user if they have no more certificates. | ||
func (c *Cache) evict(ctx context.Context, user, serial string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename this to evictLocked
since it assumes mutex is taken
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah good call, changed.
pkg/security/certificate_manager.go
Outdated
clientCertExpirationCache *clientcert.Cache | ||
|
||
// Client cert ttl cache. | ||
clientCertTTLCache *clientcert.Cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm on the fence about splitting things up into two caches. Their implementations are so similar that it feels like there should be one cache that's specialized to update both metrics since they're supposed to be identical. In this case because these are only used here in 3 specific locations, it's easy to keep the usages contained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm up for discussing it a little further. For more context behind motivation, making each cache responsible for one metric drastically improves the testing story.
The other improvement is that code duplication would have to exist somewhere, and would likely be multiplied if pushed down. If anything, I'd almost prefer to create an intermediary composite object for managing TTL and Expiration as one surface to the certificate manager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I thought this over a bit more and was able to make it better than I worried about. Check out the new change, lmk what you think.
a579bf7
to
2e3055a
Compare
@@ -0,0 +1,133 @@ | |||
// Copyright 2025 The Cockroach Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gpt made this
a69bbbc
to
200ae5a
Compare
Previously, we cached and emitted metrics for client certificates indefinitely or until they expired. However, this strategy ignores the case where a team rotates their certificates, discontinuing to use soon-to-expire ones, in which crdb would continue to report on their expiration. This change modifies the caching behavior so that eviction from the cache is based on last usage. It does a few additional things here: - Supports the removal of child metrics by key. - Removes the `server.client_cert_expiration_cache.capacity` cluster setting. - Adds a shared utility for adding jitter. - Moves caching and metrics reporting from a single cache to two caches, one for expiration, one for ttl. Fixes: cockroachdb#142686 Epic: CRDB-40209 Release note(ops change): Cluster setting `server.client_cert_expiration_cache.capacity` removed. Metrics `security.certificate.expiration.client` and `security.certificate.ttl.client` now report lowest value seen for user in last 24h.
200ae5a
to
522c333
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the modification to maintain a single cache, thanks!
In general, I think this PR is not "backport friendly" because the diff is large and contains tangential changes (I agree they're good but just not suited for a backport where we prefer to change as little as possible). Can you restructure into two separate commits potentially? Or perhaps 3 might be necessary.
- Smallest possible diff that gets a version of this functionality working. It's not as important in the backport to have these refactors and changes. They make the diff larger, and increase risk.
- Refactors etc. that are nice to have but not needed for backports
- Second iteration of feature to make the code nicer if necessary on
master
only.
Let me know if this is tricky or not possible and we can discuss
@@ -1,4 +1,4 @@ | |||
// Copyright 2023 The Cockroach Authors. | |||
// Copyright 2025 The Cockroach Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: generally there's no need to bump copyright years in files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh nice catch!
Making the PR smaller is definitely possible, and tricky simultaneously. Is it worth hacking on for a day to see if I can trim down the changeset? Here are some of the not-quite-required additions:
On Monday I'll try to put up some example PRs which are smaller, pruning out the easier of the above. |
Closing this in favor of a smaller changeset. |
security: change client cert expiration caching eviction policy
Previously, we cached and emitted metrics for client certificates
indefinitely or until they expired. However, this strategy ignores
the case where a team rotates their certificates, discontinuing to
use soon-to-expire ones, in which crdb would continue to report on
their expiration.
This change modifies the caching behavior so that eviction from the
cache is based on last usage. It does a few additional things here:
server.client_cert_expiration_cache.capacity
clustersetting.
caches, one for expiration, one for ttl.
Fixes: #142686
Epic: CRDB-40209
Release note(ops change): Cluster setting
server.client_cert_expiration_cache.capacity
removed.Metrics
security.certificate.expiration.client
andsecurity.certificate.ttl.client
now report lowest value seen foruser in last 24h.