-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
structlogging: restructure hot range logger for testability #142996
structlogging: restructure hot range logger for testability #142996
Conversation
ab85aff
to
93bbfb0
Compare
@@ -24,6 +24,13 @@ import ( | |||
// ReportTopHottestRanges limits the number of ranges to be reported per iteration | |||
const ReportTopHottestRanges = 5 | |||
|
|||
// HotRangeLogManualTicker is a channel that can be used to force the hot range | |||
// the logging task to tick. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is this an extra the
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah it is, I'll get rid of it.
// logHotRanges collects the hot ranges from this node's status server and | ||
// sends them to the TELEMETRY log channel. | ||
func (s *hotRangesLoggingScheduler) logHotRanges(ctx context.Context, stopper *stop.Stopper) { | ||
// early exit conditions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use full sentences for comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can actually be removed.
// Within normal operation, there will only be one function listening to this | ||
// ticker, but in the tests there may be multiple "nodes" within the process. | ||
// Tests then will need to send multiple requests, to trigger all the nodes. | ||
var HotRangeLogManualTicker = make(chan struct{}, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we provided a way to mock the ticker being used in tests instead of having both a manual test ticker and the regular one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not easily, because the channel isn't exposed in a way that an interface can be built around it. Would you prefer it? I can give an example of what it would look like here, or provide an alternative utility.
time.Sleep(intervalDuration * 2) | ||
testutils.SucceedsSoon(t, func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically we discourage test.Sleep
to wait out intervals. You're using SucceedsSoon
already so there's no need to sleep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I'll get rid of this one.
|
||
// very that there's no logged hot ranges, despite the system ticking | ||
structlogging.HotRangeLogManualTicker <- struct{}{} | ||
if time.Since(start) > intervalDuration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part seems a little fragile. From what I understand we want to test that no logs are sent before we've waited out the duration. Maybe there's a better way to do this. Can we check timestamps in the log spy?
Although, if we move to mock the ticker directly I think that might be more straightforard. I'm not really sure what the purpose of the manual ticker is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test we can actually get rid of. The ranges themselves actually won't come online before some period of time - so any waits may not do what we want.
structlogging.TelemetryHotRangesStatsInterval.Override(ctx, &ts.ClusterSettings().SV, time.Millisecond) | ||
structlogging.TelemetryHotRangesStatsLoggingDelay.Override(ctx, &ts.ClusterSettings().SV, 0*time.Millisecond) | ||
|
||
structlogging.HotRangeLogManualTicker <- struct{}{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this if we have SucceedsSoon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't necessarily need it - I'll remove it for now.
5be14fd
to
3a0700e
Compare
// shouldLog checks the below conditions to see whether it should emit logs. | ||
// - Is the cluster setting server.telemetry.hot_ranges_stats.enabled true? | ||
func (s *hotRangesLoggingScheduler) shouldLog() bool { | ||
return TelemetryHotRangesStatsEnabled.Get(&s.st.SV) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is a stylistic preference, but this function just checks 1 cluster setting and there's also only one caller of shouldLog
as far as I can tell. Maybe we can just inline this call in maybeLogHotRanges
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, your question is exactly correct - for now this function acts only as a passthrough. Part of the reasoning behind this is the additional checks which will shortly follow this PR. In the future the function will look like:
func shouldLog(lastLogTime time.Time, topReplicaCPU ) bool {
setttingIntervalDuration := IntervalSetting.Get(&s.st.SV)
settingEnabled := EnabledSetting.Get(&s.st.SV)
if settingEnabled && time.Since(lastLogTime) > settingIntervalDuration {
return true
}
if topReplicaCPU > LogCPUThreshold {
return true
}
return false
}
We're going to move the ticker interval to something small like 1m. From there we aim to log at the cluster setting's interval, or when the node in question goes over a certain load.
// Get first 5 logs since the logging loop may have fired multiple times. | ||
// We should have gotten 5 distinct range ids, one for each split point above. | ||
logs := spy.Logs()[:5] | ||
// Depend on a range which we don't exist to go anywhere. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to say "Look for a descriptor, which we always expect to exist in the system."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that sgtm! Ah, did you mean expect
in the original comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh yes! yeah - the original wording doesn't make too much sense.
return nil | ||
}) | ||
structlogging.TelemetryHotRangesStatsInterval.Override(ctx, &ts.ClusterSettings().SV, 1*time.Hour) | ||
// TestHotRangesStatsTenants tests that hot ranges stats are logged per node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Commented test name doesn't match - also I'm not sure the comment actually reflects what's being tested anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, wrong PR, this should be changed now.
3a0700e
to
248495c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the updates!
This change does a few things to improve the testability of the hot ranges logger. The includes: __Logger__: * The introduction of a shouldLog function, which determines whether the system should log or not. * The breakout of the logging action into its own function. __Tests__: * The addition of a setup and teardown utility for the hot ranage logger. * The breakout of the default case and a timed case. * Tests for hot ranges which exist in the system to start. Fixes: cockroachdb#142995 Epic: CRDB-43150 Release note: None
248495c
to
d13ad97
Compare
np, thanks for the review! |
bors r+ |
structlogging: restructure hot range logger for testability
This change does a few things to improve the testability of the hot
ranges logger. The includes:
Logger:
the system should log or not.
Tests:
logger.
Fixes: #142995
Epic: CRDB-43150