Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

structlogging: restructure hot range logger for testability #142996

Merged
merged 1 commit into from
Mar 26, 2025

Conversation

angles-n-daemons
Copy link
Contributor

@angles-n-daemons angles-n-daemons commented Mar 17, 2025

structlogging: restructure hot range logger for testability

This change does a few things to improve the testability of the hot
ranges logger. The includes:

Logger:

  • The introduction of a shouldLog function, which determines whether
    the system should log or not.
  • The breakout of the logging action into its own function.

Tests:

  • The addition of a setup and teardown utility for the hot ranage
    logger.
  • The breakout of the default case and a timed case.
  • Tests for hot ranges which exist in the system to start.

Fixes: #142995
Epic: CRDB-43150

@angles-n-daemons angles-n-daemons requested a review from a team as a code owner March 17, 2025 17:59
@angles-n-daemons angles-n-daemons requested review from xinhaoz and removed request for a team March 17, 2025 17:59
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@angles-n-daemons angles-n-daemons force-pushed the hot-range-logger-cleanup branch 2 times, most recently from ab85aff to 93bbfb0 Compare March 17, 2025 18:01
@@ -24,6 +24,13 @@ import (
// ReportTopHottestRanges limits the number of ranges to be reported per iteration
const ReportTopHottestRanges = 5

// HotRangeLogManualTicker is a channel that can be used to force the hot range
// the logging task to tick.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is this an extra the ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah it is, I'll get rid of it.

// logHotRanges collects the hot ranges from this node's status server and
// sends them to the TELEMETRY log channel.
func (s *hotRangesLoggingScheduler) logHotRanges(ctx context.Context, stopper *stop.Stopper) {
// early exit conditions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use full sentences for comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can actually be removed.

// Within normal operation, there will only be one function listening to this
// ticker, but in the tests there may be multiple "nodes" within the process.
// Tests then will need to send multiple requests, to trigger all the nodes.
var HotRangeLogManualTicker = make(chan struct{}, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we provided a way to mock the ticker being used in tests instead of having both a manual test ticker and the regular one?

Copy link
Contributor Author

@angles-n-daemons angles-n-daemons Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not easily, because the channel isn't exposed in a way that an interface can be built around it. Would you prefer it? I can give an example of what it would look like here, or provide an alternative utility.

Comment on lines 167 to 168
time.Sleep(intervalDuration * 2)
testutils.SucceedsSoon(t, func() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically we discourage test.Sleep to wait out intervals. You're using SucceedsSoon already so there's no need to sleep.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'll get rid of this one.


// very that there's no logged hot ranges, despite the system ticking
structlogging.HotRangeLogManualTicker <- struct{}{}
if time.Since(start) > intervalDuration {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part seems a little fragile. From what I understand we want to test that no logs are sent before we've waited out the duration. Maybe there's a better way to do this. Can we check timestamps in the log spy?

Although, if we move to mock the ticker directly I think that might be more straightforard. I'm not really sure what the purpose of the manual ticker is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test we can actually get rid of. The ranges themselves actually won't come online before some period of time - so any waits may not do what we want.

structlogging.TelemetryHotRangesStatsInterval.Override(ctx, &ts.ClusterSettings().SV, time.Millisecond)
structlogging.TelemetryHotRangesStatsLoggingDelay.Override(ctx, &ts.ClusterSettings().SV, 0*time.Millisecond)

structlogging.HotRangeLogManualTicker <- struct{}{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this if we have SucceedsSoon?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't necessarily need it - I'll remove it for now.

@angles-n-daemons angles-n-daemons force-pushed the hot-range-logger-cleanup branch 6 times, most recently from 5be14fd to 3a0700e Compare March 21, 2025 16:32
Comment on lines +111 to +115
// shouldLog checks the below conditions to see whether it should emit logs.
// - Is the cluster setting server.telemetry.hot_ranges_stats.enabled true?
func (s *hotRangesLoggingScheduler) shouldLog() bool {
return TelemetryHotRangesStatsEnabled.Get(&s.st.SV)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is a stylistic preference, but this function just checks 1 cluster setting and there's also only one caller of shouldLog as far as I can tell. Maybe we can just inline this call in maybeLogHotRanges?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, your question is exactly correct - for now this function acts only as a passthrough. Part of the reasoning behind this is the additional checks which will shortly follow this PR. In the future the function will look like:

func shouldLog(lastLogTime time.Time, topReplicaCPU ) bool {
    setttingIntervalDuration := IntervalSetting.Get(&s.st.SV)
    settingEnabled := EnabledSetting.Get(&s.st.SV)
    if settingEnabled && time.Since(lastLogTime) > settingIntervalDuration {
        return true
    }
    if topReplicaCPU > LogCPUThreshold {
        return true
    }
    return false
}

We're going to move the ticker interval to something small like 1m. From there we aim to log at the cluster setting's interval, or when the node in question goes over a certain load.

// Get first 5 logs since the logging loop may have fired multiple times.
// We should have gotten 5 distinct range ids, one for each split point above.
logs := spy.Logs()[:5]
// Depend on a range which we don't exist to go anywhere.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this comment.

Copy link
Contributor Author

@angles-n-daemons angles-n-daemons Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to say "Look for a descriptor, which we always expect to exist in the system."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that sgtm! Ah, did you mean expect in the original comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yes! yeah - the original wording doesn't make too much sense.

return nil
})
structlogging.TelemetryHotRangesStatsInterval.Override(ctx, &ts.ClusterSettings().SV, 1*time.Hour)
// TestHotRangesStatsTenants tests that hot ranges stats are logged per node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Commented test name doesn't match - also I'm not sure the comment actually reflects what's being tested anymore.

Copy link
Contributor Author

@angles-n-daemons angles-n-daemons Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, wrong PR, this should be changed now.

@angles-n-daemons angles-n-daemons force-pushed the hot-range-logger-cleanup branch from 3a0700e to 248495c Compare March 26, 2025 15:43
Copy link
Member

@xinhaoz xinhaoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the updates!

This change does a few things to improve the testability of the hot
ranges logger. The includes:

__Logger__:
 * The introduction of a shouldLog function, which determines whether
         the system should log or not.
 * The breakout of the logging action into its own function.

__Tests__:
 * The addition of a setup and teardown utility for the hot ranage
         logger.
 * The breakout of the default case and a timed case.
 * Tests for hot ranges which exist in the system to start.

Fixes: cockroachdb#142995
Epic: CRDB-43150

Release note: None
@angles-n-daemons angles-n-daemons force-pushed the hot-range-logger-cleanup branch from 248495c to d13ad97 Compare March 26, 2025 18:23
@angles-n-daemons
Copy link
Contributor Author

np, thanks for the review!

@angles-n-daemons
Copy link
Contributor Author

bors r+

@craig craig bot merged commit 07b1421 into cockroachdb:master Mar 26, 2025
23 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

structlogging: reorganize hot range logger for testability
3 participants