Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: add raft.proposal.leader.applied.latency #143094

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

wenyihu6
Copy link
Contributor

@wenyihu6 wenyihu6 commented Mar 18, 2025

kvserver: add raft.proposal.leader.applied.latency

This commit adds raft.proposal.leader.applied.latency to StoreMetrics.

The duration elapsed between a proposal being submitted to Raft and its local
Raft leader application to the state machine.

While similar to 'raft.replication.latency', this metric starts counting only
after the proposal is being submitted to Raft, excluding earlier processing
steps such as acquiring for proposal quota.

Only successful write commands are measured. Measurements are not recorded for:

  • Failed proposals
  • Read-only commands
  • Read-write commands that don't result in actual writes

Note that this metric only captures application on the proposing leader node
itself but not including the time taken for follower replicas to apply the
changes.

Informs: #72393
Release note: none


kvserver: add raft.proposal.leader.ack.latency

This commit adds raft.proposal.leader.ack.latency to StoreMetrics.

This metrics tracks the duration elapsed between a proposal being submitted to
Raft and its local Raft leader's acknowledgement on the success to the client.

Note that this may not include the time taken to apply the command to the
state machine for asynchronous consensus.

Only successful write commands are measured. Measurements are not recorded for:

  • Failed proposals
  • Read-only commands
  • Read-write commands that don't result in actual writes

Informs: #72393
Release note: none

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@wenyihu6 wenyihu6 force-pushed the metricsavg branch 4 times, most recently from a51aa19 to 186829a Compare March 18, 2025 21:46
@wenyihu6 wenyihu6 marked this pull request as ready for review March 18, 2025 22:00
@wenyihu6 wenyihu6 requested a review from a team as a code owner March 18, 2025 22:00
@wenyihu6 wenyihu6 requested review from tbg and arulajmani March 18, 2025 22:00
@wenyihu6 wenyihu6 requested a review from pav-kv March 18, 2025 22:03
@wenyihu6 wenyihu6 force-pushed the metricsavg branch 7 times, most recently from e802284 to 66da83a Compare March 19, 2025 02:46
This commit adds raft.proposal.leader.applied.latency to StoreMetrics.

The duration elapsed between a proposal being submitted to Raft and its local
Raft leader application to the state machine.

While similar to 'raft.replication.latency', this metric starts counting only
after the proposal is being submitted to Raft, excluding earlier processing
steps such as acquiring for proposal quota.

Only successful write commands are measured. Measurements are not recorded for:
- Failed proposals
- Read-only commands
- Read-write commands that don't result in actual writes

Note that this metric only captures application on the proposing leader node
itself but not including the time taken for follower replicas to apply the
changes.

Informs: cockroachdb#72393
Release note: none
This commit adds raft.proposal.leader.ack.latency to StoreMetrics.

This metrics tracks the duration elapsed between a proposal being submitted to
Raft and its local Raft leader's acknowledgement on the success to the client.

Note that this may not include the time taken to apply the command to the
state machine for asynchronous consensus.

Only successful write commands are measured. Measurements are not recorded for:
- Failed proposals
- Read-only commands
- Read-write commands that don't result in actual writes

Informs: cockroachdb#72393
Release note: none
@tbg
Copy link
Member

tbg commented Mar 19, 2025

I know that Nathan proposed this in the original issue, but these two metrics seem very similar. The only difference is that leader.ack.latency would roughly be recorded in

if err := appTask.AckCommittedEntriesBeforeApplication(ctx); err != nil {
return stats, err
}

which is just a bit earlier in the same handleRaftReady cycle that also applies the command. We already measure the latency of the entire ready cycle (in a histogram), so I'm not sure why we would need both metrics - histograms are fairly expensive (#137780 comes to mind - an overhead we haven't been able to fully remove even after @dhartunian spent plenty of time trying to optimize it away).

Also, raft.proposal.leader.ack.latency is perhaps easy to misinterpret. In this PR, the metric really measures when the entry comes up for application. Theoretically, this can be well after it is committed: imagine the leader building up a large uncommitted raft log. It may then commit a large number of entries at once, but we limit how much comes up for application in a single raft cycle. The entry would be committed potentially ~seconds before it comes up for application. And since an entry comes up for application only after all previous entries have replicated (and applied) there is always a lot of contagion between entries. True, kv.replication.latency closes over even more stuff, but I'm still not sure we'll benefit from more "similar" metrics.

Long story short, I'm hesitant to grow the zoo around replication-related metrics with very similar ones. Instead, a metric that show the true replication lag - how long does it take to get a quorum, or how much is inflight - would be more useful because they'd give us deeper insight into the replication layer. It's tricky to get this right, I think. Half the battle is thinking it through ahead of time. I see in #143094 that you don't need thee metrics for your current project, so putting this on ice would be an option.

Let me know if you think I'm missing something here!

@wenyihu6
Copy link
Contributor Author

wenyihu6 commented Mar 20, 2025

TFTR! Agreed that these two metrics are very similar to replication.latency. Given this is no longer needed for the closed ts project, happy to put this on ice until we have a clear need. If needed in the future, we could consider adding them into a range report or a log line to avoid the histogram overhead you mentioned.

In this PR, the metric really measures when the entry comes up for application. Theoretically, this can be well after it is committed: imagine the leader building up a large uncommitted raft log. It may then commit a large number of entries at once, but we limit how much comes up for application in a single raft cycle. The entry would be committed potentially ~seconds before it comes up for application.

Curious to learn more here - in this scenario, do clients continue to wait here

case propResult := <-ch:
until the the local application is done if they happen to miss this fast path
return forEachCheckedCmdIter(ctx, stagedIter, func(cmd CheckedCommand, ctx context.Context) error {
if !cmd.Rejected() && cmd.IsLocal() && cmd.CanAckBeforeApplication() {
return cmd.AckSuccess(cmd.Ctx())
}
return nil
})
? How often is this fast path being hit in general?

@mgartner
Copy link
Collaborator

... histograms are fairly expensive (#137780 comes to mind - an overhead we haven't been able to fully remove even after @dhartunian spent plenty of time trying to optimize it away).

The "stats" name overloading strikes again! #137780 is not related to the histograms @dhartunian optimized, it is related to SQL table stats histograms, i.e., histograms of SQL column values. After landing various optimizations there, I no longer measure significant overhead from them. #143230 will close that issue.

@tbg
Copy link
Member

tbg commented Mar 21, 2025

🙈 apologies, I meant to link #133306.

@tbg
Copy link
Member

tbg commented Mar 21, 2025

@wenyihu6: see here:

// CanAckBeforeApplication implements apply.CheckedCommand.
func (c *replicatedCmd) CanAckBeforeApplication() bool {
// CanAckBeforeApplication determines whether the request type is compatible
// with acknowledgement of success before it has been applied. For now, this
// determination is based on whether the request is performing transactional
// writes. This could be exposed as an option on the batch header instead.
//
// We don't try to ack async consensus writes before application because we
// know that there isn't a client waiting for the result.
req := c.proposal.Request
if !req.IsIntentWrite() || req.AsyncConsensus {
return false
}
if et, ok := req.GetArg(kvpb.EndTxn); ok && et.(*kvpb.EndTxnRequest).InternalCommitTrigger != nil {
// Don't early-ack for commit triggers, just to keep things simple - the
// caller is reasonably expecting to be able to run another replication
// change right away, and some code paths pull the descriptor out of memory
// where it may not reflect the previous operation yet.
return false
}
return true
}

Most "regular" intent writes are async consensus at this point. So EndTxn is the important case that hits the optimization.

@tbg tbg removed their request for review March 24, 2025 12:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants