-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make live-repair part of invariants checks #1610
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some questions about logs
info!(self.log, "No Live Repair required at this time"); | ||
} else if !self.downstairs.start_live_repair( | ||
// Try to start live-repair, logging if it fails | ||
if !self.downstairs.start_live_repair( | ||
&self.state, | ||
self.ddef.get_def().unwrap().extent_count(), | ||
) { | ||
// It's hard to hit this condition; we need a Downstairs to be in | ||
// LiveRepairReady, but for no other downstairs to be in Active. | ||
warn!(self.log, "Could not start live repair, trying again later"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How fast will this check loop?
Could this warning (which I do like to see) end up spamming the log file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this will spam the logs if it fails. Frequency depends on how often events are hitting Upstairs::apply
; minimum of 1/sec for stat updates, but it could be more if the Upstairs is spamming IO. I don't have strong opinions on the best way to handle this, any ideas?
My weak inclination is to remove all logging from the failure path (or move them to debug!
logs, which is what we do for other invariant maintenance); we'll see enough other information in the logs to triangulate what's going wrong in the rare case where we want to live-repair but there are no Active downstairs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your weak inclination here, at least for this message. Any message spamming the log is almost always worse, and in this case there should be enough other information (seeing DS transitions) that we can determine what happened.
} | ||
Ok(NegotiationResult::Replay) => { | ||
self.downstairs.replay_jobs(client_id); | ||
} | ||
Ok(NegotiationResult::LiveRepair) => { | ||
// Immediately check for live-repair | ||
self.repair_check_deadline = Some(Instant::now()); | ||
// We will immediately check for live-repair as part of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this place be good to log a message that someone wants live repair?
Meaning, we could get one message and one message only? Verify range to target?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, done in de089ab
|
Yes.
Good catch, but after staring at the code, I don't think it's actually any different from our current situation! Right now, when we reach The only way we'd repair two Downstairs simultaneously is if we get very lucky and they (1) complete negotiation at exactly the same time and (2) the second negotiation message wins the race with |
Right now, the Crucible upstairs has a timer which periodically checks whether live-repair should start. Maintaining this timer is tricky: we set it to
None
if live-repair isn't possible, orSome(t)
if live-repair is possible. We have to manually keep the timer in sync with the rest of the system's state.In #1608, @leftwo expresses concern about whether he's maintaining the timer correctly; it's certainly not obvious.
This PR removes the timer, moving checking for live-repair into the "invariant maintenance" category of work. In other words, every time
Upstairs::apply
is called, it will also check whether live-repair needs to start. This is a cheap check, and doing it every time means we don't have to think about keeping the timer correctly enabled / disabled.There's one edge case of behavior change: we will no longer perform live-repair on two Downstairs simultaneously. One of them will necessarily enter
LiveRepairReady
first, since events are handled in a single task; it will then immediately enterLiveRepair
in the same call toUpstairs::apply
.