Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make live-repair part of invariants checks #1610

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mkeeter
Copy link
Contributor

@mkeeter mkeeter commented Jan 17, 2025

Right now, the Crucible upstairs has a timer which periodically checks whether live-repair should start. Maintaining this timer is tricky: we set it to None if live-repair isn't possible, or Some(t) if live-repair is possible. We have to manually keep the timer in sync with the rest of the system's state.

In #1608, @leftwo expresses concern about whether he's maintaining the timer correctly; it's certainly not obvious.

This PR removes the timer, moving checking for live-repair into the "invariant maintenance" category of work. In other words, every time Upstairs::apply is called, it will also check whether live-repair needs to start. This is a cheap check, and doing it every time means we don't have to think about keeping the timer correctly enabled / disabled.

There's one edge case of behavior change: we will no longer perform live-repair on two Downstairs simultaneously. One of them will necessarily enter LiveRepairReady first, since events are handled in a single task; it will then immediately enter LiveRepair in the same call to Upstairs::apply.

Copy link
Contributor

@leftwo leftwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some questions about logs

upstairs/src/upstairs.rs Show resolved Hide resolved
info!(self.log, "No Live Repair required at this time");
} else if !self.downstairs.start_live_repair(
// Try to start live-repair, logging if it fails
if !self.downstairs.start_live_repair(
&self.state,
self.ddef.get_def().unwrap().extent_count(),
) {
// It's hard to hit this condition; we need a Downstairs to be in
// LiveRepairReady, but for no other downstairs to be in Active.
warn!(self.log, "Could not start live repair, trying again later");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How fast will this check loop?
Could this warning (which I do like to see) end up spamming the log file?

Copy link
Contributor Author

@mkeeter mkeeter Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this will spam the logs if it fails. Frequency depends on how often events are hitting Upstairs::apply; minimum of 1/sec for stat updates, but it could be more if the Upstairs is spamming IO. I don't have strong opinions on the best way to handle this, any ideas?

My weak inclination is to remove all logging from the failure path (or move them to debug! logs, which is what we do for other invariant maintenance); we'll see enough other information in the logs to triangulate what's going wrong in the rare case where we want to live-repair but there are no Active downstairs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your weak inclination here, at least for this message. Any message spamming the log is almost always worse, and in this case there should be enough other information (seeing DS transitions) that we can determine what happened.

}
Ok(NegotiationResult::Replay) => {
self.downstairs.replay_jobs(client_id);
}
Ok(NegotiationResult::LiveRepair) => {
// Immediately check for live-repair
self.repair_check_deadline = Some(Instant::now());
// We will immediately check for live-repair as part of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this place be good to log a message that someone wants live repair?
Meaning, we could get one message and one message only? Verify range to target?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done in de089ab

@jmpesp
Copy link
Contributor

jmpesp commented Jan 17, 2025

There's one edge case of behavior change: we will no longer perform live-repair on two Downstairs simultaneously. One of them will necessarily enter LiveRepairReady first, since events are handled in a single task; it will then immediately enter LiveRepair in the same call to Upstairs::apply.

  1. Does this mean that the states would remain Active + LiveRepair + LiveRepairReady for the duration of the repair?
  2. This doubles the time it takes to repair the two because they'd be done back to back. This is a bit unfortunate

@mkeeter
Copy link
Contributor Author

mkeeter commented Jan 17, 2025

  1. Does this mean that the states would remain Active + LiveRepair + LiveRepairReady for the duration of the repair?

Yes.

  1. This doubles the time it takes to repair the two because they'd be done back to back. This is a bit unfortunate

Good catch, but after staring at the code, I don't think it's actually any different from our current situation!

Right now, when we reach LiveRepairReady during negotiation, we set repair_check_deadline to Instant::now(). This means that the next call to Upstairs::select will almost certainly choose UpstairsAction::RepairCheck and begin live-repair of just that Downstairs.

The only way we'd repair two Downstairs simultaneously is if we get very lucky and they (1) complete negotiation at exactly the same time and (2) the second negotiation message wins the race with tokio::time::sleep_until(Instant::now()).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants