-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpcc/multiregion/survive=zone/chaos=true failed [fatal raft error: match(10192) is out of range] #143058
Comments
There was a fatal error in raft. From n3 logs:
|
This means that replica r301/7 (on s3) received an Typically, we see this right when the follower restarts, since this is the most likely situation in which a file system loses a write. This is the case here:
is only 3s before the crash ( We don't have the data directories, so it's going to be difficult to RCA what happened here. Some notes:
My best guess is that the write was lost from the file system or pebble WAL. However, since we didn't power cycle the VM, it is more likely than not a problem above the file system. I'm not sure how to make this actionable. We could, in principle, set up our testing infrastructure to retain disks for such clusters. Then we could, for instance, examine the WAL files that are still present. However, even that would not be enough, since pebble would have deleted the relevant WAL files in the seconds leading up to the crash. But at least we could do some more rudimentary verification of what the Raft log bounds are. |
Anything interesting with
Can someone from Storage interpret this? |
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.
roachtest.tpcc/multiregion/survive=zone/chaos=true failed with artifacts on release-24.3.9-rc @ b97183a1624094224049587f5aa836c3ff03ea95:
Parameters:
arch=amd64
cloud=gce
coverageBuild=false
cpu=4
encrypted=true
fs=ext4
localSSD=true
runtimeAssertionsBuild=true
ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
See: Grafana
Same failure on other branches
This test on roachdash | Improve this report!
Jira issue: CRDB-48640
The text was updated successfully, but these errors were encountered: