failure-injection: add disk stall failure modes #143104

DarrylWong · 2025-03-18T23:07:53Z

This change adds cgroup and dmsetup disk stalls to the failure injection library. A majority of this logic is a port of the existing disk stall implementations found in roachtestutil, however several additions were added to make cleanup and restore of said failures restore the system back to it's original state.

Informs: #138970
Release note: none

blathers-crl · 2025-03-18T23:07:56Z

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2025-03-18T23:08:06Z

This change is

DarrylWong · 2025-03-19T16:04:01Z

pkg/roachprod/failureinjection/failures/disk_stall.go

+	// 22.04+ addition, as older distributions and the upstream cgroup implementation do not
+	// have this check.
+	//
+	// This additional check appears to protect against the io hanging when allowing bursts


Mystery solved although I can't say that was worth the time spent reading the cgroups source code 🤣 At least we can now say roachprod documents cgroups better than cgroups does.

DarrylWong · 2025-03-20T15:27:33Z

pkg/roachprod/failureinjection/failures/disk_stall.go

+		return err
+	}
+
+	// When we unmounted the disk in setup, the cgroups controllers may have been removed, re-add them.


This feels iffy to me to have to manually add back the controllers, but no matter how hard I tried I couldn't get them to be added back short of restarting the roachprod cluster.

I tried I couldn't get them to be added back short of restarting the roachprod cluster.

By "restarting", you mean VMs (not systemd)? It does look error-prone. What if we miss something; what's the chance that the cluster remains fully reusable (by independent roachtest)?

By "restarting", you mean VMs

Yeah, neither restarting the daemon or restarting the cockroach service (although I don't think this would do anything since we restart the node anyway) would re-add the controllers. Looking at the roachprod start script, we don't touch cgroups so no idea what the difference would be.

DarrylWong · 2025-03-20T15:30:15Z

pkg/roachprod/failureinjection/failures/disk_stall.go

+	// If true, allow the failure mode to restart nodes as needed. E.g. dmsetup requires
+	// the cockroach process to not be running to properly setup. If RestartNodes is true,
+	// then the failure mode will restart the cluster for the user.
+	RestartNodes bool


The current disk stall roachtests are okay with the node dying since they don't try to do anything else in the test after the fact. However, I still think we should restart the node at the end so artifacts can be collected - you can see in all of the current roachtests there's no cockroach logs for the stalled node.

DarrylWong · 2025-03-20T15:31:30Z

pkg/roachprod/failureinjection/failures/disk_stall.go

+	var err error
+
+	// Disabling journaling requires the cockroach process to not have been started yet.
+	if diskStallArgs.RestartNodes {


Note to self: having to restart the cluster to setup/cleanup dmsetup seems really disruptive - makes me lean towards using cgroups in the failure injection framework

pkg/cmd/roachtest/tests/failure_injection.go

srosenberg · 2025-03-24T16:09:14Z

pkg/cmd/roachtest/tests/failure_injection.go

+	var tests []failureSmokeTest
+	for _, stallWrites := range []bool{true, false} {
+		for _, stallReads := range []bool{true, false} {
+			if !stallWrites && !stallReads {


I wonder if it might be still useful as a smoke test to verify that the no-op versions of validateFailure and validateRestore actually work as expected.

Hmm, do you mean check that they should fail? If so then failure-injection/smoke-test/noop should cover that, but lemme know if i'm misunderstanding.

srosenberg · 2025-03-27T18:54:05Z

pkg/roachprod/failureinjection/failures/failure.go

+	return errors.CombineErrors(err, res[0].Err)
+}
+
+func (f *GenericFailure) WaitForSQLLiveness(


Nit: sqlliveness is for session management [1], this is more like "is sql ready". I also wonder if we could merge it with WaitForSQLReady? (The "pinging" logic could be passed as a func, but the retry logic is reused.)

[1] https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20200615_sql_liveness.md

👍 Changed the name to WaitForSqlReady. I also added a retryForDuration helper that should extract the retry logic out and be reusable.

I also wonder if we could merge it with WaitForSQLReady?

Do you mean the roachtestutil WaitForSQLReady? Not aware of something similar on the roachprod layer.

Yep. We can refactor on subsequent passes.

herkolategan

Nice work, left a few comments. Also curious if the cgroup test would work with the dmsetup staller as both seem to wait for and recover from node death?

herkolategan · 2025-03-31T07:43:46Z

pkg/cmd/roachtest/tests/failure_injection.go

-	if err = failureMode.Restore(ctx, quietLogger, t.args); err != nil {
+	l.Printf("%s: Running Recover(); details in %s.log", t.failureName, file)
+	if err = c.AddGrafanaAnnotation(ctx, l, grafana.AddAnnotationRequest{
+		Text: fmt.Sprintf("%s recovred", t.testName),


Small typo: "recovred" -> "recovered"

herkolategan · 2025-03-31T08:19:45Z

pkg/roachprod/failureinjection/failures/failure.go

@@ -70,7 +83,8 @@ func (f *GenericFailure) Run(
 		l.Printf("Local cluster detected, logging command instead of running:\n%s", cmd)
 		return nil
 	}
-	return f.c.Run(ctx, l, l.Stdout, l.Stderr, install.WithNodes(node), f.runTitle, cmd)
+	l.Printf(cmd)


Was this for debugging? If it's meant to stay in, could add something to the print to indicate it's a command.

Yeah it was intentional, before I left it up to the caller if they wanted to log the command, but I found I always wanted to log the command.

could add something to the print to indicate it's a command.

Done, also reworded the local cluster log to make it less confusing.

herkolategan · 2025-03-31T09:28:42Z

pkg/roachprod/failureinjection/failures/disk_stall.go

+	// something went wrong in Recover.
+	err := s.setThroughput(ctx, l, stallType, throughput{limited: false}, nodes, cockroachIOController)
+	if err != nil {
+		l.PrintfCtx(ctx, "error unstalling the disk; stumbling on: %v", err)


Curiosity question: When do we consider Cleanup to return an error, or just warn as is happening here?

The user may or may not have successfully called Restore so Cleanup just best effort retries in case. I added this because a disk stall is so disruptive, it seemed like a bad idea to just leave the cluster in a bad state.

Maybe this needs to be clarified in the failure injection library API contract, i.e. if Cleanup can be called without calling Recover first. I think it makes sense in the context of how users would want to use Cleanup, e.g. something went wrong and we just want to restore everything in one shot.

herkolategan · 2025-03-31T09:37:27Z

pkg/roachprod/failureinjection/failures/disk_stall.go

+type DiskStallArgs struct {
+	StallLogs   bool
+	StallReads  bool
+	StallWrites bool


Since we allow setting more than one stall type, is there a scenario where we would want the throughput to be different for the read vs. the write stall?

I was thinking if you wanted to set two different outputs you could just inject two different failures. I originally had a misleading comment that said concurrent disk stall failures weren't allowed - I removed that.

herkolategan · 2025-03-31T09:47:57Z

pkg/roachprod/failureinjection/failures/disk_stall.go

+	// While this is exactly what  we want, it's not the intended use case and is invalid.
+	bytesPerSecond := 4
+	if diskStallArgs.Throughput > 0 {
+		bytesPerSecond = diskStallArgs.Throughput


Does this still not allow the FI framework to accidentally set it to between (>0, <4)? Might be worthwhile to error out earlier with a more descriptive error than from the cli call.

Good call, I also changed the default to be 2. From what I can tell there's no reason to not set it to 2, I was just porting over the existing behavior.

herkolategan · 2025-03-31T09:54:58Z

pkg/roachprod/failureinjection/failures/disk_stall.go

+	return nil
+}
+
+func getStallType(diskStallArgs DiskStallArgs) ([]bandwidthType, error) {


Nit: Feels like this should be plural getStallTypes, and in usage references stallTypes, err = getStallTypes(...)

herkolategan · 2025-03-31T09:57:01Z

pkg/roachprod/failureinjection/failures/disk_stall.go

+	})
+}
+
+func (s *CGroupDiskStaller) WaitForFailureToPropagate(


A general question on how the disk staller will be used. Since we have the throughput setting I'm assuming it will always be set to a value that will cause the node to die?

I'm assuming it will always be set to a value will cause the node to die?

Good question, I don't think that's true so maybe always waiting for the node to die is too strong. The registerDiskBandwidthOverload for example, doesn't want the node to die it just slows writes to 128 MiB to test AC.

I have two trains of though for fixing this. The first is that we can just add a ExpectNodeDeath flag. The second is that we don't need to do anything, tests that don't expect the node to die can just skip calling WaitForFailureToPropagate.

My thinking is that WaitForFailureToPropagate would be used mostly a suggestion anyway (i.e. log a warning but don't fail the test). Waiting for a disk stall to be detected (when the throughput isn't 0) or waiting for ranges to rebalance is workload dependent/not bounded anyway. e.g. if we slow writes to 5 MiB, it's not obvious if the node will die or not, so wait up to N minutes and log a warning if it doesn't.

DarrylWong · 2025-03-31T14:51:03Z

Also curious if the cgroup test would work with the dmsetup staller as both seem to wait for and recover from node death?

Unfortunately the cgroups test works by parsing /sys/fs/cgroup/system.slice/io.stat but dmsetup disables cgroups so that file doesn't exist.

Switching to iotop and filtering by PID should work for both, but I'll have to experiment with it. Fwiw though, I think what the dmsetup test does is a lot simpler/more intuitive, i.e. if writes are stalled test that we can't write a file seems like a stronger test than heuristically checking disk is stalled because we observed "low enough" bytes

This change adds cgroup and dmsetup disk stalls to the failure injection library. A majority of this logic is a port of the existing disk stall implementations found in roachtestutil, however several additions were added to make cleanup and restore of said failures restore the system back to it's original state.

…just one This will exercise the failure injection library's ability to inject failures on multiple nodes at once. To support this, a SeededRandGroups helper was added to NodeListOption.

While the two names have similar meanings in the context of failure injection, we often use restore to refer to restoring a backup. Lets remove any ambiguity by renaming it to recover.

srosenberg · 2025-04-01T18:16:56Z

pkg/roachprod/failureinjection/failures/failure.go

 }

 func (f *GenericFailure) Run(
 	ctx context.Context, l *logger.Logger, node install.Nodes, args ...string,
 ) error {
 	cmd := strings.Join(args, " ")
+	l.Printf("running cmd: %s", cmd)
 	// In general, most failures shouldn't be run locally out of caution.
 	if f.c.IsLocal() {


Maybe this should be enforced as a precondition, i.e., panic instead of logging?

srosenberg · 2025-04-01T18:18:30Z

pkg/roachprod/failureinjection/failures/failure.go

+	return errors.Wrapf(err, "never connected to node %d after %s", node, timeout)
+}
+
+func (f *GenericFailure) WaitForSQLNodeDeath(


Nit: WaitForSQLUnavailable. "Death" implies the process is down, which PingNode can't establish; so, the current name is somewhat misleading.

srosenberg · 2025-04-01T18:19:25Z

pkg/roachprod/failureinjection/failures/failure.go

+	start := timeutil.Now()
+	err := retryForDuration(ctx, timeout, func() error {
+		if err := f.PingNode(ctx, l, node); err != nil {
+			l.Printf("Connections to node %d dropped after %s", node, timeutil.Since(start))


Nit: "dropped" -> "unavailable" for the same reason, as above. We can't really tell if it's unavailable because of dropped network packets or some other reason.

srosenberg · 2025-04-01T18:20:58Z

pkg/roachprod/failureinjection/failures/disk_stall.go

+	return &CGroupDiskStaller{GenericFailure: genericFailure}, nil
+}
+
+const CgroupsDiskStallName = "cgroup-disk-stall"


Nit: move above the func s.t. all var/const/type declarations precede the definitions.

srosenberg · 2025-04-01T18:21:05Z

pkg/roachprod/failureinjection/failures/disk_stall.go

+	return CgroupsDiskStallName
+}
+
+type DiskStallArgs struct {


Nit: move above the funcs s.t. all var/const/type declarations precede the definitions.

srosenberg · 2025-04-01T18:29:30Z

pkg/roachprod/failureinjection/failures/disk_stall.go

+		if err := s.Run(ctx, l, s.c.Nodes, "mkdir -p {store-dir}/logs"); err != nil {
+			return err
+		}
+		if err := s.Run(ctx, l, s.c.Nodes, "rm -f logs && ln -s {store-dir}/logs logs || true"); err != nil {


This would end up wiping node logs, every time this failure is injected, for the duration of some arbitrary test.

srosenberg · 2025-04-01T18:33:01Z

pkg/roachprod/failureinjection/failures/disk_stall.go

+	if diskStallArgs.StallWrites {
+		stallTypes = []bandwidthType{writeBandwidth}
+	}
+	if diskStallArgs.StallReads {


Should we further validate if StallLogs is set without StallWrites? In that case, we don't expect the process to panic.

srosenberg · 2025-04-01T18:35:28Z

pkg/roachprod/failureinjection/failures/disk_stall.go

+		`sudo dmsetup create data1`); err != nil {
+		return err
+	}
+	// This has occasionally been seen to fail with "Device or resource busy",


Nit: missing TODO?

srosenberg · 2025-04-01T18:38:35Z

pkg/roachprod/failureinjection/failures/disk_stall.go

+	// snapd will run "snapd auto-import /dev/dm-0" via udev triggers when
+	// /dev/dm-0 is created. This possibly interferes with the dmsetup create
+	// reload, so uninstall snapd.
+	if err = s.Run(ctx, l, s.c.Nodes, `sudo apt-get purge -y snapd`); err != nil {


This makes me wonder if we should disable cluster reuse for anything that touches system dependencies. apt tends to flake occasionally and adds a source of non-determinism.

srosenberg · 2025-04-01T18:39:48Z

pkg/roachprod/failureinjection/failures/disk_stall.go

+	return nil
+}
+
+func (s *DmsetupDiskStaller) WaitForFailureToPropagate(


Both of these helpers seem generic; could be moved to GenericFailure.

srosenberg · 2025-04-01T18:45:45Z

pkg/cmd/roachtest/tests/failure_injection.go

+	stalledNode := stalledNodeGroup.SeededRandNode(rng)
+	unaffectedNode := unaffectedNodeGroup.SeededRandNode(rng)
+
+	ableToCreateFile := func(ctx context.Context, l *logger.Logger, c cluster.Cluster, node option.NodeListOption) bool {


Nit: ableToCreateFile -> touchFile. Since we can't tell exactly why this func failed, e.g., ssh timeout, we should use a "weaker", i.e., less precise, name. Otherwise, the name could mislead the author/reader into thinking that the false result is caused by something else entirely, e.g., file permissions.

srosenberg · 2025-04-01T18:47:00Z

pkg/cmd/roachtest/tests/failure_injection.go

+			}
+			return nil
+		},
+		workload: func(ctx context.Context, c cluster.Cluster) error {


Same workload is duplicated above, albeit with different CLI options.

srosenberg · 2025-04-01T18:48:12Z

pkg/cmd/roachtest/tests/failure_injection.go

 			if err := test.run(ctx, t.L(), c, fr); err != nil {
 				t.Fatal(errors.Wrapf(err, "%s failed", test.testName))
 			}
+			cancel()


For clean shutdown, we want this inside defer since Fatal escapes.

srosenberg · 2025-04-02T15:13:16Z

pkg/roachprod/failureinjection/failures/disk_stall.go

+	// Check the number of bytes read and written to disk.
+	res, err := s.RunWithDetails(
+		ctx, l, node,
+		fmt.Sprintf(`grep -E '%d:%d' /sys/fs/cgroup/system.slice/io.stat |`, maj, min),


It seems there are some lightweight metrics collectors, e.g., [1], which we could easily add as a scrape target. This could make it easier to debug; rather than sampling over discrete intervals ourselves, we'd have a more or less continuous view in Grafana.

[1] https://github.com/arianvp/cgroup-exporter/tree/main/collector

srosenberg

Nice work! There are a number of things we could improve in subsequent PR(s). It might be instructive to document some of them by way of GH issues. Otherwise, feel free to merge after a final sanity pass. 🚢

DarrylWong force-pushed the disk-stalls branch 4 times, most recently from 6696122 to af95c8d Compare March 19, 2025 15:58

DarrylWong commented Mar 19, 2025

View reviewed changes

DarrylWong force-pushed the disk-stalls branch 5 times, most recently from 0e9ad9c to d9375bd Compare March 20, 2025 15:19

DarrylWong commented Mar 20, 2025

View reviewed changes

DarrylWong changed the title ~~Disk stalls~~ failure-injection: add disk stall failure modes Mar 20, 2025

DarrylWong force-pushed the disk-stalls branch 3 times, most recently from 1b98045 to 172479a Compare March 20, 2025 18:12

DarrylWong marked this pull request as ready for review March 20, 2025 18:15

DarrylWong requested a review from a team as a code owner March 20, 2025 18:15

DarrylWong requested review from srosenberg and golgeek and removed request for a team March 20, 2025 18:15

DarrylWong force-pushed the disk-stalls branch from 172479a to 6a870eb Compare March 20, 2025 21:24

srosenberg reviewed Mar 24, 2025

View reviewed changes

pkg/cmd/roachtest/tests/failure_injection.go Outdated Show resolved Hide resolved

srosenberg reviewed Mar 24, 2025

View reviewed changes

pkg/cmd/roachtest/tests/failure_injection.go Outdated Show resolved Hide resolved

srosenberg reviewed Mar 24, 2025

View reviewed changes

pkg/cmd/roachtest/tests/failure_injection.go Outdated Show resolved Hide resolved

srosenberg reviewed Mar 24, 2025

View reviewed changes

pkg/cmd/roachtest/tests/failure_injection.go Outdated Show resolved Hide resolved

srosenberg reviewed Mar 24, 2025

View reviewed changes

DarrylWong force-pushed the disk-stalls branch 2 times, most recently from 25c1f6e to 917453a Compare March 27, 2025 17:44

srosenberg reviewed Mar 27, 2025

View reviewed changes

DarrylWong force-pushed the disk-stalls branch 3 times, most recently from e960d18 to dcd297b Compare March 27, 2025 20:52

herkolategan reviewed Mar 31, 2025

View reviewed changes

DarrylWong force-pushed the disk-stalls branch 3 times, most recently from fc0225d to a497ab1 Compare March 31, 2025 16:06

DarrylWong added 3 commits March 31, 2025 13:33

failure-injection/roachtest: stall random amount of nodes instead of …

182d03c

…just one This will exercise the failure injection library's ability to inject failures on multiple nodes at once. To support this, a SeededRandGroups helper was added to NodeListOption.

failure-injection: rename restore to recover

1800559

While the two names have similar meanings in the context of failure injection, we often use restore to refer to restoring a backup. Lets remove any ambiguity by renaming it to recover.

DarrylWong force-pushed the disk-stalls branch from a497ab1 to 1800559 Compare March 31, 2025 17:33

srosenberg reviewed Apr 1, 2025

View reviewed changes

srosenberg reviewed Apr 2, 2025

View reviewed changes

srosenberg approved these changes Apr 2, 2025

View reviewed changes

failure-injection: add disk stall failure modes #143104

Are you sure you want to change the base?

failure-injection: add disk stall failure modes #143104

Conversation

DarrylWong commented Mar 18, 2025 • edited Loading

blathers-crl bot commented Mar 18, 2025

cockroach-teamcity commented Mar 18, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

herkolategan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

herkolategan Mar 31, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DarrylWong commented Mar 31, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srosenberg Apr 1, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srosenberg left a comment

Choose a reason for hiding this comment

DarrylWong commented Mar 18, 2025 •

edited

Loading

herkolategan Mar 31, 2025 •

edited

Loading

srosenberg Apr 1, 2025 •

edited

Loading