Region snapshot replacement for deleted snapshots #7862

jmpesp · 2025-03-24T21:10:12Z

There were many tests for when a snapshot is delete after a region snapshot replacement was requested and the process had started, but not for when one was deleted before any of the associated sagas ran. Deleting the snapshot caused the snapshot volume to be hard-deleted, which caused the region snapshot replacement machinery to believe it couldn't do its job.

My first thought was that Nexus needed to not hard-delete the snapshot volume, and this would let the region snapshot replacement proceed. Nexus would also need to reconstruct any of the snapshot volumes that were hard-deleted through some arduous process of finding all the copies of the snapshot volume in the read-only parent fields of other volumes.

But then I started diving into the code relevant to the early stages of region snapshot replacement. When inserting a region snapshot replacement request, the snapshot volume ID was required so that a volume repair record can "lock" the volume. The comments here state that the repair record isn't needed (as snapshot volumes are never directly constructed and the lock is meant to serialize access to a volume so that an associated Upstairs won't receive multiple replacement requests) and that the lock was done out of an abudance of caution.

Looking at the "region snapshot replacement start" saga, it doesn't need the contents of the volume either: the snapshot volume ID is used to find other resources, but the volume contents themselves are never read.

Digging further, it became apparent that not only was this lock not required (narrator: spoiler alert it actually was), snapshot volume wasn't either!

The first step in realizing this was to remove creating the volume repair record with the snapshot's (or read-only region's) volume ID when inserting region snapshot replacement requests into the DB, and to stop creating the volume repair record. This required changing a bunch of unit tests, but those unit tests were creating 100% blank volumes to pass along for this purpose, further proving that the contents of the volume were not actually required:

@@ -1675,21 +1484,6 @@ mod test {
         let region_id = Uuid::new_v4();
         let snapshot_id = Uuid::new_v4();

-        let volume_id = VolumeUuid::new_v4();
-
-        datastore
-            .volume_create(
-                volume_id,
-                VolumeConstructionRequest::Volume {
-                    id: Uuid::new_v4(), // not required to match!
-                    block_size: 512,
-                    sub_volumes: vec![], // nothing needed here
-                    read_only_parent: None,
-                },
-            )
-            .await
-            .unwrap();
-
         let request = RegionSnapshotReplacement::new_from_region_snapshot(
             dataset_id,
             region_id,

The next step to realizing the snapshot volume was not required is that no other region snapshot replacement related code had to change!

Most of the test suite passed after this change, however I added an integration test for the issue seen in #7790 and it plus some others were intermittently failing.

Was this the wrong move?

Nope, it just revealed a case where serialization was required:

without a volume repair record on the snapshot volume, multiple region snapshot replacement requests could now concurrently run for the same snapshot volume
this is ok! the swapping of read-only targets will be serialized by the transaction in volume_replace_snapshot
but: caller A and caller B both attempting to
1. get the number of currently allocated regions for a snapshot volume
2. allocate a single read-only replacement region for the snapshot volume using a redundancy level increased by 1
will, with a certain interleaving, both think that they've increased
the number of allocated regions for a snapshot volume by 1, but both
see the same new region. Classic TOCTOU!
two region snapshot replacement requests for different region snapshots that both have the same new replacement region will result in (after the two volume_replace_snapshot calls) a snapshot volume that has duplicate targets (this was the motivation for the check in Validate Volume region sets have unique targets #7846).

Concurrently getting the number of currently allocated regions, then calling arbitrary_region_allocate in this way is not safe, so Nexus is required to serialize these callers. The general way to do this is... using the volume repair record to "lock" the volume.

Before this commit, creating a volume repair record for a hard-deleted volume was not possible: the creation routine would return an error, saying that it didn't make sense to lock a non-existent volume.

Nexus is faced with the following scenario:

users can delete snapshots at any time, which will hard-delete the snapshot volume
getting the current number of allocated regions for the hard-deleted snapshot volume and calling arbitrary_region_allocate needs to be exclusive

So this commit relaxes that restriction: volume repair records can now be created for either not-created-yet or hard-deleted volumes. This means that region snapshot replacements will still be serialized even for hard-deleted snapshot volumes.

The alternative to this would either be some other form of exclusivity (another lock), or to spend cycles changing the "get number of currently allocated regions then allocation an additional one" pattern to be safe for multiple callers. In the name of urgency, the existing volume repair record is used without the aforementioned restriction.

Fixes #7790

There were many tests for when a snapshot is delete _after_ a region snapshot replacement was requested and the process had started, but not for when one was deleted _before_ any of the associated sagas ran. Deleting the snapshot caused the snapshot volume to be hard-deleted, which caused the region snapshot replacement machinery to believe it couldn't do its job. My first thought was that Nexus needed to _not_ hard-delete the snapshot volume, and this would let the region snapshot replacement proceed. Nexus would also need to reconstruct any of the snapshot volumes that were hard-deleted through some arduous process of finding all the copies of the snapshot volume in the read-only parent fields of other volumes. But then I started diving into the code relevant to the early stages of region snapshot replacement. When inserting a region snapshot replacement request, the snapshot volume ID was required so that a volume repair record can "lock" the volume. The comments here state that the repair record isn't needed (as snapshot volumes are never directly constructed and the lock is meant to serialize access to a volume so that an associated Upstairs won't receive multiple replacement requests) and that the lock was done out of an abudance of caution. Looking at the "region snapshot replacement start" saga, it doesn't need the contents of the volume either: the snapshot volume ID is used to find other resources, but the volume contents themselves are never read. Digging further, it became apparent that not only was this lock not required (narrator: spoiler alert it actually was), snapshot volume wasn't either! The first step in realizing this was to remove creating the volume repair record with the snapshot's (or read-only region's) volume ID when inserting region snapshot replacement requests into the DB, and to stop creating the volume repair record. This required changing a bunch of unit tests, but those unit tests were creating 100% blank volumes to pass along for this purpose, further proving that the contents of the volume were not actually required: ```patch @@ -1675,21 +1484,6 @@ mod test { let region_id = Uuid::new_v4(); let snapshot_id = Uuid::new_v4(); - let volume_id = VolumeUuid::new_v4(); - - datastore - .volume_create( - volume_id, - VolumeConstructionRequest::Volume { - id: Uuid::new_v4(), // not required to match! - block_size: 512, - sub_volumes: vec![], // nothing needed here - read_only_parent: None, - }, - ) - .await - .unwrap(); - let request = RegionSnapshotReplacement::new_from_region_snapshot( dataset_id, region_id, ``` The next step to realizing the snapshot volume was not required is that no other region snapshot replacement related code had to change! _Most_ of the test suite passed after this change, however I added an integration test for the issue seen in #7790 and it plus some others were intermittently failing. Was this the wrong move? Nope, it just revealed a case where serialization _was_ required: - without a volume repair record on the snapshot volume, multiple region snapshot replacement requests could now concurrently run for the same snapshot volume - this is ok! the swapping of read-only targets will be serialized by the transaction in `volume_replace_snapshot` - but: caller A and caller B both attempting to 1. get the number of currently allocated regions for a snapshot volume 2. allocate a single read-only replacement region for the snapshot volume using a redundancy level increased by 1 will, with a certain interleaving, _both_ think that they've increased the number of allocated regions for a snapshot volume by 1, but both see the same new region. Classic TOCTOU! - two region snapshot replacement requests for different region snapshots that both have the _same_ new replacement region will result in (after the two `volume_replace_snapshot` calls) a snapshot volume that has duplicate targets (this was the motivation for the check in #7846). Concurrently getting the number of currently allocated regions, then calling `arbitrary_region_allocate` in this way is not safe, so Nexus is required to serialize these callers. The general way to do this is... using the volume repair record to "lock" the volume. Before this commit, creating a volume repair record for a hard-deleted volume was not possible: the creation routine would return an error, saying that it didn't make sense to lock a non-existent volume. Nexus is faced with the following scenario: - users can delete snapshots at any time, which will hard-delete the snapshot volume - getting the current number of allocated regions for the hard-deleted snapshot volume and calling `arbitrary_region_allocate` needs to be exclusive So this commit relaxes that restriction: volume repair records can now be created for either not-created-yet or hard-deleted volumes. This means that region snapshot replacements will still be serialized even for hard-deleted snapshot volumes. The alternative to this would either be some other form of exclusivity (another lock), or to spend cycles changing the "get number of currently allocated regions then allocation an additional one" pattern to be safe for multiple callers. In the name of urgency, the existing volume repair record is used without the aforementioned restriction. Fixes #7790

nexus/db-model/src/region_snapshot_replacement.rs

nexus/test-utils/src/resource_helpers.rs

…shots

jmpesp · 2025-04-08T21:41:52Z

It's been a while, so I merged with main and pushed just now.

leftwo

I've ran this overnight with expunging a sled and letting repair move stuff around. Did around 10 expungements and the repairs have all completed.

jmpesp requested a review from leftwo March 24, 2025 21:10

leftwo reviewed Apr 8, 2025

View reviewed changes

nexus/db-model/src/region_snapshot_replacement.rs Show resolved Hide resolved

nexus/test-utils/src/resource_helpers.rs Show resolved Hide resolved

Merge branch 'main' into region_snapshot_replacement_for_deleted_snap…

a9b9dcb

…shots

leftwo approved these changes Apr 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Region snapshot replacement for deleted snapshots #7862

Region snapshot replacement for deleted snapshots #7862

jmpesp commented Mar 24, 2025

jmpesp commented Apr 8, 2025

leftwo left a comment

Region snapshot replacement for deleted snapshots #7862

Are you sure you want to change the base?

Region snapshot replacement for deleted snapshots #7862

Conversation

jmpesp commented Mar 24, 2025

jmpesp commented Apr 8, 2025

leftwo left a comment

Choose a reason for hiding this comment