-
Notifications
You must be signed in to change notification settings - Fork 43
Region snapshot replacement for deleted snapshots #7862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jmpesp
wants to merge
2
commits into
main
Choose a base branch
from
region_snapshot_replacement_for_deleted_snapshots
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
There were many tests for when a snapshot is delete _after_ a region snapshot replacement was requested and the process had started, but not for when one was deleted _before_ any of the associated sagas ran. Deleting the snapshot caused the snapshot volume to be hard-deleted, which caused the region snapshot replacement machinery to believe it couldn't do its job. My first thought was that Nexus needed to _not_ hard-delete the snapshot volume, and this would let the region snapshot replacement proceed. Nexus would also need to reconstruct any of the snapshot volumes that were hard-deleted through some arduous process of finding all the copies of the snapshot volume in the read-only parent fields of other volumes. But then I started diving into the code relevant to the early stages of region snapshot replacement. When inserting a region snapshot replacement request, the snapshot volume ID was required so that a volume repair record can "lock" the volume. The comments here state that the repair record isn't needed (as snapshot volumes are never directly constructed and the lock is meant to serialize access to a volume so that an associated Upstairs won't receive multiple replacement requests) and that the lock was done out of an abudance of caution. Looking at the "region snapshot replacement start" saga, it doesn't need the contents of the volume either: the snapshot volume ID is used to find other resources, but the volume contents themselves are never read. Digging further, it became apparent that not only was this lock not required (narrator: spoiler alert it actually was), snapshot volume wasn't either! The first step in realizing this was to remove creating the volume repair record with the snapshot's (or read-only region's) volume ID when inserting region snapshot replacement requests into the DB, and to stop creating the volume repair record. This required changing a bunch of unit tests, but those unit tests were creating 100% blank volumes to pass along for this purpose, further proving that the contents of the volume were not actually required: ```patch @@ -1675,21 +1484,6 @@ mod test { let region_id = Uuid::new_v4(); let snapshot_id = Uuid::new_v4(); - let volume_id = VolumeUuid::new_v4(); - - datastore - .volume_create( - volume_id, - VolumeConstructionRequest::Volume { - id: Uuid::new_v4(), // not required to match! - block_size: 512, - sub_volumes: vec![], // nothing needed here - read_only_parent: None, - }, - ) - .await - .unwrap(); - let request = RegionSnapshotReplacement::new_from_region_snapshot( dataset_id, region_id, ``` The next step to realizing the snapshot volume was not required is that no other region snapshot replacement related code had to change! _Most_ of the test suite passed after this change, however I added an integration test for the issue seen in #7790 and it plus some others were intermittently failing. Was this the wrong move? Nope, it just revealed a case where serialization _was_ required: - without a volume repair record on the snapshot volume, multiple region snapshot replacement requests could now concurrently run for the same snapshot volume - this is ok! the swapping of read-only targets will be serialized by the transaction in `volume_replace_snapshot` - but: caller A and caller B both attempting to 1. get the number of currently allocated regions for a snapshot volume 2. allocate a single read-only replacement region for the snapshot volume using a redundancy level increased by 1 will, with a certain interleaving, _both_ think that they've increased the number of allocated regions for a snapshot volume by 1, but both see the same new region. Classic TOCTOU! - two region snapshot replacement requests for different region snapshots that both have the _same_ new replacement region will result in (after the two `volume_replace_snapshot` calls) a snapshot volume that has duplicate targets (this was the motivation for the check in #7846). Concurrently getting the number of currently allocated regions, then calling `arbitrary_region_allocate` in this way is not safe, so Nexus is required to serialize these callers. The general way to do this is... using the volume repair record to "lock" the volume. Before this commit, creating a volume repair record for a hard-deleted volume was not possible: the creation routine would return an error, saying that it didn't make sense to lock a non-existent volume. Nexus is faced with the following scenario: - users can delete snapshots at any time, which will hard-delete the snapshot volume - getting the current number of allocated regions for the hard-deleted snapshot volume and calling `arbitrary_region_allocate` needs to be exclusive So this commit relaxes that restriction: volume repair records can now be created for either not-created-yet or hard-deleted volumes. This means that region snapshot replacements will still be serialized even for hard-deleted snapshot volumes. The alternative to this would either be some other form of exclusivity (another lock), or to spend cycles changing the "get number of currently allocated regions then allocation an additional one" pattern to be safe for multiple callers. In the name of urgency, the existing volume repair record is used without the aforementioned restriction. Fixes #7790
leftwo
reviewed
Apr 8, 2025
It's been a while, so I merged with main and pushed just now. |
leftwo
approved these changes
Apr 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've ran this overnight with expunging a sled and letting repair move stuff around. Did around 10 expungements and the repairs have all completed.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There were many tests for when a snapshot is delete after a region snapshot replacement was requested and the process had started, but not for when one was deleted before any of the associated sagas ran. Deleting the snapshot caused the snapshot volume to be hard-deleted, which caused the region snapshot replacement machinery to believe it couldn't do its job.
My first thought was that Nexus needed to not hard-delete the snapshot volume, and this would let the region snapshot replacement proceed. Nexus would also need to reconstruct any of the snapshot volumes that were hard-deleted through some arduous process of finding all the copies of the snapshot volume in the read-only parent fields of other volumes.
But then I started diving into the code relevant to the early stages of region snapshot replacement. When inserting a region snapshot replacement request, the snapshot volume ID was required so that a volume repair record can "lock" the volume. The comments here state that the repair record isn't needed (as snapshot volumes are never directly constructed and the lock is meant to serialize access to a volume so that an associated Upstairs won't receive multiple replacement requests) and that the lock was done out of an abudance of caution.
Looking at the "region snapshot replacement start" saga, it doesn't need the contents of the volume either: the snapshot volume ID is used to find other resources, but the volume contents themselves are never read.
Digging further, it became apparent that not only was this lock not required (narrator: spoiler alert it actually was), snapshot volume wasn't either!
The first step in realizing this was to remove creating the volume repair record with the snapshot's (or read-only region's) volume ID when inserting region snapshot replacement requests into the DB, and to stop creating the volume repair record. This required changing a bunch of unit tests, but those unit tests were creating 100% blank volumes to pass along for this purpose, further proving that the contents of the volume were not actually required:
The next step to realizing the snapshot volume was not required is that no other region snapshot replacement related code had to change!
Most of the test suite passed after this change, however I added an integration test for the issue seen in #7790 and it plus some others were intermittently failing.
Was this the wrong move?
Nope, it just revealed a case where serialization was required:
without a volume repair record on the snapshot volume, multiple region snapshot replacement requests could now concurrently run for the same snapshot volume
this is ok! the swapping of read-only targets will be serialized by the transaction in
volume_replace_snapshot
but: caller A and caller B both attempting to
will, with a certain interleaving, both think that they've increased
the number of allocated regions for a snapshot volume by 1, but both
see the same new region. Classic TOCTOU!
two region snapshot replacement requests for different region snapshots that both have the same new replacement region will result in (after the two
volume_replace_snapshot
calls) a snapshot volume that has duplicate targets (this was the motivation for the check in Validate Volume region sets have unique targets #7846).Concurrently getting the number of currently allocated regions, then calling
arbitrary_region_allocate
in this way is not safe, so Nexus is required to serialize these callers. The general way to do this is... using the volume repair record to "lock" the volume.Before this commit, creating a volume repair record for a hard-deleted volume was not possible: the creation routine would return an error, saying that it didn't make sense to lock a non-existent volume.
Nexus is faced with the following scenario:
users can delete snapshots at any time, which will hard-delete the snapshot volume
getting the current number of allocated regions for the hard-deleted snapshot volume and calling
arbitrary_region_allocate
needs to be exclusiveSo this commit relaxes that restriction: volume repair records can now be created for either not-created-yet or hard-deleted volumes. This means that region snapshot replacements will still be serialized even for hard-deleted snapshot volumes.
The alternative to this would either be some other form of exclusivity (another lock), or to spend cycles changing the "get number of currently allocated regions then allocation an additional one" pattern to be safe for multiple callers. In the name of urgency, the existing volume repair record is used without the aforementioned restriction.
Fixes #7790