Restic mover job should fail if restic repo is locked #1429

onedr0p · 2024-10-22T15:10:05Z

Important

This request requires restic 0.17.0 for the unlock status code and Kubernetes 1.31 for the stable podFailurePolicy feature.

Restic mover should not try over and over to backup if the repo is locked, we could utilize the pod failure policy feature and restics exit codes to achieve this.

podReplacementPolicy: Failed # required when using podFailurePolicy
podFailurePolicy:
  rules:
    - action: FailJob
      onExitCodes:
        containerName: restic
        operator: In
        values: [11] # exit code 11 indicates a locked restic repo

That should mark the job as failed and not be retried until the next volsync schedule.

The current behavior is the same job will retry over and over until the backoffLimit is reached on a locked restic repo.

Originally posted by @onedr0p in #1415

The text was updated successfully, but these errors were encountered:

PrivatePuffin · 2024-10-22T16:06:27Z

I want to add to this that this issue is even worse.
Got into an issue where a mover repeatedly fetched 99Gb(!) from S3 storage by running over-and-over trying to put 120Gb into a 100Gb PVC.

The sizing issue was my mistake, but it shouldn't try to repeatedly restart and burn through my S3 funds either.

onedr0p · 2024-10-22T16:07:48Z

Yeah it also feels like backoffLimit should be set to 0 (or at least configurable), if it fails the first time I highly doubt it will ever succeed.

PrivatePuffin · 2024-10-22T16:12:27Z

Yeah maybe 1 retry by default is excusable, but certainly with the S3 backend users can get into BOATLOADS of issues financially if things go horribly wrong.

tesshuflower · 2024-10-22T16:19:24Z

Currently the design of volsync is essentially very kubernetes centric, preferring to retry until things work.

If we just fail the job then we have no chance to retry for even something like a network hiccup. If we could detect specific errors from restic, this is an interesting idea however. If we did something it would definitely need to be an opt-in feature.

One issue is that simply stopping the job from re-trying will not solve the scheduling issue, as volsync will not schedule until the previous synchronization has completed (i.e. the job has completed). VolSync will still reschedule the job after the backupLimit is hit.
Would need modifications to the scheduling code to mark the synchronization as completed with error. Then there's still the question of whether we hang forever or retry on a schedule (which could still get someone into the same situation).

PrivatePuffin · 2024-10-22T16:26:27Z

@tesshuflower I agree, a few retries is not a bad thing.
But having it endlessly retry is going to cause more trouble than its solving things.

samip5 · 2024-10-22T20:32:18Z

Yeah, my S3 bill is the way I noticed the whole problem after the fact..

onedr0p · 2024-10-23T00:28:04Z

If we could detect specific errors from restic, this is an interesting idea however. If we did something it would definitely need to be an opt-in feature.

That's good to hear and I am glad you agree. There's no point in retrying if we know from restic exit codes if the job will never succeed. Hopefully there can be some improvements in this area.

PrivatePuffin · 2024-10-23T09:18:28Z

Yeah, my S3 bill is the way I noticed the whole problem after the fact..

Same here, Luckily for me it wasn't the bill but it was the quotum warning

JohnStrunk added this to VolSync project tracking Oct 22, 2024

onedr0p mentioned this issue Oct 22, 2024

My trigger schedule doesn't seem to be honored (v0.10.0) #1415

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restic mover job should fail if restic repo is locked #1429

Restic mover job should fail if restic repo is locked #1429

onedr0p commented Oct 22, 2024 •

edited

Loading

PrivatePuffin commented Oct 22, 2024

onedr0p commented Oct 22, 2024 •

edited

Loading

PrivatePuffin commented Oct 22, 2024

tesshuflower commented Oct 22, 2024

PrivatePuffin commented Oct 22, 2024

samip5 commented Oct 22, 2024

onedr0p commented Oct 23, 2024

PrivatePuffin commented Oct 23, 2024

Restic mover job should fail if restic repo is locked #1429

Restic mover job should fail if restic repo is locked #1429

Comments

onedr0p commented Oct 22, 2024 • edited Loading

PrivatePuffin commented Oct 22, 2024

onedr0p commented Oct 22, 2024 • edited Loading

PrivatePuffin commented Oct 22, 2024

tesshuflower commented Oct 22, 2024

PrivatePuffin commented Oct 22, 2024

samip5 commented Oct 22, 2024

onedr0p commented Oct 23, 2024

PrivatePuffin commented Oct 23, 2024

onedr0p commented Oct 22, 2024 •

edited

Loading

onedr0p commented Oct 22, 2024 •

edited

Loading