Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restic mover job should fail if restic repo is locked #1429

Open
onedr0p opened this issue Oct 22, 2024 · 8 comments
Open

Restic mover job should fail if restic repo is locked #1429

onedr0p opened this issue Oct 22, 2024 · 8 comments

Comments

@onedr0p
Copy link
Contributor

onedr0p commented Oct 22, 2024

Important

This request requires restic 0.17.0 for the unlock status code and Kubernetes 1.31 for the stable podFailurePolicy feature.

Restic mover should not try over and over to backup if the repo is locked, we could utilize the pod failure policy feature and restics exit codes to achieve this.

podReplacementPolicy: Failed # required when using podFailurePolicy
podFailurePolicy:
  rules:
    - action: FailJob
      onExitCodes:
        containerName: restic
        operator: In
        values: [11] # exit code 11 indicates a locked restic repo

That should mark the job as failed and not be retried until the next volsync schedule.

The current behavior is the same job will retry over and over until the backoffLimit is reached on a locked restic repo.

Originally posted by @onedr0p in #1415

@PrivatePuffin
Copy link

I want to add to this that this issue is even worse.
Got into an issue where a mover repeatedly fetched 99Gb(!) from S3 storage by running over-and-over trying to put 120Gb into a 100Gb PVC.

The sizing issue was my mistake, but it shouldn't try to repeatedly restart and burn through my S3 funds either.

@onedr0p
Copy link
Contributor Author

onedr0p commented Oct 22, 2024

Yeah it also feels like backoffLimit should be set to 0 (or at least configurable), if it fails the first time I highly doubt it will ever succeed.

@PrivatePuffin
Copy link

Yeah maybe 1 retry by default is excusable, but certainly with the S3 backend users can get into BOATLOADS of issues financially if things go horribly wrong.

@tesshuflower
Copy link
Contributor

Currently the design of volsync is essentially very kubernetes centric, preferring to retry until things work.

If we just fail the job then we have no chance to retry for even something like a network hiccup. If we could detect specific errors from restic, this is an interesting idea however. If we did something it would definitely need to be an opt-in feature.

One issue is that simply stopping the job from re-trying will not solve the scheduling issue, as volsync will not schedule until the previous synchronization has completed (i.e. the job has completed). VolSync will still reschedule the job after the backupLimit is hit.
Would need modifications to the scheduling code to mark the synchronization as completed with error. Then there's still the question of whether we hang forever or retry on a schedule (which could still get someone into the same situation).

@PrivatePuffin
Copy link

@tesshuflower I agree, a few retries is not a bad thing.
But having it endlessly retry is going to cause more trouble than its solving things.

@samip5
Copy link

samip5 commented Oct 22, 2024

Yeah, my S3 bill is the way I noticed the whole problem after the fact..

@onedr0p
Copy link
Contributor Author

onedr0p commented Oct 23, 2024

If we could detect specific errors from restic, this is an interesting idea however. If we did something it would definitely need to be an opt-in feature.

That's good to hear and I am glad you agree. There's no point in retrying if we know from restic exit codes if the job will never succeed. Hopefully there can be some improvements in this area.

@PrivatePuffin
Copy link

Yeah, my S3 bill is the way I noticed the whole problem after the fact..

Same here, Luckily for me it wasn't the bill but it was the quotum warning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

4 participants