"suspend" and "resume" endpoint for Compactor #8141

outofrange · 2025-03-07T16:18:01Z

Is your proposal related to a problem?

I'm working on two Kubernetes jobs which read/write/delete blocks from object storage:

periodic rewrites on new blocks uploaded to object storage
running maintenance scripts in case of any corruption that gets Compactor halting (which happens quite often :/)

For the first job, I want to ensure that Compactor isn't currently running, working on blocks I want to rewrite at the same time
For the second job, I want to notify the Compactor that it's fine to carry on

Describe the solution you'd like

It would be great if my tasks could call an endpoint like ://thanos-compactor/suspend when they are starting their work.
This would notify Compactor to drop & forget everything it's doing immediately, going into its halted state.

After everything is done, a call to ://thanos-compactor/resume would then get it back running and to leave the halted state again, resyncing block information to continue (or rather restart) compaction.

Compactor should support storing its suspension status locally, so stays suspended after getting restarted.

These endpoints should be opt-in via a flag (or seperate ones?), as this shouldn't be made available without proper monitoring for Compactors being halted for too long.

Having "Resume" in the UI would be nice as well; not so sure about a "Suspend" button, as it could raise concerns about accidental (or malicious) activations impacting stability and performance. But on the other hand, there is already a --disable-admin-operations flag.

Describe alternatives you've considered

Putting `thanos compact` inbetween my scripts

My first idea was to run Compactor as a job as well, making it easier to perform the operations in sequence.
While this would be possible with a custom image executing some pre and post thanos compact tasks, it's also getting more complex when I don't want to execute those tasks on the same schedule.

Locking streams or bucket instead of Compactor

suspend / resume would work in my use case, but I'm only working with one Compactor instance right now, against a simple, not replicated bucket.

With more than one Compactor instances working on different streams, discovering, suspending and resuming the correct one might get a bit more involved.
In that case it might be easier to put some lock file into the bucket, which could be checked by the Compactor before every upload.
Haven't thought about it too much, so not sure if this would even properly work in all cases when replication gets involved?

This would also waste more CPU cycles, since Compactor learns about the lock at a later point.

Doing relabeling in Compactor

As described in #4941 (comment), my particular use case of relabeling would be handled even better directly by the Compactor.

Additional context

/resume would even by useful without /suspend, in cases where it's easier to do a request (or click a button in the UI) rather than restarting the process, for example when Comapctor isn't running in the cluster.

The text was updated successfully, but these errors were encountered:

outofrange · 2025-03-07T16:28:35Z

These endpoints should be opt-in via a flag (or seperate ones?), as this shouldn't be made available without proper monitoring for Compactors being halted for too long.

This could also be adressed by having a mandatory "requestUntil" timestamp for suspensions, at which Compactor would auto-resume if it didn't get resumed earlier, or the suspension wasn't extended with another call.

A max-suspend flag on process start could also specify a maximum duration a Compactor can get suspended.

The suspend endpoint would respond with how long it will stay suspended, min(requestUntil, max-suspend), so callers know when to schedule another request for renewing the suspension.

dosubot bot added the component: compact label Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"suspend" and "resume" endpoint for Compactor #8141

"suspend" and "resume" endpoint for Compactor #8141

outofrange commented Mar 7, 2025 •

edited

Loading

outofrange commented Mar 7, 2025 •

edited

Loading

"suspend" and "resume" endpoint for Compactor #8141

"suspend" and "resume" endpoint for Compactor #8141

Comments

outofrange commented Mar 7, 2025 • edited Loading

Is your proposal related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Putting thanos compact inbetween my scripts

Locking streams or bucket instead of Compactor

Doing relabeling in Compactor

Additional context

outofrange commented Mar 7, 2025 • edited Loading

outofrange commented Mar 7, 2025 •

edited

Loading

Putting `thanos compact` inbetween my scripts

outofrange commented Mar 7, 2025 •

edited

Loading