Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compact: Penalty deduplication very slow during compaction #7759

Open
mkrull opened this issue Sep 18, 2024 · 3 comments
Open

compact: Penalty deduplication very slow during compaction #7759

mkrull opened this issue Sep 18, 2024 · 3 comments

Comments

@mkrull
Copy link

mkrull commented Sep 18, 2024

Thanos, Prometheus and Golang version used:

prometheus, version 2.51.0+dedupelabels (branch: HEAD, revision: e67325fe4f59ea136be6123dc78e951781bf8280)
  build user:       root@4390ea63360d
  build date:       20240320-13:35:33
  go version:       go1.22.1
  platform:         linux/amd64
  tags:             netgo,builtinassets,dedupelabels
thanos, version 0.35.1 (branch: HEAD, revision: 086a698b2195adb6f3463ebbd032e780f39d2050)
  build user:       root@be0f036fd8fa
  build date:       20240528-13:54:20
  go version:       go1.21.10
  platform:         linux/amd64
  tags:             netgo

Thanos 0.31.0 and 0.36.1 yielded similar results.

Object Storage Provider:

S3

What happened:

We have blocks collected from a specific service that hit either a bug or a worst case scenario using penalty vertical compaction. The below example is an extreme case but other similar blocks still take 2h-8h with an occasional 15+h duration on decent AWS hardware (m7i.8xl or similar). Other larger or similarly sized blocks in our infrastructure take a fraction of that time with the same settings.

thanos-compactor-12 thanos ts=2024-09-18T03:54:23.905492239Z caller=compact.go:1233 level=info group="0@{ ...snip..., metrics_shard=\"prom-12\", ...snip...}" groupKey=0@9689886630010143309 msg="compacted blocks" new=01J7Z0FJCMBZ06817ZBJXR758P duration=24h24m14.990136102s duration_ms=87854990 overlapping_blocks=true blocks="[data/compact/0@9689886630010143309/01J7YA5EP10FQ82BEZ64674J7Y data/compact/0@9689886630010143309/01J7YA59A4VJ0TPJXCQ05X2SD7]"

where the duration of duration=24h24m14.990136102s seems completely out of proportion for the two blocks that are large but not outrageously large:

01J7YA5EP10FQ82BEZ64674J7Y
Start Time: September 16, 2024 8:00 PM
End Time: September 16, 2024 10:00 PM
Duration: 2 hours
Series: 1334644
Samples: 723080671
Chunks: 5896758
Total size: 2.25 GiB
Chunks: 2.06 GiB (91.41%)
Index: 198.06 MiB (8.59%)
Daily: 27.02 GiB / day
Resolution: 0
Level: 1
Source: sidecar
01J7YA59A4VJ0TPJXCQ05X2SD7
Start Time: September 16, 2024 8:00 PM
End Time: September 16, 2024 10:00 PM
Duration: 2 hours
Series: 1334526
Samples: 724170555
Chunks: 5893744
Total size: 2.27 GiB
Chunks: 2.07 GiB (91.48%)
Index: 197.95 MiB (8.52%)
Daily: 27.22 GiB / day
Resolution: 0
Level: 1
Source: sidecar

producing:

01J7Z0FJCMBZ06817ZBJXR758P
Start Time: September 16, 2024 8:00 PM
End Time: September 16, 2024 10:00 PM
Duration: 2 hours
Series: 1335035
Samples: 713111069
Chunks: 6930001
Total size: 2.24 GiB
Chunks: 2.04 GiB (91.13%)
Index: 203.55 MiB (8.87%)
Daily: 26.90 GiB / day
Resolution: 0
Level: 2
Source: compactor

The profile suggests most of the time is spent at https://github.com/thanos-io/thanos/blob/v0.36.1/pkg/dedup/iter.go#L422, that is kind of expected but it gets hit hundreds of billions to trillions of times.

pprof001

What you expected to happen:

Compaction to be finished well below 2h for the compactors to keep up.

How to reproduce it (as minimally and precisely as possible):

I am still trying to figure out what happens exactly, I can reproduce the long durations with the same blocks on my local machine and will spend some time to hopefully get more information.

Full logs to relevant components:

Nothing out of the ordinary, even in debug mode.

Anything else we need to know:

I am pretty sure we are "holding it wrong" and would like to figure out in what way. I will add more information as requested or once I find something.

Copy link

dosubot bot commented Sep 18, 2024

Hey @mkrull! I'm here to help you with any bugs, questions, or contributions you have in mind. Let's work together to solve this issue.

The penalty deduplication process in Thanos can take an excessively long time during compaction if at least one replica label is not specified. This is a requirement for the penalty-based deduplication algorithm to function correctly. Without specifying a replica label, the deduplication process becomes inefficient, leading to prolonged processing times.

To resolve this issue, ensure that you specify at least one replica label using the --deduplication.replica-label flag when using the penalty-based deduplication algorithm. Here is the relevant code snippet that highlights this requirement:

cmd.Flag("deduplication.func", "Experimental. Deduplication algorithm for merging overlapping blocks. "+
	"Possible values are: \"\", \"penalty\". If no value is specified, the default compact deduplication merger is used, which performs 1:1 deduplication for samples. "+
	"When set to penalty, penalty based deduplication algorithm will be used. At least one replica label has to be set via --deduplication.replica-label flag.").
	Default("").EnumVar(&cc.dedupFunc, compact.DedupAlgorithmPenalty, "")

cmd.Flag("deduplication.replica-label", "Label to treat as a replica indicator of blocks that can be deduplicated (repeated flag). This will merge multiple replica blocks into one. This process is irreversible."+
	"Experimental. When one or more labels are set, compactor will ignore the given labels so that vertical compaction can merge the blocks."+
	"Please note that by default this uses a NAIVE algorithm for merging which works well for deduplication of blocks with **precisely the same samples** like produced by Receiver replication."+
	"If you need a different deduplication algorithm (e.g one that works well with Prometheus replicas), please set it via --deduplication.func.").
	StringsVar(&cc.dedupReplicaLabels)

Ensure that you set the --deduplication.replica-label flag appropriately to enable efficient penalty-based deduplication [1].

Additionally, consider the following points to further optimize the compaction process:

  1. Compaction Groups / Block Streams: Ensure that blocks are grouped correctly by external labels. Blocks with the same labels are considered as produced by the same source, which helps in efficient compaction.
  2. Vertical Compactions: Thanos supports vertical compaction, which can help in merging overlapping blocks. Ensure that vertical compaction is enabled if you have overlapping blocks.
  3. Scalability: You can run multiple Compactors against a single Bucket as long as each instance compacts a separate stream of blocks. This can help in scaling the compaction process [2].

By addressing these points, you can significantly improve the efficiency of the penalty deduplication process during compaction in Thanos.

To continue talking to Dosu, mention @dosu.

@mkrull
Copy link
Author

mkrull commented Sep 18, 2024

To address the above:
The compactors have the replica label configured and each compactor works on blocks of one pair of Prometheus replicas.

@mkrull
Copy link
Author

mkrull commented Sep 19, 2024

The metric causing the deduplication being slow is pretty high cardinality with ~200,000 series in the 2h block. Those series are distributed across many chunks ~30-50 due to how the metric is exposed.

Removing that metric brings down the compaction time to expected levels.

While this is certainly not an ideal use of Prometheus it is still a non-obvious impact from a user perspective and is tricky to deal with if the maintainers of a Prometheus + Thanos stack have limited control over the metrics produced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant