Revisit payload splitting strategy for Datadog Metrics destination. #327

tobz · 2024-11-15T17:01:09Z

Context

In #306, we added support for splitting request payloads in the Datadog Metrics destination when we detected that we had exceeded one of the (un)compressed payload size limits.

As part of this work, we took an approach where we incrementally encoded each metric into a single scratch buffer, while also writing each encoded metric into the compressor. My thinking went: when it comes time to split a request due to exceeding a limit, all we'll need to do is re-compress our existing encoded metrics, and we'll skip having to re-encode them all over again.

However, what I failed to consider was how this would greatly increase the component's memory bounds. Since we're now potentially holding the entire uncompressed payload in the scratch buffer, the scratch buffer can be as large as the uncompressed limit.. which when accounting for having a sketch-specific and series-specific request builder, means we could have two persistent allocations totaling up to ~65MiB... which is pretty suboptimal.

We should try and rethink our approach here, balancing the overhead of having to re-encode a bunch of metrics with the frequency at which payload splitting even occurs.

tobz · 2024-11-15T17:06:58Z

One approach is to just hold all of the metrics we've encoded so far, and do the re-encoding on-demand when a split is required. This would be fairly memory-efficient -- we would only need to allocate the Vec<Metric> and we could re-use it over time -- and we wouldn't generally expect this vector to grow very large: a ballpark figure of ~256 bytes per Metric means we're spending ~256KB for every 1,000 metrics.... so at 100,000 metrics, we're at ~25MB. That's still a good ways less than half of our current potential upper bound.

However, the main downsides of this approach are that we re-encode all the metrics when splitting, and that it's very difficult to put an upper bound of the size of our Vec<Metric> allocation without introducing further logic to force a flush if we run out of space. The second one is the bigger one because if you want to maximize your ability to continue packing metrics into the current compressed payload, you really don't want a low upper bound on the size of the Vec<Metric>, otherwise you force yourself to flush too early.

Again, we're still spending a lot of potential memory on tracking those metrics for what is generally a very rare scenario.

tobz · 2024-11-15T17:10:24Z

Another approach could be to try and come up with a strategy of checkpointing the compressor output, and only tracking the encoded metrics since the last checkpoint, to reduce the number of metrics we need to keep around and re-encode in the event of having to split.

Essentially, if we could determine a strategy to know where it's safe to chop the compressed output, we could essentially flush the compressor when we detect we need to split, take some subslice of the compressed output (up to the latest "checkpoint"), and then for all metrics that we know came after that checkpoint, re-encode them and stick them into their own payload.

tobz added area/memory Memory bounds and memory management. destination/datadog-metrics Datadog Metrics destination. effort/intermediate Involves changes that can be worked on by non-experts but might require guidance. type/enhancement An enhancement in functionality or support. labels Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit payload splitting strategy for Datadog Metrics destination. #327

Revisit payload splitting strategy for Datadog Metrics destination. #327

tobz commented Nov 15, 2024

tobz commented Nov 15, 2024

tobz commented Nov 15, 2024

Revisit payload splitting strategy for Datadog Metrics destination. #327

Revisit payload splitting strategy for Datadog Metrics destination. #327

Comments

tobz commented Nov 15, 2024

Context

tobz commented Nov 15, 2024

tobz commented Nov 15, 2024