Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit payload splitting strategy for Datadog Metrics destination. #327

Open
tobz opened this issue Nov 15, 2024 · 2 comments
Open

Revisit payload splitting strategy for Datadog Metrics destination. #327

tobz opened this issue Nov 15, 2024 · 2 comments
Labels
area/memory Memory bounds and memory management. destination/datadog-metrics Datadog Metrics destination. effort/intermediate Involves changes that can be worked on by non-experts but might require guidance. type/enhancement An enhancement in functionality or support.

Comments

@tobz
Copy link
Member

tobz commented Nov 15, 2024

Context

In #306, we added support for splitting request payloads in the Datadog Metrics destination when we detected that we had exceeded one of the (un)compressed payload size limits.

As part of this work, we took an approach where we incrementally encoded each metric into a single scratch buffer, while also writing each encoded metric into the compressor. My thinking went: when it comes time to split a request due to exceeding a limit, all we'll need to do is re-compress our existing encoded metrics, and we'll skip having to re-encode them all over again.

However, what I failed to consider was how this would greatly increase the component's memory bounds. Since we're now potentially holding the entire uncompressed payload in the scratch buffer, the scratch buffer can be as large as the uncompressed limit.. which when accounting for having a sketch-specific and series-specific request builder, means we could have two persistent allocations totaling up to ~65MiB... which is pretty suboptimal.

We should try and rethink our approach here, balancing the overhead of having to re-encode a bunch of metrics with the frequency at which payload splitting even occurs.

@tobz tobz added area/memory Memory bounds and memory management. destination/datadog-metrics Datadog Metrics destination. effort/intermediate Involves changes that can be worked on by non-experts but might require guidance. type/enhancement An enhancement in functionality or support. labels Nov 15, 2024
@tobz
Copy link
Member Author

tobz commented Nov 15, 2024

One approach is to just hold all of the metrics we've encoded so far, and do the re-encoding on-demand when a split is required. This would be fairly memory-efficient -- we would only need to allocate the Vec<Metric> and we could re-use it over time -- and we wouldn't generally expect this vector to grow very large: a ballpark figure of ~256 bytes per Metric means we're spending ~256KB for every 1,000 metrics.... so at 100,000 metrics, we're at ~25MB. That's still a good ways less than half of our current potential upper bound.

However, the main downsides of this approach are that we re-encode all the metrics when splitting, and that it's very difficult to put an upper bound of the size of our Vec<Metric> allocation without introducing further logic to force a flush if we run out of space. The second one is the bigger one because if you want to maximize your ability to continue packing metrics into the current compressed payload, you really don't want a low upper bound on the size of the Vec<Metric>, otherwise you force yourself to flush too early.

Again, we're still spending a lot of potential memory on tracking those metrics for what is generally a very rare scenario.

@tobz
Copy link
Member Author

tobz commented Nov 15, 2024

Another approach could be to try and come up with a strategy of checkpointing the compressor output, and only tracking the encoded metrics since the last checkpoint, to reduce the number of metrics we need to keep around and re-encode in the event of having to split.

Essentially, if we could determine a strategy to know where it's safe to chop the compressed output, we could essentially flush the compressor when we detect we need to split, take some subslice of the compressed output (up to the latest "checkpoint"), and then for all metrics that we know came after that checkpoint, re-encode them and stick them into their own payload.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/memory Memory bounds and memory management. destination/datadog-metrics Datadog Metrics destination. effort/intermediate Involves changes that can be worked on by non-experts but might require guidance. type/enhancement An enhancement in functionality or support.
Projects
None yet
Development

No branches or pull requests

1 participant