Fix internal telemetry so that we don't emit quantile summaries. #500

tobz · 2025-02-13T19:41:27Z

Currently, internal histogram metrics are emitted as distributions, which is the (IMO) right thing to do because we can get accurate quantiles over a high number of samples in a space-efficient way. All great stuff.

Our problem, however, is that those internal metrics are then exposed over a Prometheus scrape endpoint to get picked up by the Agent, and we don't currently support doing anything with distributions other than rendering them as an aggregated summary, where we emit specific quantiles. This is bad because those quantiles can't actually be meaningfully summed up downstream, or averaged, and so on. It's also bad because the measurements are aggregated together endlessly, so the resulting quantiles end up getting oversmoothed, such that recent changes don't ever get reflected in the output.

One potential solution would be to emit our internal histogram metrics as actual histograms, and add the necessary support to the Prometheus destination. In doing so, the Agent could use its existing support for generating distributions from Prometheus histograms, which solves both of our problems:

we get native distributions on the backend (so we can sum, average, etc)
since the Agent tracks the delta of the histogram buckets, we don't end up with aggregation-induced oversmoothing

The main downside to this is that histograms carry individual values, so a histogram with 1,024 samples takes roughly 8KB, whereas a DDSketch might take a mere fraction of that: 100-200 bytes, if not less. For some metrics, we're updating them very frequently, and so 1,024 samples might actually be too low of a value for the sake of example, making the problem even worse at the upper bound.

Using distributions internally is the most efficient, but we don't have a good way to actually translate those into Prometheus histograms. The most direct route would be to, instead, actually emit the metrics directly to Datadog instead of scraping them via the Datadog Agent... but we also switched away from that due to issues with getting metrics to line up with their dopplegangers being emitted by the Core Agent, so we would have to go back and solve that problem to remove the need the Prometheus destination.

Even then, however, we still expose the Prometheus scrape endpoint for local observability, and things like target metrics collection in SMP runs... so we wouldn't really be able to get rid of it entirely.

As you can see, no silver bullet. :)

The text was updated successfully, but these errors were encountered:

tobz added destination/prometheus Prometheus Scrape destination. effort/intermediate Involves changes that can be worked on by non-experts but might require guidance. source/internal-metrics Internal Metrics source. type/bug Bug fixes. labels Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix internal telemetry so that we don't emit quantile summaries. #500

Fix internal telemetry so that we don't emit quantile summaries. #500

tobz commented Feb 13, 2025

Fix internal telemetry so that we don't emit quantile summaries. #500

Fix internal telemetry so that we don't emit quantile summaries. #500

Comments

tobz commented Feb 13, 2025