Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix internal telemetry so that we don't emit quantile summaries. #500

Open
tobz opened this issue Feb 13, 2025 · 0 comments
Open

Fix internal telemetry so that we don't emit quantile summaries. #500

tobz opened this issue Feb 13, 2025 · 0 comments
Labels
destination/prometheus Prometheus Scrape destination. effort/intermediate Involves changes that can be worked on by non-experts but might require guidance. source/internal-metrics Internal Metrics source. type/bug Bug fixes.

Comments

@tobz
Copy link
Member

tobz commented Feb 13, 2025

Currently, internal histogram metrics are emitted as distributions, which is the (IMO) right thing to do because we can get accurate quantiles over a high number of samples in a space-efficient way. All great stuff.

Our problem, however, is that those internal metrics are then exposed over a Prometheus scrape endpoint to get picked up by the Agent, and we don't currently support doing anything with distributions other than rendering them as an aggregated summary, where we emit specific quantiles. This is bad because those quantiles can't actually be meaningfully summed up downstream, or averaged, and so on. It's also bad because the measurements are aggregated together endlessly, so the resulting quantiles end up getting oversmoothed, such that recent changes don't ever get reflected in the output.

One potential solution would be to emit our internal histogram metrics as actual histograms, and add the necessary support to the Prometheus destination. In doing so, the Agent could use its existing support for generating distributions from Prometheus histograms, which solves both of our problems:

  • we get native distributions on the backend (so we can sum, average, etc)
  • since the Agent tracks the delta of the histogram buckets, we don't end up with aggregation-induced oversmoothing

The main downside to this is that histograms carry individual values, so a histogram with 1,024 samples takes roughly 8KB, whereas a DDSketch might take a mere fraction of that: 100-200 bytes, if not less. For some metrics, we're updating them very frequently, and so 1,024 samples might actually be too low of a value for the sake of example, making the problem even worse at the upper bound.

Using distributions internally is the most efficient, but we don't have a good way to actually translate those into Prometheus histograms. The most direct route would be to, instead, actually emit the metrics directly to Datadog instead of scraping them via the Datadog Agent... but we also switched away from that due to issues with getting metrics to line up with their dopplegangers being emitted by the Core Agent, so we would have to go back and solve that problem to remove the need the Prometheus destination.

Even then, however, we still expose the Prometheus scrape endpoint for local observability, and things like target metrics collection in SMP runs... so we wouldn't really be able to get rid of it entirely.

As you can see, no silver bullet. :)

@tobz tobz added destination/prometheus Prometheus Scrape destination. effort/intermediate Involves changes that can be worked on by non-experts but might require guidance. source/internal-metrics Internal Metrics source. type/bug Bug fixes. labels Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
destination/prometheus Prometheus Scrape destination. effort/intermediate Involves changes that can be worked on by non-experts but might require guidance. source/internal-metrics Internal Metrics source. type/bug Bug fixes.
Projects
None yet
Development

No branches or pull requests

1 participant