Fix internal telemetry so that we don't emit quantile summaries. #500
Labels
destination/prometheus
Prometheus Scrape destination.
effort/intermediate
Involves changes that can be worked on by non-experts but might require guidance.
source/internal-metrics
Internal Metrics source.
type/bug
Bug fixes.
Currently, internal histogram metrics are emitted as distributions, which is the (IMO) right thing to do because we can get accurate quantiles over a high number of samples in a space-efficient way. All great stuff.
Our problem, however, is that those internal metrics are then exposed over a Prometheus scrape endpoint to get picked up by the Agent, and we don't currently support doing anything with distributions other than rendering them as an aggregated summary, where we emit specific quantiles. This is bad because those quantiles can't actually be meaningfully summed up downstream, or averaged, and so on. It's also bad because the measurements are aggregated together endlessly, so the resulting quantiles end up getting oversmoothed, such that recent changes don't ever get reflected in the output.
One potential solution would be to emit our internal histogram metrics as actual histograms, and add the necessary support to the Prometheus destination. In doing so, the Agent could use its existing support for generating distributions from Prometheus histograms, which solves both of our problems:
The main downside to this is that histograms carry individual values, so a histogram with 1,024 samples takes roughly 8KB, whereas a
DDSketch
might take a mere fraction of that: 100-200 bytes, if not less. For some metrics, we're updating them very frequently, and so 1,024 samples might actually be too low of a value for the sake of example, making the problem even worse at the upper bound.Using distributions internally is the most efficient, but we don't have a good way to actually translate those into Prometheus histograms. The most direct route would be to, instead, actually emit the metrics directly to Datadog instead of scraping them via the Datadog Agent... but we also switched away from that due to issues with getting metrics to line up with their dopplegangers being emitted by the Core Agent, so we would have to go back and solve that problem to remove the need the Prometheus destination.
Even then, however, we still expose the Prometheus scrape endpoint for local observability, and things like target metrics collection in SMP runs... so we wouldn't really be able to get rid of it entirely.
As you can see, no silver bullet. :)
The text was updated successfully, but these errors were encountered: