Metrics for memory zones #1198

sharnoff · 2025-01-08T17:07:43Z

Follow-up to INC-361

Problem description / Motivation

Sometimes we get OOM-kills on our VMs, even though they have vm.overcommit_memory=2, because we're out of ZONE_NORMAL and kernel allocations fail.

When this kind of thing happens, it's often hard to validate, and we don't have any way to check how close we are to running out.

We should expose memory usage per ZONE_NORMAL / ZONE_MOVABLE / etc, from each VM.

IMO this is probably unlikely to be included by any standard metrics exporters, so maybe we collect this from neonvm-daemon?

The text was updated successfully, but these errors were encountered:

mickael-carl · 2025-01-08T17:17:44Z

Just noting here that node_exporter supports that metric since 2021. I'd like to suggest switching away from vector for host metrics for this reason also 🙂

sharnoff added the c/autoscaling/neonvm Component: autoscaling: NeonVM label Jan 8, 2025