Add a tracer for Linux capabilities #1692

ptodev · 2025-02-21T18:08:05Z

This is an attempt to get Beyla to track which Linux capabilities are used by processes.

Technically the issue above is asking Beyla output a log, whereas this PR adds a metric for capabilities. A metric might be more useful since it can be used for visualisations and alerts. However, I'm not sure how to handle cardinality issues with labels such as PIDs.

ptodev · 2025-02-21T18:09:53Z

pkg/internal/ebpf/capabilitytracer/capabilitytracer.go

+		&ebpfcommon.IdentityPidsFilter{},
+		// p.pidsFilter,


If I use p.pidsFilter then the ring buffer filter rejects all the spans because the namespace in the PID struct is empty. What should this namespace be set to? I couldn't find an example in the code.

ptodev · 2025-02-21T18:11:33Z

bpf/capability_tracer.c

+        bpf_dbg_printk("=== capable (1) the pid %d doesnt match ===", id);
+        return 0;
+    }
+    bpf_dbg_printk("=== capable (1) the pid %d matches ===", id);


I configured Beyla to only monitor Alloy, and yet for some reason I'm seeing capability metrics for all sorts of PIDs. I'm not sure why that's the case? Also, none of the PID labels on the metrics match the Alloy PID...

ptodev · 2025-02-21T18:14:31Z

pkg/export/attributes/metric.go

+	CapabilityRequests = Name{
+		Section: "capability.requests",
+		Prom:    "capability_requests_total",
+		OTEL:    "capability.requests",
+	}


I'm not sure how to handle the explosion in cardinality with this metric. There could be lots of PIDs. Maybe we just don't include PID as a label? And instead we could have logs for the more high cardinality information?

ptodev · 2025-02-21T18:16:04Z

pkg/internal/ebpf/capabilitytracer/capabilitytracer.go

+	}
+}
+
+func (p *Tracer) AllowPID(pid, ns uint32, svc *svc.Attrs) {


It takes a bit of time for this filtering to kick in. Therefore, I think Beyla might miss kprobes triggering right when an Alloy process starts up. Not sure what the best way to handle this is.

ptodev · 2025-02-21T18:21:31Z

I've been testing this by running Beyla locally with these kinds of settings:

export OTEL_EXPORTER_OTLP_PROTOCOL="http/protobuf"
export OTEL_EXPORTER_OTLP_ENDPOINT="https://otlp-gateway-prod-gb-south-0.grafana.net/otlp"
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Basic=="

export BEYLA_OPEN_PORT=12345
export BEYLA_LOG_LEVEL="DEBUG"
export BEYLA_TRACE_PRINTER="text"
export BEYLA_BPF_DEBUG="TRUE"
export BEYLA_SERVICE_NAME="alloy"
export BEYLA_CONFIG_PATH="/home/ubuntu/Desktop/beyla.yml"
export BEYLA_OTEL_METRICS_FEATURES="application,application_process,application_span"
export BEYLA_OTEL_METRICS_INSTRUMENTATIONS="*"

/home/ubuntu/github/beyla/beyla

beyla.yml:

discovery:
  default_exclude_services:

I can indeed see some mertics, but the PIDs are not for the Alloy instance that I want to monitor:

grcevski · 2025-02-27T14:31:51Z

pkg/internal/ebpf/capabilitytracer/capabilitytracer.go

+
+func (p *Tracer) KProbes() map[string]ebpfcommon.ProbeDesc {
+	kprobes := map[string]ebpfcommon.ProbeDesc{
+		"capable": {


I have one general comment, I think this is good proposal for debugging, but adding this kprobe will be very high overhead, since the capabilities are checked a lot in the kernel, pretty much on any system call.

I see couple of ideas here:

This should be made optional, like a capabilities debug configuration.

We don't necessarily need to produce metrics, I think it would be sufficient to print on the standard output of what capability is being asked for, and maybe a goroutines stack trace dump.

I think it's important that we switch the concept from something we run in production to something that can be used for debugging.

Apart from what @grcevski said, even if optional, we may want to profile the kind of overhead this will have - I can help you with that @ptodev. We should probably investigate this vs a tracepoint as well (as it may be a bit faster, which will matter in this particular case).

I completely agree that it should be optional!

However, I don't see why this should be considered just a feature for debugging. There are users who may want to gather security-related telemetry. Treating this sort of information as first class citizen in Beyla would mean users can put the telemetry in Grafana Cloud and setup dashboards and Alerts for it. In the future, other security-related features could also be implemented.

That makes sense. Ultimately, I think we will need to measure/profile this and make an educated decision moving forward. In my experience with previous instrumentation of high-traffic codepaths (think syscalls like read() or write()), this can very easily overflow the event buffer and raise CPU usage significantly - so I guess this is our main concern here. Yet, there's only one way to validate this: profiling - so perhaps we can set up a few scenarios and measure and see how it goes.

Ah I see, well that makes sense @ptodev, I didn't think of this as something beyond Beyla internals, but what you say makes sense.

Thank you for the feedback! Would it be ok if we proceed like this:

I can change this PR to only print a debug log in the Beyla stdout whenever a process requests a capability. We won't filter the processes, because if we do we might not log capabilities being requested right when the process is starting.

What would be the best way to profile the code?

Is there an example somewhere on how to set up tracepoints? I don't see a tracepoint option in the bpfPrograms struct.

I could open a subsequent PR to allow for the creation of OTel metrics based on capabilities. The metrics could incluse the process name, PID and the name of the capability that was requested. Since those are OTel metrics, high cardinality issues due to the including the PID can probably be ignored. If we need to also produce Prometheus metrics we could omit the PID.

I'll set up a meeting with @rafaelroquetto for later this week to discuss profiling and tracepoints. And I'll update the PR afterwards.

Add a tracer for Linux capabilities

2fd8208

ptodev commented Feb 21, 2025

View reviewed changes

grcevski reviewed Feb 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a tracer for Linux capabilities #1692

Add a tracer for Linux capabilities #1692

ptodev commented Feb 21, 2025

ptodev Feb 21, 2025

ptodev Feb 21, 2025

ptodev Feb 21, 2025

ptodev Feb 21, 2025

ptodev commented Feb 21, 2025

grcevski Feb 27, 2025

rafaelroquetto Feb 27, 2025

ptodev Feb 27, 2025

rafaelroquetto Feb 27, 2025

grcevski Feb 27, 2025

ptodev Mar 11, 2025

Add a tracer for Linux capabilities #1692

Are you sure you want to change the base?

Add a tracer for Linux capabilities #1692

Conversation

ptodev commented Feb 21, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ptodev commented Feb 21, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment