-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a tracer for Linux capabilities #1692
base: main
Are you sure you want to change the base?
Conversation
&ebpfcommon.IdentityPidsFilter{}, | ||
// p.pidsFilter, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I use p.pidsFilter
then the ring buffer filter rejects all the spans because the namespace in the PID struct is empty. What should this namespace be set to? I couldn't find an example in the code.
bpf_dbg_printk("=== capable (1) the pid %d doesnt match ===", id); | ||
return 0; | ||
} | ||
bpf_dbg_printk("=== capable (1) the pid %d matches ===", id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I configured Beyla to only monitor Alloy, and yet for some reason I'm seeing capability metrics for all sorts of PIDs. I'm not sure why that's the case? Also, none of the PID labels on the metrics match the Alloy PID...
CapabilityRequests = Name{ | ||
Section: "capability.requests", | ||
Prom: "capability_requests_total", | ||
OTEL: "capability.requests", | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to handle the explosion in cardinality with this metric. There could be lots of PIDs. Maybe we just don't include PID as a label? And instead we could have logs for the more high cardinality information?
} | ||
} | ||
|
||
func (p *Tracer) AllowPID(pid, ns uint32, svc *svc.Attrs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It takes a bit of time for this filtering to kick in. Therefore, I think Beyla might miss kprobes triggering right when an Alloy process starts up. Not sure what the best way to handle this is.
|
||
func (p *Tracer) KProbes() map[string]ebpfcommon.ProbeDesc { | ||
kprobes := map[string]ebpfcommon.ProbeDesc{ | ||
"capable": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one general comment, I think this is good proposal for debugging, but adding this kprobe will be very high overhead, since the capabilities are checked a lot in the kernel, pretty much on any system call.
I see couple of ideas here:
- This should be made optional, like a capabilities debug configuration.
- We don't necessarily need to produce metrics, I think it would be sufficient to print on the standard output of what capability is being asked for, and maybe a goroutines stack trace dump.
I think it's important that we switch the concept from something we run in production to something that can be used for debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely agree that it should be optional!
However, I don't see why this should be considered just a feature for debugging. There are users who may want to gather security-related telemetry. Treating this sort of information as first class citizen in Beyla would mean users can put the telemetry in Grafana Cloud and setup dashboards and Alerts for it. In the future, other security-related features could also be implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Ultimately, I think we will need to measure/profile this and make an educated decision moving forward. In my experience with previous instrumentation of high-traffic codepaths (think syscalls like read()
or write()
), this can very easily overflow the event buffer and raise CPU usage significantly - so I guess this is our main concern here. Yet, there's only one way to validate this: profiling - so perhaps we can set up a few scenarios and measure and see how it goes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, well that makes sense @ptodev, I didn't think of this as something beyond Beyla internals, but what you say makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback! Would it be ok if we proceed like this:
- I can change this PR to only print a debug log in the Beyla stdout whenever a process requests a capability. We won't filter the processes, because if we do we might not log capabilities being requested right when the process is starting.
- What would be the best way to profile the code?
- Is there an example somewhere on how to set up tracepoints? I don't see a tracepoint option in the
bpfPrograms
struct.
- I could open a subsequent PR to allow for the creation of OTel metrics based on capabilities. The metrics could incluse the process name, PID and the name of the capability that was requested. Since those are OTel metrics, high cardinality issues due to the including the PID can probably be ignored. If we need to also produce Prometheus metrics we could omit the PID.
I'll set up a meeting with @rafaelroquetto for later this week to discuss profiling and tracepoints. And I'll update the PR afterwards.
This is an attempt to get Beyla to track which Linux capabilities are used by processes.
Fixes #1485
Technically the issue above is asking Beyla output a log, whereas this PR adds a metric for capabilities. A metric might be more useful since it can be used for visualisations and alerts. However, I'm not sure how to handle cardinality issues with labels such as PIDs.