-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hostmetrics: use WMI to fetch ppid #35337
base: main
Are you sure you want to change the base?
Conversation
Scraping metrics on Windows largely has one truly expensive operation: getting the Parent Process ID. gopsutil does this with the [CreateToolhelp32Snapshot](https://learn.microsoft.com/en-us/windows/win32/api/tlhelp32/nf-tlhelp32-createtoolhelp32snapshot) Win32 API call. This takes a snapshot of a large amount of process details and ends up being very expensive. This is causing CPU performance issues for Windows users. This PR does the following in an attempt to address this perforamnce issue: * Repurpose the handlecount package to be for querying any WMI process data and add ParentProcessID to that querying process * Add the featureflag `hostmetrics.process.wmiParentProcessID` which will enable using the WMI query for parent process ID fetching * Whether the featureflag is on or off, this PR also adds a guard around fetching the parent process ID that will only check it if the resource attribute for ppid is enabled. This means users in very resource constrained environments can opt out of that fetching either way
I finally had some time to rework this PR with new changes since the old one was closed when I let it go stale. Hopefully this time we can get it pushed through. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Not stale |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Not stale |
receiver/hostmetricsreceiver/internal/scraper/processscraper/process_scraper.go
Outdated
Show resolved
Hide resolved
func (wmiProcessQueryer) wmiProcessQuery() (map[int64]*wmiProcInfo, error) { | ||
processes := []Win32_Process{} | ||
// Based on reflection of Win32_Process type, this creates the following query: | ||
// `get-wmiobject -query "select ProcessId, ParentProcessId, HandleCount from Win32_Process"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method retrieves both parent process ID and process handle count on every run, regardless of whether both are needed.
Would it make sense to call different commands depending on which information is needed? Would it help in performance to for example only retrieve parent process ID and not retrieve handle count, assuming the user didn't enable the process.handles
metric?
The default configuration for the process scraper is to require parent process ID (as the process.parent_pid
resource attribute is enabled by default) and not require handle count (as the process.handles
metric is disabled by default).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accomplished via manual query construction based on the passed in config in acb81f8
receiver/hostmetricsreceiver/internal/scraper/processscraper/documentation.md
Show resolved
Hide resolved
processes := []Win32_Process{} | ||
// Based on reflection of Win32_Process type, this creates the following query: | ||
// `get-wmiobject -query "select ProcessId, ParentProcessId, HandleCount from Win32_Process"` | ||
q := wmi.CreateQuery(&processes, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding to @andrzej-stencel point above: wmi.CreateQuery
uses reflection to create a textual representation for the query and that won't change after startup, even, without optimizing to collect only the configured information there is no need to create the query on every "scrape".
Looking at github.com/yusufpapurcu/wmi
it seems that you can keep the same type Win32_Process
with all the fields that you may want depending on the config, even if those fields are not included in the textual query. However, for that to not report an error we need to proper set the WMI client configuration, see https://github.com/yusufpapurcu/wmi/blob/master/wmi.go#L108-L113 - It will be interesting evaluating the cost of that, i.e.: data structure with fields not present in the query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in acb81f8
I don't directly expect much impact to allowing missing fields based on a read of the code; looks like the only codepath difference when allowing missing fields is for a particular error to be ignored here: https://github.com/yusufpapurcu/wmi/blob/master/wmi.go#L354
I will try and do some due dilligence on the effect of including/not include certain struct fields one way or another. When I originally tested the difference between always fetching both fields vs only fetching one, I did not notice a substantial difference in WMI's performance. However this test was done a while ago, and it would be a good idea to do it again.
Thanks for the reviews @pjanotti and @andrzej-stencel, I'm acknowledging that I've seen them but can't address yet. Right after I asked for that review at the Collector SIG I had to be out for a bit so I wasn't able to get to it. Thanks for getting to it promptly even though I ended up being unable to answer. Will respond next week! |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
@braydonk let's know if you would like some help to move this forward, it is a good issue to improve. |
On Oct 28, 2024 @pjanotti wrote:
For this initial revival of the PR I stuck with the spirit of my original approach, this was my thought process. I agree that it is odd at a surface level to seemingly arbitrarily relate certain metrics and the retrieval method. The reason I think it's a good idea is because the retrieval method has an unfortunate underlying nuance, that being the dependence on a separate service. I think it's very rare for WMI to be completely inaccessible on a given machine, but a user may still want to disallow the collector to use the service. That's why I made it so the flow is something like this:
This all hinges on the premise that the user should be allowed to unilaterally allow/disallow the collector to use WMI. If this isn't actually as important as I'm making it out to be, we can reconsider. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already a performance improvement over the current code. However, looking a bit at how the process data is retrieved on Windows it becomes clear that the optimization instead could be done at the time that the PIDs are captured handles, err := s.getProcessHandles(ctx)
. Internally, gopsutils
, uses EnumProcesses
to get the list of PIDs, then one by one we execute code to get other information about each process. So it calls CreateToolhelp32Snapshot
for each process to get the parent PID, instead it should call CreateToolhelp32Snapshot
instead of EnumProcesses
when obtaining the initial list of PIDs. Besides performance the current code has issues regarding correctness and availability of information (processes can terminate between the calls and PIDs can be reused). However, this is a more involved optimization than the current PR since we may need to call Win32 APIs directly instead of relying on gopsutils
.
Given the current implementation I wonder if using WMI to get all properties would actually perform better. The marshalling cost of all requested properties, for a high enough number of processes, can make WMI more expensive than the Win32 API, however, given the current implementation I suspect WMI can perform better - anyway, needs to be measured.
receiver/hostmetricsreceiver/internal/scraper/processscraper/documentation.md
Show resolved
Hide resolved
component: hostmetricsreceiver | ||
|
||
# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`). | ||
note: Use Windows Management Interface to fetch process.ppid by default. Add `disable_wmi` config option to fallback to old behaviour. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The remaining of the PR references wmi_enabled
instead of disable_wmi
|
||
#### High CPU Usage On Windows | ||
|
||
Getting the Parent Process ID of all processes on Windows is a very expensive operation. There are two options to combat this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still true after this change? Could you please collect the latest image capture of perfmon with the latest code?
On my box:
$ Measure-Command { Get-WmiObject Win32_Process | Select-Object ProcessId, ParentProcessId, HandleCount }
Days : 0
Hours : 0
Minutes : 0
Seconds : 0
Milliseconds : 796
Ticks : 7969745
TotalDays : 9.22424189814815E-06
TotalHours : 0.000221381805555556
TotalMinutes : 0.0132829083333333
TotalSeconds : 0.7969745
TotalMilliseconds : 796.9745
So I rough approximation it will be a bit less than 2 milliseconds per process (I had 437 processes on my box). I will guess that not being on Powershell it should a bit cheaper.
} | ||
|
||
if !queryOptSet { | ||
return "", errors.New("no query options supplied") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather check the length of queryOpts at the top before anything else instead of tracking via a bool
In the context of the WMI manager it makes sense to require some option, not so much at the query itself since selecting only ProcessId
seems reasonable.
|
||
import "errors" | ||
|
||
var ErrPlatformSupport = errors.New("WMI process info collection is only supported on Windows") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the suggestion below, does it need to be public?
var ErrPlatformSupport = errors.New("WMI process info collection is only supported on Windows") | |
var ErrPlatformNotSupported = errors.New("WMI process info collection is only supported on Windows") |
var ErrHandlesPlatformSupport = errors.New("process handle collection is only supported on Windows") | ||
|
||
func NewManager() Manager { | ||
func NewManager(queryOpts ...QueryOption) Manager { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first I think it would be better to add an error to the return signature and return it here.
This should also be private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will also allow to check for empty queryOpts right here which avoids the need to the return error on the function creating the query string. Logically the fact that a query for PIDs doesn't make sense is closer to the way the manager is used, not to the query itself. In fact, the scraper first query for PIDs.
@@ -210,7 +227,7 @@ func (s *processScraper) getProcessMetadata(ctx context.Context) ([]*processMeta | |||
|
|||
var errs scrapererror.ScrapeErrors | |||
|
|||
if err := s.refreshHandleCounts(); err != nil { | |||
if err := s.refreshWMIProcInfo(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not to be addressed in this PR, but, at this point I wonder if it won't be better to use WMI for everything if we are getting this info here. The issue is that the underlying library gets a list of PIDs and then open/close handles to the process to get the remaining info. It also adds a bit of noise since the process can terminate in between the calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possibility is if the PID is recycled fast enough we could get the information for the wrong process. Note this is already a possibility with current code, even without the WMI query, it is not something being introduced here .
func (m *wmiProcInfoManager) GetProcessPpid(pid int64) (int64, error) { | ||
procInfo, err := m.getProcessInfo(pid) | ||
if err != nil { | ||
return 0, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PID 0
is always the System Idle Process
. Per Raymond Chen DWORD(-1), i.e.: (0xFFFFFFF
) is a better alternative - that said here we are treating the PID as an int64
, perhaps something to review separately.
var ErrPlatformSupport = errors.New("WMI process info collection is only supported on Windows") | ||
|
||
// Manager is the public interface for a WMI Process Info manager. | ||
type Manager interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it needs to be public? Similar Q to some other types here.
ErrNoProcesses = errors.New("no process info is currently registered") | ||
ErrProcessNotFound = errors.New("process info not found") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do they need to be public?
Given these nuances and my comment about the current code. Unless there is some property that we think is not available in WMI we should consider an option like |
Thanks for the review @pjanotti! First, updated performance data. Here's a perfmon screenshot running two collectors next to each other. (It's actually the same binary, but I made two copies with different names for screenshot demonstration) WMI Off: WMI On: Important note for comparing the two profiles is that the profile for WMI On is 450ms compared to the WMI Off profile at 2.5s.
I think I'm missing something here; is it possible to call
This is an issue I've known about for some time but been unsure how to tackle without rearchitecting the scraper. The fact that the list of processes is gathered and then information is fetched individually for each makes it so this problem will always exist on any platform. Much of the scraper was written before my time as a CODEOWNER, but staying relatively the same cross-platform by leaning on I think trying to make WMI fetch all information on Windows would make it so the scraper looks completely different and would diminish that benefit of ease of maintenance (arguably the usage of WMI as it exists and in this PR is already a blow to maintainability, but is necessitated by unacceptable performance on Windows in the current implementation). Perhaps this is worth addressing at a different angle, such as an entirely new Windows specific process scraper. I'm not sure where that would live or how we'd go about making that though.
Had to dig through the archives for this, but back when I introduced the Personally I am interested in finding an avenue to open the floor for making this receiver more efficient even if it requires less reliance on All other comments on the PR are noted but I won't have free time for a couple days to address them. I plan to ASAP. Thank you for your review! |
Thanks for the updated chart and profiles @braydonk!
To get parent PID current code calls gopsutils
I think we can isolate enough the high-level of the receiver from each platform implementation. Perhaps the main thing here is that not many developers in the community know and use Windows so having something like
At this point I don't think there is enough difference on the data output to justify that. It seems to me that if we make the calls to win32 API directly we can keep most of the non-Windows code as it is. That said it is possible that I'm being too optimistic. |
We are keeping a close eye on this PR as it would really help us out. We are a heavy Windows shop and are actively working with Coralogix to ramp up our Observability efforts. The alternative is likely running If there is anyway we can help please ask. |
@pjanotti steve lerner here- this is one of our customers above who would love to get this merged- they have large fleets of new windows servers and are going 100% otel... obrigadão irmão tava no Rio a semana pasada! |
Description:
Scraping process metrics on Windows only really has one truly expensive operation: getting the Parent Process ID. gopsutil does this with the CreateToolhelp32Snapshot Win32 API call. This takes a snapshot of a large amount of process details and ends up being very expensive. This is causing CPU performance issues for Windows users.
This PR does the following in an attempt to address this performance issue:
disable_wmi
to turn WMI off in the rare case where you can't use WMIprocess.ppid
resource attribute is enabled. This means users in very resource constrained environments can opt out of fetching the metric regardless.Link to tracking Issue: #32947
Testing:
data:image/s3,"s3://crabby-images/e0980/e0980f39c7a30aa83180befe5ce1bc01e8b5d7f3" alt="perfmon screenshot showing the collector without the feature flag using on average 40% more CPU"
I tested this by doing two builds of the collector and comparing the performance against each other. With the WMI feature flag enabled, there is a reduction in the CPU usage. (red = before, green = after)
I also compared the original way of getting parent process ID against disabling the resource attribute. The difference is even more drastic. (red = before, green = after)
data:image/s3,"s3://crabby-images/1698c/1698cc8743bd353d3d627071a8e4fae1b56a785e" alt="perfmon screenshot showing the collector that has the parent process ID resource attribute labelled using drastically less CPU"
Documentation:
Added the new configuration value to the README. Added documentation note to
process.handles
to note that it has a hard requirement on WMI.