Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hostmetrics: use WMI to fetch ppid #35337

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

braydonk
Copy link
Contributor

Description:
Scraping process metrics on Windows only really has one truly expensive operation: getting the Parent Process ID. gopsutil does this with the CreateToolhelp32Snapshot Win32 API call. This takes a snapshot of a large amount of process details and ends up being very expensive. This is causing CPU performance issues for Windows users.

This PR does the following in an attempt to address this performance issue:

  • Repurpose the handlecount package to be for querying any Window Management Interface process data and add ParentProcessID to that querying process
  • Change the default ParentProcessID fetching process to be through the new WMI fetching API, with a new configuration option disable_wmi to turn WMI off in the rare case where you can't use WMI
  • Whether using WMI or not, this PR also adds a guard around fetching the parent process ID that will only check it if the process.ppid resource attribute is enabled. This means users in very resource constrained environments can opt out of fetching the metric regardless.

Link to tracking Issue: #32947

Testing:
I tested this by doing two builds of the collector and comparing the performance against each other. With the WMI feature flag enabled, there is a reduction in the CPU usage. (red = before, green = after)
perfmon screenshot showing the collector without the feature flag using on average 40% more CPU

I also compared the original way of getting parent process ID against disabling the resource attribute. The difference is even more drastic. (red = before, green = after)
perfmon screenshot showing the collector that has the parent process ID resource attribute labelled using drastically less CPU

Documentation:
Added the new configuration value to the README. Added documentation note to process.handles to note that it has a hard requirement on WMI.

Scraping metrics on Windows largely has one truly expensive operation:
getting the Parent Process ID. gopsutil does this with the
[CreateToolhelp32Snapshot](https://learn.microsoft.com/en-us/windows/win32/api/tlhelp32/nf-tlhelp32-createtoolhelp32snapshot)
Win32 API call. This takes a snapshot of a large amount of process
details and ends up being very expensive. This is causing CPU
performance issues for Windows users.

This PR does the following in an attempt to address this perforamnce
issue:
* Repurpose the handlecount package to be for querying any WMI process
  data and add ParentProcessID to that querying process
* Add the featureflag `hostmetrics.process.wmiParentProcessID` which
  will enable using the WMI query for parent process ID fetching
* Whether the featureflag is on or off, this PR also adds a guard around
  fetching the parent process ID that will only check it if the resource
  attribute for ppid is enabled. This means users in very resource
  constrained environments can opt out of that fetching either way
@braydonk braydonk requested review from dmitryax and a team as code owners September 21, 2024 02:09
@braydonk braydonk changed the title Ppid wmi hostmetrics: use WMI to fetch ppid Sep 21, 2024
@braydonk
Copy link
Contributor Author

I finally had some time to rework this PR with new changes since the old one was closed when I let it go stale. Hopefully this time we can get it pushed through.

Copy link
Contributor

github-actions bot commented Oct 5, 2024

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Oct 5, 2024
@braydonk
Copy link
Contributor Author

braydonk commented Oct 6, 2024

Not stale

@dashpole dashpole removed the Stale label Oct 6, 2024
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Oct 21, 2024
@braydonk
Copy link
Contributor Author

Not stale

@dashpole dashpole removed the Stale label Oct 21, 2024
@atoulme atoulme added the Run Windows Enable running windows test on a PR label Oct 23, 2024
func (wmiProcessQueryer) wmiProcessQuery() (map[int64]*wmiProcInfo, error) {
processes := []Win32_Process{}
// Based on reflection of Win32_Process type, this creates the following query:
// `get-wmiobject -query "select ProcessId, ParentProcessId, HandleCount from Win32_Process"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method retrieves both parent process ID and process handle count on every run, regardless of whether both are needed.

Would it make sense to call different commands depending on which information is needed? Would it help in performance to for example only retrieve parent process ID and not retrieve handle count, assuming the user didn't enable the process.handles metric?

The default configuration for the process scraper is to require parent process ID (as the process.parent_pid resource attribute is enabled by default) and not require handle count (as the process.handles metric is disabled by default).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accomplished via manual query construction based on the passed in config in acb81f8

processes := []Win32_Process{}
// Based on reflection of Win32_Process type, this creates the following query:
// `get-wmiobject -query "select ProcessId, ParentProcessId, HandleCount from Win32_Process"`
q := wmi.CreateQuery(&processes, "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding to @andrzej-stencel point above: wmi.CreateQuery uses reflection to create a textual representation for the query and that won't change after startup, even, without optimizing to collect only the configured information there is no need to create the query on every "scrape".

Looking at github.com/yusufpapurcu/wmi it seems that you can keep the same type Win32_Process with all the fields that you may want depending on the config, even if those fields are not included in the textual query. However, for that to not report an error we need to proper set the WMI client configuration, see https://github.com/yusufpapurcu/wmi/blob/master/wmi.go#L108-L113 - It will be interesting evaluating the cost of that, i.e.: data structure with fields not present in the query.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in acb81f8

I don't directly expect much impact to allowing missing fields based on a read of the code; looks like the only codepath difference when allowing missing fields is for a particular error to be ignored here: https://github.com/yusufpapurcu/wmi/blob/master/wmi.go#L354

I will try and do some due dilligence on the effect of including/not include certain struct fields one way or another. When I originally tested the difference between always fetching both fields vs only fetching one, I did not notice a substantial difference in WMI's performance. However this test was done a while ago, and it would be a good idea to do it again.

@braydonk
Copy link
Contributor Author

braydonk commented Nov 1, 2024

Thanks for the reviews @pjanotti and @andrzej-stencel, I'm acknowledging that I've seen them but can't address yet. Right after I asked for that review at the Collector SIG I had to be out for a bit so I wasn't able to get to it. Thanks for getting to it promptly even though I ended up being unable to answer. Will respond next week!

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Nov 16, 2024
Copy link
Contributor

github-actions bot commented Dec 1, 2024

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Dec 1, 2024
@pjanotti
Copy link
Contributor

pjanotti commented Dec 2, 2024

@braydonk let's know if you would like some help to move this forward, it is a good issue to improve.

@braydonk braydonk reopened this Feb 22, 2025
@braydonk
Copy link
Contributor Author

On Oct 28, 2024 @pjanotti wrote:

I'm not sure that it is a good idea to tie the data retrieved with the underlying technology to get it. Can't we instead rely on logging proper warning at startup if WMI is not available and use the enable/disable metrics to allow the user to suppress the warnings?

For this initial revival of the PR I stuck with the spirit of my original approach, this was my thought process.

I agree that it is odd at a surface level to seemingly arbitrarily relate certain metrics and the retrieval method. The reason I think it's a good idea is because the retrieval method has an unfortunate underlying nuance, that being the dependence on a separate service. I think it's very rare for WMI to be completely inaccessible on a given machine, but a user may still want to disallow the collector to use the service.

That's why I made it so the flow is something like this:

  • There is a config option to completely disallow WMI in the case where a user wants that, it is on by default
  • If it is off and process.handles is enabled, config validation fails because that is the only possible method to collect that metric
  • If it is off and process.parent_pid is enabled (which it is by default), it will fall back to the original process snapshotting method of retrieving the attribute
  • If WMI is allowed but process.handles and process.parent_pid are both disabled, no WMI interaction will take place

This all hinges on the premise that the user should be allowed to unilaterally allow/disallow the collector to use WMI. If this isn't actually as important as I'm making it out to be, we can reconsider.

@github-actions github-actions bot removed the Stale label Feb 23, 2025
Copy link
Contributor

@pjanotti pjanotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already a performance improvement over the current code. However, looking a bit at how the process data is retrieved on Windows it becomes clear that the optimization instead could be done at the time that the PIDs are captured handles, err := s.getProcessHandles(ctx). Internally, gopsutils, uses EnumProcesses to get the list of PIDs, then one by one we execute code to get other information about each process. So it calls CreateToolhelp32Snapshot for each process to get the parent PID, instead it should call CreateToolhelp32Snapshot instead of EnumProcesses when obtaining the initial list of PIDs. Besides performance the current code has issues regarding correctness and availability of information (processes can terminate between the calls and PIDs can be reused). However, this is a more involved optimization than the current PR since we may need to call Win32 APIs directly instead of relying on gopsutils.

Given the current implementation I wonder if using WMI to get all properties would actually perform better. The marshalling cost of all requested properties, for a high enough number of processes, can make WMI more expensive than the Win32 API, however, given the current implementation I suspect WMI can perform better - anyway, needs to be measured.

component: hostmetricsreceiver

# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
note: Use Windows Management Interface to fetch process.ppid by default. Add `disable_wmi` config option to fallback to old behaviour.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remaining of the PR references wmi_enabled instead of disable_wmi


#### High CPU Usage On Windows

Getting the Parent Process ID of all processes on Windows is a very expensive operation. There are two options to combat this:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still true after this change? Could you please collect the latest image capture of perfmon with the latest code?

On my box:

$ Measure-Command { Get-WmiObject Win32_Process | Select-Object ProcessId, ParentProcessId, HandleCount }
Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 796
Ticks             : 7969745
TotalDays         : 9.22424189814815E-06
TotalHours        : 0.000221381805555556
TotalMinutes      : 0.0132829083333333
TotalSeconds      : 0.7969745
TotalMilliseconds : 796.9745

So I rough approximation it will be a bit less than 2 milliseconds per process (I had 437 processes on my box). I will guess that not being on Powershell it should a bit cheaper.

}

if !queryOptSet {
return "", errors.New("no query options supplied")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather check the length of queryOpts at the top before anything else instead of tracking via a bool

In the context of the WMI manager it makes sense to require some option, not so much at the query itself since selecting only ProcessId seems reasonable.


import "errors"

var ErrPlatformSupport = errors.New("WMI process info collection is only supported on Windows")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the suggestion below, does it need to be public?

Suggested change
var ErrPlatformSupport = errors.New("WMI process info collection is only supported on Windows")
var ErrPlatformNotSupported = errors.New("WMI process info collection is only supported on Windows")

var ErrHandlesPlatformSupport = errors.New("process handle collection is only supported on Windows")

func NewManager() Manager {
func NewManager(queryOpts ...QueryOption) Manager {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I think it would be better to add an error to the return signature and return it here.

This should also be private.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also allow to check for empty queryOpts right here which avoids the need to the return error on the function creating the query string. Logically the fact that a query for PIDs doesn't make sense is closer to the way the manager is used, not to the query itself. In fact, the scraper first query for PIDs.

@@ -210,7 +227,7 @@ func (s *processScraper) getProcessMetadata(ctx context.Context) ([]*processMeta

var errs scrapererror.ScrapeErrors

if err := s.refreshHandleCounts(); err != nil {
if err := s.refreshWMIProcInfo(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not to be addressed in this PR, but, at this point I wonder if it won't be better to use WMI for everything if we are getting this info here. The issue is that the underlying library gets a list of PIDs and then open/close handles to the process to get the remaining info. It also adds a bit of noise since the process can terminate in between the calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility is if the PID is recycled fast enough we could get the information for the wrong process. Note this is already a possibility with current code, even without the WMI query, it is not something being introduced here .

func (m *wmiProcInfoManager) GetProcessPpid(pid int64) (int64, error) {
procInfo, err := m.getProcessInfo(pid)
if err != nil {
return 0, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PID 0 is always the System Idle Process. Per Raymond Chen DWORD(-1), i.e.: (0xFFFFFFF) is a better alternative - that said here we are treating the PID as an int64, perhaps something to review separately.

var ErrPlatformSupport = errors.New("WMI process info collection is only supported on Windows")

// Manager is the public interface for a WMI Process Info manager.
type Manager interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it needs to be public? Similar Q to some other types here.

Comment on lines +33 to +34
ErrNoProcesses = errors.New("no process info is currently registered")
ErrProcessNotFound = errors.New("process info not found")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they need to be public?

@pjanotti
Copy link
Contributor

pjanotti commented Feb 24, 2025

@braydonk

The reason I think it's a good idea is because the retrieval method has an unfortunate underlying nuance, that being the dependence on a separate service.

Given these nuances and my comment about the current code. Unless there is some property that we think is not available in WMI we should consider an option like use_wmi that does a single WMI query for all properties. When use_wmi is disabled we could do the Win32 API route and in that case some of the info may not be available (parent PID can be made cheaply available without WMI, not sure about handle count).

@braydonk
Copy link
Contributor Author

braydonk commented Feb 25, 2025

Thanks for the review @pjanotti!

First, updated performance data. Here's a perfmon screenshot running two collectors next to each other.

image

(It's actually the same binary, but I made two copies with different names for screenshot demonstration)

WMI Off:
config file
pprof profile

WMI On:
config file
pprof profile

Important note for comparing the two profiles is that the profile for WMI On is 450ms compared to the WMI Off profile at 2.5s.


Internally, gopsutils, uses EnumProcesses to get the list of PIDs, then one by one we execute code to get other information about each process. So it calls CreateToolhelp32Snapshot for each process to get the parent PID, instead it should call CreateToolhelp32Snapshot instead of EnumProcesses when obtaining the initial list of PIDs.

I think I'm missing something here; is it possible to call CreateToolhelp32Snapshot on a bulk set of processes? The API documentation makes it look like this can only be done for one process at a time: https://learn.microsoft.com/en-us/windows/win32/api/tlhelp32/nf-tlhelp32-createtoolhelp32snapshot


Besides performance the current code has issues regarding correctness and availability of information (processes can terminate between the calls and PIDs can be reused).

This is an issue I've known about for some time but been unsure how to tackle without rearchitecting the scraper. The fact that the list of processes is gathered and then information is fetched individually for each makes it so this problem will always exist on any platform. Much of the scraper was written before my time as a CODEOWNER, but staying relatively the same cross-platform by leaning on gopsutil appears to be a goal. It ends up being to the detriment of any platform in the end, but is much simpler to maintain.

I think trying to make WMI fetch all information on Windows would make it so the scraper looks completely different and would diminish that benefit of ease of maintenance (arguably the usage of WMI as it exists and in this PR is already a blow to maintainability, but is necessitated by unacceptable performance on Windows in the current implementation).

Perhaps this is worth addressing at a different angle, such as an entirely new Windows specific process scraper. I'm not sure where that would live or how we'd go about making that though.


We could do the Win32 API route and in that case some of the info may not be available (parent PID can be made cheaply available without WMI, not sure about handle count).

Had to dig through the archives for this, but back when I introduced the process.handles metric in the first place I had an initial version that used NtQuerySystemInformation instead. It was rejected in favour of using WMI instead. I'm not sure if the original reservations that were stated at the time still apply. One thing's for sure, it's harder to understand for a general maintainer who may not know how it works. NtQuerySystemInformation also doesn't provide a stability guarantee and that was a concern. To my knowledge, the only other way to get Parent PID other than CreateToolhelp32Snapshot is NtQuerySystemInformation (which I think is what WMI does under the hood?)

Personally I am interested in finding an avenue to open the floor for making this receiver more efficient even if it requires less reliance on gopsutil in favour of direct system API usage to allow for gathering the data more efficiently. I'm not sure how to broach the topic more broadly.


All other comments on the PR are noted but I won't have free time for a couple days to address them. I plan to ASAP. Thank you for your review!

@pjanotti
Copy link
Contributor

Thanks for the updated chart and profiles @braydonk!

Internally, gopsutils, uses EnumProcesses to get the list of PIDs, then one by one we execute code to get other information about each process. So it calls CreateToolhelp32Snapshot for each process to get the parent PID, instead it should call CreateToolhelp32Snapshot instead of EnumProcesses when obtaining the initial list of PIDs.

I think I'm missing something here; is it possible to call CreateToolhelp32Snapshot on a bulk set of processes? The API documentation makes it look like this can only be done for one process at a time: https://learn.microsoft.com/en-us/windows/win32/api/tlhelp32/nf-tlhelp32-createtoolhelp32snapshot

To get parent PID current code calls gopsutils PpidWithContext. Until the ppid is cached it needs to call getFromSnapProcess which makes the call to CreateToolhelp32Snapshot, but, it must use the flag TH32CS_SNAPPROCESS to capture the ppid and that takes a snapshot of all processes in the box, the parameter with the PID is ignored in this case. Notice use of win32 APIs Process32First and Process32Next at https://github.com/shirou/gopsutil/blob/v4.25.1/process/process_windows.go#L847-L868. So if we used this as the mechanism to capture the list of PIDs we would already be getting the ppid for each process and in a snapshot that should be consistent. There will be other information that would require us to open each process handle, but, the cost to get the ppid would be great reduced.

staying relatively the same cross-platform by leaning on gopsutil appears to be a goal. It ends up being to the detriment of any platform in the end, but is much simpler to maintain.

I think we can isolate enough the high-level of the receiver from each platform implementation. Perhaps the main thing here is that not many developers in the community know and use Windows so having something like gopsutils shielding them from Windows is a more practical reason. That said I'm here to improve the collector experience on Windows 😄 so I can help here.

Perhaps this is worth addressing at a different angle, such as an entirely new Windows specific process scraper. I'm not sure where that would live or how we'd go about making that though.

At this point I don't think there is enough difference on the data output to justify that. It seems to me that if we make the calls to win32 API directly we can keep most of the non-Windows code as it is. That said it is possible that I'm being too optimistic.

@rustyautopsy
Copy link

rustyautopsy commented Feb 25, 2025

We are keeping a close eye on this PR as it would really help us out. We are a heavy Windows shop and are actively working with Coralogix to ramp up our Observability efforts. The alternative is likely running otel with windows_exporter which is less then ideal.

If there is anyway we can help please ask.

@slernercx
Copy link

@pjanotti steve lerner here- this is one of our customers above who would love to get this merged- they have large fleets of new windows servers and are going 100% otel... obrigadão irmão tava no Rio a semana pasada!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
receiver/hostmetrics Run Windows Enable running windows test on a PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants