Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subscribe and its variants run into a busy loop taking 100% CPU when groups is not empty #975

Open
tnqn opened this issue May 23, 2024 · 3 comments · May be fixed by #977
Open

Subscribe and its variants run into a busy loop taking 100% CPU when groups is not empty #975

tnqn opened this issue May 23, 2024 · 3 comments · May be fixed by #977

Comments

@tnqn
Copy link
Contributor

tnqn commented May 23, 2024

My colleage @hongliangl tried to upgrade netlink to a recent commit to pick up a required change. However, our CI became flaky when validating the new netlink version. We identified the flake started from the commit merged via #941.

The above commit changes the socket created by Subscribe to non-blocking when groups are provided. However, it didn't change how the message was received from the socket, causing the receiver goroutine to run into a busy loop, taking 100% CPU.

netlink/addr_linux.go

Lines 367 to 372 in 856e190

for {
msgs, from, err := s.Receive()
if err != nil {
if err == syscall.EAGAIN {
continue
}

The issue can be reproduced by the following code:

// netlink-reproducer.go
package main

import (
	"fmt"
	"os"

	"github.com/vishvananda/netlink"
)

func main() {
	ch := make(chan netlink.AddrUpdate, 1000)
	stopCh := make(chan struct{})
	defer close(stopCh)
	if err := netlink.AddrSubscribeWithOptions(ch, stopCh, netlink.AddrSubscribeOptions{
		ErrorCallback: func(err error) {
			fmt.Fprintf(os.Stderr, "Received err: %v\n", err)
		},
	}); err != nil {
		fmt.Fprintf(os.Stderr, "Failed to subscribe: %v\n", err)
	}
	for {
		select {
		case update := <-ch:
			fmt.Fprintf(os.Stderr, "Received: %v\n", update)
		}
	}
}

The process would always take 100%+ CPU:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 412687 root      20   0 1968136   8172   2660 S 103.6   0.0   0:15.53 netlink-reprodu

To fix it, all subscribers' receiver goroutines should use poll or select to wait for events first before receiving messages from the socket. I could take a stab at fixing the implementation, but I wonder if we could revert the commit that introduces the bug first to unblock projects requiring other changes of the library.

@tnqn
Copy link
Contributor Author

tnqn commented May 23, 2024

cc @aboch @kuroa-me

@kuroa-me
Copy link
Contributor

Apologies for my sluppy implementation, this was such a basic mistake...

@tnqn Do you mind fixing the implementation after the revert? Would be a great opportunity for me to learn.

@tnqn
Copy link
Contributor Author

tnqn commented May 23, 2024

@tnqn Do you mind fixing the implementation after the revert? Would be a great opportunity for me to learn.

Sure, I'm working on a fix, will ping you once I open a PR. Sorry for reverting the commit, I'm just not sure how long it takes to land the fix but we need some changes in the recent commit to unblock our project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants