Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: update netlink dep to latest commit #3237

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

vadasambar
Copy link

  • this fixes interrupted sys call errors in the plugin

What type of PR is this?
bug

Which issue does this PR fix?:

#3196

What does this PR do / Why do we need it?:
This PR fixes the frequent interrupted sys call errors like these:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox \"xxxxxxxx\": plugin type=\"aws-cni\" name=\"aws-cni\" failed (add): add command: failed to setup network: SetupPodNetwork: failed to setup veth pair: failed to setup veth network: setup NS network: failed while waiting for v6 addresses to be stable: could not list addresses: interrupted system call

Testing done on this change:

Will this PR introduce any new dependencies?:

No

Will this break upgrades or downgrades? Has updating a running cluster been tested?:
No

Does this change require updates to the CNI daemonset config files to work?:

No

Does this PR introduce any user-facing change?:

fix: interrupted system call error

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
- this fixes interrupted sys call errors in the plugin
@vadasambar vadasambar requested a review from a team as a code owner March 18, 2025 11:45
@@ -32,7 +32,7 @@ require (
github.com/sirupsen/logrus v1.9.3
github.com/spf13/pflag v1.0.5
github.com/stretchr/testify v1.10.0
github.com/vishvananda/netlink v1.3.0
github.com/vishvananda/netlink v1.3.1-0.20250303224720-0e7078ed04c8
Copy link
Author

@vadasambar vadasambar Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v1.3.1 hasn't been released yet. This uses the latest commit at the time of creating this PR (until we get v1.3.1 release of netlink): vishvananda/netlink@0e7078e

Copy link
Author

@vadasambar vadasambar Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could handle the error in the code here:

addrs, err := netLink.AddrList(link, netlink.FAMILY_V6)
if err != nil {
return fmt.Errorf("could not list addresses: %v", err)
}

with something like:

		if errors.Is(err, unix.EINTR) || err != nil {
			return fmt.Errorf("could not list addresses: %v", err)
		}

but we might end up ignoring unix.EINTR's which really matter.

P.S.: updated code link

@haouc
Copy link
Contributor

haouc commented Mar 19, 2025

@vadasambar thanks for updating this. Did you get chance to test? If you did, can you comment with the reprod you had in the overview?

@vadasambar
Copy link
Author

@vadasambar thanks for updating this. Did you get chance to test? If you did, can you comment with the reprod you had in the overview?

This PR is only updating a dependency. If the tests are passing (for this repo), even if the issue is not fixed (I think the issue will be fixed), we would be in a better state.

The only problem I see is using a master/trunk version of a dependency. Provided upstream dependency is doing its due diligence, we should be fine? Let me know what you think.

@vadasambar
Copy link
Author

@haouc ^

@yash97
Copy link
Contributor

yash97 commented Mar 20, 2025

I don't see the urgency to use package commit while upgrading dependency. We can update package when it is officially released. Even with this update we only get benefit of error categorization. But we either we retry silently using this new update which is not officially released yet or we let kubelet retry on failure. It doesn't seem worth the risk of importing package with a tag with no official release as of now.

@vadasambar
Copy link
Author

vadasambar commented Mar 21, 2025

I don't see the urgency to use package commit while upgrading dependency. We can update package when it is officially released. Even with this update we only get benefit of error categorization.

We could wait for the upstream to update the dependency but we don't know when that would happen. The fix was merged in Sep 2024.

Even with this update we only get benefit of error categorization. But we either we retry silently using this new update which is not officially released yet or we let kubelet retry on failure. It doesn't seem worth the risk of importing package with a tag with no official release as of now.

We are throwing an error in the CNI (now) and letting kubelet retry vs using the result from call (before) even if the system call was interrupted (and then keep on retrying until the result contains the the value we were looking for). If we let kubelet retry, it's going to add to the pod startup time.

P.S.: the problem is only going to get worse with increased pod churn or larger clusters since the probability of sys call getting interrupted is only going to go up.

I guess we have 3 options now:

  1. Wait for upstream to release the fix (no timeline on this; last releases: 2024, 2022, 2020)
  2. Use the upstream's trunk version (comes with risk of using an unreleased version of a dependency)
  3. Handle the error in CNI (need to be careful with dropping sys call interrupted errors; might need to replicate upstream's fix until there's a new release, something like this and this)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants