You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've encountered a problem with bonding configs., after our most recent Flatcar upgrade from v3760.2.0 to v3975.2.1. The behavior is very weird, in that the bond0 interface actor churn does not always begin after the initial upgrade reboot. Instead, the bond0 interface actor churn most frequently appears after a subsequent reboot.
We can commonly recover from this by rebooting, but that does not always fix it
We have tried downing and reupping the effected Bond0 interface but that doesn't seem to have any effect
We tried to upgrade to the next known stable, 3975.2.2, but we see the same problem
We tried downgrading to v3760.2.0 and that worked-- the interface no longer enters churn
We then tried upgrading back to 3975.2.1, rebooting after the upgrade reboot, and churn reappeared
Impact
Nodes, rebooted after the initial upgrade reboot, go into churn on the secondary Bond0 interface and are subsequently unable to communicate with other nodes in the cluster.
Environment and steps to reproduce
Set-up: [ describe the environment Flatcar/Nebraska etc was running in when encountering the bug; Platform etc. ]
a. Baremetal Flatcar OS 3760.2.0 upgraded via Nebraska to Flatcar OS 3975.2.1
Task: [ describe the task performing when encountering the bug ]
a. After the node is upgraded and rebooted, the node is then rebooted a second time, and churn appears, which causes lag during node login and commands being run
Action(s): [ sequence of actions that triggered the bug, see example below ]
a. Rebooted the node, after the initial upgrade reboot
b. Node login and commands begin to hang and take many seconds to minutes to complete
c. /proc/net/bonding/bond0 shows churn on the secondary interface, and has no system mac address present
Error: [describe the error that was triggered]
a. Nodes were unable to communicate with effected node
Expected behavior
Expect nodes to communicate with other nodes in the cluster
Additional information
Please add any information here that does not fit the above format.
The text was updated successfully, but these errors were encountered:
[ 18.646307] ice 0000:41:00.1 enp65s0f1np1: NIC Link is up 25 Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: FC-FEC/BASE-R, Autoneg Advertised: On, Autoneg Negotiated: False, Flow Control: None
[ 18.666616] bond0: (slave enp65s0f1np1): Enslaving as a backup interface with an up link
[ 18.675470] bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond
[ 18.686452] ice 0000:41:00.1 enp65s0f1np1: Error ADDING CP rule for fail-over
[ 18.693782] ice 0000:41:00.1 enp65s0f1np1: Shared SR-IOV resources in bond are active
[ 18.702648] ice 0000:41:00.0: Primary interface not in switchdev mode - VF LAG disabled
We were asked to try the following but are still seeing issues:
Create /etc/systemd/network/98-bond-mac.link
Add the following to the newly created /etc/systemd/network/98-bond-mac.link
Description
We've encountered a problem with bonding configs., after our most recent Flatcar upgrade from v3760.2.0 to v3975.2.1. The behavior is very weird, in that the bond0 interface actor churn does not always begin after the initial upgrade reboot. Instead, the bond0 interface actor churn most frequently appears after a subsequent reboot.
Impact
Nodes, rebooted after the initial upgrade reboot, go into churn on the secondary Bond0 interface and are subsequently unable to communicate with other nodes in the cluster.
Environment and steps to reproduce
a. Baremetal Flatcar OS 3760.2.0 upgraded via Nebraska to Flatcar OS 3975.2.1
a. After the node is upgraded and rebooted, the node is then rebooted a second time, and churn appears, which causes lag during node login and commands being run
a. Rebooted the node, after the initial upgrade reboot
b. Node login and commands begin to hang and take many seconds to minutes to complete
c. /proc/net/bonding/bond0 shows churn on the secondary interface, and has no system mac address present
a. Nodes were unable to communicate with effected node
Expected behavior
Expect nodes to communicate with other nodes in the cluster
Additional information
Please add any information here that does not fit the above format.
The text was updated successfully, but these errors were encountered: