Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrectly installed vrf routes due to improperly processed netlink RTM_NEWROUTE and RTM_NEWLINK (wrong order) #18041

Open
2 tasks done
Kurczaczek21 opened this issue Feb 6, 2025 · 0 comments
Labels
triage Needs further investigation

Comments

@Kurczaczek21
Copy link

Kurczaczek21 commented Feb 6, 2025

Description

After changing version of FRR from FRR 8.4.2 to FRR 10.1.1, the problem with creating VRFs and adding the default route showed up. It is related to the data redistribution from kernel process. This issue still occurs on the latest FRR version.

Sometimes the RTM_NEWROUTE would be processed by FRR before RTM_NEWLINK which will lead to adding the route to the default VRF, because the RTM_NEWLINK has not been processed yet and the destination VRF is not existing which we believe is the reason for this bug. The default route is advertised via the BGP session.

NOTE:
In this issue it is also attached possiblesolution for this issue (Additional context paragraph)

Version

FRRouting 10.1.1 (h1001) on Linux(5.15.0-127-generic).
Copyright 1996-2005 Kunihiro Ishiguro, et al.
configured with:
    '--build=x86_64-linux-gnu' '--prefix=/usr' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-option-checking' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--sbindir=/usr/lib/frr' '--with-vtysh-pager=/usr/bin/pager' '--libdir=/usr/lib/x86_64-linux-gnu/frr' '--with-moduledir=/usr/lib/x86_64-linux-gnu/frr/modules' '--disable-dependency-tracking' '--disable-rpki' '--disable-scripting' '--enable-pim6d' '--with-libpam' '--enable-doc' '--enable-doc-html' '--enable-snmp' '--enable-fpm' '--disable-protobuf' '--disable-zeromq' '--enable-ospfapi' '--enable-bgp-vnc' '--enable-multipath=256' '--enable-user=frr' '--enable-group=frr' '--enable-vty-group=frrvty' '--enable-configfile-mask=0640' '--enable-logfile-mask=0640' 'build_alias=x86_64-linux-gnu' 'PYTHON=python3'

How to reproduce

This problem does not occur for every iteration.
For this purpose I am using a script to automatize this process.

This script and necessary config files are all attached to this issue.
NOTE - I couldn't upload .conf file extensions, so instead I changed every dot to dash (. to -) and added .txt extension.
Properly named files and description:
repro_v2.sh (repro_v2-sh.txt)- script to add kernell routes, and handle config reloades.
frr.conf_no_vrf (frr-conf_no_vrf.txt) - startup config with some BGP configuration
frr.conf_with_vrf (frr-conf_with_vrf.txt)- FRR config which is used in every iteration to create proper bgp vrf config
frr.conf_with_vrf_org (frr-conf_with_vrf_org.txt) - used to reset the previous config file, so the reproduction script can be run again without problems.

Typical number of iterations to reproduce is usually around 50-150 but this number can be different.

  1. Run FRR 10.1.1
    Startup config:
frr version 10.1.1
frr defaults datacenter
hostname missing-vrf-rt-bug
log stdout
log syslog
log file /frr.log
!
debug zebra nexthop
debug zebra kernel
debug bgp zebra
debug bgp bestpath 0.0.0.0/0
debug zebra rib detailed
debug zebra events
debug zebra dplane
debug zebra nht detailed
zebra nexthop-group keep 30
zebra dplane limit 2000
service integrated-vtysh-config
!
router bgp 4250100001
 bgp router-id 10.40.0.1
 no bgp suppress-duplicates
 no bgp hard-administrative-reset
 no bgp default ipv4-unicast
 address-family ipv4 unicast
  redistribute kernel
 exit-address-family
 address-family ipv6 unicast
  redistribute kernel
 exit-address-family
 coalesce-time 1000
 bgp graceful-restart stalepath-time 15
 bgp graceful-restart
 bgp graceful-restart preserve-fw-state
 bgp bestpath as-path multipath-relax
exit
!
ip nht resolve-via-default
!
end 
  1. Add routes and VRFs to kernell eg.
ip link add vrfv$VRF_NUM type vrf table $VRF_NUM
ip r add blackhole default metric 4278198272 table $VRF_NUM
ip -6 r add blackhole default metric 4278198272 table $VRF_NUM
ip link add brv$VRF_NUM type bridge
ip link set brv$VRF_NUM master vrfv$VRF_NUM
ip link add vxlan$VRF_NUM type vxlan id $VRF_NUM dstport 4789
ip link set vxlan$VRF_NUM master brv$VRF_NUM
ip link set vrfv$VRF_NUM up
ip link set brv$VRF_NUM up
ip link set vxlan$VRF_NUM up
  1. Add router bgp vrf configuration for a specific vrf which is currently added in kernel along with its config for kernel routes redistribution. Example config file eg. frr.conf_with_vrf ( file frr.conf_with_vrf_org is attached, so the script will start again iteration for the org
frr version 10.1.1
frr defaults datacenter
hostname missing-vrf-rt-bug
log file /frr.log
log stdout
log syslog
!
debug zebra nexthop
debug zebra kernel
debug bgp zebra
debug bgp bestpath 0.0.0.0/0
debug zebra rib detailed
debug zebra events
debug zebra dplane
debug zebra nht detailed
zebra nexthop-group keep 30
zebra dplane limit 2000
service integrated-vtysh-config
!
vrf vrfv2980010
 vni 2980010
exit-vrf
!
vrf mgmt
exit-vrf
!
vrf vrfv252
exit-vrf
!
router bgp 4250100001
 bgp router-id 10.40.0.1
 no bgp suppress-duplicates
 no bgp hard-administrative-reset
 no bgp default ipv4-unicast
 address-family ipv4 unicast
  redistribute kernel
 exit-address-family
 address-family ipv6 unicast
  redistribute kernel
 exit-address-family
 coalesce-time 1000
 bgp graceful-restart stalepath-time 15
 bgp graceful-restart
 bgp graceful-restart preserve-fw-state
 bgp bestpath as-path multipath-relax
exit
!
router bgp 4250100001 vrf vrfv2980010
 bgp router-id 10.40.0.1
 !
 address-family ipv4 unicast
  redistribute kernel
 exit-address-family
 !
 address-family ipv6 unicast
  redistribute kernel
 exit-address-family
exit
!
ip nht resolve-via-default
! 

  1. Reload FRR with updated config eg /usr/lib/frr/frr-reload.py --reload /etc/frr/frr.conf_with_vrf
    This is not obligatory to reproduce the bug itself, but will add the router bgp 4250100001 vrf vrfvXXXXX config to the FRR, which will give us some more context about how infulent this BUG can be.

  2. Verify if the bug showed up by running vtysh -c "show ip route vrf vrfv$VRF_NUM 0.0.0.0". If the default route does not exists there for vrfv$VRF_NUM the bug has been successfully reproduced.

If not reproduced, try re-adding VRF nd routes again. Script for automatization this process along with configs is attached.

Expected behavior

Expected output for vtysh -c "show ip route vrf vrfv$VRF_NUM 0.0.0.0 json" :

root@h1001:/# vtysh -c "show ip route vrf vrfv2980049 0.0.0.0 json"
{
  "0.0.0.0/0":[
    {
      "prefix":"0.0.0.0/0",
      "prefixLen":0,
      "protocol":"kernel",
      "vrfId":371,
      "vrfName":"vrfv2980049",  <----------------VRF == vrfv<number>
      "selected":true,
      "destSelected":true,
      "distance":255,
      "metric":8192,
      "installed":true,
      "table":2980049,
      "internalStatus":16,
      "internalFlags":8,
      "internalNextHopNum":1,
      "internalNextHopActiveNum":1,
      "nexthopGroupId":67,
      "installedNexthopGroupId":67,
      "uptime":"00:00:31",
      "nexthops":[
        {
          "flags":3,
          "fib":true,
          "unreachable":true,
          "blackhole":true,
          "active":true
        }
      ]
    }
  ]
}

New default route record won't be visible in default VRF: vtysh -c 'show bgp ipv4 unicast 0.0.0.0/0', and should be seen in here: vtysh -c "show bgp vrf vrfv2980049 ipv4 unicast 0.0.0.0".

IPv4, specific VRF:

root@h1001:/# vtysh -c "show bgp vrf vrfv2980049 ipv4 unicast 0.0.0.0"
BGP routing table entry for 0.0.0.0/0, version 1
Paths: (1 available, best #1, vrf vrfv2980049)
  Not advertised to any peer
  Local
    0.0.0.0(h1001 from 0.0.0.0 (10.40.0.1)
      Origin incomplete, metric 8192, aigp-metric 8192, weight 32768, valid, sourced, bestpath-from-AS Local, best (First path received)
      Last update: Thu Feb  6 07:08:07 2025

IPv4, default VRF:

root@h1001:/# vtysh -c "show bgp ipv4 unicast 0.0.0.0"
% Network not in table

IPv6, specific VRF:

root@h1001:/# vtysh -c "show bgp vrf vrfv2980049 ipv6 unicast ::/0"
BGP routing table entry for ::/0, version 1
Paths: (1 available, best #1, vrf vrfv2980049)
  Not advertised to any peer
  Local
    ::(h1001) from :: (10.40.0.1)
      Origin incomplete, metric 8192, aigp-metric 8192, weight 32768, valid, sourced, bestpath-from-AS Local, best (First path received)
      Last update: Thu Feb  6 07:08:07 2025

IPv6, default VRF:

root@h1001:/# vtysh -c "show bgp ipv6 unicast ::/0"
% Network not in table

Actual behavior

After some reproduction iterations, the bug will show up, and the route will be added to the default VRF instead of the specific one.

Unwanted output (BUG) for vtysh -c "show ip route vrf vrfv$VRF_NUM 0.0.0.0 json":

root@h1001:/# vtysh -c "show ip route vrf vrfv2980049 0.0.0.0 json"
{
  "0.0.0.0/0":[
    {
      "prefix":"0.0.0.0/0",
      "prefixLen":0,
      "protocol":"kernel",
      "vrfId":0,
      "vrfName":"default",     <----------------VRF == default
      "selected":true,
      "destSelected":true,
      "distance":255,
      "metric":8192,
      "installed":true,
      "table":2980049,
      "internalStatus":16,
      "internalFlags":8,
      "internalNextHopNum":1,
      "internalNextHopActiveNum":1,
      "nexthopGroupId":170,
      "installedNexthopGroupId":170,
      "uptime":"00:01:34",
      "nexthops":[
        {
          "flags":3,
          "fib":true,
          "unreachable":true,
          "blackhole":true,
          "active":true
        }
      ]
    }
  ]
}

New default route record presented for vtysh -c 'show bgp ipv4 unicast 0.0.0.0/0', and the default route is now missing in Linux.

IPv4, specific VRF: (missing route)

root@h1001/# vtysh -c "show bgp vrf vrfv2980049 ipv4 unicast 0.0.0.0"
% Network not in table

IPv4, default VRF: (unwanted route from specific VRF present)

root@h1001:/# vtysh -c "show bgp ipv4 unicast 0.0.0.0"
BGP routing table entry for 0.0.0.0/0, version 1.
Paths: (1 available, best #1, table default)
  Not advertised to any peer
  Local
    0.0.0.0(h1001) from 0.0.0.0 (10.40.0.1)
      Origin incomplete, metric 8192, aigp-metric 8192, weight 32768, valid, sourced, bestpath-from-AS Local, best (First path received)
      Last update: Thu Feb  6 07:06:07 2025

IPv6, specific VRF:

root@h1001:/# vtysh -c "show bgp vrf vrfv2980049 ipv6 unicast ::/0"
BGP routing table entry for ::/0, version 1
Paths: (1 available, best #1, vrf vrfv2980049)
  Not advertised to any peer
  Local
    ::(h1001) from :: (10.40.0.1)
      Origin incomplete, metric 8192, aigp-metric 8192, weight 32768, valid, sourced, bestpath-from-AS Local, best (First path received)
      Last update: Thu Feb  6 07:06:07 2025

IPv6, default VRF:

root@h1001:/# vtysh -c "show bgp ipv6 unicast ::/0"
% Network not in table

Additional context

As far as I know, this issue could be reproduced only for IPv4 addresses.

!!! Potential code fragment which can fix this problem is added as patch in attached files. - vrf_rt_installation.patch

vrf_rt_installaton.patch
Note that this only fixes the default route override, and it does not fix the entire problem.

FRR logs if bug is present:
(First line indicates what we want to AVOID)

2025/01/31 09:59:45 BGP: [RHWNZ-VRQBG] Rx route ADD VRF 0 kernel[0] 0.0.0.0/0 nexthop 0.0.0.0 (type 6 if 0) metric 8192 distance 255 tag 0
2025/01/31 09:59:45 BGP: [RBZV6-DW61Y] Tx redistribute add VRF 5188 afi 2 kernel 0
2025/01/31 09:59:45 BGP: [RHWNZ-VRQBG] Rx route ADD VRF 5188 kernel[0] ::/0 nexthop :: (type 6 if 0) metric 8192 distance 255 tag 0
2025/01/31 09:59:45 BGP: [VTVCM-Y2NW3] Configuration Read in Took: 00:00:00
2025/01/31 09:59:45 BGP: [G6NKK-8C6DV] end_config: VTY:0x55a1421839d0, pending SET-CFG: 0
2025/01/31 09:59:45 BGP: [V7N4G-NR80B] bgp_process_main_one: p=0.0.0.0/0(VRF default) afi=IPv4, safi=unicast start
2025/01/31 09:59:45 BGP: [WEWEC-8SE72] 0.0.0.0/0(VRF default): path Static announcement is the bestpath from AS 0
2025/01/31 09:59:45 BGP: [JW7VP-K1YVV] 0.0.0.0/0(VRF default): Comparing path frr_if1 flags Valid Counted Mpath  with path frr_if0 flags Selected Valid Counted
2025/01/31 09:59:45 BGP: [PX4TR-N07FM] 0.0.0.0/0: path frr_if1 and path frr_if0 are equal via multipath-relax
2025/01/31 09:59:45 BGP: [KAAY6-34F21] 0.0.0.0/0: path frr_if1 loses to path frr_if0 due to oldest external
2025/01/31 09:59:45 BGP: [WEWEC-8SE72] 0.0.0.0/0(VRF default): path frr_if0 is the bestpath from AS 4240135001
2025/01/31 09:59:45 BGP: [JW7VP-K1YVV] 0.0.0.0/0(VRF default): Comparing path Static announcement flags Valid Dmed Selected Unsorted  with path frr_if0 flags Selected Valid Dmed Selected Counted
2025/01/31 09:59:45 BGP: [TG8CF-BT6NT] 0.0.0.0/0: path Static announcement wins over path frr_if0 due to weight 32768 > 0
2025/01/31 09:59:45 BGP: [N6CTF-2RSKS] 0.0.0.0/0(VRF default): After path selection, newbest is path Static announcement oldbest was frr_if0
2025/01/31 09:59:45 BGP: [ZF63D-VT50R] 0.0.0.0/0(VRF default): path Static announcement is the bestpath, add to the multipath list
2025/01/31 09:59:45 BGP: [JW7VP-K1YVV] 0.0.0.0/0(VRF default): Comparing path frr_if0 flags Selected Valid Dmed Selected Counted  with path Static announcement flags Valid Dmed Selected
2025/01/31 09:59:45 BGP: [MJZKY-GPYZM] 0.0.0.0/0: path frr_if0 loses to path Static announcement due to weight 0 < 32768
2025/01/31 09:59:45 BGP: [JW7VP-K1YVV] 0.0.0.0/0(VRF default): Comparing path frr_if1 flags Valid Dmed Check Counted Mpath  with path Static announcement flags Valid Dmed Selected
2025/01/31 09:59:45 BGP: [MJZKY-GPYZM] 0.0.0.0/0: path frr_if1 loses to path Static announcement due to weight 0 < 32768
2025/01/31 09:59:45 BGP: [VMWTX-J3P3K] 0.0.0.0/0(VRF default): starting mpath update, newbest Static announcement num candidates 1 old-mpath-count 1 old-cum-bw 0
2025/01/31 09:59:45 BGP: [YW4BC-Y5TG6] 0.0.0.0/0(VRF default): comparing candidate Static announcement with existing mpath frr_if1
2025/01/31 09:59:45 BGP: [YW4BC-Y5TG6] 0.0.0.0/0(VRF default): comparing candidate NONE with existing mpath frr_if1
2025/01/31 09:59:45 BGP: [W4AZV-S1BGV] 0.0.0.0/0: remove mpath path frr_if1 nexthop 0.0.0.0, cur count 1
2025/01/31 09:59:45 BGP: [S7KWG-REZFR] 0.0.0.0/0(VRF default): New mpath count (incl newbest) 1 mpath-change YES all_paths_lb 0 cum_bw 0
2025/01/31 09:59:45 BGP: [GVV1N-MH1P0] bgp_process_main_one: p=0.0.0.0/0(VRF default) afi=IPv4, safi=unicast, old_select=0x55a142120720, new_select=0x55a142224770
root@h1001:/# cat /frr.log | grep -E "RTM_NEWLINK ADD for vrfv|RTM_NEWROUTE 0.0.0.0
2025/01/22 11:03:41 ZEBRA: [YXPF5-B2CE0] netlink_route_multipath_msg_encode: RTM_NEWROUTE 0.0.0.0/0 vrf 0(254)
2025/01/22 11:04:24 ZEBRA: [K2A4T-TS83H] RTM_NEWROUTE 0.0.0.0/0 vrf default(0) table_id: 2980024 metric: 8192 Admin Distance: 255
2025/01/22 11:04:24 ZEBRA: [W6P02-X0WYC] RTM_NEWLINK ADD for vrfv2980024(261) vrf_id 261 type 2 sl_type 0 master 0

Checklist

  • I have searched the open issues for this bug.
  • I have not included sensitive information in this report.
@Kurczaczek21 Kurczaczek21 added the triage Needs further investigation label Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Needs further investigation
Projects
None yet
Development

No branches or pull requests

1 participant