Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PANIC in rd_fill_getroute_reply #585

Closed
andrenth opened this issue Jul 19, 2022 · 4 comments
Closed

PANIC in rd_fill_getroute_reply #585

andrenth opened this issue Jul 19, 2022 · 4 comments
Labels
Milestone

Comments

@andrenth
Copy link
Collaborator

This message has showed up twice in the logs during the current testing period of version 1.1:

PANIC in rd_fill_getroute_reply():
Invalid FIB action (6) in FIB while being processed by CPS block in rd_fill_getroute_reply

The following kernel logs have appeared at the same time in kern.log:

Jul 19 05:01:25 gtk1 kernel: [395312.386002] show_signal: 1 callbacks suppressed
Jul 19 05:01:25 gtk1 kernel: [395312.386004] traps: lcore-worker-8[14109] general protection fault ip:7fcd8f7f4c50 sp:7fcd8c3e7050 error:0 in libgcc_s.so.1[7fcd8f7e8000+12000]
Jul 19 05:01:28 gtk1 kernel: [395315.776939] BUG: kernel NULL pointer dereference, address: 0000000000000010
Jul 19 05:01:28 gtk1 kernel: [395315.777020] #PF: supervisor read access in kernel mode
Jul 19 05:01:28 gtk1 kernel: [395315.777064] #PF: error_code(0x0000) - not-present page
Jul 19 05:01:28 gtk1 kernel: [395315.777108] PGD 0 P4D 0 
Jul 19 05:01:28 gtk1 kernel: [395315.777138] Oops: 0000 [#1] SMP PTI
Jul 19 05:01:28 gtk1 kernel: [395315.777175] CPU: 30 PID: 14131 Comm: lcore-worker-30 Tainted: G           OE     5.4.0-117-generic #132-Ubuntu
Jul 19 05:01:28 gtk1 kernel: [395315.777254] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS SE5C600.86B.02.06.E006.013120181511 01/31/2018
Jul 19 05:01:28 gtk1 kernel: [395315.777343] RIP: 0010:vmacache_find+0x29/0xc0
Jul 19 05:01:28 gtk1 kernel: [395315.777383] Code: 00 66 66 66 66 90 55 45 31 c0 65 48 8b 0c 25 c0 bb 01 00 48 89 e5 48 3b b9 10 08 00 00 74 05 4c 89 c0 5d c3 f6 41 26 20 75 f5 <48> 8b 47 10 48 3b 81 20 08 00 00 75 44 48 89 f0 ba 04 00 00 00 45
Jul 19 05:01:28 gtk1 kernel: [395315.777523] RSP: 0000:ffffb10c8ebafa00 EFLAGS: 00010246
Jul 19 05:01:28 gtk1 kernel: [395315.777568] RAX: ffff8999b1ac1740 RBX: 0000000000000000 RCX: ffff8999b1ac1740
Jul 19 05:01:28 gtk1 kernel: [395315.777626] RDX: 0000000000000000 RSI: 000000350ba54000 RDI: 0000000000000000
Jul 19 05:01:28 gtk1 kernel: [395315.777684] RBP: ffffb10c8ebafa00 R08: 0000000000000000 R09: ffffb10c8ebafbc0
Jul 19 05:01:28 gtk1 kernel: [395315.777742] R10: ffff89999f05ea80 R11: ffff89999f05ea80 R12: 000000350ba54000
Jul 19 05:01:28 gtk1 kernel: [395315.777799] R13: 000000350ba54000 R14: 0000000000000000 R15: 0000000000000000
Jul 19 05:01:28 gtk1 kernel: [395315.777858] FS:  00007fcd76fed400(0000) GS:ffff89b9beb80000(0000) knlGS:0000000000000000
Jul 19 05:01:28 gtk1 kernel: [395315.777922] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 19 05:01:28 gtk1 kernel: [395315.777970] CR2: 0000000000000010 CR3: 0000003fb8e0a003 CR4: 00000000000606e0
Jul 19 05:01:28 gtk1 kernel: [395315.778027] Call Trace:
Jul 19 05:01:28 gtk1 kernel: [395315.778060]  find_vma+0x1b/0x70
Jul 19 05:01:28 gtk1 kernel: [395315.778096]  ? __switch_to_asm+0x34/0x70
Jul 19 05:01:28 gtk1 kernel: [395315.778135]  find_extend_vma+0x22/0x90
Jul 19 05:01:28 gtk1 kernel: [395315.778171]  __get_user_pages+0xc3/0x7d0
Jul 19 05:01:28 gtk1 kernel: [395315.778209]  get_user_pages_remote+0x146/0x230
Jul 19 05:01:28 gtk1 kernel: [395315.778257]  kni_fifo_trans_pa2va+0x1d1/0x2c0 [rte_kni]
Jul 19 05:01:28 gtk1 kernel: [395315.778305]  kni_net_release_fifo_phy+0x36/0x40 [rte_kni]
Jul 19 05:01:28 gtk1 kernel: [395315.778352]  kni_dev_remove+0x33/0x40 [rte_kni]
Jul 19 05:01:28 gtk1 kernel: [395315.778394]  kni_release+0xab/0x160 [rte_kni]
Jul 19 05:01:28 gtk1 kernel: [395315.778435]  __fput+0xcc/0x260
Jul 19 05:01:28 gtk1 kernel: [395315.778466]  ____fput+0xe/0x10
Jul 19 05:01:28 gtk1 kernel: [395315.778497]  task_work_run+0x8f/0xb0
Jul 19 05:01:28 gtk1 kernel: [395315.778533]  do_exit+0x36e/0xaf0
Jul 19 05:01:28 gtk1 kernel: [395315.778567]  ? _cond_resched+0x19/0x30
Jul 19 05:01:28 gtk1 kernel: [395315.778602]  ? mutex_lock+0x13/0x40
Jul 19 05:01:28 gtk1 kernel: [395315.778636]  ? pipe_wait+0xaf/0xc0
Jul 19 05:01:28 gtk1 kernel: [395315.778670]  do_group_exit+0x47/0xb0
Jul 19 05:01:28 gtk1 kernel: [395315.778706]  get_signal+0x169/0x890
Jul 19 05:01:28 gtk1 kernel: [395315.778742]  do_signal+0x34/0x6c0
Jul 19 05:01:28 gtk1 kernel: [395315.778775]  ? __vfs_read+0x29/0x40
Jul 19 05:01:28 gtk1 kernel: [395315.778809]  ? vfs_read+0xab/0x160
Jul 19 05:01:28 gtk1 kernel: [395315.778845]  exit_to_usermode_loop+0xbf/0x160
Jul 19 05:01:28 gtk1 kernel: [395315.778885]  do_syscall_64+0x163/0x190
Jul 19 05:01:28 gtk1 kernel: [395315.778922]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 19 05:01:28 gtk1 kernel: [395315.778967] RIP: 0033:0x7fcd8ff553cc
Jul 19 05:01:28 gtk1 kernel: [395315.780417] Code: Bad RIP value.
Jul 19 05:01:28 gtk1 kernel: [395315.781845] RSP: 002b:00007fcd76fea2a0 EFLAGS: 00003246 ORIG_RAX: 0000000000000000
Jul 19 05:01:28 gtk1 kernel: [395315.783334] RAX: fffffffffffffe00 RBX: 000055d3be8756e0 RCX: 00007fcd8ff553cc
Jul 19 05:01:28 gtk1 kernel: [395315.784807] RDX: 0000000000000001 RSI: 00007fcd76fea2ef RDI: 0000000000000084
Jul 19 05:01:28 gtk1 kernel: [395315.786258] RBP: 00007fcd76fea2ef R08: 0000000000000000 R09: 00007fcd76fea2f0
Jul 19 05:01:28 gtk1 kernel: [395315.787715] R10: 000055d3be083500 R11: 0000000000003246 R12: 000055d3be874060
Jul 19 05:01:28 gtk1 kernel: [395315.789172] R13: 0000000000001680 R14: 000000000000001e R15: 0000000000000087
Jul 19 05:01:28 gtk1 kernel: [395315.790639] Modules linked in: rte_kni(OE) binfmt_misc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper rapl joydev input_leds intel_cstate ipmi_si ipmi_devintf ipmi_msghandler mgag200 drm_vram_helper ttm drm_kms_helper fb_sys_fops syscopyarea sysfillrect sysimgblt mei_me mei ioatdma mac_hid sch_fq_codel uio_pci_generic uio drm ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear raid1 isci hid_generic ixgbe igb usbhid xfrm_algo ahci libsas i2c_algo_bit lpc_ich scsi_transport_sas libahci i2c_i801 crc32_pclmul hid dca mdio wmi [last unloaded: rte_kni]
Jul 19 05:01:28 gtk1 kernel: [395315.801477] CR2: 0000000000000010
Jul 19 05:01:28 gtk1 kernel: [395315.803041] ---[ end trace 6f8b3699caf10e87 ]---
Jul 19 05:01:28 gtk1 kernel: [395315.851009] RIP: 0010:vmacache_find+0x29/0xc0
Jul 19 05:01:28 gtk1 kernel: [395315.852584] Code: 00 66 66 66 66 90 55 45 31 c0 65 48 8b 0c 25 c0 bb 01 00 48 89 e5 48 3b b9 10 08 00 00 74 05 4c 89 c0 5d c3 f6 41 26 20 75 f5 <48> 8b 47 10 48 3b 81 20 08 00 00 75 44 48 89 f0 ba 04 00 00 00 45
Jul 19 05:01:28 gtk1 kernel: [395315.855858] RSP: 0000:ffffb10c8ebafa00 EFLAGS: 00010246
Jul 19 05:01:28 gtk1 kernel: [395315.857487] RAX: ffff8999b1ac1740 RBX: 0000000000000000 RCX: ffff8999b1ac1740
Jul 19 05:01:28 gtk1 kernel: [395315.859132] RDX: 0000000000000000 RSI: 000000350ba54000 RDI: 0000000000000000
Jul 19 05:01:28 gtk1 kernel: [395315.860763] RBP: ffffb10c8ebafa00 R08: 0000000000000000 R09: ffffb10c8ebafbc0
Jul 19 05:01:28 gtk1 kernel: [395315.862411] R10: ffff89999f05ea80 R11: ffff89999f05ea80 R12: 000000350ba54000
Jul 19 05:01:28 gtk1 kernel: [395315.864076] R13: 000000350ba54000 R14: 0000000000000000 R15: 0000000000000000
Jul 19 05:01:28 gtk1 kernel: [395315.865754] FS:  00007fcd76fed400(0000) GS:ffff89b9beb80000(0000) knlGS:0000000000000000
Jul 19 05:01:28 gtk1 kernel: [395315.867461] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jul 19 05:01:28 gtk1 kernel: [395315.869171] CR2: 00007fcd8ff553a2 CR3: 0000003fb8e0a003 CR4: 00000000000606e0
Jul 19 05:01:28 gtk1 kernel: [395315.870917] Fixing recursive fault but reboot is needed!

An attempt to restart Gatekeeper causes an immediate reboot.

@AltraMayor AltraMayor added this to the Version 1.1 milestone Jul 19, 2022
AltraMayor added a commit that referenced this issue Jul 29, 2022
This patch is meant to help to collect information for issue #585,
and to allow Gatekeeper servers in production to keep running even
if they find the routing table is corrupted while dumping it.

This patch also pushes issue #572 forward.
AltraMayor added a commit that referenced this issue Jul 29, 2022
This patch is meant to help to collect information for issue #585,
and to allow Gatekeeper servers in production to keep running even
if they find the routing table is corrupted while dumping it.

This patch also pushes issue #572 forward.
@AltraMayor
Copy link
Owner

Pull request #593 logs more information about the memory corruption and allows Gatekeeper to keep running. Pull request #593 is not a solution, but a palliative while we work on a final solution.

@AltraMayor
Copy link
Owner

This issue is the combination of two problems: 1. a bug in the LPM iterator, and 2. a bug in the KNI driver that seems to be triggered when Gatekeeper terminates without releasing the resources associated with the kernel module of the KNI. Pull request #594 addresses the first problem.

@AltraMayor
Copy link
Owner

Given that this issue is no longer reproducible in production, I'm moving to release 1.2 the investigation of the KNI kernel module.

@AltraMayor AltraMayor modified the milestones: Version 1.1, Version 1.2 Oct 6, 2022
AltraMayor added a commit that referenced this issue Mar 1, 2024
DPDK dropped its KNI library at version 23.11.
This commit replaces DPDK's KNI library with virtio-user.

This commit closes #481, closes #570, closes #585, closes #674.
AltraMayor added a commit that referenced this issue Mar 1, 2024
DPDK dropped its KNI library at version 23.11.
This commit replaces DPDK's KNI library with virtio-user.

This commit closes #481, closes #570, closes #585, closes #674.
@AltraMayor AltraMayor added the bug label Mar 6, 2024
@AltraMayor
Copy link
Owner

Pull request #678 dropped the KNI library, so the last problem no longer exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants