Kernel error when loading ib_mthca

MathieuVII

Member
Oct 21, 2020
3
0
21
48
Hi,

I'm using an old infiniband card (MT25208 InfiniHost III Ex) at home, which stopped working some kernel ago.

Pveversion :
pve-manager/7.2-4/ca9d43cc (running kernel: 5.15.35-1-pve)
When i load ib_mthca i've some kernel error. I don't know at which point this section of kernel is patched and where i should fill a bug report (in case i should fill one).
Code:
[604128.986395] ib_mthca: unknown parameter 'debug_level' ignored
[604128.986628] ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
[604128.986631] ib_mthca: Initializing 0000:02:00.0
[604128.986647] ib_mthca 0000:02:00.0: enabling device (0140 -> 0142)
[604130.191695] ================================================================================
[604130.191766] UBSAN: shift-out-of-bounds in drivers/infiniband/hw/mthca/mthca_cmd.c:1089:24
[604130.191823] shift exponent 32 is too large for 32-bit type 'int'
[604130.191865] CPU: 3 PID: 2754261 Comm: modprobe Tainted: P           O      5.15.35-1-pve #1
[604130.191869] Hardware name: Supermicro Super Server/A2SDi-4C-HLN4F, BIOS 1.5 05/17/2021
[604130.191872] Call Trace:
[604130.191875]  <TASK>
[604130.191878]  dump_stack_lvl+0x4a/0x5f
[604130.191886]  dump_stack+0x10/0x12
[604130.191888]  ubsan_epilogue+0x9/0x45
[604130.191891]  __ubsan_handle_shift_out_of_bounds.cold+0x61/0xef
[604130.191897]  mthca_QUERY_DEV_LIM.cold+0x237/0x260 [ib_mthca]
[604130.191915]  mthca_dev_lim+0x50/0x31a [ib_mthca]
[604130.191928]  mthca_init_tavor+0xc4/0x1b7 [ib_mthca]
[604130.191941]  ? insert_vmap_area.constprop.0+0xb4/0xf0
[604130.191946]  ? __cond_resched+0x1a/0x50
[604130.191950]  ? down_write+0x13/0x50
[604130.191953]  ? kernfs_activate+0xc5/0x110
[604130.191957]  ? kernfs_add_one+0xe2/0x130
[604130.191960]  ? __kernfs_create_file+0x7f/0xc0
[604130.191963]  ? sysfs_add_file_mode_ns+0x8a/0x180
[604130.191967]  ? sysfs_create_file_ns+0x66/0x90
[604130.191971]  ? device_create_file+0x42/0x80
[604130.191975]  ? dma_pool_create+0x198/0x240
[604130.191980]  __mthca_init_one+0x355/0x69f [ib_mthca]
[604130.191993]  ? vprintk_default+0x1d/0x20
[604130.191996]  ? vprintk+0x58/0x90
[604130.191999]  ? _printk+0x58/0x6f
[604130.192004]  mthca_init_one.cold+0x44/0x92 [ib_mthca]
[604130.192018]  local_pci_probe+0x4b/0x90
[604130.192022]  pci_device_probe+0x115/0x1f0
[604130.192026]  really_probe+0x21e/0x420
[604130.192029]  __driver_probe_device+0x115/0x190
[604130.192033]  driver_probe_device+0x23/0xc0
[604130.192036]  __driver_attach+0xbd/0x1d0
[604130.192039]  ? __device_attach_driver+0x110/0x110
[604130.192043]  bus_for_each_dev+0x7e/0xc0
[604130.192046]  driver_attach+0x1e/0x20
[604130.192049]  bus_add_driver+0x135/0x200
[604130.192052]  driver_register+0x91/0xf0
[604130.192056]  __pci_register_driver+0x68/0x70
[604130.192058]  mthca_init+0x169/0x17f [ib_mthca]
[604130.192071]  ? __mthca_check_profile_val+0x8a/0x8a [ib_mthca]
[604130.192083]  do_one_initcall+0x48/0x1d0
[604130.192087]  ? kmem_cache_alloc_trace+0x19e/0x2e0
[604130.192092]  do_init_module+0x52/0x280
[604130.192096]  load_module+0x2613/0x2a00
[604130.192102]  __do_sys_finit_module+0xbf/0x120
[604130.192106]  __x64_sys_finit_module+0x1a/0x20
[604130.192110]  do_syscall_64+0x5c/0xc0
[604130.192112]  ? syscall_exit_to_user_mode+0x27/0x50
[604130.192116]  ? __x64_sys_newfstat+0x16/0x20
[604130.192120]  ? do_syscall_64+0x69/0xc0
[604130.192122]  ? do_syscall_64+0x69/0xc0
[604130.192124]  ? syscall_exit_to_user_mode+0x27/0x50
[604130.192127]  ? __x64_sys_close+0x12/0x40
[604130.192130]  ? do_syscall_64+0x69/0xc0
[604130.192133]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[604130.192137] RIP: 0033:0x7f84f5d0d9b9
[604130.192141] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a7 54 0c 00 f7 d8 64 89 01 48
[604130.192144] RSP: 002b:00007ffdd72b1408 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[604130.192149] RAX: ffffffffffffffda RBX: 0000561eb0b65e50 RCX: 00007f84f5d0d9b9
[604130.192151] RDX: 0000000000000000 RSI: 0000561eb0b66620 RDI: 0000000000000003
[604130.192153] RBP: 0000000000040000 R08: 0000000000000000 R09: 0000000000000000
[604130.192154] R10: 0000000000000003 R11: 0000000000000246 R12: 0000561eb0b66620
[604130.192156] R13: 0000000000000000 R14: 0000561eb0b66050 R15: 0000561eb0b65e50
[604130.192160]  </TASK>
[604130.192161] ================================================================================
[604130.192216] ================================================================================
[604130.192271] UBSAN: shift-out-of-bounds in drivers/infiniband/hw/mthca/mthca_cmd.c:1101:26
[604130.192325] shift exponent 64 is too large for 32-bit type 'int'
[604130.192365] CPU: 3 PID: 2754261 Comm: modprobe Tainted: P           O      5.15.35-1-pve #1
[604130.192368] Hardware name: Supermicro Super Server/A2SDi-4C-HLN4F, BIOS 1.5 05/17/2021
[604130.192370] Call Trace:
[604130.192371]  <TASK>
[604130.192373]  dump_stack_lvl+0x4a/0x5f
[604130.192376]  dump_stack+0x10/0x12
[604130.192379]  ubsan_epilogue+0x9/0x45
[604130.192382]  __ubsan_handle_shift_out_of_bounds.cold+0x61/0xef
[604130.192386]  mthca_QUERY_DEV_LIM.cold+0x17/0x260 [ib_mthca]
[604130.192401]  mthca_dev_lim+0x50/0x31a [ib_mthca]
[604130.192414]  mthca_init_tavor+0xc4/0x1b7 [ib_mthca]
[604130.192427]  ? insert_vmap_area.constprop.0+0xb4/0xf0
[604130.192430]  ? __cond_resched+0x1a/0x50
[604130.192433]  ? down_write+0x13/0x50
[604130.192436]  ? kernfs_activate+0xc5/0x110
[604130.192439]  ? kernfs_add_one+0xe2/0x130
[604130.192442]  ? __kernfs_create_file+0x7f/0xc0
[604130.192445]  ? sysfs_add_file_mode_ns+0x8a/0x180
[604130.192449]  ? sysfs_create_file_ns+0x66/0x90
[604130.192452]  ? device_create_file+0x42/0x80
[604130.192456]  ? dma_pool_create+0x198/0x240
[604130.192460]  __mthca_init_one+0x355/0x69f [ib_mthca]
[604130.192473]  ? vprintk_default+0x1d/0x20
[604130.192476]  ? vprintk+0x58/0x90
[604130.192478]  ? _printk+0x58/0x6f
[604130.192482]  mthca_init_one.cold+0x44/0x92 [ib_mthca]
[604130.192496]  local_pci_probe+0x4b/0x90
[604130.192499]  pci_device_probe+0x115/0x1f0
[604130.192502]  really_probe+0x21e/0x420
[604130.192506]  __driver_probe_device+0x115/0x190
[604130.192509]  driver_probe_device+0x23/0xc0
[604130.192512]  __driver_attach+0xbd/0x1d0
[604130.192516]  ? __device_attach_driver+0x110/0x110
[604130.192519]  bus_for_each_dev+0x7e/0xc0
[604130.192523]  driver_attach+0x1e/0x20
[604130.192525]  bus_add_driver+0x135/0x200
[604130.192529]  driver_register+0x91/0xf0
[604130.192532]  __pci_register_driver+0x68/0x70
[604130.192535]  mthca_init+0x169/0x17f [ib_mthca]
[604130.192547]  ? __mthca_check_profile_val+0x8a/0x8a [ib_mthca]
[604130.192559]  do_one_initcall+0x48/0x1d0
[604130.192563]  ? kmem_cache_alloc_trace+0x19e/0x2e0
[604130.192567]  do_init_module+0x52/0x280
[604130.192570]  load_module+0x2613/0x2a00
[604130.192576]  __do_sys_finit_module+0xbf/0x120
[604130.192581]  __x64_sys_finit_module+0x1a/0x20
[604130.192584]  do_syscall_64+0x5c/0xc0
[604130.192586]  ? syscall_exit_to_user_mode+0x27/0x50
[604130.192589]  ? __x64_sys_newfstat+0x16/0x20
[604130.192592]  ? do_syscall_64+0x69/0xc0
[604130.192594]  ? do_syscall_64+0x69/0xc0
[604130.192597]  ? syscall_exit_to_user_mode+0x27/0x50
[604130.192600]  ? __x64_sys_close+0x12/0x40
[604130.192602]  ? do_syscall_64+0x69/0xc0
[604130.192604]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[604130.192608] RIP: 0033:0x7f84f5d0d9b9
[604130.192611] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a7 54 0c 00 f7 d8 64 89 01 48
[604130.192613] RSP: 002b:00007ffdd72b1408 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[604130.192616] RAX: ffffffffffffffda RBX: 0000561eb0b65e50 RCX: 00007f84f5d0d9b9
[604130.192618] RDX: 0000000000000000 RSI: 0000561eb0b66620 RDI: 0000000000000003
[604130.192620] RBP: 0000000000040000 R08: 0000000000000000 R09: 0000000000000000
[604130.192622] R10: 0000000000000003 R11: 0000000000000246 R12: 0000561eb0b66620
[604130.192624] R13: 0000000000000000 R14: 0000561eb0b66050 R15: 0000561eb0b65e50
[604130.192627]  </TASK>
[604130.192628] ================================================================================
[604130.192683] ================================================================================
[604130.192737] UBSAN: shift-out-of-bounds in drivers/infiniband/hw/mthca/mthca_cmd.c:1212:34
[604130.192791] shift exponent 32 is too large for 32-bit type 'int'
[604130.192832] CPU: 3 PID: 2754261 Comm: modprobe Tainted: P           O      5.15.35-1-pve #1
[604130.192835] Hardware name: Supermicro Super Server/A2SDi-4C-HLN4F, BIOS 1.5 05/17/2021
[604130.192836] Call Trace:
[604130.192837]  <TASK>
[604130.192839]  dump_stack_lvl+0x4a/0x5f
[604130.192842]  dump_stack+0x10/0x12
[604130.192845]  ubsan_epilogue+0x9/0x45
[604130.192848]  __ubsan_handle_shift_out_of_bounds.cold+0x61/0xef
[604130.192852]  mthca_QUERY_DEV_LIM.cold+0x217/0x260 [ib_mthca]
[604130.192866]  mthca_dev_lim+0x50/0x31a [ib_mthca]
[604130.192880]  mthca_init_tavor+0xc4/0x1b7 [ib_mthca]
[604130.192892]  ? insert_vmap_area.constprop.0+0xb4/0xf0
[604130.192896]  ? __cond_resched+0x1a/0x50
[604130.192899]  ? down_write+0x13/0x50
[604130.192901]  ? kernfs_activate+0xc5/0x110
[604130.192905]  ? kernfs_add_one+0xe2/0x130
[604130.192908]  ? __kernfs_create_file+0x7f/0xc0
[604130.192913]  __mthca_init_one+0x355/0x69f [ib_mthca]
[604130.192926]  ? vprintk_default+0x1d/0x20
[604130.192929]  ? vprintk+0x58/0x90
[604130.192932]  ? _printk+0x58/0x6f
[604130.192935]  mthca_init_one.cold+0x44/0x92 [ib_mthca]
[604130.192949]  local_pci_probe+0x4b/0x90
[604130.192952]  pci_device_probe+0x115/0x1f0
[604130.192955]  really_probe+0x21e/0x420
[604130.192959]  __driver_probe_device+0x115/0x190
[604130.192962]  driver_probe_device+0x23/0xc0
[604130.192966]  __driver_attach+0xbd/0x1d0
[604130.192969]  ? __device_attach_driver+0x110/0x110
[604130.192972]  bus_for_each_dev+0x7e/0xc0
[604130.192976]  driver_attach+0x1e/0x20
[604130.192979]  bus_add_driver+0x135/0x200
[604130.192982]  driver_register+0x91/0xf0
[604130.192985]  __pci_register_driver+0x68/0x70
[604130.192988]  mthca_init+0x169/0x17f [ib_mthca]
[604130.193000]  ? __mthca_check_profile_val+0x8a/0x8a [ib_mthca]
[604130.193012]  do_one_initcall+0x48/0x1d0
[604130.193016]  ? kmem_cache_alloc_trace+0x19e/0x2e0
[604130.193020]  do_init_module+0x52/0x280
[604130.193023]  load_module+0x2613/0x2a00
[604130.193029]  __do_sys_finit_module+0xbf/0x120
[604130.193034]  __x64_sys_finit_module+0x1a/0x20
[604130.193037]  do_syscall_64+0x5c/0xc0
[604130.193039]  ? syscall_exit_to_user_mode+0x27/0x50
[604130.193042]  ? __x64_sys_newfstat+0x16/0x20
[604130.193045]  ? do_syscall_64+0x69/0xc0
[604130.193047]  ? do_syscall_64+0x69/0xc0
[604130.193050]  ? syscall_exit_to_user_mode+0x27/0x50
[604130.193053]  ? __x64_sys_close+0x12/0x40
[604130.193055]  ? do_syscall_64+0x69/0xc0
[604130.193057]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[604130.193061] RIP: 0033:0x7f84f5d0d9b9
[604130.193063] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a7 54 0c 00 f7 d8 64 89 01 48
[604130.193066] RSP: 002b:00007ffdd72b1408 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[604130.193069] RAX: ffffffffffffffda RBX: 0000561eb0b65e50 RCX: 00007f84f5d0d9b9
[604130.193071] RDX: 0000000000000000 RSI: 0000561eb0b66620 RDI: 0000000000000003
[604130.193073] RBP: 0000000000040000 R08: 0000000000000000 R09: 0000000000000000
[604130.193075] R10: 0000000000000003 R11: 0000000000000246 R12: 0000561eb0b66620
[604130.193077] R13: 0000000000000000 R14: 0000561eb0b66050 R15: 0000561eb0b65e50
[604130.193080]  </TASK>
[604130.193081] ================================================================================

Thank you
 
Last edited:
it looks a bit like a UBSAN warning - see the following thread for a bit more background:

https://forum.proxmox.com/threads/proxmox-7-2-upgrade-broke-my-raid.109368/

does the system continue to work or does it hang/crash?
Thank you for your answer, i'm going to look at the link you provided.

Globally, the system still work, but without support for the Infiniband card (i use 1GbE card instead). It's a very old card, it's look like the driver packaged by kernel are from 2008.

Best regards,

Mathieu
 
In case you have the time and opportunity:
Since we had some similar reports - just to make sure that UBSAN is really not causing the issues - we prepared a test-kernel where UBSAN is disabled:
http://download.proxmox.com/temp/kernel-5.15.35-noubsan/

it would be great if you could:
* fetch the packages
* compare the sha512sums
* install them via `apt install /path/to/packages/pve-kernel-5.15.35-3testubsan-pve_5.15.35-1~testubsan_amd64.deb` (or with `dpkg -i`)

and boot once to see if the issues go away.

Thanks!
 
Thank you for your help,
I'm going to test this week. I doubt, it's caused by UBSAN, driver stopped working before i got those messages.

Best regards,

Mathieu
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!