8.2.2 upgrade breaks 1st node. Manual network start needed

tl5k5

Well-Known Member
Jul 28, 2017
62
1
48
52
Hey all,
I have a 3 node Proxmox/ceph cluster and I decided to update. After updating the 1st node, networking no longer works on it. Others have said their 8.2.x upgrade changed the device name but that's not what I'm seeing. Upon booting up and and logging in, I can run manually run systemctl restart networking and everything seems to start connecting and working as it should.
I've tested disabling all virtual capabilities in the BIOS, but that did not help anything. I also installed intel-microcode just to see if that would fix it.

Any help would be appreciated.

Thanks!

Code:
#dmesg -l warn

[    2.032488] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
[    2.032492] TAA CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html for more details.
[    2.032493] MMIO Stale Data CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/processor_mmio_stale_data.html for more details.
[    2.143659] Invalid PCCT: 0 PCC subspaces
[    2.962993] i8042: probe of i8042 failed with error -5
[    2.991094] device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measurements will not be recorded in the IMA log.
[    2.991773] platform eisa.0: EISA: Cannot allocate resource for mainboard
[    2.991774] platform eisa.0: Cannot allocate resource for EISA slot 1
[    2.991776] platform eisa.0: Cannot allocate resource for EISA slot 2
[    2.991777] platform eisa.0: Cannot allocate resource for EISA slot 3
[    2.991779] platform eisa.0: Cannot allocate resource for EISA slot 4
[    2.991780] platform eisa.0: Cannot allocate resource for EISA slot 5
[    2.991782] platform eisa.0: Cannot allocate resource for EISA slot 6
[    2.991783] platform eisa.0: Cannot allocate resource for EISA slot 7
[    2.991784] platform eisa.0: Cannot allocate resource for EISA slot 8
[    3.057920] ENERGY_PERF_BIAS: Set to 'normal', was 'performance'
[    3.593471] lpc_ich 0000:00:1f.0: No MFD cells added
[    3.627784] bnxt_en 0000:29:00.0 (unnamed net_device) (uninitialized): Device requests max timeout of 100 seconds, may trigger hung task watchdog
[    3.658620] bnxt_en 0000:29:00.1 (unnamed net_device) (uninitialized): Device requests max timeout of 100 seconds, may trigger hung task watchdog
[    4.777073] device-mapper: thin: Data device (dm-7) discard unsupported: Disabling discard passdown.
[    5.955566] ERST: [Firmware Warn]: too many record IDs!
[    6.204355] systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit uses KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update the service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
[    6.205157] systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit uses KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update the service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
[    6.205560] systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit uses KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update the service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
[    6.205955] systemd[1]: /lib/systemd/system/ceph-volume@.service:8: Unit uses KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update the service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
[    6.320226] pstore: backend 'erst' already in use: ignoring 'efi_pstore'
[    6.346943] spl: loading out-of-tree module taints kernel.
[    6.393504] zfs: module license 'CDDL' taints kernel.
[    6.393508] Disabling lock debugging due to kernel taint
[    6.393528] zfs: module license taints kernel.
[    7.139662] power_meter ACPI000D:00: Ignoring unsafe software power cap!
[    7.139666] power_meter ACPI000D:00: hwmon_device_register() is deprecated. Please convert the driver to use hwmon_device_register_with_info().
[    7.197992] ipmi_si IPI0001:00: The BMC does not support clearing the recv irq bit, compensating, but the BMC needs to be fixed.
[    7.256194] ------------[ cut here ]------------
[    7.256200] CPU: 14 PID: 2304 Comm: (udev-worker) Tainted: P           O       6.8.4-2-pve #1
[    7.256203] Hardware name: HPE ProLiant DL580 Gen10/ProLiant DL580 Gen10, BIOS U34 07/20/2023
[    7.256204] Call Trace:
[    7.256206]  <TASK>
[    7.256208]  dump_stack_lvl+0x48/0x70
[    7.256214]  dump_stack+0x10/0x20
[    7.256215]  __ubsan_handle_shift_out_of_bounds+0x1ac/0x360
[    7.256221]  bnxt_qplib_alloc_init_hwq.cold+0x8c/0xd7 [bnxt_re]
[    7.256238]  bnxt_qplib_create_qp+0x1d5/0x8c0 [bnxt_re]
[    7.256250]  ? bnxt_re_create_qp+0x5f4/0xf30 [bnxt_re]
[    7.256264]  bnxt_re_create_qp+0x71d/0xf30 [bnxt_re]
[    7.256273]  ? __kmalloc+0x1ab/0x400
[    7.256278]  create_qp+0x17a/0x290 [ib_core]
[    7.256310]  ? create_qp+0x17a/0x290 [ib_core]
[    7.256336]  ib_create_qp_kernel+0x3b/0xe0 [ib_core]
[    7.256361]  create_mad_qp+0x8e/0x100 [ib_core]
[    7.256393]  ? __pfx_qp_event_handler+0x10/0x10 [ib_core]
[    7.256423]  ib_mad_init_device+0x2c2/0x8a0 [ib_core]
[    7.256454]  add_client_context+0x127/0x1c0 [ib_core]
[    7.256482]  enable_device_and_get+0xe6/0x1e0 [ib_core]
[    7.256509]  ib_register_device+0x506/0x610 [ib_core]
[    7.256539]  bnxt_re_probe+0xe7d/0x11a0 [bnxt_re]
[    7.256550]  ? __pfx_bnxt_re_probe+0x10/0x10 [bnxt_re]
[    7.256559]  auxiliary_bus_probe+0x3e/0xa0
[    7.256562]  really_probe+0x1c9/0x430
[    7.256566]  __driver_probe_device+0x8c/0x190
[    7.256568]  driver_probe_device+0x24/0xd0
[    7.256571]  __driver_attach+0x10b/0x210
[    7.256573]  ? __pfx___driver_attach+0x10/0x10
[    7.256576]  bus_for_each_dev+0x8a/0xf0
[    7.256578]  driver_attach+0x1e/0x30
[    7.256580]  bus_add_driver+0x156/0x260
[    7.256583]  driver_register+0x5e/0x130
[    7.256586]  __auxiliary_driver_register+0x73/0xf0
[    7.256589]  ? __pfx_bnxt_re_mod_init+0x10/0x10 [bnxt_re]
[    7.256597]  bnxt_re_mod_init+0x3e/0xff0 [bnxt_re]
[    7.256605]  ? __pfx_bnxt_re_mod_init+0x10/0x10 [bnxt_re]
[    7.256612]  do_one_initcall+0x5b/0x340
[    7.256617]  do_init_module+0x97/0x290
[    7.256620]  load_module+0x213a/0x22a0
[    7.256627]  init_module_from_file+0x96/0x100
[    7.256630]  ? init_module_from_file+0x96/0x100
[    7.256634]  idempotent_init_module+0x11c/0x2b0
[    7.256639]  __x64_sys_finit_module+0x64/0xd0
[    7.256640]  do_syscall_64+0x84/0x180
[    7.256643]  ? do_syscall_64+0x93/0x180
[    7.256646]  ? syscall_exit_to_user_mode+0x86/0x260
[    7.256648]  ? do_syscall_64+0x93/0x180
[    7.256650]  ? do_syscall_64+0x93/0x180
[    7.256651]  ? exc_page_fault+0x94/0x1b0
[    7.256653]  entry_SYSCALL_64_after_hwframe+0x73/0x7b
[    7.256656] RIP: 0033:0x749e762ff719
[    7.256667] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 64 89 01 48
[    7.256669] RSP: 002b:00007ffee1332be8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[    7.256671] RAX: ffffffffffffffda RBX: 000062fd9e78b920 RCX: 0000749e762ff719
[    7.256673] RDX: 0000000000000000 RSI: 0000749e76492efd RDI: 000000000000000f
[    7.256674] RBP: 0000749e76492efd R08: 0000000000000000 R09: 000062fd9e7399b0
[    7.256675] R10: 000000000000000f R11: 0000000000000246 R12: 0000000000020000
[    7.256676] R13: 0000000000000000 R14: 000062fd9e770e70 R15: 000062fd9dc51ec1
[    7.256679]  </TASK>
[    7.256680] ---[ end trace ]---
[  109.288123] bnxt_en 0000:29:00.0: QPLIB: bnxt_re_is_fw_stalled: FW STALL Detected. cmdq[0xe]=0x3 waited (102032 > 100000) msec active 1
[  109.288328] ------------[ cut here ]------------
[  109.288331] WARNING: CPU: 14 PID: 2304 at drivers/infiniband/core/cq.c:322 ib_free_cq+0x109/0x150 [ib_core]
[  109.288436] Modules linked in: intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common isst_if_common nfit x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd rapl bnxt_re(+) ib_uverbs intel_cstate pcspkr acpi_power_meter mgag200 mei_me ib_core ipmi_si ioatdma acpi_ipmi mei intel_pch_thermal i2c_algo_bit hpilo dca ipmi_devintf acpi_tad ipmi_msghandler joydev input_leds mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq hid_generic usbmouse usbkbd usbhid hid dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c ses enclosure uas usb_storage xhci_pci xhci_pci_renesas crc32_pclmul smartpqi ehci_pci bnxt_en scsi_transport_sas xhci_hcd ehci_hcd lpc_ich wmi
[  109.288581] CPU: 14 PID: 2304 Comm: (udev-worker) Tainted: P           O       6.8.4-2-pve #1
[  109.288583] Hardware name: HPE ProLiant DL580 Gen10/ProLiant DL580 Gen10, BIOS U34 07/20/2023
[  109.288585] RIP: 0010:ib_free_cq+0x109/0x150 [ib_core]
[  109.288610] Code: e8 fc 9c 02 00 65 ff 0d 9d 07 1f 3e 0f 85 70 ff ff ff 0f 1f 44 00 00 e9 66 ff ff ff 48 8d 7f 50 e8 6c ba cc e2 e9 35 ff ff ff <0f> 0b 31 c0 31 f6 31 ff c3 cc cc cc cc 0f 0b eb 80 44 0f b6 25 64
[  109.288612] RSP: 0018:ffffacb261967630 EFLAGS: 00010202
[  109.288614] RAX: 0000000000000002 RBX: 0000000000000001 RCX: 0000000000000000
[  109.288616] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9d4cde98ac00
[  109.288617] RBP: ffffacb2619676a0 R08: 0000000000000000 R09: 0000000000000000
[  109.288618] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9d4d07200000
[  109.288620] R13: ffff9d4cc2245300 R14: 00000000ffffff92 R15: ffff9d4ce79e8000
[  109.288621] FS:  0000749e7610e8c0(0000) GS:ffff9d588f300000(0000) knlGS:0000000000000000
[  109.288623] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  109.288624] CR2: 000062fd9e76c0c8 CR3: 0000000c6e1c6004 CR4: 00000000007706f0
[  109.288626] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  109.288627] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  109.288628] PKRU: 55555554
[  109.288629] Call Trace:
[  109.288631]  <TASK>
[  109.288632]  ? show_regs+0x6d/0x80
[  109.288638]  ? __warn+0x89/0x160
[  109.288643]  ? ib_free_cq+0x109/0x150 [ib_core]
[  109.288668]  ? report_bug+0x17e/0x1b0
[  109.288673]  ? handle_bug+0x46/0x90
[  109.288678]  ? exc_invalid_op+0x18/0x80
[  109.288680]  ? asm_exc_invalid_op+0x1b/0x20
[  109.288685]  ? ib_free_cq+0x109/0x150 [ib_core]
[  109.288709]  ? ib_mad_init_device+0x54c/0x8a0 [ib_core]
[  109.288739]  add_client_context+0x127/0x1c0 [ib_core]
[  109.288765]  enable_device_and_get+0xe6/0x1e0 [ib_core]
[  109.288791]  ib_register_device+0x506/0x610 [ib_core]
[  109.288819]  bnxt_re_probe+0xe7d/0x11a0 [bnxt_re]
[  109.288832]  ? __pfx_bnxt_re_probe+0x10/0x10 [bnxt_re]
[  109.288841]  auxiliary_bus_probe+0x3e/0xa0
[  109.288845]  really_probe+0x1c9/0x430
[  109.288848]  __driver_probe_device+0x8c/0x190
[  109.288851]  driver_probe_device+0x24/0xd0
[  109.288854]  __driver_attach+0x10b/0x210
[  109.288856]  ? __pfx___driver_attach+0x10/0x10
[  109.288859]  bus_for_each_dev+0x8a/0xf0
[  109.288861]  driver_attach+0x1e/0x30
[  109.288863]  bus_add_driver+0x156/0x260
[  109.288866]  driver_register+0x5e/0x130
[  109.288869]  __auxiliary_driver_register+0x73/0xf0
[  109.288871]  ? __pfx_bnxt_re_mod_init+0x10/0x10 [bnxt_re]
[  109.288880]  bnxt_re_mod_init+0x3e/0xff0 [bnxt_re]
[  109.288887]  ? __pfx_bnxt_re_mod_init+0x10/0x10 [bnxt_re]
[  109.288894]  do_one_initcall+0x5b/0x340
[  109.288899]  do_init_module+0x97/0x290
[  109.288903]  load_module+0x213a/0x22a0
[  109.288909]  init_module_from_file+0x96/0x100
[  109.288912]  ? init_module_from_file+0x96/0x100
[  109.288916]  idempotent_init_module+0x11c/0x2b0
[  109.288921]  __x64_sys_finit_module+0x64/0xd0
[  109.288923]  do_syscall_64+0x84/0x180
[  109.288925]  ? do_syscall_64+0x93/0x180
[  109.288927]  ? syscall_exit_to_user_mode+0x86/0x260
[  109.288930]  ? do_syscall_64+0x93/0x180
[  109.288931]  ? do_syscall_64+0x93/0x180
[  109.288933]  ? exc_page_fault+0x94/0x1b0
[  109.288935]  entry_SYSCALL_64_after_hwframe+0x73/0x7b
[  109.288937] RIP: 0033:0x749e762ff719
[  109.288952] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 64 89 01 48
[  109.288953] RSP: 002b:00007ffee1332be8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[  109.288955] RAX: ffffffffffffffda RBX: 000062fd9e78b920 RCX: 0000749e762ff719
[  109.288957] RDX: 0000000000000000 RSI: 0000749e76492efd RDI: 000000000000000f
[  109.288958] RBP: 0000749e76492efd R08: 0000000000000000 R09: 000062fd9e7399b0
[  109.288959] R10: 000000000000000f R11: 0000000000000246 R12: 0000000000020000
[  109.288961] R13: 0000000000000000 R14: 000062fd9e770e70 R15: 000062fd9dc51ec1
[  109.288964]  </TASK>
[  109.288965] ---[ end trace 0000000000000000 ]---
 
Code:
#dmesg -l err
[    1.805402] x86/cpu: VMX (outside TXT) disabled by BIOS
[    2.962991] i8042: Can't read CTR while initializing i8042
[    7.256196] UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
[    7.256198] shift exponent 64 is too large for 64-bit type 'long unsigned int'
[  109.288139] bnxt_en 0000:29:00.0 bnxt_re0: Failed to modify HW QP
[  109.288147] infiniband bnxt_re0: Couldn't change QP1 state to INIT: -110
[  109.288155] infiniband bnxt_re0: Couldn't start port
[  109.288289] bnxt_en 0000:29:00.0 bnxt_re0: Failed to destroy HW QP
[  109.288968] bnxt_en 0000:29:00.0 bnxt_re0: Free MW failed: 0xffffff92
[  109.288972] infiniband bnxt_re0: Couldn't open port 1





Code:
#dmesg | grep bnxt_en
[    3.627784] bnxt_en 0000:29:00.0 (unnamed net_device) (uninitialized): Device requests max timeout of 100 seconds, may trigger hung task watchdog
[    3.658286] bnxt_en 0000:29:00.0 eth0: Broadcom BCM57414 NetXtreme-E 10Gb/25Gb Ethernet found at mem e2210000, node addr 5c:ba:2c:67:8f:e0
[    3.658294] bnxt_en 0000:29:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[    3.658620] bnxt_en 0000:29:00.1 (unnamed net_device) (uninitialized): Device requests max timeout of 100 seconds, may trigger hung task watchdog
[    3.686724] bnxt_en 0000:29:00.1 eth1: Broadcom BCM57414 NetXtreme-E 10Gb/25Gb Ethernet found at mem e2200000, node addr 5c:ba:2c:67:8f:e8
[    3.686731] bnxt_en 0000:29:00.1: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)
[    3.696989] bnxt_en 0000:29:00.0 eno1np0: renamed from eth0
[    3.713015] bnxt_en 0000:29:00.1 eno2np1: renamed from eth1
[  109.288123] bnxt_en 0000:29:00.0: QPLIB: bnxt_re_is_fw_stalled: FW STALL Detected. cmdq[0xe]=0x3 waited (102032 > 100000) msec active 1
[  109.288139] bnxt_en 0000:29:00.0 bnxt_re0: Failed to modify HW QP
[  109.288289] bnxt_en 0000:29:00.0 bnxt_re0: Failed to destroy HW QP
[  109.288436] Modules linked in: intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common isst_if_common nfit x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd rapl bnxt_re(+) ib_uverbs intel_cstate pcspkr acpi_power_meter mgag200 mei_me ib_core ipmi_si ioatdma acpi_ipmi mei intel_pch_thermal i2c_algo_bit hpilo dca ipmi_devintf acpi_tad ipmi_msghandler joydev input_leds mac_hid zfs(PO) spl(O) vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq hid_generic usbmouse usbkbd usbhid hid dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c ses enclosure uas usb_storage xhci_pci xhci_pci_renesas crc32_pclmul smartpqi ehci_pci bnxt_en scsi_transport_sas xhci_hcd ehci_hcd lpc_ich wmi
[  109.288968] bnxt_en 0000:29:00.0 bnxt_re0: Free MW failed: 0xffffff92
[  211.688179] bnxt_en 0000:29:00.1: QPLIB: bnxt_re_is_fw_stalled: FW STALL Detected. cmdq[0xe]=0x3 waited (102365 > 100000) msec active 1
[  211.688196] bnxt_en 0000:29:00.1 bnxt_re1: Failed to modify HW QP
[  211.688411] bnxt_en 0000:29:00.1 bnxt_re1: Failed to destroy HW QP
[  211.688473] bnxt_en 0000:29:00.1 bnxt_re1: Free MW failed: 0xffffff92
[  746.952441] bnxt_en 0000:29:00.0 eno1np0: NIC Link is Up, 10000 Mbps (NRZ) full duplex, Flow control: none
[  746.952448] bnxt_en 0000:29:00.0 eno1np0: FEC autoneg off encoding: None
[  747.266321] bnxt_en 0000:29:00.1 eno2np1: NIC Link is Up, 10000 Mbps (NRZ) full duplex, Flow control: ON - receive & transmit
[  747.266328] bnxt_en 0000:29:00.1 eno2np1: FEC autoneg off encoding: None
[  748.462338] bnxt_en 0000:29:00.0 eno1np0: entered promiscuous mode
[  748.462638] bnxt_en 0000:29:00.1 eno2np1: entered promiscuous mode
 
I am having similar issue but on a realtek quad port nic with 2 network lacp bonds. I can ping the ip locally but nothing else on the network segment. All bonds and nics marked as "up". I did alot of research and it seems to point with an issue with ASPM.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!