This all started last mouth when my server started getting random CPU lockups that would crash/reboot the system. I have tried all of the following with no real solution yet:
This is latest hang was with no VM's even running let alone created. It was just proxmox running. If I am reading the log correctly the lockup is originating from the network side of things but I dont know if its from the nic or the software for the network. I would love some help with this as I am really scratching my head on what is causing the problem.
- roll back to older kernal
- start proxmox with no vm's running, just the host
- ran a full memtest with no errors
- ran a prime95 test for 72 hours with no errors
- removed HSA as for a while I thought that was the problem (still think there might be a problem with it but thats after I get host stable)
- removed all pcie devices that are not needed
- nuked my PVE 8 install and installed a fresh PVE 9.1.4
Code:
Jan 08 20:55:39 tardis kernel: watchdog: BUG: soft lockup - CPU#60 stuck for 1251s! [ebtables:33416]
Jan 08 20:55:40 tardis kernel: Modules linked in: tcp_diag inet_diag ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter nf_tables softdog sunrpc binfmt_misc bondi>
Jan 08 20:55:40 tardis kernel: CPU: 60 UID: 0 PID: 33416 Comm: ebtables Tainted: P O L 6.17.4-2-pve #1 PREEMPT(voluntary)
Jan 08 20:55:40 tardis kernel: Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [L]=SOFTLOCKUP
Jan 08 20:55:40 tardis kernel: Hardware name: Supermicro Super Server/H11SSL-i, BIOS 3.4 07/28/2025
Jan 08 20:55:40 tardis kernel: RIP: 0010:memcpy_orig+0x16/0x130
Jan 08 20:55:40 tardis kernel: Code: 21 01 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 48 89 f8 48 83 fa 20 72 7e 40 38 fe 7c 35 48 83 ea 20 48 83 ea 20 <4c> 8b 06 4c 8b 4e 08 4c 8b 56 10 4c 8b 5e 18>
Jan 08 20:55:40 tardis kernel: RSP: 0000:ffffccc210d379c0 EFLAGS: 00000202
Jan 08 20:55:40 tardis kernel: RAX: ffff8c0881a9a000 RBX: 0000000000000000 RCX: fffff4c540000000
Jan 08 20:55:40 tardis kernel: RDX: 0000000000000340 RSI: ffff8be88286dc80 RDI: ffff8c0881a9ac80
Jan 08 20:55:40 tardis kernel: RBP: ffffccc210d379c8 R08: 0000000000008000 R09: 0000000000800000
Jan 08 20:55:40 tardis kernel: R10: 0000000000600000 R11: 0000000000600000 R12: fffff4c5810a1b40
Jan 08 20:55:40 tardis kernel: R13: fffff4c60106a680 R14: 0000000000000001 R15: fffff4c5810a1b40
Jan 08 20:55:40 tardis kernel: FS: 000076f8172a9740(0000) GS:ffff8c188c186000(0000) knlGS:0000000000000000
Jan 08 20:55:40 tardis kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 08 20:55:40 tardis kernel: CR2: 000076f81750fc80 CR3: 00000010354ef000 CR4: 00000000003506f0
Jan 08 20:55:40 tardis kernel: Call Trace:
Jan 08 20:55:40 tardis kernel: <TASK>
Jan 08 20:55:40 tardis kernel: ? copy_mc_to_kernel+0x39/0x50
Jan 08 20:55:40 tardis kernel: folio_mc_copy+0x8a/0xf0
Jan 08 20:55:40 tardis kernel: ? rmap_walk_anon+0x1a3/0x220
Jan 08 20:55:40 tardis kernel: __migrate_folio.isra.0+0x9d/0x200
Jan 08 20:55:40 tardis kernel: move_to_new_folio+0x99/0x130
Jan 08 20:55:40 tardis kernel: migrate_pages_batch+0xa23/0xe70
Jan 08 20:55:40 tardis kernel: ? srso_return_thunk+0x5/0x5f
Jan 08 20:55:40 tardis kernel: ? change_pte_range+0x6fb/0xe20
Jan 08 20:55:40 tardis kernel: ? srso_return_thunk+0x5/0x5f
Jan 08 20:55:40 tardis kernel: ? change_pte_range+0x1ca/0xe20
Jan 08 20:55:40 tardis kernel: migrate_pages+0x9a7/0xda0
Jan 08 20:55:40 tardis kernel: ? __pfx_alloc_misplaced_dst_folio+0x10/0x10
Jan 08 20:55:40 tardis kernel: ? srso_return_thunk+0x5/0x5f
Jan 08 20:55:40 tardis kernel: ? lru_gen_del_folio+0x111/0x1e0
Jan 08 20:55:40 tardis kernel: migrate_misplaced_folio+0xc0/0x250
Jan 08 20:55:40 tardis kernel: __handle_mm_fault+0xdd4/0xfd0
Jan 08 20:55:40 tardis kernel: handle_mm_fault+0x119/0x370
Jan 08 20:55:40 tardis kernel: do_user_addr_fault+0x2f8/0x830
Jan 08 20:55:40 tardis kernel: exc_page_fault+0x7f/0x1b0
Jan 08 20:55:40 tardis kernel: asm_exc_page_fault+0x27/0x30
Jan 08 20:55:40 tardis kernel: RIP: 0033:0x76f81736798b
Jan 08 20:55:40 tardis kernel: Code: bb b8 12 00 48 d1 f9 48 89 0d b9 b8 12 00 48 8b 90 f0 01 00 00 48 89 15 5b 29 13 00 48 8b 90 f8 01 00 00 48 89 15 45 29 13 00 <48> 8b 90 00 02 00 00 48 89 15 7f b8 12 00 48>
Jan 08 20:55:40 tardis kernel: RSP: 002b:00007ffc3450ec08 EFLAGS: 00010216
Jan 08 20:55:40 tardis kernel: RAX: 000076f81750fa80 RBX: 000076f8172d2e00 RCX: 0000000000400000
Jan 08 20:55:40 tardis kernel: RDX: 0000000000600000 RSI: 0000000000000000 RDI: 000076f8172b5600
Jan 08 20:55:40 tardis kernel: RBP: 00007ffc3450ed20 R08: 0000000000000000 R09: 0000000000000000
Jan 08 20:55:40 tardis kernel: R10: 000076f8174d0e80 R11: 000076f8172b5600 R12: 000076f8174d0730
Jan 08 20:55:40 tardis kernel: R13: 000076f8172d33b8 R14: 000076f817491410 R15: 000076f8172ac000
Jan 08 20:55:40 tardis kernel: </TASK>
Jan 08 20:55:49 tardis systemd-journald[1489]: Received SIGTERM from PID 1 (systemd).
Jan 08 20:55:49 tardis systemd[1]: Stopping systemd-journald.service - Journal Service...
Jan 08 20:55:49 tardis systemd-journald[1489]: Journal stopped
This is latest hang was with no VM's even running let alone created. It was just proxmox running. If I am reading the log correctly the lockup is originating from the network side of things but I dont know if its from the nic or the software for the network. I would love some help with this as I am really scratching my head on what is causing the problem.