Hi,
One of my servers, every few weeks get stuck and needs to be rebooted.
CPU is 12 x 13th Gen Intel(R) Core(TM) i7-1355U (1 Socket) with 96GB DDR5 (non-ECC)
Proxmox 9.1.2 - kernel 6.17.2-1-pve (it has been happening the same with previous versions of Proxmox and kernels, I just upgrades to 6.17.4-1), so everything up-to date (including microcode).
It runs very few VMs: OpenWRT, FreePBX, Proxmox Backup Server and a Windows 10.
NVMe SSD as ZFS storage, plus USB external HDD for the PBS
CPU level is quite low (10-15%, some increase when PBS is active), it becomes 75% when this starts happening.
Memory also around 30% when is booted up, but usually comes to 70-80% in hours after boot (ZFS-ARC), so looks normal.
IO delay typically around 0,10-0,15%.
The first signs show:
Then every few seconds (approximately 30s), the "watchdog: BUG: soft lockup - CPU#4 stuck for 82s! [ksmd:99][/CODE]" increases, until about 5 hours later starts messages such as:
Out of memory: Killed process 1855 ....
Then the server is hanged up (or too slow to be accesible, but still seems to initially keep responding to ping, but not ssh), and the only way to resolve it is power cycle.
Memory has been tested several times, no issues.
Any hints about what to look for or how to fix it?
Otherwise, I'm tempted to try to write some script that reboots the server when it detects "general protection fault" in journalctl. Not sure if some script is already available to check that. If someone can tell if something similar has already been done will be nice!
Tks!
One of my servers, every few weeks get stuck and needs to be rebooted.
CPU is 12 x 13th Gen Intel(R) Core(TM) i7-1355U (1 Socket) with 96GB DDR5 (non-ECC)
Proxmox 9.1.2 - kernel 6.17.2-1-pve (it has been happening the same with previous versions of Proxmox and kernels, I just upgrades to 6.17.4-1), so everything up-to date (including microcode).
It runs very few VMs: OpenWRT, FreePBX, Proxmox Backup Server and a Windows 10.
NVMe SSD as ZFS storage, plus USB external HDD for the PBS
CPU level is quite low (10-15%, some increase when PBS is active), it becomes 75% when this starts happening.
Memory also around 30% when is booted up, but usually comes to 70-80% in hours after boot (ZFS-ARC), so looks normal.
IO delay typically around 0,10-0,15%.
The first signs show:
Code:
Dec 17 15:13:15 proxmox kernel: Oops: general protection fault, probably for non-canonical address 0x77614ee07c1f748c: 0000 [#1] SMP NOPTI
Dec 17 15:13:15 proxmox kernel: CPU: 5 UID: 0 PID: 2720 Comm: kvm Tainted: P O 6.17.2-1-pve #1 PREEMPT(voluntary)
Dec 17 15:13:15 proxmox kernel: Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE
Dec 17 15:13:15 proxmox kernel: Hardware name: Default string Default string/Default string, BIOS 5.27 10/13/2023
Dec 17 15:13:15 proxmox kernel: RIP: 0010:folio_mark_dirty+0x24/0x60
Dec 17 15:13:15 proxmox kernel: Code: 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb e8 1e a6 02 00 48 85 c0 74 34 48 89 c7 48 8b 03 a9 00 00 01 00 75 20 <48> 8b 47 68 48 89 de 48 8b 40 10 ff d0 0f 1f 00 48 8b 5d f8 c9 31
Dec 17 15:13:15 proxmox kernel: RSP: 0018:ffffd0ae2c9af678 EFLAGS: 00010246
Dec 17 15:13:15 proxmox kernel: RAX: c27b2e45e401e401 RBX: fffff7c499217180 RCX: 0400000000000040
Dec 17 15:13:15 proxmox kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 77614ee07c1f7424
Dec 17 15:13:15 proxmox kernel: RBP: ffffd0ae2c9af680 R08: 0000000000000000 R09: 0000000000000000
Dec 17 15:13:15 proxmox kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 80000016485c6867
Dec 17 15:13:15 proxmox kernel: R13: fffff7c499217180 R14: ffffd0ae2c9afa48 R15: ffff8d9934def098
Dec 17 15:13:15 proxmox kernel: FS: 00007be893ca1840(0000) GS:ffff8d9ffea06000(0000) knlGS:0000000000000000
Dec 17 15:13:15 proxmox kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 17 15:13:15 proxmox kernel: CR2: 0000000001ab7fd0 CR3: 000000011cb20005 CR4: 0000000000f72ef0
Dec 17 15:13:15 proxmox kernel: PKRU: 55555554
Dec 17 15:13:15 proxmox kernel: Call Trace:
Dec 17 15:13:15 proxmox kernel: <TASK>
Dec 17 15:13:15 proxmox kernel: unmap_page_range+0x1098/0x17a0
Dec 17 15:13:15 proxmox kernel: ? kvm_flush_remote_tlbs+0x4d/0x70 [kvm]
Dec 17 15:13:15 proxmox kernel: unmap_single_vma.isra.0+0x78/0xd0
Dec 17 15:13:15 proxmox kernel: zap_page_range_single_batched+0xd1/0x1a0
Dec 17 15:13:15 proxmox kernel: madvise_vma_behavior+0xc22/0xdb0
Dec 17 15:13:15 proxmox kernel: madvise_walk_vmas+0x264/0x2f0
Dec 17 15:13:15 proxmox kernel: madvise_do_behavior+0xaa/0x300
Dec 17 15:13:15 proxmox kernel: do_madvise+0xf4/0x160
Dec 17 15:13:15 proxmox kernel: __x64_sys_madvise+0x2b/0x40
Dec 17 15:13:15 proxmox kernel: x64_sys_call+0x21bf/0x2330
Dec 17 15:13:15 proxmox kernel: do_syscall_64+0x80/0xa30
Dec 17 15:13:15 proxmox kernel: ? __x64_sys_madvise+0x2b/0x40
Dec 17 15:13:15 proxmox kernel: ? x64_sys_call+0x21bf/0x2330
Dec 17 15:13:15 proxmox kernel: ? do_syscall_64+0xb8/0xa30
Dec 17 15:13:15 proxmox kernel: ? __tlb_batch_free_encoded_pages+0x5b/0xc0
Dec 17 15:13:15 proxmox kernel: ? tlb_finish_mmu+0x88/0x1b0
Dec 17 15:13:15 proxmox kernel: ? do_madvise+0x121/0x160
Dec 17 15:13:15 proxmox kernel: ? __x64_sys_madvise+0x2b/0x40
Dec 17 15:13:15 proxmox kernel: ? x64_sys_call+0x21bf/0x2330
Dec 17 15:13:15 proxmox kernel: ? do_syscall_64+0xb8/0xa30
Dec 17 15:13:15 proxmox kernel: ? __x64_sys_madvise+0x2b/0x40
Dec 17 15:13:15 proxmox kernel: ? x64_sys_call+0x21bf/0x2330
Dec 17 15:13:15 proxmox kernel: ? do_syscall_64+0xb8/0xa30
Dec 17 15:13:15 proxmox kernel: ? do_syscall_64+0xb8/0xa30
Dec 17 15:13:15 proxmox kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e
Dec 17 15:13:15 proxmox kernel: RIP: 0033:0x7be896f69bb7
Dec 17 15:13:15 proxmox kernel: Code: ff e8 2d 7d ff ff 48 8b 54 24 28 64 48 2b 14 25 28 00 00 00 75 05 48 83 c4 38 c3 e8 f3 ff 00 00 0f 1f 00 b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 11 a2 0d 00 f7 d8 64 89 01 48
Dec 17 15:13:15 proxmox kernel: RSP: 002b:00007ffd29e8ade8 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
Dec 17 15:13:15 proxmox kernel: RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007be896f69bb7
Dec 17 15:13:15 proxmox kernel: RDX: 0000000000000004 RSI: 0000000000001000 RDI: 00007be48ea13000
Dec 17 15:13:15 proxmox kernel: RBP: 000000000ac13000 R08: 0000000000000000 R09: 0000000400000000
Dec 17 15:13:15 proxmox kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007be893c9f268
Dec 17 15:13:15 proxmox kernel: R13: 00005c56fb54a3c0 R14: 00007be48ea13000 R15: 00000000ffffffff
Dec 17 15:13:15 proxmox kernel: </TASK>
Dec 17 15:13:15 proxmox kernel: Modules linked in: wireguard curve25519_x86_64 libcurve25519_generic ip6_udp_tunnel udp_tunnel cfg80211 udp_diag tcp_diag inet_diag nf_conntrack_netlink nfnetlink_acct rpcsec_gss_krb5 nfsv4 nfs netfs ebtable_filter ebtables ip6table_raw ip6t_>
Dec 17 15:13:15 proxmox kernel: soundwire_cadence snd_sof_pci snd_sof_xtensa_dsp sunrpc snd_sof snd_sof_utils intel_rapl_msr snd_soc_acpi_intel_match intel_rapl_common snd_soc_acpi_intel_sdca_quirks soundwire_generic_allocation snd_soc_acpi intel_uncore_frequency intel_unc>
Dec 17 15:13:15 proxmox kernel: ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq xe intel_vsec drm_gpuvm drm_gpusvm_helper gpu_sched drm_ttm_helper drm_exec drm_suballoc_helper uas usb_storage hid_generic usbkbd usbmouse usbhid hid i915 nvme drm>
Dec 17 15:13:15 proxmox kernel: ---[ end trace 0000000000000000 ]---
Dec 17 15:13:15 proxmox kernel: RIP: 0010:folio_mark_dirty+0x24/0x60
Dec 17 15:13:15 proxmox kernel: Code: 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb e8 1e a6 02 00 48 85 c0 74 34 48 89 c7 48 8b 03 a9 00 00 01 00 75 20 <48> 8b 47 68 48 89 de 48 8b 40 10 ff d0 0f 1f 00 48 8b 5d f8 c9 31
Dec 17 15:13:15 proxmox kernel: RSP: 0018:ffffd0ae2c9af678 EFLAGS: 00010246
Dec 17 15:13:15 proxmox kernel: RAX: c27b2e45e401e401 RBX: fffff7c499217180 RCX: 0400000000000040
Dec 17 15:13:15 proxmox kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 77614ee07c1f7424
Dec 17 15:13:15 proxmox kernel: RBP: ffffd0ae2c9af680 R08: 0000000000000000 R09: 0000000000000000
Dec 17 15:13:15 proxmox kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 80000016485c6867
Dec 17 15:13:15 proxmox kernel: R13: fffff7c499217180 R14: ffffd0ae2c9afa48 R15: ffff8d9934def098
Dec 17 15:13:15 proxmox kernel: FS: 00007be893ca1840(0000) GS:ffff8d9ffea06000(0000) knlGS:0000000000000000
Dec 17 15:13:15 proxmox kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 17 15:13:15 proxmox kernel: CR2: 0000000001ab7fd0 CR3: 000000011cb20005 CR4: 0000000000f72ef0
Dec 17 15:13:15 proxmox kernel: PKRU: 55555554
Dec 17 15:13:15 proxmox kernel: ------------[ cut here ]------------
Dec 17 15:13:15 proxmox kernel: WARNING: CPU: 5 PID: 2720 at kernel/exit.c:898 do_exit+0x7d6/0xa20
Dec 17 15:13:15 proxmox kernel: Modules linked in: wireguard curve25519_x86_64 libcurve25519_generic ip6_udp_tunnel udp_tunnel cfg80211 udp_diag tcp_diag inet_diag nf_conntrack_netlink nfnetlink_acct rpcsec_gss_krb5 nfsv4 nfs netfs ebtable_filter ebtables ip6table_raw ip6t_>
Dec 17 15:13:15 proxmox kernel: soundwire_cadence snd_sof_pci snd_sof_xtensa_dsp sunrpc snd_sof snd_sof_utils intel_rapl_msr snd_soc_acpi_intel_match intel_rapl_common snd_soc_acpi_intel_sdca_quirks soundwire_generic_allocation snd_soc_acpi intel_uncore_frequency intel_unc>
Dec 17 15:13:15 proxmox kernel: ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq xe intel_vsec drm_gpuvm drm_gpusvm_helper gpu_sched drm_ttm_helper drm_exec drm_suballoc_helper uas usb_storage hid_generic usbkbd usbmouse usbhid hid i915 nvme drm>
Dec 17 15:13:15 proxmox kernel: CPU: 5 UID: 0 PID: 2720 Comm: kvm Tainted: P D O 6.17.2-1-pve #1 PREEMPT(voluntary)
Dec 17 15:13:15 proxmox kernel: Tainted: [P]=PROPRIETARY_MODULE, [D]=DIE, [O]=OOT_MODULE
Dec 17 15:13:15 proxmox kernel: Hardware name: Default string Default string/Default string, BIOS 5.27 10/13/2023
Dec 17 15:13:15 proxmox kernel: RIP: 0010:do_exit+0x7d6/0xa20
Dec 17 15:13:15 proxmox kernel: Code: 4c 89 ab f0 0a 00 00 48 89 45 c0 48 8b 83 10 0d 00 00 e9 33 fe ff ff 48 8b bb d0 0a 00 00 31 f6 e8 2f e2 ff ff e9 e6 fd ff ff <0f> 0b e9 6d f8 ff ff 4c 89 e6 bf 05 06 00 00 e8 d6 41 01 00 e9 a6
Dec 17 15:13:15 proxmox kernel: RSP: 0018:ffffd0ae2c9afec0 EFLAGS: 00010282
Dec 17 15:13:15 proxmox kernel: RAX: 0000000000000286 RBX: ffff8d88dd0d0000 RCX: 0000000000000000
Dec 17 15:13:15 proxmox kernel: RDX: 000000000000270f RSI: 0000000000002710 RDI: 000000000000000b
Dec 17 15:13:15 proxmox kernel: RBP: ffffd0ae2c9aff10 R08: 0000000000000000 R09: 0000000000000000
Dec 17 15:13:15 proxmox kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000000b
Dec 17 15:13:15 proxmox kernel: R13: 0000000000000001 R14: 0000000000000246 R15: 77614ee07c1f748c
Dec 17 15:13:15 proxmox kernel: FS: 00007be893ca1840(0000) GS:ffff8d9ffea06000(0000) knlGS:0000000000000000
Dec 17 15:13:15 proxmox kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 17 15:13:15 proxmox kernel: CR2: 0000000001ab7fd0 CR3: 000000011cb20005 CR4: 0000000000f72ef0
Dec 17 15:13:15 proxmox kernel: PKRU: 55555554
Dec 17 15:13:15 proxmox kernel: Call Trace:
Dec 17 15:13:15 proxmox kernel: <TASK>
Dec 17 15:13:15 proxmox kernel: make_task_dead+0x93/0xa0
Dec 17 15:13:15 proxmox kernel: rewind_stack_and_make_dead+0x16/0x20
Dec 17 15:13:15 proxmox kernel: RIP: 0033:0x7be896f69bb7
Dec 17 15:13:15 proxmox kernel: Code: ff e8 2d 7d ff ff 48 8b 54 24 28 64 48 2b 14 25 28 00 00 00 75 05 48 83 c4 38 c3 e8 f3 ff 00 00 0f 1f 00 b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 11 a2 0d 00 f7 d8 64 89 01 48
Dec 17 15:13:15 proxmox kernel: RSP: 002b:00007ffd29e8ade8 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
Dec 17 15:13:15 proxmox kernel: RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007be896f69bb7
Dec 17 15:13:15 proxmox kernel: RDX: 0000000000000004 RSI: 0000000000001000 RDI: 00007be48ea13000
Dec 17 15:13:15 proxmox kernel: RBP: 000000000ac13000 R08: 0000000000000000 R09: 0000000400000000
Dec 17 15:13:15 proxmox kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007be893c9f268
Dec 17 15:13:15 proxmox kernel: R13: 00005c56fb54a3c0 R14: 00007be48ea13000 R15: 00000000ffffffff
Dec 17 15:13:15 proxmox kernel: </TASK>
Dec 17 15:13:15 proxmox kernel: ---[ end trace 0000000000000000 ]---
Dec 17 15:13:15 proxmox kernel: BUG: kernel NULL pointer dereference, address: 00000000000005a9
Dec 17 15:13:15 proxmox kernel: #PF: supervisor write access in kernel mode
Dec 17 15:13:15 proxmox kernel: #PF: error_code(0x0002) - not-present page
Dec 17 15:13:15 proxmox kernel: Oops: Oops: 0002 [#2] SMP NOPTI
Dec 17 15:13:15 proxmox kernel: CPU: 5 UID: 0 PID: 2720 Comm: kvm Tainted: P D W O 6.17.2-1-pve #1 PREEMPT(voluntary)
Dec 17 15:13:15 proxmox kernel: Tainted: [P]=PROPRIETARY_MODULE, [D]=DIE, [W]=WARN, [O]=OOT_MODULE
Dec 17 15:13:15 proxmox kernel: Hardware name: Default string Default string/Default string, BIOS 5.27 10/13/2023
Dec 17 15:13:15 proxmox kernel: RIP: 0010:__blk_flush_plug+0x80/0x140
Dec 17 15:13:15 proxmox kernel: Code: 00 00 ad de 48 89 5d c0 48 89 5d c8 48 39 c1 74 6a 49 8b 47 30 48 8b 75 b8 48 39 c6 74 4a 49 8b 4f 30 49 8b 57 38 48 8b 45 c0 <48> 89 59 08 ...
Dec 17 15:13:42 proxmox kernel: watchdog: BUG: soft lockup - CPU#4 stuck for 26s! [ksmd:99]
...
Dec 17 15:14:10 proxmox kernel: watchdog: BUG: soft lockup - CPU#4 stuck for 52s! [ksmd:99]
...
Dec 17 15:14:42 proxmox kernel: watchdog: BUG: soft lockup - CPU#4 stuck for 82s! [ksmd:99]
Then every few seconds (approximately 30s), the "watchdog: BUG: soft lockup - CPU#4 stuck for 82s! [ksmd:99][/CODE]" increases, until about 5 hours later starts messages such as:
Out of memory: Killed process 1855 ....
Then the server is hanged up (or too slow to be accesible, but still seems to initially keep responding to ping, but not ssh), and the only way to resolve it is power cycle.
Memory has been tested several times, no issues.
Any hints about what to look for or how to fix it?
Otherwise, I'm tempted to try to write some script that reboots the server when it detects "general protection fault" in journalctl. Not sure if some script is already available to check that. If someone can tell if something similar has already been done will be nice!
Tks!