PVE random crash / pvestatd.service killed

a-0

New Member
Feb 15, 2024
2
0
1
Yesterday, my PVE crashed "out of nowhere" (I did not change any configuration, issue any command or such, just normal VMs running as ever). It ran flawlessly on exactly this hardware for 2 years now.

Since the first thing that happened according to journalctl -xeb-1 was, that the pvestatd.service was killed, I assume the issue may be there.

The symptoms were that none of the services running in LXCs/VMs were reachable, neither was the WebUI, but the server did not shut down. Sadly, I did not have physical access beyond hard-rebooting the machine today, so this is all I know. The tasks log does not have any entry around that time.

Full log of the incident:

Code:
Feb 14 16:15:11 rack0 systemd[1]: pvestatd.service: Main process exited, code=killed, status=9/KILL
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ An ExecStart= process belonging to unit pvestatd.service has exited.
░░
░░ The process' exit code is 'killed' and its exit status is 9.
Feb 14 16:15:11 rack0 kernel: BUG: unable to handle page fault for address: 0000000000006204
Feb 14 16:15:11 rack0 kernel: #PF: supervisor read access in kernel mode
Feb 14 16:15:11 rack0 kernel: #PF: error_code(0x0000) - not-present page
Feb 14 16:15:11 rack0 kernel: PGD 0 P4D 0
Feb 14 16:15:11 rack0 kernel: Oops: 0000 [#1] SMP NOPTI
Feb 14 16:15:11 rack0 kernel: CPU: 1 PID: 5021 Comm: pvestatd Tainted: P           O      5.15.116-1-pve #1
Feb 14 16:15:11 rack0 kernel: Hardware name: BIOSTAR Group B560MX-E PRO/B560MX-E PRO, BIOS 5.19 12/21/2021
Feb 14 16:15:11 rack0 kernel: RIP: 0010:pid_nr_ns+0x14/0x40
Feb 14 16:15:11 rack0 kernel: Code: ba e8 50 7d 5a 00 eb b9 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 0f 1f 44 00 00 55 45 31 c0 48 89 e5 48 85 ff 74 15 8b 46 40 <3b> 47 04 77 0d 48 c1 e0 04 48 01 c7 48 39 77 68 74 09 44 89 c0 5d
Feb 14 16:15:11 rack0 kernel: RSP: 0018:ffffb47e0a253d60 EFLAGS: 00010206
Feb 14 16:15:11 rack0 kernel: RAX: 0000000000000000 RBX: ffffffffba08a780 RCX: 0000000000000000
Feb 14 16:15:11 rack0 kernel: RDX: 0000000000040006 RSI: ffffffffba08a780 RDI: 0000000000006200
Feb 14 16:15:11 rack0 kernel: RBP: ffffb47e0a253d60 R08: 0000000000000000 R09: ffffffffba08a780
Feb 14 16:15:11 rack0 kernel: R10: 0000000000000228 R11: ffffb47e0a253ce0 R12: 0000000000006200
Feb 14 16:15:11 rack0 kernel: R13: 000000000004473d R14: 000000000004473d R15: ffffb47e0a253e68
Feb 14 16:15:11 rack0 kernel: FS:  00007fc486078280(0000) GS:ffff93a8d5840000(0000) knlGS:0000000000000000
Feb 14 16:15:11 rack0 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 14 16:15:11 rack0 kernel: CR2: 0000000000006204 CR3: 000000014c506004 CR4: 0000000000772ee0
Feb 14 16:15:11 rack0 kernel: PKRU: 55555554
Feb 14 16:15:11 rack0 kernel: Call Trace:
Feb 14 16:15:11 rack0 kernel:  <TASK>
Feb 14 16:15:11 rack0 kernel:  ? __die_body.cold+0x1a/0x1f
Feb 14 16:15:11 rack0 kernel:  ? __die+0x2b/0x37
Feb 14 16:15:11 rack0 kernel:  ? page_fault_oops+0x136/0x2c0
Feb 14 16:15:11 rack0 kernel:  ? do_user_addr_fault+0x1e0/0x660
Feb 14 16:15:11 rack0 kernel:  ? do_user_addr_fault+0x31a/0x660
Feb 14 16:15:11 rack0 kernel:  ? number+0x39a/0x400
Feb 14 16:15:11 rack0 kernel:  ? exc_page_fault+0x77/0x170
Feb 14 16:15:11 rack0 kernel:  ? asm_exc_page_fault+0x27/0x30
Feb 14 16:15:11 rack0 kernel:  ? pid_nr_ns+0x14/0x40
Feb 14 16:15:11 rack0 kernel:  next_tgid+0x4a/0x100
Feb 14 16:15:11 rack0 kernel:  proc_pid_readdir+0xaf/0x220
Feb 14 16:15:11 rack0 kernel:  proc_root_readdir+0x3a/0x50
Feb 14 16:15:11 rack0 kernel:  iterate_dir+0x9f/0x1d0
Feb 14 16:15:11 rack0 kernel:  __x64_sys_getdents64+0x78/0x110
Feb 14 16:15:11 rack0 kernel:  ? __ia32_compat_sys_getdents+0x110/0x110
Feb 14 16:15:11 rack0 kernel:  do_syscall_64+0x59/0xc0
Feb 14 16:15:11 rack0 kernel:  ? exit_to_user_mode_prepare+0x37/0x1b0
Feb 14 16:15:11 rack0 kernel:  ? irqentry_exit_to_user_mode+0x9/0x20
Feb 14 16:15:11 rack0 kernel:  ? irqentry_exit+0x1d/0x30
Feb 14 16:15:11 rack0 kernel:  ? exc_page_fault+0x89/0x170
Feb 14 16:15:11 rack0 kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xcb
Feb 14 16:15:11 rack0 kernel: RIP: 0033:0x7fc486176f07
Feb 14 16:15:11 rack0 kernel: Code: 0f 1f 00 48 8b 47 20 c3 66 2e 0f 1f 84 00 00 00 00 00 90 48 81 fa ff ff ff 7f b8 ff ff ff 7f 48 0f 47 d0 b8 d9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 59 af 10 00 f7 d8 64 89 02 48
Feb 14 16:15:11 rack0 kernel: RSP: 002b:00007fffb71f5a48 EFLAGS: 00000293 ORIG_RAX: 00000000000000d9
Feb 14 16:15:11 rack0 kernel: RAX: ffffffffffffffda RBX: 000056367ee182d0 RCX: 00007fc486176f07
Feb 14 16:15:11 rack0 kernel: RDX: 0000000000008000 RSI: 000056367ee18300 RDI: 0000000000000008
Feb 14 16:15:11 rack0 kernel: RBP: 000056367ee18300 R08: 0000000000000030 R09: 00005636784853b0
Feb 14 16:15:11 rack0 kernel: R10: 000056367edb64a8 R11: 0000000000000293 R12: ffffffffffffff80
Feb 14 16:15:11 rack0 kernel: R13: 000056367ee182d4 R14: 0000000000000000 R15: 000056367edb64a8
Feb 14 16:15:11 rack0 kernel:  </TASK>
Feb 14 16:15:11 rack0 kernel: Modules linked in: tcp_diag udp_diag inet_diag binfmt_misc cfg80211 8021q garp mrp wireguard curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_counter xt_mark nft_compat rpcsec_gss_krb5 nfsv4 n>
Feb 14 16:15:11 rack0 kernel:  snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine kvm snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec irqbypass crct10dif_pclmul ghash_clmulni_intel snd_hda_core aesni_intel snd_hwdep cec snd_pcm rc_core crypto_simd mei_hdcp i2c_algo_bit cryptd fb_sys_fops snd_timer intel_cstate ee1004 snd syscopyarea mei_me sysfillrect soundcore>
Feb 14 16:15:11 rack0 kernel: CR2: 0000000000006204
Feb 14 16:15:11 rack0 kernel: ---[ end trace c87fe6b1c9027956 ]---
Feb 14 16:15:11 rack0 kernel: RIP: 0010:pid_nr_ns+0x14/0x40
Feb 14 16:15:11 rack0 kernel: Code: ba e8 50 7d 5a 00 eb b9 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 0f 1f 44 00 00 55 45 31 c0 48 89 e5 48 85 ff 74 15 8b 46 40 <3b> 47 04 77 0d 48 c1 e0 04 48 01 c7 48 39 77 68 74 09 44 89 c0 5d
Feb 14 16:15:11 rack0 kernel: RSP: 0018:ffffb47e0a253d60 EFLAGS: 00010206
Feb 14 16:15:11 rack0 kernel: RAX: 0000000000000000 RBX: ffffffffba08a780 RCX: 0000000000000000
Feb 14 16:15:11 rack0 kernel: RDX: 0000000000040006 RSI: ffffffffba08a780 RDI: 0000000000006200
Feb 14 16:15:11 rack0 kernel: RBP: ffffb47e0a253d60 R08: 0000000000000000 R09: ffffffffba08a780
Feb 14 16:15:11 rack0 kernel: R10: 0000000000000228 R11: ffffb47e0a253ce0 R12: 0000000000006200
Feb 14 16:15:11 rack0 kernel: R13: 000000000004473d R14: 000000000004473d R15: ffffb47e0a253e68
Feb 14 16:15:11 rack0 kernel: FS:  00007fc486078280(0000) GS:ffff93a8d5840000(0000) knlGS:0000000000000000
Feb 14 16:15:11 rack0 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 14 16:15:11 rack0 kernel: CR2: 0000000000006204 CR3: 000000014c506004 CR4: 0000000000772ee0
Feb 14 16:15:11 rack0 kernel: PKRU: 55555554
Feb 14 16:15:11 rack0 systemd[1]: pvestatd.service: Failed with result 'signal'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit pvestatd.service has entered the 'failed' state with result 'signal'.
Feb 14 16:15:11 rack0 systemd[1]: pvestatd.service: Consumed 14h 56min 28.993s CPU time.
░░ Subject: Resources consumed by unit runtime
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit pvestatd.service completed and consumed the indicated resources.
Feb 14 16:15:12 rack0 kernel: BUG: Bad page map in process ksmd  pte:800000065c87d307 pmd:4bd1a8067
Feb 14 16:15:12 rack0 kernel: addr:00007f7e50de71b0 vm_flags:a8120073 anon_vma:ffff93a540921410 mapping:0000000000000000 index:7f7e50de7
Feb 14 16:15:12 rack0 kernel: file:(null) fault:0x0 mmap:0x0 readpage:0x0

Is this issue known? Would it be mitigated by e.g. auto-restarting pvestatd.service upon kill/crash?
 
Hello, i add to the request, i have issue with soft lookup too. and now i see this fail too.

Code:
Feb 29 12:21:11 pve1 kernel: CR2: 0000000000225245
Feb 29 12:21:11 pve1 kernel:  cryptd i2c_algo_bit ecdh_generic snd input_leds rapl video gigabyte_wmi wmi_bmof k10temp soundcore ccp pcspkr ecc libarc4 m>
Feb 29 12:21:11 pve1 kernel: Modules linked in: tcp_diag inet_diag nf_conntrack_netlink overlay xt_nat xfrm_user xfrm_algo ipt_REJECT nf_reject_ipv4 xt_L>
Feb 29 12:21:11 pve1 kernel:  </TASK>
Feb 29 12:21:11 pve1 kernel: R13: 00005614c0813f78 R14: 00005614c18d5f88 R15: 00007f9795be3000
Feb 29 12:21:11 pve1 kernel: R10: 0000000214299000 R11: 0000000000000293 R12: 0000000000000000
Feb 29 12:21:11 pve1 kernel: RBP: 00007f98a8681000 R08: 0000000000000000 R09: 00005614c18d5fb0
Feb 29 12:21:11 pve1 kernel: RDX: 0000000000007000 RSI: 00007f98a8681000 RDI: 000000000000002b
Feb 29 12:21:11 pve1 kernel: RAX: ffffffffffffffda RBX: 00007f97975f56c0 RCX: 00007f995e3153b7
Feb 29 12:21:11 pve1 kernel: RSP: 002b:00007f97963de160 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
Feb 29 12:21:11 pve1 kernel: Code: 08 89 3c 24 48 89 4c 24 18 e8 05 f4 f8 ff 4c 8b 54 24 18 48 8b 54 24 10 41 89 c0 48 8b 74 24 08 8b 3c 24 b8 12 00 00 0>
Feb 29 12:21:11 pve1 kernel: RIP: 0033:0x7f995e3153b7
Feb 29 12:21:11 pve1 kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Feb 29 12:21:11 pve1 kernel:  ? do_syscall_64+0x67/0x90
Feb 29 12:21:11 pve1 kernel:  ? do_syscall_64+0x67/0x90
Feb 29 12:21:11 pve1 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Feb 29 12:21:11 pve1 kernel:  ? do_syscall_64+0x67/0x90
Feb 29 12:21:11 pve1 kernel:  ? do_syscall_64+0x67/0x90
Feb 29 12:21:11 pve1 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Feb 29 12:21:11 pve1 kernel:  ? syscall_exit_to_user_mode+0x37/0x60
Feb 29 12:21:11 pve1 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Feb 29 12:21:11 pve1 kernel:  ? exit_to_user_mode_prepare+0xa5/0x190
Feb 29 12:21:11 pve1 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Feb 29 12:21:11 pve1 kernel:  ? do_syscall_64+0x67/0x90
Feb 29 12:21:11 pve1 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Feb 29 12:21:11 pve1 kernel:  ? syscall_exit_to_user_mode+0x37/0x60
Feb 29 12:21:11 pve1 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Feb 29 12:21:11 pve1 kernel:  do_syscall_64+0x5b/0x90
Feb 29 12:21:11 pve1 kernel:  __x64_sys_pwrite64+0xa6/0xd0
Feb 29 12:21:11 pve1 kernel:  vfs_write+0x254/0x440
Feb 29 12:21:11 pve1 kernel:  ? security_file_permission+0x39/0x70
Feb 29 12:21:11 pve1 kernel:  blkdev_write_iter+0xf3/0x180
Feb 29 12:21:11 pve1 kernel:  __generic_file_write_iter+0xae/0xd0
Feb 29 12:21:11 pve1 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Feb 29 12:21:11 pve1 kernel:  generic_perform_write+0xd4/0x230
Feb 29 12:21:11 pve1 kernel:  blkdev_write_begin+0x20/0x40
Feb 29 12:21:11 pve1 kernel:  block_write_begin+0x57/0x140
Feb 29 12:21:11 pve1 kernel:  ? __pfx_blkdev_get_block+0x10/0x10
Feb 29 12:21:11 pve1 kernel:  ? __pfx_blkdev_get_block+0x10/0x10
Feb 29 12:21:11 pve1 kernel:  ? __filemap_get_folio+0x98/0x230
Feb 29 12:21:11 pve1 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Feb 29 12:21:11 pve1 kernel:  ? __pfx_blkdev_get_block+0x10/0x10
Feb 29 12:21:11 pve1 kernel:  ? __block_write_begin_int+0x48/0x5c0
Feb 29 12:21:11 pve1 kernel:  ? __pfx_blkdev_get_block+0x10/0x10
Feb 29 12:21:11 pve1 kernel:  ? __pfx_blkdev_get_block+0x10/0x10
Feb 29 12:21:11 pve1 kernel:  ? asm_exc_page_fault+0x27/0x30
Feb 29 12:21:11 pve1 kernel:  ? exc_page_fault+0x83/0x1b0
Feb 29 12:21:11 pve1 kernel:  ? srso_alias_return_thunk+0x5/0x7f
Feb 29 12:21:11 pve1 kernel:  ? do_user_addr_fault+0x31d/0x6a0
Feb 29 12:21:11 pve1 kernel:  ? page_fault_oops+0x176/0x500
Feb 29 12:21:11 pve1 kernel:  ? __die+0x24/0x80
Feb 29 12:21:11 pve1 kernel:  ? show_regs+0x6d/0x80
Feb 29 12:21:11 pve1 kernel:  <TASK>
Feb 29 12:21:11 pve1 kernel: Call Trace:
Feb 29 12:21:11 pve1 kernel: PKRU: 55555554
Feb 29 12:21:11 pve1 kernel: CR2: 0000000000225245 CR3: 0000000194ff0000 CR4: 0000000000750ee0
Feb 29 12:21:11 pve1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 29 12:21:11 pve1 kernel: FS:  00007f97963e36c0(0000) GS:ffff8a263ee40000(0000) knlGS:0000000000000000
Feb 29 12:21:11 pve1 kernel: R13: ffff9abf799e7cb0 R14: 0000000000001000 R15: ffffffffbb906d60
Feb 29 12:21:11 pve1 kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000214299000
Feb 29 12:21:11 pve1 kernel: RBP: ffff9abf799e7c18 R08: 0000000000000000 R09: ffff9abf799e7cb8
Feb 29 12:21:11 pve1 kernel: RDX: 0000000000001000 RSI: 0000000214299000 RDI: ffffc040627746c8
Feb 29 12:21:11 pve1 kernel: RAX: 0000000000225245 RBX: ffffc0406da8e040 RCX: ffffffffbb906d60
Feb 29 12:21:11 pve1 kernel: RSP: 0018:ffff9abf799e7b80 EFLAGS: 00010206
Feb 29 12:21:11 pve1 kernel: Code: 89 f0 48 89 4d 88 41 81 e0 ff 0f 00 00 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 48 8b 47 18 44 01 c2 89 55 bc 41 0>
Feb 29 12:21:11 pve1 kernel: RIP: 0010:__block_write_begin_int+0x48/0x5c0
Feb 29 12:21:11 pve1 kernel: Hardware name: Gigabyte Technology Co., Ltd. X570S AORUS ELITE AX/X570S AORUS ELITE AX, BIOS F7d 12/25/2023
Feb 29 12:21:11 pve1 kernel: CPU: 17 PID: 842297 Comm: worker Tainted: P      D W  O       6.5.11-8-pve #1
Feb 29 12:21:11 pve1 kernel: Oops: 0000 [#6] PREEMPT SMP NOPTI
Feb 29 12:21:11 pve1 kernel: PGD 0 P4D 0
Feb 29 12:21:11 pve1 kernel: #PF: error_code(0x0000) - not-present page
Feb 29 12:21:11 pve1 kernel: #PF: supervisor read access in kernel mode
Feb 29 12:21:11 pve1 kernel: BUG: unable to handle page fault for address: 0000000000225245
 
I had exact the same massages and the same behavior two times. In a other case i had troubles with the Filesystem and now permanent networkproblems. At the moment unfortunaly we couldn't use our trainingsenviroment. I have no idea what could i do to solve the problem. Except reinstall PVE7
 
For me, the issues were "resolved" by hard-rebooting the server. That is a very ugly solution, but luckily in my case nothing was corrupted by this, and the boot went flawlessly as ever. Since then, this did not happen again to me, knocking on wood.
 
I had exact the same massages and the same behavior two times. In a other case i had troubles with the Filesystem and now permanent networkproblems. At the moment unfortunaly we couldn't use our trainingsenviroment. I have no idea what could i do to solve the problem. Except reinstall PVE7
What version is the stable of PVE7?

Since i update from 7 to 8 i have problems, if i do backups with high io, and then "BUG: unable to handle page fault for address" can come in any moment for no reason, the server is not in high load.

Screenshot_17.png
 
i had used the last 7 Version without any Problems. I think there is a Kernel Problem in den Version 8. In my case the SDN Permissions seems also have problems. Because the Network is very unstable, We have 800 Users (not concurrent) and it seems the SDN Permissions handling have troubles.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!