pvestatd crash

Mar 28, 2024
9
0
1
Hi everyone, this is the first post,
I'm writing about a problem with a node, same problem on two different clusters in production.
in our datacenter we have 3 clusters with a total of 9 nodes.
Two of these the pvestatd service crashes often, every time I check the dashboard I have to restart the service. In this case it is a cluster of 2 nodes, and the other node which is identical in hardware and also in the number of VMs and resources assigned, the service never crashes. the datacenter perform a hardware test, RAM first without errors.
I checked the integrity of the packages via debsums, which reports FAILED only on some configuration files modified by the pve installer, i think:
/etc/issue FAILED
/etc/lvm/lvm.conf FAILED
/etc/cron.d/mdadm FAILED
/etc/apt/sources.list.d/pve-enterprise.list FAILED
/etc/systemd/timesyncd.conf FAILED
so everything seems fine.
in syslog, in the line before the first error:
pvedaemon [374198]: VM 201 qmp command failed - VM 201 qmp command 'query-proxmox-support' failed - got timeout
I find this:
Mar 27 22:55:57 pve2node1 pvestatd[38449]: qemu status update error: Can't locate object method "Handle=HASH(0x55eabbc33fc8)" via package "IO::Multiplex" at /usr/share/perl5/IO/ Multiplex.pm line 966.

pve-manager/7.4-17/513c62be
Linux 5.15.143-1-pve #1 SMP PVE 5.15.143-1
32 x 13th Gen Intel(R) Core(TM) i9-13900 (1 Socket)
126GB RAM
2 X 2TB NVME for root in MIRROR
and
2 x 4TB SSD Enterprise ZFS MIRROR
2 x 4TB SSD Enterprise ZFS MIRROR
thank's
 
Last edited:
the service continue to crash,
that is the journalctl -b pvestatd.service:

Mar 29 22:00:52 pve2node1 systemd[1]: Started PVE Status Daemon.
Mar 29 22:25:52 pve2node1 pvestatd[768068]: closing with write buffer at /usr/share/perl5/IO/Multiplex.pm line 928.
Mar 29 22:25:52 pve2node1 pvestatd[768068]: closing with write buffer at /usr/share/perl5/IO/Multiplex.pm line 928.
Mar 29 22:25:52 pve2node1 pvestatd[768068]: closing with read buffer at /usr/share/perl5/IO/Multiplex.pm line 927.
Mar 29 22:25:52 pve2node1 pvestatd[768068]: closing with read buffer at /usr/share/perl5/IO/Multiplex.pm line 927.
Mar 29 22:25:52 pve2node1 pvestatd[768068]: closing with read buffer at /usr/share/perl5/IO/Multiplex.pm line 927.
Mar 29 22:25:52 pve2node1 pvestatd[768068]: qemu status update error: Can't locate object method "Handle=HASH(0x55fd78c0d698)" via package "IO::Multiplex" at /usr/share/perl5/IO/Multiplex.pm line 966.
Mar 29 22:29:22 pve2node1 pvestatd[768068]: Use of uninitialized value $vmid in concatenation (.) or string at /usr/share/perl5/PVE/QemuServer/Helpers.pm line 29.
Mar 29 22:29:22 pve2node1 pvestatd[768068]: Use of uninitialized value $vmid in concatenation (.) or string at /usr/share/perl5/PVE/QemuServer/Helpers.pm line 29.
Mar 29 22:29:22 pve2node1 pvestatd[768068]: Use of uninitialized value $vmid in concatenation (.) or string at /usr/share/perl5/PVE/QemuServer/Helpers.pm line 29.
Mar 29 22:29:22 pve2node1 pvestatd[768068]: Use of uninitialized value $vmid in concatenation (.) or string at /usr/share/perl5/PVE/QemuServer/Helpers.pm line 29.
Mar 29 22:35:52 pve2node1 systemd[1]: pvestatd.service: Main process exited, code=killed, status=11/SEGV
 
Hey. If you are able to boot into windows there are some Windows applications you can run which will stress test the system and report any CPU errors.

It maybe worth trying to tweak the CPU voltages up a little as the 13900/14900 seem to be rather power hungry and I believe the motherboards genric presets maybe a little conservative for some cpu. In my MSI motherboard the setting was 'CPU Lite Load' which needed increases by a couple.
 
Hi!
thank's for the reply, the server is in datacenter and motherboard is supermicro, i can't boot windows but i can tell that the cpu is always at 20-60%, never go up to 90%..
Now the datacenter is doing an hardware test
 
the hardware check, after 3 hours, has negative result, so all hardware is ok, and for 2 days all was ok, i thought the problem was solved by bios update and NIC firmware upgrade done during the hardware check, but today the server HUNG.
in the syslog i found that:
Apr 04 09:19:38 pve3n2 kernel: BUG: Bad page map in process pvestatd pte:8000000158beb845 pmd:1469ce067
this is the first error, go back until the nigth i found the first segfault:
Code:
Apr 04 01:51:14 pve3n2 kernel: pve-firewall[2205]: segfault at 0 ip 00006169e21ef6a4 sp 00007fff73e39570 error 4 in perl[6169e20a2000+195000] likely on CPU 12 (core 24, socket 0)
Apr 04 01:51:14 pve3n2 kernel: Code: 43 02 48 8d 0c 83 31 c0 48 39 d9 48 0f 45 c1 48 89 44 24 10 0f b6 43 01 48 8b 74 24 18 4c 8b 4e 10 4c 39 cd 0f 83 ec 00 00 00 <0f> b6 5d 00 48 81 fb c3 00 00 00 40 0f 9f c6 3d 96 00 00 00 0f 87

random error, so i think's ram related, but the hardware check of 3 hours tell is all right!!!
i must do memory change?
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!