pvestatd segfaults

unterkomplex

New Member
Jul 30, 2024
10
1
3
Hi,


I noticed that pvestatd was not running on one of my nodes. It turns out that the service segfaulted
Aug 31 16:39:55 pve1 kernel: pvestatd[1862]: segfault at 100000000000 ip 000063217ac22321 sp 00007ffc2cfa5140 error 4 in perl[95321,63217abd1000+1ae000] likely on CPU 3 (core 3, socket 1)
Aug 31 16:39:55 pve1 kernel: Code: 00 00 00 66 0f 1f 44 00 00 48 8d 4a 01 48 83 c0 08 49 89 0c 24 48 8b 75 00 48 3b 56 18 73 52 48 89 ca 48 8b 18 48 85 db 74 df <48> 8b 13 48 89 10 48 8b 45 00 48 83 68 10 01 83 b>
Aug 31 16:39:55 pve1 systemd[1]: pvestatd.service: Main process exited, code=killed, status=11/SEGV
Aug 31 16:39:55 pve1 systemd[1]: pvestatd.service: Failed with result 'signal'.
Aug 31 16:39:55 pve1 systemd[1]: pvestatd.service: Consumed 7h 27min 59.427s CPU time, 160.6M memory peak.

I was able to manually restart the service and it seems to work fine now
Sep 01 10:20:25 pve1 systemd[1]: Starting pvestatd.service - PVE Status Daemon...
Sep 01 10:20:27 pve1 pvestatd[2099919]: starting server
Sep 01 10:20:27 pve1 systemd[1]: Started pvestatd.service - PVE Status Daemon.

I checked for any other segfaults
root@pve1:~# journalctl | grep segfault
Nov 03 23:02:28 pve1 kernel: dnsmasq[228401]: segfault at 5ea02dc81e1e ip 00007445fd6c42d5 sp 00007ffc327293a8 error 4 in libdbus-1.so.3.32.4[7445fd6a1000+30000] likely on CPU 0 (core 0, socket 0)
Nov 28 08:55:59 pve1 kernel: dnsmasq[1685]: segfault at 616309463d64 ip 00007d02c4f892d5 sp 00007ffe04781958 error 4 in libdbus-1.so.3.32.4[7d02c4f66000+30000] likely on CPU 3 (core 3, socket 0)
Dec 08 22:11:05 pve1 kernel: dnsmasq[1265]: segfault at 5941068861ac ip 00007c80e1a0a2d5 sp 00007fff0b538e28 error 4 in libdbus-1.so.3.32.4[7c80e19e7000+30000] likely on CPU 1 (core 1, socket 0)
Aug 10 04:24:01 pve1 kernel: task UPID:pve1:[2514843]: segfault at 2989dc5a8 ip 00007529bff58087 sp 00007ffc9b12f8c0 error 4 in libc.so.6[7529bfee9000+155000] likely on CPU 2 (core 2, socket 0)
Aug 21 14:29:43 pve1 kernel: python3[2469825]: segfault at ffffffffff8 ip 00007d1eaae60efa sp 00007ffcc1a05fa0 error 4 in libc.so.6[7d1eaadee000+155000] likely on CPU 0 (core 0, socket 0)
Aug 31 09:12:34 pve1 kernel: python3[1528251]: segfault at 100000000008 ip 00007d900a4b9653 sp 00007ffd7a3b4c90 error 4 in libcrypto.so.3[26d653,7d900a343000+381000] likely on CPU 0 (core 0, socket 1)
Aug 31 16:39:55 pve1 kernel: pvestatd[1862]: segfault at 100000000000 ip 000063217ac22321 sp 00007ffc2cfa5140 error 4 in perl[95321,63217abd1000+1ae000] likely on CPU 3 (core 3, socket 1)

and it looks like there was one segfault a couple of hours earlier (with no reboot in between)
Aug 31 09:12:34 pve1 kernel: python3[1528251]: segfault at 100000000008 ip 00007d900a4b9653 sp 00007ffd7a3b4c90 error 4 in libcrypto.so.3[26d653,7d900a343000+381000] likely on CPU 0 (core 0, socket 1)
Aug 31 09:12:35 pve1 kernel: Code: 89 ee 48 89 f5 49 c1 e6 03 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 49 8b 07 4a 8b 1c 30 48 85 db 74 19 0f 1f 40 00 48 89 d8 <48> 8b 5b 08 48 89 ee 48 8b 38 41 ff d4 48 85 db 75 eb 41 83 ed 01


The node was upgraded from Proxmox 8.4 to 9 on 27th August
A third segfault happened on Promox 8.4
Aug 21 14:29:43 pve1 kernel: python3[2469825]: segfault at ffffffffff8 ip 00007d1eaae60efa sp 00007ffcc1a05fa0 error 4 in libc.so.6[7d1eaadee000+155000] likely on CPU 0 (core 0, socket 0)
Aug 21 14:29:43 pve1 kernel: Code: ac 2c 10 00 e8 f7 62 fe ff 0f 1f 80 00 00 00 00 48 85 ff 0f 84 bf 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d e6 8e 13 00 <48> 8b 47 f8 64 8b 2b a8 02 75 5b 48 8b 15 6c 8e 13 00 64 48 83 3a


The CPU is quite old : AMD Embedded G-Series GX-420GI Radeon R7E

I am not sure if this is more likely due to a faulty hardware or a bug. Happy to provide more details if that helps to investigate
 
Hi!

As there are multiple programs segfaulting, I'd check if there are any problems with memory (memtest), filesystem corruptions or package corruptions (e.g. smart tests, checking packages with debsums -c, etc.).
 
I tried
  • debsums
  • 24h memtest
  • smartctl long test
  • zpool scrub
and all checks returned no errors
I guess it is hard to investigate further, but leaving this here in case others encounter similar issues
 
It's also worth a try to check the dmesg/syslog around the time where the segfaults happen and/or if there are any errors during boot. Are there any BIOS settings that were changed? What about resetting the BIOS settings to default?
 
Hi,
Thanks for your support

  • BIOS has been on latest version and settings have not changed since last year
  • I have reset the settings now and then configured them again just to be sure
  • I checked journalctl k -b0 and didn't seen anything obvious (the entry just before the segfault is from boot two days earlier)

I will see if it occurs again, but if no one else reports similar issues then I agree it looks more like a hardware issue or something related to my specific config
 
I will see if it occurs again, but if no one else reports similar issues then I agree it looks more like a hardware issue or something related to my specific config

I have the same issue on my MS-01 system. It has worked perfectly fine until I upgraded to proxmox 9. only after the upgrade pvestatd started to suddenly stop with this kind of error in the log:

[dom set 14 20:22:37 2025] pvestatd[2053]: segfault at 2b4c ip 000056eba86c772f sp 00007ffe1a040720 error 4 in perl[19872f,56eba8573000+1ae000] likely on CPU 5 (core 8, socket 0)

Even though I didn't think it was a hw issue, I ran hw tests on memory and cpu, and no issues were reported.
 
Last edited:
Even though I didn't think it was a hw issue, I ran hw tests on memory and cpu, and no issues were reported.
Which CPU tests have you made? A good test suite that usually shows some signs of hardware trouble is stress-ng as these kinds of errors are usually caused when there's quite a load on the CPU. Otherwise random segfaults of widespread executables (such as perl, dnsmaq, python, ...) on a stable kernel occurring right next after each other is most of the time a sign of hardware issues.
 
A good test suite that usually shows some signs of hardware trouble is stress-ng as these kinds of errors are usually caused when there's quite a load on the CPU.

I will try it, thanks.

Otherwise random segfaults of widespread executables (such as perl, dnsmaq, python, ...) on a stable kernel occurring right next after each other is most of the time a sign of hardware issues.

I would agree with you if it wasn't for the fact that I ran proxmox 8.x for 1 year on the same hardware with no issues. How would you explain that?
 
I would agree with you if it wasn't for the fact that I ran proxmox 8.x for 1 year on the same hardware with no issues. How would you explain that?
Hardware has wear too as any other component and sometimes that is even exacerbated by implementation faults, e.g. the 13th and 14th Intel 700 and 900 series had problems with overvoltage, which could render cores permanently degraded or even fail them entirely. Another cause could be that a cable is loose or the hardware configuration was changed (e.g. through BIOS options).

All in all, it's more likely that there's an issue with the hardware when multiple unrelated binaries segfault, which are in use by millions of people. But it's also only a first check to eliminate sources of errors and then check for other things ;).