pve‑ha‑lrm segfault on a single HA node

crypsis

New Member
Aug 7, 2024
5
0
1
A single node in our HA enabled Proxmox cluster crashed yesterday. The crash produced a segfault in the HA resource manager (pve‑ha‑lrm) and subsequently stopped the watchdog, which forced a reboot of that node. Below are the exact syslog lines that show the Perl warning (the likely root cause) and the resulting failure cascade.

Proxmox version: 9.1.9
Kernel: Linux 6.17.13-7-pve
Linux distro: Debian Trixie


Code:
2026-05-31T15:39:57.155Z Attempt to free unreferenced scalar: SV 0x5d19f136f858, Perl interpreter: 0x5d19ea3c22a0 at /usr/share/perl5/PVE/HA/LRM.pm line 871.
2026-05-31T15:39:57.155Z pve-ha-lrm: segfault at 55 ip 00005d19b06f46a2 sp 00007ffc31574af0 error 4 in perl[946a2,5d19b06a4000+1ae000] likely on CPU 22 (core 21, socket 0)
2026-05-31T15:39:57.155Z Code: 00 48 89 df e8 df 88 0f 00 e9 3d ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 41 54 55 48 89 fd 53 48 8b 7e 08 48 89 f3 4c 8b 66 10 <48> 63 47 04 83 f8 fe 74 3d f6 44 07 09 04 74 1e e8 c9 6c 14 00 48
2026-05-31T15:39:57.163Z client (PID 5875) did not stop watchdog - disable watchdog updates
2026-05-31T15:39:57.167Z pve-ha-lrm.service: Main process exited, code=killed, status=11/SEGV
2026-05-31T15:39:57.170Z pve-ha-lrm.service: Failed with result 'signal'.
2026-05-31T15:39:57.174Z pve-ha-lrm.service: Consumed 2d 20h 22min 5.771s CPU time, 226.7M memory peak.
2026-05-31T15:39:58.164Z exit watchdog-mux with active connections
2026-05-31T15:39:58.169Z watchdog: watchdog0: watchdog did not stop!
2026-05-31T15:39:58.172Z watchdog-mux.service: Deactivated successfully.

Are there any known bugs in the current Proxmox 9/HA stack that produce the exact “Attempt to free unreferenced scalar” warning in PVE-HA-LRM (line 871) and the subsequent segfault?
 
Hi!

What exact version of the pve-ha-manager package was running at that point in time on the said node, where the segfault happened?
 
As a follow-up and to eliminate other possible causes: Have such segfaults or memory corrupts happened before on the node? If some maintenance time is possible, does a memtest with a few runs end successfully?
 
As a follow-up and to eliminate other possible causes: Have such segfaults or memory corrupts happened before on the node? If some maintenance time is possible, does a memtest with a few runs end successfully?
No, this is the first time that such segfault happened on the node.
I have not tried a memtest yet and would need to schedule it outside of office hours since this is a production node.
To add on the possible memory corruption itself, the iLO health checks did not see any memory errors occur at that time of segfault.
 
Could you share the full log of pve-ha-lrm on the node where it crashed before the "Attempt to free unreferenced scalar" error? And preferably the HA configuration (resources.cfg and rules.cfg) and the log output of the pve-ha-crm on the active master node at that time in the same time slice as well.

Also how many max_workers are configured in the datacenter options?

The "Attempt to free unreferenced scalar" error is a bit tricky, because it involves the Perl interpreter panicking. I suspect that it either is a threading issue or a memory corruption for now, but I'm looking to see if I can reproduce the issue.
 
I dont have any logs specific of the pve-ha-lrm before it crashed which i already provided earlier. I do have a full journalctl but it shows exactly the same logs which I obtained from the normal syslog, but not specific to pve-ha-lrm.

I have a full log of the pve-ha-crm of the master node but again it shows only the first active action after the crash already happened on pve31-ams.

I have posted the resources, rules and pve-ha-crm as .txt attachments

NOTE: Earlier provided logs you would need to add +2 hours comparing to the "master-node-31-05-2026.txt" timestamps.

pve31-ams was the crashed server
pve30-ams was the master at that time

Maybe this would be easier to compare this is from the journalctl of the crashed server pve31-ams
Code:
May 31 17:39:57 pve31-ams pve-ha-lrm[5875]: Attempt to free unreferenced scalar: SV 0x5d19f136f858, Perl interpreter: 0x5d19ea3c22a0 at /usr/share/perl5/PVE/HA/LRM.pm line 871.
May 31 17:39:57 pve31-ams kernel: pve-ha-lrm[5875]: segfault at 55 ip 00005d19b06f46a2 sp 00007ffc31574af0 error 4 in perl[946a2,5d19b06a4000+1ae000] likely on CPU 22 (core 21, socket 0)
May 31 17:39:57 pve31-ams kernel: Code: 00 48 89 df e8 df 88 0f 00 e9 3d ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 41 54 55 48 89 fd 53 48 8b 7e 08 48 89 f3 4c 8b 66 10 <48> 63 47 04 83 f8 fe 74 3d f6 44 07 09 04 74 1e e8 c9 6c 14 00 48
May 31 17:39:57 pve31-ams watchdog-mux[1941]: client (PID 5875) did not stop watchdog - disable watchdog updates
May 31 17:39:57 pve31-ams systemd[1]: pve-ha-lrm.service: Main process exited, code=killed, status=11/SEGV
May 31 17:39:57 pve31-ams systemd-journald[1301]: Received client request to sync journal.
May 31 17:39:57 pve31-ams systemd[1]: pve-ha-lrm.service: Failed with result 'signal'.
May 31 17:39:57 pve31-ams systemd[1]: pve-ha-lrm.service: Consumed 2d 20h 22min 5.771s CPU time, 226.7M memory peak.
May 31 17:39:58 pve31-ams watchdog-mux[1941]: exit watchdog-mux with active connections
May 31 17:39:58 pve31-ams systemd-journald[1301]: Received client request to sync journal.
May 31 17:39:58 pve31-ams kernel: watchdog: watchdog0: watchdog did not stop!
May 31 17:39:58 pve31-ams systemd[1]: watchdog-mux.service: Deactivated successfully.
 

Attachments

Thanks for the infos!

It's a bit unfortunate that the log doesn't have any entries before the error to know on what tasks the LRM was currently working on.

Can I assume that the node is the ballpark of ~33+ nodes according to the names? How many max_workers are configured in the datacenter options (/etc/pve/datacenter.cfg)?
 
Thanks for the infos!

It's a bit unfortunate that the log doesn't have any entries before the error to know on what tasks the LRM was currently working on.

Can I assume that the node is the ballpark of ~33+ nodes according to the names? How many max_workers are configured in the datacenter options (/etc/pve/datacenter.cfg)?


The cluster has a total of 4 nodes.
Cluster: pve30, pve31, pve32 and pve33
many max_workers = 4

Contents of /etc/pve/datacenter.cfg

Code:
console: html5
ha: shutdown_policy=migrate
max_workers: 4
migration: secure,network=172.18.0.203/24