segfault on all cluster nodes

tkropf

Renowned Member
Dec 3, 2014
5
0
66
Hi forum,

we run a six node pve/ceph cluster with two corosync rings.

Yesterday we have a segfault on all nodes a the same time. Every node make a reboot.

Feb 25 13:16:01 node1 systemd[1]: Started Proxmox VE replication runner.
Feb 25 13:16:01 node1 pve-ha-crm[4427]: service 'vm:3064' without node
Feb 25 13:16:04 node1 kernel: [2668121.457529] server[3363]: segfault at 7efe5d546c79 ip 000055f06e82b7b9 sp 00007efe53ffea10 error 4 in pmxcfs[55f06e80e0
00+2b000]
Feb 25 13:16:04 node1 systemd[1]: pve-cluster.service: Main process exited, code=killed, status=11/SEGV
Feb 25 13:16:04 node1 systemd[1]: pve-cluster.service: Unit entered failed state.
Feb 25 13:16:04 node1 systemd[1]: pve-cluster.service: Failed with result 'signal'.
Feb 25 13:16:05 node1 pve-ha-lrm[6471]: updating service status from manager failed: Connection refused
Feb 25 13:16:05 node1 pve-ha-lrm[6471]: lost lock 'ha_agent_node1_lock - can't create '/etc/pve/priv/lock' (pmxcfs not mounted?)
Feb 25 13:16:06 node1 pveproxy[188716]: ipcc_send_rec[1] failed: Connection refused
Feb 25 13:16:06 node1 pveproxy[188716]: ipcc_send_rec[2] failed: Connection refused
Feb 25 13:16:06 node1 pveproxy[188716]: ipcc_send_rec[3] failed: Connection refused
Feb 25 13:16:06 node1 pveproxy[195960]: ipcc_send_rec[1] failed: Connection refused

We see a lot of messages simular to this in the syslog

Feb 25 13:16:01 node1 pve-ha-crm[4427]: service 'vm:3064' without node

HA-manager dont'n know the ressource 3064 but we see him at /etc/pve/ha/resources.cfg as ignored.

We have no network problem on the corosync nics and switches.

Anybody know this problem?
 
Hi. This is something we heard of, but at a very low occurrence and not reproducible here. It's highly probably a bug in pmxcfs where the log ringbuffer gets thrown off and accesses a non-existend entry address, leading to a segmentation fault... :/

Interestingly we had a report yestereday in the german forum ( https://forum.proxmox.com/threads/a...cluster-mit-12-nodes-segfault-cfs_loop.51913/ ) and one quite a bit ago ( https://forum.proxmox.com/threads/pmxcfs-segfaults.30823/ )... also your case is related to them, if I check where the segfault happened:

Code:
# addr2line -e /usr/bin/pmxcfs -fCi $(printf "%x" $[0x000055f06e82b7b9 - 0x55f06e80e000])   
clog_dump_json                                                               
/home/builder/source/build/src/logger.c:166

It's always a different location but always the clog involved... Could you turn on coredumps for a chance this happens again, with a coredump (e.g., by installing systemd-coredump) we'd have much more information, as we sadly cannot reproduce this here yet...