segfault on all cluster nodes

tkropf

Member
Dec 3, 2014
5
0
21
Hi forum,

we run a six node pve/ceph cluster with two corosync rings.

Yesterday we have a segfault on all nodes a the same time. Every node make a reboot.

Feb 25 13:16:01 node1 systemd[1]: Started Proxmox VE replication runner.
Feb 25 13:16:01 node1 pve-ha-crm[4427]: service 'vm:3064' without node
Feb 25 13:16:04 node1 kernel: [2668121.457529] server[3363]: segfault at 7efe5d546c79 ip 000055f06e82b7b9 sp 00007efe53ffea10 error 4 in pmxcfs[55f06e80e0
00+2b000]
Feb 25 13:16:04 node1 systemd[1]: pve-cluster.service: Main process exited, code=killed, status=11/SEGV
Feb 25 13:16:04 node1 systemd[1]: pve-cluster.service: Unit entered failed state.
Feb 25 13:16:04 node1 systemd[1]: pve-cluster.service: Failed with result 'signal'.
Feb 25 13:16:05 node1 pve-ha-lrm[6471]: updating service status from manager failed: Connection refused
Feb 25 13:16:05 node1 pve-ha-lrm[6471]: lost lock 'ha_agent_node1_lock - can't create '/etc/pve/priv/lock' (pmxcfs not mounted?)
Feb 25 13:16:06 node1 pveproxy[188716]: ipcc_send_rec[1] failed: Connection refused
Feb 25 13:16:06 node1 pveproxy[188716]: ipcc_send_rec[2] failed: Connection refused
Feb 25 13:16:06 node1 pveproxy[188716]: ipcc_send_rec[3] failed: Connection refused
Feb 25 13:16:06 node1 pveproxy[195960]: ipcc_send_rec[1] failed: Connection refused

We see a lot of messages simular to this in the syslog

Feb 25 13:16:01 node1 pve-ha-crm[4427]: service 'vm:3064' without node

HA-manager dont'n know the ressource 3064 but we see him at /etc/pve/ha/resources.cfg as ignored.

We have no network problem on the corosync nics and switches.

Anybody know this problem?
 
Hi. This is something we heard of, but at a very low occurrence and not reproducible here. It's highly probably a bug in pmxcfs where the log ringbuffer gets thrown off and accesses a non-existend entry address, leading to a segmentation fault... :/

Interestingly we had a report yestereday in the german forum ( https://forum.proxmox.com/threads/a...cluster-mit-12-nodes-segfault-cfs_loop.51913/ ) and one quite a bit ago ( https://forum.proxmox.com/threads/pmxcfs-segfaults.30823/ )... also your case is related to them, if I check where the segfault happened:

Code:
# addr2line -e /usr/bin/pmxcfs -fCi $(printf "%x" $[0x000055f06e82b7b9 - 0x55f06e80e000])   
clog_dump_json                                                               
/home/builder/source/build/src/logger.c:166

It's always a different location but always the clog involved... Could you turn on coredumps for a chance this happens again, with a coredump (e.g., by installing systemd-coredump) we'd have much more information, as we sadly cannot reproduce this here yet...
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!