pmxcfs segfaults

laowolf

Renowned Member
Jul 29, 2014
66
1
73
I have encountered a pmxcfs segfault error several days ago. following is my pve version information.

root@pmx12:~# pveversion -v
proxmox-ve: 4.3-66 (running kernel: 4.4.19-1-pve)
pve-manager: 4.3-1 (running version: 4.3-1/e7cdc165)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.4.19-1-pve: 4.4.19-66
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-46
qemu-server: 4.0-88
pve-firmware: 1.1-9
libpve-common-perl: 4.0-73
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-61
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-qemu-kvm: 2.6.1-6
pve-container: 1.0-75
pve-firewall: 2.0-29
pve-ha-manager: 1.0-35
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.4-1
lxcfs: 2.0.3-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
zfsutils: 0.6.5.7-pve10~bpo80
openvswitch-switch: 2.5.0-1

and following is the error log of one node. actually, all the nodes encountered same error at the same time, and pmxcfs daemon is killed by the kernel.


Nov 30 05:55:02 pmx12 pmxcfs[7329]: [status] notice: received log
Nov 30 05:56:57 pmx12 pmxcfs[7329]: [status] notice: received log
Nov 30 05:57:01 pmx12 pmxcfs[7329]: [dcdb] notice: members: 2/8645, 3/7329, 4/7402, 5/9037, 6/7430
Nov 30 05:57:01 pmx12 pmxcfs[7329]: [dcdb] notice: starting data syncronisation
Nov 30 05:57:01 pmx12 pmxcfs[7329]: [status] notice: members: 2/8645, 3/7329, 4/7402, 5/9037, 6/7430
Nov 30 05:57:01 pmx12 pmxcfs[7329]: [status] notice: starting data syncronisation
Nov 30 05:57:01 pmx12 pmxcfs[7329]: [dcdb] notice: received sync request (epoch 2/8645/00000016)
Nov 30 05:57:01 pmx12 pmxcfs[7329]: [status] notice: received sync request (epoch 2/8645/00000016)
Nov 30 05:57:01 pmx12 kernel: show_signal_msg: 3 callbacks suppressed
Nov 30 05:57:01 pmx12 kernel: cfs_loop[7330]: segfault at 7efd95c6d17c ip 000000000041ad90 sp 00007efd35c28428 error 4 in pmxcfs[400000+28000]
Nov 30 05:57:01 pmx12 systemd[1]: pve-cluster.service: main process exited, code=killed, status=11/SEGV
Nov 30 05:57:01 pmx12 systemd[1]: Unit pve-cluster.service entered failed state.
Nov 30 05:57:04 pmx12 pve-ha-crm[13291]: ipcc_send_rec failed: Transport endpoint is not connected
Nov 30 05:57:04 pmx12 pve-ha-crm[13291]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:04 pmx12 pve-ha-crm[13291]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:05 pmx12 pve-ha-lrm[13303]: ipcc_send_rec failed: Transport endpoint is not connected
Nov 30 05:57:05 pmx12 pve-ha-lrm[13303]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:05 pmx12 pve-ha-lrm[13303]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:09 pmx12 pve-ha-crm[13291]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:09 pmx12 pve-ha-crm[13291]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:09 pmx12 pve-ha-crm[13291]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:10 pmx12 pve-ha-lrm[13303]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:10 pmx12 pve-ha-lrm[13303]: ipcc_send_rec failed: Connection refused

my question is
1. how can I get further information about the reason of the faults?
2. Is this a known bug? if so, how can I fix it? if not, how can I avoid it?

Any idea or suggestions?
 
Hi diemater, thanks for the concern and reply.
The problem happened by accident in the early morning of 2016-11-30, I have never encounter this before and have no idea about reproducing it.
I'd like to dig into the case, but just feel sorry for not knowing how to do it.
Could you give me some hint to get more information about the bug?
 
Could you give me some hint to get more information about the bug?

This is a known issue, but it is not very frequent (about 3 reports in 5 years). The problem is that the code looks correct, and we are unable to reproduce the issue here...
 
This is a known issue, but it is not very frequent (about 3 reports in 5 years). The problem is that the code looks correct, and we are unable to reproduce the issue here...

I have run proxmox for about 3 years, and this is the first time I encountered the problem.
I'll keep close watch to my proxmox cluster. But how could I keep the useful informations when it happened again?
 
But how could I keep the useful informations when it happened again?

The kernel log points to the address (ip 000000000041ad90), and objdump reveals that this is at logger.c:184 - any information on how to reproduce the issue would be helpful (what you have done at that time, something special?)
 
The kernel log points to the address (ip 000000000041ad90), and objdump reveals that this is at logger.c:184 - any information on how to reproduce the issue would be helpful (what you have done at that time, something special?)
It happened in the early morning of 2016-11-30. I think there is no one at working at that time and there was nothing being done.
 
My 2 cents :
Code:
[Wed Feb 13 05:24:44 2019] perf: interrupt took too long (4931 > 4920), lowering kernel.perf_event_max_sample_rate to 40500
[Wed Apr 10 16:24:53 2019] cfs_loop[6168]: segfault at 7f3bad915000 ip 00007f3bad08378a sp 00007f3ba4c323a8 error 4 in libc-2.24.so[7f3bad000000+195000]
[Sun Apr 28 14:13:13 2019] hrtimer: interrupt took 10710 ns

I am investigating why one of my windows VM qemu-ga is using a lot of CPU and found this.

# pveversion -v
proxmox-ve: 5.3-1 (running kernel: 4.15.18-11-pve)
pve-manager: 5.3-9 (running version: 5.3-9/ba817b29)
pve-kernel-4.15: 5.3-2
pve-kernel-4.15.18-11-pve: 4.15.18-33
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.4.83-1-pve: 4.4.83-96
pve-kernel-4.4.40-1-pve: 4.4.40-82
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.19-1-pve: 4.4.19-66
pve-kernel-4.4.16-1-pve: 4.4.16-64
pve-kernel-4.4.15-1-pve: 4.4.15-60
pve-kernel-4.4.6-1-pve: 4.4.6-48
ceph: 12.2.10-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-46
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-37
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-33
pve-container: 2.0-34
pve-docs: 5.3-2
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-17
pve-firmware: 2.0-6
pve-ha-manager: 2.0-6
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 3.10.1-1
qemu-server: 5.0-46
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!