pmxcfs segfaults

laowolf · Dec 6, 2016

I have encountered a pmxcfs segfault error several days ago. following is my pve version information.

root@pmx12:~# pveversion -v
proxmox-ve: 4.3-66 (running kernel: 4.4.19-1-pve)
pve-manager: 4.3-1 (running version: 4.3-1/e7cdc165)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.4.19-1-pve: 4.4.19-66
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-46
qemu-server: 4.0-88
pve-firmware: 1.1-9
libpve-common-perl: 4.0-73
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-61
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-qemu-kvm: 2.6.1-6
pve-container: 1.0-75
pve-firewall: 2.0-29
pve-ha-manager: 1.0-35
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.4-1
lxcfs: 2.0.3-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
zfsutils: 0.6.5.7-pve10~bpo80
openvswitch-switch: 2.5.0-1

and following is the error log of one node. actually, all the nodes encountered same error at the same time, and pmxcfs daemon is killed by the kernel.

Nov 30 05:55:02 pmx12 pmxcfs[7329]: [status] notice: received log
Nov 30 05:56:57 pmx12 pmxcfs[7329]: [status] notice: received log
Nov 30 05:57:01 pmx12 pmxcfs[7329]: [dcdb] notice: members: 2/8645, 3/7329, 4/7402, 5/9037, 6/7430
Nov 30 05:57:01 pmx12 pmxcfs[7329]: [dcdb] notice: starting data syncronisation
Nov 30 05:57:01 pmx12 pmxcfs[7329]: [status] notice: members: 2/8645, 3/7329, 4/7402, 5/9037, 6/7430
Nov 30 05:57:01 pmx12 pmxcfs[7329]: [status] notice: starting data syncronisation
Nov 30 05:57:01 pmx12 pmxcfs[7329]: [dcdb] notice: received sync request (epoch 2/8645/00000016)
Nov 30 05:57:01 pmx12 pmxcfs[7329]: [status] notice: received sync request (epoch 2/8645/00000016)
Nov 30 05:57:01 pmx12 kernel: show_signal_msg: 3 callbacks suppressed
Nov 30 05:57:01 pmx12 kernel: cfs_loop[7330]: segfault at 7efd95c6d17c ip 000000000041ad90 sp 00007efd35c28428 error 4 in pmxcfs[400000+28000]
Nov 30 05:57:01 pmx12 systemd[1]: pve-cluster.service: main process exited, code=killed, status=11/SEGV
Nov 30 05:57:01 pmx12 systemd[1]: Unit pve-cluster.service entered failed state.
Nov 30 05:57:04 pmx12 pve-ha-crm[13291]: ipcc_send_rec failed: Transport endpoint is not connected
Nov 30 05:57:04 pmx12 pve-ha-crm[13291]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:04 pmx12 pve-ha-crm[13291]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:05 pmx12 pve-ha-lrm[13303]: ipcc_send_rec failed: Transport endpoint is not connected
Nov 30 05:57:05 pmx12 pve-ha-lrm[13303]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:05 pmx12 pve-ha-lrm[13303]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:09 pmx12 pve-ha-crm[13291]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:09 pmx12 pve-ha-crm[13291]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:09 pmx12 pve-ha-crm[13291]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:10 pmx12 pve-ha-lrm[13303]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:10 pmx12 pve-ha-lrm[13303]: ipcc_send_rec failed: Connection refused

my question is
1. how can I get further information about the reason of the faults?
2. Is this a known bug? if so, how can I fix it? if not, how can I avoid it?

Any idea or suggestions?

dietmar · Dec 6, 2016

Are you able to reproduce this bug somehow?

laowolf · Dec 6, 2016

Hi diemater, thanks for the concern and reply.
The problem happened by accident in the early morning of 2016-11-30, I have never encounter this before and have no idea about reproducing it.
I'd like to dig into the case, but just feel sorry for not knowing how to do it.
Could you give me some hint to get more information about the bug?

dietmar · Dec 6, 2016

laowolf said:
Could you give me some hint to get more information about the bug?

This is a known issue, but it is not very frequent (about 3 reports in 5 years). The problem is that the code looks correct, and we are unable to reproduce the issue here...

laowolf · Dec 6, 2016

dietmar said:
This is a known issue, but it is not very frequent (about 3 reports in 5 years). The problem is that the code looks correct, and we are unable to reproduce the issue here...

I have run proxmox for about 3 years, and this is the first time I encountered the problem.
I'll keep close watch to my proxmox cluster. But how could I keep the useful informations when it happened again?

dietmar · Dec 6, 2016

laowolf said:
But how could I keep the useful informations when it happened again?

The kernel log points to the address (ip 000000000041ad90), and objdump reveals that this is at logger.c:184 - any information on how to reproduce the issue would be helpful (what you have done at that time, something special?)

laowolf · Dec 6, 2016

dietmar said:
The kernel log points to the address (ip 000000000041ad90), and objdump reveals that this is at logger.c:184 - any information on how to reproduce the issue would be helpful (what you have done at that time, something special?)

It happened in the early morning of 2016-11-30. I think there is no one at working at that time and there was nothing being done.

Lokytech · May 20, 2019

My 2 cents :

Code:

[Wed Feb 13 05:24:44 2019] perf: interrupt took too long (4931 > 4920), lowering kernel.perf_event_max_sample_rate to 40500
[Wed Apr 10 16:24:53 2019] cfs_loop[6168]: segfault at 7f3bad915000 ip 00007f3bad08378a sp 00007f3ba4c323a8 error 4 in libc-2.24.so[7f3bad000000+195000]
[Sun Apr 28 14:13:13 2019] hrtimer: interrupt took 10710 ns

I am investigating why one of my windows VM qemu-ga is using a lot of CPU and found this.

# pveversion -v
proxmox-ve: 5.3-1 (running kernel: 4.15.18-11-pve)
pve-manager: 5.3-9 (running version: 5.3-9/ba817b29)
pve-kernel-4.15: 5.3-2
pve-kernel-4.15.18-11-pve: 4.15.18-33
pve-kernel-4.13.4-1-pve: 4.13.4-26
pve-kernel-4.4.83-1-pve: 4.4.83-96
pve-kernel-4.4.40-1-pve: 4.4.40-82
pve-kernel-4.4.35-2-pve: 4.4.35-79
pve-kernel-4.4.19-1-pve: 4.4.19-66
pve-kernel-4.4.16-1-pve: 4.4.16-64
pve-kernel-4.4.15-1-pve: 4.4.15-60
pve-kernel-4.4.6-1-pve: 4.4.6-48
ceph: 12.2.10-pve1
corosync: 2.4.4-pve1
criu: 2.11.1-1~bpo90
glusterfs-client: 3.8.8-1
ksm-control-daemon: 1.2-2
libjs-extjs: 6.0.1-2
libpve-access-control: 5.1-3
libpve-apiclient-perl: 2.0-5
libpve-common-perl: 5.0-46
libpve-guest-common-perl: 2.0-20
libpve-http-server-perl: 2.0-11
libpve-storage-perl: 5.0-37
libqb0: 1.0.3-1~bpo9
lvm2: 2.02.168-pve6
lxc-pve: 3.1.0-2
lxcfs: 3.0.2-2
novnc-pve: 1.0.0-2
proxmox-widget-toolkit: 1.0-22
pve-cluster: 5.0-33
pve-container: 2.0-34
pve-docs: 5.3-2
pve-edk2-firmware: 1.20181023-1
pve-firewall: 3.0-17
pve-firmware: 2.0-6
pve-ha-manager: 2.0-6
pve-i18n: 1.0-9
pve-libspice-server1: 0.14.1-2
pve-qemu-kvm: 2.12.1-1
pve-xtermjs: 3.10.1-1
qemu-server: 5.0-46
smartmontools: 6.5+svn4324-1
spiceterm: 3.0-5
vncterm: 1.5-3
zfsutils-linux: 0.7.12-pve1~bpo1

Search

Search

pmxcfs segfaults

laowolf

Renowned Member

dietmar

Proxmox Staff Member

laowolf

Renowned Member

dietmar

Proxmox Staff Member

laowolf

Renowned Member

dietmar

Proxmox Staff Member

laowolf

Renowned Member

Lokytech

Renowned Member