I have encountered a pmxcfs segfault error several days ago. following is my pve version information.
root@pmx12:~# pveversion -v
proxmox-ve: 4.3-66 (running kernel: 4.4.19-1-pve)
pve-manager: 4.3-1 (running version: 4.3-1/e7cdc165)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.4.19-1-pve: 4.4.19-66
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-46
qemu-server: 4.0-88
pve-firmware: 1.1-9
libpve-common-perl: 4.0-73
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-61
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-qemu-kvm: 2.6.1-6
pve-container: 1.0-75
pve-firewall: 2.0-29
pve-ha-manager: 1.0-35
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.4-1
lxcfs: 2.0.3-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
zfsutils: 0.6.5.7-pve10~bpo80
openvswitch-switch: 2.5.0-1
and following is the error log of one node. actually, all the nodes encountered same error at the same time, and pmxcfs daemon is killed by the kernel.
Nov 30 05:55:02 pmx12 pmxcfs[7329]: [status] notice: received log
Nov 30 05:56:57 pmx12 pmxcfs[7329]: [status] notice: received log
Nov 30 05:57:01 pmx12 pmxcfs[7329]: [dcdb] notice: members: 2/8645, 3/7329, 4/7402, 5/9037, 6/7430
Nov 30 05:57:01 pmx12 pmxcfs[7329]: [dcdb] notice: starting data syncronisation
Nov 30 05:57:01 pmx12 pmxcfs[7329]: [status] notice: members: 2/8645, 3/7329, 4/7402, 5/9037, 6/7430
Nov 30 05:57:01 pmx12 pmxcfs[7329]: [status] notice: starting data syncronisation
Nov 30 05:57:01 pmx12 pmxcfs[7329]: [dcdb] notice: received sync request (epoch 2/8645/00000016)
Nov 30 05:57:01 pmx12 pmxcfs[7329]: [status] notice: received sync request (epoch 2/8645/00000016)
Nov 30 05:57:01 pmx12 kernel: show_signal_msg: 3 callbacks suppressed
Nov 30 05:57:01 pmx12 kernel: cfs_loop[7330]: segfault at 7efd95c6d17c ip 000000000041ad90 sp 00007efd35c28428 error 4 in pmxcfs[400000+28000]
Nov 30 05:57:01 pmx12 systemd[1]: pve-cluster.service: main process exited, code=killed, status=11/SEGV
Nov 30 05:57:01 pmx12 systemd[1]: Unit pve-cluster.service entered failed state.
Nov 30 05:57:04 pmx12 pve-ha-crm[13291]: ipcc_send_rec failed: Transport endpoint is not connected
Nov 30 05:57:04 pmx12 pve-ha-crm[13291]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:04 pmx12 pve-ha-crm[13291]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:05 pmx12 pve-ha-lrm[13303]: ipcc_send_rec failed: Transport endpoint is not connected
Nov 30 05:57:05 pmx12 pve-ha-lrm[13303]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:05 pmx12 pve-ha-lrm[13303]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:09 pmx12 pve-ha-crm[13291]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:09 pmx12 pve-ha-crm[13291]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:09 pmx12 pve-ha-crm[13291]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:10 pmx12 pve-ha-lrm[13303]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:10 pmx12 pve-ha-lrm[13303]: ipcc_send_rec failed: Connection refused
my question is
1. how can I get further information about the reason of the faults?
2. Is this a known bug? if so, how can I fix it? if not, how can I avoid it?
Any idea or suggestions?
root@pmx12:~# pveversion -v
proxmox-ve: 4.3-66 (running kernel: 4.4.19-1-pve)
pve-manager: 4.3-1 (running version: 4.3-1/e7cdc165)
pve-kernel-4.4.6-1-pve: 4.4.6-48
pve-kernel-4.2.2-1-pve: 4.2.2-16
pve-kernel-4.4.19-1-pve: 4.4.19-66
pve-kernel-4.2.3-2-pve: 4.2.3-22
lvm2: 2.02.116-pve3
corosync-pve: 2.4.0-1
libqb0: 1.0-1
pve-cluster: 4.0-46
qemu-server: 4.0-88
pve-firmware: 1.1-9
libpve-common-perl: 4.0-73
libpve-access-control: 4.0-19
libpve-storage-perl: 4.0-61
pve-libspice-server1: 0.12.8-1
vncterm: 1.2-1
pve-qemu-kvm: 2.6.1-6
pve-container: 1.0-75
pve-firewall: 2.0-29
pve-ha-manager: 1.0-35
ksm-control-daemon: 1.2-1
glusterfs-client: 3.5.2-2+deb8u2
lxc-pve: 2.0.4-1
lxcfs: 2.0.3-pve1
criu: 1.6.0-1
novnc-pve: 0.5-8
zfsutils: 0.6.5.7-pve10~bpo80
openvswitch-switch: 2.5.0-1
and following is the error log of one node. actually, all the nodes encountered same error at the same time, and pmxcfs daemon is killed by the kernel.
Nov 30 05:55:02 pmx12 pmxcfs[7329]: [status] notice: received log
Nov 30 05:56:57 pmx12 pmxcfs[7329]: [status] notice: received log
Nov 30 05:57:01 pmx12 pmxcfs[7329]: [dcdb] notice: members: 2/8645, 3/7329, 4/7402, 5/9037, 6/7430
Nov 30 05:57:01 pmx12 pmxcfs[7329]: [dcdb] notice: starting data syncronisation
Nov 30 05:57:01 pmx12 pmxcfs[7329]: [status] notice: members: 2/8645, 3/7329, 4/7402, 5/9037, 6/7430
Nov 30 05:57:01 pmx12 pmxcfs[7329]: [status] notice: starting data syncronisation
Nov 30 05:57:01 pmx12 pmxcfs[7329]: [dcdb] notice: received sync request (epoch 2/8645/00000016)
Nov 30 05:57:01 pmx12 pmxcfs[7329]: [status] notice: received sync request (epoch 2/8645/00000016)
Nov 30 05:57:01 pmx12 kernel: show_signal_msg: 3 callbacks suppressed
Nov 30 05:57:01 pmx12 kernel: cfs_loop[7330]: segfault at 7efd95c6d17c ip 000000000041ad90 sp 00007efd35c28428 error 4 in pmxcfs[400000+28000]
Nov 30 05:57:01 pmx12 systemd[1]: pve-cluster.service: main process exited, code=killed, status=11/SEGV
Nov 30 05:57:01 pmx12 systemd[1]: Unit pve-cluster.service entered failed state.
Nov 30 05:57:04 pmx12 pve-ha-crm[13291]: ipcc_send_rec failed: Transport endpoint is not connected
Nov 30 05:57:04 pmx12 pve-ha-crm[13291]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:04 pmx12 pve-ha-crm[13291]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:05 pmx12 pve-ha-lrm[13303]: ipcc_send_rec failed: Transport endpoint is not connected
Nov 30 05:57:05 pmx12 pve-ha-lrm[13303]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:05 pmx12 pve-ha-lrm[13303]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:09 pmx12 pve-ha-crm[13291]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:09 pmx12 pve-ha-crm[13291]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:09 pmx12 pve-ha-crm[13291]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:10 pmx12 pve-ha-lrm[13303]: ipcc_send_rec failed: Connection refused
Nov 30 05:57:10 pmx12 pve-ha-lrm[13303]: ipcc_send_rec failed: Connection refused
my question is
1. how can I get further information about the reason of the faults?
2. Is this a known bug? if so, how can I fix it? if not, how can I avoid it?
Any idea or suggestions?