PVE6.0-5: Corosync3 segvaults randomly on nodes

astnwt · Aug 12, 2019

Hey guys,

we updated from PVE5 to PVE6 recently and noticed that nodes on our 4-node cluster leave randomly. Checking pvecm status states that CMAP cannot be initialized, so I had a look at corosync on the failed node only to learn that it obviously segfaulted.

This happened on 3 of 4 cluster nodes since we upgraded. Of course I could apply some nasty workaround like a shellscript watchdog that fires up corosync again after it died - but I really would like to fix the underlying problem.

Aug 12 21:03:16 scp4 pmxcfs[1572]: [status] crit: received short message (0 bytes)
Aug 12 21:03:16 scp4 pmxcfs[1572]: [status] crit: leaving CPG group
Aug 12 21:03:16 scp4 systemd[1]: corosync.service: Main process exited, code=killed, status=11/SEGV
Aug 12 21:03:16 scp4 systemd[1]: corosync.service: Failed with result 'signal'.
Aug 12 21:03:16 scp4 pmxcfs[1572]: [dcdb] crit: cpg_send_message failed: 2
Aug 12 21:03:16 scp4 pmxcfs[1572]: [dcdb] crit: cpg_send_message failed: 2
Aug 12 21:03:16 scp4 pmxcfs[1572]: [dcdb] crit: cpg_send_message failed: 2
[...]

Firing up corosync3 again, using systemctl start corosync works, but I don't know how long and I'm not used to "any" trouble in PVE5, which was rock solid with all its underlying components.

Greetings from Lower Austria,

- Daniel

dietmar · Aug 13, 2019

Please can you open a bug report at bugzilla.proxmox.com ?

andrew sp · Aug 13, 2019

+1
same here, random corosync SEGVs on PVE6.0 cluster

astnwt · Aug 13, 2019

@dietmar: Thanks for reaching out. I followed your advice and opened a case on https://bugzilla.proxmox.com/show_bug.cgi?id=2326 a few minutes ago.

For the sake of visibility and documentation:

The problem seems to include KNET, at least tonight the cluster disintegrated again (without segfaulting,
but it was still corosync / KNET related).

If somebody has a comparable situation, some further post-mortem info
in the adjusted and stripped down syslog I'm attaching to this post:

Notable events in attached logs:
00:17 - KNET: Link seems to be lost (it's not, our monitoring still reaches the host)
00:20 - KNET: complains about MTU issues
04:35 - KNET: Link seems to be lost (again, our monitoring does not confirm this)
05:24 - KNET/Corosync: Now things start to get out of hand. Cluster falls apart..
10:12 - INTERVENTION:
We notice that since 05:24, 2 of our 4 cluster nodes are "gone", logging in on https/8006 does not work (realm list does not get populated, firing up the web interface in the browser itself took a minute). Machine itself is perfectly reachable via SSH. pvecm status stalls without output. I choose to restart corosync (systemctl restart corosync).
10:12 - After restarting corosync, both missing cluster nodes join again, all issues are suddenly gone, everything turns back to normal operation.

Hope that helps others who are experiencing something along these lines.

RokaKen · Aug 13, 2019

This issue seems to be related to PVE 5.4-11 + Corosync 3.x: major issues

astnwt · Aug 13, 2019

@RokaKen: besides the SEGV that happened yesterday, the whole issue is the same it seems, yes.
Thanks for pointing that out - I stumbled upon the issue when I started to hunt down the coroync SEGV - but when the problems got worse this night I did not remember the thread.

I keep this open for the SEGV part though and will start watching the other thread

fips · Sep 6, 2019

+1
I have got the same problem with our corosync 3 PVE6 cluster

tom · Sep 6, 2019

fips said:
I have got the same problem with our corosync 3 PVE6 cluster

please post your pveversion -v.

fips · Sep 6, 2019

Sure:

Code:

root@vmb2:~# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.18-1-pve)
pve-manager: 6.0-5 (running version: 6.0-5/f8a710d7)
pve-kernel-5.0: 6.0-6
pve-kernel-helper: 6.0-6
pve-kernel-5.0.18-1-pve: 5.0.18-3
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.4.98-6-pve: 4.4.98-107
pve-kernel-4.4.83-1-pve: 4.4.83-96
pve-kernel-4.4.76-1-pve: 4.4.76-94
pve-kernel-4.4.35-1-pve: 4.4.35-77
ceph: 14.2.1-pve2
ceph-fuse: 14.2.1-pve2
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve2
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-3
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-7
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-63
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-5
pve-container: 3.0-5
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1

tom · Sep 6, 2019

you missed latest updates for libknet, please upgrade and compare your list with:

https://pve.proxmox.com/wiki/Downlo...Proxmox_Virtual_Environment_6.x_to_latest_6.0

if you still see issues after applying the updates, please post again.

chrigiboy · Oct 2, 2019

Same issue here, corosync.service 11/SEGV

Oct 1 02:35:00 server3 systemd[1]: pvesr.service: Succeeded.
Oct 1 02:35:00 server3 systemd[1]: Started Proxmox VE replication runner.
Oct 1 02:35:20 server3 corosync[51563]: [TOTEM ] Retransmit List: 15f68
Oct 1 02:35:39 server3 corosync[51563]: [TOTEM ] Retransmit List: 15fdb
Oct 1 02:35:59 server3 pmxcfs[33528]: [status] crit: cpg_dispatch failed: 2
Oct 1 02:35:59 server3 systemd[1]: corosync.service: Main process exited, code=killed, status=11/SEGV
Oct 1 02:35:59 server3 pmxcfs[33528]: [status] crit: cpg_leave failed: 2
Oct 1 02:35:59 server3 systemd[1]: corosync.service: Failed with result 'signal'.
Oct 1 02:35:59 server3 pmxcfs[33528]: [quorum] crit: quorum_dispatch failed: 2
Oct 1 02:35:59 server3 pmxcfs[33528]: [status] notice: node lost quorum
Oct 1 02:35:59 server3 pmxcfs[33528]: [dcdb] crit: cpg_dispatch failed: 2
Oct 1 02:35:59 server3 pmxcfs[33528]: [dcdb] crit: cpg_leave failed: 2
Oct 1 02:35:59 server3 pmxcfs[33528]: [confdb] crit: cmap_dispatch failed: 2
Oct 1 02:35:59 server3 pve-ha-lrm[1832]: unable to write lrm status file - unable to open file '/etc/pve/nodes/server3/lrm_status.tmp.1832' - Permission denied
Oct 1 02:36:00 server3 systemd[1]: Starting Proxmox VE replication runner...
Oct 1 02:36:00 server3 pmxcfs[33528]: [quorum] crit: quorum_initialize failed: 2
Oct 1 02:36:00 server3 pmxcfs[33528]: [quorum] crit: can't initialize service
Oct 1 02:36:00 server3 pmxcfs[33528]: [confdb] crit: cmap_initialize failed: 2
Oct 1 02:36:00 server3 pmxcfs[33528]: [confdb] crit: can't initialize service
Oct 1 02:36:00 server3 pmxcfs[33528]: [dcdb] notice: start cluster connection
Oct 1 02:36:00 server3 pmxcfs[33528]: [dcdb] crit: cpg_initialize failed: 2
Oct 1 02:36:00 server3 pmxcfs[33528]: [dcdb] crit: can't initialize service
Oct 1 02:36:00 server3 pmxcfs[33528]: [status] notice: start cluster connection
Oct 1 02:36:00 server3 pmxcfs[33528]: [status] crit: cpg_initialize failed: 2
Oct 1 02:36:00 server3 pmxcfs[33528]: [status] crit: can't initialize service
Oct 1 02:36:01 server3 pvesr[56406]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 1 02:36:02 server3 pvesr[56406]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 1 02:36:03 server3 pvesr[56406]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 1 02:36:04 server3 pvesr[56406]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 1 02:36:05 server3 pvesr[56406]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 1 02:36:06 server3 pmxcfs[33528]: [quorum] crit: quorum_initialize failed: 2
Oct 1 02:36:06 server3 pmxcfs[33528]: [confdb] crit: cmap_initialize failed: 2
Oct 1 02:36:06 server3 pmxcfs[33528]: [dcdb] crit: cpg_initialize failed: 2
Oct 1 02:36:06 server3 pmxcfs[33528]: [status] crit: cpg_initialize failed: 2
Oct 1 02:36:06 server3 pvesr[56406]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 1 02:36:07 server3 pvesr[56406]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 1 02:36:08 server3 pvesr[56406]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 1 02:36:09 server3 pvesr[56406]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 1 02:36:10 server3 pvesr[56406]: error with cfs lock 'file-replication_cfg': no quorum!
Oct 1 02:36:10 server3 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Oct 1 02:36:10 server3 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Oct 1 02:36:10 server3 systemd[1]: Failed to start Proxmox VE replication runner.
Oct 1 02:36:12 server3 pmxcfs[33528]: [quorum] crit: quorum_initialize failed: 2
Oct 1 02:36:12 server3 pmxcfs[33528]: [confdb] crit: cmap_initialize failed: 2
Oct 1 02:36:12 server3 pmxcfs[33528]: [dcdb] crit: cpg_initialize failed: 2
Oct 1 02:36:12 server3 pmxcfs[33528]: [status] crit: cpg_initialize failed: 2
Oct 1 02:36:18 server3 pmxcfs[33528]: [quorum] crit: quorum_initialize failed: 2
Oct 1 02:36:18 server3 pmxcfs[33528]: [confdb] crit: cmap_initialize failed: 2
Oct 1 02:36:18 server3 pmxcfs[33528]: [dcdb] crit: cpg_initialize failed: 2
Oct 1 02:36:18 server3 pmxcfs[33528]: [status] crit: cpg_initialize failed: 2

root@server3:~# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-7 (running version: 6.0-7/28984024)
pve-kernel-5.0: 6.0-7
pve-kernel-helper: 6.0-7
pve-kernel-4.15: 5.4-6
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.15-1-pve: 5.0.15-1
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph-fuse: 12.2.12-pve1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.11-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-4
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-8
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-64
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-7
pve-container: 3.0-7
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve2

Stoiko Ivanov · Oct 2, 2019

Christian Geissler said:
libknet1: 1.11-pve1

there has been an update to libknet, which should address the issue.
please update to the latest version and reboot

alebeta · Oct 3, 2019

Hi friends, I have the same issue...

strange because the node had been updated few days ago, and has the latest updates.

Related logs:

Oct 3 03:10:30 proxmox07 systemd[1]: corosync.service: Main process exited, code=killed, status=11/SEGV
Oct 3 03:10:30 proxmox07 systemd[1]: corosync.service: Failed with result 'signal'.
Oct 3 03:10:30 proxmox07 pmxcfs[1572]: [dcdb] crit: cpg_dispatch failed: 2
Oct 3 03:10:30 proxmox07 pmxcfs[1572]: [dcdb] crit: cpg_leave failed: 2
Oct 3 03:10:30 proxmox07 pmxcfs[1572]: [confdb] crit: cmap_dispatch failed: 2
Oct 3 03:10:30 proxmox07 pmxcfs[1572]: [quorum] crit: quorum_dispatch failed: 2
Oct 3 03:10:30 proxmox07 pmxcfs[1572]: [status] notice: node lost quorum
Oct 3 03:10:30 proxmox07 pmxcfs[1572]: [status] crit: cpg_dispatch failed: 2
Oct 3 03:10:30 proxmox07 pmxcfs[1572]: [status] crit: cpg_leave failed: 2
Oct 3 03:10:30 proxmox07 pmxcfs[1572]: [status] crit: cpg_dispatch failed: 2
Oct 3 03:10:30 proxmox07 pmxcfs[1572]: [status] crit: cpg_leave failed: 2
Oct 3 03:10:31 proxmox07 pmxcfs[1572]: [quorum] crit: quorum_initialize failed: 2
Oct 3 03:10:31 proxmox07 pmxcfs[1572]: [quorum] crit: can't initialize service
Oct 3 03:10:31 proxmox07 pmxcfs[1572]: [confdb] crit: cmap_initialize failed: 2
Oct 3 03:10:31 proxmox07 pmxcfs[1572]: [confdb] crit: can't initialize service
Oct 3 03:10:31 proxmox07 pmxcfs[1572]: [dcdb] notice: start cluster connection
Oct 3 03:10:31 proxmox07 pmxcfs[1572]: [dcdb] crit: cpg_initialize failed: 2
Oct 3 03:10:31 proxmox07 pmxcfs[1572]: [dcdb] crit: can't initialize service
Oct 3 03:10:31 proxmox07 pmxcfs[1572]: [status] notice: start cluster connection
Oct 3 03:10:31 proxmox07 pmxcfs[1572]: [status] crit: cpg_initialize failed: 2
Oct 3 03:10:31 proxmox07 pmxcfs[1572]: [status] crit: can't initialize service
Oct 3 03:10:35 proxmox07 pve-ha-lrm[1821]: lost lock 'ha_agent_proxmox07_lock - cfs lock update failed - Permission denied
Oct 3 03:10:37 proxmox07 pve-ha-crm[1813]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied
Oct 3 03:10:37 proxmox07 pmxcfs[1572]: [quorum] crit: quorum_initialize failed: 2
Oct 3 03:10:37 proxmox07 pmxcfs[1572]: [confdb] crit: cmap_initialize failed: 2
Oct 3 03:10:37 proxmox07 pmxcfs[1572]: [dcdb] crit: cpg_initialize failed: 2
Oct 3 03:10:37 proxmox07 pmxcfs[1572]: [status] crit: cpg_initialize failed: 2
Oct 3 03:10:40 proxmox07 pve-ha-lrm[1821]: status change active => lost_agent_lock
Oct 3 03:10:42 proxmox07 pve-ha-crm[1813]: status change master => lost_manager_lock
Oct 3 03:10:42 proxmox07 pve-ha-crm[1813]: watchdog closed (disabled)
Oct 3 03:10:42 proxmox07 pve-ha-crm[1813]: status change lost_manager_lock => wait_for_quorum
Oct 3 03:10:43 proxmox07 pmxcfs[1572]: [quorum] crit: quorum_initialize failed: 2
Oct 3 03:10:43 proxmox07 pmxcfs[1572]: [confdb] crit: cmap_initialize failed: 2
Oct 3 03:10:43 proxmox07 pmxcfs[1572]: [dcdb] crit: cpg_initialize failed: 2
Oct 3 03:10:43 proxmox07 pmxcfs[1572]: [status] crit: cpg_initialize failed: 2
Oct 3 03:10:49 proxmox07 pmxcfs[1572]: [quorum] crit: quorum_initialize failed: 2
Oct 3 03:10:49 proxmox07 pmxcfs[1572]: [confdb] crit: cmap_initialize failed: 2
Oct 3 03:10:49 proxmox07 pmxcfs[1572]: [dcdb] crit: cpg_initialize failed: 2
Oct 3 03:10:49 proxmox07 pmxcfs[1572]: [status] crit: cpg_initialize failed: 2
Oct 3 03:10:55 proxmox07 pmxcfs[1572]: [quorum] crit: quorum_initialize failed: 2
Oct 3 03:10:55 proxmox07 pmxcfs[1572]: [confdb] crit: cmap_initialize failed: 2
Oct 3 03:10:55 proxmox07 pmxcfs[1572]: [dcdb] crit: cpg_initialize failed: 2
Oct 3 03:10:55 proxmox07 pmxcfs[1572]: [status] crit: cpg_initialize failed: 2
Oct 3 03:11:00 proxmox07 systemd[1]: Starting Proxmox VE replication runner...
Oct 3 03:11:00 proxmox07 pvesr[631]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 3 03:11:01 proxmox07 pmxcfs[1572]: [quorum] crit: quorum_initialize failed: 2
Oct 3 03:11:01 proxmox07 pmxcfs[1572]: [confdb] crit: cmap_initialize failed: 2
Oct 3 03:11:01 proxmox07 pmxcfs[1572]: [dcdb] crit: cpg_initialize failed: 2
Oct 3 03:11:01 proxmox07 pmxcfs[1572]: [status] crit: cpg_initialize failed: 2
Oct 3 03:11:01 proxmox07 pvesr[631]: trying to acquire cfs lock 'file-replication_cfg' ...

pveversion -v:

proxmox-ve: 6.0-2 (running kernel: 5.0.21-2-pve)
pve-manager: 6.0-7 (running version: 6.0-7/28984024)
pve-kernel-5.0: 6.0-8
pve-kernel-helper: 6.0-8
pve-kernel-4.15: 5.4-9
pve-kernel-5.0.21-2-pve: 5.0.21-6
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-14-pve: 4.15.18-39
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.12-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-5
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-65
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-7
pve-container: 3.0-7
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1

kev1904 · Oct 15, 2019

Checkout:

https://bugzilla.proxmox.com/show_bug.cgi?id=2326#c75

Search

Search

PVE6.0-5: Corosync3 segvaults randomly on nodes

astnwt

Renowned Member

dietmar

Proxmox Staff Member

andrew sp

New Member

astnwt

Renowned Member

Attachments

RokaKen

Active Member

astnwt

Renowned Member

fips

Renowned Member

tom

Proxmox Staff Member

fips

Renowned Member

tom

Proxmox Staff Member

chrigiboy

Well-Known Member

Stoiko Ivanov

Proxmox Staff Member

alebeta

Well-Known Member

kev1904

Well-Known Member