PVE6.0-5: Corosync3 segvaults randomly on nodes

astnwt

Renowned Member
Jun 1, 2011
12
3
68
Austria
fhstp.ac.at
Hey guys,

we updated from PVE5 to PVE6 recently and noticed that nodes on our 4-node cluster leave randomly. Checking pvecm status states that CMAP cannot be initialized, so I had a look at corosync on the failed node only to learn that it obviously segfaulted.

This happened on 3 of 4 cluster nodes since we upgraded. Of course I could apply some nasty workaround like a shellscript watchdog that fires up corosync again after it died - but I really would like to fix the underlying problem.

Aug 12 21:03:16 scp4 pmxcfs[1572]: [status] crit: received short message (0 bytes)
Aug 12 21:03:16 scp4 pmxcfs[1572]: [status] crit: leaving CPG group
Aug 12 21:03:16 scp4 systemd[1]: corosync.service: Main process exited, code=killed, status=11/SEGV
Aug 12 21:03:16 scp4 systemd[1]: corosync.service: Failed with result 'signal'.
Aug 12 21:03:16 scp4 pmxcfs[1572]: [dcdb] crit: cpg_send_message failed: 2
Aug 12 21:03:16 scp4 pmxcfs[1572]: [dcdb] crit: cpg_send_message failed: 2
Aug 12 21:03:16 scp4 pmxcfs[1572]: [dcdb] crit: cpg_send_message failed: 2
[...]


Firing up corosync3 again, using systemctl start corosync works, but I don't know how long and I'm not used to "any" trouble in PVE5, which was rock solid with all its underlying components.


Greetings from Lower Austria,

- Daniel
 
Last edited:
@dietmar: Thanks for reaching out. I followed your advice and opened a case on https://bugzilla.proxmox.com/show_bug.cgi?id=2326 a few minutes ago.

For the sake of visibility and documentation:

The problem seems to include KNET, at least tonight the cluster disintegrated again (without segfaulting,
but it was still corosync / KNET related).

If somebody has a comparable situation, some further post-mortem info
in the adjusted and stripped down syslog I'm attaching to this post:

Notable events in attached logs:
00:17 - KNET: Link seems to be lost (it's not, our monitoring still reaches the host)
00:20 - KNET: complains about MTU issues
04:35 - KNET: Link seems to be lost (again, our monitoring does not confirm this)
05:24 - KNET/Corosync: Now things start to get out of hand. Cluster falls apart..
10:12 - INTERVENTION:
We notice that since 05:24, 2 of our 4 cluster nodes are "gone", logging in on https/8006 does not work (realm list does not get populated, firing up the web interface in the browser itself took a minute). Machine itself is perfectly reachable via SSH.
pvecm status stalls without output. I choose to restart corosync (systemctl restart corosync).
10:12 - After restarting corosync, both missing cluster nodes join again, all issues are suddenly gone, everything turns back to normal operation.

Hope that helps others who are experiencing something along these lines.
 

Attachments

@RokaKen: besides the SEGV that happened yesterday, the whole issue is the same it seems, yes.
Thanks for pointing that out - I stumbled upon the issue when I started to hunt down the coroync SEGV - but when the problems got worse this night I did not remember the thread.

I keep this open for the SEGV part though and will start watching the other thread :)
 
Sure:

Code:
root@vmb2:~# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.18-1-pve)
pve-manager: 6.0-5 (running version: 6.0-5/f8a710d7)
pve-kernel-5.0: 6.0-6
pve-kernel-helper: 6.0-6
pve-kernel-5.0.18-1-pve: 5.0.18-3
pve-kernel-4.15.18-20-pve: 4.15.18-46
pve-kernel-4.4.98-6-pve: 4.4.98-107
pve-kernel-4.4.83-1-pve: 4.4.83-96
pve-kernel-4.4.76-1-pve: 4.4.76-94
pve-kernel-4.4.35-1-pve: 4.4.35-77
ceph: 14.2.1-pve2
ceph-fuse: 14.2.1-pve2
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.10-pve2
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-3
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-7
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-63
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-5
pve-cluster: 6.0-5
pve-container: 3.0-5
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-2
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve1
 
Same issue here, corosync.service 11/SEGV

Oct 1 02:35:00 server3 systemd[1]: pvesr.service: Succeeded.
Oct 1 02:35:00 server3 systemd[1]: Started Proxmox VE replication runner.
Oct 1 02:35:20 server3 corosync[51563]: [TOTEM ] Retransmit List: 15f68
Oct 1 02:35:39 server3 corosync[51563]: [TOTEM ] Retransmit List: 15fdb
Oct 1 02:35:59 server3 pmxcfs[33528]: [status] crit: cpg_dispatch failed: 2
Oct 1 02:35:59 server3 systemd[1]: corosync.service: Main process exited, code=killed, status=11/SEGV
Oct 1 02:35:59 server3 pmxcfs[33528]: [status] crit: cpg_leave failed: 2
Oct 1 02:35:59 server3 systemd[1]: corosync.service: Failed with result 'signal'.
Oct 1 02:35:59 server3 pmxcfs[33528]: [quorum] crit: quorum_dispatch failed: 2
Oct 1 02:35:59 server3 pmxcfs[33528]: [status] notice: node lost quorum
Oct 1 02:35:59 server3 pmxcfs[33528]: [dcdb] crit: cpg_dispatch failed: 2
Oct 1 02:35:59 server3 pmxcfs[33528]: [dcdb] crit: cpg_leave failed: 2
Oct 1 02:35:59 server3 pmxcfs[33528]: [confdb] crit: cmap_dispatch failed: 2
Oct 1 02:35:59 server3 pve-ha-lrm[1832]: unable to write lrm status file - unable to open file '/etc/pve/nodes/server3/lrm_status.tmp.1832' - Permission denied
Oct 1 02:36:00 server3 systemd[1]: Starting Proxmox VE replication runner...
Oct 1 02:36:00 server3 pmxcfs[33528]: [quorum] crit: quorum_initialize failed: 2
Oct 1 02:36:00 server3 pmxcfs[33528]: [quorum] crit: can't initialize service
Oct 1 02:36:00 server3 pmxcfs[33528]: [confdb] crit: cmap_initialize failed: 2
Oct 1 02:36:00 server3 pmxcfs[33528]: [confdb] crit: can't initialize service
Oct 1 02:36:00 server3 pmxcfs[33528]: [dcdb] notice: start cluster connection
Oct 1 02:36:00 server3 pmxcfs[33528]: [dcdb] crit: cpg_initialize failed: 2
Oct 1 02:36:00 server3 pmxcfs[33528]: [dcdb] crit: can't initialize service
Oct 1 02:36:00 server3 pmxcfs[33528]: [status] notice: start cluster connection
Oct 1 02:36:00 server3 pmxcfs[33528]: [status] crit: cpg_initialize failed: 2
Oct 1 02:36:00 server3 pmxcfs[33528]: [status] crit: can't initialize service
Oct 1 02:36:01 server3 pvesr[56406]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 1 02:36:02 server3 pvesr[56406]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 1 02:36:03 server3 pvesr[56406]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 1 02:36:04 server3 pvesr[56406]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 1 02:36:05 server3 pvesr[56406]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 1 02:36:06 server3 pmxcfs[33528]: [quorum] crit: quorum_initialize failed: 2
Oct 1 02:36:06 server3 pmxcfs[33528]: [confdb] crit: cmap_initialize failed: 2
Oct 1 02:36:06 server3 pmxcfs[33528]: [dcdb] crit: cpg_initialize failed: 2
Oct 1 02:36:06 server3 pmxcfs[33528]: [status] crit: cpg_initialize failed: 2
Oct 1 02:36:06 server3 pvesr[56406]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 1 02:36:07 server3 pvesr[56406]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 1 02:36:08 server3 pvesr[56406]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 1 02:36:09 server3 pvesr[56406]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 1 02:36:10 server3 pvesr[56406]: error with cfs lock 'file-replication_cfg': no quorum!
Oct 1 02:36:10 server3 systemd[1]: pvesr.service: Main process exited, code=exited, status=13/n/a
Oct 1 02:36:10 server3 systemd[1]: pvesr.service: Failed with result 'exit-code'.
Oct 1 02:36:10 server3 systemd[1]: Failed to start Proxmox VE replication runner.
Oct 1 02:36:12 server3 pmxcfs[33528]: [quorum] crit: quorum_initialize failed: 2
Oct 1 02:36:12 server3 pmxcfs[33528]: [confdb] crit: cmap_initialize failed: 2
Oct 1 02:36:12 server3 pmxcfs[33528]: [dcdb] crit: cpg_initialize failed: 2
Oct 1 02:36:12 server3 pmxcfs[33528]: [status] crit: cpg_initialize failed: 2
Oct 1 02:36:18 server3 pmxcfs[33528]: [quorum] crit: quorum_initialize failed: 2
Oct 1 02:36:18 server3 pmxcfs[33528]: [confdb] crit: cmap_initialize failed: 2
Oct 1 02:36:18 server3 pmxcfs[33528]: [dcdb] crit: cpg_initialize failed: 2
Oct 1 02:36:18 server3 pmxcfs[33528]: [status] crit: cpg_initialize failed: 2




root@server3:~# pveversion -v
proxmox-ve: 6.0-2 (running kernel: 5.0.15-1-pve)
pve-manager: 6.0-7 (running version: 6.0-7/28984024)
pve-kernel-5.0: 6.0-7
pve-kernel-helper: 6.0-7
pve-kernel-4.15: 5.4-6
pve-kernel-5.0.21-1-pve: 5.0.21-2
pve-kernel-5.0.15-1-pve: 5.0.15-1
pve-kernel-4.15.18-18-pve: 4.15.18-44
pve-kernel-4.15.18-12-pve: 4.15.18-36
ceph-fuse: 12.2.12-pve1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.11-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-4
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-8
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-64
lxcfs: 3.0.3-pve60
novnc-pve: 1.0.0-60
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-7
pve-container: 3.0-7
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
zfsutils-linux: 0.8.1-pve2
 
Hi friends, I have the same issue...

strange because the node had been updated few days ago, and has the latest updates.

Related logs:

Oct 3 03:10:30 proxmox07 systemd[1]: corosync.service: Main process exited, code=killed, status=11/SEGV
Oct 3 03:10:30 proxmox07 systemd[1]: corosync.service: Failed with result 'signal'.
Oct 3 03:10:30 proxmox07 pmxcfs[1572]: [dcdb] crit: cpg_dispatch failed: 2
Oct 3 03:10:30 proxmox07 pmxcfs[1572]: [dcdb] crit: cpg_leave failed: 2
Oct 3 03:10:30 proxmox07 pmxcfs[1572]: [confdb] crit: cmap_dispatch failed: 2
Oct 3 03:10:30 proxmox07 pmxcfs[1572]: [quorum] crit: quorum_dispatch failed: 2
Oct 3 03:10:30 proxmox07 pmxcfs[1572]: [status] notice: node lost quorum
Oct 3 03:10:30 proxmox07 pmxcfs[1572]: [status] crit: cpg_dispatch failed: 2
Oct 3 03:10:30 proxmox07 pmxcfs[1572]: [status] crit: cpg_leave failed: 2
Oct 3 03:10:30 proxmox07 pmxcfs[1572]: [status] crit: cpg_dispatch failed: 2
Oct 3 03:10:30 proxmox07 pmxcfs[1572]: [status] crit: cpg_leave failed: 2
Oct 3 03:10:31 proxmox07 pmxcfs[1572]: [quorum] crit: quorum_initialize failed: 2
Oct 3 03:10:31 proxmox07 pmxcfs[1572]: [quorum] crit: can't initialize service
Oct 3 03:10:31 proxmox07 pmxcfs[1572]: [confdb] crit: cmap_initialize failed: 2
Oct 3 03:10:31 proxmox07 pmxcfs[1572]: [confdb] crit: can't initialize service
Oct 3 03:10:31 proxmox07 pmxcfs[1572]: [dcdb] notice: start cluster connection
Oct 3 03:10:31 proxmox07 pmxcfs[1572]: [dcdb] crit: cpg_initialize failed: 2
Oct 3 03:10:31 proxmox07 pmxcfs[1572]: [dcdb] crit: can't initialize service
Oct 3 03:10:31 proxmox07 pmxcfs[1572]: [status] notice: start cluster connection
Oct 3 03:10:31 proxmox07 pmxcfs[1572]: [status] crit: cpg_initialize failed: 2
Oct 3 03:10:31 proxmox07 pmxcfs[1572]: [status] crit: can't initialize service
Oct 3 03:10:35 proxmox07 pve-ha-lrm[1821]: lost lock 'ha_agent_proxmox07_lock - cfs lock update failed - Permission denied
Oct 3 03:10:37 proxmox07 pve-ha-crm[1813]: lost lock 'ha_manager_lock - cfs lock update failed - Permission denied
Oct 3 03:10:37 proxmox07 pmxcfs[1572]: [quorum] crit: quorum_initialize failed: 2
Oct 3 03:10:37 proxmox07 pmxcfs[1572]: [confdb] crit: cmap_initialize failed: 2
Oct 3 03:10:37 proxmox07 pmxcfs[1572]: [dcdb] crit: cpg_initialize failed: 2
Oct 3 03:10:37 proxmox07 pmxcfs[1572]: [status] crit: cpg_initialize failed: 2
Oct 3 03:10:40 proxmox07 pve-ha-lrm[1821]: status change active => lost_agent_lock
Oct 3 03:10:42 proxmox07 pve-ha-crm[1813]: status change master => lost_manager_lock
Oct 3 03:10:42 proxmox07 pve-ha-crm[1813]: watchdog closed (disabled)
Oct 3 03:10:42 proxmox07 pve-ha-crm[1813]: status change lost_manager_lock => wait_for_quorum
Oct 3 03:10:43 proxmox07 pmxcfs[1572]: [quorum] crit: quorum_initialize failed: 2
Oct 3 03:10:43 proxmox07 pmxcfs[1572]: [confdb] crit: cmap_initialize failed: 2
Oct 3 03:10:43 proxmox07 pmxcfs[1572]: [dcdb] crit: cpg_initialize failed: 2
Oct 3 03:10:43 proxmox07 pmxcfs[1572]: [status] crit: cpg_initialize failed: 2
Oct 3 03:10:49 proxmox07 pmxcfs[1572]: [quorum] crit: quorum_initialize failed: 2
Oct 3 03:10:49 proxmox07 pmxcfs[1572]: [confdb] crit: cmap_initialize failed: 2
Oct 3 03:10:49 proxmox07 pmxcfs[1572]: [dcdb] crit: cpg_initialize failed: 2
Oct 3 03:10:49 proxmox07 pmxcfs[1572]: [status] crit: cpg_initialize failed: 2
Oct 3 03:10:55 proxmox07 pmxcfs[1572]: [quorum] crit: quorum_initialize failed: 2
Oct 3 03:10:55 proxmox07 pmxcfs[1572]: [confdb] crit: cmap_initialize failed: 2
Oct 3 03:10:55 proxmox07 pmxcfs[1572]: [dcdb] crit: cpg_initialize failed: 2
Oct 3 03:10:55 proxmox07 pmxcfs[1572]: [status] crit: cpg_initialize failed: 2
Oct 3 03:11:00 proxmox07 systemd[1]: Starting Proxmox VE replication runner...
Oct 3 03:11:00 proxmox07 pvesr[631]: trying to acquire cfs lock 'file-replication_cfg' ...
Oct 3 03:11:01 proxmox07 pmxcfs[1572]: [quorum] crit: quorum_initialize failed: 2
Oct 3 03:11:01 proxmox07 pmxcfs[1572]: [confdb] crit: cmap_initialize failed: 2
Oct 3 03:11:01 proxmox07 pmxcfs[1572]: [dcdb] crit: cpg_initialize failed: 2
Oct 3 03:11:01 proxmox07 pmxcfs[1572]: [status] crit: cpg_initialize failed: 2
Oct 3 03:11:01 proxmox07 pvesr[631]: trying to acquire cfs lock 'file-replication_cfg' ...




pveversion -v:



proxmox-ve: 6.0-2 (running kernel: 5.0.21-2-pve)
pve-manager: 6.0-7 (running version: 6.0-7/28984024)
pve-kernel-5.0: 6.0-8
pve-kernel-helper: 6.0-8
pve-kernel-4.15: 5.4-9
pve-kernel-5.0.21-2-pve: 5.0.21-6
pve-kernel-4.15.18-21-pve: 4.15.18-48
pve-kernel-4.15.18-14-pve: 4.15.18-39
ceph-fuse: 12.2.11+dfsg1-2.1+b1
corosync: 3.0.2-pve2
criu: 3.11-3
glusterfs-client: 5.5-3
ksm-control-daemon: 1.3-1
libjs-extjs: 6.0.1-10
libknet1: 1.12-pve1
libpve-access-control: 6.0-2
libpve-apiclient-perl: 3.0-2
libpve-common-perl: 6.0-5
libpve-guest-common-perl: 3.0-1
libpve-http-server-perl: 3.0-2
libpve-storage-perl: 6.0-9
libqb0: 1.0.5-1
lvm2: 2.03.02-pve3
lxc-pve: 3.1.0-65
lxcfs: 3.0.3-pve60
novnc-pve: 1.1.0-1
openvswitch-switch: 2.10.0+2018.08.28+git.8ca7c82b7d+ds1-12
proxmox-mini-journalreader: 1.1-1
proxmox-widget-toolkit: 2.0-7
pve-cluster: 6.0-7
pve-container: 3.0-7
pve-docs: 6.0-4
pve-edk2-firmware: 2.20190614-1
pve-firewall: 4.0-7
pve-firmware: 3.0-2
pve-ha-manager: 3.0-2
pve-i18n: 2.0-3
pve-qemu-kvm: 4.0.0-5
pve-xtermjs: 3.13.2-1
qemu-server: 6.0-7
smartmontools: 7.0-pve2
spiceterm: 3.1-1
vncterm: 1.6-1
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!