Hey guys,
we updated from PVE5 to PVE6 recently and noticed that nodes on our 4-node cluster leave randomly. Checking pvecm status states that CMAP cannot be initialized, so I had a look at corosync on the failed node only to learn that it obviously segfaulted.
This happened on 3 of 4 cluster nodes since we upgraded. Of course I could apply some nasty workaround like a shellscript watchdog that fires up corosync again after it died - but I really would like to fix the underlying problem.
Aug 12 21:03:16 scp4 pmxcfs[1572]: [status] crit: received short message (0 bytes)
Aug 12 21:03:16 scp4 pmxcfs[1572]: [status] crit: leaving CPG group
Aug 12 21:03:16 scp4 systemd[1]: corosync.service: Main process exited, code=killed, status=11/SEGV
Aug 12 21:03:16 scp4 systemd[1]: corosync.service: Failed with result 'signal'.
Aug 12 21:03:16 scp4 pmxcfs[1572]: [dcdb] crit: cpg_send_message failed: 2
Aug 12 21:03:16 scp4 pmxcfs[1572]: [dcdb] crit: cpg_send_message failed: 2
Aug 12 21:03:16 scp4 pmxcfs[1572]: [dcdb] crit: cpg_send_message failed: 2
[...]
Firing up corosync3 again, using systemctl start corosync works, but I don't know how long and I'm not used to "any" trouble in PVE5, which was rock solid with all its underlying components.
Greetings from Lower Austria,
- Daniel
we updated from PVE5 to PVE6 recently and noticed that nodes on our 4-node cluster leave randomly. Checking pvecm status states that CMAP cannot be initialized, so I had a look at corosync on the failed node only to learn that it obviously segfaulted.
This happened on 3 of 4 cluster nodes since we upgraded. Of course I could apply some nasty workaround like a shellscript watchdog that fires up corosync again after it died - but I really would like to fix the underlying problem.
Aug 12 21:03:16 scp4 pmxcfs[1572]: [status] crit: received short message (0 bytes)
Aug 12 21:03:16 scp4 pmxcfs[1572]: [status] crit: leaving CPG group
Aug 12 21:03:16 scp4 systemd[1]: corosync.service: Main process exited, code=killed, status=11/SEGV
Aug 12 21:03:16 scp4 systemd[1]: corosync.service: Failed with result 'signal'.
Aug 12 21:03:16 scp4 pmxcfs[1572]: [dcdb] crit: cpg_send_message failed: 2
Aug 12 21:03:16 scp4 pmxcfs[1572]: [dcdb] crit: cpg_send_message failed: 2
Aug 12 21:03:16 scp4 pmxcfs[1572]: [dcdb] crit: cpg_send_message failed: 2
[...]
Firing up corosync3 again, using systemctl start corosync works, but I don't know how long and I'm not used to "any" trouble in PVE5, which was rock solid with all its underlying components.
Greetings from Lower Austria,
- Daniel
Last edited: