[SOLVED] pmxcfs failed, Failed to start Corosync cluster Engine

plofkat

Active Member
Mar 20, 2013
51
2
28
We had a critical server power failure taking down the entire cluster (inverter on ups failed, knocking out mains breakers and killing all servers before maintenance staff could reach power room)
6 server cluster (vwk-prox01 to vwk-prox06)

Shared storage folder mounts properly on all servers, however there seems to be some issue with pmxcfs

The following shows up on all servers.

Code:
journalctl -xn
-- Logs begin at Tue 2018-06-19 09:16:08 SAST, end at Tue 2018-06-19 09:29:59 SAST. --
Jun 19 09:29:47 vwk-prox02 pmxcfs[1350]: [dcdb] crit: cpg_initialize failed: 2
Jun 19 09:29:47 vwk-prox02 pmxcfs[1350]: [status] crit: cpg_initialize failed: 2
Jun 19 09:29:53 vwk-prox02 pmxcfs[1350]: [quorum] crit: quorum_initialize failed: 2
Jun 19 09:29:53 vwk-prox02 pmxcfs[1350]: [confdb] crit: cmap_initialize failed: 2
Jun 19 09:29:53 vwk-prox02 pmxcfs[1350]: [dcdb] crit: cpg_initialize failed: 2
Jun 19 09:29:53 vwk-prox02 pmxcfs[1350]: [status] crit: cpg_initialize failed: 2
Jun 19 09:29:59 vwk-prox02 pmxcfs[1350]: [quorum] crit: quorum_initialize failed: 2
Jun 19 09:29:59 vwk-prox02 pmxcfs[1350]: [confdb] crit: cmap_initialize failed: 2
Jun 19 09:29:59 vwk-prox02 pmxcfs[1350]: [dcdb] crit: cpg_initialize failed: 2
Jun 19 09:29:59 vwk-prox02 pmxcfs[1350]: [status] crit: cpg_initialize failed: 2

Code:
systemctl status corosync.service -l
● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled)
   Active: failed (Result: timeout) since Tue 2018-06-19 09:18:06 SAST; 7min ago
  Process: 1421 ExecStart=/usr/share/corosync/corosync start (code=killed, signal=TERM)

Jun 19 09:16:36 vwk-prox02 corosync[1431]: [MAIN  ] Corosync Cluster Engine ('2.4.2'): started and ready to provide service.
Jun 19 09:16:36 vwk-prox02 corosync[1431]: [MAIN  ] Corosync built-in features: augeas systemd pie relro bindnow
Jun 19 09:18:06 vwk-prox02 systemd[1]: corosync.service start operation timed out. Terminating.
Jun 19 09:18:06 vwk-prox02 systemd[1]: Failed to start Corosync Cluster Engine.
Jun 19 09:18:06 vwk-prox02 systemd[1]: Unit corosync.service entered failed state.
Jun 19 09:18:06 vwk-prox02 corosync[1421]: Starting Corosync Cluster Engine (corosync):
.
 
Update on this:

One server's on board network was knocked out, however it seems that corosync does start up on this server.
Is there anyway I can get the working (i assume) database on my other servers to get the cluster online?
 
Last update on this, if anyone is interested or ever have the same issue.

Root cause was combination of faulty NIC on one of the servers causing chaos on the network and failed DNS on the node.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!