Cluster Failed?

XN-Matt · Sep 6, 2017

We've got all our nodes online and added one server to HA but it remains in the queued status.

I have attached an image showing the cluster but it just remains in the dead status. Have restarted the cluster process too without much change.

Is there anything else to check?

t.lamprecht · Sep 7, 2017

XN-Matt said:
We've got all our nodes online and added one server to HA but it remains in the queued status.

I have attached an image showing the cluster but it just remains in the dead status. Have restarted the cluster process too without much change.

Is there anything else to check?

It seems that the CRM does not runs, or isn't able to get its lock on all nodes concurrently – which could mean that there is a quorum or pmxcfs problem.

Can you please post the output of:

Code:

systemctl -l status pve-ha-crm pve-ha-lrm corosync

from all nodes?

XN-Matt · Sep 7, 2017

Here you go.

Code:

● pve-ha-crm.service - PVE Cluster Ressource Manager Daemon
   Loaded: loaded (/lib/systemd/system/pve-ha-crm.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2017-09-07 02:11:34 BST; 10h ago
  Process: 24537 ExecStop=/usr/sbin/pve-ha-crm stop (code=exited, status=0/SUCCESS)
  Process: 24550 ExecStart=/usr/sbin/pve-ha-crm start (code=exited, status=0/SUCCESS)
 Main PID: 24552 (pve-ha-crm)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/pve-ha-crm.service
           └─24552 pve-ha-crm

Sep 07 02:11:33 c1-h6 systemd[1]: Starting PVE Cluster Ressource Manager Daemon...
Sep 07 02:11:34 c1-h6 pve-ha-crm[24552]: starting server
Sep 07 02:11:34 c1-h6 pve-ha-crm[24552]: status change startup => wait_for_quorum
Sep 07 02:11:34 c1-h6 systemd[1]: Started PVE Cluster Ressource Manager Daemon.
Sep 07 02:11:39 c1-h6 pve-ha-crm[24552]: status change wait_for_quorum => slave
Sep 07 02:12:39 c1-h6 pve-ha-crm[24552]: ipcc_send_rec failed: Transport endpoint is not connected

● pve-ha-lrm.service - PVE Local HA Ressource Manager Daemon
   Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2017-09-04 00:24:16 BST; 3 days ago
  Process: 7800 ExecStop=/usr/sbin/pve-ha-lrm stop (code=exited, status=0/SUCCESS)
  Process: 8704 ExecStart=/usr/sbin/pve-ha-lrm start (code=exited, status=0/SUCCESS)
 Main PID: 8737 (pve-ha-lrm)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/pve-ha-lrm.service
           └─8737 pve-ha-lrm

Sep 07 01:53:30 c1-h6 pve-ha-lrm[8737]: service 'ct:167' without node
Sep 07 01:53:30 c1-h6 pve-ha-lrm[8737]: service 'vm:105' without node
Sep 07 01:53:30 c1-h6 pve-ha-lrm[8737]: service 'vm:104' without node
Sep 07 01:53:30 c1-h6 pve-ha-lrm[8737]: service 'vm:127' without node
Sep 07 02:11:30 c1-h6 pve-ha-lrm[8737]: ipcc_send_rec failed: Transport endpoint is not connected
Sep 07 02:12:40 c1-h6 pve-ha-lrm[8737]: ipcc_send_rec failed: Transport endpoint is not connected
Sep 07 04:16:03 c1-h6 pve-ha-lrm[8658]: starting service vm:145
Sep 07 04:16:03 c1-h6 pve-ha-lrm[8659]: start VM 145: UPID:c1-h6:000021D3:01A6CE56:59B0B9F3:qmstart:145:root@pam:
Sep 07 04:16:03 c1-h6 pve-ha-lrm[8658]: <root@pam> starting task UPID:c1-h6:000021D3:01A6CE56:59B0B9F3:qmstart:145:root@pam:
Sep 07 04:16:05 c1-h6 pve-ha-lrm[8658]: <root@pam> end task UPID:c1-h6:000021D3:01A6CE56:59B0B9F3:qmstart:145:root@pam: OK

● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2017-09-07 02:11:19 BST; 10h ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
 Main PID: 24472 (corosync)
    Tasks: 2 (limit: 4915)
   CGroup: /system.slice/corosync.service
           └─24472 /usr/sbin/corosync -f

Sep 07 02:11:19 c1-h6 corosync[24472]:  [QUORUM] Members[1]: 2
Sep 07 02:11:19 c1-h6 corosync[24472]:  [MAIN  ] Completed service synchronization, ready to provide service.
Sep 07 02:11:19 c1-h6 corosync[24472]: notice  [TOTEM ] A new membership (10.0.0.14:232) was formed. Members joined: 4 6 7 8 9 3 1 5
Sep 07 02:11:19 c1-h6 corosync[24472]:  [TOTEM ] A new membership (10.0.0.14:232) was formed. Members joined: 4 6 7 8 9 3 1 5
Sep 07 02:11:19 c1-h6 corosync[24472]: notice  [QUORUM] This node is within the primary component and will provide service.
Sep 07 02:11:19 c1-h6 corosync[24472]: notice  [QUORUM] Members[9]: 4 6 7 8 9 2 3 1 5
Sep 07 02:11:19 c1-h6 corosync[24472]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Sep 07 02:11:19 c1-h6 corosync[24472]:  [QUORUM] This node is within the primary component and will provide service.
Sep 07 02:11:19 c1-h6 corosync[24472]:  [QUORUM] Members[9]: 4 6 7 8 9 2 3 1 5
Sep 07 02:11:19 c1-h6 corosync[24472]:  [MAIN  ] Completed service synchronization, ready to provide service.

It is worth noting, I did a restart which sorted that out but now this node (from output above), just remains on "starting" when a VM is on that and added to HA. Moving the same VM elsewhere and enabling HA goes from starting to started.

cheissol · Jul 12, 2018

I've the same problem. Did you solve it?

fireon · Sep 17, 2018

Same problem....

Search

Search

Cluster Failed?

XN-Matt

Well-Known Member

Attachments

t.lamprecht

Proxmox Staff Member

XN-Matt

Well-Known Member

cheissol

Active Member

fireon

Distinguished Member