Cluster Failed?

XN-Matt

Well-Known Member
Aug 21, 2017
90
7
48
42
We've got all our nodes online and added one server to HA but it remains in the queued status.

I have attached an image showing the cluster but it just remains in the dead status. Have restarted the cluster process too without much change.

Is there anything else to check?
 

Attachments

  • Screen Shot 2017-09-06 at 4.39.45 PM.png
    Screen Shot 2017-09-06 at 4.39.45 PM.png
    94.4 KB · Views: 26
We've got all our nodes online and added one server to HA but it remains in the queued status.

I have attached an image showing the cluster but it just remains in the dead status. Have restarted the cluster process too without much change.

Is there anything else to check?

It seems that the CRM does not runs, or isn't able to get its lock on all nodes concurrently – which could mean that there is a quorum or pmxcfs problem.

Can you please post the output of:
Code:
systemctl -l status pve-ha-crm pve-ha-lrm corosync
from all nodes?
 
Here you go.

Code:
● pve-ha-crm.service - PVE Cluster Ressource Manager Daemon
   Loaded: loaded (/lib/systemd/system/pve-ha-crm.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2017-09-07 02:11:34 BST; 10h ago
  Process: 24537 ExecStop=/usr/sbin/pve-ha-crm stop (code=exited, status=0/SUCCESS)
  Process: 24550 ExecStart=/usr/sbin/pve-ha-crm start (code=exited, status=0/SUCCESS)
 Main PID: 24552 (pve-ha-crm)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/pve-ha-crm.service
           └─24552 pve-ha-crm

Sep 07 02:11:33 c1-h6 systemd[1]: Starting PVE Cluster Ressource Manager Daemon...
Sep 07 02:11:34 c1-h6 pve-ha-crm[24552]: starting server
Sep 07 02:11:34 c1-h6 pve-ha-crm[24552]: status change startup => wait_for_quorum
Sep 07 02:11:34 c1-h6 systemd[1]: Started PVE Cluster Ressource Manager Daemon.
Sep 07 02:11:39 c1-h6 pve-ha-crm[24552]: status change wait_for_quorum => slave
Sep 07 02:12:39 c1-h6 pve-ha-crm[24552]: ipcc_send_rec failed: Transport endpoint is not connected

● pve-ha-lrm.service - PVE Local HA Ressource Manager Daemon
   Loaded: loaded (/lib/systemd/system/pve-ha-lrm.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2017-09-04 00:24:16 BST; 3 days ago
  Process: 7800 ExecStop=/usr/sbin/pve-ha-lrm stop (code=exited, status=0/SUCCESS)
  Process: 8704 ExecStart=/usr/sbin/pve-ha-lrm start (code=exited, status=0/SUCCESS)
 Main PID: 8737 (pve-ha-lrm)
    Tasks: 1 (limit: 4915)
   CGroup: /system.slice/pve-ha-lrm.service
           └─8737 pve-ha-lrm

Sep 07 01:53:30 c1-h6 pve-ha-lrm[8737]: service 'ct:167' without node
Sep 07 01:53:30 c1-h6 pve-ha-lrm[8737]: service 'vm:105' without node
Sep 07 01:53:30 c1-h6 pve-ha-lrm[8737]: service 'vm:104' without node
Sep 07 01:53:30 c1-h6 pve-ha-lrm[8737]: service 'vm:127' without node
Sep 07 02:11:30 c1-h6 pve-ha-lrm[8737]: ipcc_send_rec failed: Transport endpoint is not connected
Sep 07 02:12:40 c1-h6 pve-ha-lrm[8737]: ipcc_send_rec failed: Transport endpoint is not connected
Sep 07 04:16:03 c1-h6 pve-ha-lrm[8658]: starting service vm:145
Sep 07 04:16:03 c1-h6 pve-ha-lrm[8659]: start VM 145: UPID:c1-h6:000021D3:01A6CE56:59B0B9F3:qmstart:145:root@pam:
Sep 07 04:16:03 c1-h6 pve-ha-lrm[8658]: <root@pam> starting task UPID:c1-h6:000021D3:01A6CE56:59B0B9F3:qmstart:145:root@pam:
Sep 07 04:16:05 c1-h6 pve-ha-lrm[8658]: <root@pam> end task UPID:c1-h6:000021D3:01A6CE56:59B0B9F3:qmstart:145:root@pam: OK

● corosync.service - Corosync Cluster Engine
   Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2017-09-07 02:11:19 BST; 10h ago
     Docs: man:corosync
           man:corosync.conf
           man:corosync_overview
 Main PID: 24472 (corosync)
    Tasks: 2 (limit: 4915)
   CGroup: /system.slice/corosync.service
           └─24472 /usr/sbin/corosync -f

Sep 07 02:11:19 c1-h6 corosync[24472]:  [QUORUM] Members[1]: 2
Sep 07 02:11:19 c1-h6 corosync[24472]:  [MAIN  ] Completed service synchronization, ready to provide service.
Sep 07 02:11:19 c1-h6 corosync[24472]: notice  [TOTEM ] A new membership (10.0.0.14:232) was formed. Members joined: 4 6 7 8 9 3 1 5
Sep 07 02:11:19 c1-h6 corosync[24472]:  [TOTEM ] A new membership (10.0.0.14:232) was formed. Members joined: 4 6 7 8 9 3 1 5
Sep 07 02:11:19 c1-h6 corosync[24472]: notice  [QUORUM] This node is within the primary component and will provide service.
Sep 07 02:11:19 c1-h6 corosync[24472]: notice  [QUORUM] Members[9]: 4 6 7 8 9 2 3 1 5
Sep 07 02:11:19 c1-h6 corosync[24472]: notice  [MAIN  ] Completed service synchronization, ready to provide service.
Sep 07 02:11:19 c1-h6 corosync[24472]:  [QUORUM] This node is within the primary component and will provide service.
Sep 07 02:11:19 c1-h6 corosync[24472]:  [QUORUM] Members[9]: 4 6 7 8 9 2 3 1 5
Sep 07 02:11:19 c1-h6 corosync[24472]:  [MAIN  ] Completed service synchronization, ready to provide service.

It is worth noting, I did a restart which sorted that out but now this node (from output above), just remains on "starting" when a VM is on that and added to HA. Moving the same VM elsewhere and enabling HA goes from starting to started.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!