[SOLVED] Quorum Dissolved

RobFantini

Famous Member
May 24, 2012
2,017
102
133
Boston,Mass
Hello

we have a 4 node cluster. one of them is also an nfs server used to store backups.

at 4am a backup of all vm's on all nodes was set to run for the 1-st time, using the nfs server.

now the quorum is dissolved . pvecm nodes returns this :

Code:
fbc240 s009 ~ # pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M    760   2012-05-23 18:56:21  fbc240
   3   X    924                        fbc100
   4   X    780                        fbc241
   5   X    876                        fbc1


fbc241 s012 ~ # pvecm nodes
Node  Sts   Inc   Joined               Name
   1   X    780                        fbc240
   3   X    924                        fbc100
   4   M    772   2012-05-23 19:19:20  fbc241
   5   X    876                        fbc1

I can ping each node from the others..

So it looks like due to very heavy network traffic that the cluster broke.

My question - how do I reconnect the cluster ?
 
Last edited by a moderator:
restart cman and pve-cluster:

What is the output of:

Code:
/etc/init.d/cman stop
/etc/init.d/cman start

if that works, try:

Code:
/etc/init.d/pve-cluster restart

think of using extra network for storage and cluster communication.
 
restart cman and pve-cluster:

What is the output of:

Code:
/etc/init.d/cman stop
/etc/init.d/cman start

output:
Code:
fbc100 s001 /etc/pve # /etc/init.d/cman stop
Stopping cluster: 
   Leaving fence domain... [  OK  ]
   Stopping dlm_controld... [  OK  ]
   Stopping fenced... [  OK  ]
   Stopping cman... [  OK  ]
   Waiting for corosync to shutdown:[  OK  ]
   Unloading kernel modules... [  OK  ]
   Unmounting configfs... [  OK  ]


fbc100 s001 /etc/pve # /etc/init.d/cman start
Starting cluster: 
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... Timed-out waiting for cluster
[FAILED]
 
the question is: why is there a timeout, why is cluster communication not working?

post your /etc/pve/cluster.conf

also check syslog for any useful logs.
 
Code:
<?xml version="1.0"?>
<cluster config_version="79" name="fbcluster">
  <cman keyfile="/var/lib/pve-cluster/corosync.authkey"/>
  <fencedevices>
    <fencedevice agent="fence_apc" ipaddr="10.100.100.11" login="fbcadmin" name="apc11" passwd="032scali"/>
    <fencedevice agent="fence_apc" ipaddr="10.100.100.78" login="fbcadmin" name="apc78" passwd="032scali"/>
    <fencedevice agent="fence_apc" ipaddr="10.100.100.88" login="fbcadmin" name="apc88" passwd="032scali"/>
  </fencedevices>
  <clusternodes>
    <clusternode name="fbc241" nodeid="4" votes="1">
      <fence>
        <method name="power">
          <device name="apc11" port="4" secure="on"/>
          <device name="apc78" port="4" secure="on"/>
          <device name="apc88" port="4" secure="on"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="fbc240" nodeid="1" votes="1">
      <fence>
        <method name="power">
          <device name="apc11" port="2" secure="on"/>
          <device name="apc78" port="2" secure="on"/>
          <device name="apc88" port="2" secure="on"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="fbc100" nodeid="3" votes="1">
      <fence>
        <method name="power">
          <device name="apc78" port="3" secure="on"/>
          <device name="apc88" port="3" secure="on"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="fbc1" nodeid="5" votes="1"/>
  </clusternodes>
  <rm>
    <failoverdomains>
      <failoverdomain name="fbc240-fbc241" nofailback="1" ordered="1" restricted="1">
        <failoverdomainnode name="fbc240" priority="1"/>
        <failoverdomainnode name="fbc241" priority="100"/>
      </failoverdomain>
      <failoverdomain name="fbc241-fbc240" nofailback="1" ordered="1" restricted="1">
        <failoverdomainnode name="fbc241" priority="1"/>
        <failoverdomainnode name="fbc240" priority="100"/>
      </failoverdomain>
    </failoverdomains>
    <pvevm autostart="1" domain="fbc240-fbc241" vmid="1023"/>
    <pvevm autostart="1" domain="fbc241-fbc240" vmid="101"/>
    <pvevm autostart="1" domain="fbc241-fbc240" vmid="104"/>
    <pvevm autostart="1" domain="fbc241-fbc240" vmid="115"/>
    <pvevm autostart="1" domain="fbc241-fbc240" vmid="105"/>
    <pvevm autostart="1" vmid="102"/>
  </rm>
</cluster>
 
the question is: why is there a timeout, why is cluster communication not working?

You simply do not have quorum if you start 2 out of 4 nodes.

Either start "/etc/init.d/cman start" on all nodes (at same time), or set expected votes:

# pvecm expected 1
 
Tom and Dietmar : thank you for the help.

The issue was caused by the switch loosing the multicast settings.

I must have set it for multicast , applied the setting, but did not save. Then somehow with all the traffic from 4 nodes doing a backup at the same time to nfs , the switch reset it seelf. [ the quorum and cluster were working fine for 3 months before that. ]

Also I put info to wiki showing how to save [ not just apply ] multicast in a Netgear switch.

Next we'll add nics and use bond0 for our vmbr0 on the nodes.

In addition will not back up all nodes at the same time.
 
Or use a separate network for cluster communication (or storage).

Two of the nodes are located away from the room with our server rack.

By different network, is using a different subnet enough, or should we use completely separate network hardware? It is not hard to add the hardware as we already have extra network wires in place.
 
or should we use completely separate network hardware?

At least use separate NIC/cables (so that storage traffic does not slow down cluster communication).

Also, using bonding on two different switches makes sense.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!