[SOLVED] Quorum Dissolved

RobFantini · Jun 5, 2012

Hello

we have a 4 node cluster. one of them is also an nfs server used to store backups.

at 4am a backup of all vm's on all nodes was set to run for the 1-st time, using the nfs server.

now the quorum is dissolved . pvecm nodes returns this :

Code:

fbc240 s009 ~ # pvecm nodes
Node  Sts   Inc   Joined               Name
   1   M    760   2012-05-23 18:56:21  fbc240
   3   X    924                        fbc100
   4   X    780                        fbc241
   5   X    876                        fbc1


fbc241 s012 ~ # pvecm nodes
Node  Sts   Inc   Joined               Name
   1   X    780                        fbc240
   3   X    924                        fbc100
   4   M    772   2012-05-23 19:19:20  fbc241
   5   X    876                        fbc1

I can ping each node from the others..

So it looks like due to very heavy network traffic that the cluster broke.

My question - how do I reconnect the cluster ?

tom · Jun 5, 2012

restart cman and pve-cluster:

What is the output of:

Code:

/etc/init.d/cman stop
/etc/init.d/cman start

if that works, try:

Code:

/etc/init.d/pve-cluster restart

think of using extra network for storage and cluster communication.

RobFantini · Jun 5, 2012

tom said:
restart cman and pve-cluster:

What is the output of:

Code:

/etc/init.d/cman stop /etc/init.d/cman start

output:

Code:

fbc100 s001 /etc/pve # /etc/init.d/cman stop
Stopping cluster: 
   Leaving fence domain... [  OK  ]
   Stopping dlm_controld... [  OK  ]
   Stopping fenced... [  OK  ]
   Stopping cman... [  OK  ]
   Waiting for corosync to shutdown:[  OK  ]
   Unloading kernel modules... [  OK  ]
   Unmounting configfs... [  OK  ]


fbc100 s001 /etc/pve # /etc/init.d/cman start
Starting cluster: 
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... Timed-out waiting for cluster
[FAILED]

RobFantini · Jun 5, 2012

Tom: as /etc/init.d/cman start did not work, any suggestions please on what to try next?

tom · Jun 5, 2012

the question is: why is there a timeout, why is cluster communication not working?

post your /etc/pve/cluster.conf

also check syslog for any useful logs.

RobFantini · Jun 5, 2012

Code:

<?xml version="1.0"?>
<cluster config_version="79" name="fbcluster">
  <cman keyfile="/var/lib/pve-cluster/corosync.authkey"/>
  <fencedevices>
    <fencedevice agent="fence_apc" ipaddr="10.100.100.11" login="fbcadmin" name="apc11" passwd="032scali"/>
    <fencedevice agent="fence_apc" ipaddr="10.100.100.78" login="fbcadmin" name="apc78" passwd="032scali"/>
    <fencedevice agent="fence_apc" ipaddr="10.100.100.88" login="fbcadmin" name="apc88" passwd="032scali"/>
  </fencedevices>
  <clusternodes>
    <clusternode name="fbc241" nodeid="4" votes="1">
      <fence>
        <method name="power">
          <device name="apc11" port="4" secure="on"/>
          <device name="apc78" port="4" secure="on"/>
          <device name="apc88" port="4" secure="on"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="fbc240" nodeid="1" votes="1">
      <fence>
        <method name="power">
          <device name="apc11" port="2" secure="on"/>
          <device name="apc78" port="2" secure="on"/>
          <device name="apc88" port="2" secure="on"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="fbc100" nodeid="3" votes="1">
      <fence>
        <method name="power">
          <device name="apc78" port="3" secure="on"/>
          <device name="apc88" port="3" secure="on"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="fbc1" nodeid="5" votes="1"/>
  </clusternodes>
  <rm>
    <failoverdomains>
      <failoverdomain name="fbc240-fbc241" nofailback="1" ordered="1" restricted="1">
        <failoverdomainnode name="fbc240" priority="1"/>
        <failoverdomainnode name="fbc241" priority="100"/>
      </failoverdomain>
      <failoverdomain name="fbc241-fbc240" nofailback="1" ordered="1" restricted="1">
        <failoverdomainnode name="fbc241" priority="1"/>
        <failoverdomainnode name="fbc240" priority="100"/>
      </failoverdomain>
    </failoverdomains>
    <pvevm autostart="1" domain="fbc240-fbc241" vmid="1023"/>
    <pvevm autostart="1" domain="fbc241-fbc240" vmid="101"/>
    <pvevm autostart="1" domain="fbc241-fbc240" vmid="104"/>
    <pvevm autostart="1" domain="fbc241-fbc240" vmid="115"/>
    <pvevm autostart="1" domain="fbc241-fbc240" vmid="105"/>
    <pvevm autostart="1" vmid="102"/>
  </rm>
</cluster>

tom · Jun 5, 2012

Is IP multicast working in your network?

See http://pve.proxmox.com/wiki/Multicast_notes

dietmar · Jun 6, 2012

tom said:
the question is: why is there a timeout, why is cluster communication not working?

You simply do not have quorum if you start 2 out of 4 nodes.

Either start "/etc/init.d/cman start" on all nodes (at same time), or set expected votes:

# pvecm expected 1

RobFantini · Jun 6, 2012

Tom and Dietmar : thank you for the help.

The issue was caused by the switch loosing the multicast settings.

I must have set it for multicast , applied the setting, but did not save. Then somehow with all the traffic from 4 nodes doing a backup at the same time to nfs , the switch reset it seelf. [ the quorum and cluster were working fine for 3 months before that. ]

Also I put info to wiki showing how to save [ not just apply ] multicast in a Netgear switch.

Next we'll add nics and use bond0 for our vmbr0 on the nodes.

In addition will not back up all nodes at the same time.

dietmar · Jun 6, 2012

RobFantini said:
In addition will not back up all nodes at the same time.

Or use a separate network for cluster communication (or storage).

RobFantini · Jun 6, 2012

dietmar said:
Or use a separate network for cluster communication (or storage).

Two of the nodes are located away from the room with our server rack.

By different network, is using a different subnet enough, or should we use completely separate network hardware? It is not hard to add the hardware as we already have extra network wires in place.

dietmar · Jun 6, 2012

RobFantini said:
or should we use completely separate network hardware?

At least use separate NIC/cables (so that storage traffic does not slow down cluster communication).

Also, using bonding on two different switches makes sense.

Search

Search

[SOLVED] Quorum Dissolved

RobFantini

Famous Member

tom

Proxmox Staff Member

RobFantini

Famous Member

RobFantini

Famous Member

tom

Proxmox Staff Member

RobFantini

Famous Member

tom

Proxmox Staff Member

dietmar

Proxmox Staff Member

RobFantini

Famous Member

dietmar

Proxmox Staff Member

RobFantini

Famous Member

dietmar

Proxmox Staff Member