Problem with cman and rgmanager

jamaicada · Oct 9, 2013

Hello!
I've got a problem with HA on the 3 server cluster. I've got 3 ports on each node: management, bridge for vm and network with test ISCSI storage.
I manually shutdown management port for node3 on the switch, so I can simulate network problems on the node3. Fencing works fine - I can see 2 ports (bridge for vm and ISCSI storage) disabled on the netwok switch.
Then I enable back management port and try to restart services like this:

Code:

root@proxmox3:~# /etc/init.d/pve-cluster restart
Restarting pve cluster filesystem: pve-cluster.
root@proxmox3:~# /etc/init.d/pvedaemon restart
Restarting PVE Daemon: pvedaemon.
root@proxmox3:~# /etc/init.d/cman restart
Stopping cluster:
   Leaving fence domain... [  OK  ]
   Stopping dlm_controld... [  OK  ]
   Stopping fenced... [  OK  ]
   Stopping cman... [  OK  ]
   Unloading kernel modules... [  OK  ]
   Unmounting configfs... [  OK  ]
Starting cluster:
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... [  OK  ]
   Starting fenced... [  OK  ]
   Starting dlm_controld... [  OK  ]
   Tuning DLM kernel config... [  OK  ]
   Unfencing self... fence_node: cannot connect to cman
[FAILED]

Rgmanager just hangs. Nothing I can do to bring cman and rgmanager back, the only way is reboot.
/etc/pve/cluster.conf

Code:

<?xml version="1.0"?><cluster config_version="7" name="rnet-cluster">
  <cman keyfile="/var/lib/pve-cluster/corosync.authkey"/>
  <fencedevices>
   <fencedevice agent="fence_ifmib" community="test-fencing" ipaddr="192.168.100.1" name="test-switch" snmp_version="2c"/>
  </fencedevices>
  <clusternodes>
    <clusternode name="proxmox1" nodeid="1" votes="1">
     <fence>
       <method name="1">
         <device action="off" name="test-switch" port="FastEthernet0/33"/>
         <device action="off" name="test-switch" port="FastEthernet0/36"/>
       </method>
     </fence>
    </clusternode>
    <clusternode name="proxmox2" nodeid="2" votes="1">
     <fence>
       <method name="1">
         <device action="off" name="test-switch" port="FastEthernet0/34"/>
         <device action="off" name="test-switch" port="FastEthernet0/37"/>
       </method>
     </fence>
    </clusternode>
    <clusternode name="proxmox3" nodeid="3" votes="1">
     <fence>
       <method name="1">
         <device action="off" name="test-switch" port="FastEthernet0/35"/>
         <device action="off" name="test-switch" port="FastEthernet0/38"/>
       </method>
     </fence>
    </clusternode>
  </clusternodes>
  <rm>
    <pvevm autostart="1" vmid="101"/>
  </rm>
</cluster>

where first port is bridge for vm and second is network with test ISCSI storage.

Code:

root@proxmox3:~# fence_tool ls
root@proxmox3:~# clustat
Could not connect to CMAN: No such file or directory
root@proxmox3:~# pvecm status
cman_tool: Cannot open connection to cman, is it running ?

What I can do to bring back node3 to cluster? Is it possible to do without reboot?
P.S FENCE_JOIN="yes" is configured for all nodes.
P.S.S Scheme

dietmar · Oct 9, 2013

jamaicada said:
What I can do to bring back node3 to cluster? Is it possible to do without reboot?

Your fencing does not work (else the other nodes would have fenced node3)?

jamaicada · Oct 9, 2013

Sorry, didnt understand. Is it question or statement? As far as I can see node3 is fenced because 35 and 38 port become disabled.

mir · Oct 9, 2013

Is the interface which is not fenced the one the cluster communication flows over?

jamaicada · Oct 9, 2013

mir said:
Is the interface which is not fenced the one the cluster communication flows over?

Yes, its the only interface for cluster communication.

dietmar · Oct 9, 2013

jamaicada said:
Sorry, didnt understand. Is it question or statement? As far as I can see node3 is fenced because 35 and 38 port become disabled.

Oh, you only fence one network port? I think this kind of fencing is quite dangerous, because VMs are still running and have access to network/storage on other ports. rgmanager will start the same VM on other node, so you end up with 2 instances running on same storage!

jamaicada · Oct 9, 2013

dietmar said:
Oh, you only fence one network port? I think this kind of fencing is quite dangerous, because VMs are still running and have access to network/storage on other ports. rgmanager will start the same VM on other node, so you end up with 2 instances running on same storage!

No! I fence bridge port for VMs and port for storage. And I do_not_fence the port for cluster communication, I just shutdown this port on the switch to simulate netowrk problem. I wrote about this in the first message.

where first port is bridge for vm and second is network with test ISCSI storage.

Scheme in the first post.

mir · Oct 9, 2013

Well, if you don't break the communication to/from the fenced node to the rest of the cluster rgmanager will consider this node still active so restarting the node will cause the rgmanager on the fenced node to try to migrate resources to other nodes. Since you have cut the connections from the resources to the storage migration will fail and you end up with a deadlocked rgmanager.

jamaicada · Oct 9, 2013

mir said:
Well, if you don't break the communication to/from the fenced node to the rest of the cluster rgmanager will consider this node still active so restarting the node will cause the rgmanager on the fenced node to try to migrate resources to other nodes. Since you have cut the connections from the resources to the storage migration will fail and you end up with a deadlocked rgmanager.

I break communication port. So other machines know that node3 is down. Then I see node3 fenced. Then I enable communication port back... cman is down, rgmanager hangs up - thats the problem.
Any ideas?

dietmar · Oct 10, 2013

Is the any hint in /var/log/syslog why cman fail to start?

jamaicada · Oct 10, 2013

dietmar said:
Is the any hint in /var/log/syslog why cman fail to start?

Code:

Oct 10 15:04:00 proxmox3 pmxcfs[6898]: [status] crit: cpg_send_message failed: 9
Oct 10 15:04:00 proxmox3 pmxcfs[6898]: [status] crit: cpg_send_message failed: 9
Oct 10 15:04:00 proxmox3 pmxcfs[6898]: [status] crit: cpg_send_message failed: 9
Oct 10 15:04:00 proxmox3 pmxcfs[6898]: [status] crit: cpg_send_message failed: 9
Oct 10 15:04:00 proxmox3 pmxcfs[6898]: [status] crit: cpg_send_message failed: 9
Oct 10 15:04:00 proxmox3 pmxcfs[6898]: [status] crit: cpg_send_message failed: 9
Oct 10 15:04:00 proxmox3 pmxcfs[6898]: [status] crit: cpg_send_message failed: 9
Oct 10 15:04:00 proxmox3 pmxcfs[6898]: [status] crit: cpg_send_message failed: 9
Oct 10 15:04:07 proxmox3 fenced[11815]: fenced 1364188437 started
Oct 10 15:04:07 proxmox3 dlm_controld[11826]: dlm_controld 1364188437 started
Oct 10 15:04:07 proxmox3 fenced[11815]: found uncontrolled entry /sys/kernel/dlm/rgmanager
Oct 10 15:04:07 proxmox3 dlm_controld[11826]: found uncontrolled lockspace rgmanager
Oct 10 15:04:07 proxmox3 dlm_controld[11826]: telling cman to remove nodeid 3 from cluster
Oct 10 15:04:07 proxmox3 corosync[7533]: cman killed by node 3 because we were killed by cman_tool or other application

dietmar · Oct 10, 2013

Seems dlm/corosync thinks your need to reboot.

jamaicada · Oct 10, 2013

dietmar said:
Seems dlm/corosync thinks your need to reboot.

So this is normal behavior or what? I need to reboot every time the network is down?

dietmar · Oct 10, 2013

jamaicada said:
So this is normal behavior or what? I need to reboot every time the network is down?

Yes, you need to reboot if you loose quorum (if you use HA)!

dietmar · Oct 10, 2013

dietmar said:
Yes, you need to reboot if you loose quorum (if you use HA)!

That is one reason why you need redundant network for cluster communication.

Search

Search

Problem with cman and rgmanager

jamaicada

New Member

dietmar

Proxmox Staff Member

jamaicada

New Member

mir

Famous Member

jamaicada

New Member

dietmar

Proxmox Staff Member

jamaicada

New Member

mir

Famous Member

jamaicada

New Member

dietmar

Proxmox Staff Member

jamaicada

New Member

dietmar

Proxmox Staff Member

jamaicada

New Member

dietmar

Proxmox Staff Member

dietmar

Proxmox Staff Member

We value your privacy