Problem with cman and rgmanager

jamaicada

New Member
Oct 9, 2013
7
0
1
Hello!
I've got a problem with HA on the 3 server cluster. I've got 3 ports on each node: management, bridge for vm and network with test ISCSI storage.
I manually shutdown management port for node3 on the switch, so I can simulate network problems on the node3. Fencing works fine - I can see 2 ports (bridge for vm and ISCSI storage) disabled on the netwok switch.
Then I enable back management port and try to restart services like this:
Code:
root@proxmox3:~# /etc/init.d/pve-cluster restart
Restarting pve cluster filesystem: pve-cluster.
root@proxmox3:~# /etc/init.d/pvedaemon restart
Restarting PVE Daemon: pvedaemon.
root@proxmox3:~# /etc/init.d/cman restart
Stopping cluster:
   Leaving fence domain... [  OK  ]
   Stopping dlm_controld... [  OK  ]
   Stopping fenced... [  OK  ]
   Stopping cman... [  OK  ]
   Unloading kernel modules... [  OK  ]
   Unmounting configfs... [  OK  ]
Starting cluster:
   Checking if cluster has been disabled at boot... [  OK  ]
   Checking Network Manager... [  OK  ]
   Global setup... [  OK  ]
   Loading kernel modules... [  OK  ]
   Mounting configfs... [  OK  ]
   Starting cman... [  OK  ]
   Waiting for quorum... [  OK  ]
   Starting fenced... [  OK  ]
   Starting dlm_controld... [  OK  ]
   Tuning DLM kernel config... [  OK  ]
   Unfencing self... fence_node: cannot connect to cman
[FAILED]
Rgmanager just hangs. Nothing I can do to bring cman and rgmanager back, the only way is reboot.
/etc/pve/cluster.conf
Code:
<?xml version="1.0"?><cluster config_version="7" name="rnet-cluster">
  <cman keyfile="/var/lib/pve-cluster/corosync.authkey"/>
  <fencedevices>
   <fencedevice agent="fence_ifmib" community="test-fencing" ipaddr="192.168.100.1" name="test-switch" snmp_version="2c"/>
  </fencedevices>
  <clusternodes>
    <clusternode name="proxmox1" nodeid="1" votes="1">
     <fence>
       <method name="1">
         <device action="off" name="test-switch" port="FastEthernet0/33"/>
         <device action="off" name="test-switch" port="FastEthernet0/36"/>
       </method>
     </fence>
    </clusternode>
    <clusternode name="proxmox2" nodeid="2" votes="1">
     <fence>
       <method name="1">
         <device action="off" name="test-switch" port="FastEthernet0/34"/>
         <device action="off" name="test-switch" port="FastEthernet0/37"/>
       </method>
     </fence>
    </clusternode>
    <clusternode name="proxmox3" nodeid="3" votes="1">
     <fence>
       <method name="1">
         <device action="off" name="test-switch" port="FastEthernet0/35"/>
         <device action="off" name="test-switch" port="FastEthernet0/38"/>
       </method>
     </fence>
    </clusternode>
  </clusternodes>
  <rm>
    <pvevm autostart="1" vmid="101"/>
  </rm>
</cluster>
where first port is bridge for vm and second is network with test ISCSI storage.
Code:
root@proxmox3:~# fence_tool ls
root@proxmox3:~# clustat
Could not connect to CMAN: No such file or directory
root@proxmox3:~# pvecm status
cman_tool: Cannot open connection to cman, is it running ?
What I can do to bring back node3 to cluster? Is it possible to do without reboot?
P.S FENCE_JOIN="yes" is configured for all nodes.
P.S.S Scheme proxmox.jpeg
 
Last edited:
Sorry, didnt understand. Is it question or statement? As far as I can see node3 is fenced because 35 and 38 port become disabled.
 
Last edited:
Sorry, didnt understand. Is it question or statement? As far as I can see node3 is fenced because 35 and 38 port become disabled.

Oh, you only fence one network port? I think this kind of fencing is quite dangerous, because VMs are still running and have access to network/storage on other ports. rgmanager will start the same VM on other node, so you end up with 2 instances running on same storage!
 
Oh, you only fence one network port? I think this kind of fencing is quite dangerous, because VMs are still running and have access to network/storage on other ports. rgmanager will start the same VM on other node, so you end up with 2 instances running on same storage!
No! I fence bridge port for VMs and port for storage. And I do_not_fence the port for cluster communication, I just shutdown this port on the switch to simulate netowrk problem. I wrote about this in the first message.
where first port is bridge for vm and second is network with test ISCSI storage.
Scheme in the first post.
 
Last edited:
Well, if you don't break the communication to/from the fenced node to the rest of the cluster rgmanager will consider this node still active so restarting the node will cause the rgmanager on the fenced node to try to migrate resources to other nodes. Since you have cut the connections from the resources to the storage migration will fail and you end up with a deadlocked rgmanager.
 
Well, if you don't break the communication to/from the fenced node to the rest of the cluster rgmanager will consider this node still active so restarting the node will cause the rgmanager on the fenced node to try to migrate resources to other nodes. Since you have cut the connections from the resources to the storage migration will fail and you end up with a deadlocked rgmanager.
I break communication port. So other machines know that node3 is down. Then I see node3 fenced. Then I enable communication port back... cman is down, rgmanager hangs up - thats the problem.
Any ideas?
 
Last edited:
Is the any hint in /var/log/syslog why cman fail to start?
Code:
Oct 10 15:04:00 proxmox3 pmxcfs[6898]: [status] crit: cpg_send_message failed: 9
Oct 10 15:04:00 proxmox3 pmxcfs[6898]: [status] crit: cpg_send_message failed: 9
Oct 10 15:04:00 proxmox3 pmxcfs[6898]: [status] crit: cpg_send_message failed: 9
Oct 10 15:04:00 proxmox3 pmxcfs[6898]: [status] crit: cpg_send_message failed: 9
Oct 10 15:04:00 proxmox3 pmxcfs[6898]: [status] crit: cpg_send_message failed: 9
Oct 10 15:04:00 proxmox3 pmxcfs[6898]: [status] crit: cpg_send_message failed: 9
Oct 10 15:04:00 proxmox3 pmxcfs[6898]: [status] crit: cpg_send_message failed: 9
Oct 10 15:04:00 proxmox3 pmxcfs[6898]: [status] crit: cpg_send_message failed: 9
Oct 10 15:04:07 proxmox3 fenced[11815]: fenced 1364188437 started
Oct 10 15:04:07 proxmox3 dlm_controld[11826]: dlm_controld 1364188437 started
Oct 10 15:04:07 proxmox3 fenced[11815]: found uncontrolled entry /sys/kernel/dlm/rgmanager
Oct 10 15:04:07 proxmox3 dlm_controld[11826]: found uncontrolled lockspace rgmanager
Oct 10 15:04:07 proxmox3 dlm_controld[11826]: telling cman to remove nodeid 3 from cluster
Oct 10 15:04:07 proxmox3 corosync[7533]: cman killed by node 3 because we were killed by cman_tool or other application
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!