We have installed a HA cluster with 2 HP servers using iLO interfaces for fencing. The system is full updated and it's running version 2.2-31.
From time to time the first node is loosing the communication with the second node (I guess so). After that the cluster appears as broken and we cannot even change7 the setting of the VMs (we have Device or resource busy message).
The prox1 logs:
rgmanager.log
Jan 09 16:41:34 rgmanager State change: prox02 DOWN
fenced.log
Jan 09 16:41:34 fenced fencing node prox02
Jan 09 16:41:36 fenced fence prox02 success
pvecm nodes
Node Sts Inc Joined Name
1 M 692 2012-12-18 12:15:57 prox01
2 X 716 prox02
The prox2 logs:
rgmanager.log
Jan 09 16:41:37 rgmanager #67: Shutting down uncleanly
fenced.log
Jan 09 16:41:37 fenced cluster is down, exiting
pvecm nodes
cman_tool: Cannot open connection to cman, is it running ?
We also tried to restart cman to the second node and we have the following messages:
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... [ OK ]
Tuning DLM kernel config... [ OK ]
Starting GFS2 Control Daemon: gfs_controld.
Unfencing self... fence_node: cannot connect to cman
[FAILED]
and if we check the cman status it displays:
root@prox02:/var/run# /etc/init.d/cman status
Found stale pid file
Also if we check the fencing from prox01 it works always without a problem but from prox02:
fence_node: cannot connect to cman
If we reboot the two nodes the HA is working again without any problem until it broke again.
Is there any way to recover the cluster without rebooting the two nodes ?
How can we solve the cluster problem ?
From time to time the first node is loosing the communication with the second node (I guess so). After that the cluster appears as broken and we cannot even change7 the setting of the VMs (we have Device or resource busy message).
The prox1 logs:
rgmanager.log
Jan 09 16:41:34 rgmanager State change: prox02 DOWN
fenced.log
Jan 09 16:41:34 fenced fencing node prox02
Jan 09 16:41:36 fenced fence prox02 success
pvecm nodes
Node Sts Inc Joined Name
1 M 692 2012-12-18 12:15:57 prox01
2 X 716 prox02
The prox2 logs:
rgmanager.log
Jan 09 16:41:37 rgmanager #67: Shutting down uncleanly
fenced.log
Jan 09 16:41:37 fenced cluster is down, exiting
pvecm nodes
cman_tool: Cannot open connection to cman, is it running ?
We also tried to restart cman to the second node and we have the following messages:
Starting cluster:
Checking if cluster has been disabled at boot... [ OK ]
Checking Network Manager... [ OK ]
Global setup... [ OK ]
Loading kernel modules... [ OK ]
Mounting configfs... [ OK ]
Starting cman... [ OK ]
Waiting for quorum... [ OK ]
Starting fenced... [ OK ]
Starting dlm_controld... [ OK ]
Tuning DLM kernel config... [ OK ]
Starting GFS2 Control Daemon: gfs_controld.
Unfencing self... fence_node: cannot connect to cman
[FAILED]
and if we check the cman status it displays:
root@prox02:/var/run# /etc/init.d/cman status
Found stale pid file
Also if we check the fencing from prox01 it works always without a problem but from prox02:
fence_node: cannot connect to cman
If we reboot the two nodes the HA is working again without any problem until it broke again.
Is there any way to recover the cluster without rebooting the two nodes ?
How can we solve the cluster problem ?