I've been playing with HA on a 3-node Proxmox 3.3 (now 3.4) cluster for about a month now.
My conclusion is that the RedHat related components relating to Fencing and Resource Management are mostly broken and if they can be made to work, work inconsistently.
Here's my scenario:
I'm using a 3-node cluster with one of the nodes acting as an NFS server. All 3 nodes are Dell R610 with 48GB ECC RAM, RAID 10 SSD/PERC H700 and redundant power supplies with IDRAC6 Enterprise with IPMI over LAN enabled and on a dedicated Ethernet port. I'm using IPMILAN for fencing. Since I'm supplying power to each server from two different UPS', it's as good as any other fencing solution.
Issues:
(1) Fencing - can't make fence_node to work out of the box, it always comes back with "agent error". Traced the issue to the check_input function in the /usr/share/fence/fencing.py file. Removed the last section where checking for device_opts is being performed. Now unfencing (fence_node -U <node>) and fencing (fence_node <node>) work as expected. fence_node -S <node> also gives back proper status from any of the nodes.
(2) CMAN / RGMANAGER don't start consistently when a node is powered up - no solution other than manually going in and starting these.
(3) When everything appears working and "fence_tool ls", "pvecm status", and "/etc/pve/.members" report the correct information on all nodes and the cluster is quorate, purposely disabling a node that has a HA VM on it does absolutely nothing. If I go to the node with the HA VM running and disable the network interface (e.g. ifconfig vmbr1 down) the other nodes report that the node is down after a short period, but it is not getting fenced (shutdown) at all and the HA VM does not migrate. If I manually issue a "fence_node <node>" command from one of the other running nodes, the node with the HA VM shuts down, but nothing happens to the HA VM. It stays attached to the node that has been fenced.
On top of this, the remaining running nodes start behaving erratically. Issuing a "pvecm status" takes 10-20 seconds and trying to do a console on a running VM doesn't work. The web interface also becomes unresponsive (which I'm accessing through one of the running nodes and on the public interface (vmbr0)).
What I expect:
(1) If I isolate a node that's part of the cluster, I expect it to be fenced (turned off) within 5-10 seconds and the HA VMs that were running on it are promptly migrated.
(2) Unfencing a node should start it up and automatically rejoin the cluster without having to manually force a "service cman start" and "service rgmanager start"
Has anyone actually made all this work, as expected? Any comments would be appreciated.
Regards,
Stephan.
My conclusion is that the RedHat related components relating to Fencing and Resource Management are mostly broken and if they can be made to work, work inconsistently.
Here's my scenario:
I'm using a 3-node cluster with one of the nodes acting as an NFS server. All 3 nodes are Dell R610 with 48GB ECC RAM, RAID 10 SSD/PERC H700 and redundant power supplies with IDRAC6 Enterprise with IPMI over LAN enabled and on a dedicated Ethernet port. I'm using IPMILAN for fencing. Since I'm supplying power to each server from two different UPS', it's as good as any other fencing solution.
Issues:
(1) Fencing - can't make fence_node to work out of the box, it always comes back with "agent error". Traced the issue to the check_input function in the /usr/share/fence/fencing.py file. Removed the last section where checking for device_opts is being performed. Now unfencing (fence_node -U <node>) and fencing (fence_node <node>) work as expected. fence_node -S <node> also gives back proper status from any of the nodes.
(2) CMAN / RGMANAGER don't start consistently when a node is powered up - no solution other than manually going in and starting these.
(3) When everything appears working and "fence_tool ls", "pvecm status", and "/etc/pve/.members" report the correct information on all nodes and the cluster is quorate, purposely disabling a node that has a HA VM on it does absolutely nothing. If I go to the node with the HA VM running and disable the network interface (e.g. ifconfig vmbr1 down) the other nodes report that the node is down after a short period, but it is not getting fenced (shutdown) at all and the HA VM does not migrate. If I manually issue a "fence_node <node>" command from one of the other running nodes, the node with the HA VM shuts down, but nothing happens to the HA VM. It stays attached to the node that has been fenced.
On top of this, the remaining running nodes start behaving erratically. Issuing a "pvecm status" takes 10-20 seconds and trying to do a console on a running VM doesn't work. The web interface also becomes unresponsive (which I'm accessing through one of the running nodes and on the public interface (vmbr0)).
What I expect:
(1) If I isolate a node that's part of the cluster, I expect it to be fenced (turned off) within 5-10 seconds and the HA VMs that were running on it are promptly migrated.
(2) Unfencing a node should start it up and automatically rejoin the cluster without having to manually force a "service cman start" and "service rgmanager start"
Has anyone actually made all this work, as expected? Any comments would be appreciated.
Regards,
Stephan.