I checked the logs - nothing really throws up a red flag. We have 2 clusters running almost identical hardware. On each cluster we have 4 VMs (1 each on each node). Multicast is working fine. IPMI is used as fence devices and tested with IPMITOOL as well as the fence_ipmi wrapper. ACPID is off (BIOS set to immediate shutoff). We've used both public and private network IP addresses for the nodes (properly defined in /etc/hosts)
Issues we're seeing on both.
1) If the nodes do not boot all within 10 seconds of each other, at least one of the nodes will fail to start cman and even if it does start cman it will say DLM low comms, no local IP address set - this will then cause rgmanager to not be started on some nodes and all the vms will migrate to one node. Due to the increased load, that node will freeze and then the IPMI will start rebooting all the nodes concurrently causing an endless loop
2) If I power off one of the nodes (yank power) - the IPMI will reboot it (currently BIOS is set to return to last state but I can try turn to ON as the option for power management). The VM will move to another node and start up but migration is a hassle and works 50/50. potential outcomes are
a) VM is migrated but the conf stays on the old node and shows as down even though it is running on new node
b) VM and node both show as down in the gui but they're both running
c) Node is running, vm is down on old node in gui and it says missing conf file as the node has really been moved
The only way to fix all this is a reboot of all4 nodes (we've written a script for IPMI to reboot all 4 concurrently and since the cluster is ssd and san driven, they boot up around the same time)
3) The other issue is cman / rgmanager. It seems that on all 4 we have to manually restart CMAN and rgmanager inorder for it to work which doesn't works ince we have auto start on boot ....
So these 3 issues combined turn into an unuseable HA setup. Without HA everything runs well but then again, can just use SolusVM is we wanted to simply virtualize ....
Now 8 times out of 10 it works fine. Power goes out, node migrates - everything's peachy but we need something that works 10/10 times.
May try unicast as I still believe this is a cluster communication issue. Now before you say my configs are messed up - everything has been done textbook from clean installs per the wiki.
Cluster.conf is super simple and the vms are not even production just tests - they're small files.
Again, only issue is HA - everything else is fine. We get errors with migration, HA reboots, clusvadm migrations, etc - its just unstable. Anyway, we hope we can find a solution soon since we like all the other attributes but frankly this is a big fly in the ointment for having something that is production ready. I guess to administer a cluster of VMs it works