HA / Fencing Issue

ejc317

Member
Oct 18, 2012
263
0
16
On our 4 node cluster, only 1 server will start RGManager. The other 3 will not do it automatically. When I hit start on RGManager, it gets an init.d failure.

However, if I manually restart CMAN and rejoin fence basically and then hit start - it works fine.

this is leading to other issues like our IPMI going haywire and all the VMS on our 16 node cluster moving over to 1 server .... so it really needs to be fixed

Any ideas?
 
You need to find out what happens on boot - consult the log files in /var/log/syslog and /var/log/cluster/*
 
I checked the logs - nothing really throws up a red flag. We have 2 clusters running almost identical hardware. On each cluster we have 4 VMs (1 each on each node). Multicast is working fine. IPMI is used as fence devices and tested with IPMITOOL as well as the fence_ipmi wrapper. ACPID is off (BIOS set to immediate shutoff). We've used both public and private network IP addresses for the nodes (properly defined in /etc/hosts)

Issues we're seeing on both.

1) If the nodes do not boot all within 10 seconds of each other, at least one of the nodes will fail to start cman and even if it does start cman it will say DLM low comms, no local IP address set - this will then cause rgmanager to not be started on some nodes and all the vms will migrate to one node. Due to the increased load, that node will freeze and then the IPMI will start rebooting all the nodes concurrently causing an endless loop

2) If I power off one of the nodes (yank power) - the IPMI will reboot it (currently BIOS is set to return to last state but I can try turn to ON as the option for power management). The VM will move to another node and start up but migration is a hassle and works 50/50. potential outcomes are

a) VM is migrated but the conf stays on the old node and shows as down even though it is running on new node
b) VM and node both show as down in the gui but they're both running
c) Node is running, vm is down on old node in gui and it says missing conf file as the node has really been moved

The only way to fix all this is a reboot of all4 nodes (we've written a script for IPMI to reboot all 4 concurrently and since the cluster is ssd and san driven, they boot up around the same time)

3) The other issue is cman / rgmanager. It seems that on all 4 we have to manually restart CMAN and rgmanager inorder for it to work which doesn't works ince we have auto start on boot ....

So these 3 issues combined turn into an unuseable HA setup. Without HA everything runs well but then again, can just use SolusVM is we wanted to simply virtualize ....

Now 8 times out of 10 it works fine. Power goes out, node migrates - everything's peachy but we need something that works 10/10 times.

May try unicast as I still believe this is a cluster communication issue. Now before you say my configs are messed up - everything has been done textbook from clean installs per the wiki.

Cluster.conf is super simple and the vms are not even production just tests - they're small files.

Again, only issue is HA - everything else is fine. We get errors with migration, HA reboots, clusvadm migrations, etc - its just unstable. Anyway, we hope we can find a solution soon since we like all the other attributes but frankly this is a big fly in the ointment for having something that is production ready. I guess to administer a cluster of VMs it works
 
Last edited:
1) If the nodes do not boot all within 10 seconds of each other, at least one of the nodes will fail to start cman and even

This is expected behavior. You should avoid to loose quorum if you run HA! You need to start the services manually.
 
This is expected behavior. You should avoid to loose quorum if you run HA! You need to start the services manually.

Well that's the other issue - it randomly loses quorum as the GUI shows nodes as down etc - we have to cycle restart pvedaemon, etc
 
The other issues seem to be common too. Ghost VM left on old node and not truly migrating until reboot - doesn't work if you have other production VMs to be rebooting whole clusters

what device controls the status of the node / vm? I get black computer screen icon even tho the status says running
 
Fencing status is fine

However, I did just see this. Seems like RGmanager is locking up on some of the nodes ...
 

Attachments

  • error message.jpg
    error message.jpg
    118.9 KB · Views: 8
Other issue I see is this

svc: failed to register lockdv1 RPC service

And i can't kill rgmanager - either from the GUI or SSH
 
Last edited:

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!