Network failure in HA, the VM doesn't boot anymore

Ulrar

Active Member
Feb 4, 2016
37
1
28
34
Hi,

I'm trying to setup an HA cluster on OVH servers, using a vRack and glusterFS for the storage.
I configured everything on the 3 nodes, and installed a VM for testing. It runs fine, I can do live migration between the hosts and if I reboot a node (properly, with the actual reboot command) the VM does reboot on another host.

Now I'd like to simulate a network problem, as I know the vRacks aren't as reliable as they say. They do tend to change switches without warning, so I need to be sure it'll behave correctly. To test that, I just remove the node currently running the VM from the vRack, which to the cluster looks like the network just went down. After a time, the separated node does try to power down (which it can't really, since it's OVH, but that's fine) and one of the other two node tries to start the VM, but it fails. Sometimes it fails completely and the VM doesn't boot at all, and sometimes it appears to start but nothing works (no network, no console ..) and in the proxmox logs I get :

pvedaemon: unable to connect to VM 100 qmp socket - timeout after 31 retries

I can stop and boot the VM up manually as much as I want, it always does that. If I get the down node to join the cluster again, everything starts working again after (re)booting the VM. Alternatively, I can wait for some time with the VM powered down and then start it, and it will boot fine on the two-node cluster.

I installed everything from the jessie repo two days ago, so I should be running the latest stable version I guess. Is that a known bug ? Am I missing some config somewhere ?

Thanks
 
Hi,

so you're on Proxmox VE 4.1?

No known bug, AFAIK. Whats your setup, VM config? Skim through our HA wiki page if you haven't already http://pve.proxmox.com/wiki/High_Availability_Cluster_4.x

What does the log says on such a failure?
Look with

Code:
ha-manager status

Who the current master is, the log (especially from the pve-ha-crm) would be interesting.
 
Can't manage to reproduce this anymore, tried it a bunch of times by making the master fail, the slaves fail, no way to get that same bug.

So I guess it's solved, I'm putting this in production but I don't like that :(
 
Figured it out ! The problem is GlusterFS is locking the files during a heal, and since I'm running glusterFS servers directly on the proxmox nodes (small cluster), when the network crashes the VM keeps writing on the disk for a minute, forcing a heal when the node joins the cluster back.
Was able to make that a lot better by enabling sharding on the volume, went from dozens of minutes of downtime to a few seconds.

If anyone runs into this, it is possible to enable sharding on an existing volume, but it will only affect new files. To "shard" your existing files, just shut down the VM and copy them (to create a new file) :

cp vm-100-disk-1.qcow2 vm-100-disk-1.qcow2.new
rm vm-100-disk-1.qcow2
mv vm-100-disk-1.qcow2.new vm-100-disk-1.qcow2

Careful, it can take a while, not great for a production system. Took two hours for a 200 Gb file on my cluster.
I am running into that problem clearly because I have the proxmox and glusterfs on the same server, which you aren't supposed to do (but I have no choice for now). But I'm pretty sure these kind of things can happen on "regular" architectures too, just needs a little more unfortunate network outage.