Hi,
I'm trying to setup an HA cluster on OVH servers, using a vRack and glusterFS for the storage.
I configured everything on the 3 nodes, and installed a VM for testing. It runs fine, I can do live migration between the hosts and if I reboot a node (properly, with the actual reboot command) the VM does reboot on another host.
Now I'd like to simulate a network problem, as I know the vRacks aren't as reliable as they say. They do tend to change switches without warning, so I need to be sure it'll behave correctly. To test that, I just remove the node currently running the VM from the vRack, which to the cluster looks like the network just went down. After a time, the separated node does try to power down (which it can't really, since it's OVH, but that's fine) and one of the other two node tries to start the VM, but it fails. Sometimes it fails completely and the VM doesn't boot at all, and sometimes it appears to start but nothing works (no network, no console ..) and in the proxmox logs I get :
pvedaemon: unable to connect to VM 100 qmp socket - timeout after 31 retries
I can stop and boot the VM up manually as much as I want, it always does that. If I get the down node to join the cluster again, everything starts working again after (re)booting the VM. Alternatively, I can wait for some time with the VM powered down and then start it, and it will boot fine on the two-node cluster.
I installed everything from the jessie repo two days ago, so I should be running the latest stable version I guess. Is that a known bug ? Am I missing some config somewhere ?
Thanks
I'm trying to setup an HA cluster on OVH servers, using a vRack and glusterFS for the storage.
I configured everything on the 3 nodes, and installed a VM for testing. It runs fine, I can do live migration between the hosts and if I reboot a node (properly, with the actual reboot command) the VM does reboot on another host.
Now I'd like to simulate a network problem, as I know the vRacks aren't as reliable as they say. They do tend to change switches without warning, so I need to be sure it'll behave correctly. To test that, I just remove the node currently running the VM from the vRack, which to the cluster looks like the network just went down. After a time, the separated node does try to power down (which it can't really, since it's OVH, but that's fine) and one of the other two node tries to start the VM, but it fails. Sometimes it fails completely and the VM doesn't boot at all, and sometimes it appears to start but nothing works (no network, no console ..) and in the proxmox logs I get :
pvedaemon: unable to connect to VM 100 qmp socket - timeout after 31 retries
I can stop and boot the VM up manually as much as I want, it always does that. If I get the down node to join the cluster again, everything starts working again after (re)booting the VM. Alternatively, I can wait for some time with the VM powered down and then start it, and it will boot fine on the two-node cluster.
I installed everything from the jessie repo two days ago, so I should be running the latest stable version I guess. Is that a known bug ? Am I missing some config somewhere ?
Thanks