Failover Failure?

lonegroover · Jan 20, 2012

Hi.

I've recently inherited a Proxmox VE cluster from a sysadmin colleague who is now no longer with our company.

I've been very impressed with it so far. The web admin interface is very slick and the features are quite impressive, especially the live migration facility.

The cluster consists of two servers, each with access to a mirrired NAS array where the individual VMs are represented as logical volumes under /dev/drbdvg{0,1}, eg /dev/drbdvg1/vm-106-disk-1.

However, although it has run painlessly and smoothly for months now, one of the servers (called pves1) in the cluster failed yesterday.

I had expected the other (pves2) to take over the VMs automatically on detecting that the other was down - but this didn't happen. All of the (KVM) VMs hosted on pves1 simply crashed with it.

So:

1. Am I right in my assumption that pves2 should have taken over the VMs painlessly on detecting that pves1 was down, or is this not supported?

2. Assuming I am right, how can we fix this?

3. If my assumption above is wrong, is there any way that the VMs from the dead half of the cluster can be migrated and restarted while it's dead? (the data centre is unfortunately quite remote).

Grateful for any help / advice.

lonegroover · Jan 20, 2012

Should have mentioned that this is Proxmox Virtual Environment 1.8. The two physical servers in the cluster run Ubuntu 10.4.

tom · Jan 20, 2012

no, HA is not implemented in 1.x by default. I assume you need to fix this "manually".

HA is introduced in 2.0, which is currently in beta - for details about 2.0 see http://pve.proxmox.com/wiki/Category:Proxmox_VE_2.0

lonegroover · Jan 20, 2012

tom said:
no, HA is not implemented in 1.x by default. I assume you need to fix this "manually".

HA is introduced in 2.0, which is currently in beta - for details about 2.0 see http://pve.proxmox.com/wiki/Category:Proxmox_VE_2.0

Thanks Tom, appreciate the reply.

When you say it's not implemented in 1.x by default - is there a (non-default) way to organise some sort of failover scenario?

We will look to upgrading to 2.0 in the longer term.

Still interested to know if there's some sort of manual way to get the VMs over from the dead server - the storage would still be online, of course.

Could it be as simple as setting up new server instances on the surviving server using the same logical volumes as the VMs that are down, then starting them from there - would that work? Or would Proxmox disallow the same volumes to be used?

Thanks ..

tom · Jan 20, 2012

HA is not simple, there is no easy way. If you want HA, evaluate 2.0.

lonegroover · Jan 20, 2012

tom said:
HA is not simple, there is no easy way. If you want HA, evaluate 2.0.

Thanks, Tom. We will do that.

I tried a rather hacky manual resurrection by editing a new VM config file, and that worked - just did a test by powering down half the cluster.

udo · Jan 21, 2012

lonegroover said:
Thanks, Tom. We will do that.

I tried a rather hacky manual resurrection by editing a new VM config file, and that worked - just did a test by powering down half the cluster.

Hi,
for this case it's an good idea to rsync the vm-configs to the other node (to a backup-dir). In case of emergency you need only to move the config to /etc/qemu-server and start the VM again (only true for VMs with shared storage).

Udo

Search

Search

Failover Failure?

lonegroover

Guest

lonegroover

Guest

tom

Proxmox Staff Member

lonegroover

Guest

tom

Proxmox Staff Member

lonegroover

Guest

udo

Distinguished Member

We value your privacy