Failover Failure?

L

lonegroover

Guest
Hi.

I've recently inherited a Proxmox VE cluster from a sysadmin colleague who is now no longer with our company.

I've been very impressed with it so far. The web admin interface is very slick and the features are quite impressive, especially the live migration facility.

The cluster consists of two servers, each with access to a mirrired NAS array where the individual VMs are represented as logical volumes under /dev/drbdvg{0,1}, eg /dev/drbdvg1/vm-106-disk-1.

However, although it has run painlessly and smoothly for months now, one of the servers (called pves1) in the cluster failed yesterday.

I had expected the other (pves2) to take over the VMs automatically on detecting that the other was down - but this didn't happen. All of the (KVM) VMs hosted on pves1 simply crashed with it.

So:

1. Am I right in my assumption that pves2 should have taken over the VMs painlessly on detecting that pves1 was down, or is this not supported?

2. Assuming I am right, how can we fix this?

3. If my assumption above is wrong, is there any way that the VMs from the dead half of the cluster can be migrated and restarted while it's dead? (the data centre is unfortunately quite remote).

Grateful for any help / advice.
 
Should have mentioned that this is Proxmox Virtual Environment 1.8. The two physical servers in the cluster run Ubuntu 10.4.
 
no, HA is not implemented in 1.x by default. I assume you need to fix this "manually".

HA is introduced in 2.0, which is currently in beta - for details about 2.0 see http://pve.proxmox.com/wiki/Category:Proxmox_VE_2.0

Thanks Tom, appreciate the reply.

When you say it's not implemented in 1.x by default - is there a (non-default) way to organise some sort of failover scenario?

We will look to upgrading to 2.0 in the longer term.

Still interested to know if there's some sort of manual way to get the VMs over from the dead server - the storage would still be online, of course.

Could it be as simple as setting up new server instances on the surviving server using the same logical volumes as the VMs that are down, then starting them from there - would that work? Or would Proxmox disallow the same volumes to be used?

Thanks ..
 
HA is not simple, there is no easy way. If you want HA, evaluate 2.0.
 
HA is not simple, there is no easy way. If you want HA, evaluate 2.0.

Thanks, Tom. We will do that.

I tried a rather hacky manual resurrection by editing a new VM config file, and that worked - just did a test by powering down half the cluster.
 
Thanks, Tom. We will do that.

I tried a rather hacky manual resurrection by editing a new VM config file, and that worked - just did a test by powering down half the cluster.
Hi,
for this case it's an good idea to rsync the vm-configs to the other node (to a backup-dir). In case of emergency you need only to move the config to /etc/qemu-server and start the VM again (only true for VMs with shared storage).

Udo
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!