Proxmox 2.3 HA Problem

rjbick

New Member
Jul 11, 2014
13
0
1
I googled around and came up empty, please help if you can.

I have a 2 node cluster active-active drbd in which one of the host machines failed during backups. Migration didn't occur to the other machine (by the way I only run on one machine at a time). I found the system in a state where 2 of the machines were off and the backup that failed the machine was still on. I tried to do qm migrate over to the other host which failed. So in the past when that's the case i just 3 finger salute the machine that's still running. This worked for the 2 machines that were off when I found the state but the active one was stuck in
Service Name Owner (Last) State
------- ---- ----- ------ -----
pvevm:100 none recovering

I can't find any reference material on how to fix this. I tried clusvcadm -e pvevm:100 which in the past has done the trick. It didn't in this case. So to get the machine stated again I took it out of the HA cluster, started and readded it. But the state is still recovering. Does anyone have any insight on how to fix this?
 
You need at least 3 nodes for HA because you lose quorum if you have only 2 nodes. Try to disable the service to fix your problem.
 
I have a qdisk (raspberry pi) iscsi for the 3d quorum vote. It has worked well in the past. So you suggest:

Disabling
clusvcadm -d pvevm:100

then

Enabling
clusvcadm -e pvevm:100
 
Please can you also post the VM config from one machine which does not start - maybe the backup lock is still set in the config?
 
I am going to, just trying to settle on storage and datacenter failover. We definitely need to upgrade, especially since squeeze repos aren't available at debian.

bootdisk: virtio0
cores: 2
memory: 4096
name: breaking
net0: virtio={OMITTED},bridge=vmbr0
ostype: l26
sockets: 2
virtio0: Breaking:vm-100-disk-1,size=32G

I just thought of something I recall reading somewhere that virtio can have some problems on shared storage, maybe I'm just thinking of something else.
 
We have another problem same cluster where the cluster.conf version is different from when I removed the vm to start it. Now the other host has the wrong conf. Any thoughts?

At this point only the version number of the conf is different.