rgmanager marking VMs as failed

athompso

Member
Sep 13, 2013
127
8
18
On a fairly regular basis, but unpredictably, when I attempt to migrate a VM from one host to another, I get failures with no useful error information.
Digging deeper reveals that the problem is at the rgmanager level somehow, with the VM resources being marked as "failed" even though they're still running!
The only 100% guaranteed way I've found to clear the problem is to manually kill the KVM process, then run "clusvcadm -d" to disable the offending resource, then "clusvcadm -e" to re-enable it, which then automatically migrates it to another host.

I have a 4-node cluster, all configured identically. Four 1gb ethernet, bonded together, mgmt interface on a VLAN interface, VMs mostly on other VLANs. Using PVE-hosted CEPH on the same 4 nodes as the underlying data store for all VMs.

Network uses (as of a few weeks ago) OVS, but this problem has been happening both before and after switching to OVS.

What should I be looking for to further diagnose this problem?

FYI, corosync remains happy throughout this problem - only rgmanager appears to be affected.

(Previously reported in bug # 297, but Martin doesn't think it's a bug... and I don't know where to look next.)

-Adam
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!