On a fairly regular basis, but unpredictably, when I attempt to migrate a VM from one host to another, I get failures with no useful error information.
Digging deeper reveals that the problem is at the rgmanager level somehow, with the VM resources being marked as "failed" even though they're still running!
The only 100% guaranteed way I've found to clear the problem is to manually kill the KVM process, then run "clusvcadm -d" to disable the offending resource, then "clusvcadm -e" to re-enable it, which then automatically migrates it to another host.
I have a 4-node cluster, all configured identically. Four 1gb ethernet, bonded together, mgmt interface on a VLAN interface, VMs mostly on other VLANs. Using PVE-hosted CEPH on the same 4 nodes as the underlying data store for all VMs.
Network uses (as of a few weeks ago) OVS, but this problem has been happening both before and after switching to OVS.
What should I be looking for to further diagnose this problem?
FYI, corosync remains happy throughout this problem - only rgmanager appears to be affected.
(Previously reported in bug # 297, but Martin doesn't think it's a bug... and I don't know where to look next.)
-Adam
Digging deeper reveals that the problem is at the rgmanager level somehow, with the VM resources being marked as "failed" even though they're still running!
The only 100% guaranteed way I've found to clear the problem is to manually kill the KVM process, then run "clusvcadm -d" to disable the offending resource, then "clusvcadm -e" to re-enable it, which then automatically migrates it to another host.
I have a 4-node cluster, all configured identically. Four 1gb ethernet, bonded together, mgmt interface on a VLAN interface, VMs mostly on other VLANs. Using PVE-hosted CEPH on the same 4 nodes as the underlying data store for all VMs.
Network uses (as of a few weeks ago) OVS, but this problem has been happening both before and after switching to OVS.
What should I be looking for to further diagnose this problem?
FYI, corosync remains happy throughout this problem - only rgmanager appears to be affected.
(Previously reported in bug # 297, but Martin doesn't think it's a bug... and I don't know where to look next.)
-Adam