HA migration not working when losing CEPH

Fred Saunier

Well-Known Member
Aug 24, 2017
55
2
48
Brussels, BE
Hi all,

In a 5-node cluster with 3 VLANS (prod, corosync, ceph) with all VMs living on CEPH, if a node loses connection to CEPH (but prod and corosync are still UP), VMs that are hosted on that node are not automatically migrated by HA. I have to manually stop the VMs on the node, manually move their .conf to another node and start them there. My understanding is that it could be due to the fact that the node is still visible by corosync.

Any idea how I can make it possible to have HA migrate those VMs if access to CEPH is lost by the node?

Thanks,
Fred
 
Do you have multiple Links on you Servers? If yes, which one you using for corosync?
 
Do you have multiple Links on you Servers? If yes, which one you using for corosync?

Yes, there are multiple links on each node :
  • NIC1 goes to "prod" vlan
  • NIC2 goes to "coro" vlan
  • NIC3 goes to "ceph" vlan
In addition, corosync has also a redundant link on "prod" vlan.
 
Best way should be to create a Bond with 2 or the 3 NICs, if now a link failed, there is no problem if the whole Bond failed, then the Node get fenced.

You Problem is, the Corosync has a Connection but the Storage not, so PVE will see the Node is still alive it will not fence it. Maybe you can inject some values with a Script and check if the Storage is avaiable.

But in generel, every Link on the Node should be redundant.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!