VMs not migrating when Ceph is degraded in 3-Node Full-Mesh Cluster

Aug 20, 2025
1
0
1
Germany
Hello Community,

I am currently setting up our new 3-Node Proxmox Cluster, pretty new to Proxmox itself. We are Using Full-Mesh with 25Gbit/s cards for Ceph, 10Gbit/s cards for Coro/VMBR and 18 (6 per Node) SATA 6G Enterprise SSDs. Ceph performance took a bit of testing, but we are now at a point where we are happy with it (PGs did make a huge difference, disabled autobalancing and went from 32 to 512). Installed the Servers on Proxmox 9, Full-Mesh was configured using the SDN-Fabrics feature.

Now I am testing HA functionality and am running into a bit of a problem.

Scenario A: Full Node failure

This works fine; whether I simulate the failure by unplugging any and all network connections, the VMs which are added as HA resources migrate to the other two nodes as expected.

Scenario B: Coro/VMBR Uplink failure

Same as above, VMs migrate as expected.

Scenario C: Single Ceph port failure

When I pull the connection between two nodes, e.g. pve1 and pve2, ceph still works as expected and the VMs are not affected. This tells me the fabric is working as intended.

Scenario D: Full Ceph network card failure

This is where it gets confusing. If I pull the cables on both ports of, lets say node pve2, Ceph goes into degraded state as expected, and reports 6 OSD/1 Host/1 MON down. But since the node is still reachable on the cluster network, the VMs stall and stay on that node. HA resources show they are stuck in "recovery" state, and I cant migrate them manually. When I try, the task eventually ends up with the status "migration aborted". I thought that maybe shutting down the node would allow me to start the VMs on one of the remaining ones, but when I try to migrate them, I just get a message "no route to host". Which is expected, since pve2 is down, but shouldn't the VM data be available on both other nodes beacuse of Ceph?

I found a thread from 2017 that seems to ask about the same problem, but was unresolved:

It looks like the Proxmox Quorum and Ceph MONs dont talk to each other, so the former thinks all is OK (because Coro works as expected) while the latter see that Ceph is (obviously) degraded. Maybe this is a configuration issue, but I could not find anything pertaining to this. Is this not a potential SPOF?

Maybe someone with a bit more experience can shed some light on this?
 
Maybe someone with a bit more experience can shed some light on this?
Unfortunately no light, but yes, I experienced the same thing and just tried to not run into this problem by avoiding using mesh network for CEPH and using two switches. If both fail, everything fails, so not unexpected problems.