VM stuck on starting after moving nodes

CosmicDog

New Member
Feb 2, 2025
1
0
1
Hello all,
I am trying to set up a 3-node cluster with failover, nodes 1 and 2 have my OSDs and nodes 1, 2, and 3 are monitors, nodes 1 and 2 are my production servers and node 3 is just for negotiation. If I gracefully power off node 1 the VM will move to node 2 (using HA fail over) and start normal, but if I simulate a crash and forcefully shut it down, the VM moves over but gets stuck on starting indefinitely unless I start node 1 back, almost like the disk is locked, I have tried to remove the lock on RBD image with no luck. Also, I can't even do anything in the shell when node 1 goes offline but as soon as I start node 1 back everything is fine.
 
nodes 1 and 2 have my OSDs
A Ceph cluster with two nodes? You are a brave man...

Probably you did not keep the default "size=3,min_size=2", didn't you? That would require (at least) three nodes with OSDs!

When one of your two OSD-Nodes fails the whole Ceph storage goes read-only. Everything stalls!

There are some more pitfalls and caveats: https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/

Also, I can't even do anything in the shell when node 1 goes offline but as soon as I start node 1 back everything is fine.
Did you check that Quorum stays active? You have three nodes, so "pvecm status" should always give "Flags: Quorate".

Note also that PVE Quorum and Ceph monitoring majority are completely independent.