Help needed: VMs stuck on "Booting from Hard Disk" with Ceph RBD

petwri

New Member
Jan 27, 2025
14
0
1
I am using Ceph RBD as my VM-Image storage backend. Just recently one of my Ceph OSDs crashed, I had to replace it. Now a recovery is running, a few PGs are degraded there.

This resulted in VMs not being able to boot anymore. I don't know if it's just a coincidence that it happended at the same time or if it's related to ceph doing a recovery, but some VMs are stuck at "Booting from Hard Disk" and nothing happens. I already tried playing around with the HDD options of the VMs (skip replication, boot order, caching) but it doesn't make any difference. I can migrate VMs between nodes (which happens instantly - makes sense since the images are on ceph rbd), but regardless of the node I am using, I cannot start any.

I am a little stuck here, since I don't really know how to proceed or what logs I should check. I don't want to restart any other VMs because I am afraid none of them will come up. The RBD images are definitely there.

Update: I just tried to export the image to local storage, and the export is stuck after a few MBs - seems like my ceph rbd storage is compromised. I'll let ceph do all recovery and then check again.
 
Last edited:
Hi, 12 OSDs and 3 nodes. Smartmontool reported disk was about to die, I set it to out, let the recovery finish, then down. Thats when pg's became degraded.
 
You did not tell us how many nodes and how many OSD you have.

Just for your future planning: a stable Ceph cluster may need some more resources than the bare minimum: https://forum.proxmox.com/threads/fabu-can-i-use-ceph-in-a-_very_-small-cluster.159671/
Hello, I went through the discussion on the link but couldn't find a solution to my current problem. I am currently experiencing the same issue with @petwri

We have a 7 node cluster. Node 4 went offline due to nice issues, because of that the ntp synchronization across nodes became skewed and ceph was offline. We configured node 1 as the primary ntp server and configured all other nodes to synchronise to it. Clock skew issue was resolved and ceph came online but most of my critical machines are not booting up. They get stuck on bootup. Here is the details of my setup below, please can you be of any help? Thank you.


Nodes: 7 total
Up: 6 (ve01, ve02, ve03, ve05, ve06, ve07)
Down: 1 (ve04 — offline, missing Mellanox NIC)
MONs: 7 configured
In quorum: 6 (ve01, ve02, ve03, ve05, ve06, ve07)
Down: 1 (mon.ve04 — out of quorum)
MGRs: 6
Active: 1 (ve01)
Standby: 5 (ve02, ve03, ve05, ve06, ve07)
OSDs: 84 total
Up and in: 67
Down: 17
12 on ve04 (offline with node)
5 orphaned (osd.52, 53, 54, 55 on ve05; osd.63 on ve06 — no backing storage, pending cleanup)
Placement Groups: 673 total
Active + clean: 672 (99.85%)
Down: 1 (PG 2.136 — contains zero objects)
MDS: 6
Active: 1
Standby: 5
Pools: 4
.mgr (1 PG)
wacren-ve-pool (512 PGs) — RBD for VMs
cephfs_data (128 PGs)
cephfs_metadata (32 PGs)
Storage:
Raw capacity: 234 TiB
Used: 54 TiB (23%)
Available: 180 TiB
Replication: size=3, min_size=2 (host-level failure domain)
Cluster Health: HEALTH_WARN (down from HEALTH_ERR)