Dear all,
after a short power loss, my cluster of three nodes is not coming up again.
System facts: Proxmox 8.3.3, Ceph Reef 18.2.4, each node 3 SSDs (4TB SATA, 2TB NVME, 1TB SATA with a OSD partition) with 3 OSD, so a total of 9 OSDs.
The system tries to activate the OSDs, then decides that ALL OSDs of a node are too bad and calls them down.
I've tried to disable down-ing of the OSDs, but the system does not show any signs of recovery.
Log entries look like this:
Please give me tips how to debug or solve this!
Thanks,
engineer5
after a short power loss, my cluster of three nodes is not coming up again.
System facts: Proxmox 8.3.3, Ceph Reef 18.2.4, each node 3 SSDs (4TB SATA, 2TB NVME, 1TB SATA with a OSD partition) with 3 OSD, so a total of 9 OSDs.
The system tries to activate the OSDs, then decides that ALL OSDs of a node are too bad and calls them down.
I've tried to disable down-ing of the OSDs, but the system does not show any signs of recovery.
root@f4:~# ceph health
HEALTH_WARN 6 osds down; 2 hosts (6 osds) down; Reduced data availability: 65 pgs inactive, 65 pgs down
Log entries look like this:
2025-02-19T18:10:21.086911+0100 mon.f4 (mon.0) 5579 : cluster 3 Health check update: 44 slow ops, oldest one blocked for 357 sec, daemons [osd.0,osd.5,osd.6,mon.fuji4] have slow ops. (SLOW_OPS)
2025-02-19T18:10:21.087248+0100 mon.f4 (mon.0) 5580 : cluster 1 osd.1 failed (root=default,host=f5) (2 reporters from different host after 326.382033 >= grace 324.756848)
2025-02-19T18:10:21.087292+0100 mon.f4 (mon.0) 5581 : cluster 1 osd.6 failed (root=default,host=f5) (2 reporters from different host after 326.381985 >= grace 325.887730)
2025-02-19T18:10:21.087318+0100 mon.f4 (mon.0) 5582 : cluster 1 osd.8 failed (root=default,host=f5) (2 reporters from different host after 326.381968 >= grace 325.646970)
2025-02-19T18:10:21.087847+0100 mon.f4 (mon.0) 5583 : cluster 3 Health check update: 6 osds down (OSD_DOWN)
2025-02-19T18:10:21.087860+0100 mon.f4 (mon.0) 5584 : cluster 3 Health check update: 2 hosts (6 osds) down (OSD_HOST_DOWN)
2025-02-19T18:10:21.101109+0100 mon.fuji4 (mon.0) 5585 : cluster 0 osdmap e3638: 9 total, 3 up, 9 in
2025-02-19T18:10:21.759021+0100 osd.6 (osd.6) 4647 : cluster 3 2 slow requests (by type [ 'delayed' : 2 ] most affected pool [ 'pool1' : 1 ])
2025-02-19T18:10:22.108540+0100 mon.fuji4 (mon.0) 5588 : cluster 0 osdmap e3639: 9 total, 3 up, 9 in
2025-02-19T18:10:22.457209+0100 mgr.fuji4 (mgr.84034215) 4603 : cluster 0 pgmap v4772: 65 pgs: 26 stale+peering, 39 peering; 184 GiB data, 524 GiB used, 18 TiB / 18 TiB avail
Please give me tips how to debug or solve this!
Thanks,
engineer5