Hi,
thanks to the high temperatures yesterday my little proxmox cluster got a forced shutdown by heat of all nodes.
After rebooting, all ceph OSDs are stuck switching from down to peering and vice versa.
Now I am searching for a documentation on how to fix that Ceph Storage.
Is there a way to bring one of those peers of each OSD back active, to take it as new "master"-OSD to restart replication?
thanks to the high temperatures yesterday my little proxmox cluster got a forced shutdown by heat of all nodes.
After rebooting, all ceph OSDs are stuck switching from down to peering and vice versa.
Now I am searching for a documentation on how to fix that Ceph Storage.
Code:
# ceph -s
cluster:
id: ed66bf94-647e-4f5d-9ebc-e4dae28c49a7
health: HEALTH_WARN
noout,norebalance flag(s) set
Reduced data availability: 129 pgs inactive, 129 pgs peering
11 slow ops, oldest one blocked for 46 sec, daemons [osd.0,osd.7,mon.pve1] have slow ops.
services:
mon: 3 daemons, quorum pve1,pve2,pve3 (age 2h)
mgr: pve1(active, since 2h), standbys: pve3, pve2
osd: 9 osds: 9 up (since 47s), 9 in (since 7h)
flags noout,norebalance
data:
pools: 2 pools, 129 pgs
objects: 194.92k objects, 740 GiB
usage: 2.8 TiB used, 5.9 TiB / 8.7 TiB avail
pgs: 100.000% pgs not active
129 peering
Code:
# ceph health detail
HEALTH_WARN noout,norebalance flag(s) set; Reduced data availability: 129 pgs inactive, 129 pgs peering; 56 slow ops, oldest one blocked for 121 sec, daemons [osd.0,osd.7,mon.pve1] have slow ops.
[WRN] OSDMAP_FLAGS: noout,norebalance flag(s) set
[WRN] PG_AVAILABILITY: Reduced data availability: 129 pgs inactive, 129 pgs peering
pg 1.0 is stuck peering for 11h, current state peering, last acting [7,5,3]
pg 4.0 is stuck peering for 8h, current state peering, last acting [3,7,8]
pg 4.1 is stuck peering for 12h, current state peering, last acting [1,2,0]
pg 4.2 is stuck peering for 12h, current state peering, last acting [1,4,6]
pg 4.3 is stuck peering for 12h, current state peering, last acting [1,7,6]
pg 4.4 is stuck peering for 8h, current state peering, last acting [6,8,7]
pg 4.5 is stuck peering for 10h, current state peering, last acting [7,1,0]
pg 4.6 is stuck peering for 12h, current state peering, last acting [5,3,7]
pg 4.7 is stuck peering for 8h, current state peering, last acting [6,5,7]
pg 4.8 is stuck peering for 12h, current state peering, last acting [5,7,3]
pg 4.9 is stuck peering for 12h, current state peering, last acting [1,0,2]
pg 4.a is stuck peering for 10h, current state peering, last acting [2,6,1]
pg 4.b is stuck peering for 8h, current state peering, last acting [0,4,5]
pg 4.c is stuck peering for 10h, current state peering, last acting [4,3,1]
pg 4.d is stuck inactive for 8h, current state peering, last acting [4,1,3]
pg 4.19 is stuck peering for 12h, current state peering, last acting [1,3,2]
pg 4.1a is stuck peering for 10h, current state peering, last acting [4,6,5]
pg 4.1b is stuck peering for 10h, current state peering, last acting [4,3,5]
pg 4.1c is stuck peering for 12h, current state peering, last acting [1,6,4]
pg 4.1d is stuck peering for 12h, current state peering, last acting [5,7,3]
pg 4.1e is stuck peering for 8h, current state peering, last acting [0,7,5]
pg 4.1f is stuck peering for 12h, current state peering, last acting [1,4,3]
pg 4.20 is stuck peering for 8h, current state peering, last acting [3,7,8]
pg 4.21 is stuck peering for 12h, current state peering, last acting [1,2,0]
pg 4.22 is stuck peering for 12h, current state peering, last acting [1,4,6]
pg 4.23 is stuck peering for 12h, current state peering, last acting [1,7,6]
pg 4.24 is stuck peering for 8h, current state peering, last acting [6,8,7]
pg 4.25 is stuck peering for 10h, current state peering, last acting [7,1,0]
pg 4.26 is stuck peering for 12h, current state peering, last acting [5,3,7]
pg 4.27 is stuck peering for 8h, current state peering, last acting [6,5,7]
pg 4.28 is stuck peering for 12h, current state peering, last acting [5,7,3]
pg 4.29 is stuck peering for 12h, current state peering, last acting [1,0,2]
pg 4.2a is stuck peering for 10h, current state peering, last acting [2,6,1]
pg 4.2b is stuck peering for 8h, current state peering, last acting [0,4,5]
pg 4.2c is stuck peering for 10h, current state peering, last acting [4,3,1]
pg 4.2d is stuck peering for 10h, current state peering, last acting [4,1,3]
pg 4.2e is stuck peering for 10h, current state peering, last acting [4,5,0]
pg 4.2f is stuck peering for 8h, current state peering, last acting [3,8,4]
pg 4.30 is stuck peering for 8h, current state peering, last acting [3,8,2]
pg 4.31 is stuck peering for 8h, current state peering, last acting [3,7,8]
pg 4.32 is stuck peering for 8h, current state peering, last acting [0,1,7]
pg 4.33 is stuck peering for 10h, current state peering, last acting [4,8,3]
pg 4.34 is stuck peering since forever, current state peering, last acting [5,0,7]
pg 4.35 is stuck peering for 12h, current state peering, last acting [5,3,7]
pg 4.36 is stuck peering for 8h, current state peering, last acting [6,7,5]
pg 4.37 is stuck peering for 8h, current state peering, last acting [3,7,5]
pg 4.38 is stuck peering for 10h, current state peering, last acting [2,6,8]
pg 4.39 is stuck peering for 12h, current state peering, last acting [1,3,2]
pg 4.7b is stuck peering for 10h, current state peering, last acting [4,3,5]
pg 4.7e is stuck peering for 8h, current state peering, last acting [0,7,5]
pg 4.7f is stuck peering for 12h, current state peering, last acting [1,4,3]
[WRN] SLOW_OPS: 56 slow ops, oldest one blocked for 121 sec, daemons [osd.0,osd.7,mon.pve1] have slow ops.
Code:
cat /var/log/ceph/ceph-osd.1.log
2025-07-03T02:18:36.759+0200 77144b5cd6c0 1 osd.1 10262 is_healthy false -- only 0/4 up peers (less than 33%)
2025-07-03T02:18:36.759+0200 77144b5cd6c0 1 osd.1 10262 not healthy; waiting to boot
Is there a way to bring one of those peers of each OSD back active, to take it as new "master"-OSD to restart replication?