Hello all!
Recently we have experienced a power outage and loss network connectivity (Junpier switch that was used by Ceph cluster). Some proxmox/ceph nodes have been restarted as well. Network traffic and nodes have been restored but our cluster is in critical condition.
On the monitors we are able to see the PG list but the cluster has practically suspended and does not change status.
This is the health status:
some osds there are shown as "down" but these were marked as down intentionally a while ago and intendent to be replaced - after that there weren't any issue related to the cluster.
Cluster status doesn't change at all, looks like everything is stuck in peering. PGs marked as unknown change to peering after MGR starts but after reaching about ~200 active PGs nothing changes and hangs.
This is the snipped from one of the OSD with extended debug verbosity (most of OSD returns the same)
This is the snipped from one of the OSD without extended debug. Osd practically floods the logs with entries like this:
Trying to run `ceph pg dump` causes a hang. Checking the command with `strace` we can see only timeouts.
Has anyone encountered anything similar before? Is there still a chance to recover the data in this state?
Recently we have experienced a power outage and loss network connectivity (Junpier switch that was used by Ceph cluster). Some proxmox/ceph nodes have been restarted as well. Network traffic and nodes have been restored but our cluster is in critical condition.
On the monitors we are able to see the PG list but the cluster has practically suspended and does not change status.
This is the health status:
Code:
ceph -s
cluster:
id: xxxx
health: HEALTH_ERR
noscrub,nodeep-scrub flag(s) set
3 nearfull osd(s)
3 pool(s) nearfull
no active mgr
BlueFS spillover detected on 2 OSD(s)
Reduced data availability: 4083 pgs inactive, 31 pgs down, 289 pgs peering, 2 pgs stale
Degraded data redundancy: 48877/1358004 objects degraded (3.599%), 25 pgs degraded, 48 pgs undersized
242 pgs not deep-scrubbed in time
1 pgs not scrubbed in time
2 daemons have recently crashed
18 slow requests are blocked > 32 sec
6 stuck requests are blocked > 4096 sec
24 slow ops, oldest one blocked for 47745 sec, daemons [osd.12,osd.13] have slow ops.
mons are allowing insecure global_id reclaim
services:
mon: 3 daemons, quorum mon3,mon4,mon5 (age 2h)
mgr: no daemons active (since 5h)
osd: 37 osds: 35 up, 30 in; 230 remapped pgs
flags noscrub,nodeep-scrub
data:
pools: 14 pools, 4352 pgs
objects: 679.00k objects, 2.5 TiB
usage: 5.1 TiB used, 4.1 TiB / 9.1 TiB avail
pgs: 85.777% pgs unknown
8.042% pgs not active
48877/1358004 objects degraded (3.599%)
8181/1358004 objects misplaced (0.602%)
3733 unknown
268 peering
226 active+clean
31 down
23 activating
23 active+undersized
18 remapped+peering
16 active+undersized+degraded
6 undersized+degraded+peered
2 active+undersized+degraded+remapped+backfill_wait
2 stale+remapped+peering
1 creating+peering
1 active+remapped+backfill_wait
1 active+undersized+degraded+remapped+backfilling
1 activating+remapped
some osds there are shown as "down" but these were marked as down intentionally a while ago and intendent to be replaced - after that there weren't any issue related to the cluster.
Cluster status doesn't change at all, looks like everything is stuck in peering. PGs marked as unknown change to peering after MGR starts but after reaching about ~200 active PGs nothing changes and hangs.
This is the snipped from one of the OSD with extended debug verbosity (most of OSD returns the same)
Code:
2025-03-30 21:00:20.922 7fd71b27a700 1 --1- [v2:192.168.17.125:6806/1759437,v1:192.168.17.125:6807/1759437] >> conn(0x5566ad690000 0x5566adb65000 :6807 s=ACCEPTING pgs=0 cs=0 l=0).handle_client_banner read peer banner and addr failed
2025-03-30 21:00:20.922 7fd71a278700 1 -- [v2:192.168.17.125:6804/1759437,v1:192.168.17.125:6805/1759437] >> conn(0x5566adacb200 legacy=0x5566add50000 unknown :6805 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 135
2025-03-30 21:00:20.922 7fd71a278700 1 -- [v2:192.168.17.125:6804/1759437,v1:192.168.17.125:6805/1759437] >> conn(0x5566adacb200 legacy=0x5566add50000 unknown :6805 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed
2025-03-30 21:00:20.922 7fd71a278700 1 --1- [v2:192.168.17.125:6804/1759437,v1:192.168.17.125:6805/1759437] >> conn(0x5566adacb200 0x5566add50000 :6805 s=ACCEPTING pgs=0 cs=0 l=0).handle_client_banner read peer banner and addr failed
2025-03-30 21:00:20.922 7fd71a278700 1 -- [v2:192.168.17.125:6804/1759437,v1:192.168.17.125:6805/1759437] reap_dead start
2025-03-30 21:00:20.922 7fd71b27a700 1 --1- [v2:192.168.17.125:6804/1759437,v1:192.168.17.125:6805/1759437] >> conn(0x5566ab5fe880 0x5566a707b000 :6805 s=ACCEPTING pgs=0 cs=0 l=0).send_server_banner sd=85 legacy v1:192.168.17.125:6805/1759437 socket_addr v1:192.168.17.125:6805/1759437 target_addr v1:192.168.17.119:39209/0
This is the snipped from one of the OSD without extended debug. Osd practically floods the logs with entries like this:
Code:
2025-03-30 21:06:20.957 7f0e27f85700 0 log_channel(cluster) log [WRN] : slow request osd_pg_create(e128618 15.1b:115413 15.31:115413 15.40:115413 15.6c:115413 15.6e:115413 15.71:115413 15.75:115413 15.7a:115413 15.84:115413 15.9f:115413 15.a5:115413 15.c9:115413 15.d4:115413 15.da:115413 15.fc:115413) initiated 2025-03-30 20:22:38.487018 currently started
2025-03-30 21:06:20.957 7f0e27f85700 -1 osd.36 128650 get_health_metrics reporting 1 slow ops, oldest is osd_pg_create(e128618 15.1b:115413 15.31:115413 15.40:115413 15.6c:115413 15.6e:115413 15.71:115413 15.75:115413 15.7a:115413 15.84:115413 15.9f:115413 15.a5:115413 15.c9:115413 15.d4:115413 15.da:115413 15.fc:115413)
Trying to run `ceph pg dump` causes a hang. Checking the command with `strace` we can see only timeouts.
Has anyone encountered anything similar before? Is there still a chance to recover the data in this state?