[SOLVED] 3 Node Proxmox and Ceph Storage, Power Outage

cbug

New Member
Jul 3, 2025
2
2
3
Hi,
thanks to the high temperatures yesterday my little proxmox cluster got a forced shutdown by heat of all nodes.
After rebooting, all ceph OSDs are stuck switching from down to peering and vice versa.

Now I am searching for a documentation on how to fix that Ceph Storage.

Code:
# ceph -s
  cluster:
    id:     ed66bf94-647e-4f5d-9ebc-e4dae28c49a7
    health: HEALTH_WARN
            noout,norebalance flag(s) set
            Reduced data availability: 129 pgs inactive, 129 pgs peering
            11 slow ops, oldest one blocked for 46 sec, daemons [osd.0,osd.7,mon.pve1] have slow ops.

  services:
    mon: 3 daemons, quorum pve1,pve2,pve3 (age 2h)
    mgr: pve1(active, since 2h), standbys: pve3, pve2
    osd: 9 osds: 9 up (since 47s), 9 in (since 7h)
         flags noout,norebalance

  data:
    pools:   2 pools, 129 pgs
    objects: 194.92k objects, 740 GiB
    usage:   2.8 TiB used, 5.9 TiB / 8.7 TiB avail
    pgs:     100.000% pgs not active
             129 peering

Code:
# ceph health detail
HEALTH_WARN noout,norebalance flag(s) set; Reduced data availability: 129 pgs inactive, 129 pgs peering; 56 slow ops, oldest one blocked for 121 sec, daemons [osd.0,osd.7,mon.pve1] have slow ops.
[WRN] OSDMAP_FLAGS: noout,norebalance flag(s) set
[WRN] PG_AVAILABILITY: Reduced data availability: 129 pgs inactive, 129 pgs peering
    pg 1.0 is stuck peering for 11h, current state peering, last acting [7,5,3]
    pg 4.0 is stuck peering for 8h, current state peering, last acting [3,7,8]
    pg 4.1 is stuck peering for 12h, current state peering, last acting [1,2,0]
    pg 4.2 is stuck peering for 12h, current state peering, last acting [1,4,6]
    pg 4.3 is stuck peering for 12h, current state peering, last acting [1,7,6]
    pg 4.4 is stuck peering for 8h, current state peering, last acting [6,8,7]
    pg 4.5 is stuck peering for 10h, current state peering, last acting [7,1,0]
    pg 4.6 is stuck peering for 12h, current state peering, last acting [5,3,7]
    pg 4.7 is stuck peering for 8h, current state peering, last acting [6,5,7]
    pg 4.8 is stuck peering for 12h, current state peering, last acting [5,7,3]
    pg 4.9 is stuck peering for 12h, current state peering, last acting [1,0,2]
    pg 4.a is stuck peering for 10h, current state peering, last acting [2,6,1]
    pg 4.b is stuck peering for 8h, current state peering, last acting [0,4,5]
    pg 4.c is stuck peering for 10h, current state peering, last acting [4,3,1]
    pg 4.d is stuck inactive for 8h, current state peering, last acting [4,1,3]
    pg 4.19 is stuck peering for 12h, current state peering, last acting [1,3,2]
    pg 4.1a is stuck peering for 10h, current state peering, last acting [4,6,5]
    pg 4.1b is stuck peering for 10h, current state peering, last acting [4,3,5]
    pg 4.1c is stuck peering for 12h, current state peering, last acting [1,6,4]
    pg 4.1d is stuck peering for 12h, current state peering, last acting [5,7,3]
    pg 4.1e is stuck peering for 8h, current state peering, last acting [0,7,5]
    pg 4.1f is stuck peering for 12h, current state peering, last acting [1,4,3]
    pg 4.20 is stuck peering for 8h, current state peering, last acting [3,7,8]
    pg 4.21 is stuck peering for 12h, current state peering, last acting [1,2,0]
    pg 4.22 is stuck peering for 12h, current state peering, last acting [1,4,6]
    pg 4.23 is stuck peering for 12h, current state peering, last acting [1,7,6]
    pg 4.24 is stuck peering for 8h, current state peering, last acting [6,8,7]
    pg 4.25 is stuck peering for 10h, current state peering, last acting [7,1,0]
    pg 4.26 is stuck peering for 12h, current state peering, last acting [5,3,7]
    pg 4.27 is stuck peering for 8h, current state peering, last acting [6,5,7]
    pg 4.28 is stuck peering for 12h, current state peering, last acting [5,7,3]
    pg 4.29 is stuck peering for 12h, current state peering, last acting [1,0,2]
    pg 4.2a is stuck peering for 10h, current state peering, last acting [2,6,1]
    pg 4.2b is stuck peering for 8h, current state peering, last acting [0,4,5]
    pg 4.2c is stuck peering for 10h, current state peering, last acting [4,3,1]
    pg 4.2d is stuck peering for 10h, current state peering, last acting [4,1,3]
    pg 4.2e is stuck peering for 10h, current state peering, last acting [4,5,0]
    pg 4.2f is stuck peering for 8h, current state peering, last acting [3,8,4]
    pg 4.30 is stuck peering for 8h, current state peering, last acting [3,8,2]
    pg 4.31 is stuck peering for 8h, current state peering, last acting [3,7,8]
    pg 4.32 is stuck peering for 8h, current state peering, last acting [0,1,7]
    pg 4.33 is stuck peering for 10h, current state peering, last acting [4,8,3]
    pg 4.34 is stuck peering since forever, current state peering, last acting [5,0,7]
    pg 4.35 is stuck peering for 12h, current state peering, last acting [5,3,7]
    pg 4.36 is stuck peering for 8h, current state peering, last acting [6,7,5]
    pg 4.37 is stuck peering for 8h, current state peering, last acting [3,7,5]
    pg 4.38 is stuck peering for 10h, current state peering, last acting [2,6,8]
    pg 4.39 is stuck peering for 12h, current state peering, last acting [1,3,2]
    pg 4.7b is stuck peering for 10h, current state peering, last acting [4,3,5]
    pg 4.7e is stuck peering for 8h, current state peering, last acting [0,7,5]
    pg 4.7f is stuck peering for 12h, current state peering, last acting [1,4,3]
[WRN] SLOW_OPS: 56 slow ops, oldest one blocked for 121 sec, daemons [osd.0,osd.7,mon.pve1] have slow ops.


Code:
cat /var/log/ceph/ceph-osd.1.log
2025-07-03T02:18:36.759+0200 77144b5cd6c0  1 osd.1 10262 is_healthy false -- only 0/4 up peers (less than 33%)
2025-07-03T02:18:36.759+0200 77144b5cd6c0  1 osd.1 10262 not healthy; waiting to boot

Is there a way to bring one of those peers of each OSD back active, to take it as new "master"-OSD to restart replication?
 
A few hours of sleep later i was able to fix it.

It turned out that I was able to ping all cluster_network IPs(and relied on that falsely), but connection via ssh failed.

I am running cluster_network on an seperate physical network interface.
(Cluster X.X.50.0/24 SFP+, corosync X.X.40.0 RJ45)

Digging a bit into I noticed SFP+ link got MTU of 9000, setting 1500 nearly instantly resulted in the cluster to be HEALTH_OK, since an hour it is now remapping and backfilling.

The power outage also rebooted the switch the servers are connected, I am currently suspecting some unapplied changes on it in the past. The switch had an uptime of 6 years before the outage yesterday.

After 3 hours of remapping and scrubbing, cluster is now active+clean again.