Ceph - power outage and recovery

Anddevvsss · Mar 30, 2025

Hello all!

Recently we have experienced a power outage and loss network connectivity (Junpier switch that was used by Ceph cluster). Some proxmox/ceph nodes have been restarted as well. Network traffic and nodes have been restored but our cluster is in critical condition.

On the monitors we are able to see the PG list but the cluster has practically suspended and does not change status.

This is the health status:

Code:

ceph -s
  cluster:
    id:     xxxx
    health: HEALTH_ERR
            noscrub,nodeep-scrub flag(s) set
            3 nearfull osd(s)
            3 pool(s) nearfull
            no active mgr
            BlueFS spillover detected on 2 OSD(s)
            Reduced data availability: 4083 pgs inactive, 31 pgs down, 289 pgs peering, 2 pgs stale
            Degraded data redundancy: 48877/1358004 objects degraded (3.599%), 25 pgs degraded, 48 pgs undersized
            242 pgs not deep-scrubbed in time
            1 pgs not scrubbed in time
            2 daemons have recently crashed
            18 slow requests are blocked > 32 sec
            6 stuck requests are blocked > 4096 sec
            24 slow ops, oldest one blocked for 47745 sec, daemons [osd.12,osd.13] have slow ops.
            mons are allowing insecure global_id reclaim
 
  services:
    mon: 3 daemons, quorum mon3,mon4,mon5 (age 2h)
    mgr: no daemons active (since 5h)
    osd: 37 osds: 35 up, 30 in; 230 remapped pgs
         flags noscrub,nodeep-scrub
 
  data:
    pools:   14 pools, 4352 pgs
    objects: 679.00k objects, 2.5 TiB
    usage:   5.1 TiB used, 4.1 TiB / 9.1 TiB avail
    pgs:     85.777% pgs unknown
             8.042% pgs not active
             48877/1358004 objects degraded (3.599%)
             8181/1358004 objects misplaced (0.602%)
             3733 unknown
             268  peering
             226  active+clean
             31   down
             23   activating
             23   active+undersized
             18   remapped+peering
             16   active+undersized+degraded
             6    undersized+degraded+peered
             2    active+undersized+degraded+remapped+backfill_wait
             2    stale+remapped+peering
             1    creating+peering
             1    active+remapped+backfill_wait
             1    active+undersized+degraded+remapped+backfilling
             1    activating+remapped

some osds there are shown as "down" but these were marked as down intentionally a while ago and intendent to be replaced - after that there weren't any issue related to the cluster.

Cluster status doesn't change at all, looks like everything is stuck in peering. PGs marked as unknown change to peering after MGR starts but after reaching about ~200 active PGs nothing changes and hangs.

This is the snipped from one of the OSD with extended debug verbosity (most of OSD returns the same)

Code:

2025-03-30 21:00:20.922 7fd71b27a700  1 --1- [v2:192.168.17.125:6806/1759437,v1:192.168.17.125:6807/1759437] >>  conn(0x5566ad690000 0x5566adb65000 :6807 s=ACCEPTING pgs=0 cs=0 l=0).handle_client_banner read peer banner and addr failed
2025-03-30 21:00:20.922 7fd71a278700  1 -- [v2:192.168.17.125:6804/1759437,v1:192.168.17.125:6805/1759437] >>  conn(0x5566adacb200 legacy=0x5566add50000 unknown :6805 s=STATE_CONNECTION_ESTABLISHED l=0).read_bulk peer close file descriptor 135
2025-03-30 21:00:20.922 7fd71a278700  1 -- [v2:192.168.17.125:6804/1759437,v1:192.168.17.125:6805/1759437] >>  conn(0x5566adacb200 legacy=0x5566add50000 unknown :6805 s=STATE_CONNECTION_ESTABLISHED l=0).read_until read failed
2025-03-30 21:00:20.922 7fd71a278700  1 --1- [v2:192.168.17.125:6804/1759437,v1:192.168.17.125:6805/1759437] >>  conn(0x5566adacb200 0x5566add50000 :6805 s=ACCEPTING pgs=0 cs=0 l=0).handle_client_banner read peer banner and addr failed
2025-03-30 21:00:20.922 7fd71a278700  1 -- [v2:192.168.17.125:6804/1759437,v1:192.168.17.125:6805/1759437] reap_dead start
2025-03-30 21:00:20.922 7fd71b27a700  1 --1- [v2:192.168.17.125:6804/1759437,v1:192.168.17.125:6805/1759437] >>  conn(0x5566ab5fe880 0x5566a707b000 :6805 s=ACCEPTING pgs=0 cs=0 l=0).send_server_banner sd=85 legacy v1:192.168.17.125:6805/1759437 socket_addr v1:192.168.17.125:6805/1759437 target_addr v1:192.168.17.119:39209/0

This is the snipped from one of the OSD without extended debug. Osd practically floods the logs with entries like this:

Code:

2025-03-30 21:06:20.957 7f0e27f85700  0 log_channel(cluster) log [WRN] : slow request osd_pg_create(e128618 15.1b:115413 15.31:115413 15.40:115413 15.6c:115413 15.6e:115413 15.71:115413 15.75:115413 15.7a:115413 15.84:115413 15.9f:115413 15.a5:115413 15.c9:115413 15.d4:115413 15.da:115413 15.fc:115413) initiated 2025-03-30 20:22:38.487018 currently started
2025-03-30 21:06:20.957 7f0e27f85700 -1 osd.36 128650 get_health_metrics reporting 1 slow ops, oldest is osd_pg_create(e128618 15.1b:115413 15.31:115413 15.40:115413 15.6c:115413 15.6e:115413 15.71:115413 15.75:115413 15.7a:115413 15.84:115413 15.9f:115413 15.a5:115413 15.c9:115413 15.d4:115413 15.da:115413 15.fc:115413)

Trying to run `ceph pg dump` causes a hang. Checking the command with `strace` we can see only timeouts.

Has anyone encountered anything similar before? Is there still a chance to recover the data in this state?

alexskysilk · Mar 30, 2025

1. why is your mgr down?
2. post the content of ceph osd tree and ceph osd df (if you even can)
3. would be good to know crush rules for all 14(!!) of your pools. (why do you have 14 pools?!)

Generally speaking, it looks like you dont have enough OSDs for all your pgs. what you provide will provide insight how to recover (if possible.)

Anddevvsss · Mar 31, 2025

alexskysilk said:
1. why is your mgr down?
2. post the content of ceph osd tree and ceph osd df (if you even can)
3. would be good to know crush rules for all 14(!!) of your pools. (why do you have 14 pools?!)

Generally speaking, it looks like you dont have enough OSDs for all your pgs. what you provide will provide insight how to recover (if possible.)

1. It restarts itself after some time because of high memory usage - OOM kills it. Already started

2. osd tree

Code:

ceph osd tree
ID  CLASS WEIGHT   TYPE NAME        STATUS REWEIGHT PRI-AFF
 -6        8.09579 root ssd                               
-31        3.15976     host NODE-21                       
 20   ssd  0.43660         osd.20       up  0.95000 1.00000
 21   ssd  0.43660         osd.21       up  1.00000 1.00000
 35   ssd  0.40999         osd.35       up  1.00000 1.00000
 36   ssd  0.43660         osd.36       up  1.00000 1.00000
 37   ssd  0.39999         osd.37       up  1.00000 1.00000
 38   ssd  0.37000         osd.38       up  1.00000 1.00000
 39   ssd  0.34999         osd.39       up  1.00000 1.00000
 40   ssd  0.31999         osd.40       up  1.00000 1.00000
-13        2.06638     host NODE-22                       
 27   ssd  0.31998         osd.27       up  0.79999 1.00000
 28   ssd  0.43660         osd.28       up  1.00000 1.00000
 29   ssd  0.43660         osd.29       up  1.00000 1.00000
 41   ssd  0.43660         osd.41       up  0.79999 1.00000
 42   ssd  0.43660         osd.42       up  1.00000 1.00000
-17        2.86966     host NODE-23                       
 43   sdd  0.39000         osd.43       up        0 1.00000
 44   sdd  0.43660         osd.44       up        0 1.00000
 15   ssd  0.40999         osd.15     down        0 1.00000
 24   ssd  0.43649         osd.24     down        0 1.00000
 26   ssd  0.40999         osd.26       up        0 1.00000
 30   ssd  0.34999         osd.30       up        0 1.00000
 31   ssd  0.43660         osd.31       up        0 1.00000
 -1       61.20140 root default                           
 -7              0     host NODE-05                       
-10              0     host NODE-08                       
 -2              0     host NODE-11                       
-11        8.43958     host NODE-14                       
  8   hdd  2.39999         osd.8        up  1.00000 1.00000
 12   hdd  2.39999         osd.12       up  1.00000 1.00000
 13   hdd  3.63959         osd.13       up  1.00000 1.00000
-14        7.27917     host NODE-15                       
 18   hdd  3.63959         osd.18       up  0.95000 1.00000
 19   hdd  3.63959         osd.19       up  1.00000 1.00000
 -3       12.73447     host NODE-17                       
  2   hdd  5.45799         osd.2        up  1.00000 1.00000
  3   hdd  3.63689         osd.3        up  1.00000 1.00000
  9   hdd  3.63959         osd.9        up  1.00000 1.00000
 -4       10.91606     host NODE-18                       
 45   hdd  3.63869         osd.45       up  1.00000 1.00000
 46   hdd  3.63869         osd.46       up  1.00000 1.00000
 47   hdd  3.63869         osd.47       up  1.00000 1.00000
 -5       10.91606     host NODE-19                       
  6   hdd  3.63869         osd.6        up  1.00000 1.00000
  7   hdd  3.63869         osd.7        up  1.00000 1.00000
 11   hdd  3.63869         osd.11       up  1.00000 1.00000
-12        7.27737     host NODE-24                       
 34   hdd  3.63869         osd.34       up  1.00000 1.00000
 51   hdd  3.63869         osd.51       up  1.00000 1.00000
-43        3.63869     host NODE-25                       
  0   hdd  3.63869         osd.0        up  1.00000 1.00000

osd df (I'm able to run it for a short time after MGR restart). I'm noticed that osd 43, 44, 26, 30, 31 don't report their usage to MGR for some reason.

Code:

ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE    RAW USE DATA    OMAP    META     AVAIL   %USE  VAR  PGS STATUS
20   ssd 0.43660  0.95000 450 GiB 291 GiB 287 GiB  39 MiB  985 MiB 159 GiB 64.70 1.22  88     up
21   ssd 0.43660  1.00000 450 GiB 282 GiB 278 GiB  37 MiB  987 MiB 168 GiB 62.64 1.18 115     up
35   ssd 0.40999  1.00000 450 GiB 282 GiB 278 GiB  23 MiB 1001 MiB 168 GiB 62.68 1.18  96     up
36   ssd 0.43660  1.00000 450 GiB 227 GiB 223 GiB  27 MiB  997 MiB 224 GiB 50.33 0.95  84     up
37   ssd 0.39999  1.00000 447 GiB 293 GiB 292 GiB  35 MiB  989 MiB 154 GiB 65.50 1.23  89     up
38   ssd 0.37000  1.00000 447 GiB 231 GiB 230 GiB  34 MiB  990 MiB 216 GiB 51.73 0.97  82     up
39   ssd 0.34999  1.00000 447 GiB 275 GiB 274 GiB  29 MiB  995 MiB 172 GiB 61.50 1.16  90     up
40   ssd 0.31999  1.00000 447 GiB 162 GiB 161 GiB  20 MiB 1004 MiB 286 GiB 36.13 0.68  88     up
27   ssd 0.31998  0.79999 447 GiB 403 GiB 402 GiB  44 MiB  980 MiB  44 GiB 90.06 1.69  40     up
28   ssd 0.43660  1.00000 447 GiB 375 GiB 374 GiB  24 MiB 1000 MiB  72 GiB 83.95 1.58  64     up
29   ssd 0.43660  1.00000 447 GiB 341 GiB 340 GiB  26 MiB  998 MiB 106 GiB 76.19 1.43  58     up
41   ssd 0.43660  0.79999 447 GiB 393 GiB 392 GiB  25 MiB  999 MiB  54 GiB 87.99 1.65  55     up
42   ssd 0.43660  1.00000 447 GiB 278 GiB 277 GiB  18 MiB 1006 MiB 169 GiB 62.11 1.17  62     up
43   sdd 0.39000        0     0 B     0 B     0 B     0 B      0 B     0 B     0    0  16     up
44   sdd 0.43660        0     0 B     0 B     0 B     0 B      0 B     0 B     0    0   8     up
15   ssd 0.40999        0     0 B     0 B     0 B     0 B      0 B     0 B     0    0   0   down
24   ssd 0.43649        0     0 B     0 B     0 B     0 B      0 B     0 B     0    0   0   down
26   ssd 0.40999        0     0 B     0 B     0 B     0 B      0 B     0 B     0    0   3     up
30   ssd 0.34999        0     0 B     0 B     0 B     0 B      0 B     0 B     0    0   4     up
31   ssd 0.43660        0     0 B     0 B     0 B     0 B      0 B     0 B     0    0   5     up
 8   hdd 2.39999  1.00000 1.8 TiB 1.1 TiB 1.1 TiB  72 MiB  1.5 GiB 762 GiB 59.14 1.11 254     up
12   hdd 2.39999  1.00000 1.8 TiB 1.3 TiB 1.3 TiB  60 MiB  1.8 GiB 559 GiB 69.99 1.32 283     up
13   hdd 3.63959  1.00000 3.6 TiB 2.0 TiB 2.0 TiB 185 MiB  3.7 GiB 1.7 TiB 53.71 1.01 443     up
18   hdd 3.63959  0.95000 3.6 TiB 2.0 TiB 2.0 TiB 146 MiB  4.1 GiB 1.7 TiB 54.41 1.02 343     up
19   hdd 3.63959  1.00000 3.6 TiB 1.9 TiB 1.9 TiB 117 MiB  4.0 GiB 1.8 TiB 51.26 0.96 337     up
 2   hdd 5.45799  1.00000 5.5 TiB 2.4 TiB 2.4 TiB 133 MiB  4.4 GiB 3.0 TiB 44.44 0.84 346     up
 3   hdd 3.63689  1.00000 3.6 TiB 1.9 TiB 1.8 TiB  67 MiB      0 B 1.8 TiB 51.61 0.97 389     up
 9   hdd 3.63959  1.00000 3.6 TiB 2.0 TiB 2.0 TiB 116 MiB  3.6 GiB 1.7 TiB 54.50 1.02 301     up
45   hdd 3.63869  1.00000 3.7 TiB 2.0 TiB 2.0 TiB 112 MiB  3.9 GiB 1.6 TiB 55.43 1.04 361     up
46   hdd 3.63869  1.00000 3.7 TiB 2.0 TiB 1.9 TiB  82 MiB  3.9 GiB 1.7 TiB 54.01 1.02 297     up
47   hdd 3.63869  1.00000 3.7 TiB 1.9 TiB 1.8 TiB  88 MiB  3.7 GiB 1.8 TiB 50.49 0.95 388     up
 6   hdd 3.63869  1.00000 3.7 TiB 2.0 TiB 2.0 TiB 149 MiB  3.6 GiB 1.7 TiB 54.80 1.03 289     up
 7   hdd 3.63869  1.00000 3.7 TiB 1.9 TiB 1.8 TiB 104 MiB  2.5 GiB 1.8 TiB 51.77 0.97 398     up
11   hdd 3.63869  1.00000 3.7 TiB 2.0 TiB 1.9 TiB 107 MiB  2.6 GiB 1.8 TiB 52.67 0.99 304     up
34   hdd 3.63869  1.00000 3.6 TiB 1.6 TiB 1.6 TiB 134 MiB  2.4 GiB 2.0 TiB 44.79 0.84 370     up
51   hdd 3.63869  1.00000 3.6 TiB 1.8 TiB 1.8 TiB 165 MiB  3.6 GiB 1.9 TiB 48.58 0.91 371     up
 0   hdd 3.63869  1.00000 3.7 TiB 1.8 TiB 1.8 TiB 110 MiB  2.5 GiB 1.9 TiB 49.16 0.92 346     up
                    TOTAL  66 TiB  35 TiB  35 TiB 2.3 GiB   65 GiB  31 TiB 53.21               
MIN/MAX VAR: 0.68/1.69  STDDEV: 12.97

3. We used to have a small tenants there and we wanted to separate them at pools level.

Code:

ceph osd pool ls detail
pool 3 'ssd_pool' replicated size 2 min_size 1 crush_rule 1 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 129502 flags hashpspool,nearfull,selfmanaged_snaps stripe_width 0 application rbd
        removed_snaps [1~21e,221~1c,23e~b,24a~1c,268~16,287~1,289~1,28b~2]
pool 4 'ceph_data' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode warn last_change 115444 lfor 0/0/14476 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
        removed_snaps [1~e,10~2,13~1d,35~8,3f~8,48~2,4b~20,6e~32,a2~10,b4~8,bd~1,bf~6,c6~7,cf~1,d2~3,d6~6,e0~2,e4~3,e8~12,10f~6,118~7,127~2,12b~1,136~1,138~2,140~1,142~3]
pool 5 '.rgw.root' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 102311 flags hashpspool stripe_width 0 application rgw
pool 6 'default.rgw.control' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 102311 flags hashpspool stripe_width 0 application rgw
pool 7 'default.rgw.meta' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 102311 flags hashpspool stripe_width 0 application rgw
pool 8 'default.rgw.log' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 102311 flags hashpspool stripe_width 0 application rgw
pool 9 'rgw_pool' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 102311 flags hashpspool max_bytes 2000000000000 stripe_width 0 application rgw
pool 10 'default.rgw.buckets.index' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 102311 flags hashpspool stripe_width 0 application rgw
pool 11 'default.rgw.buckets.data' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 102311 flags hashpspool stripe_width 0 application rgw
pool 12 'default.rgw.buckets.non-ec' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 102311 flags hashpspool stripe_width 0 application rgw
pool 13 'rbd_images_pool' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 102311 flags hashpspool,selfmanaged_snaps max_bytes 2000000000000 stripe_width 0 application rbd
        removed_snaps [1~8,a~3]
pool 14 'ssd_logicsystems' replicated size 2 min_size 1 crush_rule 1 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 129502 flags hashpspool,nearfull,selfmanaged_snaps max_bytes 1400000000000 stripe_width 0 application rbd
        removed_snaps [1~11]
pool 15 'ssd_ntenant' replicated size 2 min_size 2 crush_rule 1 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 129502 flags hashpspool,nearfull,selfmanaged_snaps stripe_width 0 application rbd
        removed_snaps [1~3]
pool 16 'sata_ntenant' replicated size 2 min_size 2 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode warn last_change 115440 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
        removed_snaps [1~3]

And another thing we noticed in the mon logs which keeps showing up:

Code:

2025-03-31 16:51:37.876 7fc885196700  1 --2- [v2:192.168.17.54:3300/0,v1:192.168.17.54:6789/0] >> 192.168.17.125:0/1373804076 conn(0x561077fcc400 0x56104bd64300 crc :-1 s=READY pgs=6 cs=0 l=1 rev1=1 rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted)

alexskysilk · Mar 31, 2025

Anddevvsss said:
It restarts itself after some time because of high memory usage - OOM kills it. Already started

Fix that problem first. Why are you running out of memory?

alexskysilk · Mar 31, 2025

Anddevvsss said:
We used to have a small tenants there and we wanted to separate them at pools level.

looking at your layout... you are BRAVE. I wouldnt go to production with such a lopsided deployment, and without any room to self heal.

Anddevvsss said:
replicated size 2 min_size 1

brave is a... diplomatic word.

Anddevvsss · Mar 31, 2025

alexskysilk said:
Fix that problem first. Why are you running out of memory?

okay, I will try to figure out why MGR is being killed after some period of time. For now MGR bash been up for couple of hours.

I expanded the log details and looked at the monitor and individual OSD logs and I still see this type of information:

Code:

2025-03-31 19:45:05.913 7f17e717b700 1 --2- [v2:192.168.17.53:3300/0,v1:192.168.17.53:6789/0] >> 192.168.17.124:0/1253958478 conn(0x5634ed474400 0x5634f04fee00 crc :-1 s=READY pgs=1 cs=0 l=1 rev1=1 rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted)

2025-03-31 19:45:05.933 7f17e717b700 1 --1- [v2:192.168.17.53:3300/0,v1:192.168.17.53:6789/0] >> v1:192.168.17.124:0/2092370780 conn(0x5634e7e68480 0x5634e8f60800 :6789 s=OPENED pgs=1 cs=1 l=1).handle_message read tag failed
2025-03-31 19:45:05.933 7f17e717b700 1 --1- [v2:192.168.17.53:3300/0,v1:192.168.17.53:6789/0] >> v1:192.168.17.124:0/2092370780 conn(0x5634e7e68480 0x5634e8f60800 :6789 s=OPENED pgs=1 cs=1 l=1).fault on lossy channel, failing

in OSD logs:

Code:

2025-03-31 19:50:35.187 7fd891232700  1 --1- [v2:192.168.17.125:6806/3119803,v1:192.168.17.125:6807/3119803] >>  conn(0x55d2e692a000 0x55d2d20a2000 :6807 s=ACCEPTING pgs=0 cs=0 l=0).handle_client_banner read peer banner and addr failed
2025-03-31 19:50:35.187 7fd890a31700  1 --1- 192.168.17.125:0/3119803 >> v1:192.168.17.114:6813/400744 conn(0x55d2b9c24400 0x55d2ba49d800 :-1 s=CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1).fault on lossy channel, failing
2025-03-31 19:50:35.187 7fd8801dd700 10 osd.0 130204 heartbeat_reset failed hb con 0x55d2b9c24400 for osd.8, reopening
2025-03-31 19:50:35.187 7fd8801dd700  1 -- 192.168.17.125:0/3119803 >> v1:192.168.17.114:6815/400744 conn(0x55d2e1a52880 legacy=0x55d2bb832800 unknown :-1 s=STATE_CONNECTING l=1).mark_down
2025-03-31 19:50:35.187 7fd890a31700  1 --1- 192.168.17.125:0/3119803 >> v1:192.168.17.114:6813/400744 conn(0x55d2b9c25a80 0x55d2e7471800 :-1 s=CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1).fault on lossy channel, failing

looks like a communication problem between OSDs and monitors but I don't know where to look for the cause

At this point ceph osd df hangs.

alexskysilk · Mar 31, 2025

at this point it may be worthwhile to see how your network is set up.

Do you want to post the content of your /etc/network/interfaces for your nodes, and describe how they are physically interconnected?

Anddevvsss · Apr 1, 2025

alexskysilk said:
at this point it may be worthwhile to see how your network is set up.

Do you want to post the content of your /etc/network/interfaces for your nodes, and describe how they are physically interconnected?

thank you for reply @alexskysilk

/etc/network/interfaces below:

Code:

auto lo
iface lo inet loopback

auto eno1
iface eno1 inet manual

auto ens1f0
iface ens1f0 inet manual

auto ens1f1
iface ens1f1 inet manual

auto eno2
iface eno2 inet manual

auto bond0
iface bond0 inet manual
        bond-slaves eno1 eno2
        bond-miimon 100
        bond-mode 802.3ad
        bond-lacp-rate slow

auto bond0.16
iface bond0.16 inet manual
        vlan-raw-device bond0



auto bond0.17
iface bond0.17 inet manual
        vlan-raw-device bond0


auto bond0.28
iface bond0.28 inet manual

auto vmbr0
iface vmbr0 inet manual
        bridge_ports bond0
        bridge_stp off
        bridge_fd 0
#TRUNK BRIDGE

auto vmbr1
iface vmbr1 inet static
        address  192.168.16.125
        netmask  255.255.255.0
        gateway  192.168.16.1
        broadcast  192.168.16.255
        bridge_ports bond0.16
        dns-nameservers 213.184.16.1
        bridge_stp off
        bridge_fd 0
# VLAN 16 PROXMOX

auto vmbr2
iface vmbr2 inet static
        address  192.168.17.125
        netmask  255.255.255.0
        broadcast  192.168.17.255
        bridge_ports bond0.17
        bridge_stp off
        bridge_fd 0
# VLAN 17 CEPH

we have two separate vlans: 10 Gbps for Ceph and 1 Gbps for proxmox. In meantime I asked our network guy for more detailed description.

alexskysilk · Apr 1, 2025

Anddevvsss said:
we have two separate vlans

Hate to break it to you, but you only have one pair of interfaces in a lagg; while I cant see what speed the underlying interfaces are connected at, they will not be different per vlan.

You are also comingling this same bond for all your disparate traffic types (ceph private, ceph public, corosync, migration, vm traffic, etc.) a backup with a fast enough target can literally kill your entire cluster.

Anddevvsss · Apr 3, 2025

alexskysilk said:
Hate to break it to you, but you only have one pair of interfaces in a lagg; while I cant see what speed the underlying interfaces are connected at, they will not be different per vlan.

You are also comingling this same bond for all your disparate traffic types (ceph private, ceph public, corosync, migration, vm traffic, etc.) a backup with a fast enough target can literally kill your entire cluster.

thanks for the support @alexskysilk

Now we know that we have a lot to improve. In the meantime we managed to partially "restore" the ceph cluster - at least most of the PGs are active + clean. But despite this we have a dozen or so PGs in the unknown and incomplete status, which probably block our cluster operation anyway because we can't start any LXC or KVM.

I would like to try to recover those marked as "incomplete" somehow. I exported them from acting OSD to check if there is data - I see a lot of them so they are definitely not empty.

This is the query output of one of incomplete PG: https://pastebin.com/Qd5U2VaT

I can see it's blocked by OSD.33 which is completely lost and we aren't able to bring it back. Do you know if it is possible to unblock this PG and use it's data even if it's incomplete?

alexskysilk · Apr 3, 2025

Anddevvsss said:
Do you know if it is possible to unblock this PG and use it's data even if it's incomplete?

Best I can tell is that the pg WAS served by osd 33 at some point, but isnt anymore. the remaining two shards dont agree, which prevents the subsystem from activating the pg.

if you havent done so already, try ceph pg repair 4.370

It may take multiple tries, and it may not do anything anyway. If pg repair doesnt revive it, the resolution would be to mark one of the remaining shards lost and force activation, but understand that there is no guarantee that the data you activate will be valid.

If not abundantly clear yet, dont run anything you care about without backups.

Search

Search

Ceph - power outage and recovery

Anddevvsss

New Member

alexskysilk

Distinguished Member

Anddevvsss

New Member

alexskysilk

Distinguished Member

alexskysilk

Distinguished Member

Anddevvsss

New Member

alexskysilk

Distinguished Member

Anddevvsss

New Member

alexskysilk

Distinguished Member

Anddevvsss

New Member

alexskysilk

Distinguished Member

We value your privacy