Proxmox 6 Ceph Nautlius not recovering sits idle after osd remval

Haider Jarral

Well-Known Member
Aug 18, 2018
121
5
58
37
Hello Experts,

I have recently setup new proxmox 6 environment with ceph, I removed 3 osds from one of the node and after that ceph has stopped recovering even though its shows there is Degraded Data Redundancy, I can't figure out why it wouldn't recover these.

~# pveversion
pve-manager/6.1-8/806edfe1 (running kernel: 5.3.18-2-pve)
root@dellr730-1:~#
root@dellr730-1:~# ceph -v
ceph version 14.2.8 (0f8245b6b446e041c5cdc40ec82d4d2263c2895b) nautilus (stable)


root@dellr730-1:~#



ceph pg repair, osd restart, scrub nothing works. Here are relevant outputs.

Code:
# ceph -w
  cluster:
    id:     01bcf2d5-6e96-4d50-81ec-cd6bb55c500e
    health: HEALTH_WARN
            Degraded data redundancy: 440934/1611642 objects degraded (27.359%), 420 pgs degraded, 420 pgs undersized
 
  services:
    mon: 3 daemons, quorum dellr730-1,dellr730-2,hp1 (age 7h)
    mgr: dellr730-1(active, since 7h)
    osd: 14 osds: 14 up (since 5m), 14 in (since 46m); 92 remapped pgs
 
  data:
    pools:   1 pools, 512 pgs
    objects: 537.21k objects, 2.0 TiB
    usage:   4.5 TiB used, 8.2 TiB / 13 TiB avail
    pgs:     440934/1611642 objects degraded (27.359%)
             96280/1611642 objects misplaced (5.974%)
             420 active+undersized+degraded
             92  active+clean+remapped
 
  io:
    client:   3.7 MiB/s rd, 5.9 MiB/s wr, 74 op/s rd, 135 op/s wr


Code:
# ceph osd crush tree
ID  CLASS WEIGHT   TYPE NAME           
 -1       12.73599 root default       
-10        6.36800     host dellr730-1
 10   ssd  0.90999         osd.10     
 11   ssd  0.90999         osd.11     
 12   ssd  0.90999         osd.12     
 13   ssd  0.90999         osd.13     
 14   ssd  0.90999         osd.14     
 15   ssd  0.90999         osd.15     
 16   ssd  0.90999         osd.16     
 -7        6.36800     host dellr730-2
  3   ssd  0.90999         osd.3       
  4   ssd  0.90999         osd.4       
  5   ssd  0.90999         osd.5       
  6   ssd  0.90999         osd.6       
  7   ssd  0.90999         osd.7       
  8   ssd  0.90999         osd.8       
  9   ssd  0.90999         osd.9       
 -3              0     host hp1       
root@dellr730-1:~#

Code:
# ceph health detail
HEALTH_WARN Degraded data redundancy: 440934/1611642 objects degraded (27.359%), 420 pgs degraded, 420 pgs undersized
PG_DEGRADED Degraded data redundancy: 440934/1611642 objects degraded (27.359%), 420 pgs degraded, 420 pgs undersized
    pg 1.1bb is active+undersized+degraded, acting [14,6]
    pg 1.1bc is stuck undersized for 6352.642140, current state active+undersized+degraded, last acting [12,5]
    pg 1.1bd is stuck undersized for 918.274133, current state active+undersized+degraded, last acting [4,11]
    pg 1.1be is stuck undersized for 1081.936833, current state active+undersized+degraded, last acting [8,10]
    pg 1.1bf is stuck undersized for 6356.685735, current state active+undersized+degraded, last acting [7,14]
    pg 1.1c0 is stuck undersized for 918.276848, current state active+undersized+degraded, last acting [7,11]
    pg 1.1c2 is stuck undersized for 6356.699629, current state active+undersized+degraded, last acting [8,15]
    pg 1.1c3 is stuck undersized for 6360.756731, current state active+undersized+degraded, last acting [3,14]
    pg 1.1c4 is stuck undersized for 6356.702880, current state active+undersized+degraded, last acting [4,13]
    pg 1.1c5 is stuck undersized for 6356.699650, current state active+undersized+degraded, last acting [5,13]
    pg 1.1c6 is stuck undersized for 6356.698034, current state active+undersized+degraded, last acting [8,14]
    pg 1.1c8 is stuck undersized for 1081.938997, current state active+undersized+degraded, last acting [5,10]
    pg 1.1c9 is stuck undersized for 6356.702559, current state active+undersized+degraded, last acting [4,16]
    pg 1.1ca is stuck undersized for 6360.764538, current state active+undersized+degraded, last acting [13,4]
    pg 1.1cb is stuck undersized for 1081.928116, current state active+undersized+degraded, last acting [10,8]
    pg 1.1cd is stuck undersized for 6352.641605, current state active+undersized+degraded, last acting [12,6]
    pg 1.1cf is stuck undersized for 6356.699738, current state active+undersized+degraded, last acting [5,16]
    pg 1.1d1 is stuck undersized for 6356.701899, current state active+undersized+degraded, last acting [16,9]
    pg 1.1d5 is stuck undersized for 918.277097, current state active+undersized+degraded, last acting [9,11]
    pg 1.1d6 is stuck undersized for 918.263862, current state active+undersized+degraded, last acting [11,9]
    pg 1.1d7 is stuck undersized for 918.265974, current state active+undersized+degraded, last acting [11,3]
    pg 1.1d8 is stuck undersized for 918.259707, current state active+undersized+degraded, last acting [11,6]
    pg 1.1d9 is stuck undersized for 6360.765421, current state active+undersized+degraded, last acting [13,7]
    pg 1.1da is stuck undersized for 918.277137, current state active+undersized+degraded, last acting [9,11]
    pg 1.1dd is stuck undersized for 6352.625143, current state active+undersized+degraded, last acting [14,8]
    pg 1.1de is stuck undersized for 918.266185, current state active+undersized+degraded, last acting [11,4]
    pg 1.1df is stuck undersized for 6352.644790, current state active+undersized+degraded, last acting [16,9]
    pg 1.1e1 is stuck undersized for 932.058143, current state active+undersized+degraded, last acting [9,14]
    pg 1.1e2 is stuck undersized for 1081.926149, current state active+undersized+degraded, last acting [10,7]
    pg 1.1e7 is stuck undersized for 1081.935331, current state active+undersized+degraded, last acting [4,10]
    pg 1.1e8 is stuck undersized for 918.262368, current state active+undersized+degraded, last acting [11,7]
    pg 1.1e9 is stuck undersized for 6360.749248, current state active+undersized+degraded, last acting [14,5]
    pg 1.1ea is stuck undersized for 6356.700886, current state active+undersized+degraded, last acting [8,13]
    pg 1.1eb is stuck undersized for 6352.644904, current state active+undersized+degraded, last acting [6,16]
    pg 1.1ec is stuck undersized for 6360.753516, current state active+undersized+degraded, last acting [8,14]
    pg 1.1ed is stuck undersized for 6356.701374, current state active+undersized+degraded, last acting [8,12]
    pg 1.1ee is stuck undersized for 6360.755001, current state active+undersized+degraded, last acting [8,16]
    pg 1.1ef is stuck undersized for 6352.640852, current state active+undersized+degraded, last acting [12,5]
    pg 1.1f0 is stuck undersized for 6356.686830, current state active+undersized+degraded, last acting [14,5]
    pg 1.1f2 is stuck undersized for 1098.912748, current state active+undersized+degraded, last acting [13,3]
    pg 1.1f3 is stuck undersized for 6352.621163, current state active+undersized+degraded, last acting [14,9]
    pg 1.1f4 is stuck undersized for 6356.700767, current state active+undersized+degraded, last acting [16,7]
    pg 1.1f5 is stuck undersized for 6360.753729, current state active+undersized+degraded, last acting [8,12]
    pg 1.1f7 is stuck undersized for 918.273982, current state active+undersized+degraded, last acting [4,11]
    pg 1.1f9 is stuck undersized for 6356.701051, current state active+undersized+degraded, last acting [12,4]
    pg 1.1fa is stuck undersized for 6356.701741, current state active+undersized+degraded, last acting [16,7]
    pg 1.1fb is stuck undersized for 6360.756217, current state active+undersized+degraded, last acting [15,6]
    pg 1.1fc is stuck undersized for 6356.704673, current state active+undersized+degraded, last acting [13,3]
    pg 1.1fd is stuck undersized for 6360.755293, current state active+undersized+degraded, last acting [16,9]
    pg 1.1fe is stuck undersized for 6360.759098, current state active+undersized+degraded, last acting [9,15]
    pg 1.1ff is stuck undersized for 6352.646041, current state active+undersized+degraded, last acting [13,7]
root@dellr730-1:~#

Code:
ceph osd crush tree --show-shadow
ID  CLASS WEIGHT   TYPE NAME               
-11  ssd2        0 root default~ssd2       
 -8  ssd2        0     host dellr730-1~ssd2
 -4  ssd2        0     host dellr730-2~ssd2
 -2  ssd2        0     host hp1~ssd2       
 -6   ssd 12.73984 root default~ssd         
-12   ssd  6.36992     host dellr730-1~ssd 
 10   ssd  0.90999         osd.10           
 11   ssd  0.90999         osd.11           
 12   ssd  0.90999         osd.12           
 13   ssd  0.90999         osd.13           
 14   ssd  0.90999         osd.14           
 15   ssd  0.90999         osd.15           
 16   ssd  0.90999         osd.16           
 -9   ssd  6.36992     host dellr730-2~ssd 
  3   ssd  0.90999         osd.3           
  4   ssd  0.90999         osd.4           
  5   ssd  0.90999         osd.5           
  6   ssd  0.90999         osd.6           
  7   ssd  0.90999         osd.7           
  8   ssd  0.90999         osd.8           
  9   ssd  0.90999         osd.9           
 -5   ssd        0     host hp1~ssd         
 -1       12.73599 root default             
-10        6.36800     host dellr730-1     
 10   ssd  0.90999         osd.10           
 11   ssd  0.90999         osd.11           
 12   ssd  0.90999         osd.12           
 13   ssd  0.90999         osd.13           
 14   ssd  0.90999         osd.14           
 15   ssd  0.90999         osd.15           
 16   ssd  0.90999         osd.16           
 -7        6.36800     host dellr730-2     
  3   ssd  0.90999         osd.3           
  4   ssd  0.90999         osd.4           
  5   ssd  0.90999         osd.5           
  6   ssd  0.90999         osd.6           
  7   ssd  0.90999         osd.7           
  8   ssd  0.90999         osd.8           
  9   ssd  0.90999         osd.9           
 -3              0     host hp1             
root@dellr730-1:~#
 
Degraded data redundancy: 440934/1611642 objects degraded (27.359%), 420 pgs degraded, 420 pgs undersized
Your pool seems to use size 3/2. If that's the case, then it will continue to serve data but it will never recover. It needs a third node to get its three copies restored.
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!