Ceph remain in HEALTH_WARN after osd remove

l.ansaloni

Renowned Member
Feb 20, 2011
42
3
73
Nonantola, Italy
newlogic.it
I have a cluster of 3 servers with cepth storage over 9 disks (3 each server).


One osd is going down/out and so I "remove" it, after that system start to rebalance data over the remainig osd but after some hours rebalance is stopping with 1 page stuck unclean:


Code:
     # ceph -s
     cluster 79796df2-0dc6-4a2d-8c63-5be76c25c12b
     health HEALTH_WARN 1 pgs backfilling; 1 pgs stuck unclean; recovery 710/1068853 objects degraded (0.066%)
     monmap e3: 3 mons at {0=10.10.10.1:6789/0,1=10.10.10.2:6789/0,2=10.10.10.3:6789/0}, election epoch 50, quorum 0,1,2 0,1,2
     osdmap e1251: 8 osds: 8 up, 8 in
      pgmap v16360830: 1216 pgs, 5 pools, 1383 GB data, 347 kobjects
            4048 GB used, 2652 GB / 6701 GB avail
            710/1068853 objects degraded (0.066%)
                1215 active+clean
                   1 active+remapped+backfilling
  client io 34820 kB/s rd, 126 kB/s wr, 335 op/s

This the osd tree:


Code:
    #ceph osd tree
    # id    weight    type name    up/down    reweight
    -1    6.56    root default
    -2    2.46        host proxmox00
    0    0.82            osd.0    up    1    
    1    0.82            osd.1    up    1    
    2    0.82            osd.2    up    1    
    -3    2.46        host proxmox01
    3    0.82            osd.3    up    1    
    4    0.82            osd.4    up    1    
    5    0.82            osd.5    up    1    
    -4    1.64        host proxmox02
    6    0.82            osd.6    up    1    
    8    0.82            osd.8    up    1

Because the cluster does not return in the state HEALT_OK?
 
I have a cluster of 3 servers with cepth storage over 9 disks (3 each server).


One osd is going down/out and so I "remove" it, after that system start to rebalance data over the remainig osd but after some hours rebalance is stopping with 1 page stuck unclean:


Code:
     # ceph -s
     cluster 79796df2-0dc6-4a2d-8c63-5be76c25c12b
     health HEALTH_WARN 1 pgs backfilling; 1 pgs stuck unclean; recovery 710/1068853 objects degraded (0.066%)
     monmap e3: 3 mons at {0=10.10.10.1:6789/0,1=10.10.10.2:6789/0,2=10.10.10.3:6789/0}, election epoch 50, quorum 0,1,2 0,1,2
     osdmap e1251: 8 osds: 8 up, 8 in
      pgmap v16360830: 1216 pgs, 5 pools, 1383 GB data, 347 kobjects
            4048 GB used, 2652 GB / 6701 GB avail
            710/1068853 objects degraded (0.066%)
                1215 active+clean
                   1 active+remapped+backfilling
  client io 34820 kB/s rd, 126 kB/s wr, 335 op/s

This the osd tree:


Code:
    #ceph osd tree
    # id    weight    type name    up/down    reweight
    -1    6.56    root default
    -2    2.46        host proxmox00
    0    0.82            osd.0    up    1    
    1    0.82            osd.1    up    1    
    2    0.82            osd.2    up    1    
    -3    2.46        host proxmox01
    3    0.82            osd.3    up    1    
    4    0.82            osd.4    up    1    
    5    0.82            osd.5    up    1    
    -4    1.64        host proxmox02
    6    0.82            osd.6    up    1    
    8    0.82            osd.8    up    1

Because the cluster does not return in the state HEALT_OK?
Hi,
how full are your OSDs?

Any hints with
Code:
ceph health detail
1216 placementgroups for 8 OSDs are many...

What is the output of following command (on proxmox00 because osd.1):
Code:
ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok config show | grep -i backfil
Udo