Ceph remain in HEALTH_WARN after osd remove

l.ansaloni

Renowned Member
Feb 20, 2011
42
2
73
Nonantola, Italy
newlogic.it
I have a cluster of 3 servers with cepth storage over 9 disks (3 each server).


One osd is going down/out and so I "remove" it, after that system start to rebalance data over the remainig osd but after some hours rebalance is stopping with 1 page stuck unclean:


Code:
     # ceph -s
     cluster 79796df2-0dc6-4a2d-8c63-5be76c25c12b
     health HEALTH_WARN 1 pgs backfilling; 1 pgs stuck unclean; recovery 710/1068853 objects degraded (0.066%)
     monmap e3: 3 mons at {0=10.10.10.1:6789/0,1=10.10.10.2:6789/0,2=10.10.10.3:6789/0}, election epoch 50, quorum 0,1,2 0,1,2
     osdmap e1251: 8 osds: 8 up, 8 in
      pgmap v16360830: 1216 pgs, 5 pools, 1383 GB data, 347 kobjects
            4048 GB used, 2652 GB / 6701 GB avail
            710/1068853 objects degraded (0.066%)
                1215 active+clean
                   1 active+remapped+backfilling
  client io 34820 kB/s rd, 126 kB/s wr, 335 op/s

This the osd tree:


Code:
    #ceph osd tree
    # id    weight    type name    up/down    reweight
    -1    6.56    root default
    -2    2.46        host proxmox00
    0    0.82            osd.0    up    1    
    1    0.82            osd.1    up    1    
    2    0.82            osd.2    up    1    
    -3    2.46        host proxmox01
    3    0.82            osd.3    up    1    
    4    0.82            osd.4    up    1    
    5    0.82            osd.5    up    1    
    -4    1.64        host proxmox02
    6    0.82            osd.6    up    1    
    8    0.82            osd.8    up    1

Because the cluster does not return in the state HEALT_OK?
 
I have a cluster of 3 servers with cepth storage over 9 disks (3 each server).


One osd is going down/out and so I "remove" it, after that system start to rebalance data over the remainig osd but after some hours rebalance is stopping with 1 page stuck unclean:


Code:
     # ceph -s
     cluster 79796df2-0dc6-4a2d-8c63-5be76c25c12b
     health HEALTH_WARN 1 pgs backfilling; 1 pgs stuck unclean; recovery 710/1068853 objects degraded (0.066%)
     monmap e3: 3 mons at {0=10.10.10.1:6789/0,1=10.10.10.2:6789/0,2=10.10.10.3:6789/0}, election epoch 50, quorum 0,1,2 0,1,2
     osdmap e1251: 8 osds: 8 up, 8 in
      pgmap v16360830: 1216 pgs, 5 pools, 1383 GB data, 347 kobjects
            4048 GB used, 2652 GB / 6701 GB avail
            710/1068853 objects degraded (0.066%)
                1215 active+clean
                   1 active+remapped+backfilling
  client io 34820 kB/s rd, 126 kB/s wr, 335 op/s

This the osd tree:


Code:
    #ceph osd tree
    # id    weight    type name    up/down    reweight
    -1    6.56    root default
    -2    2.46        host proxmox00
    0    0.82            osd.0    up    1    
    1    0.82            osd.1    up    1    
    2    0.82            osd.2    up    1    
    -3    2.46        host proxmox01
    3    0.82            osd.3    up    1    
    4    0.82            osd.4    up    1    
    5    0.82            osd.5    up    1    
    -4    1.64        host proxmox02
    6    0.82            osd.6    up    1    
    8    0.82            osd.8    up    1

Because the cluster does not return in the state HEALT_OK?
Hi,
how full are your OSDs?

Any hints with
Code:
ceph health detail
1216 placementgroups for 8 OSDs are many...

What is the output of following command (on proxmox00 because osd.1):
Code:
ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok config show | grep -i backfil
Udo
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!