CEPH OSD stuck degraded

DataScouting

New Member
Sep 29, 2017
8
0
1
41
Greetings,

Our proxmox cluster uses a ceph pool replicated across 3 osds on 3 servers. After a power failure however one of the osds appears to refuse to sync at all and all objects on it appear degraded. In fact I cleared some space on the cluster, by erasing an old no longer used VM, and said pool still has its old usage percentage (~85,5%), while the other two have lowered to the proper new size (~63%, both as seen in the ceph osd tab).
OS reports that the OSD is mounted rw (via mount command) and in fact the server has been normally rebooted a few times since.
Cluster status report is:
cluster 5a36c253-d38a-4e7e-bbb2-929f58639662
health HEALTH_WARN
64 pgs backfill_toofull
64 pgs degraded
64 pgs stuck degraded
64 pgs stuck unclean
64 pgs stuck undersized
64 pgs undersized
recovery 149738/449178 objects degraded (33.336%)
recovery 149726/449178 objects misplaced (33.333%)
1 near full osd(s)
monmap e3: 3 mons at {0=10.2.2.242:6789/0,1=10.2.2.240:6789/0,2=10.2.2.243:6789/0}
election epoch 488, quorum 0,1,2 1,0,2
osdmap e394: 3 osds: 3 up, 3 in; 64 remapped pgs
pgmap v32555208: 64 pgs, 1 pools, 575 GB data, 146 kobjects
1959 GB used, 818 GB / 2778 GB avail
149738/449178 objects degraded (33.336%)
149726/449178 objects misplaced (33.333%)
64 active+undersized+degraded+remapped+backfill_toofull
client io 274 kB/s rd, 625 kB/s wr, 153 op/s

and ceph versions are:
osd.0: {
"version": "ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)"
}
osd.1: {
"version": "ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)"
}
osd.2: {
"version": "ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)"
}


How can we restore this node?
 
Crush map is as follows in case it is needed:
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
# types
type 0
osd type 1
host type 2
chassis type 3
rack type 4
row type 5
pdu type 6
pod type 7
room type 8
datacenter type 9
region type 10 root

# buckets
host hydra01
{
id -2 # do not change unnecessarily
# weight 0.900
alg straw
hash 0 # rjenkins1
item osd.0
weight 0.900
}
host hydra02
{
id -3 # do not change unnecessarily
# weight 0.900
alg straw
hash 0 # rjenkins1
item osd.1
weight 0.900
}
host hydra03
{
id -4 # do not change unnecessarily
# weight 0.900
alg straw
hash 0 # rjenkins1
item osd.2
weight 0.900
}
root default
{
id -1 # do not change unnecessarily
# weight 2.700
alg straw
hash 0 # rjenkins1
item hydra01 weight 0.900
item hydra02 weight 0.900
item hydra03 weight 0.900
}

# rules
rule replicated_ruleset
{
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
 
Last edited:
Your OSDs are not able to recover because the cluster hit the bellow state.
Code:
PG_DEGRADED_FULL

Data redundancy may be reduced or at risk for some data due to a lack of free space in the cluster. Specifically, one or more PGs has the backfill_toofull or recovery_toofull flag set, meaning that the cluster is unable to migrate or recover data because one or more OSDs is above the backfillfull threshold.

See the discussion for OSD_BACKFILLFULL or OSD_FULL above for steps to resolve this condition.
http://docs.ceph.com/docs/master/rados/operations/health-checks/

What is 'ceph osd df tree' showing?
 
ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR TYPE NAME
-1 2.69998 - 2778G 1959G 818G 70.52 1.00 root default
-2 0.89999 - 926G 792G 133G 85.60 1.21 host hydra01
0 0.89999 1.00000 926G 792G 133G 85.60 1.21 osd.0
-3 0.89999 - 926G 583G 342G 62.99 0.89 host hydra02
1 0.89999 1.00000 926G 583G 342G 62.99 0.89 osd.1
-4 0.89999 - 926G 583G 342G 62.99 0.89 host hydra03
2 0.89999 1.00000 926G 583G 342G 62.99 0.89 osd.2
TOTAL 2778G 1959G 818G 70.52
MIN/MAX VAR: 0.89/1.21 STDDEV: 10.66
 
By the way
ceph osd dump | grep full_ratio
returns blank, perhaps it is due to older version but there are no thresholds appearing in the results of ceph osd dump.
 
Can you please use the code tags, when posting terminal output? [ CODE]code here[ /CODE], under the little plus. Makes it way more readable. ;)

osd.0 is on 85.6% (near_full) and is not able to participate in the recovery and as there are only three OSDs on three nodes, the recovery stalled. Is there anything visible in the ceph logs, why it has much more data then the other two OSDs? I guess the deletion didn't went through on that OSD.

Take the following with caution:
One way is to find out which objects are only on that OSD and on no other, then those objects could be deleted (assuming it's the old VM) and the recovery should go on.
Or another way would be to destroy the OSD and create a new one (you could also a another disk temporary).

Do backups before, if something goes wrong.
 
ceph osd dump | grep full_ratio
The full_ratio is by default 95% and the near_full_ratio by 85%. I assume, as you have only three OSDs, one per host, that destroying the OSD and recreating it, should work without problem. But as I stated above, do backups before.
 
Would setting the near_full ratio to e.g. 87% (as mentioned in the provided link) allow for autorecovery to complete (and clear up the extra items as well) without needing intervention on the cluster?
 
I don't believe so, but you can try. Even if the recovery works, the OSD will just stay on near_full and any further adding of data will fill the disk up.
 
Greetings,
We had a major failure on the above system. Due to internal issues restoration was not performed. Due to a sudden power failure today we lost one of the 2 correct servers (both of the system disks were fried). The other 2 servers were shut down properly after all vms had stopped. After starting up the 2 machines ceph data appear unaccessible, though ceph only says health warning. Is there a way to force restore the disks from the (hopefully) correct data from the third server?
 
The ceph disk is still healthy on the failed server so is there a way to do some sort of reinstall on the failed server and reinsert the ceph disk in the cluster.
 
ceph data appear unaccessible, though ceph only says health warning. Is there a way to force restore the disks from the (hopefully) correct data from the third server?
This is probably due to the ceph cluster hitting the PG_DEGRADED_FULL.

The ceph disk is still healthy on the failed server so is there a way to do some sort of reinstall on the failed server and reinsert the ceph disk in the cluster.
From the top of my head:
I assume you re-installed the OS and added the server to the cluster. Then you need to install the ceph packages (pveceph install) too.

If the server also had a monitor, the easiest way ist to remove the monitor from the /etc/pve/ceph.conf and run the 'pveceph createmon'. Then you should have a working third monitor (incl. manager) again.

After that, reboot the server, to see if all services come up again, this should also trigger the start of the OSDs.

If something is not working, check the journal for error messages about the services.