osd failed by xfs error

Allen Chen

New Member
Sep 21, 2018
10
0
1
37
Hi ,

I got 2 OSD's (osd.13 and 17) failed, there are a lot of error message is dmesg. df is just hang.

[53623085.515011] XFS (sdc1): xfs_log_force: error -5 returned.
[53623088.039118] XFS (sdd1): xfs_log_force: error -5 returned.
[53623115.516356] XFS (sdc1): xfs_log_force: error -5 returned.
[53623118.040473] XFS (sdd1): xfs_log_force: error -5 returned.

sdc 8:32 0 1.8T 0 disk
└─sdc1 8:33 0 1.8T 0 part /var/lib/ceph/osd/ceph-13
sdd 8:48 0 1.8T 0 disk
└─sdd1 8:49 0 1.8T 0 part /var/lib/ceph/osd/ceph-17

13 1.81850 osd.13 down 0 1.00000
17 1.81850 osd.17 down 0 1.00000


# uname -a
Linux ceph-ric-sata-03.mgt.tpgtelecom.com.au 4.4.1-1.el7.elrepo.x86_64 #1 SMP Sun Jan 31 16:49:23 EST 2016 x86_64 x86_64 x86_64 GNU/Linux


Smartctl comes good, should no issue on physical drive. Not sure if xfs corrupted bring osd down? Any advise how to fix this problem? Please comments.

Thanks so much.
 
You can try running a xfs_repair on the 2 OSD's.

Make sure that the OSD process/service is fully stopped for the 2 OSD's first before doing such.

*Warning* There is a change that the xfs_repair will fix the XFS file system enough for the OSD to start however may cause data corruption / missing files at the CEPH Level.
 
You can try running a xfs_repair on the 2 OSD's.

Make sure that the OSD process/service is fully stopped for the 2 OSD's first before doing such.

*Warning* There is a change that the xfs_repair will fix the XFS file system enough for the OSD to start however may cause data corruption / missing files at the CEPH Level.



Thanks so much for your advice.
Currently the osd is down, the data has been balanced, osd service status is lilke below. I will stop osd service, is it ok to run xfs_repair against 2 osd's?


ceph-osd@13.service loaded failed failed Ceph object storage daemon
ceph-osd@17.service loaded failed failed Ceph object storage daemon
 
When you say data is balanced, have you let CEPH repair by re-creating the PG's on other OSD's?

Can you provide the full ceph status output, if CEPH is in a full healthy state id suggest against trying to fix the two OSD's, and remove wipe and re-add as two fresh OSD's.
 
When you say data is balanced, have you let CEPH repair by re-creating the PG's on other OSD's?

Can you provide the full ceph status output, if CEPH is in a full healthy state id suggest against trying to fix the two OSD's, and remove wipe and re-add as two fresh OSD's.

Thanks. ceph status is ok now. I pg's have been migrated to other osd's.


[root@ceph-ric-sata-03 log]# ceph -s
cluster 0245417b-c40b-435a-a09e-
health HEALTH_OK
monmap e3: 3 mons at { }
election epoch 1694, quorum 0,1,2 ceph-ric-fec-01,ceph-ric-fec-02,ceph-ric-sata-01
osdmap e1138: 40 osds: 38 up, 38 in
flags sortbitwise,require_jewel_osds
pgmap v47225558: 1024 pgs, 1 pools, 14849 GB data, 3712 kobjects
29689 GB used, 41070 GB / 70760 GB avail
1024 active+clean
client io 12222 B/s rd, 22449 kB/s wr, 1 op/s rd, 757 op/s wr
 
Then your best bet is to remove the OSD fully, wipe the disks give them a good health check and add as 2 fresh OSD's.

Even if you was to fix the OSD's there is a chance some data may be corrupt and could just cause you further issues down the line.
 
Then your best bet is to remove the OSD fully, wipe the disks give them a good health check and add as 2 fresh OSD's.

Even if you was to fix the OSD's there is a chance some data may be corrupt and could just cause you further issues down the line.

Thanks so much, mate.