problem with ceph storage pgs stuck unclean

halnabriss

Renowned Member
Mar 2, 2014
31
0
71
Hi everyone,
I have a problem with my ceph storage, it is showing pgs stuck unclean warning, I tried to repair pages, and to restart monitors and OSDs but nothing worked. All the problems occurred after an OSD became 95% full, and everything in my cluster stuck!, then I used the command (ceph pg set_full_ratio 0.98), deleted unused machines, and everything worked again, but still have these warnings.

more details about my case are shown as follow:

# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 7.23996 root default
-2 2.71999 host node01
0 2.71999 osd.0 up 1.00000 1.00000
-3 2.71999 host node02
1 2.71999 osd.1 up 1.00000 1.00000
-4 1.79999 host node03
2 1.79999 osd.2 up 0.55003 1.00000


# ceph health
HEALTH_WARN 77 pgs stuck unclean; recovery 46/949785 objects degraded (0.005%); recovery 152987/949785 objects misplaced (16.108%)

# ceph -s
cluster f877d510-6946-4a66-bfbb-06b0ee12ae28
health HEALTH_WARN
77 pgs stuck unclean
recovery 46/949785 objects degraded (0.005%)
recovery 152987/949785 objects misplaced (16.108%)
monmap e3: 3 mons at {0=10.1.1.1:6789/0,1=10.1.1.2:6789/0,2=10.1.1.3:6789/0}
election epoch 70, quorum 0,1,2 0,1,2
osdmap e304: 3 osds: 3 up, 3 in; 77 remapped pgs
pgmap v2222751: 160 pgs, 2 pools, 1201 GB data, 309 kobjects
3676 GB used, 3756 GB / 7433 GB avail
46/949785 objects degraded (0.005%)
152987/949785 objects misplaced (16.108%)
83 active+clean
77 active+remapped
client io 66399 kB/s rd, 851 kB/s wr, 1221 op/s

# ceph health detail
HEALTH_WARN 77 pgs stuck unclean; recovery 46/949785 objects degraded (0.005%); recovery 152987/949785 objects misplaced (16.108%)
pg 4.25 is stuck unclean for 121582.717747, current state active+remapped, last acting [0,1,2]
pg 5.1a is stuck unclean for 118635.513579, current state active+remapped, last acting [0,1,2]
pg 4.1b is stuck unclean for 121589.276017, current state active+remapped, last acting [1,0,2]
pg 4.1a is stuck unclean for 121587.037792, current state active+remapped, last acting [1,0,2]
pg 5.1b is stuck unclean for 118676.177113, current state active+remapped, last acting [0,1,2]
pg 4.7a is stuck unclean for 116027.140499, current state active+remapped, last acting [1,0,2]
pg 4.79 is stuck unclean for 115386.851628, current state active+remapped, last acting [1,0,2]
pg 5.1e is stuck unclean for 116462.007267, current state active+remapped, last acting [1,0,2]
pg 4.78 is stuck unclean for 121555.036604, current state active+remapped, last acting [0,1,2]
pg 4.1e is stuck unclean for 116520.298145, current state active+remapped, last acting [1,0,2]
pg 4.1d is stuck unclean for 121587.158490, current state active+remapped, last acting [0,1,2]
pg 4.7e is stuck unclean for 121586.939474, current state active+remapped, last acting [1,0,2]
pg 4.1c is stuck unclean for 121586.202691, current state active+remapped, last acting [1,0,2]
pg 4.13 is stuck unclean for 115386.853358, current state active+remapped, last acting [1,0,2]
pg 5.12 is stuck unclean for 116462.007466, current state active+remapped, last acting [1,0,2]
pg 4.7c is stuck unclean for 121581.825483, current state active+remapped, last acting [0,1,2]
pg 5.10 is stuck unclean for 121596.099742, current state active+remapped, last acting [1,0,2]
pg 4.10 is stuck unclean for 116027.202342, current state active+remapped, last acting [1,0,2]
pg 4.71 is stuck unclean for 121586.364382, current state active+remapped, last acting [1,0,2]
pg 5.16 is stuck unclean for 121591.441230, current state active+remapped, last acting [1,0,2]
pg 4.77 is stuck unclean for 121584.143843, current state active+remapped, last acting [0,1,2]
pg 5.14 is stuck unclean for 119195.905471, current state active+remapped, last acting [0,1,2]
pg 4.75 is stuck unclean for 121584.384698, current state active+remapped, last acting [0,1,2]
pg 4.b is stuck unclean for 120632.338610, current state active+remapped, last acting [0,1,2]
pg 5.b is stuck unclean for 118672.980616, current state active+remapped, last acting [0,1,2]
pg 4.a is stuck unclean for 121590.361216, current state active+remapped, last acting [1,0,2]
pg 4.6a is stuck unclean for 116520.297389, current state active+remapped, last acting [1,0,2]
pg 4.9 is stuck unclean for 121581.842716, current state active+remapped, last acting [0,1,2]
pg 5.9 is stuck unclean for 119866.168159, current state active+remapped, last acting [0,1,2]
pg 5.e is stuck unclean for 118641.998274, current state active+remapped, last acting [0,1,2]
pg 5.f is stuck unclean for 115816.478902, current state active+remapped, last acting [1,0,2]
pg 5.c is stuck unclean for 116035.945866, current state active+remapped, last acting [1,0,2]
pg 4.d is stuck unclean for 121583.616507, current state active+remapped, last acting [0,1,2]
pg 4.6d is stuck unclean for 120850.772815, current state active+remapped, last acting [0,1,2]
pg 4.c is stuck unclean for 116520.297148, current state active+remapped, last acting [1,0,2]
pg 4.6c is stuck unclean for 121590.714610, current state active+remapped, last acting [0,1,2]
pg 4.3 is stuck unclean for 121556.453100, current state active+remapped, last acting [0,1,2]
pg 4.63 is stuck unclean for 121582.568779, current state active+remapped, last acting [0,1,2]
pg 5.3 is stuck unclean for 116035.902051, current state active+remapped, last acting [1,0,2]
pg 4.2 is stuck unclean for 121581.835128, current state active+remapped, last acting [0,1,2]
pg 4.62 is stuck unclean for 116027.098725, current state active+remapped, last acting [1,0,2]
pg 5.0 is stuck unclean for 118685.737689, current state active+remapped, last acting [0,1,2]
pg 4.1 is stuck unclean for 121585.405808, current state active+remapped, last acting [1,0,2]
pg 4.61 is stuck unclean for 121581.947941, current state active+remapped, last acting [0,1,2]
pg 4.0 is stuck unclean for 121582.869185, current state active+remapped, last acting [0,1,2]
pg 4.60 is stuck unclean for 121603.161066, current state active+remapped, last acting [0,1,2]
pg 5.6 is stuck unclean for 116462.006376, current state active+remapped, last acting [1,0,2]
pg 4.7 is stuck unclean for 116027.087510, current state active+remapped, last acting [1,0,2]
pg 4.6 is stuck unclean for 120751.693971, current state active+remapped, last acting [0,1,2]
pg 4.65 is stuck unclean for 116027.086255, current state active+remapped, last acting [1,0,2]
pg 4.5a is stuck unclean for 121584.771439, current state active+remapped, last acting [0,1,2]
pg 4.5c is stuck unclean for 121584.108782, current state active+remapped, last acting [0,1,2]
pg 4.53 is stuck unclean for 121582.627265, current state active+remapped, last acting [1,0,2]
pg 4.52 is stuck unclean for 115290.593727, current state active+remapped, last acting [1,0,2]
pg 4.51 is stuck unclean for 121555.698662, current state active+remapped, last acting [0,1,2]
pg 4.57 is stuck unclean for 121582.464896, current state active+remapped, last acting [0,1,2]
pg 4.4b is stuck unclean for 121582.762554, current state active+remapped, last acting [1,0,2]
pg 4.4a is stuck unclean for 121595.675892, current state active+remapped, last acting [0,1,2]
pg 4.49 is stuck unclean for 121581.922555, current state active+remapped, last acting [0,1,2]
pg 4.48 is stuck unclean for 119258.014499, current state active+remapped, last acting [0,1,2]
pg 4.4f is stuck unclean for 121594.400713, current state active+remapped, last acting [1,0,2]
pg 4.4c is stuck unclean for 116520.297840, current state active+remapped, last acting [1,0,2]
pg 4.43 is stuck unclean for 116520.297863, current state active+remapped, last acting [1,0,2]
pg 4.41 is stuck unclean for 116027.068146, current state active+remapped, last acting [1,0,2]
pg 4.40 is stuck unclean for 116520.297938, current state active+remapped, last acting [1,0,2]
pg 4.38 is stuck unclean for 120226.454185, current state active+remapped, last acting [0,1,2]
pg 4.3e is stuck unclean for 121581.861168, current state active+remapped, last acting [1,0,2]
pg 4.31 is stuck unclean for 121583.502541, current state active+remapped, last acting [1,0,2]
pg 4.36 is stuck unclean for 121582.880836, current state active+remapped, last acting [1,0,2]
pg 4.2b is stuck unclean for 121582.990050, current state active+remapped, last acting [1,0,2]
pg 4.29 is stuck unclean for 121582.880635, current state active+remapped, last acting [1,0,2]
pg 4.28 is stuck unclean for 121587.158553, current state active+remapped, last acting [0,1,2]
pg 4.2e is stuck unclean for 121582.880683, current state active+remapped, last acting [1,0,2]
pg 4.2d is stuck unclean for 121553.777639, current state active+remapped, last acting [0,1,2]
pg 4.23 is stuck unclean for 116520.298495, current state active+remapped, last acting [1,0,2]
pg 4.21 is stuck unclean for 116520.298558, current state active+remapped, last acting [1,0,2]
pg 4.27 is stuck unclean for 121582.065714, current state active+remapped, last acting [0,1,2]
recovery 46/949785 objects degraded (0.005%)
recovery 152987/949785 objects misplaced (16.108%)

Logs showing :

#tail -f /var/log/ceph/ceph-mon.0.log
2017-05-18 03:53:50.539006 7f826e394700 0 log_channel(cluster) log [INF] : pgmap v2222822: 160 pgs: 77 active+remapped, 83 active+clean; 1201 GB data, 3676 GB used, 3756 GB / 7433 GB avail; 14667 kB/s rd, 284 kB/s wr, 300 op/s; 46/949785 objects degraded (0.005%); 152987/949785 objects misplaced (16.108%)
2017-05-18 03:53:51.545615 7f826e394700 0 log_channel(cluster) log [INF] : pgmap v2222823: 160 pgs: 77 active+remapped, 83 active+clean; 1201 GB data, 3676 GB used, 3756 GB / 7433 GB avail; 31352 kB/s rd, 824 kB/s wr, 675 op/s; 46/949785 objects degraded (0.005%); 152987/949785 objects misplaced (16.108%)
2017-05-18 03:53:52.552068 7f826e394700 0 log_channel(cluster) log [INF] : pgmap v2222824: 160 pgs: 77 active+remapped, 83 active+clean; 1201 GB data, 3676 GB used, 3756 GB / 7433 GB avail; 89268 kB/s rd, 2659 kB/s wr, 1947 op/s; 46/949785 objects degraded (0.005%); 152987/949785 objects misplaced (16.108%)
2017-05-18 03:53:55.627772 7f826e394700 0 log_channel(cluster) log [INF] : pgmap v2222825: 160 pgs: 77 active+remapped, 83 active+clean; 1201 GB data, 3676 GB used, 3756 GB / 7433 GB avail; 26179 kB/s rd, 1133 kB/s wr, 624 op/s; 46/949785 objects degraded (0.005%); 152987/949785 objects misplaced (16.108%)
2017-05-18 03:53:56.633148 7f826e394700 0 log_channel(cluster) log [INF] : pgmap v2222826: 160 pgs: 77 active+remapped, 83 active+clean; 1201 GB data, 3676 GB used, 3756 GB / 7433 GB avail; 22057 kB/s rd, 1042 kB/s wr, 542 op/s; 46/949785 objects degraded (0.005%); 152987/949785 objects misplaced (16.108%)
2017-05-18 03:53:57.636356 7f826e394700 0 log_channel(cluster) log [INF] : pgmap v2222827: 160 pgs: 77 active+remapped, 83 active+clean; 1201 GB data, 3676 GB used, 3756 GB / 7433 GB avail; 69140 kB/s rd, 2991 kB/s wr, 1597 op/s; 46/949785 objects degraded (0.005%); 152987/949785 objects misplaced (16.108%)


How can I fix this warning??

thanks in advance for your replies
 
I red this tutorial before, I tried to repair and force recreate pgs, but nothing changed. the query command shows the following:

# ceph pg 4.27 query
"state": "active+remapped",
"snap_trimq": "[b~1,1d~1]",
"epoch": 304,
"up": [
0,
1
],
"acting": [
0,
1,
2
],
"actingbackfill": [
"0",
"1",
"2"
],
.
.
.
.

"peer": "1",
"pgid": "4.27",
"last_update": "304'3461628",
"last_complete": "296'3456616",

"peer": "2",
"pgid": "4.27",
"last_update": "304'3461628",
"last_complete": "304'3461628",

"recovery_state": [
{
"name": "Started\/Primary\/Active",
"enter_time": "2017-05-18 01:06:32.417795",
"might_have_unfound": [
{
"osd": "1",
"status": "already probed"
},
{
"osd": "2",
"status": "already probed"
}
],
"recovery_progress": {
"backfill_targets": [],
"waiting_on_backfill": [],
"last_backfill_started": "-1\/0\/\/0",
"backfill_info": {
"begin": "-1\/0\/\/0",
"end": "-1\/0\/\/0",
"objects": []
},
"peer_backfill_info": [],
"backfills_in_flight": [],
"recovering": [],
"pg_backend": {
"pull_from_peer": [],
"pushing": []
}
},
 
Thank you jeffwadsworth for your replies,
But actually I couldn't use the answer in the referenced URL, the OSDs in my case are working fine:
#ceph osd stat
osdmap e304: 3 osds: 3 up, 3 in; 77 remapped pgs

There is no network issues, each OSD contacts the others, ping commands and telnet to Monitors IPs/ports working very well!
 
Thank you all ,
the problem occurred after using reweight by utilization, the weight of osd.2 changed to 0.55003
2 1.79999 osd.2 up 0.55003 1.00000

when I changed the weight to 1, everything went good
ceph osd reweight 2 1


Thanks
closed thread :)