problem with ceph storage pgs stuck unclean

halnabriss

Renowned Member
Mar 2, 2014
31
0
71
Hi everyone,
I have a problem with my ceph storage, it is showing pgs stuck unclean warning, I tried to repair pages, and to restart monitors and OSDs but nothing worked. All the problems occurred after an OSD became 95% full, and everything in my cluster stuck!, then I used the command (ceph pg set_full_ratio 0.98), deleted unused machines, and everything worked again, but still have these warnings.

more details about my case are shown as follow:

# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 7.23996 root default
-2 2.71999 host node01
0 2.71999 osd.0 up 1.00000 1.00000
-3 2.71999 host node02
1 2.71999 osd.1 up 1.00000 1.00000
-4 1.79999 host node03
2 1.79999 osd.2 up 0.55003 1.00000


# ceph health
HEALTH_WARN 77 pgs stuck unclean; recovery 46/949785 objects degraded (0.005%); recovery 152987/949785 objects misplaced (16.108%)

# ceph -s
cluster f877d510-6946-4a66-bfbb-06b0ee12ae28
health HEALTH_WARN
77 pgs stuck unclean
recovery 46/949785 objects degraded (0.005%)
recovery 152987/949785 objects misplaced (16.108%)
monmap e3: 3 mons at {0=10.1.1.1:6789/0,1=10.1.1.2:6789/0,2=10.1.1.3:6789/0}
election epoch 70, quorum 0,1,2 0,1,2
osdmap e304: 3 osds: 3 up, 3 in; 77 remapped pgs
pgmap v2222751: 160 pgs, 2 pools, 1201 GB data, 309 kobjects
3676 GB used, 3756 GB / 7433 GB avail
46/949785 objects degraded (0.005%)
152987/949785 objects misplaced (16.108%)
83 active+clean
77 active+remapped
client io 66399 kB/s rd, 851 kB/s wr, 1221 op/s

# ceph health detail
HEALTH_WARN 77 pgs stuck unclean; recovery 46/949785 objects degraded (0.005%); recovery 152987/949785 objects misplaced (16.108%)
pg 4.25 is stuck unclean for 121582.717747, current state active+remapped, last acting [0,1,2]
pg 5.1a is stuck unclean for 118635.513579, current state active+remapped, last acting [0,1,2]
pg 4.1b is stuck unclean for 121589.276017, current state active+remapped, last acting [1,0,2]
pg 4.1a is stuck unclean for 121587.037792, current state active+remapped, last acting [1,0,2]
pg 5.1b is stuck unclean for 118676.177113, current state active+remapped, last acting [0,1,2]
pg 4.7a is stuck unclean for 116027.140499, current state active+remapped, last acting [1,0,2]
pg 4.79 is stuck unclean for 115386.851628, current state active+remapped, last acting [1,0,2]
pg 5.1e is stuck unclean for 116462.007267, current state active+remapped, last acting [1,0,2]
pg 4.78 is stuck unclean for 121555.036604, current state active+remapped, last acting [0,1,2]
pg 4.1e is stuck unclean for 116520.298145, current state active+remapped, last acting [1,0,2]
pg 4.1d is stuck unclean for 121587.158490, current state active+remapped, last acting [0,1,2]
pg 4.7e is stuck unclean for 121586.939474, current state active+remapped, last acting [1,0,2]
pg 4.1c is stuck unclean for 121586.202691, current state active+remapped, last acting [1,0,2]
pg 4.13 is stuck unclean for 115386.853358, current state active+remapped, last acting [1,0,2]
pg 5.12 is stuck unclean for 116462.007466, current state active+remapped, last acting [1,0,2]
pg 4.7c is stuck unclean for 121581.825483, current state active+remapped, last acting [0,1,2]
pg 5.10 is stuck unclean for 121596.099742, current state active+remapped, last acting [1,0,2]
pg 4.10 is stuck unclean for 116027.202342, current state active+remapped, last acting [1,0,2]
pg 4.71 is stuck unclean for 121586.364382, current state active+remapped, last acting [1,0,2]
pg 5.16 is stuck unclean for 121591.441230, current state active+remapped, last acting [1,0,2]
pg 4.77 is stuck unclean for 121584.143843, current state active+remapped, last acting [0,1,2]
pg 5.14 is stuck unclean for 119195.905471, current state active+remapped, last acting [0,1,2]
pg 4.75 is stuck unclean for 121584.384698, current state active+remapped, last acting [0,1,2]
pg 4.b is stuck unclean for 120632.338610, current state active+remapped, last acting [0,1,2]
pg 5.b is stuck unclean for 118672.980616, current state active+remapped, last acting [0,1,2]
pg 4.a is stuck unclean for 121590.361216, current state active+remapped, last acting [1,0,2]
pg 4.6a is stuck unclean for 116520.297389, current state active+remapped, last acting [1,0,2]
pg 4.9 is stuck unclean for 121581.842716, current state active+remapped, last acting [0,1,2]
pg 5.9 is stuck unclean for 119866.168159, current state active+remapped, last acting [0,1,2]
pg 5.e is stuck unclean for 118641.998274, current state active+remapped, last acting [0,1,2]
pg 5.f is stuck unclean for 115816.478902, current state active+remapped, last acting [1,0,2]
pg 5.c is stuck unclean for 116035.945866, current state active+remapped, last acting [1,0,2]
pg 4.d is stuck unclean for 121583.616507, current state active+remapped, last acting [0,1,2]
pg 4.6d is stuck unclean for 120850.772815, current state active+remapped, last acting [0,1,2]
pg 4.c is stuck unclean for 116520.297148, current state active+remapped, last acting [1,0,2]
pg 4.6c is stuck unclean for 121590.714610, current state active+remapped, last acting [0,1,2]
pg 4.3 is stuck unclean for 121556.453100, current state active+remapped, last acting [0,1,2]
pg 4.63 is stuck unclean for 121582.568779, current state active+remapped, last acting [0,1,2]
pg 5.3 is stuck unclean for 116035.902051, current state active+remapped, last acting [1,0,2]
pg 4.2 is stuck unclean for 121581.835128, current state active+remapped, last acting [0,1,2]
pg 4.62 is stuck unclean for 116027.098725, current state active+remapped, last acting [1,0,2]
pg 5.0 is stuck unclean for 118685.737689, current state active+remapped, last acting [0,1,2]
pg 4.1 is stuck unclean for 121585.405808, current state active+remapped, last acting [1,0,2]
pg 4.61 is stuck unclean for 121581.947941, current state active+remapped, last acting [0,1,2]
pg 4.0 is stuck unclean for 121582.869185, current state active+remapped, last acting [0,1,2]
pg 4.60 is stuck unclean for 121603.161066, current state active+remapped, last acting [0,1,2]
pg 5.6 is stuck unclean for 116462.006376, current state active+remapped, last acting [1,0,2]
pg 4.7 is stuck unclean for 116027.087510, current state active+remapped, last acting [1,0,2]
pg 4.6 is stuck unclean for 120751.693971, current state active+remapped, last acting [0,1,2]
pg 4.65 is stuck unclean for 116027.086255, current state active+remapped, last acting [1,0,2]
pg 4.5a is stuck unclean for 121584.771439, current state active+remapped, last acting [0,1,2]
pg 4.5c is stuck unclean for 121584.108782, current state active+remapped, last acting [0,1,2]
pg 4.53 is stuck unclean for 121582.627265, current state active+remapped, last acting [1,0,2]
pg 4.52 is stuck unclean for 115290.593727, current state active+remapped, last acting [1,0,2]
pg 4.51 is stuck unclean for 121555.698662, current state active+remapped, last acting [0,1,2]
pg 4.57 is stuck unclean for 121582.464896, current state active+remapped, last acting [0,1,2]
pg 4.4b is stuck unclean for 121582.762554, current state active+remapped, last acting [1,0,2]
pg 4.4a is stuck unclean for 121595.675892, current state active+remapped, last acting [0,1,2]
pg 4.49 is stuck unclean for 121581.922555, current state active+remapped, last acting [0,1,2]
pg 4.48 is stuck unclean for 119258.014499, current state active+remapped, last acting [0,1,2]
pg 4.4f is stuck unclean for 121594.400713, current state active+remapped, last acting [1,0,2]
pg 4.4c is stuck unclean for 116520.297840, current state active+remapped, last acting [1,0,2]
pg 4.43 is stuck unclean for 116520.297863, current state active+remapped, last acting [1,0,2]
pg 4.41 is stuck unclean for 116027.068146, current state active+remapped, last acting [1,0,2]
pg 4.40 is stuck unclean for 116520.297938, current state active+remapped, last acting [1,0,2]
pg 4.38 is stuck unclean for 120226.454185, current state active+remapped, last acting [0,1,2]
pg 4.3e is stuck unclean for 121581.861168, current state active+remapped, last acting [1,0,2]
pg 4.31 is stuck unclean for 121583.502541, current state active+remapped, last acting [1,0,2]
pg 4.36 is stuck unclean for 121582.880836, current state active+remapped, last acting [1,0,2]
pg 4.2b is stuck unclean for 121582.990050, current state active+remapped, last acting [1,0,2]
pg 4.29 is stuck unclean for 121582.880635, current state active+remapped, last acting [1,0,2]
pg 4.28 is stuck unclean for 121587.158553, current state active+remapped, last acting [0,1,2]
pg 4.2e is stuck unclean for 121582.880683, current state active+remapped, last acting [1,0,2]
pg 4.2d is stuck unclean for 121553.777639, current state active+remapped, last acting [0,1,2]
pg 4.23 is stuck unclean for 116520.298495, current state active+remapped, last acting [1,0,2]
pg 4.21 is stuck unclean for 116520.298558, current state active+remapped, last acting [1,0,2]
pg 4.27 is stuck unclean for 121582.065714, current state active+remapped, last acting [0,1,2]
recovery 46/949785 objects degraded (0.005%)
recovery 152987/949785 objects misplaced (16.108%)

Logs showing :

#tail -f /var/log/ceph/ceph-mon.0.log
2017-05-18 03:53:50.539006 7f826e394700 0 log_channel(cluster) log [INF] : pgmap v2222822: 160 pgs: 77 active+remapped, 83 active+clean; 1201 GB data, 3676 GB used, 3756 GB / 7433 GB avail; 14667 kB/s rd, 284 kB/s wr, 300 op/s; 46/949785 objects degraded (0.005%); 152987/949785 objects misplaced (16.108%)
2017-05-18 03:53:51.545615 7f826e394700 0 log_channel(cluster) log [INF] : pgmap v2222823: 160 pgs: 77 active+remapped, 83 active+clean; 1201 GB data, 3676 GB used, 3756 GB / 7433 GB avail; 31352 kB/s rd, 824 kB/s wr, 675 op/s; 46/949785 objects degraded (0.005%); 152987/949785 objects misplaced (16.108%)
2017-05-18 03:53:52.552068 7f826e394700 0 log_channel(cluster) log [INF] : pgmap v2222824: 160 pgs: 77 active+remapped, 83 active+clean; 1201 GB data, 3676 GB used, 3756 GB / 7433 GB avail; 89268 kB/s rd, 2659 kB/s wr, 1947 op/s; 46/949785 objects degraded (0.005%); 152987/949785 objects misplaced (16.108%)
2017-05-18 03:53:55.627772 7f826e394700 0 log_channel(cluster) log [INF] : pgmap v2222825: 160 pgs: 77 active+remapped, 83 active+clean; 1201 GB data, 3676 GB used, 3756 GB / 7433 GB avail; 26179 kB/s rd, 1133 kB/s wr, 624 op/s; 46/949785 objects degraded (0.005%); 152987/949785 objects misplaced (16.108%)
2017-05-18 03:53:56.633148 7f826e394700 0 log_channel(cluster) log [INF] : pgmap v2222826: 160 pgs: 77 active+remapped, 83 active+clean; 1201 GB data, 3676 GB used, 3756 GB / 7433 GB avail; 22057 kB/s rd, 1042 kB/s wr, 542 op/s; 46/949785 objects degraded (0.005%); 152987/949785 objects misplaced (16.108%)
2017-05-18 03:53:57.636356 7f826e394700 0 log_channel(cluster) log [INF] : pgmap v2222827: 160 pgs: 77 active+remapped, 83 active+clean; 1201 GB data, 3676 GB used, 3756 GB / 7433 GB avail; 69140 kB/s rd, 2991 kB/s wr, 1597 op/s; 46/949785 objects degraded (0.005%); 152987/949785 objects misplaced (16.108%)


How can I fix this warning??

thanks in advance for your replies
 
I red this tutorial before, I tried to repair and force recreate pgs, but nothing changed. the query command shows the following:

# ceph pg 4.27 query
"state": "active+remapped",
"snap_trimq": "[b~1,1d~1]",
"epoch": 304,
"up": [
0,
1
],
"acting": [
0,
1,
2
],
"actingbackfill": [
"0",
"1",
"2"
],
.
.
.
.

"peer": "1",
"pgid": "4.27",
"last_update": "304'3461628",
"last_complete": "296'3456616",

"peer": "2",
"pgid": "4.27",
"last_update": "304'3461628",
"last_complete": "304'3461628",

"recovery_state": [
{
"name": "Started\/Primary\/Active",
"enter_time": "2017-05-18 01:06:32.417795",
"might_have_unfound": [
{
"osd": "1",
"status": "already probed"
},
{
"osd": "2",
"status": "already probed"
}
],
"recovery_progress": {
"backfill_targets": [],
"waiting_on_backfill": [],
"last_backfill_started": "-1\/0\/\/0",
"backfill_info": {
"begin": "-1\/0\/\/0",
"end": "-1\/0\/\/0",
"objects": []
},
"peer_backfill_info": [],
"backfills_in_flight": [],
"recovering": [],
"pg_backend": {
"pull_from_peer": [],
"pushing": []
}
},
 
Thank you jeffwadsworth for your replies,
But actually I couldn't use the answer in the referenced URL, the OSDs in my case are working fine:
#ceph osd stat
osdmap e304: 3 osds: 3 up, 3 in; 77 remapped pgs

There is no network issues, each OSD contacts the others, ping commands and telnet to Monitors IPs/ports working very well!
 
Thank you all ,
the problem occurred after using reweight by utilization, the weight of osd.2 changed to 0.55003
2 1.79999 osd.2 up 0.55003 1.00000

when I changed the weight to 1, everything went good
ceph osd reweight 2 1


Thanks
closed thread :)
 

About

The Proxmox community has been around for many years and offers help and support for Proxmox VE, Proxmox Backup Server, and Proxmox Mail Gateway.
We think our community is one of the best thanks to people like you!

Get your subscription!

The Proxmox team works very hard to make sure you are running the best software and getting stable updates and security enhancements, as well as quick enterprise support. Tens of thousands of happy customers have a Proxmox subscription. Get yours easily in our online shop.

Buy now!