ceph health issue

RobFantini · May 21, 2014

Hello,
I've run in to the following on our 3 node ceph cluster

Code:

# ceph health detail

HEALTH_WARN 32 pgs degraded; 92 pgs down; 92 pgs peering; 92 pgs stuck inactive; 192 pgs stuck unclean; 3 requests are blocked > 32 sec; 2 osds have slow requests; recovery 46790/456882 objects degraded (10.241%); 1 mons down, quorum 0,1,2 0,2,1
pg 1.20 is stuck inactive for 74762.284833, current state down+peering, last acting [10,6]
pg 1.21 is stuck inactive for 65922.715915, current state down+peering, last acting [10,7]
pg 1.1e is stuck inactive for 65690.198886, current state down+peering, last acting [4,8]
pg 2.1c is stuck inactive for 65810.745600, current state down+peering, last acting [10,7]
pg 0.1e is stuck inactive for 65690.198691, current state down+peering, last acting [7,8]
pg 1.1f is stuck inactive for 65906.255850, current state down+peering, last acting [8,4]
pg 0.1d is stuck inactive for 65690.210385, current state down+peering, last acting [5,10]
pg 1.1d is stuck inactive for 65690.210323, current state down+peering, last acting [5,10]
pg 1.1a is stuck inactive for 65690.198144, current state down+peering, last acting [4,10]
pg 1.1b is stuck inactive for 65906.287501, current state down+peering, last acting [8,4]
pg 2.1b is stuck inactive since forever, current state down+peering, last acting [5,9]
pg 1.18 is stuck inactive for 65922.709773, current state down+peering, last acting [10,4]
pg 2.1a is stuck inactive since forever, current state down+peering, last acting [10,4]
pg 0.17 is stuck inactive since forever, current state down+peering, last acting [6,10]
pg 1.16 is stuck inactive for 65690.192023, current state down+peering, last acting [6,8]
pg 2.14 is stuck inactive for 65594.732013, current state down+peering, last acting [6,9]
pg 2.17 is stuck inactive for 65813.617834, current state down+peering, last acting [10,5]
pg 2.16 is stuck inactive for 65234.560829, current state down+peering, last acting [5,10]
pg 0.14 is stuck inactive since forever, current state down+peering, last acting [9,4]
pg 1.15 is stuck inactive for 65906.288367, current state down+peering, last acting [8,5]
pg 2.11 is stuck inactive for 65819.547115, current state down+peering, last acting [6,8]
pg 0.13 is stuck inactive for 65690.190726, current state down+peering, last acting [7,10]
pg 1.12 is stuck inactive since forever, current state down+peering, last acting [5,10]
...
pg 2.2f is down+peering, acting [7,9]
pg 2.2e is down+peering, acting [7,9]
pg 0.2c is down+peering, acting [8,7]
pg 2.29 is down+peering, acting [6,10]
pg 1.2a is down+peering, acting [5,10]
pg 2.28 is down+peering, acting [5,9]
pg 2.2b is down+peering, acting [7,8]
pg 2.2a is down+peering, acting [9,5]
pg 0.28 is down+peering, acting [7,8]
...
pg 0.2f is down+peering, acting [8,5]
pg 2.2c is down+peering, acting [8,6]
pg 1.2f is down+peering, acting [5,8]
1 ops are blocked > 4194.3 sec
2 ops are blocked > 2097.15 sec
1 ops are blocked > 4194.3 sec on osd.8
2 ops are blocked > 2097.15 sec on osd.10
2 osds have slow requests
recovery 46790/456882 objects degraded (10.241%)
mon.3 (rank 3) addr 10.11.12.240:6789/0 is down (out of quorum)

Code:

# ceph health 

HEALTH_WARN 
   32 pgs degraded; 
   92 pgs down; 92 pgs peering; 
   92 pgs stuck inactive; 
   192 pgs stuck unclean;
   3 requests are blocked > 32 sec; 
   recovery 46790/456882 objects degraded (10.241%); 
   1 mons down, quorum 0,1,2 0,2,1

I'm checking docs at http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ , but to jump start the solution [ if any ] suggestions are appreciated.

thanks
Rob

sdutremble · May 21, 2014

Looks like you have one monitor down and possibly 2 OSD with issues.

Do you see anything in the Proxmox Web interface for the Ceph Monitors and OSD pages?

Serge

RobFantini · May 21, 2014

sdutremble said:
Looks like you have one monitor down and possibly 2 OSD with issues.

Do you see anything in the Proxmox Web interface for the Ceph Monitors and OSD pages?

Serge

One node was removed, before removing it as a mon... that still shows up on the Monitor tab in pve.

For OSD: one node shows all osd as down/out.

sdutremble · May 21, 2014

So you HAD a 3-node Ceph cluster. You removed one node and now have a (possibly broken) 2-node Ceph cluster.

How did you remove the node?

Please describe your cluster BEFORE and AFTER removing the node. As well, can you tell us what you have now in the Ceph Monitor page and OSD?

Serge

udo · May 21, 2014

RobFantini said:
One node was removed, before removing it as a mon... that still shows up on the Monitor tab in pve.

For OSD: one node shows all osd as down/out.

Hi,
any hints in the log, if you try to start the OSD on the node with the failed OSDs:

Code:

/etc/init.d/ceph start osd repair

Is enough free space on the disks? (log)
How looks

Code:

ceph osd tree

Udo

RobFantini · May 21, 2014

In PVE for each down/out OSD, I clicked the osd then 'start' .

the ceph health is getting better:

Code:

ceph4-ib  ~ # ceph  health detail
HEALTH_WARN 4 pgs backfilling; 4 pgs stuck unclean; recovery 2787/542792 objects degraded (0.513%); too few pgs per osd (17 < min 20); 1 mons down, quorum 0,1,2 0,2,1
pg 2.0 is stuck unclean for 71736.217307, current state active+remapped+backfilling, last acting [3,5,7]
pg 2.22 is stuck unclean for 71733.980573, current state active+remapped+backfilling, last acting [3,4,6]
pg 2.31 is stuck unclean for 71734.218929, current state active+remapped+backfilling, last acting [9,1,5]
pg 2.12 is stuck unclean for 71742.212600, current state active+remapped+backfilling, last acting [3,7,5]
pg 2.22 is active+remapped+backfilling, acting [3,4,6]
pg 2.12 is active+remapped+backfilling, acting [3,7,5]
pg 2.0 is active+remapped+backfilling, acting [3,5,7]
pg 2.31 is active+remapped+backfilling, acting [9,1,5]
recovery 2787/542792 objects degraded (0.513%)
too few pgs per osd (17 < min 20)
mon.3 (rank 3) addr 10.11.12.240:6789/0 is down (out of quorum)

but this part concerns me: too few pgs per osd (17 < min 20)

I'll give that more time to heal then try to start up some of the KVM's.

RobFantini · May 21, 2014

sdutremble said:
So you HAD a 3-node Ceph cluster. You removed one node and now have a (possibly broken) 2-node Ceph cluster.

How did you remove the node?

Please describe your cluster BEFORE and AFTER removing the node. As well, can you tell us what you have now in the Ceph Monitor page and OSD?

Serge

It was a 4 node cluster. The 4-th node did not have and OSD's assigned yet... Due to a hardware issue I deleted the node using ' pve delnode '

Ths osd's are all up.

Fro monitors 3 are up with quorum set to Yes.

udo · May 21, 2014

RobFantini said:
but this part concerns me: too few pgs per osd (17 < min 20)

I'll give that more time to heal then try to start up some of the KVM's.

Hi,
how many pgs has your pools?
Do an

Code:

ceph osd lspools
ceph osd pool get rbd pg_num

# and again
ceph osd tree

Udo

RobFantini · May 21, 2014

There is/was plenty of space, last time I checked we had 18 ot 20T free.

I'll try ' /etc/init.d/ceph start osd repair ' it what i dis already [ start the osd's in pve web page] does not fix the issue.

here is tree:

Code:

ceph4-ib  ~ # ceph osd tree
# id    weight  type name       up/down reweight
-1      20.02   root default
-2      7.28            host ceph4-ib
0       1.82                    osd.0   up      1
1       1.82                    osd.1   up      1
2       1.82                    osd.2   up      1
3       1.82                    osd.3   up      1
-3      7.28            host ceph3-ib
4       1.82                    osd.4   up      1
5       1.82                    osd.5   up      1
6       1.82                    osd.6   up      1
7       1.82                    osd.7   up      1
-4      5.46            host ceph2-ib
8       1.82                    osd.8   up      1
9       1.82                    osd.9   up      1
10      1.82                    osd.10  up      1

RobFantini · May 21, 2014

now health is:

Code:

ceph4-ib  ~ # ceph  health detail
HEALTH_WARN too few pgs per osd (17 < min 20); 1 mons down, quorum 0,1,2 0,2,1
too few pgs per osd (17 < min 20)
mon.3 (rank 3) addr 10.11.12.240:6789/0 is down (out of quorum)

I tried to run repair, that did not work:

Code:

ceph4-ib  ~ # /etc/init.d/ceph start osd repair
/etc/init.d/ceph: rep.air not found (/etc/ceph/ceph.conf defines osd.0 osd.1 osd.2 osd.3 mon.0 mon.1 mon.2 mon.3, /var/lib/ceph defines osd.0 osd.1 osd.2 osd.3)

udo · May 21, 2014

Hmm,
looks like that ceph changed someting - run on an older ceph version...

But anyway - ceph osd tree shows all osd up. So the problem is more to less pgs in the pool?!

with 11 OSDs and 2 replica (?) your pool should have round 550 pgs. (OSD*100/replica)

Udo

RobFantini · May 21, 2014

ceph health detail just shows HEALTH_WARN too few pgs per osd (17 < min 20); as an issue.

But I am able to get KVM's working..

udo · May 21, 2014

RobFantini said:
ceph health detail just shows HEALTH_WARN too few pgs per osd (17 < min 20); as an issue.

But I am able to get KVM's working..

Hi,
again - how many pgs do you have in the pool?

Udo

RobFantini · May 21, 2014

Code:

ceph4-ib  ~ # ceph osd lspools
0 data,1 metadata,2 rbd,


# ceph osd pool get rbd pg_num
pg_num: 64

udo · May 21, 2014

Hi,
it's strongly recommendet to increase the pg + pgp value with something like (512 or 1024 - ceph recommends a power of 2-Value):

Code:

ceph osd pool set rbd pg_num 512
ceph osd pool set rbd pgp_num 512

Take a time for rebalance.

Udo

RobFantini · May 22, 2014

Thank you for the advice Udo, will do that now.

Search

Search

ceph health issue

RobFantini

Famous Member

sdutremble

Renowned Member

RobFantini

Famous Member

sdutremble

Renowned Member

udo

Distinguished Member

RobFantini

Famous Member

RobFantini

Famous Member

udo

Distinguished Member

RobFantini

Famous Member

RobFantini

Famous Member

udo

Distinguished Member

RobFantini

Famous Member

udo

Distinguished Member

RobFantini

Famous Member

udo

Distinguished Member

RobFantini

Famous Member

We value your privacy